Skip to main content
TokenSavingMiddleware is an AgentMiddleware subclass that does three independent things, each toggleable:
  1. Tool-output compression — head+tail truncates ToolMessage bodies once they exceed compress_threshold_chars, leaving the most recent N tool messages untouched.
  2. Early-exit nudge — once the agent has run at least early_exit_min_call_index model calls, evaluates trajectory signals and injects a HumanMessage telling the agent to submit when it appears stuck.
  3. Perplexity compression (opt-in) — applies word-level keep/drop classification to stale messages.
Failures inside the hook are caught and logged; the middleware never breaks the agent loop.

Standalone setup

from langchain.agents import create_agent
from reasonblocks import TokenSavingMiddleware

agent = create_agent(
    model="anthropic:claude-haiku-4-5-20251001",
    tools=[...],
    system_prompt="...",
    middleware=[
        TokenSavingMiddleware(
            compress_threshold_chars=1800,
            keep_recent_tool_messages=2,
        ),
    ],
)

Stack with ReasonBlocksMiddleware

When stacking with ReasonBlocksMiddleware, place TokenSavingMiddleware last. It runs after steering injections are queued, so any injected content goes out compressed.
from reasonblocks import ReasonBlocks, TokenSavingMiddleware

rb = ReasonBlocks(api_key="rb_live_...")

agent = create_agent(
    model="anthropic:claude-haiku-4-5-20251001",
    tools=[...],
    system_prompt="...",
    middleware=[
        rb.middleware(agent_name="bugfixer"),
        TokenSavingMiddleware(),
    ],
)
When you assemble the stack via ReasonBlocksConfig + build_middleware, set enable_token_saving=True and the ordering is handled for you. See ReasonBlocksConfig.

Tool-output compression

Old ToolMessage bodies in the message history are head+tail truncated once they exceed compress_threshold_chars. The middleware replaces same-id messages via LangGraph’s add_messages reducer, so the history actually shrinks rather than growing. The most recent keep_recent_tool_messages tool messages are exempt — the agent always has full visibility into its current reasoning step.
TokenSavingMiddleware(
    compress_threshold_chars=1800,   # default
    head_keep_chars=900,             # default — keep this much from the start
    tail_keep_chars=700,             # default — keep this much from the end
    keep_recent_tool_messages=2,     # default — exempt these from compression
    enable_compression=True,         # default
)
compress_tool_output is also exposed as a standalone helper:
from reasonblocks import compress_tool_output

compressed = compress_tool_output(
    raw_observation,
    threshold_chars=1800,
    head_chars=900,
    tail_chars=700,
)

Early-exit nudge

Once the call index reaches early_exit_min_call_index (default 40), the middleware calls signals_fn(steps) on each before_model. If the returned signals indicate a loop, it injects a single HumanMessage with early_exit_text. The fire condition reads three keys from the dict your signals_fn returns (any key it omits is treated as 0.0):
streak > 0.7
OR (hedge > 0.6 AND diversity > 0.5)
There is no built-in signals function — you supply one. It takes the trajectory steps and returns those loop-likelihood signals in [0, 1]. Compute them however you like; the server-side monitor scores you already collect via telemetry are one source.
from reasonblocks import TokenSavingMiddleware

def my_signals(steps: list[dict]) -> dict[str, float]:
    # Return loop-likelihood signals in [0, 1].
    return {"streak": 0.0, "hedge": 0.0, "diversity": 0.0}

TokenSavingMiddleware(
    early_exit_min_call_index=40,   # default
    enable_early_exit=True,         # default
    signals_fn=my_signals,          # required for early-exit to fire
)
signals_fn defaults to None, which disables the early-exit check even when enable_early_exit=True. You must pass a signals_fn for early-exit to do anything.
The default nudge text:
You appear to be stuck in a loop. Stop investigating and submit your current best answer now using whatever submission tool your task expects. Do not start another investigation.
Override it with early_exit_text="...".

Inspect TokenSavingStats

TokenSavingMiddleware.stats is a TokenSavingStats dataclass with running counters.
ts = TokenSavingMiddleware()
agent = create_agent(..., middleware=[rb.middleware(), ts])

with mw:
    result = agent.invoke(...)

print(ts.stats.compressions)              # tool messages head+tail compressed
print(ts.stats.chars_saved)               # total characters removed (head+tail)
print(ts.stats.early_exits)               # early-exit nudges injected
print(ts.stats.replacements_emitted)      # list of per-step replacement counts
print(ts.stats.perplexity_compressions)   # if perplexity compression is enabled
print(ts.stats.perplexity_chars_saved)
print(ts.stats.perplexity_cache_hits)

Perplexity compression (opt-in)

For long trajectories where head+tail isn’t enough, the middleware can apply LLMLingua-2-style word-level keep/drop classification to stale AIMessage and ToolMessage content. Two staleness tiers, each with its own keep ratio. Decisions are cached per (message_id, target_keep_ratio) so each message is classified only once.
Provide a WordClassifier callable. The shipped factory uses Anthropic Haiku as the classifier:
import anthropic
from reasonblocks.token_saving import (
    TokenSavingMiddleware,
    make_anthropic_classifier,
)

client = anthropic.Anthropic()

classifier = make_anthropic_classifier(
    client,
    model="claude-haiku-4-5-20251001",
    target_keep_ratio=0.5,
)

ts = TokenSavingMiddleware(
    enable_perplexity_compression=True,
    perplexity_classifier=classifier,
)
With enable_perplexity_compression=True but perplexity_classifier=None, perplexity compression is silently skipped.
Perplexity compression calls an LLM classifier per stale-message window. Latency and cost scale with the number of stale windows; the per-(id, ratio) cache amortizes the cost across repeat calls but doesn’t eliminate it. Keep perplexity_recent_cutoff >= 3 so the agent never sees its current reasoning compressed.

Use ReasonBlocksConfig for full control

If you assemble the stack via ReasonBlocksConfig + build_middleware, every field above maps to a ts_* config field. See ReasonBlocksConfig.
from reasonblocks import ReasonBlocksAPI, ReasonBlocksConfig, build_middleware

api = ReasonBlocksAPI(api_key="rb_live_...")

config = ReasonBlocksConfig(
    enable_token_saving=True,
    ts_compress_threshold_chars=1800,
    ts_keep_recent_tool_messages=2,
    ts_enable_early_exit=True,
    ts_enable_perplexity_compression=False,
)

middleware = build_middleware(
    config, api,
    score_fn=score_fn, fsm=fsm, state_manager=state_manager,
)