TokenSavingMiddleware is an AgentMiddleware subclass that does three independent things, each toggleable:
- Tool-output compression — head+tail truncates
ToolMessage bodies once they exceed compress_threshold_chars, leaving the most recent N tool messages untouched.
- Early-exit nudge — once the agent has run at least
early_exit_min_call_index model calls, evaluates trajectory signals and injects a HumanMessage telling the agent to submit when it appears stuck.
- Perplexity compression (opt-in) — applies word-level keep/drop classification to stale messages.
Failures inside the hook are caught and logged; the middleware never breaks the agent loop.
Standalone setup
from langchain.agents import create_agent
from reasonblocks import TokenSavingMiddleware
agent = create_agent(
model="anthropic:claude-haiku-4-5-20251001",
tools=[...],
system_prompt="...",
middleware=[
TokenSavingMiddleware(
compress_threshold_chars=1800,
keep_recent_tool_messages=2,
),
],
)
Stack with ReasonBlocksMiddleware
When stacking with ReasonBlocksMiddleware, place TokenSavingMiddleware last. It runs after steering injections are queued, so any injected content goes out compressed.
from reasonblocks import ReasonBlocks, TokenSavingMiddleware
rb = ReasonBlocks(api_key="rb_live_...")
agent = create_agent(
model="anthropic:claude-haiku-4-5-20251001",
tools=[...],
system_prompt="...",
middleware=[
rb.middleware(agent_name="bugfixer"),
TokenSavingMiddleware(),
],
)
When you assemble the stack via ReasonBlocksConfig + build_middleware, set enable_token_saving=True and the ordering is handled for you. See ReasonBlocksConfig.
Old ToolMessage bodies in the message history are head+tail truncated once they exceed compress_threshold_chars. The middleware replaces same-id messages via LangGraph’s add_messages reducer, so the history actually shrinks rather than growing.
The most recent keep_recent_tool_messages tool messages are exempt — the agent always has full visibility into its current reasoning step.
TokenSavingMiddleware(
compress_threshold_chars=1800, # default
head_keep_chars=900, # default — keep this much from the start
tail_keep_chars=700, # default — keep this much from the end
keep_recent_tool_messages=2, # default — exempt these from compression
enable_compression=True, # default
)
compress_tool_output is also exposed as a standalone helper:
from reasonblocks import compress_tool_output
compressed = compress_tool_output(
raw_observation,
threshold_chars=1800,
head_chars=900,
tail_chars=700,
)
Early-exit nudge
Once the call index reaches early_exit_min_call_index (default 40), the middleware calls signals_fn(steps) on each before_model. If the returned signals indicate a loop, it injects a single HumanMessage with early_exit_text.
The fire condition reads three keys from the dict your signals_fn returns (any key it omits is treated as 0.0):
streak > 0.7
OR (hedge > 0.6 AND diversity > 0.5)
There is no built-in signals function — you supply one. It takes the trajectory steps and returns those loop-likelihood signals in [0, 1]. Compute them however you like; the server-side monitor scores you already collect via telemetry are one source.
from reasonblocks import TokenSavingMiddleware
def my_signals(steps: list[dict]) -> dict[str, float]:
# Return loop-likelihood signals in [0, 1].
return {"streak": 0.0, "hedge": 0.0, "diversity": 0.0}
TokenSavingMiddleware(
early_exit_min_call_index=40, # default
enable_early_exit=True, # default
signals_fn=my_signals, # required for early-exit to fire
)
signals_fn defaults to None, which disables the early-exit check even when enable_early_exit=True. You must pass a signals_fn for early-exit to do anything.
The default nudge text:
You appear to be stuck in a loop. Stop investigating and submit your current best answer now using whatever submission tool your task expects. Do not start another investigation.
Override it with early_exit_text="...".
Inspect TokenSavingStats
TokenSavingMiddleware.stats is a TokenSavingStats dataclass with running counters.
ts = TokenSavingMiddleware()
agent = create_agent(..., middleware=[rb.middleware(), ts])
with mw:
result = agent.invoke(...)
print(ts.stats.compressions) # tool messages head+tail compressed
print(ts.stats.chars_saved) # total characters removed (head+tail)
print(ts.stats.early_exits) # early-exit nudges injected
print(ts.stats.replacements_emitted) # list of per-step replacement counts
print(ts.stats.perplexity_compressions) # if perplexity compression is enabled
print(ts.stats.perplexity_chars_saved)
print(ts.stats.perplexity_cache_hits)
Perplexity compression (opt-in)
For long trajectories where head+tail isn’t enough, the middleware can apply LLMLingua-2-style word-level keep/drop classification to stale AIMessage and ToolMessage content. Two staleness tiers, each with its own keep ratio. Decisions are cached per (message_id, target_keep_ratio) so each message is classified only once.
Provide a WordClassifier callable. The shipped factory uses Anthropic Haiku as the classifier:import anthropic
from reasonblocks.token_saving import (
TokenSavingMiddleware,
make_anthropic_classifier,
)
client = anthropic.Anthropic()
classifier = make_anthropic_classifier(
client,
model="claude-haiku-4-5-20251001",
target_keep_ratio=0.5,
)
ts = TokenSavingMiddleware(
enable_perplexity_compression=True,
perplexity_classifier=classifier,
)
With enable_perplexity_compression=True but perplexity_classifier=None, perplexity compression is silently skipped. TokenSavingMiddleware(
enable_perplexity_compression=True,
perplexity_classifier=classifier,
perplexity_recent_cutoff=3, # last N calls keep full fidelity
perplexity_mid_cutoff=10, # cutoff between mid and old tier
perplexity_keep_ratio_mid=0.55, # keep 55% of words in mid tier
perplexity_keep_ratio_old=0.30, # keep 30% in old tier
perplexity_window_words=50, # words per classifier call
perplexity_min_content_words=30, # skip messages shorter than this
)
| Tier | Range (calls back) | Default keep ratio |
|---|
| Recent | 0..perplexity_recent_cutoff | full fidelity |
| Mid | recent_cutoff..mid_cutoff | 0.55 |
| Old | >= mid_cutoff | 0.30 |
Perplexity compression calls an LLM classifier per stale-message window. Latency and cost scale with the number of stale windows; the per-(id, ratio) cache amortizes the cost across repeat calls but doesn’t eliminate it. Keep perplexity_recent_cutoff >= 3 so the agent never sees its current reasoning compressed.
Use ReasonBlocksConfig for full control
If you assemble the stack via ReasonBlocksConfig + build_middleware, every field above maps to a ts_* config field. See ReasonBlocksConfig.
from reasonblocks import ReasonBlocksAPI, ReasonBlocksConfig, build_middleware
api = ReasonBlocksAPI(api_key="rb_live_...")
config = ReasonBlocksConfig(
enable_token_saving=True,
ts_compress_threshold_chars=1800,
ts_keep_recent_tool_messages=2,
ts_enable_early_exit=True,
ts_enable_perplexity_compression=False,
)
middleware = build_middleware(
config, api,
score_fn=score_fn, fsm=fsm, state_manager=state_manager,
)