Skip to main content
TokenSavingMiddleware is an optional, domain-agnostic middleware that reduces token consumption in long-running agent trajectories. It provides two independent mechanisms: tool-output compression and early-exit nudging. Both levers are on by default and can be toggled independently. A third, opt-in mechanism — perplexity-based word-level compression — is available when you supply a classifier. Failures inside the middleware hook are logged and swallowed. The middleware never interrupts the agent loop.
TokenSavingMiddleware stacks alongside ReasonBlocksMiddleware rather than being embedded inside it. You can use either independently.
from reasonblocks import ReasonBlocks, TokenSavingMiddleware

rb = ReasonBlocks(api_key="rb_live_...")

agent = create_agent(
    model=...,
    tools=...,
    middleware=[
        rb.middleware(agent_name="reviewer", task="Review PR #42"),
        TokenSavingMiddleware(),  # always last
    ],
)

Constructor

compress_threshold_chars
integer
default:"1800"
Minimum character length a ToolMessage body must reach before it is compressed. Messages shorter than this threshold are left unchanged.
head_keep_chars
integer
default:"900"
Number of characters to keep from the start of a tool output when compressing. The head tends to contain the most actionable content.
tail_keep_chars
integer
default:"700"
Number of characters to keep from the end of a tool output when compressing. The tail often contains closing context, error messages, or final values.
keep_recent_tool_messages
integer
default:"2"
Number of the most recent ToolMessage objects to exempt from compression. These are the messages the agent is actively reasoning about; compressing them would degrade step quality.
early_exit_min_call_index
integer
default:"40"
Minimum number of model calls that must have occurred before an early-exit nudge can be injected. This prevents the nudge from firing on short, healthy runs.
early_exit_text
string
default:"\"You appear to be stuck in a loop...\""
The text injected as a HumanMessage when an early-exit nudge fires. The default message instructs the agent to stop investigating and submit its current best answer. Override this to match your agent’s specific submission instructions.
signals_fn
callable
default:"None"
A function (steps: list[dict]) -> dict[str, float] that evaluates the agent’s trajectory and returns loop-likelihood signals in [0, 1]. The middleware checks the "streak", "hedge", and "diversity" keys to decide whether to fire the early-exit nudge (any omitted key is treated as 0.0). There is no built-in implementation — supply your own. When None (the default), the early-exit lever is disabled even if enable_early_exit=True.
enable_compression
boolean
default:"True"
Whether to enable head+tail tool-output compression. Set to False to disable compression entirely while keeping the early-exit lever active.
enable_early_exit
boolean
default:"True"
Whether to enable the early-exit nudge. Set to False to disable the nudge entirely while keeping compression active.
enable_perplexity_compression
boolean
default:"False"
Whether to enable word-level perplexity-based compression. Off by default. Requires perplexity_classifier to be set; if perplexity_classifier is None and this is True, no perplexity compression occurs.
perplexity_classifier
callable
A WordClassifier callable — (words: list[str]) -> list[bool] — that returns a keep/drop decision for each word. Use make_anthropic_classifier() to build one backed by a small Anthropic model, or supply your own heuristic. Required when enable_perplexity_compression=True.
perplexity_recent_cutoff
integer
default:"3"
Messages from fewer than this many model calls ago are considered “recent” and are excluded from perplexity compression. Keeps the agent’s most active context at full fidelity.
perplexity_mid_cutoff
integer
default:"10"
Messages from between perplexity_recent_cutoff and this many calls ago are in the “mid” tier and compressed at perplexity_keep_ratio_mid. Messages older than this are in the “old” tier.
perplexity_keep_ratio_mid
number
default:"0.55"
Target fraction of words to keep in “mid” tier messages (3–9 model calls ago). 0.55 means the classifier aims to keep roughly 55% of words.
perplexity_keep_ratio_old
number
default:"0.30"
Target fraction of words to keep in “old” tier messages (10+ model calls ago). More aggressive than the mid tier.
perplexity_window_words
integer
default:"50"
The number of words per window passed to the classifier in a single call. Larger windows give the classifier more context but cost more tokens per call.
perplexity_min_content_words
integer
default:"30"
Texts shorter than this many words are returned unchanged — the classifier overhead is not worth it for short messages. Applies per-message when deciding whether to invoke the classifier at all.

Stats attribute

Every TokenSavingMiddleware instance exposes a stats attribute of type TokenSavingStats that accumulates counters across all before_model calls.
mw = TokenSavingMiddleware()
# ... run the agent ...
print(mw.stats.compressions)       # number of head+tail compressions applied
print(mw.stats.chars_saved)        # total characters removed by head+tail compression
print(mw.stats.early_exits)        # number of early-exit nudges injected
print(mw.stats.perplexity_compressions)  # word-level compressions applied
print(mw.stats.perplexity_chars_saved)   # characters removed by word-level compression
print(mw.stats.perplexity_cache_hits)    # cached compression decisions reused

TokenSavingStats dataclass

TokenSavingStats is a plain dataclass. All fields default to 0.
compressions
integer
Running count of head+tail compressions applied to ToolMessage objects.
chars_saved
integer
Total characters removed across all head+tail compressions.
early_exits
integer
Number of times the early-exit nudge was injected into the message history.
perplexity_compressions
integer
Number of word-level perplexity compressions applied. Only increments when enable_perplexity_compression=True.
perplexity_chars_saved
integer
Total characters removed by word-level perplexity compression.
perplexity_cache_hits
integer
Number of times a cached compression decision was reused instead of calling the classifier again. Cache keys are (message_id, target_keep_ratio).

Standalone utilities

compress_tool_output()

Head+tail truncates a single tool output string when it exceeds a character threshold. Returns the content unchanged if it is within the threshold. You can call this directly when you want to compress a string outside of the middleware lifecycle.
from reasonblocks import compress_tool_output

compressed = compress_tool_output(
    long_output,
    threshold_chars=1800,
    head_chars=900,
    tail_chars=700,
)
content
string
required
The tool output string to compress.
threshold_chars
integer
default:"1800"
Character length above which compression is applied. Strings at or below this length are returned unchanged.
head_chars
integer
default:"900"
Characters to keep from the start of the string.
tail_chars
integer
default:"700"
Characters to keep from the end of the string.
return
string
The original string if it’s within the threshold, otherwise a head + omission notice + tail string of the form "{head}\n\n[... N chars truncated ...]\n\n{tail}".

make_anthropic_classifier()

Wraps an anthropic.Anthropic-compatible client as a WordClassifier for use with perplexity-based compression. The classifier asks a small Anthropic model to label each word keep or drop (LLMLingua-2 style, prompt-only — not true log-probability perplexity). Falls back to the built-in heuristic classifier on any failure (parse error, timeout, rate limit), so the middleware never breaks because of a classifier error.
import anthropic
from reasonblocks import TokenSavingMiddleware, make_anthropic_classifier

client = anthropic.Anthropic()
classifier = make_anthropic_classifier(
    client,
    model="claude-haiku-4-5-20251001",
    target_keep_ratio=0.5,
)

mw = TokenSavingMiddleware(
    enable_perplexity_compression=True,
    perplexity_classifier=classifier,
)
client
object
required
An anthropic.Anthropic-compatible client instance. Must expose a client.messages.create() method with the standard Anthropic Messages API signature.
model
string
default:"\"claude-haiku-4-5-20251001\""
The model used to classify words. A small, fast model such as Haiku is recommended to keep classification costs low.
target_keep_ratio
number
default:"0.5"
The fraction of words the classifier should aim to keep. This value is included in the system prompt so the model can calibrate its labeling. 0.5 means aim for roughly 50% retention.
return
WordClassifier
A WordClassifier callable with signature (words: list[str]) -> list[bool]. Pass this to TokenSavingMiddleware(perplexity_classifier=...).

build_steps_from_messages()

Converts a LangChain message history into the step dict format your signals_fn receives. Pairs each AIMessage’s tool calls with their matching ToolMessage objects via tool_call_id. Use it to build the steps argument when writing a custom signals_fn.
from reasonblocks.token_saving import build_steps_from_messages

steps = build_steps_from_messages(state["messages"])
messages
list
required
A list of LangChain messages (AIMessage, ToolMessage, HumanMessage, etc.) representing the agent’s trajectory so far.
return
list[dict]
A list of step dicts, one per AIMessage (or one per tool call when an AIMessage has multiple tool calls). Each dict contains:

Full example

from reasonblocks import ReasonBlocks, TokenSavingMiddleware

rb = ReasonBlocks(api_key="rb_live_...")

mw = TokenSavingMiddleware(
    compress_threshold_chars=1800,
    keep_recent_tool_messages=2,
)

agent = create_agent(
    model=...,
    tools=...,
    middleware=[rb.middleware(agent_name="reviewer", task="Review PR #5"), mw],
)

result = agent.invoke({"messages": [HumanMessage(content="Review PR #5")]})

print(f"Compressions: {mw.stats.compressions}")
print(f"Chars saved:  {mw.stats.chars_saved}")
print(f"Early exits:  {mw.stats.early_exits}")