TokenSavingMiddleware is an optional, domain-agnostic middleware that reduces token consumption in long-running agent trajectories. It provides two independent mechanisms: tool-output compression and early-exit nudging. Both levers are on by default and can be toggled independently. A third, opt-in mechanism — perplexity-based word-level compression — is available when you supply a classifier.
Failures inside the middleware hook are logged and swallowed. The middleware never interrupts the agent loop.
TokenSavingMiddleware stacks alongside ReasonBlocksMiddleware rather than being embedded inside it. You can use either independently.Constructor
Minimum character length a
ToolMessage body must reach before it is compressed. Messages shorter than this threshold are left unchanged.Number of characters to keep from the start of a tool output when compressing. The head tends to contain the most actionable content.
Number of characters to keep from the end of a tool output when compressing. The tail often contains closing context, error messages, or final values.
Number of the most recent
ToolMessage objects to exempt from compression. These are the messages the agent is actively reasoning about; compressing them would degrade step quality.Minimum number of model calls that must have occurred before an early-exit nudge can be injected. This prevents the nudge from firing on short, healthy runs.
The text injected as a
HumanMessage when an early-exit nudge fires. The default message instructs the agent to stop investigating and submit its current best answer. Override this to match your agent’s specific submission instructions.A function
(steps: list[dict]) -> dict[str, float] that evaluates the agent’s trajectory and returns loop-likelihood signals in [0, 1]. The middleware checks the "streak", "hedge", and "diversity" keys to decide whether to fire the early-exit nudge (any omitted key is treated as 0.0). There is no built-in implementation — supply your own. When None (the default), the early-exit lever is disabled even if enable_early_exit=True.Whether to enable head+tail tool-output compression. Set to
False to disable compression entirely while keeping the early-exit lever active.Whether to enable the early-exit nudge. Set to
False to disable the nudge entirely while keeping compression active.Whether to enable word-level perplexity-based compression. Off by default. Requires
perplexity_classifier to be set; if perplexity_classifier is None and this is True, no perplexity compression occurs.A
WordClassifier callable — (words: list[str]) -> list[bool] — that returns a keep/drop decision for each word. Use make_anthropic_classifier() to build one backed by a small Anthropic model, or supply your own heuristic. Required when enable_perplexity_compression=True.Messages from fewer than this many model calls ago are considered “recent” and are excluded from perplexity compression. Keeps the agent’s most active context at full fidelity.
Messages from between
perplexity_recent_cutoff and this many calls ago are in the “mid” tier and compressed at perplexity_keep_ratio_mid. Messages older than this are in the “old” tier.Target fraction of words to keep in “mid” tier messages (3–9 model calls ago).
0.55 means the classifier aims to keep roughly 55% of words.Target fraction of words to keep in “old” tier messages (10+ model calls ago). More aggressive than the mid tier.
The number of words per window passed to the classifier in a single call. Larger windows give the classifier more context but cost more tokens per call.
Texts shorter than this many words are returned unchanged — the classifier overhead is not worth it for short messages. Applies per-message when deciding whether to invoke the classifier at all.
Stats attribute
EveryTokenSavingMiddleware instance exposes a stats attribute of type TokenSavingStats that accumulates counters across all before_model calls.
TokenSavingStats dataclass
TokenSavingStats is a plain dataclass. All fields default to 0.
Running count of head+tail compressions applied to
ToolMessage objects.Total characters removed across all head+tail compressions.
Number of times the early-exit nudge was injected into the message history.
Number of word-level perplexity compressions applied. Only increments when
enable_perplexity_compression=True.Total characters removed by word-level perplexity compression.
Number of times a cached compression decision was reused instead of calling the classifier again. Cache keys are
(message_id, target_keep_ratio).Standalone utilities
compress_tool_output()
Head+tail truncates a single tool output string when it exceeds a character threshold. Returns the content unchanged if it is within the threshold. You can call this directly when you want to compress a string outside of the middleware lifecycle.
The tool output string to compress.
Character length above which compression is applied. Strings at or below this length are returned unchanged.
Characters to keep from the start of the string.
Characters to keep from the end of the string.
The original string if it’s within the threshold, otherwise a head + omission notice + tail string of the form
"{head}\n\n[... N chars truncated ...]\n\n{tail}".make_anthropic_classifier()
Wraps an anthropic.Anthropic-compatible client as a WordClassifier for use with perplexity-based compression. The classifier asks a small Anthropic model to label each word keep or drop (LLMLingua-2 style, prompt-only — not true log-probability perplexity).
Falls back to the built-in heuristic classifier on any failure (parse error, timeout, rate limit), so the middleware never breaks because of a classifier error.
An
anthropic.Anthropic-compatible client instance. Must expose a client.messages.create() method with the standard Anthropic Messages API signature.The model used to classify words. A small, fast model such as Haiku is recommended to keep classification costs low.
The fraction of words the classifier should aim to keep. This value is included in the system prompt so the model can calibrate its labeling.
0.5 means aim for roughly 50% retention.A
WordClassifier callable with signature (words: list[str]) -> list[bool]. Pass this to TokenSavingMiddleware(perplexity_classifier=...).build_steps_from_messages()
Converts a LangChain message history into the step dict format your signals_fn receives. Pairs each AIMessage’s tool calls with their matching ToolMessage objects via tool_call_id. Use it to build the steps argument when writing a custom signals_fn.
A list of LangChain messages (
AIMessage, ToolMessage, HumanMessage, etc.) representing the agent’s trajectory so far.A list of step dicts, one per
AIMessage (or one per tool call when an AIMessage has multiple tool calls). Each dict contains:
