TokenSavingMiddleware (compression + early-exit) for maximum cost cut at unchanged accuracy, and the general monitor (enable_general_monitor) for an accuracy lift. This page shows how to assemble each.
Validated headline
Paired n=75 runs on ScaleAI/SWE-bench_Pro,claude-sonnet-4-6, real Docker grading, same task ids across arms.
| Arm | Configuration | Pass rate | Mean input tokens | Delta |
|---|---|---|---|---|
| baseline | no middleware | 25.3% | 1,257,316 | — |
| token-saving | TokenSavingMiddleware | 25.4% | 606,212 | −51.8% tokens, flat accuracy |
| + general monitor | enable_general_monitor=True | 36.0% | 1,136,946 | +10.7pp accuracy, −9.6% tokens |
Token-saving stack (the −51.8% arm)
Head+tail tool-output compression plus an early-exit nudge. Compression works on its own; the early-exit nudge requires asignals_fn you supply (there is no built-in — see token saving).
Add the general monitor (the accuracy-lift arm)
GeneralMonitorMiddleware runs the v1 rule-firing detector pack (semantic loop, verification skip, and the rest — see Monitors) and injects a short corrective hint when a rule fires. Place it before TokenSavingMiddleware so injected hints go out compressed.
Assemble via the unified config
If you compose the stack throughReasonBlocksConfig / build_middleware, the ordering (general monitor before token-saving) is handled for you:
The early-exit nudge fires only when you pass a
signals_fn (config field ts_signals_fn). Without one, the token-saving stack still delivers the bulk of the savings through tool-output compression.A/B test this stack
To prove the cost/accuracy impact on your own tasks, route runs throughab_middleware and attach the
code-review middleware only on the on arm — keyed off mw.arm — so the
control stays a vanilla agent with telemetry only:
mw.arm is "on" / "off" (or "" outside an experiment). Run your eval set,
then pull the per-arm report — see Run an A/B evaluation.
The on arm now reflects the full code-review stack (steering + general monitor
- compression); the
offarm is the vanilla baseline.requires reasonblocks>=0.2.0.
See also
- Run an A/B evaluation — per-arm cost/accuracy report
- Reduce token usage —
TokenSavingMiddlewarecompression, early-exit, and thesignals_fncontract - Monitors — the v1 detector pack the general monitor runs
- ReasonBlocksConfig — every
ts_*/gm_*knob

