ab_middleware() flips a deterministic coin per run and returns either the full ReasonBlocks pipeline (the on arm) or a vanilla agent (the off arm). Both arms stream telemetry, so you can pull one report that compares them side by side: token/cost deltas, task-accuracy non-inferiority, a sample-ratio check, and a per-day learning curve.
Use this for a head-to-head evaluation (“does ReasonBlocks help, and by how much?”). For normal production use, reach for
middleware() directly.How the two arms differ
on (full ReasonBlocks) | off (vanilla control) | |
|---|---|---|
| E1/E2/E3 retrieval | ✅ | — |
| Monitor steering | ✅ | — |
| Model routing | ✅ (if configured) | — |
| System-prompt rewrite | ✅ | — (true passthrough) |
| Live telemetry | ✅ | ✅ |
| Run scored server-side | ✅ | ✅ |
off arm is a true passthrough — the model request is left untouched, so the control isn’t quietly getting ReasonBlocks’ prompt-cache optimization. It still streams telemetry, so every control run is scored and recorded just like the treatment arm.
Run the experiment
Disable the intervention cap for the eval window
Set
INTERVENTION_CAP_ENABLED=false on rb-api. Otherwise a free-tier org that crosses its monthly cap mid-experiment has its on arm silently downgraded to vanilla — corrupting the comparison with no error.Route each run through ab_middleware()
Call it once per run with a stable Make
experiment_id and a per-unit unit_id. Everything else mirrors middleware(). This is a complete, runnable harness:outcome a mechanical function of run artifacts (a checked answer, tests passing, exit code) — not a judgment that can see the arm. That’s what makes the accuracy guardrail credible.experiment_id, arm, assignment_unit, and rb_version onto the run row, and rb-api treats experiment_id + arm as immutable once set — a resume or retry can’t relabel a run. If you only need the arm decision (to route at a different layer), call the assignment function directly:
The report
Per-arm rollup:
n_runs, outcomes (success/failure/other/unfinished), success_rate, tokens_per_run (median + winsorized mean), steps_per_run, cost (input/output/cache-read tokens + cost_per_run_usd), and reasoning_health.The deltas:
cost_per_run_usd and tokens_per_run_median (ON vs OFF + pct), and success_rate with a Newcombe ci_95 plus success_rate_stratified (inverse-variance across task_profile).Sample-ratio-mismatch check: observed vs expected split, chi-square,
p_value, flagged.Per-day, per-arm cost + accuracy — the learning curve.
Auto-generated warnings (wide CI, low token-split coverage, degenerate strata, non-stationarity).
Reading the numbers
- Cost / tokens are the headline — objective and hard to dispute. Cost is priced from the per-step input/output/cache-read split per the model actually used, so model routing and prompt-cache effects are both credited.
success_rateis the guardrail, not a win condition. Frame it as non-inferiority (“accuracy didn’t regress”), and read the CI width before claiming “no change” — a wide interval at small N means “not enough data”, not “equal”.srm.flaggedusually means a bug (theonpath dropping runs before they’re tagged), not bad luck — investigate before trusting the rest.timeseriesis a learning curve: distillation runs on both arms, so theonlibrary grows during the window. Treat the earliest (cold-library) buckets as the stationary baseline.
Related
ab_middleware() reference
Parameters and the on/off lifecycle.
Reduce token usage
Add compression to the ON bundle for the full cost story.

