Skip to main content
ab_middleware() flips a deterministic coin per run and returns either the full ReasonBlocks pipeline (the on arm) or a vanilla agent (the off arm). Both arms stream telemetry, so you can pull one report that compares them side by side: token/cost deltas, task-accuracy non-inferiority, a sample-ratio check, and a per-day learning curve.
Use this for a head-to-head evaluation (“does ReasonBlocks help, and by how much?”). For normal production use, reach for middleware() directly.

How the two arms differ

on (full ReasonBlocks)off (vanilla control)
E1/E2/E3 retrieval
Monitor steering
Model routing✅ (if configured)
System-prompt rewrite— (true passthrough)
Live telemetry
Run scored server-side
The off arm is a true passthrough — the model request is left untouched, so the control isn’t quietly getting ReasonBlocks’ prompt-cache optimization. It still streams telemetry, so every control run is scored and recorded just like the treatment arm.

Run the experiment

1

Disable the intervention cap for the eval window

Set INTERVENTION_CAP_ENABLED=false on rb-api. Otherwise a free-tier org that crosses its monthly cap mid-experiment has its on arm silently downgraded to vanilla — corrupting the comparison with no error.
Experiment-tagged runs are excluded from the billing meter regardless of this flag, but the cap gate (which zeroes retrieval when an org is over quota) only lifts when INTERVENTION_CAP_ENABLED=false.
2

Route each run through ab_middleware()

Call it once per run with a stable experiment_id and a per-unit unit_id. Everything else mirrors middleware(). This is a complete, runnable harness:
import re
from langchain.agents import create_agent
from langchain_anthropic import ChatAnthropic
from langchain_core.tools import tool
from reasonblocks import ReasonBlocks

rb = ReasonBlocks(api_key="rb_live_...")
model = ChatAnthropic(model="claude-haiku-4-5-20251001", max_tokens=512)

@tool
def calc(expression: str) -> str:
    """Evaluate a simple arithmetic expression, e.g. '17 * 23'."""
    if not re.fullmatch(r"[0-9+\-*/(). ]+", expression or ""):
        return "invalid expression"
    return str(eval(expression, {"__builtins__": {}}, {}))

# A fixed eval set with checkable answers -> a mechanical outcome label.
TASKS = [("t-1", "compute 17 * 23", "391"), ("t-2", "compute 256 + 389", "645")]

for task_id, prompt, expected in TASKS:
    mw = rb.ab_middleware(
        experiment_id="cust-acme-q2",
        unit_id=task_id,            # stable id -> retries stay in the same arm
        on_fraction=0.5,            # 50/50 split
        agent_name="calc-agent",
        task=prompt,
        org_id="acme",
    )
    agent = create_agent(
        model=model,
        tools=[calc],
        system_prompt="You are a precise calculator. Use the calc tool, then state the result.",
        middleware=[mw],
    )
    outcome = "failure"
    try:
        result = agent.invoke({"messages": [("user", prompt)]})
        final = next(
            (m.content for m in reversed(result["messages"])
             if getattr(m, "content", None) and not getattr(m, "tool_calls", None)),
            "",
        )
        outcome = "success" if expected in str(final) else "failure"
    finally:
        mw.flush_session(outcome_status=outcome)   # records the run's outcome
        mw.close(timeout=10)
Pass a stable unit_id (a task/ticket id, not a fresh per-attempt run id). Assignment is hash(experiment_id, unit_id), so a retried task lands in the same arm and can’t contaminate the comparison. Uniform hashing also keeps any sub-population — one repo, one task type — split at ~on_fraction for free.
Make outcome a mechanical function of run artifacts (a checked answer, tests passing, exit code) — not a judgment that can see the arm. That’s what makes the accuracy guardrail credible.
3

Pull the report

curl -sS \
  "https://rb-api.reasonblocks.com/v1/monitor/experiments/cust-acme-q2/report?org_id=acme&on_fraction=0.5" \
  -H "Authorization: Bearer $RB_API_KEY"
on_fraction is your configured ON probability — used only for the sample-ratio-mismatch (SRM) check. The report is computed live from telemetry on every call; nothing is cached.
The middleware stamps experiment_id, arm, assignment_unit, and rb_version onto the run row, and rb-api treats experiment_id + arm as immutable once set — a resume or retry can’t relabel a run. If you only need the arm decision (to route at a different layer), call the assignment function directly:
from reasonblocks import assign_arm

arm = assign_arm("cust-acme-q2", "t-1", on_fraction=0.5)  # -> "on" | "off"

The report

arms.on / arms.off
object
Per-arm rollup: n_runs, outcomes (success/failure/other/unfinished), success_rate, tokens_per_run (median + winsorized mean), steps_per_run, cost (input/output/cache-read tokens + cost_per_run_usd), and reasoning_health.
comparison
object
The deltas: cost_per_run_usd and tokens_per_run_median (ON vs OFF + pct), and success_rate with a Newcombe ci_95 plus success_rate_stratified (inverse-variance across task_profile).
srm
object
Sample-ratio-mismatch check: observed vs expected split, chi-square, p_value, flagged.
timeseries
array
Per-day, per-arm cost + accuracy — the learning curve.
caveats
array
Auto-generated warnings (wide CI, low token-split coverage, degenerate strata, non-stationarity).
A trimmed real response:
{
  "arms": {
    "on":  {"n_runs": 6, "success_rate": 0.83,
            "tokens_per_run": {"median": 1996},
            "cost": {"cache_read_tokens": 14400, "cost_per_run_usd": 0.0101}},
    "off": {"n_runs": 6, "success_rate": 0.83,
            "tokens_per_run": {"median": 12600},
            "cost": {"cache_read_tokens": 0, "cost_per_run_usd": 0.063}}
  },
  "comparison": {
    "cost_per_run_usd":      {"on": 0.0101, "off": 0.063, "pct": -0.84},
    "tokens_per_run_median": {"on": 1996,  "off": 12600, "pct": -0.70},
    "success_rate": {"on": 0.83, "off": 0.83, "delta": 0.0,
                     "ci_95": [-0.42, 0.42], "method": "newcombe"}
  },
  "srm": {"observed": {"on": 6, "off": 6}, "p_value": 1.0, "flagged": false},
  "timeseries": [{"day": "2026-05-19", "arm": "on", "tokens_per_run": 1996, "success_rate": 1.0}],
  "caveats": ["Accuracy CI is wide (±42%); N is too small to bound a tight margin ..."]
}

Reading the numbers

  • Cost / tokens are the headline — objective and hard to dispute. Cost is priced from the per-step input/output/cache-read split per the model actually used, so model routing and prompt-cache effects are both credited.
  • success_rate is the guardrail, not a win condition. Frame it as non-inferiority (“accuracy didn’t regress”), and read the CI width before claiming “no change” — a wide interval at small N means “not enough data”, not “equal”.
  • srm.flagged usually means a bug (the on path dropping runs before they’re tagged), not bad luck — investigate before trusting the rest.
  • timeseries is a learning curve: distillation runs on both arms, so the on library grows during the window. Treat the earliest (cold-library) buckets as the stationary baseline.
Two honest limitations:
  • The accuracy label is the self-reported outcome. For a defensible non-inferiority claim, validate a blind sample of labels and keep outcome mechanical.
  • Token-saving compression isn’t in ab_middleware by default. To A/B the full code-review stack, attach TokenSavingMiddleware / GeneralMonitorMiddleware to the on arm only via mw.arm — see A/B test this stack (requires reasonblocks>=0.2.0). On short, clean tasks the on arm adds retrieval/steering text with no compression payoff and can cost more; the win shows on long, messy trajectories.

ab_middleware() reference

Parameters and the on/off lifecycle.

Reduce token usage

Add compression to the ON bundle for the full cost story.