Run an A/B evaluation

ab_middleware() flips a deterministic coin per run and returns either the full ReasonBlocks pipeline (the on arm) or a vanilla agent (the off arm). Both arms stream telemetry, so you can pull one report that compares them side by side: token/cost deltas, task-accuracy non-inferiority, a sample-ratio check, and a per-day learning curve.

Use this for a head-to-head evaluation (“does ReasonBlocks help, and by how much?”). For normal production use, reach for middleware() directly.

How the two arms differ

	`on` (full ReasonBlocks)	`off` (vanilla control)
E1/E2/E3 retrieval	✅	—
Monitor steering	✅	—
Model routing	✅ (if configured)	—
System-prompt rewrite	✅	— (true passthrough)
Live telemetry	✅	✅
Run scored server-side	✅	✅

The off arm is a true passthrough — the model request is left untouched, so the control isn’t quietly getting ReasonBlocks’ prompt-cache optimization. It still streams telemetry, so every control run is scored and recorded just like the treatment arm.

Run the experiment

Disable the intervention cap for the eval window

Set INTERVENTION_CAP_ENABLED=false on rb-api. Otherwise a free-tier org that crosses its monthly cap mid-experiment has its on arm silently downgraded to vanilla — corrupting the comparison with no error.

Experiment-tagged runs are excluded from the billing meter regardless of this flag, but the cap gate (which zeroes retrieval when an org is over quota) only lifts when INTERVENTION_CAP_ENABLED=false.

Route each run through ab_middleware()

Call it once per run with a stable experiment_id and a per-unit unit_id. Everything else mirrors middleware(). This is a complete, runnable harness:

import re
from langchain.agents import create_agent
from langchain_anthropic import ChatAnthropic
from langchain_core.tools import tool
from reasonblocks import ReasonBlocks

rb = ReasonBlocks(api_key="rb_live_...")
model = ChatAnthropic(model="claude-haiku-4-5-20251001", max_tokens=512)

@tool
def calc(expression: str) -> str:
    """Evaluate a simple arithmetic expression, e.g. '17 * 23'."""
    if not re.fullmatch(r"[0-9+\-*/(). ]+", expression or ""):
        return "invalid expression"
    return str(eval(expression, {"__builtins__": {}}, {}))

# A fixed eval set with checkable answers -> a mechanical outcome label.
TASKS = [("t-1", "compute 17 * 23", "391"), ("t-2", "compute 256 + 389", "645")]

for task_id, prompt, expected in TASKS:
    mw = rb.ab_middleware(
        experiment_id="cust-acme-q2",
        unit_id=task_id,            # stable id -> retries stay in the same arm
        on_fraction=0.5,            # 50/50 split
        agent_name="calc-agent",
        task=prompt,
        org_id="acme",
    )
    agent = create_agent(
        model=model,
        tools=[calc],
        system_prompt="You are a precise calculator. Use the calc tool, then state the result.",
        middleware=[mw],
    )
    outcome = "failure"
    try:
        result = agent.invoke({"messages": [("user", prompt)]})
        final = next(
            (m.content for m in reversed(result["messages"])
             if getattr(m, "content", None) and not getattr(m, "tool_calls", None)),
            "",
        )
        outcome = "success" if expected in str(final) else "failure"
    finally:
        mw.flush_session(outcome_status=outcome)   # records the run's outcome
        mw.close(timeout=10)

Pass a stable unit_id (a task/ticket id, not a fresh per-attempt run id). Assignment is hash(experiment_id, unit_id), so a retried task lands in the same arm and can’t contaminate the comparison. Uniform hashing also keeps any sub-population — one repo, one task type — split at ~on_fraction for free.

Make outcome a mechanical function of run artifacts (a checked answer, tests passing, exit code) — not a judgment that can see the arm. That’s what makes the accuracy guardrail credible.

Pull the report

curl -sS \
  "https://rb-api.reasonblocks.com/v1/monitor/experiments/cust-acme-q2/report?org_id=acme&on_fraction=0.5" \
  -H "Authorization: Bearer $RB_API_KEY"

on_fraction is your configured ON probability — used only for the sample-ratio-mismatch (SRM) check. The report is computed live from telemetry on every call; nothing is cached.

The middleware stamps experiment_id, arm, assignment_unit, and rb_version onto the run row, and rb-api treats experiment_id + arm as immutable once set — a resume or retry can’t relabel a run. If you only need the arm decision (to route at a different layer), call the assignment function directly:

from reasonblocks import assign_arm

arm = assign_arm("cust-acme-q2", "t-1", on_fraction=0.5)  # -> "on" | "off"

The report

object

Per-arm rollup: n_runs, outcomes (success/failure/other/unfinished), success_rate, tokens_per_run (median + winsorized mean), steps_per_run, cost (input/output/cache-read tokens + cost_per_run_usd), and reasoning_health.

object

The deltas: cost_per_run_usd and tokens_per_run_median (ON vs OFF + pct), and success_rate with a Newcombe ci_95 plus success_rate_stratified (inverse-variance across task_profile).

object

Sample-ratio-mismatch check: observed vs expected split, chi-square, p_value, flagged.

array

Per-day, per-arm cost + accuracy — the learning curve.

array

Auto-generated warnings (wide CI, low token-split coverage, degenerate strata, non-stationarity).

A trimmed real response:

{
  "arms": {
    "on":  {"n_runs": 6, "success_rate": 0.83,
            "tokens_per_run": {"median": 1996},
            "cost": {"cache_read_tokens": 14400, "cost_per_run_usd": 0.0101}},
    "off": {"n_runs": 6, "success_rate": 0.83,
            "tokens_per_run": {"median": 12600},
            "cost": {"cache_read_tokens": 0, "cost_per_run_usd": 0.063}}
  },
  "comparison": {
    "cost_per_run_usd":      {"on": 0.0101, "off": 0.063, "pct": -0.84},
    "tokens_per_run_median": {"on": 1996,  "off": 12600, "pct": -0.70},
    "success_rate": {"on": 0.83, "off": 0.83, "delta": 0.0,
                     "ci_95": [-0.42, 0.42], "method": "newcombe"}
  },
  "srm": {"observed": {"on": 6, "off": 6}, "p_value": 1.0, "flagged": false},
  "timeseries": [{"day": "2026-05-19", "arm": "on", "tokens_per_run": 1996, "success_rate": 1.0}],
  "caveats": ["Accuracy CI is wide (±42%); N is too small to bound a tight margin ..."]
}

Reading the numbers

Cost / tokens are the headline — objective and hard to dispute. Cost is priced from the per-step input/output/cache-read split per the model actually used, so model routing and prompt-cache effects are both credited.
success_rate is the guardrail, not a win condition. Frame it as non-inferiority (“accuracy didn’t regress”), and read the CI width before claiming “no change” — a wide interval at small N means “not enough data”, not “equal”.
srm.flagged usually means a bug (the on path dropping runs before they’re tagged), not bad luck — investigate before trusting the rest.
timeseries is a learning curve: distillation runs on both arms, so the on library grows during the window. Treat the earliest (cold-library) buckets as the stationary baseline.

Two honest limitations:

The accuracy label is the self-reported outcome. For a defensible non-inferiority claim, validate a blind sample of labels and keep outcome mechanical.
Token-saving compression isn’t in ab_middleware by default. To A/B the full code-review stack, attach TokenSavingMiddleware / GeneralMonitorMiddleware to the on arm only via mw.arm — see A/B test this stack (requires reasonblocks>=0.2.0). On short, clean tasks the on arm adds retrieval/steering text with no compression payoff and can cost more; the win shows on long, messy trajectories.

ab_middleware() reference

Parameters and the on/off lifecycle.

Reduce token usage

Add compression to the ON bundle for the full cost story.

Getting Started

Concepts

Using ReasonBlocks

Connectors and sync

How the two arms differ

Run the experiment

The report

Reading the numbers

ab_middleware() reference

Reduce token usage

​How the two arms differ

​Run the experiment

​The report

​Reading the numbers

​Related

ab_middleware() reference

Reduce token usage

How the two arms differ

Run the experiment

The report

Reading the numbers

Related