Health monitors for agent trajectories

ReasonBlocks ships six health monitors that evaluate your agent’s accumulated step trace on every scored step. Each monitor produces a score between 0 and 1 — where 1 means maximum badness — and a weighted sum produces a single composite health score. When a monitor’s individual score reaches 0.6 (the default fire threshold), it is considered “fired,” and its name appears in the monitors_fired list for that step. Monitor results gate E1 retrieval: when any monitor fires, or when the composite score exceeds 0.15, ReasonBlocks queries for instance-level guidance from your pattern library. This keeps pattern-store lookups off of healthy steps where no intervention is needed.

The six monitors

streak — same action repeated consecutively (weight: 0.35)

The streak monitor detects when the agent calls the same tool back-to-back. It walks the trace in reverse and counts the length of the current run of identical actions, ignoring steps that produced no action.Score: min(run_length / 5, 1.0). A streak of 1 (no repetition) scores 0.0. A streak of 5 or more scores 1.0.Fires at: score ≥ 0.6, which corresponds to 3 or more consecutive identical actions.Why it matters: Repeated identical actions with different inputs are normal (e.g., reading multiple files). But the same action with the same or similar inputs usually means the agent is stuck in a loop and not making progress.This monitor carries the highest weight (0.35) because action loops are the most common and most costly failure pattern in long-running agents.

call_count — tool-call budget consumed (weight: 0.15)

The call_count monitor tracks how many tool calls the agent has made relative to a budget. It counts every step that produced an action.Score: min(total_tool_calls / 20, 1.0). The score rises linearly from 0.0 at zero calls to 1.0 at 20 calls (the default budget).Fires at: score ≥ 0.6, which corresponds to 12 or more tool calls.Why it matters: A high call count on its own is not a failure, but in combination with other monitors it signals that the agent has consumed a large share of its budget without finishing. The composite score picks this up even when no single monitor is dominant.The budget defaults to 20 calls. For agents that are expected to make many tool calls, you can adjust the monitor weights when evaluating the composite score.

edit_revert — edits that undo each other (weight: 0.15)

The edit_revert monitor detects two patterns of file-editing thrash:

Content revert: The most recent edit’s content is more similar to the edit before the previous one than to the previous edit itself — meaning the agent has effectively undone its prior change.
Edit-fail-edit cycles: The agent edits a file, receives an error observation, and then edits the same file again — repeating this cycle two or more times on the same path.

Score: 1.0 on a detected revert or two or more fail-edit cycles on the same file path; 0.0 on a clean edit history.Fires at: any detected revert or thrash cycle scores directly at 1.0, which is above the 0.6 threshold.Why it matters: Edit thrashing wastes tool calls and often indicates the agent is chasing a misdiagnosed root cause. Catching it early allows E-trace guidance to redirect the agent toward a different strategy.Edit-like tools recognized by default: edit, write, str_replace, str_replace_editor, patch, apply_patch, create_file, overwrite.

test_repeat — same test failure repeating (weight: 0.15)

The test_repeat monitor detects when the agent runs a test, receives a failure, and then runs the same test again with the same failure output — without making a substantive change in between.It normalizes error messages before comparing them, stripping volatile fragments such as temporary file paths, UUIDs, memory addresses, PIDs, timestamps, and large numbers. This ensures the same underlying failure compares equal across re-runs even when the surrounding output differs.Score: 1.0 when the same normalized failure signature appears two or more times consecutively; 0.0 otherwise.Fires at: any repeated test failure scores directly at 1.0.Why it matters: Running the same failing test repeatedly without making a code change is a clear sign the agent has not understood what the test failure is telling it. This monitor fires reliably on this pattern without false positives from minor output variation.Test steps are identified by tool names (pytest, test, run_tests, npm test, cargo test) or by error keywords in the observation text.

diversity — collapsed tool exploration (weight: 0.10)

The diversity monitor detects when the agent’s recent tool usage has collapsed to two or fewer distinct tools across a window of five consecutive calls, after the run has accumulated at least eight total tool calls.Score:

1.0 if only one tool has been used in the last 5 calls.
0.7 if exactly two tools have been used but they are not healthily alternating (an a-b-a-b-a pattern is considered intentional and is excluded).
0.0 if three or more distinct tools appear in the window, or if the run has not yet reached the minimum call count.

Fires at: score ≥ 0.6, which covers both the single-tool (1.0) and non-alternating two-tool (0.7) cases.Why it matters: A healthy agent explores its tool set as the task progresses. Collapsing to a single repeated tool — especially after many calls — often means the agent has narrowed its strategy in an unproductive way.

hedge — rising hedging and retraction density (weight: 0.10)

The hedge monitor tracks whether the agent’s language is becoming increasingly uncertain or self-contradictory across the run. It operates on the thought text of each step.It measures two things:

Retraction phrases: Explicit reversals such as “I was wrong”, “never mind”, “disregard that”, or “the bug is actually not…”. Any retraction anywhere in the trace scores 1.0 immediately.
Hedging density ratio: The density of hedging words (e.g., “maybe”, “perhaps”, “might”, “not sure”, “on second thought”) in the later half of the run compared to the early half. A ratio above 2× scores above 0.0; a ratio at or above 4× scores 1.0.

Score: 1.0 on any retraction; a continuous value based on the late/early hedge ratio otherwise.Fires at: score ≥ 0.6.Why it matters: Rising hedging language is a leading indicator that the agent is losing confidence in its approach. Catching this early — before the agent backtracks through many steps — can save significant tool-call budget.

Composite score and the fire threshold

After running all six monitors, evaluate_all computes a weighted sum:

composite = 0.35 × streak
          + 0.15 × call_count
          + 0.15 × edit_revert
          + 0.15 × test_repeat
          + 0.10 × diversity
          + 0.10 × hedge

The composite score is between 0 and 1. Individual monitors fire independently when their own score reaches the DEFAULT_FIRE_THRESHOLD of 0.6. The composite score is used separately by the E1 gate (threshold: 0.15) and surfaced in the ReasonBlocks dashboard.

How monitor results gate E1 retrieval

When any monitor fires or the composite score exceeds 0.15, the E1 injection is permitted to query for instance-level guidance from your pattern library. When neither condition is met, the E1 query is skipped entirely. The gate also has a two-step lookback: if any monitor fired on either of the two previous steps, E1 remains open on the current step. This provides continuity of guidance across consecutive at-risk steps.

E1 allowed if:
  any monitor fired this step
  OR composite score this step > 0.15
  OR any monitor fired on the previous 2 steps

Monitor injection text is also subject to cooldowns and a per-run cap (5 injections maximum) to prevent over-injection on long runs. The cooldown is shorter when the FSM is in SLOW or SKIP state (every 2 steps) and longer in FAST state (every 5 steps).

Get Started

Core Concepts

Guides

Configuration

Troubleshooting

Health monitors for agent trajectories

The six monitors

Composite score and the fire threshold

How monitor results gate E1 retrieval

Get Started

Core Concepts

Guides

Configuration

Troubleshooting

Documentation Index

​The six monitors

​Composite score and the fire threshold

​How monitor results gate E1 retrieval

The six monitors

Composite score and the fire threshold

How monitor results gate E1 retrieval