ReasonBlocks ships six health monitors that evaluate your agent’s accumulated step trace on every scored step. Each monitor produces a score between 0 and 1 — where 1 means maximum badness — and a weighted sum produces a single composite health score. When a monitor’s individual score reaches 0.6 (the default fire threshold), it is considered “fired,” and its name appears in theDocumentation Index
Fetch the complete documentation index at: https://reasonblocks.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
monitors_fired list for that step.
Monitor results gate E1 retrieval: when any monitor fires, or when the composite score exceeds 0.15, ReasonBlocks queries for instance-level guidance from your pattern library. This keeps pattern-store lookups off of healthy steps where no intervention is needed.
The six monitors
streak — same action repeated consecutively (weight: 0.35)
streak — same action repeated consecutively (weight: 0.35)
The streak monitor detects when the agent calls the same tool back-to-back. It walks the trace in reverse and counts the length of the current run of identical actions, ignoring steps that produced no action.Score:
min(run_length / 5, 1.0). A streak of 1 (no repetition) scores 0.0. A streak of 5 or more scores 1.0.Fires at: score ≥ 0.6, which corresponds to 3 or more consecutive identical actions.Why it matters: Repeated identical actions with different inputs are normal (e.g., reading multiple files). But the same action with the same or similar inputs usually means the agent is stuck in a loop and not making progress.This monitor carries the highest weight (0.35) because action loops are the most common and most costly failure pattern in long-running agents.call_count — tool-call budget consumed (weight: 0.15)
call_count — tool-call budget consumed (weight: 0.15)
The call_count monitor tracks how many tool calls the agent has made relative to a budget. It counts every step that produced an action.Score:
min(total_tool_calls / 20, 1.0). The score rises linearly from 0.0 at zero calls to 1.0 at 20 calls (the default budget).Fires at: score ≥ 0.6, which corresponds to 12 or more tool calls.Why it matters: A high call count on its own is not a failure, but in combination with other monitors it signals that the agent has consumed a large share of its budget without finishing. The composite score picks this up even when no single monitor is dominant.The budget defaults to 20 calls. For agents that are expected to make many tool calls, you can adjust the monitor weights when evaluating the composite score.edit_revert — edits that undo each other (weight: 0.15)
edit_revert — edits that undo each other (weight: 0.15)
The edit_revert monitor detects two patterns of file-editing thrash:
- Content revert: The most recent edit’s content is more similar to the edit before the previous one than to the previous edit itself — meaning the agent has effectively undone its prior change.
- Edit-fail-edit cycles: The agent edits a file, receives an error observation, and then edits the same file again — repeating this cycle two or more times on the same path.
edit, write, str_replace, str_replace_editor, patch, apply_patch, create_file, overwrite.test_repeat — same test failure repeating (weight: 0.15)
test_repeat — same test failure repeating (weight: 0.15)
The test_repeat monitor detects when the agent runs a test, receives a failure, and then runs the same test again with the same failure output — without making a substantive change in between.It normalizes error messages before comparing them, stripping volatile fragments such as temporary file paths, UUIDs, memory addresses, PIDs, timestamps, and large numbers. This ensures the same underlying failure compares equal across re-runs even when the surrounding output differs.Score: 1.0 when the same normalized failure signature appears two or more times consecutively; 0.0 otherwise.Fires at: any repeated test failure scores directly at 1.0.Why it matters: Running the same failing test repeatedly without making a code change is a clear sign the agent has not understood what the test failure is telling it. This monitor fires reliably on this pattern without false positives from minor output variation.Test steps are identified by tool names (
pytest, test, run_tests, npm test, cargo test) or by error keywords in the observation text.diversity — collapsed tool exploration (weight: 0.10)
diversity — collapsed tool exploration (weight: 0.10)
The diversity monitor detects when the agent’s recent tool usage has collapsed to two or fewer distinct tools across a window of five consecutive calls, after the run has accumulated at least eight total tool calls.Score:
- 1.0 if only one tool has been used in the last 5 calls.
- 0.7 if exactly two tools have been used but they are not healthily alternating (an
a-b-a-b-apattern is considered intentional and is excluded). - 0.0 if three or more distinct tools appear in the window, or if the run has not yet reached the minimum call count.
hedge — rising hedging and retraction density (weight: 0.10)
hedge — rising hedging and retraction density (weight: 0.10)
The hedge monitor tracks whether the agent’s language is becoming increasingly uncertain or self-contradictory across the run. It operates on the
thought text of each step.It measures two things:- Retraction phrases: Explicit reversals such as “I was wrong”, “never mind”, “disregard that”, or “the bug is actually not…”. Any retraction anywhere in the trace scores 1.0 immediately.
- Hedging density ratio: The density of hedging words (e.g., “maybe”, “perhaps”, “might”, “not sure”, “on second thought”) in the later half of the run compared to the early half. A ratio above 2× scores above 0.0; a ratio at or above 4× scores 1.0.
Composite score and the fire threshold
After running all six monitors,evaluate_all computes a weighted sum:
DEFAULT_FIRE_THRESHOLD of 0.6. The composite score is used separately by the E1 gate (threshold: 0.15) and surfaced in the ReasonBlocks dashboard.