How to Review Production Agent Failures Without Guessing

A production AI agent failure rarely looks like a clean exception. The run may finish successfully. The model may return a confident answer. The tool API may return 200 OK. The incident only becomes visible later, when a customer complains, an operations queue drifts, or a human reviewer notices that the agent took the wrong action for a plausible reason.

That makes post incident review harder than normal backend debugging. You are not only asking which line of code failed. You are asking what the agent saw, which decisions it made, which tool state it trusted, which side effects it created, and where the system should have stopped it.

This guide gives developers a practical workflow for reviewing production AI agent failures without guessing from scattered logs. We will use a fictional company, Northstar Outfitters, as the running example. Northstar runs a support agent that can look up orders, read refund policy, inspect shipment status, issue refunds, add internal notes, and escalate edge cases to a human.

The incident: a customer returned a damaged jacket. The agent approved store credit instead of refunding the original payment method. No service crashed. The refund tool accepted the call. The final response looked helpful. But the business outcome was wrong.

Start with a review packet, not a Slack theory

The first mistake in agent incident review is beginning with opinions. Someone says the model hallucinated. Someone else blames retrieval. Another developer suspects the refund tool. All of those may be true, but none of them should be the starting point.

Start by assembling a review packet. The packet should be small enough that one engineer can read it end to end, but complete enough to reconstruct the run without opening five dashboards.

Evidence

The complete trace

One logical run from user input through model calls, tools, tool results, retries, and final output.

Evidence

External state

The order, policy, shipment, ticket, and refund records as they existed before and after the run.

Evidence

Versions

Prompt version, model, tool schema version, retrieval index version, feature flags, and deployment commit.

Evidence

Expected behavior

The business rule or product expectation that defines why the observed outcome was wrong.

Opswald traces are useful here because they preserve the logical execution story of an agent run. The goal is not to collect more logs. The goal is to create one reviewable artifact that lets the team answer: what happened, why did the agent do it, and where should the system have caught it?

Step 1: Define the failed outcome precisely

A vague incident title produces a vague review. “Agent gave wrong refund” is not enough. Write the failure as a specific expected versus actual statement:

incident-statement.txt

// Too vague:
The support agent made a bad refund decision.

// Reviewable:
For damaged item return ORD-8142, policy requires refund_original_payment.
The agent called create_store_credit for €129.00 instead.
The tool call succeeded and changed customer account balance.

This statement anchors the review. It separates the bad outcome from secondary symptoms like latency, token count, or a retrieval warning. Those signals may matter later, but the review should stay centered on the decision that produced the wrong business result.

Step 2: Replay the path before editing the prompt

When an agent fails, the tempting fix is to add another sentence to the prompt: “Always refund damaged items to the original payment method.” Sometimes that is the right fix. Often it is just a patch over an unknown cause.

Replay the run first. Replay should let you walk the original path with the original user request, retrieved context, tool outputs, and model steps. You are looking for the point where the agent’s internal story diverged from the expected business story.

For Northstar, replay might show that the agent retrieved two policy snippets. One said opened items normally receive store credit. Another said damaged items are refunded to the original payment method. The agent used the first snippet and ignored the exception in the second.

That finding is different from “the model is bad.” It tells you the failure might involve retrieval ranking, context presentation, decision inspection, or a missing validation guard. Replay turns a general complaint into a concrete branch in the decision path.

Review rule

Do not change prompts, tools, or retrieval settings until you can point to the first incorrect assumption in replay. Otherwise you are tuning around a story you have not verified.

Step 3: Inspect the decision graph, not only the timeline

A timeline shows the order of events. That is necessary, but it does not always show causality. Agent failures often hide in branches: the option not taken, the fact skipped, the retry path, the assumption carried from a summarized context window.

A decision graph helps answer questions that a flat timeline cannot:

Which observations fed into the refund method decision?
Was the damaged item exception visible to the model at the decision point?
Did the agent choose between multiple tools or default to the first plausible one?
Did a previous tool result overwrite or summarize away an important fact?
Was there a validation node before the mutating refund tool?

In the Northstar case, the decision graph might show that the agent moved from read_policy to create_store_credit through a node labeled “opened item return.” The graph also shows a disconnected observation: “customer reports product arrived torn.” That observation existed in the trace, but it did not influence the final refund path.

That is the bug. The agent did not lack information. It failed to connect the right fact to the right decision.

Step 4: Compare expected versus actual tool state

Agent post incident review must cross the model boundary. A model decision is only half the story. Production failures often involve external state: records read, records written, idempotency keys, retries, and tool side effects.

For each tool call in the trace, compare three things:

Expected pre state: what the agent should have known before the call.
Actual arguments: the exact tool name and payload the agent sent.
Actual post state: what changed in the external system after the call.

tool-state-review.txt

expected_tool = "refund_original_payment"
actual_tool   = "create_store_credit"

expected_guard = "damaged_item_exception_checked"
actual_guard   = "not_present"

post_state = "customer_credit_balance increased by €129.00"

This is where many agent reviews become uncomfortable in a useful way. The model made a weak decision, but the system also allowed a mutating financial action without a deterministic policy guard. If a business invariant matters, it should not exist only in natural language.

Step 5: Write the root cause as a chain

Root cause for agent failures is usually a chain, not a single sentence. “The model chose the wrong tool” is an observation. It is not yet a root cause.

A better root cause record connects decision, context, tool validation, and prevention:

The damaged item fact was present in the initial user message.
The policy exception was retrieved but ranked below the general opened item policy.
The model selected store credit based on the general policy.
The refund tool accepted the call because it validated schema, not policy eligibility.
No regression checked damaged item returns against refund method selection.

That chain gives the team multiple repair points. You can improve retrieval ranking, adjust context formatting, add a decision check before refund tool selection, enforce policy eligibility inside the tool, and add an eval that reproduces the incident.

Step 6: Add a regression that fails for the right reason

The review is not done when the team understands the incident. It is done when the system can catch it next time. For production agents, that usually means a regression at the level where the failure occurred.

If the failure was a tool guard issue, add a deterministic unit test around the tool validation. If it was context assembly, add a fixture that verifies the damaged item exception appears in the model context. If it was decision quality, add an eval or replay based test that expects the agent to choose the original payment refund path.

The best regression is specific enough to catch the incident, but general enough to protect the class of failures. Do not only test “ORD-8142 returns card refund.” Test “damaged items override opened item store credit policy when both snippets are present.”

regression-shape.txt

given damaged item return + opened item policy
when support agent selects refund method
then selected tool is refund_original_payment
and create_store_credit is rejected without explicit exception evidence

Step 7: Turn the review into an operating loop

A single careful post incident review is useful. A repeatable operating loop is better. Every production agent failure should leave behind the same artifacts: trace link, replay notes, decision graph finding, tool state comparison, root cause chain, regression link, and rollout verification.

That structure makes agent reliability compound. The first incident teaches the team how to capture evidence. The next incident is faster because the trace format is consistent. Over time, the regression suite becomes a map of real production risk instead of a generic prompt benchmark.

Opswald is built around this workflow: capture the trace, replay the run, inspect the decision graph, and connect model choices to tool behavior. The point is not to make every failure impossible. The point is to make failures visible, reviewable, reproducible, and harder to repeat.

The practical checklist

Write the failed outcome as expected versus actual behavior.
Assemble a review packet before debating theories.
Replay the original path with original context and tool results.
Inspect the decision graph for skipped facts, wrong branches, and missing guards.
Compare expected versus actual tool state before and after side effects.
Record root cause as a chain across context, decision, validation, and regression gaps.
Add a regression that catches the class of failure, not only the exact ticket.

Production AI agent failures are not random mysteries. They only feel random when the evidence is scattered. With traces, replay, and decision graphs, a team can review agent failures like engineering incidents instead of guessing from anecdotes.

Review agent failures from the full trace

Opswald helps teams debug production agents with traces, replay, and decision graphs built for multi step tool using systems.

Request Early Access →