How to Investigate an AI Agent Failure Step by Step

Short answer: To investigate an AI agent failure, start with the wrong outcome, find the complete trace for the user turn, replay the disputed steps, and identify the first bad decision before changing code. That is the practical difference between general observability and AI agent debugging.

Useful references for this workflow include the OpenTelemetry GenAI semantic conventions for trace structure and the OpenAI Agents SDK tracing guide for agent run, tool, and handoff traces.

Your AI agent failed at 2:13 a.m. Maybe it approved the wrong vendor invoice. Maybe it escalated the wrong support case. Maybe it completed successfully and still made the wrong decision. By the time you notice, the run is over, the context is gone, and all you have left is a flat log and a vague sense that something went wrong.

This is where most teams get stuck. They know that the agent failed, but they cannot reconstruct why. They have request logs, timing metrics, maybe a prompt transcript, but not the actual execution story.

Investigating an AI agent failure requires a different workflow than investigating a normal backend incident. You are not only debugging code. You are debugging a sequence of decisions. That means you need traces, replay, and a decision graph that tells you how the agent got from the user goal to the final wrong outcome.

Here is the practical workflow we use for investigating production agent failures.

Step 1: Start with the failed outcome, not the infrastructure metric

Most teams begin with whatever alert fired first: high latency, timeout, token spike, retry count, or 500 rate. Those signals matter, but they are often secondary. The first question should be:

What was the wrong outcome the agent produced?

bad-start.txt

// Too generic:
alert = "LLM latency exceeded threshold"

// Better:
incident = "Procurement agent approved invoice INV-8421 even though the vendor was over the spending threshold"

The concrete wrong outcome gives you the anchor for the whole investigation. Without that anchor, you risk optimizing infrastructure while missing the actual decision failure.

Step 2: Find the single logical trace for the whole turn

The right unit of investigation is not an individual API call. It is the logical trace for the user turn: from the opening user message, through the model processing and tool work, to the final assistant response or external side effect. See the dedicated guide to AI agent tracing for the trace fields that make this practical.

That trace should contain a readable story like:

User message
Processing input
Decision: use tool X
Tool execution
Tool result
Processing tool results
Assistant message

If the run is split across multiple unrelated traces, or if late provider noise keeps extending the same trace after closure, your investigation is already compromised. You need one durable timeline of truth.

Bad investigation unit

Single prompt log, single tool call, or one isolated HTTP span without the surrounding turn context.

Correct investigation unit

One logical trace that covers user input, model decisions, tool calls, tool results, and final output.

Step 3: Reconstruct the execution story before touching code

This is where traces beat logs. Logs tell you what happened in chronological fragments. A good trace tells you the execution story: what the agent saw, what it decided, what it did, and what happened next.

Readability first

User message

Confirm the exact opening request or trigger that started the run.

Processing

Model reasoning rounds

Identify the LLM calls, retries, and where a tool decision or wrong interpretation appeared.

Actions

Tool execution

Map every external action back to the decision that caused it.

Closure

Final answer or side effect

Anchor the end of the investigation on the final assistant message or external write.

Do this before proposing a fix. If you skip reconstruction, you will optimize for symptoms. If you reconstruct the execution story first, the real failure class usually becomes obvious.

Step 4: Separate observed facts from inferred explanations

This is where many teams lose discipline. They see a bad output and immediately invent a story about “the model got confused” or “the framework retried badly.” That may be true, but your trace needs to separate what was observed from what is merely inferred.

Observed facts include:

the exact provider request JSON
the tool call proposed in the response
the tool result observed in the next request
the final assistant message or external write

Inferred explanations include:

“the agent preferred speed over accuracy”
“it probably misunderstood the policy”
“this looks like context pressure”

You need both, but not mixed together. Raw first debugging only works if the evidence layer stays honest.

Step 5: Use replay to test the failure hypothesis

Once you reconstruct the execution story, the next move is not to edit code immediately. It is to test your hypothesis against a replay or equivalent deterministic reconstruction.

For example:

If you believe the agent read the wrong section of a policy document, replay the run at the step where the document was first used.
If you believe retries polluted the trace and hid the real answer, replay the turn and examine which attempt actually won semantically.
If you believe a tool result arrived late and reopened the turn incorrectly, replay the lifecycle and compare evidence ordering against final closure.

Agent replay matters because it forces discipline. It turns debugging from storytelling into falsifiable analysis.

⚠️ Common mistake

Teams often patch the first visible failure, then discover later that the real problem was one step earlier in the decision chain. Replay is what prevents fix on fix layering.

Step 6: Find the first bad decision, not the loudest error

The loudest error is rarely the root cause. The root cause is usually the first bad decision that forced everything downstream into a bad state.

Examples:

A send email tool failed because the wrong address was extracted two steps earlier.
An invoice was approved because the agent interpreted “M” as monthly instead of millions in an earlier spreadsheet read step.
A lifecycle mismatch appeared because the turn boundary was already semantically wrong before the close event was emitted.

This is why decision graphs matter. A flat trace shows the visible break. A decision graph shows the first wrong branch.

Step 7: Classify the failure before proposing the fix

Not all agent failures are the same. Before changing code, classify the failure. We usually see four major buckets:

Boundary bugs: one user turn becomes multiple traces or one trace absorbs too much late activity.
Lifecycle bugs: closure, retry, or completion semantics are wrong.
Decision bugs: the agent chose the wrong action based on incomplete or misread context.
Presentation bugs: the readable story in the dashboard does not match the backend truth.

A lot of wasted debugging time comes from fixing a presentation issue as if it were a boundary bug, or a lifecycle bug as if it were a model quality issue.

Step 8: Fix the invariant, then add the regression

The best fix is not the one that passes today’s repro. It is the one that restores the broken invariant.

Examples of invariants:

One user turn equals one logical trace.
The final assistant message is the semantic closure anchor.
Late evidence must not semantically reopen a closed turn.
Tool decisions come from the provider response, and tool results come from the next provider request.

good-fix-checklist.txt

// A correct fix should do all 3:
1. restore the invariant
2. add a regression test on the concrete repro
3. prove the new behavior on a fresh production message

If your fix only addresses the visible symptom, the next trace will break in a slightly different way.

Step 9: Verify the readable story in the product, not only the test suite

This is the part many teams skip. The tests go green, the API payload looks fine, and everyone declares victory. Then the real dashboard still shows a generic title, a weird grouping, or a closure that feels wrong to a human operator.

For agent debugging infrastructure, product verification matters. The point is not only to store the right facts. The point is to make the run legible to a human investigating the failure.

That means the final check is:

does the trace title come from the real opening user message?
does the step list read like a human story?
are retries visible inside one grouping, not split into nonsense?
does the final assistant message clearly close the turn?

What good agent investigation infrastructure looks like

At this point the pattern should be clear. Debugging agents is not about collecting more logs. It is about making decisions inspectable.

The infrastructure you want looks like this:

Traces that preserve the full logical turn
Replay that lets you reconstruct exactly what happened
Decision graphs that show where the wrong branch was taken
Readable step narratives that humans can scan in seconds

That is the difference between knowing that an agent failed and knowing why it failed.

The practical checklist

When the next agent failure happens, follow this sequence:

Define the exact wrong outcome.
Find the single logical trace for the whole turn.
Reconstruct the execution story.
Separate observed evidence from inferred explanation.
Use replay to test the failure hypothesis.
Find the first bad decision, not the loudest error.
Classify the failure correctly.
Fix the invariant and add the regression.
Verify the readable story in the product.

If your current stack cannot support that sequence, you do not yet have agent debugging infrastructure. You have observability for agent shaped problems.

Related debugging resources

For a replay-first investigation workflow, see AI agent replay debugging. If the failure involved a bad or repeated tool call, use the tool-calling failure debugging guide to inspect arguments, retries, side effects, and durable receipts.

Debug the decision, not just the error

Investigate agent failures with traces, replay, and decision graphs that make the full execution story visible.

Get Early Access →