Pricing Blog Contact Get Early Access →
Back to blog

How to Investigate an AI Agent Failure Step by Step

A practical workflow for debugging agent failures with traces, replay, and decision graphs instead of guessing from logs.

Short answer: To investigate an AI agent failure, start with the wrong outcome, find the complete trace for the user turn, replay the disputed steps, and identify the first bad decision before changing code. That is the practical difference between general observability and AI agent debugging.

Useful references for this workflow include the OpenTelemetry GenAI semantic conventions for trace structure and the OpenAI Agents SDK tracing guide for agent run, tool, and handoff traces.

Your AI agent failed at 2:13 a.m. Maybe it approved the wrong vendor invoice. Maybe it escalated the wrong support case. Maybe it completed successfully and still made the wrong decision. By the time you notice, the run is over, the context is gone, and all you have left is a flat log and a vague sense that something went wrong.

This is where most teams get stuck. They know that the agent failed, but they cannot reconstruct why. They have request logs, timing metrics, maybe a prompt transcript, but not the actual execution story.

Investigating an AI agent failure requires a different workflow than investigating a normal backend incident. You are not only debugging code. You are debugging a sequence of decisions. That means you need traces, replay, and a decision graph that tells you how the agent got from the user goal to the final wrong outcome.

Here is the practical workflow we use for investigating production agent failures.

Step 1: Start with the failed outcome, not the infrastructure metric

Most teams begin with whatever alert fired first: high latency, timeout, token spike, retry count, or 500 rate. Those signals matter, but they are often secondary. The first question should be:

What was the wrong outcome the agent produced?

bad-start.txt
// Too generic:
alert = "LLM latency exceeded threshold"

// Better:
incident = "Procurement agent approved invoice INV-8421 even though the vendor was over the spending threshold"

The concrete wrong outcome gives you the anchor for the whole investigation. Without that anchor, you risk optimizing infrastructure while missing the actual decision failure.

Step 2: Find the single logical trace for the whole turn

The right unit of investigation is not an individual API call. It is the logical trace for the user turn: from the opening user message, through the model processing and tool work, to the final assistant response or external side effect. See the dedicated guide to AI agent tracing for the trace fields that make this practical.

That trace should contain a readable story like:

If the run is split across multiple unrelated traces, or if late provider noise keeps extending the same trace after closure, your investigation is already compromised. You need one durable timeline of truth.

Bad investigation unit

Single prompt log, single tool call, or one isolated HTTP span without the surrounding turn context.

Correct investigation unit

One logical trace that covers user input, model decisions, tool calls, tool results, and final output.

Step 3: Reconstruct the execution story before touching code

This is where traces beat logs. Logs tell you what happened in chronological fragments. A good trace tells you the execution story: what the agent saw, what it decided, what it did, and what happened next.

Readability first
User message
Confirm the exact opening request or trigger that started the run.
Processing
Model reasoning rounds
Identify the LLM calls, retries, and where a tool decision or wrong interpretation appeared.
Actions
Tool execution
Map every external action back to the decision that caused it.
Closure
Final answer or side effect
Anchor the end of the investigation on the final assistant message or external write.

Do this before proposing a fix. If you skip reconstruction, you will optimize for symptoms. If you reconstruct the execution story first, the real failure class usually becomes obvious.

Step 4: Separate observed facts from inferred explanations

This is where many teams lose discipline. They see a bad output and immediately invent a story about “the model got confused” or “the framework retried badly.” That may be true, but your trace needs to separate what was observed from what is merely inferred.

Observed facts include:

Inferred explanations include:

You need both, but not mixed together. Raw first debugging only works if the evidence layer stays honest.

Step 5: Use replay to test the failure hypothesis

Once you reconstruct the execution story, the next move is not to edit code immediately. It is to test your hypothesis against a replay or equivalent deterministic reconstruction.

For example:

Agent replay matters because it forces discipline. It turns debugging from storytelling into falsifiable analysis.

⚠️ Common mistake
Teams often patch the first visible failure, then discover later that the real problem was one step earlier in the decision chain. Replay is what prevents fix on fix layering.

Step 6: Find the first bad decision, not the loudest error

The loudest error is rarely the root cause. The root cause is usually the first bad decision that forced everything downstream into a bad state.

Examples:

This is why decision graphs matter. A flat trace shows the visible break. A decision graph shows the first wrong branch.

Step 7: Classify the failure before proposing the fix

Not all agent failures are the same. Before changing code, classify the failure. We usually see four major buckets:

A lot of wasted debugging time comes from fixing a presentation issue as if it were a boundary bug, or a lifecycle bug as if it were a model quality issue.

Step 8: Fix the invariant, then add the regression

The best fix is not the one that passes today’s repro. It is the one that restores the broken invariant.

Examples of invariants:

good-fix-checklist.txt
// A correct fix should do all 3:
1. restore the invariant
2. add a regression test on the concrete repro
3. prove the new behavior on a fresh production message

If your fix only addresses the visible symptom, the next trace will break in a slightly different way.

Step 9: Verify the readable story in the product, not only the test suite

This is the part many teams skip. The tests go green, the API payload looks fine, and everyone declares victory. Then the real dashboard still shows a generic title, a weird grouping, or a closure that feels wrong to a human operator.

For agent debugging infrastructure, product verification matters. The point is not only to store the right facts. The point is to make the run legible to a human investigating the failure.

That means the final check is:

What good agent investigation infrastructure looks like

At this point the pattern should be clear. Debugging agents is not about collecting more logs. It is about making decisions inspectable.

The infrastructure you want looks like this:

That is the difference between knowing that an agent failed and knowing why it failed.

The practical checklist

When the next agent failure happens, follow this sequence:

  1. Define the exact wrong outcome.
  2. Find the single logical trace for the whole turn.
  3. Reconstruct the execution story.
  4. Separate observed evidence from inferred explanation.
  5. Use replay to test the failure hypothesis.
  6. Find the first bad decision, not the loudest error.
  7. Classify the failure correctly.
  8. Fix the invariant and add the regression.
  9. Verify the readable story in the product.

If your current stack cannot support that sequence, you do not yet have agent debugging infrastructure. You have observability for agent shaped problems.

Debug the decision, not just the error

Investigate agent failures with traces, replay, and decision graphs that make the full execution story visible.

Get Early Access →