How to Debug Tool Calling Failures in AI Agents

Short answer: To debug AI agent tool calling failures, treat each tool call as a model decision plus an API event. You need the model context, selected tool, arguments, result, next decision, and side effects in one trace; the dedicated AI agent tool calling debugging guide covers that workflow in depth.

Relevant implementation references include the OpenAI function calling guide for tool schemas, the Model Context Protocol specification for MCP tools, and the OWASP Top 10 for LLM Applications for unsafe tool and output risks.

Tool calling is where an AI agent stops being a chatbot and starts changing the world. It searches orders, reads policies, updates tickets, issues refunds, sends emails, creates records, and chains those actions together until the user goal is done.

That power also changes the debugging problem. A bad answer is visible. A bad tool call can be invisible until a customer complains, finance reconciles a strange number, or a support team notices that the agent has been doing the right action for the wrong reason.

Imagine a fictional ecommerce company called Northstar Outfitters. They run a refunds agent that helps support reps resolve return requests. The agent can look up orders, read refund policy, inspect shipment status, create a refund, add an internal note, and escalate edge cases to a human.

One morning the team notices that several customers received store credit when they should have received a card refund. There were no crashes. The provider returned 200 OK. The tool endpoint accepted the request. The agent even left tidy support notes. From the outside, everything looked successful.

This is the core challenge when you debug AI agent tool calls: the failure is often not that a tool call failed. The failure is that the agent called the wrong tool, called the right tool with the wrong arguments, ignored the result, repeated the call, or created a silent side effect that looked valid at the API boundary.

Start with the tool call as a decision, not an API event

Traditional observability treats tool calls like HTTP spans: name, latency, status, payload, response. That is useful, but it is not enough. In an agent, every tool call is the output of a model decision. To debug it, you need to inspect the decision that produced it.

tool-call-debugging.txt

// API-level view:
create_store_credit({ order_id: "ORD-8139", amount: 89.00 })
status = "success"

// Agent-debugging view:
decision = "refund by store credit"
because = "policy lookup appeared to say opened items are credit-only"
missing_fact = "item was defective, so original payment refund applies"

If you only look at the API event, the tool succeeded. If you inspect the decision, the bug becomes visible: the agent selected the wrong refund path because it misread the policy exception.

The practical rule: every tool call should be traceable back to the model message, context, retrieved data, and intermediate decision that caused it. If you cannot answer “why did the agent call this tool now?”, you do not have enough AI agent debugging data.

The five common tool calling failure modes

Failure 1

Wrong tool choice

The agent chooses create_store_credit when it should choose refund_original_payment.

Failure 2

Malformed arguments

The agent passes the wrong order ID, amount, currency, reason code, or customer identifier.

Failure 3

Missing tool result

The tool result exists, but the next model step does not actually include or use it correctly.

Failure 4

Repeated calls

The agent retries a non-idempotent action and creates duplicate refunds, notes, or escalations.

The fifth failure mode is the most dangerous: silent side effects. The tool call succeeds, mutates external state, and produces a plausible final answer, but the side effect is wrong. No exception is thrown. The customer record changed. The agent moves on.

Debugging these failures requires more than console logs. You need a trace of the full agent run, replay to reproduce the path, and a decision graph that shows how each tool choice depended on earlier observations.

Step 1: Reconstruct the complete trace

Start by finding the single trace for the failed user turn. Do not begin with the refund endpoint logs. Do not begin with model latency. Begin with the complete execution story:

the original customer request or support rep instruction
the retrieved order data
the policy snippets or knowledge base entries shown to the model
the model response that proposed the tool call
the exact tool name and arguments
the tool result returned to the agent
the next model step after the result
the final answer and any external writes

The key is continuity. A tool call without the surrounding model context is just an API event. A model transcript without the tool result is just half the story. A useful trace ties the full loop together.

Step 2: Validate the arguments before judging the model

When a tool call looks wrong, developers often jump straight to prompt changes. Resist that instinct. First validate the arguments like you would validate any production write path.

For the Northstar refunds agent, inspect whether the tool arguments matched the observed facts:

Was order_id copied from the current customer record or from a previous search result?
Was refund_amount calculated from the returned items or from the full order total?
Was refund_method derived from policy or guessed from a summary?
Was reason_code specific enough for downstream reporting?
Was an idempotency key included for any mutating action?

Debugging rule

Treat model-generated tool arguments as untrusted input. Schema validation catches malformed values, but trace review catches plausible values that are semantically wrong.

This distinction matters. A JSON schema can verify that refund_method is one of card or store_credit. It cannot verify that store credit is the correct business decision for this customer. That judgment lives in the agent trace.

Step 3: Check whether the tool result reached the next decision

A common tool calling bug is not the call itself. It is the handoff after the call. The tool returns useful data, but the agent does not incorporate it into the next model step.

In the refunds example, lookup_order might return that the item was marked defective by warehouse inspection. The next model step should use that fact to choose original payment refund. If the next prompt only contains a short summary like “item opened, return requested,” the decisive exception disappeared.

When debugging missing tool results, compare three artifacts:

The raw tool response.
The normalized or summarized result passed back to the model.
The model decision that followed.

If the raw response contains the correct fact but the next model input does not, the bug is in your adapter, summarizer, memory layer, or context assembly. If the fact reaches the model and the model still chooses the wrong tool, the bug is more likely in prompting, policy representation, tool descriptions, or task constraints.

This is where agent replay becomes useful. With replay, you can rerun the same trace with the same tool result and change one variable at a time: the tool description, the policy wording, the argument schema, or the model prompt. Without replay, every attempted fix is mixed with fresh model variance and fresh environment state.

Step 4: Use the decision graph to find the branch where the run went wrong

Linear timelines are helpful, but tool failures often happen at branch points. The agent considered multiple possible actions, then chose one. A decision graph makes those branch points explicit.

For Northstar, the decision graph might show this path:

Customer asks for refund on opened hiking jacket.
Agent retrieves order and policy.
Branch A: opened item means store credit.
Branch B: defective item exception means original payment refund.
Agent chooses Branch A because the defect flag was missing from the policy summary.
Agent calls create_store_credit.

That graph tells you the root cause is not “the refund tool is broken.” It is “the defect exception was not available at the decision point.” The fix is different: preserve the relevant fact, strengthen the tool result summary, change the policy retrieval chunk, or add a precondition that blocks refund tool calls until required facts are present.

Step 5: Make repeated calls impossible, not merely unlikely

Repeated tool calls are especially risky when tools have side effects. If the model does not see a result, times out, receives an ambiguous response, or gets prompted to “try again,” it may call the same mutating tool twice.

Do not rely on the model to remember that a refund has already been created. Use engineering controls:

Require idempotency keys for mutating tools.
Record tool call fingerprints in the trace.
Show prior side effects clearly in the next model step.
Block duplicate calls when the same customer, order, amount, and reason appear twice.
Escalate instead of retrying when a mutating call has ambiguous status.

The trace should make duplicate intent obvious. The replay should let you reproduce the loop. The decision graph should show whether the second call came from a retry policy, a missing observation, or a model decision that lost state.

What a good debugging workflow looks like

A reliable workflow for debugging AI agent tool calls looks like this:

Start from the wrong business outcome, not the first metric that looks strange.
Open the complete trace for the user turn.
Find the first tool call that changed external state or narrowed the path.
Inspect the model context that produced that call.
Validate the tool name and arguments against observed facts.
Compare the raw tool result with what the model saw next.
Use replay to reproduce the failure with controlled changes.
Use the decision graph to identify the exact wrong branch.
Add a regression: schema guard, prompt constraint, retrieval fix, idempotency rule, or escalation condition.

Opswald is built around this style of investigation. Traces capture the agent run as a coherent execution story. Replay helps developers reproduce failures instead of guessing from stale logs. Decision graphs expose the branching decisions behind tool choices, so teams can see where an agent selected the wrong path.

The goal is not to make every tool call perfect. Production systems fail. Policies change. APIs return surprising results. Models make imperfect decisions. The goal is to make failures inspectable, reproducible, and fixable.

The bottom line

If you want to debug AI agent tool calls, do not stop at “the tool succeeded” or “the JSON was valid.” Those are API checks. Agent debugging asks a deeper question: did the agent choose the right action, with the right arguments, using the right evidence, and correctly interpret the result before moving on?

For simple chatbots, logs may be enough. For agents that call tools and mutate business systems, you need traces, replay, and decision graphs. If those tools are exposed through Model Context Protocol servers, use the MCP debugging checklist as part of the same investigation. Otherwise, the most important bug in your system may look exactly like success.

Debug the decision, not just the API call

Opswald gives engineering teams traces, replay, and decision graphs for production AI agents.

Request Early Access →