AI agent replay

Replay failed agent runs with the context that caused them.

Opswald turns production failures into replayable traces. Pin prompts, retrieved context, tool outputs, model decisions, and state transitions so teams can reproduce incidents and test fixes.

By Opswald Team, AI agent replay and debugging specialists • Last updated May 18, 2026

Request Early Access → Read the debugging guides

agent-run.trace

01Prompt + retrieved context captured

02Planner chose tool with incomplete state

03Tool output contradicted the next decision

04Replay pins the first divergent step

Direct answer

What is AI agent replay?

AI agent replay is the controlled reproduction of a prior agent run with the original prompt, retrieved context, memory, tool outputs, decisions, and side-effect receipts preserved. Instead of simply rerunning a non-deterministic workflow, replay pins the evidence that caused the incident and varies one change at a time so engineers can prove a fix.

5 evidence groups to pin before replaying an agent incident prompt, context, memory, tool outputs, and side-effect receipts

0 production mutations allowed during safe incident replay Opswald replay workflow uses stubs, dry-run tools, or sandbox accounts

1 change to vary per replay when isolating root cause prompt, model, retrieval, schema, or orchestration change

OpenAI Agents SDK tracing Agent trace concepts for workflows with LLM generations, tool calls, handoffs, guardrails, and replayable evidence. OpenTelemetry traces Trace and span model for reconstructing distributed work before turning failures into fixtures. LangSmith datasets Dataset examples show how captured runs can become durable evaluation and regression cases.

What breaks

Agent failures are rarely a single stack trace.

They happen across prompts, memory, retrieved documents, tool schemas, model choices, retries, and side effects. Opswald is built to make that chain inspectable instead of asking engineers to reconstruct it from logs.

The failure disappears on rerun

Temperature, retrieval, cache state, and tool availability change between production and local debugging.

Fixtures are incomplete

Unit tests usually preserve inputs and outputs, but not the intermediate evidence the agent used to choose actions.

Prompt changes are risky

A prompt patch can fix the visible bug while changing other decisions in the same workflow.

Regression tests lag reality

Incident traces rarely become durable tests because reconstructing the run is too manual.

Replay should answer three questions

A good replay system does not simply run the agent again. It controls what changed so engineers can isolate the first meaningful divergence.

Same evidence?Pin the original prompt, memory, retrieved documents, tool outputs, and environment metadata.
Same path?Compare the original decision graph to the replayed path at each branch, retry, and tool call.
Same side effects?Stub or sandbox mutations while preserving receipts and resulting state.
Fixed safely?Run the trace against proposed prompts, schemas, or orchestration changes before shipping.

replay-plan.yml

pin: prompt, context, memory, tool_outputs
stub: charge_card, send_email, write_ticket
compare: decision_graph, tool_args, final_state
expect: no duplicate mutation and policy evidence cited
promote: incident trace becomes regression fixture

Practical debugging

Where replay pays off

Incident reproduction

Recreate the exact context behind a production failure without asking the customer to trigger it again.

Prompt regression testing

See whether a prompt or tool-description change alters important decision paths.

Tool contract hardening

Replay old traces against stricter schemas and safer retry behavior.

Replay walkthrough

Example: replay a refund agent that charged twice after retrying

A realistic replay starts with a failed production run, not a clean-room test. This walkthrough shows the evidence Opswald would preserve before engineers vary the prompt, tool schema, or retry policy.

Freeze the failed run

Captured evidence: Customer request, trace ID, agent version, model, system prompt, retrieved policy snippets, approval token, and the original refund tool arguments.

Replay check: The replay starts from the same prompt/context bundle and does not fetch fresh policy text or memory.

Pin tool responses and receipts

Captured evidence: First `refund_customer` response, payment-provider receipt ID, timeout envelope, retry count, and final customer-support ticket state.

Replay check: The tool is stubbed with captured receipts so the agent sees the same timeout without issuing a second refund.

Compare the decision graph

Captured evidence: Planner step, approval gate, retry branch, duplicate-mutation guard result, and the model message that justified retrying.

Replay check: Opswald highlights the first divergent branch when a patched prompt or schema chooses `stop_and_escalate` instead of retrying.

Promote the trace to a regression fixture

Captured evidence: Normalized timestamps, redacted customer fields, expected tool-call sequence, and final state assertion.

Replay check: Future prompt, model, retrieval, and tool-schema changes must pass the fixture before shipping.

Comparison

Opswald vs traditional observability for AI agents

Capability Traditional logs and APM Opswald

Incident reproduction Engineers rerun locally with fresh retrieval, current tools, and whatever state is available today. Replays the original evidence chain so the team can reproduce the path that actually failed.

Side-effect safety Debug reruns risk charging cards, sending emails, updating tickets, or mutating customer state again. Uses stubs, dry-run modes, sandbox accounts, and captured receipts to preserve behavior without repeating mutations.

Regression coverage Incidents become notes or screenshots that are hard to run in CI. Promotes failed traces into replay fixtures for prompts, tools, retrieval, and orchestration changes.

Keep reading

Related Opswald guides

Replay is the missing primitiveWhy replay matters for reproducing multi-step agent failures.Decision graphs vs timelinesWhy replay works better when the run is represented as causality, not only time.Debug tool calling failuresUse replay to validate tool-call fixes and side-effect handling.AI agent debuggingExplore the broader debugging workflow for production agents.

FAQ

Questions teams ask before instrumenting agents

Is replay just deterministic execution?

No. Agent replay is controlled comparison. You pin the original evidence and deliberately vary one part—prompt, model, tool schema, retrieval, or orchestration—to see what changes.

How do you replay side-effecting tools safely?

Use stubs, dry-run modes, sandbox accounts, and captured receipts so the agent sees realistic state without mutating production again.

What should be captured before replaying an AI agent run?

Capture the original user request, system prompt, retrieved documents, memory reads, model responses, tool schemas, tool arguments, tool outputs, retries, errors, permissions, and external side-effect receipts.

How does replay help with prompt changes?

Replay lets teams run the same incident trace against a proposed prompt and compare decision paths, tool calls, and final state before shipping the change.

Can replay become a regression test?

Yes. A failed production trace can become a pinned fixture that runs against future prompt, model, retrieval, schema, and orchestration changes.

Debug the next failed agent run with evidence.

Opswald is in early access for teams shipping AI agents that call tools, use MCP servers, or run multi-step workflows in production.

Request Early Access →