Interactive Replay
Interactive Replay
Replay any AI agent session step-by-step to debug failures, understand unexpected behavior, and verify fixes.
How Replay Works
Opswald captures every step of your agentβs execution:
- Deterministic capture - All inputs, outputs, and decisions are recorded
- Perfect reproduction - Replay the exact same sequence with identical results
- Interactive debugging - Step through, pause, examine state at any point
- Fork and modify - Change inputs mid-session to test different outcomes
Starting a Replay
From Dashboard
- Go to your traces
- Click any completed session
- Click the βπ Replayβ button
- Choose replay mode:
- Step-by-step - Manual control of each step
- Real-time - Replay at original speed
- Fast forward - Skip to specific steps
From API
from opswald import OpsClient
client = OpsClient("your-api-key")
# Start interactive replayreplay = client.replays.start( session_id="session_456", mode="interactive")
print(f"Replay URL: {replay.url}")Replay Interface
Timeline Scrubber
Navigate through the session timeline:
[====|====|====|====] Step 8 of 15 ^ ^ Start Current- Drag to jump to any step instantly
- Arrow keys to step forward/backward
- Space to play/pause
- Click timestamps to jump to specific moments
Step Inspector
For each step, see:
Request Details
{ "step": 8, "type": "llm_call", "model": "gpt-4o", "timestamp": "2026-03-17T14:15:32Z", "input": { "messages": [ {"role": "system", "content": "You are a helpful assistant..."}, {"role": "user", "content": "Analyze this sales data"} ], "temperature": 0.1, "max_tokens": 2000 }}Response Details
{ "output": { "content": "Based on the sales data, I can see three key trends...", "finish_reason": "stop", "usage": { "prompt_tokens": 245, "completion_tokens": 189, "total_tokens": 434 } }, "cost": 0.0087, "duration_ms": 2140}Context State
View the full agent state at this step:
- Memory contents - What the agent remembers
- Tool outputs - Results from previous tool calls
- Variables - Any state variables or flags
- Session data - User context and conversation history
Debugging Features
Pause & Examine
Stop at any step to inspect:
# Pause at step where error occurredreplay.pause_at(step=12)
# Examine the exact statestate = replay.get_state(step=12)print(f"Memory: {state.memory}")print(f"Tools available: {state.tools}")print(f"Last output: {state.last_output}")Compare Steps
See what changed between steps:
Step 7 β Step 8 Changes:+ memory.user_preference = "quarterly reports"+ context.report_type = "sales"- context.pending_tasks[0] (task completed)Error Analysis
When replaying failed sessions:
π΄ Step 12: Tool Call FailedβββββββββββββββββββββββββββββββTool: send_emailInput: { "to": "team@company.com", "subject": "Q1 Sales Report", "body": "Please find the report attached...", "attachments": ["q1-sales.pdf"]}
Error: FileNotFoundError: q1-sales.pdfβββββββββββββββββββββββββββββββAgent State Before Error:- Working directory: /tmp/session_456/- Generated files: ["summary.txt", "charts.png"]- Missing file: q1-sales.pdf (expected but not created)
Suggested Fix:The agent attempted to attach a file it never created.Check step 8-11 for missing file generation logic.Fork & Modify
Change Inputs
Test different outcomes by modifying inputs mid-session:
# Fork from step 5 with different user inputfork = replay.fork_from( step=5, modifications={ "user_input": "Focus only on revenue trends, not customer data" })
# Replay continues with new inputfork.continue_from(step=5)Alternative Paths
Explore what would happen with different agent decisions:
# At step 8, agent chose tool A. Try tool B instead:fork = replay.fork_from( step=8, modifications={ "selected_tool": "analyze_charts", # Instead of "generate_summary" "tool_params": {"chart_type": "line", "period": "monthly"} })Model Comparison
Replay the same session with different models:
# Replay with different modelmodel_comparison = replay.fork_from( step=1, modifications={ "model": "claude-3-sonnet", # Was gpt-4o "temperature": 0.0 # Make it more deterministic })
# Compare outcomesoriginal_result = replay.final_outputnew_result = model_comparison.final_output
print("Original:", original_result.summary)print("Claude:", new_result.summary)Batch Replay
Regression Testing
Replay multiple sessions to verify fixes:
# Test a fix against historical failuresfailed_sessions = client.traces.list(error=True, limit=20)
results = client.replays.batch_replay( session_ids=[s.id for s in failed_sessions], modifications={ "agent_version": "v2.1.4", # New version "timeout": 30 # Longer timeout })
print(f"Success rate: {results.success_rate}")print(f"Still failing: {results.still_failing}")A/B Testing
Compare agent performance across variations:
# Test two different prompting strategiestest_sessions = client.traces.list(limit=10)
for session in test_sessions: # Version A: Original prompt replay_a = client.replays.start( session_id=session.id, modifications={"prompt_style": "detailed"} )
# Version B: Concise prompt replay_b = client.replays.start( session_id=session.id, modifications={"prompt_style": "concise"} )
# Compare results compare_outcomes(replay_a.result, replay_b.result)Golden Tests
Save Important Sessions
Pin critical sessions as regression tests:
# Mark session as golden testgolden = client.golden_tests.create( session_id="session_456", name="Quarterly Report Generation", description="Complete flow from data upload to email delivery", tags=["reports", "automation", "critical"])Run Golden Tests
Verify your agent still works correctly:
# Run all golden testscurl -X POST https://api.opswald.com/v1/golden-tests/run \ -H "Authorization: Bearer your-api-key"
# Results:{ "total": 15, "passed": 14, "failed": 1, "failed_tests": [ { "name": "Customer Support Escalation", "error": "Tool 'escalate_ticket' not found", "suggestion": "Tool was removed in recent update" } ]}CI Integration
Add golden tests to your deployment pipeline:
- name: Run Opswald Golden Tests run: | response=$(curl -X POST https://api.opswald.com/v1/golden-tests/run \ -H "Authorization: Bearer ${{ secrets.OPSWALD_API_KEY }}")
passed=$(echo $response | jq '.passed') total=$(echo $response | jq '.total')
if [ $passed != $total ]; then echo "Golden tests failed: $passed/$total passed" exit 1 fiPerformance Replay
Latency Analysis
Replay sessions to identify bottlenecks:
# Replay with timing analysisreplay = client.replays.start( session_id="session_456", mode="performance_analysis")
# Get step-by-step timingfor step in replay.steps: if step.duration > 5000: # >5 seconds print(f"Slow step {step.number}: {step.type} took {step.duration}ms") print(f" Details: {step.description}")Cost Analysis
Understand where money was spent:
# Replay with cost trackingreplay = client.replays.start( session_id="session_456", mode="cost_analysis")
total_cost = 0for step in replay.steps: if step.cost > 0: print(f"Step {step.number}: ${step.cost:.4f} ({step.type})") total_cost += step.cost
print(f"Total session cost: ${total_cost:.4f}")Advanced Features
Custom Replay Hooks
Add custom logic during replay:
class DebuggerHooks: def before_step(self, step): print(f"About to execute: {step.type}")
def after_step(self, step, result): if step.type == "llm_call": print(f"Tokens used: {result.tokens}")
def on_error(self, step, error): print(f"Error in {step.type}: {error}")
# Use hooks during replayreplay = client.replays.start( session_id="session_456", hooks=DebuggerHooks())Conditional Breakpoints
Set automatic pause conditions:
# Pause when cost exceeds thresholdreplay.add_breakpoint( condition="cost > 0.50", action="pause")
# Pause on specific tool callsreplay.add_breakpoint( condition="tool_name == 'send_email'", action="pause")
# Log when memory changesreplay.add_breakpoint( condition="memory.changed", action="log")Best Practices
Effective Debugging
- Start broad - Replay entire session first
- Narrow down - Focus on problematic steps
- Compare states - Look at before/after conditions
- Test theories - Use fork to verify hypotheses
- Document findings - Add notes to important sessions
Performance Tips
- Limit scope - Replay only relevant sections for large sessions
- Use filters - Focus on specific step types (LLM, tools, errors)
- Batch operations - Group related replay sessions
- Cache results - Save replay outputs for comparison
Privacy & Security
- Filtered replay - Hide sensitive data while preserving structure
- Secure sharing - Generate temporary replay links for team members
- Access logs - Track who accessed which replays
- Data retention - Set replay retention policies
Replay is your most powerful debugging tool. Use it to understand not just what your agent did, but why it made those decisions and how you can improve its behavior.