Interactive Replay

Replay any AI agent session step-by-step to debug failures, understand unexpected behavior, and verify fixes.

How Replay Works

Opswald captures every step of your agent’s execution:

Deterministic capture - All inputs, outputs, and decisions are recorded
Perfect reproduction - Replay the exact same sequence with identical results
Interactive debugging - Step through, pause, examine state at any point
Fork and modify - Change inputs mid-session to test different outcomes

Starting a Replay

From Dashboard

Go to your traces
Click any completed session
Click the ”🔄 Replay” button
Choose replay mode:
- Step-by-step - Manual control of each step
- Real-time - Replay at original speed
- Fast forward - Skip to specific steps

From API

from opswald import OpsClient

client = OpsClient("your-api-key")

# Start interactive replay
replay = client.replays.start(
    session_id="session_456",
    mode="interactive"
)

print(f"Replay URL: {replay.url}")

Replay Interface

Timeline Scrubber

Navigate through the session timeline:

[====|====|====|====]  Step 8 of 15
 ^                ^
 Start         Current

Drag to jump to any step instantly
Arrow keys to step forward/backward
Space to play/pause
Click timestamps to jump to specific moments

Step Inspector

For each step, see:

Request Details

{
  "step": 8,
  "type": "llm_call",
  "model": "gpt-4o",
  "timestamp": "2026-03-17T14:15:32Z",
  "input": {
    "messages": [
      {"role": "system", "content": "You are a helpful assistant..."},
      {"role": "user", "content": "Analyze this sales data"}
    ],
    "temperature": 0.1,
    "max_tokens": 2000
  }
}

Response Details

{
  "output": {
    "content": "Based on the sales data, I can see three key trends...",
    "finish_reason": "stop",
    "usage": {
      "prompt_tokens": 245,
      "completion_tokens": 189,
      "total_tokens": 434
    }
  },
  "cost": 0.0087,
  "duration_ms": 2140
}

Context State

View the full agent state at this step:

Memory contents - What the agent remembers
Tool outputs - Results from previous tool calls
Variables - Any state variables or flags
Session data - User context and conversation history

Debugging Features

Pause & Examine

Stop at any step to inspect:

# Pause at step where error occurred
replay.pause_at(step=12)

# Examine the exact state
state = replay.get_state(step=12)
print(f"Memory: {state.memory}")
print(f"Tools available: {state.tools}")
print(f"Last output: {state.last_output}")

Compare Steps

See what changed between steps:

Step 7 → Step 8 Changes:
+ memory.user_preference = "quarterly reports"
+ context.report_type = "sales"
- context.pending_tasks[0]  (task completed)

Error Analysis

When replaying failed sessions:

🔴 Step 12: Tool Call Failed
───────────────────────────────
Tool: send_email
Input: {
  "to": "team@company.com",
  "subject": "Q1 Sales Report",
  "body": "Please find the report attached...",
  "attachments": ["q1-sales.pdf"]
}

Error: FileNotFoundError: q1-sales.pdf
───────────────────────────────
Agent State Before Error:
- Working directory: /tmp/session_456/
- Generated files: ["summary.txt", "charts.png"]
- Missing file: q1-sales.pdf (expected but not created)

Suggested Fix:
The agent attempted to attach a file it never created.
Check step 8-11 for missing file generation logic.

Fork & Modify

Change Inputs

Test different outcomes by modifying inputs mid-session:

# Fork from step 5 with different user input
fork = replay.fork_from(
    step=5,
    modifications={
        "user_input": "Focus only on revenue trends, not customer data"
    }
)

# Replay continues with new input
fork.continue_from(step=5)

Alternative Paths

Explore what would happen with different agent decisions:

# At step 8, agent chose tool A. Try tool B instead:
fork = replay.fork_from(
    step=8,
    modifications={
        "selected_tool": "analyze_charts",  # Instead of "generate_summary"
        "tool_params": {"chart_type": "line", "period": "monthly"}
    }
)

Model Comparison

Replay the same session with different models:

# Replay with different model
model_comparison = replay.fork_from(
    step=1,
    modifications={
        "model": "claude-3-sonnet",  # Was gpt-4o
        "temperature": 0.0           # Make it more deterministic
    }
)

# Compare outcomes
original_result = replay.final_output
new_result = model_comparison.final_output

print("Original:", original_result.summary)
print("Claude:", new_result.summary)

Batch Replay

Regression Testing

Replay multiple sessions to verify fixes:

# Test a fix against historical failures
failed_sessions = client.traces.list(error=True, limit=20)

results = client.replays.batch_replay(
    session_ids=[s.id for s in failed_sessions],
    modifications={
        "agent_version": "v2.1.4",  # New version
        "timeout": 30               # Longer timeout
    }
)

print(f"Success rate: {results.success_rate}")
print(f"Still failing: {results.still_failing}")

A/B Testing

Compare agent performance across variations:

# Test two different prompting strategies
test_sessions = client.traces.list(limit=10)

for session in test_sessions:
    # Version A: Original prompt
    replay_a = client.replays.start(
        session_id=session.id,
        modifications={"prompt_style": "detailed"}
    )

    # Version B: Concise prompt
    replay_b = client.replays.start(
        session_id=session.id,
        modifications={"prompt_style": "concise"}
    )

    # Compare results
    compare_outcomes(replay_a.result, replay_b.result)

Golden Tests

Save Important Sessions

Pin critical sessions as regression tests:

# Mark session as golden test
golden = client.golden_tests.create(
    session_id="session_456",
    name="Quarterly Report Generation",
    description="Complete flow from data upload to email delivery",
    tags=["reports", "automation", "critical"]
)

Run Golden Tests

Verify your agent still works correctly:

# Run all golden tests
curl -X POST https://api.opswald.com/v1/golden-tests/run \
  -H "Authorization: Bearer your-api-key"

# Results:
{
  "total": 15,
  "passed": 14,
  "failed": 1,
  "failed_tests": [
    {
      "name": "Customer Support Escalation",
      "error": "Tool 'escalate_ticket' not found",
      "suggestion": "Tool was removed in recent update"
    }
  ]
}

CI Integration

Add golden tests to your deployment pipeline:

- name: Run Opswald Golden Tests
  run: |
    response=$(curl -X POST https://api.opswald.com/v1/golden-tests/run \
      -H "Authorization: Bearer ${{ secrets.OPSWALD_API_KEY }}")

    passed=$(echo $response | jq '.passed')
    total=$(echo $response | jq '.total')

    if [ $passed != $total ]; then
      echo "Golden tests failed: $passed/$total passed"
      exit 1
    fi

Performance Replay

Latency Analysis

Replay sessions to identify bottlenecks:

# Replay with timing analysis
replay = client.replays.start(
    session_id="session_456",
    mode="performance_analysis"
)

# Get step-by-step timing
for step in replay.steps:
    if step.duration > 5000:  # >5 seconds
        print(f"Slow step {step.number}: {step.type} took {step.duration}ms")
        print(f"  Details: {step.description}")

Cost Analysis

Understand where money was spent:

# Replay with cost tracking
replay = client.replays.start(
    session_id="session_456",
    mode="cost_analysis"
)

total_cost = 0
for step in replay.steps:
    if step.cost > 0:
        print(f"Step {step.number}: ${step.cost:.4f} ({step.type})")
        total_cost += step.cost

print(f"Total session cost: ${total_cost:.4f}")

Advanced Features

Custom Replay Hooks

Add custom logic during replay:

class DebuggerHooks:
    def before_step(self, step):
        print(f"About to execute: {step.type}")

    def after_step(self, step, result):
        if step.type == "llm_call":
            print(f"Tokens used: {result.tokens}")

    def on_error(self, step, error):
        print(f"Error in {step.type}: {error}")

# Use hooks during replay
replay = client.replays.start(
    session_id="session_456",
    hooks=DebuggerHooks()
)

Conditional Breakpoints

Set automatic pause conditions:

# Pause when cost exceeds threshold
replay.add_breakpoint(
    condition="cost > 0.50",
    action="pause"
)

# Pause on specific tool calls
replay.add_breakpoint(
    condition="tool_name == 'send_email'",
    action="pause"
)

# Log when memory changes
replay.add_breakpoint(
    condition="memory.changed",
    action="log"
)

Best Practices

Effective Debugging

Start broad - Replay entire session first
Narrow down - Focus on problematic steps
Compare states - Look at before/after conditions
Test theories - Use fork to verify hypotheses
Document findings - Add notes to important sessions

Performance Tips

Limit scope - Replay only relevant sections for large sessions
Use filters - Focus on specific step types (LLM, tools, errors)
Batch operations - Group related replay sessions
Cache results - Save replay outputs for comparison

Privacy & Security

Filtered replay - Hide sensitive data while preserving structure
Secure sharing - Generate temporary replay links for team members
Access logs - Track who accessed which replays
Data retention - Set replay retention policies

Replay is your most powerful debugging tool. Use it to understand not just what your agent did, but why it made those decisions and how you can improve its behavior.