medium_risk (where structural reasoning can only say “might differ”).
When to use replay
Confirm a structural prediction
The regression check said
medium_risk (model swap, prompt rewrite). Replay actually runs the affected traces through the new manifest so you can see the new outputs side-by-side with the originals.Reproduce a flaky bug
A trace failed in production. Replay it against the same manifest to see if the failure is deterministic, then against your fix branch to verify it’s resolved.
Build training data from drift
Replay flagged-for-repair traces against a known-good manifest, then export as JSONL for SFT. This is the bridge between trace history and the Datasets API.
Re-score with a new evaluator
Add a new
@eval function. Replay traces under the same manifest to re-score them without re-running the agent.Lifecycle
Endpoints at a glance
| Method | Path | Purpose |
|---|---|---|
POST | /api/v1/replay/batches | Create a new replay batch (selects which traces to run) |
GET | /api/v1/replay/batches/{batch_id} | Track progress + retrieve aggregate results |
GET | /api/v1/replay/export | Export trace prompts or replay results as JSONL |
POST | /api/v1/replay/tasks/{task_id}/submit | Submit a single task’s result (called by replay workers) |
Quick start
Related
- Replay Guide — when to replay vs. when to repair
- Regression Check — the pre-deploy companion that flags candidates for replay
- Datasets API — export replay results as training data