How It Works
DecimalAI is an observability platform, not an execution platform. Replay follows a pull-based workflow: You pull stale prompts from DecimalAI, run them through your agent on your own infrastructure, and submit the results back.Replayability Classification
Not every trace can be replayed. DecimalAI classifies traces into three categories:| Category | Definition | Example |
|---|---|---|
| Fully replayable | All inputs are captured — can re-run exactly | Pure text in/out, tool calls with recorded args |
| Partially replayable | Some inputs depend on external state | RAG queries where the knowledge base may have changed |
| Not replayable | Inputs depend on real-time context | Streaming data, user sessions, time-sensitive queries |
Running a Replay
Via the SDK
Via the CLI
Manual Flow
For teams with custom execution environments, you can decouple prompt export from agent execution:
What Happens After Replay
Replay results feed back into the platform:- New traces are created — tagged as replay outputs
- Evaluators score the new outputs — same eval suite as production
- Side-by-side comparison — original output vs. replay output
- DPO pairs generated — original (rejected) + replay (chosen) for preference training
Pairwise Evaluation
When scoring replayed traces, the platform runs a pairwise comparison: “Given the same task, which trajectory is better — original (v1) or replayed (v2)?” This comparison considers:- Eval scores from both versions
- Output quality and completeness
- Tool call correctness
| Verdict | Meaning | DPO result |
|---|---|---|
| Win | The new output is better | DPO pair (new = chosen, old = rejected) |
| Loss | The old output was better | Reverse DPO pair (old = chosen, new = rejected) |
| Tie | No significant difference | Skipped for DPO |
DPO Pair Generation
When a replay produces a better output than the original, DecimalAI can generate DPO training pairs:Export Prompts
You can export stale prompts for offline processing — useful when you run your agent on infrastructure that can’t pull from the SDK directly. Dashboard: Click “Export Prompts” on the Replay tab (or “Replay Prompts” on the Impact Report banner) to download a JSONL file of all replay-eligible prompts. API:link(original_trace_id=..., replayed_trace_id=...) (see Manual Flow above) so the platform can score the comparison and generate DPO pairs.
When to Replay
| Scenario | Replay? | Why |
|---|---|---|
| Added a new tool | Usually not | Existing traces are still valid |
| Changed system prompt | Yes | Outputs may differ significantly |
| Upgraded model | Yes | Compare outputs across models |
| Removed a tool | Yes | Traces using that tool are stale |
| Fixed a bug in a tool | Yes | Re-run to get corrected outputs |
Repair vs Replay: For schema changes (tool renamed, parameter changed), use Repair instead — it’s instant and costs nothing. Use replay only when agent behavior changed (prompt, model).
Next Steps
Replay API
REST reference for create batch, submit results, export prompts.
Datasets
Build DPO datasets from replay pairs (original = rejected, new = chosen).
Manifests
Replay is triggered by manifest changes — understand the diff first.