Overview
When you update your agent — change a tool, rewrite a prompt, swap a model — DecimalAI automatically detects the change and classifies every existing trace into one of five actions. This classification determines which traces stay in your training datasets, which need repair, which should be re-run, which need a human look, and which are no longer valid. The compatibility policy controls how changes at each severity level map to actions. You can use built-in presets or create custom rules per agent.Severity and action are two different things. Severity answers “how much did this surface change?” (
minor / moderate / major). Action answers “what do we do with the trace?” (keep / repair / flag / replay / drop). The compatibility policy is the lookup table from one to the other — and it’s the only place the mapping lives. The same major tool change becomes replay under the default preset but drop under strict.The Five Actions
| Action | What it means | What happens to the trace | Cost |
|---|---|---|---|
| Keep ✅ | Trace is fully compatible with the new agent version | Stays in training datasets as-is | Free |
| Repair 🔧 | Trace can be fixed with a deterministic data migration | Tool calls are rewritten to match new schema (e.g., rename param) | Free (no LLM) |
| Flag 🚩 | Trace might be fine, but the change is ambiguous enough to want a human look | Stays in datasets, but is marked for review in the Impact Report — no automatic action is taken | Free |
| Replay 🔄 | Trace input is still valuable, but output is stale | Original prompt is re-run through the new agent to get fresh output | LLM cost |
| Drop ✕ | Trace is incompatible — both input and output are invalid | Excluded from datasets entirely | Free |
flag is a real action, not a label. It’s the “don’t decide automatically” verdict: the trace stays usable, but it’s surfaced in the Impact Report so a person can choose keep, replay, or drop for it. The permissive preset leans on flag heavily — it never auto-drops, it only flags major changes for review. When two actions tie on a trace, priority runs drop > replay > flag > repair > keep.
When Each Action Applies
The tables below show the raw severity the diff engine assigns to each kind of change, plus the action thedefault preset resolves it to. These are the before-policy defaults — switching presets or overriding a surface changes the action column (see The Three Presets), but the severity classification in the “Scenario” rows stays the same.
Tool Changes
Tools are the most common source of agent changes. Here’s how different types of tool changes are classified (default preset):| Scenario | Example | Severity | Default Action | Why |
|---|---|---|---|---|
| Tool unchanged | search_docs is identical in v1 and v2 | none | Keep | Nothing changed — trace is still valid |
| Optional parameter added | check_inventory gains optional region param | minor | Keep | Old calls still work as-is; nothing to rewrite |
| Parameter renamed | product_id → item_id | moderate | Repair | Deterministic rewrite — find-and-replace in trace records |
| Parameter removed | check_inventory no longer accepts verbose flag | moderate | Repair | Strip the removed param from stored tool calls |
| Description changed | Tool description updated but schema identical | minor | Keep | No structural impact on stored calls |
| Required parameter added | refund_order now requires reason field | major | Replay | Old calls are missing a required field — re-run to capture it correctly |
| Type changed | quantity changed from string to integer | moderate | Repair | Cast the stored value to the new type during the data migration |
| Tool renamed | check_inventory → lookup_stock (new name, same capability) | major | Replay ¹ | Capability still exists under a new name; re-run captures the new tool usage |
| Tool removed entirely | get_pricing deleted from agent | major | Replay | Default replays major tool changes; override to drop if the capability is truly gone |
¹ Tool renames appear as a removal + addition to the compatibility engine — both land in the
tool_registry surface as a major change. Under the default preset that resolves to Replay (re-running picks up the new tool name). Under the strict preset a major tool change resolves to Drop instead; switch presets or override individual traces in the Impact Report.These are
tool_registry surface actions under the default preset: minor → keep, moderate → repair, major → replay. The per-scenario rows above just show which severity bucket each kind of change lands in.Prompt Changes
Prompt changes are classified by the text diff percentage between old and new versions (prompt_stack surface, default preset: minor → keep, moderate → flag, major → replay):
| Scenario | Diff Threshold | Severity | Default Action | Why |
|---|---|---|---|---|
| Typo fix | ≤5% diff | minor | Keep | Negligible impact on agent behavior |
| Paragraph added | 5–30% diff | moderate | Flag | Minor behavioral shift — a human should review, but data is likely still valid |
| Complete rewrite | >30% diff | major | Replay | Agent behavior fundamentally changed — old outputs don’t reflect new instructions |
Model Changes
model_runtime surface, default preset: minor → keep, moderate → flag, major → drop.
| Scenario | Example | Severity | Default Action | Why |
|---|---|---|---|---|
| Config tweak | Temperature 0.7 → 0.8 | minor | Keep | Same model, minor sampling change |
| Version bump (same family) | gpt-4o-2024-05 → gpt-4o-2024-08 | moderate | Flag | Same family, likely similar — but worth a human look |
| Different model (same provider) | gpt-4 → gpt-4o | major | Drop | Different model architecture — output distributions differ significantly |
| Provider change | gpt-4o → claude-sonnet-4-6 | major | Drop | Response distributions are fundamentally different — outputs from one model shouldn’t train another |
Drop vs Replay — The Decision Guide
The distinction between Drop and Replay is the most important decision in the policy. Here’s a simple framework:| Question | If yes → | If no → |
|---|---|---|
| Is the trace’s input (user question) still meaningful for the new agent? | Replay | Drop |
| Could the new agent handle this request successfully? | Replay | Drop |
| Does the capability still exist in some form (renamed, moved, or replaced)? | Replay | Drop |
| Was the tool/feature permanently removed with no replacement? | Drop | Replay |
Examples
Replay: You renamedcheck_inventory to lookup_stock. A trace asking “Is SKU-1234 in stock?” is still a perfectly valid customer question — the new agent can answer it using lookup_stock. Re-run the trace to capture the new tool usage.
Drop: You removed the get_pricing tool because pricing is now handled by a separate microservice your agent doesn’t access. A trace asking “What’s the price of SKU-1234?” can’t be answered by the new agent — the capability is gone. Drop the trace.
The Three Presets
DecimalAI ships with three policy presets. Each preset configures how severity levels map to actions for every surface:Preset Comparison
| Surface | Severity | Strict | Default | Permissive |
|---|---|---|---|---|
| Tools | Minor (optional param, description) | Keep | Keep | Keep |
| Tools | Moderate (schema change, repairable) | Drop | Repair | Keep |
| Tools | Major (tool removed, required param added) | Drop | Replay | Flag |
| Prompts | Minor (≤5% diff) | Flag | Keep | Keep |
| Prompts | Moderate (5–30% diff) | Drop | Flag | Keep |
| Prompts | Major (>30% diff) | Drop | Replay | Flag |
| Model | Minor (config tweak) | Keep | Keep | Keep |
| Model | Moderate (version bump) | Drop | Flag | Keep |
| Model | Major (different model/provider) | Drop | Drop | Flag |
| Output Contract | Minor | Flag | Keep | Keep |
| Output Contract | Moderate | Drop | Repair | Keep |
| Output Contract | Major | Drop | Drop | Flag |
When to Use Each Preset
- Strict
- Default
- Permissive
Use when: Training data purity is critical. You’re fine-tuning for production deployment and can’t afford any stale data.Behavior: Aggressively drops traces when changes are detected. Only keeps traces that are fully compatible. No automatic repair.Trade-off: Maximum data quality, minimum dataset size.
Why the Defaults Are Set This Way
Tools: major → replay (not drop)
Tools: major → replay (not drop)
Most tool “removals” are actually renames or replacements. The user’s question is still valid — re-running it captures how the new agent handles the same request. That’s why the
default preset maps tool_registry on_major to replay rather than drop. Teams that want stricter behavior can switch to the strict preset, where a major tool change drops instead.Prompts: major → replay
Prompts: major → replay
A prompt rewrite changes agent behavior, not the domain. “How do I reset my password?” is still a valid question even if the agent’s personality, tone, and instructions changed completely. Re-running produces fresh training data aligned with the new instructions, so
prompt_stack on_major defaults to replay.Model: major → drop
Model: major → drop
Switching model providers (OpenAI → Anthropic) produces fundamentally different output distributions. Training model B on model A’s outputs is generally counterproductive — the response styles, reasoning patterns, and formatting differ too much. This is the one case where the old output is genuinely harmful to keep, so
model_runtime on_major defaults to drop. (Distillation and synthetic traces are the exception — see the source-type overrides, which let those skip the model check.)Output contract: major → drop
Output contract: major → drop
If the expected output format changed entirely (e.g., from plain text to structured JSON), old outputs can’t be used for training the new format. The data is structurally incompatible, so
output_contract on_major defaults to drop. A moderate change (a field type shift) defaults to repair instead — those are often mechanically fixable.Customizing Policies
The fastest way to set a policy is the dashboard. For automation, the same configuration is available over the REST API. (There is nodecimalai policy CLI command and no Python SDK helper — policies are configured through the dashboard or the API below.)
Using Presets (Dashboard)
Navigate to your agent’s Policy tab to configure rules visually:- Select a preset as your starting point (
strict,default, orpermissive) - Adjust individual surface rules using the dropdown matrix
- Preview the impact on your existing traces before saving
- Save to apply the policy
Using Presets (API)
Create a policy for a specific agent by POSTing a preset name:GET /api/v1/manifests/policies/presets, and the policy currently active for an agent is at GET /api/v1/manifests/policies/active?agent_name=support-agent.
Custom Per-Surface Rules
Override individual surfaces by passingrules_json instead of (or alongside) a preset. The keys are the manifest surface names; each maps severity (on_minor / on_moderate / on_major) to an action:
rules_json field reference
The full per-surface policy. Keys are surface names; omit a surface to inherit the named
preset. One reserved key, source_type_overrides, is not a surface (see below).Source-Type Overrides
Distillation and synthetic traces can skip model compatibility checks, since they were generated by a teacher model — the student model’s identity doesn’t matter. Add asource_type_overrides block to rules_json:
distillation, synthetic, and manual source-type overrides for model_runtime (each set to keep on every severity) — pass your own block only when you need to widen or narrow them.
To update an existing policy, PUT /api/v1/manifests/policies/{policy_id} with the same body shape.
How It Works End-to-End
- Detect — The SDK automatically captures your agent’s manifest (tools, prompts, model) on each run
- Diff — When a new manifest is detected, DecimalAI computes a field-level diff against the previous version
- Classify — Each trace is individually classified based on which components it actually used
- Review — The Impact Report shows the distribution and lets you override individual verdicts
- Act — Repair traces, replay stale ones, build clean datasets, or export replay prompts
Next Steps
Manifests
How manifests get captured and what’s in the diff.
Versioning concepts
Component types, severity, verdicts.
Regression Check
Get the analysis as a PR comment.