How you’ll use it day-to-day
The whole flow is one-time setup, then it runs invisibly on every PR. Here’s the actual user journey:Set it up once (~10 minutes)
DECIMAL_API_KEY secret to your repo, drop a YAML file in .github/workflows/, commit. That’s it. You never touch this again unless you want to tune the fail-on threshold. Jump to setup →Open a PR that changes your agent
- Bump the model (
gpt-4o-mini→gpt-4o) - Edit the system prompt
- Add, remove, or rename a tool
- Change a tool’s argument schema
- Install or remove a skill
Get an auto-posted PR comment in ~30 seconds
Decide: merge, fix, or accept the break
What to do with the result
Read the verdicts in order — structural first, then eval-weighted:🟢 Structural is LOW IMPACT or no_change — just merge
🟢 Structural is LOW IMPACT or no_change — just merge
🟡 Structural is MEDIUM IMPACT — review the affected traces, then decide
🟡 Structural is MEDIUM IMPACT — review the affected traces, then decide
- The change is intentional (you rewrote the refund prompt because the old one was wrong): merge and accept that those traces will behave differently in production. The next time you build a dataset, those traces are now flagged as “may differ from current logic” — you decide whether to keep, repair, or replay them.
- The change has unexpected reach: revert the offending part of the PR and re-push. The Action re-runs automatically.
fail-on: high means MEDIUM doesn’t block the PR — it’s informational.🔴 Structural is HIGH IMPACT — the Action blocks the PR by default
🔴 Structural is HIGH IMPACT — the Action blocks the PR by default
- Tool removed that historical traces called → those traces can’t be replayed against the new manifest at all
- Tool’s required argument changed → existing call signatures are now invalid
- Skill removed that traces activated → behavior pipeline is gone
- Fix it: re-add the tool / argument / skill, push, the check re-runs and unblocks.
- Knowingly ship the break: flip
fail-ontonone(warn-only) for this run if you’ve decided the change is acceptable, then merge. Leaving a PR comment explaining why is good hygiene for your team’s review trail.
⚠️ Eval verdict is REGRESSION LIKELY — slow down
⚠️ Eval verdict is REGRESSION LIKELY — slow down
Currently passing eval column on the affected-trace table: those are
the traces that need a manual eyeball before merge.ℹ️ Eval verdict is EXPECTED IMPACT — cleanup ships
ℹ️ Eval verdict is EXPECTED IMPACT — cleanup ships
✅ Eval verdict is CLEAN — green light
✅ Eval verdict is CLEAN — green light
“First run” verdict on my very first PR
“First run” verdict on my very first PR
How the two verdicts combine
The 2×2 you’ll see most often:| Structural | Eval | What it means |
|---|---|---|
| HIGH IMPACT | REGRESSION LIKELY | Real regression — passing traffic will behave differently |
| HIGH IMPACT | EXPECTED IMPACT | Intentional cleanup — removing something that was broken |
| LOW IMPACT | REGRESSION LIKELY | Small-but-targeted change to working traffic — read carefully |
| LOW IMPACT | CLEAN | Nothing to see; merge |
After merge: baseline rolls over automatically
When your PR merges and your production deploy registers the new manifest with the SDK (viadecimalai.init() running in prod), that manifest becomes the new active baseline. Next PR’s check diffs against it. No manual baseline management — it tracks whatever’s actually deployed.
What it is (and what it isn’t)
What it is:- A pre-deploy structural impact analysis based on the manifest diff between your PR branch and the baseline (current production)
- A GitHub Action that runs in your CI, takes ~30 seconds, costs <$0.001 per run
- Manifest-aware — uses your production traces as the test set
- A PR comment with severity-classified affected traces
- Full agentic replay — by default the check is purely structural; it doesn’t run your agent or capture new outputs. (A preview of behavioral verification ships now via the Action’s
behavioral-checkinput: for a model swap it re-issues one recorded model call per affected trace against the candidate model and diffs the outputs. It defaults tomock(no token spend);realis same-provider only. It’s a stateless single model call — it does not run your agent end-to-end.) - A replacement for evaluation — you can still run evals; this is complementary
- A deployment gate — DecimalAI doesn’t deploy code; the PR-blocking is advisory and overridable
What’s in a manifest
The diff that drives every regression check operates on the manifest — DecimalAI’s content-addressed snapshot of your agent. A manifest captures six component types, each hashed independently:Model
openai/gpt-4o, anthropic/claude-sonnet-4-6, etc.). Sampling params if pinned.Prompt
Tools
Skills
refund_policy_v3) — versioned bundles of prompts + tools + rules.Subagents
Output schema
graph_topology_hash for multi-agent topologies (LangGraph, OpenAI Agents handoffs) and a workflow component type for orchestrator state machines — both captured automatically by the SDK’s framework adapters.| Component | Baseline | Candidate (this PR) | Diff |
|---|---|---|---|
| Model | gpt-4o-mini | gpt-4o | 🔄 swapped |
| Prompt | v3 (a3f7…) | v4 (9c21…) | 🔄 revised |
| Tools | refund, lookup | refund, lookup, escalate | ➕ escalate added |
| Skills | refund_policy_v3 | refund_policy_v3 | no change |
| Subagents | — | — | no change |
| Output schema | — | — | no change |
Why it works this way
Four properties that fall out of pairing CI with manifest-aware versioning:Deterministic diffing
Per-trace attribution
Reasoning-aware impact
medium_risk, removing a tool you actually used is high_risk.Replay-ready
medium_risk and high_risk trace is automatically eligible for replay against the candidate, so you can verify behavior changes empirically — not guess.Prerequisites
You’ll need:- The DecimalAI Python SDK installed and your agent instrumented (see Quickstart steps 1–3)
- At least some production traces ingested (the regression check has no signal until traces exist)
- A GitHub repo with a
DECIMAL_API_KEYsecret configured
Setup
Step 1: Add a thin init entry point
DecimalAI extracts your agent’s manifest by running your existing initialization code in CI — same code path that registers manifests in production. You need a script that calls your agent factory and exits. Createscripts/init_for_decimal.py:
Step 2: Add the GitHub Action workflow
Create.github/workflows/decimal.yml:
Step 3: Add the secret
In your GitHub repo: Settings → Secrets and variables → Actions → New repository secret- Name:
DECIMAL_API_KEY - Value: your API key from app.decimal.ai/settings
What you’ll see
On your first PR after installation
DecimalAI has no baseline manifest registered for this agent yet. The Action records your candidate manifest as the baseline and posts:On subsequent PRs

How severity is determined
For each surface change in the manifest diff, DecimalAI runs a deterministic query against your trace store:| Surface change | Severity |
|---|---|
| Tool removed → traces that called this tool | 🔴 HIGH |
| Tool schema added required param → traces missing the param | 🔴 HIGH |
| Tool schema added optional param → traces that called this tool | 🟢 LOW |
| Tool schema removed param → traces that passed the now-removed param | 🟡 MEDIUM |
| Tool added (no historical traces affected) | 🟢 LOW |
| Prompt section rewritten >X% → traces overlapping the changed section | 🟡 MEDIUM |
| Skill removed → traces that activated this skill | 🔴 HIGH |
| Skill modified → traces that activated this skill | 🟡 MEDIUM |
| Model changed → all traces affected | 🟡 MEDIUM (with caveat — see below) |
behavioral-check input (preview) closes part of this gap: it re-issues one recorded model call per affected trace against the candidate model and diffs the outputs. It defaults to mock (no token spend), and real is same-provider only — a stateless single model call, not your agent end-to-end. For large prompt rewrites (and to confirm real-world behavior after merge), we recommend a careful canary deploy and using post-deploy bisect.
Configuration options
with: parameters on the Action
| Parameter | Default | Description |
|---|---|---|
api-key | (required) | Your DecimalAI API key |
agent-name | (required) | Agent name used in decimalai.init(agent_name=...) |
fail-on | high | When to fail the PR: high (any HIGH IMPACT), medium (any MEDIUM+), none (warn only) |
comment-mode | update | update (single comment, updated per push) or new (new comment per push) |
trace-window-days | (action default) | How many days of production traces to query for impact |
behavioral-check | mock | Behavioral verification for model swaps (preview): off, mock (count eligible calls, no token spend), or real (re-issue one recorded model call per affected trace against the candidate and diff outputs; same-provider only) |
Blocking behavior
Thefail-on input is the gate. With the default fail-on: high, a HIGH IMPACT verdict fails the CI step — you unblock by fixing the PR (re-adding the tool/argument/skill) or by flipping fail-on to medium/none for that run if you’ve decided the break is acceptable. The PR-blocking is advisory: GitHub branch protection, not DecimalAI, is what ultimately holds the merge.
behavioral-check
input runs behavioral verification for model swaps. The structural check can
only say a model change “may differ”; behavioral-check: real re-issues one
recorded model call per affected trace against the candidate model and diffs
the outputs. It’s a stateless single model call, not your agent
end-to-end. It defaults to mock (counts eligible calls, no token spend);
real is same-provider only and spends tokens. This only runs when the
manifest diff contains a model change.Running outside CI
You can drive the same impact analysis from your laptop, a notebook, or any CI system that isn’t GitHub Actions — every snippet is pre-filled with a placeholder agent name and candidate manifest:What can go wrong
”Manifest extraction failed”
Common causes:- Your init script imports something that requires a real API key, database connection, or other production resource
- A dependency in your
pip installfailed - A required env var isn’t set
build_agent() (or equivalent) callable in isolation. Mock or stub anything that needs production resources. The init only needs to register tools, prompts, and models with the SDK — it doesn’t need to actually run the agent.
”First run — no baseline”
Normal on the first PR after installing. The Action records your candidate manifest as the baseline and exits success. Future PRs will diff against this.Empty or small impact reports
If the report shows “no traces affected” but you know your change should affect some traffic, check:- Have you ingested production traces for this agent? Manifest impact analysis needs historical traces to query.
- Is the
agent-namein your Action config the same as the one passed todecimalai.init()? Mismatched names produce empty results. - Has enough time passed since your manifest changes for traces to accumulate? Fresh changes need traffic.
Stale baseline
The baseline manifest updates only when you deploy a new manifest to production (registered via the SDK in normal mode). If your baseline is many versions behind your actual production state, your impact reports will look exaggerated. The fix: re-deploy the SDK in production to refresh the baseline.How this compares to eval-based regression checks
Other tools (LangSmith, Braintrust, Langfuse) implement regression check by running your eval suite on the new version and comparing pass rates. DecimalAI works fundamentally differently:| Eval-driven (other tools) | Manifest-aware (DecimalAI) | |
|---|---|---|
| Requires writing eval cases | Yes — substantial ongoing work | No — production traces are the test set |
| Requires running your agent in CI | Yes — your CI runs the eval suite | No — pure trace-store query |
| Knows the blast radius | No — full eval suite runs every time | Yes — identifies exactly which traces touched the change |
| Catches removed-tool regressions | Only if eval coverage exists | Always |
| Cost per check | Variable ($) | <$0.001 |
| Stochasticity issues | Yes (LLM-graded evals vary by run) | No (deterministic structural query) |
FAQ
What if I have no baseline yet?
What if I have no baseline yet?
first_run, no impacts are flagged. Subsequent PRs are diffed against it.Does it cost tokens?
Does it cost tokens?
medium_risk and high_risk traces it flags can optionally be replayed against the candidate, which costs tokens since it re-runs them through the model. The base regression check itself is free in terms of model cost; it counts against your regression_checks plan limit.What's the difference between `fail-on: high` and `fail-on: medium`?
What's the difference between `fail-on: high` and `fail-on: medium`?
fail-on: high (default) only fails the CI step on high_risk verdicts — use this if you want to ship behavior changes deliberately. fail-on: medium is stricter — fails on any non-obvious change. Pick based on how aggressive your team wants the gate to be. fail-on: none always passes (warn-only mode).Can I run the check on a branch that isn't open as a PR?
Can I run the check on a branch that isn't open as a PR?
--candidate-manifest-id explicitly to the CLI (decimalai regression-check), or POST to /api/v1/regression-check from any script.How is this different from running evals on a PR?
How is this different from running evals on a PR?
medium_risk or higher. See How this compares to eval-based regression checks for the full side-by-side.