Regression Check (GitHub Action)

DecimalAI’s regression check is a GitHub Action that runs on every PR with an agent change. It identifies the structural blast radius of your change — which historical production traces will break, which may behave differently, and which are unaffected — and posts the report as a PR comment. Unlike eval-based regression testing — which re-runs your eval cases against the new version and compares pass rates — DecimalAI computes a structural impact report against your real production traces. No eval cases to write, no provider keys in CI, no model calls. The diff between two manifests is the test; your traffic is the test set. This guide covers the full setup, what to expect, and how to customize.

How you’ll use it day-to-day

The whole flow is one-time setup, then it runs invisibly on every PR. Here’s the actual user journey:

Set it up once (~10 minutes)

Add a DECIMAL_API_KEY secret to your repo, drop a YAML file in .github/workflows/, commit. That’s it. You never touch this again unless you want to tune the fail-on threshold. Jump to setup →

Open a PR that changes your agent

Anything that touches your manifest:

Bump the model (gpt-4o-mini → gpt-4o)
Edit the system prompt
Add, remove, or rename a tool
Change a tool’s argument schema
Install or remove a skill

No special commit message, no manual trigger — the Action runs on every PR automatically.

Get an auto-posted PR comment in ~30 seconds

The Action posts an Impact Report directly on your PR: the manifest changes, a severity-classified breakdown of affected production traces, and the eval-weighted verdict. See the full sample below.The comment updates in place on every push to the branch — you don’t accumulate stale comments. Click “View full report” to drill into individual affected traces in the dashboard. Every report carries two orthogonal verdicts: the structural impact (did the change touch traffic?) and the eval-weighted verdict (was that traffic working?).

Decide: merge, fix, or accept the break

Three things you can do with the result. See the decision tree below.

What to do with the result

Read the verdicts in order — structural first, then eval-weighted:

🟢 Structural is LOW IMPACT or no_change — just merge

The diff doesn’t intersect any historical traces (e.g. you added a tool nobody’s called yet). Safe to merge. The Action exits success and doesn’t block.

🟡 Structural is MEDIUM IMPACT — review the affected traces, then decide

Click “View full report” in the PR comment to see exactly which traces the diff touched. Two common cases:

The change is intentional (you rewrote the refund prompt because the old one was wrong): merge and accept that those traces will behave differently in production. The next time you build a dataset, those traces are now flagged as “may differ from current logic” — you decide whether to keep, repair, or replay them.
The change has unexpected reach: revert the offending part of the PR and re-push. The Action re-runs automatically.

Default fail-on: high means MEDIUM doesn’t block the PR — it’s informational.

🔴 Structural is HIGH IMPACT — the Action blocks the PR by default

Something will definitely change for affected traffic. Most common causes:

Tool removed that historical traces called → those traces can’t be replayed against the new manifest at all
Tool’s required argument changed → existing call signatures are now invalid
Skill removed that traces activated → behavior pipeline is gone

Options:

Fix it: re-add the tool / argument / skill, push, the check re-runs and unblocks.
Knowingly ship the break: flip fail-on to none (warn-only) for this run if you’ve decided the change is acceptable, then merge. Leaving a PR comment explaining why is good hygiene for your team’s review trail.

⚠️ Eval verdict is REGRESSION LIKELY — slow down

This is the verdict to take seriously regardless of structural severity. Even a “LOW IMPACT” structural change can carry REGRESSION LIKELY if the handful of affected traces were all currently passing eval — you’re quietly degrading working traffic. Open the full report and look at the Currently passing eval column on the affected-trace table: those are the traces that need a manual eyeball before merge.

ℹ️ Eval verdict is EXPECTED IMPACT — cleanup ships

The change touched production traces, but every affected trace was already failing eval. This is the canonical “intentional tool removal” or “deprecated skill cleanup” case — the structural diff is real, but the affected traffic wasn’t working anyway. Merge with low risk; monitor for any unanticipated improvements in the post-deploy metrics.

✅ Eval verdict is CLEAN — green light

No passing-eval traffic affected, no failing-eval traffic affected (either zero affected, or only unscored). The change is structurally contained. Merge confidently. The structural verdict tells you whether to read the affected-trace list at all.

“First run” verdict on my very first PR

Normal. On the first PR after installing, no baseline exists yet — the Action records your current manifest as the baseline and exits success. Every PR after this point gets a real diff.

How the two verdicts combine

The 2×2 you’ll see most often:

Structural	Eval	What it means
HIGH IMPACT	REGRESSION LIKELY	Real regression — passing traffic will behave differently
HIGH IMPACT	EXPECTED IMPACT	Intentional cleanup — removing something that was broken
LOW IMPACT	REGRESSION LIKELY	Small-but-targeted change to working traffic — read carefully
LOW IMPACT	CLEAN	Nothing to see; merge

The structural verdict alone over-alerts on intentional changes. The eval verdict alone under-alerts when a change has a small surface but lands precisely on traffic that was working. You want both.

After merge: baseline rolls over automatically

When your PR merges and your production deploy registers the new manifest with the SDK (via decimalai.init() running in prod), that manifest becomes the new active baseline. Next PR’s check diffs against it. No manual baseline management — it tracks whatever’s actually deployed.

What it is (and what it isn’t)

What it is:

A pre-deploy structural impact analysis based on the manifest diff between your PR branch and the baseline (current production)
A GitHub Action that runs in your CI, takes ~30 seconds, costs <$0.001 per run
Manifest-aware — uses your production traces as the test set
A PR comment with severity-classified affected traces

What it isn’t:

Full agentic replay — by default the check is purely structural; it doesn’t run your agent or capture new outputs. (A preview of behavioral verification ships now via the Action’s behavioral-check input: for a model swap it re-issues one recorded model call per affected trace against the candidate model and diffs the outputs. It defaults to mock (no token spend); real is same-provider only. It’s a stateless single model call — it does not run your agent end-to-end.)
A replacement for evaluation — you can still run evals; this is complementary
A deployment gate — DecimalAI doesn’t deploy code; the PR-blocking is advisory and overridable

What’s in a manifest

The diff that drives every regression check operates on the manifest — DecimalAI’s content-addressed snapshot of your agent. A manifest captures six component types, each hashed independently:

Model

Provider, name, and version of the LLM the agent runs on (openai/gpt-4o, anthropic/claude-sonnet-4-6, etc.). Sampling params if pinned.

Prompt

System instructions, any few-shot examples, and templated variables — captured by content hash so a single character changes the hash.

Tools

For each tool the agent can call: function name, argument schema, and return shape.

Skills

Installed workflow plugins (e.g. refund_policy_v3) — versioned bundles of prompts + tools + rules.

Subagents

Other agents this one can hand off to. Each sub-agent has its own manifest reference.

Output schema

The expected response shape (for structured-output agents). Catches breaking schema changes.

There’s also graph_topology_hash for multi-agent topologies (LangGraph, OpenAI Agents handoffs) and a workflow component type for orchestrator state machines — both captured automatically by the SDK’s framework adapters.

When any component’s content changes, the manifest’s overall hash changes. The regression check diffs two hashes and labels every difference — so a PR that bumps the model, edits the prompt, and adds a tool reads as:

Component	Baseline	Candidate (this PR)	Diff
Model	`gpt-4o-mini`	`gpt-4o`	🔄 swapped
Prompt	v3 (`a3f7…`)	v4 (`9c21…`)	🔄 revised
Tools	`refund`, `lookup`	`refund`, `lookup`, `escalate`	➕ `escalate` added
Skills	`refund_policy_v3`	`refund_policy_v3`	no change
Subagents	—	—	no change
Output schema	—	—	no change

Why it works this way

Four properties that fall out of pairing CI with manifest-aware versioning:

Deterministic diffing

Manifests are content-hashed. Two versions with the same tools + model + prompts always produce the same hash — no false positives from formatting or comment-only changes.

Per-trace attribution

Because every trace was ingested under a specific manifest, the check knows which traces were authored against the baseline and computes a per-trace verdict, not just an aggregate.

Reasoning-aware impact

The diff engine understands what changed — tool removed vs. tool argument renamed vs. model swapped — and applies different heuristics. A model swap is medium_risk, removing a tool you actually used is high_risk.

Replay-ready

Every medium_risk and high_risk trace is automatically eligible for replay against the candidate, so you can verify behavior changes empirically — not guess.

Prerequisites

You’ll need:

The DecimalAI Python SDK installed and your agent instrumented (see Quickstart steps 1–3)
At least some production traces ingested (the regression check has no signal until traces exist)
A GitHub repo with a DECIMAL_API_KEY secret configured

Setup

Step 1: Add a thin init entry point

DecimalAI extracts your agent’s manifest by running your existing initialization code in CI — same code path that registers manifests in production. You need a script that calls your agent factory and exits. Create scripts/init_for_decimal.py:

"""Entry point for DecimalAI manifest extraction in CI.

This file runs your existing agent initialization in 'manifest_only' mode.
The SDK captures tools, prompts, and models from the runtime objects
(no source-code parsing) and exits.
"""
from myapp.agent import build_agent  # adjust to your agent factory

if __name__ == "__main__":
    build_agent()

If you don’t have a single-function agent factory, write a thin wrapper:

def build_agent():
    from langchain_openai import ChatOpenAI
    from langchain.agents import create_react_agent

    llm = ChatOpenAI(model="gpt-4o")
    tools = [search_tool, refund_tool]
    prompt = load_prompt()
    return create_react_agent(llm, tools, prompt)

Step 2: Add the GitHub Action workflow

Create .github/workflows/decimal.yml:

name: Decimal Manifest Impact
on: [pull_request]

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install dependencies
        run: pip install -e .

      - name: Manifest extraction
        env:
          DECIMALAI_MODE: manifest_only
          DECIMAL_API_KEY: ${{ secrets.DECIMAL_API_KEY }}
          # Placeholder values for required env vars in your init code.
          # These are NOT called — manifest_only mode suppresses LLM calls.
          OPENAI_API_KEY: dummy_for_init
        run: python scripts/init_for_decimal.py

      - name: Impact check
        uses: decimal-labs/regression-check@v1
        with:
          api-key: ${{ secrets.DECIMAL_API_KEY }}
          agent-name: support-agent  # the same name you use in decimalai.init()

Step 3: Add the secret

In your GitHub repo: Settings → Secrets and variables → Actions → New repository secret

Name: DECIMAL_API_KEY
Value: your API key from app.decimal.ai/settings

That’s the entire setup. Open a PR and the Action will run.

What you’ll see

On your first PR after installation

DecimalAI has no baseline manifest registered for this agent yet. The Action records your candidate manifest as the baseline and posts:

✅ Decimal Manifest Impact — support-agent

First run for this agent. Recorded the current manifest as the baseline.
Future PRs will diff against this.

The PR passes. Subsequent PRs will get real impact reports.

On subsequent PRs

🔍 Decimal Manifest Impact — support-agent

Manifest diff vs baseline (h_current_abc):
  + tool: refund_order
  ~ tool: search_docs (schema added required 'language' param)
  ~ system_prompt (28% rewritten — refund handling section)
  - tool: compare_competitors

Impact on your last 30 days of production traces (2,002 total):

  🔴  HIGH IMPACT — 247 traces will change behavior
       247 traces called the removed `compare_competitors` tool.

  🟡  MEDIUM IMPACT — 501 traces may behave differently
       412 traces called `search_docs` without specifying `language`
       89 traces depended on the rewritten refund handling section

  🟢  LOW IMPACT — 1,254 traces unaffected

Affected by current eval state:
  ✅  183 currently passing eval  ← regression-risk surface
  ❌   41 currently failing eval  ← opportunity if your change fixes them
  —  1,778 unscored

Structural: 🔴 HIGH IMPACT — 247 traces affected.
Eval:       ⚠️ REGRESSION LIKELY — 183 passing-eval traces affected.

Sample affected traces:
  • trace_a8f2 — "How do I compare with competitor X?"
  • trace_b3c4 — "Search for refund policies"
  • trace_d9e1 — "Refund order #4521"

[View full impact report → https://app.decimal.ai/agents/support-agent/regression/r_xyz]

The text above is the copy-reference for what the comment contains. On a real PR it renders as native GitHub markdown: The full impact report renders in the dashboard alongside a regressions tab that lists every check run for the agent:

Regressions tab showing pre-deploy manifest impact checks with severity badges

How severity is determined

For each surface change in the manifest diff, DecimalAI runs a deterministic query against your trace store:

Surface change	Severity
Tool removed → traces that called this tool	🔴 HIGH
Tool schema added required param → traces missing the param	🔴 HIGH
Tool schema added optional param → traces that called this tool	🟢 LOW
Tool schema removed param → traces that passed the now-removed param	🟡 MEDIUM
Tool added (no historical traces affected)	🟢 LOW
Prompt section rewritten >X% → traces overlapping the changed section	🟡 MEDIUM
Skill removed → traces that activated this skill	🔴 HIGH
Skill modified → traces that activated this skill	🟡 MEDIUM
Model changed → all traces affected	🟡 MEDIUM (with caveat — see below)

Honest caveat for model swaps and large prompt rewrites: these are changes where structural reasoning can only say “everything may be affected.” The impact report will tell you “all N traces at risk” but cannot predict behavioral direction. For a model swap, the behavioral-check input (preview) closes part of this gap: it re-issues one recorded model call per affected trace against the candidate model and diffs the outputs. It defaults to mock (no token spend), and real is same-provider only — a stateless single model call, not your agent end-to-end. For large prompt rewrites (and to confirm real-world behavior after merge), we recommend a careful canary deploy and using post-deploy bisect.

Configuration options

`with:` parameters on the Action

Parameter	Default	Description
`api-key`	(required)	Your DecimalAI API key
`agent-name`	(required)	Agent name used in `decimalai.init(agent_name=...)`
`fail-on`	`high`	When to fail the PR: `high` (any HIGH IMPACT), `medium` (any MEDIUM+), `none` (warn only)
`comment-mode`	`update`	`update` (single comment, updated per push) or `new` (new comment per push)
`trace-window-days`	(action default)	How many days of production traces to query for impact
`behavioral-check`	`mock`	Behavioral verification for model swaps (preview): `off`, `mock` (count eligible calls, no token spend), or `real` (re-issue one recorded model call per affected trace against the candidate and diff outputs; same-provider only)

Blocking behavior

The fail-on input is the gate. With the default fail-on: high, a HIGH IMPACT verdict fails the CI step — you unblock by fixing the PR (re-adding the tool/argument/skill) or by flipping fail-on to medium/none for that run if you’ve decided the break is acceptable. The PR-blocking is advisory: GitHub branch protection, not DecimalAI, is what ultimately holds the merge.

Behavioral verification (preview) — ships now. The behavioral-check input runs behavioral verification for model swaps. The structural check can only say a model change “may differ”; behavioral-check: real re-issues one recorded model call per affected trace against the candidate model and diffs the outputs. It’s a stateless single model call, not your agent end-to-end. It defaults to mock (counts eligible calls, no token spend); real is same-provider only and spends tokens. This only runs when the manifest diff contains a model change.

Deferred. Cross-provider model swaps (e.g. OpenAI → Anthropic) and full end-to-end agent replay are not in this preview. The shipped call-replay verifies a single recorded model call against the candidate of the same provider — it does not re-run your full agent loop.

Running outside CI

You can drive the same impact analysis from your laptop, a notebook, or any CI system that isn’t GitHub Actions — every snippet is pre-filled with a placeholder agent name and candidate manifest:

# After editing your agent code, register the candidate manifest:
python -c "
import decimalai
decimalai.init()
decimalai.flush_manifest_for_ci(agent_name='support-agent')
"

# Then run the check (auto-discovers the manifest ID from $GITHUB_OUTPUT
# or ./decimal_manifest_id.txt):
decimalai regression-check --agent-name support-agent

# Or pass an explicit manifest ID for any previously-registered version:
decimalai regression-check \
  --agent-name support-agent \
  --candidate-manifest-id mfst_xyz \
  --fail-on medium \
  --trace-window-days 60

Pass --dry-run (CLI) — or ?dry_run=true (curl) — to compute the report without persisting it or consuming your metered quota. Great for exploring “what would happen if I…” scenarios.

What can go wrong

”Manifest extraction failed”

Common causes:

Your init script imports something that requires a real API key, database connection, or other production resource
A dependency in your pip install failed
A required env var isn’t set

Behavior: The Action posts a “manifest extraction failed — see logs” warning and does not block the PR. Setup issues are not regressions; we won’t hold your PR hostage. Fix: Make your build_agent() (or equivalent) callable in isolation. Mock or stub anything that needs production resources. The init only needs to register tools, prompts, and models with the SDK — it doesn’t need to actually run the agent.

”First run — no baseline”

Normal on the first PR after installing. The Action records your candidate manifest as the baseline and exits success. Future PRs will diff against this.

Empty or small impact reports

If the report shows “no traces affected” but you know your change should affect some traffic, check:

Have you ingested production traces for this agent? Manifest impact analysis needs historical traces to query.
Is the agent-name in your Action config the same as the one passed to decimalai.init()? Mismatched names produce empty results.
Has enough time passed since your manifest changes for traces to accumulate? Fresh changes need traffic.

Stale baseline

The baseline manifest updates only when you deploy a new manifest to production (registered via the SDK in normal mode). If your baseline is many versions behind your actual production state, your impact reports will look exaggerated. The fix: re-deploy the SDK in production to refresh the baseline.

How this compares to eval-based regression checks

Other tools (LangSmith, Braintrust, Langfuse) implement regression check by running your eval suite on the new version and comparing pass rates. DecimalAI works fundamentally differently:

	Eval-driven (other tools)	Manifest-aware (DecimalAI)
Requires writing eval cases	Yes — substantial ongoing work	No — production traces are the test set
Requires running your agent in CI	Yes — your CI runs the eval suite	No — pure trace-store query
Knows the blast radius	No — full eval suite runs every time	Yes — identifies exactly which traces touched the change
Catches removed-tool regressions	Only if eval coverage exists	Always
Cost per check	Variable ($)	<$0.001
Stochasticity issues	Yes (LLM-graded evals vary by run)	No (deterministic structural query)

The two approaches are complementary. Use both if you want behavioral verification on the eval surface; use DecimalAI alone if you don’t have an eval suite yet.

FAQ

What if I have no baseline yet?

On the very first run, the candidate becomes the baseline automatically. Verdict is first_run, no impacts are flagged. Subsequent PRs are diffed against it.

Does it cost tokens?

No. The diff is purely structural — it inspects the manifest content, not LLM outputs. The medium_risk and high_risk traces it flags can optionally be replayed against the candidate, which costs tokens since it re-runs them through the model. The base regression check itself is free in terms of model cost; it counts against your regression_checks plan limit.

What's the difference between `fail-on: high` and `fail-on: medium`?

fail-on: high (default) only fails the CI step on high_risk verdicts — use this if you want to ship behavior changes deliberately. fail-on: medium is stricter — fails on any non-obvious change. Pick based on how aggressive your team wants the gate to be. fail-on: none always passes (warn-only mode).

Can I run the check on a branch that isn't open as a PR?

Yes. PR context is best-effort metadata for traceability. The check works on any candidate manifest — pass --candidate-manifest-id explicitly to the CLI (decimalai regression-check), or POST to /api/v1/regression-check from any script.

How is this different from running evals on a PR?

Evals score the output of running your candidate against fixed inputs; the regression check inspects the structure of the change to tell you which existing traces are affected — fast, free, no model calls. Most teams run the check on every PR and gate evals to the cases it flags medium_risk or higher. See How this compares to eval-based regression checks for the full side-by-side.

What’s next

Manifests Guide

What manifests capture and how diffs work under the hood.

Skills Guide

Skills are first-class manifest surfaces — they appear in impact reports.

Datasets Guide

Use the same manifest awareness for SFT data integrity.

Concepts

How traces, manifests, evals, and datasets connect.

Troubleshooting

Action not commenting, or says “no manifest”? Common fixes.

​How you’ll use it day-to-day

​What to do with the result

​How the two verdicts combine

​After merge: baseline rolls over automatically

​What it is (and what it isn’t)

​What’s in a manifest

Model

Prompt

Tools

Skills

Subagents

Output schema

​Why it works this way

Deterministic diffing

Per-trace attribution

Reasoning-aware impact

Replay-ready

​Prerequisites

​Setup

​Step 1: Add a thin init entry point

​Step 2: Add the GitHub Action workflow

​Step 3: Add the secret

​What you’ll see

​On your first PR after installation

​On subsequent PRs

​How severity is determined

​Configuration options

​with: parameters on the Action

​Blocking behavior

​Running outside CI

​What can go wrong

​”Manifest extraction failed”

​”First run — no baseline”

​Empty or small impact reports

​Stale baseline

​How this compares to eval-based regression checks

​FAQ

​What’s next

Manifests Guide

Skills Guide

Datasets Guide

Concepts

Troubleshooting

How you’ll use it day-to-day

What to do with the result

How the two verdicts combine

After merge: baseline rolls over automatically

What it is (and what it isn’t)

What’s in a manifest

Why it works this way

Prerequisites

Setup

Step 1: Add a thin init entry point

Step 2: Add the GitHub Action workflow

Step 3: Add the secret

What you’ll see

On your first PR after installation

On subsequent PRs

How severity is determined

Configuration options

`with:` parameters on the Action

Blocking behavior

Running outside CI

What can go wrong

”Manifest extraction failed”

”First run — no baseline”

Empty or small impact reports

Stale baseline

How this compares to eval-based regression checks

FAQ

What’s next