Post-deploy Bisect

You shipped a change. A few hours later, a metric tanked — pass rate dropped, latency spiked, or your error budget burned through. You don’t know which deploy caused it because you shipped three changes today. The Post-deploy Bisect view answers that question: which manifest version introduced this regression, and what changed between it and the previous version? This is the post-deploy companion to the Regression Check Action — together they form the full safety net.

When to use this

A metric dropped suddenly

Pass rate, latency, error rate, or a custom evaluator score is materially worse than yesterday.

Multiple deploys today

You shipped 3 PRs in a row. The regression check passed on each. You need to identify which one broke things.

The pre-deploy check was clean

The structural diff said “no high-risk traces” — but behavior changed anyway. Common when the change was a model swap or a large prompt rewrite.

You want to confirm a rollback worked

You reverted. Verify the metric recovered by clicking back through the timeline.

How it works

The bisect view computes one metric value per manifest version across your trace history, then highlights the version where the metric dropped by more than a configurable threshold (default: 10% relative drop). An illustrative bisect chart — the cliff at v1.3 is the regression boundary: v1.0–v1.2 hold steady in the low-90s; v1.3 drops to 68% (−25% vs baseline) and v1.4 stays regressed. The boundary is the v1.2 → v1.3 transition. The “regression boundary” — the transition from healthy to bad — gives you the candidate manifest version. The view then surfaces the diff between v1.2 and v1.3: which tools were added/removed, which prompts changed, what the model swap was. Same diff data as the pre-deploy check, just retrieved retroactively.

Bisect uses observed trace metrics, not behavioral replay. No tokens are spent re-running traces — it’s purely a query over data you’ve already ingested.

Step by step

Open the regression timeline for the affected agent

Navigate to Agents → <your-agent> → Regression. The default view shows the pass rate over the trailing window, broken by manifest version.If your metric isn’t pass rate, switch via the metric picker (top right). The timeline supports these metrics:

Pass rate (pass_rate)
Error rate (error_rate) — % of traces with status=error
Cost (cost) — spend per trace
p95 latency (latency_p95)

Identify the regression boundary

The view auto-flags the first version where the metric dropped > 10% relative to the rolling baseline. The flagged version appears with a red ⚠ REGRESSION badge and the magnitude of the drop.Click into the badge to expand the diff: which manifest components changed between this version and the previous healthy version.

Inspect the suspect change

The expanded panel shows the same severity-tagged surface changes you’d see on a pre-deploy check:

🔴 Tools added/removed
🟡 Prompts revised
🟡 Model swapped
🟢 Cosmetic changes (whitespace, comments)

Cross-reference against your Git history — the manifest hash is in the commit metadata if you used the GitHub Action.

Decide: rollback, fix-forward, or accept

Three paths from here:

Rollback — redeploy the previous manifest. The system will automatically pick it up; next ingested trace registers under the old version.
Fix forward — push a follow-up PR. The pre-deploy check will run on it; once merged and deployed, the timeline will show whether the metric recovered.
Accept — sometimes the regression is intentional (you’re trading latency for accuracy). Mark the version “expected” via the Acknowledge button — it stops flagging without changing the data.

Setting up regression alerts

If you want the system to ping you the moment a regression is detected, rather than discovering it on next dashboard visit, configure an alert. Create one from the dashboard’s Regression tab, or via the REST API:

curl -X POST https://api.decimal.ai/api/v1/regression-alerts \
  -H "Authorization: Bearer dai_sk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "agent_name": "support-agent",
    "metric": "pass_rate",
    "threshold_drop": 0.10,
    "notify_webhook": "https://hooks.slack.com/services/..."
  }'

The alert lifecycle (acknowledge / resolve) is available at POST /api/v1/regression-alerts/{id}/acknowledge and POST /api/v1/regression-alerts/{id}/resolve. Alerts fire when a new manifest version’s metric drops more than threshold_drop (relative) below the trailing 7-day baseline. The webhook payload follows the regression.detected event schema.

Pre-deploy vs post-deploy: when each one helps

	Pre-deploy regression check	Post-deploy bisect (this page)
When it runs	On every PR	On demand, plus alerts on detection
What it answers	”What will my change break?"	"What change broke this metric?”
Data source	Structural diff vs historical traces	Observed metrics vs manifest version
Cost	<$0.001 per check (no model calls)	Free (query over already-ingested data)
Catches model-swap regressions	⚠ Partial (flags risk but can’t predict direction)	✓ Yes (observes actual behavior)
Latency to detection	Seconds	Hours to days (depends on traffic)

The two are designed to complement, not replace, each other. The pre-deploy check eliminates the obvious-structural breaks; the post-deploy bisect catches the behavioral surprises that no structural diff can predict.

Limitations

Needs traffic: bisect requires enough traces on each manifest version to compute a stable metric. The detector only flags a regression when both compared versions have ≥ 50 traces — it won’t cry wolf on thin data. The timeline endpoint reports this min_trace_count so the UI can tell “insufficient data” apart from “no regression.”
Fixed metric set: the timeline computes pass rate, error rate, cost, and p95 latency from native trace and eval fields. Pass rate uses whatever evaluators you’ve attached to the agent (see Evaluations); the other three need no setup.
Window: the timeline aggregates over a trailing window (default 7 days, up to 90) via the window_days query param on GET /api/v1/agents/{agent}/regression-timeline.
A/B testing isn’t supported yet: if you ship two versions concurrently and split traffic, the bisect view still groups by manifest version and won’t surface that they ran in parallel. Roadmap.

What’s next

Regression Check (pre-deploy)

The companion that runs on every PR. Catches most issues before they ship.

Webhooks

Wire the regression.detected event into Slack, PagerDuty, or your incident tooling.

Manifests

How manifest versions are tracked and what gets hashed.

Evaluations

Define custom metrics that bisect can use as targets.

​When to use this

A metric dropped suddenly

Multiple deploys today

The pre-deploy check was clean

You want to confirm a rollback worked

​How it works

​Step by step

​Setting up regression alerts

​Pre-deploy vs post-deploy: when each one helps

​Limitations

​What’s next

Regression Check (pre-deploy)

Webhooks

Manifests

Evaluations

When to use this

How it works

Step by step

Setting up regression alerts

Pre-deploy vs post-deploy: when each one helps

Limitations

What’s next