When to use this
A metric dropped suddenly
Pass rate, latency, error rate, or a custom evaluator score is materially worse than yesterday.
Multiple deploys today
You shipped 3 PRs in a row. The regression check passed on each. You need to identify which one broke things.
The pre-deploy check was clean
The structural diff said “no high-risk traces” — but behavior changed anyway. Common when the change was a model swap or a large prompt rewrite.
You want to confirm a rollback worked
You reverted. Verify the metric recovered by clicking back through the timeline.
How it works
The bisect view computes one metric value per manifest version across your trace history, then highlights the version where the metric dropped by more than a configurable threshold (default: 10% relative drop). An illustrative bisect chart — the cliff atv1.3 is the regression boundary:
v1.0–v1.2 hold steady in the low-90s; v1.3 drops to 68% (−25% vs baseline) and v1.4 stays regressed. The boundary is the v1.2 → v1.3 transition.
The “regression boundary” — the transition from healthy to bad — gives you the candidate manifest version. The view then surfaces the diff between v1.2 and v1.3: which tools were added/removed, which prompts changed, what the model swap was. Same diff data as the pre-deploy check, just retrieved retroactively.
Bisect uses observed trace metrics, not behavioral replay. No tokens are spent re-running traces — it’s purely a query over data you’ve already ingested.
Step by step
Open the regression timeline for the affected agent
Navigate to Agents →
<your-agent> → Regression. The default view shows the pass rate over the trailing window, broken by manifest version.If your metric isn’t pass rate, switch via the metric picker (top right). The timeline supports these metrics:- Pass rate (
pass_rate) - Error rate (
error_rate) — % of traces withstatus=error - Cost (
cost) — spend per trace - p95 latency (
latency_p95)
Identify the regression boundary
The view auto-flags the first version where the metric dropped > 10% relative to the rolling baseline. The flagged version appears with a red
⚠ REGRESSION badge and the magnitude of the drop.Click into the badge to expand the diff: which manifest components changed between this version and the previous healthy version.Inspect the suspect change
The expanded panel shows the same severity-tagged surface changes you’d see on a pre-deploy check:
- 🔴 Tools added/removed
- 🟡 Prompts revised
- 🟡 Model swapped
- 🟢 Cosmetic changes (whitespace, comments)
Decide: rollback, fix-forward, or accept
Three paths from here:
- Rollback — redeploy the previous manifest. The system will automatically pick it up; next ingested trace registers under the old version.
- Fix forward — push a follow-up PR. The pre-deploy check will run on it; once merged and deployed, the timeline will show whether the metric recovered.
- Accept — sometimes the regression is intentional (you’re trading latency for accuracy). Mark the version “expected” via the Acknowledge button — it stops flagging without changing the data.
Setting up regression alerts
If you want the system to ping you the moment a regression is detected, rather than discovering it on next dashboard visit, configure an alert. Create one from the dashboard’s Regression tab, or via the REST API:POST /api/v1/regression-alerts/{id}/acknowledge and
POST /api/v1/regression-alerts/{id}/resolve.
Alerts fire when a new manifest version’s metric drops more than threshold_drop (relative) below the trailing 7-day baseline. The webhook payload follows the regression.detected event schema.
Pre-deploy vs post-deploy: when each one helps
| Pre-deploy regression check | Post-deploy bisect (this page) | |
|---|---|---|
| When it runs | On every PR | On demand, plus alerts on detection |
| What it answers | ”What will my change break?" | "What change broke this metric?” |
| Data source | Structural diff vs historical traces | Observed metrics vs manifest version |
| Cost | <$0.001 per check (no model calls) | Free (query over already-ingested data) |
| Catches model-swap regressions | ⚠ Partial (flags risk but can’t predict direction) | ✓ Yes (observes actual behavior) |
| Latency to detection | Seconds | Hours to days (depends on traffic) |
Limitations
- Needs traffic: bisect requires enough traces on each manifest version to compute a stable metric. The detector only flags a regression when both compared versions have ≥ 50 traces — it won’t cry wolf on thin data. The timeline endpoint reports this
min_trace_countso the UI can tell “insufficient data” apart from “no regression.” - Fixed metric set: the timeline computes pass rate, error rate, cost, and p95 latency from native trace and eval fields. Pass rate uses whatever evaluators you’ve attached to the agent (see Evaluations); the other three need no setup.
- Window: the timeline aggregates over a trailing window (default 7 days, up to 90) via the
window_daysquery param onGET /api/v1/agents/{agent}/regression-timeline. - A/B testing isn’t supported yet: if you ship two versions concurrently and split traffic, the bisect view still groups by manifest version and won’t surface that they ran in parallel. Roadmap.
What’s next
Regression Check (pre-deploy)
The companion that runs on every PR. Catches most issues before they ship.
Webhooks
Wire the
regression.detected event into Slack, PagerDuty, or your incident tooling.Manifests
How manifest versions are tracked and what gets hashed.
Evaluations
Define custom metrics that bisect can use as targets.