keep / repair / replay / drop). It accepts scores from multiple sources — built-in deterministic checks, your @eval functions, LLM judges, and pushed-from-elsewhere scores from DeepEval / LangSmith / your CI pipeline.
Together with the Compatibility Policies guide, this is how DecimalAI decides what’s training-data-grade.
When to use this API
Push scores from an external eval pipeline
Already running DeepEval or a homegrown harness? POST results to
/traces/{id}/eval-scores and they show up in the same dashboard view as your built-ins, tagged by source.Re-score a trace on demand
Call
/traces/{id}/evaluate to re-run the configured eval policy (built-ins + LLM judges + your @eval functions) without re-running the agent itself.Override a verdict manually
A human reviewer looked at the trace and disagrees with the auto-verdict.
/traces/{id}/decision writes the override; the original auto-verdict is preserved for audit.Bulk classify after a policy change
Tighten your eval policy and want every existing trace re-classified?
/traces/batch-decision runs the new policy against existing scores without re-running anything.Score sources (the source field)
Every score row carries a source so the dashboard can show it as a tagged badge:
| Source | Where it comes from |
|---|---|
built_in | Server-side deterministic checks (completion / has_output / tool_compliance / latency / token_efficiency). Always attached automatically at ingest. |
sdk / custom | Your Python @eval-decorated functions, computed SDK-side before upload. |
llm_judge | Server-side LLM-as-judge against a rubric you configured. |
deepeval / langsmith / external | Pushed via POST /eval-scores from an external pipeline. |
compat_engine | Computed by the manifest impact engine (relative to a target manifest). |
Endpoints at a glance
| Method | Path | Purpose |
|---|---|---|
POST | /api/v1/traces/{id}/eval-scores | Push N quality scores from any source onto a trace |
GET | /api/v1/traces/{id}/eval-scores | Read every score for a trace, grouped by source |
GET | /api/v1/traces/{id}/eval-breakdown | Score view with provenance + decision-engine reasons |
GET | /api/v1/traces/eval/stats | Aggregate eval stats across a workspace |
POST | /api/v1/traces/{id}/evaluate | Re-run the trace’s configured evaluators |
POST | /api/v1/traces/{id}/decision | Override the auto-verdict manually |
POST | /api/v1/traces/batch-decision | Recompute verdicts for many traces under a new policy |
Quick start
Related
- Evaluations Guide — the conceptual model +
@evaldecorator - Compatibility Policies — how scores aggregate into verdicts
- Datasets API — datasets filter on verdict + score
- Replay API — replay flagged-for-repair traces to recover them