Skip to main content
The Evaluations API attaches quality scores to traces and aggregates them into a per-trace verdict (keep / repair / replay / drop). It accepts scores from multiple sources — built-in deterministic checks, your @eval functions, LLM judges, and pushed-from-elsewhere scores from DeepEval / LangSmith / your CI pipeline. Together with the Compatibility Policies guide, this is how DecimalAI decides what’s training-data-grade.

When to use this API

Push scores from an external eval pipeline

Already running DeepEval or a homegrown harness? POST results to /traces/{id}/eval-scores and they show up in the same dashboard view as your built-ins, tagged by source.

Re-score a trace on demand

Call /traces/{id}/evaluate to re-run the configured eval policy (built-ins + LLM judges + your @eval functions) without re-running the agent itself.

Override a verdict manually

A human reviewer looked at the trace and disagrees with the auto-verdict. /traces/{id}/decision writes the override; the original auto-verdict is preserved for audit.

Bulk classify after a policy change

Tighten your eval policy and want every existing trace re-classified? /traces/batch-decision runs the new policy against existing scores without re-running anything.

Score sources (the source field)

Every score row carries a source so the dashboard can show it as a tagged badge:
SourceWhere it comes from
built_inServer-side deterministic checks (completion / has_output / tool_compliance / latency / token_efficiency). Always attached automatically at ingest.
sdk / customYour Python @eval-decorated functions, computed SDK-side before upload.
llm_judgeServer-side LLM-as-judge against a rubric you configured.
deepeval / langsmith / externalPushed via POST /eval-scores from an external pipeline.
compat_engineComputed by the manifest impact engine (relative to a target manifest).

Endpoints at a glance

MethodPathPurpose
POST/api/v1/traces/{id}/eval-scoresPush N quality scores from any source onto a trace
GET/api/v1/traces/{id}/eval-scoresRead every score for a trace, grouped by source
GET/api/v1/traces/{id}/eval-breakdownScore view with provenance + decision-engine reasons
GET/api/v1/traces/eval/statsAggregate eval stats across a workspace
POST/api/v1/traces/{id}/evaluateRe-run the trace’s configured evaluators
POST/api/v1/traces/{id}/decisionOverride the auto-verdict manually
POST/api/v1/traces/batch-decisionRecompute verdicts for many traces under a new policy

Quick start

import httpx

# 1. Push DeepEval scores onto a trace
httpx.post(
    "https://api.decimal.ai/api/v1/traces/trc_abc123/eval-scores",
    headers={"Authorization": "Bearer dai_sk_..."},
    json={
        "source": "deepeval",
        "scores": [
            {"name": "faithfulness", "score": 0.92, "passed": True},
            {"name": "answer_relevance", "score": 0.88, "passed": True},
            {"name": "context_precision", "score": 0.71, "passed": True},
        ],
    },
)

# 2. Read the full breakdown
breakdown = httpx.get(
    "https://api.decimal.ai/api/v1/traces/trc_abc123/eval-breakdown",
    headers={"Authorization": "Bearer dai_sk_..."},
).json()
print(breakdown["eval_verdict"])  # keep / repair / replay / drop
print(breakdown["quality_avg"])   # 0.0 — 1.0
for group in breakdown["source_groups"]:  # a list, one entry per source
    print(f"  {group['source']}: {len(group['scores'])} scores, avg {group['source_avg']:.2f}")