Evaluators attach via your framework’s
install() call, not init().
decimalai.init(...) sets up tracing; the evals=[...] and
builtin_evals= arguments live on the framework integration’s
install() function (e.g. decimalai.langchain.install(...)). The
examples below use the LangChain install(); swap in your framework’s
module.Pre-Built Evaluators
DecimalAI ships 10 ready-to-use evaluators in two tiers:Deterministic Evaluators (Free)
These run instantly with zero external calls:| Evaluator | What it checks | Example |
|---|---|---|
json_valid | Output is valid JSON | evals=[json_valid] |
contains(patterns) | Output contains required substrings | contains(["source:", "http"]) |
not_contains(patterns) | Output doesn’t contain banned strings | not_contains(["TODO", "FIXME"]) |
regex_match(pattern) | Output matches a regex pattern | regex_match(r"\d{4}-\d{2}-\d{2}") |
length_check(min, max) | Word/char count within bounds | length_check(min_words=10, max_words=500) |
Built-in Auto-Checks (Always On)
In addition to the evaluators above, DecimalAI runs 5 automatic quality checks on every trace — zero configuration, zero cost:| Check | What it measures | Score logic |
|---|---|---|
completion | Did the trace complete without error? | 1.0 if status is success, else 0.0 |
has_output | Is there a non-empty final output? | 1.0 if output length > 0 |
tool_compliance | Did all tool calls produce results? | Ratio of successful tool calls |
latency | Response time score | Linear decay from 1.0 (0s) to 0.0 (10s+) |
token_efficiency | Total token usage | Linear decay from 1.0 (0 tokens) to 0.0 (5000+) |
source: builtin and feed into the Decision Engine alongside your custom evaluators.
LLM-as-Judge Evaluators
These use an LLM to score outputs on a 0.0–1.0 scale with a pass/fail verdict:| Evaluator | What it scores | Input needed |
|---|---|---|
Relevance() | Does the output address the input? | input + output |
Factuality() | Is the output grounded in facts? | input + output + context |
Faithfulness() | Is output faithful to tool results? | output + tool_calls |
Toxicity() | Is the content safe and non-harmful? | output |
Conciseness() | Is the output appropriately concise? | input + output |
Execution Modes
- Client-Side (BYO Key)
- Server-Side (Metered)
Evals run in your process using your own API key via
litellm:- Unlimited evaluations — no metering
- You choose the judge model —
gpt-4o-minifor cheap,gpt-4ofor quality,claude-sonnet-4-6for diversity - Works offline — in CI pipelines, notebooks, local testing
- Privacy — trace data never leaves your infrastructure
Custom Evaluators
The @eval Decorator
Write custom evaluators that run on every trace:
| Return type | Pass logic | Example |
|---|---|---|
bool | True = pass (score 1.0), False = fail (score 0.0) | return "[source:" in trace.output |
float | 0.0–1.0 score, pass threshold at 0.5 | return 0.85 |
dict | Multi-score: each key becomes a separate score | return {"clarity": 0.8, "accuracy": 0.9} |
EvalResult | Full control over score, pass/fail, and reason | return EvalResult(score=0.9, passed=True, reason="...") |
Sampling Rate
Expensive evaluators (especially LLM-as-judge) can run on a subset of traces:TraceData Fields
| Field | Type | Description |
|---|---|---|
trace.input | str | User input / query |
trace.output | str | Final agent output |
trace.tool_calls | list[ToolCallView] | Tool calls with args + results |
trace.llm_calls | list[LlmCallView] | LLM calls with prompts + completions |
trace.active_skills | list[str] | Skills that were activated |
trace.metadata | dict | Custom metadata |
Eval Dashboard
The Evaluate page provides a production-grade dashboard:- Stat Cards — Pass rate, evaluated count, average score, failed/review counts
- Pass Rate Over Time — Stacked bar chart showing daily pass/fail trends with a trend line
- Score Distribution — Histogram of eval scores across all evaluators
- Evaluator Breakdown — Scores grouped by source (builtin, SDK, LLM judge, external)
- Verdict Filters — Filter the trace table by pass, fail, review, or unevaluated
Auto-Scoring
Configure automatic evaluation of incoming traces per-agent:| Mode | Cost | Description |
|---|---|---|
| Off | Free | Manual evaluation only |
| Deterministic | Free | Run built-in checks on every trace automatically |
| LLM Judge | Metered | AI-powered scoring with monthly budget limits |
The older
GET/PUT /api/v1/agents/{name}/eval-policy route is deprecated.
Configure auto-scoring through /api/v1/evaluators (above) instead.External Score Import
Push scores from external evaluation tools (DeepEval, LangSmith, custom pipelines):SDK Convenience Methods
REST API
LangSmith Webhook
For teams using LangSmith online evals, set up a webhook to push scores automatically:Decision Engine
The decision engine aggregates scores from all sources (built-in, SDK, LLM judge, external) into a unified verdict: The eval verdict (Pass / Fail / Review / Unevaluated) answers “is this output high quality?” It is orthogonal to the compatibility verdict (keep / repair / flag / replay / drop), which answers “what should we do with this trace for training?” The engine reads both as inputs but keeps the two axes distinct.| Verdict | Meaning | Triggered when |
|---|---|---|
| Pass | Output is high quality | All evaluators pass, or average score > threshold |
| Fail | Output has quality issues | Any critical evaluator fails |
| Review | Uncertain — needs human review | Scores are borderline |
| Unevaluated | No evaluators have run | New trace, no auto-scoring configured |
- Dataset filtering — only
passtraces are included in training data - Compatibility scoring — combined with manifest verdicts (keep/repair/replay/drop)
- Dashboard analytics — pass rate trends and regression detection
Batch Evaluation
Run evaluators across multiple traces programmatically:Next Steps
Eval Scores API
Push and retrieve evaluation scores per trace.
Evaluation concepts
Evaluators, eval scores, eval verdicts, the unified decision engine.
Datasets
Filter datasets by eval verdict — only train on high-quality traces.
Migration from other tools
Push scores from DeepEval, LangSmith, or your custom eval pipeline.