Evaluation - DecimalAI

Evaluations measure whether your agent’s outputs are good. DecimalAI provides built-in evaluators and supports custom ones.

Evaluator

An evaluator is a configured quality check that runs on traces. DecimalAI supports three categories:

Category	How It Works	Example
Deterministic	Rule-based checks, instant, zero cost	Response length, JSON validity, regex match
LLM-as-Judge	An LLM scores the output	Relevance, helpfulness, safety, coherence
Custom	User-defined via the SDK `@eval` decorator	Domain-specific checks, business rules

Evaluators produce scores (0.0–1.0) and verdicts (pass/fail) that feed into:

The eval dashboard (trends, distributions)
Dataset filtering (only train on passing traces)
The decision engine (unified keep/repair/replay/drop)

→ See Evaluations for the full guide.

Eval Score

An eval score is a single evaluation result. Each score has:

Field	Description
`name`	Which evaluator produced it (e.g., “relevance”)
`score`	0.0 to 1.0
`passed`	Boolean — did it meet the threshold?
`source`	Who ran it: `built_in`, `llm_judge`, `sdk`, `compat_engine`
`category`	What it measures: `quality` or `compatibility`

A single trace can have many eval scores from different evaluators.

Eval Verdict

The eval verdict is the aggregate outcome for a trace: pass, fail, or review. Computed from all individual eval scores.

Eval Score vs Eval Verdict: An eval score is one evaluator’s result (e.g., “relevance: 0.85”). The eval verdict is the roll-up across all scores. A trace with 5 passing scores and 1 failing score gets verdict = fail.

Decision Engine

The decision engine is the final arbiter. It combines quality eval scores AND compatibility scores into a single verdict per trace. Precedence runs top to bottom — the first rule that matches wins and short-circuits the rest, so a higher rule overrides a lower one (e.g. a drop-level compat score, or quality below 40%, beats a repair-level score). Rule 3 is the one that overrides repair: a trace whose surfaces are only repair-level still drops if its quality average falls below 40%, because a low-quality output isn’t worth mechanically fixing.

Quality Score vs Compatibility Score: Both are stored as eval scores but measure different things. Quality scores (category="quality") ask “is the output good?” Compatibility scores (category="compatibility") ask “does this trace work with the current agent version?” A trace must pass BOTH to be kept for training.

Skills & Data Pipeline

How filtered + scored traces become reproducible training datasets.

Evaluations Guide

Hands-on guide to setting up evaluators.

​Evaluator

​Eval Score

​Eval Verdict

​Decision Engine

​Next

Skills & Data Pipeline

Evaluations Guide

Evaluator

Eval Score

Eval Verdict

Decision Engine

Next