Skip to main content
Evaluations measure whether your agent’s outputs are good. DecimalAI provides built-in evaluators and supports custom ones.

Evaluator

An evaluator is a configured quality check that runs on traces. DecimalAI supports three categories:
CategoryHow It WorksExample
DeterministicRule-based checks, instant, zero costResponse length, JSON validity, regex match
LLM-as-JudgeAn LLM scores the outputRelevance, helpfulness, safety, coherence
CustomUser-defined via the SDK @eval decoratorDomain-specific checks, business rules
Evaluators produce scores (0.0–1.0) and verdicts (pass/fail) that feed into:
  • The eval dashboard (trends, distributions)
  • Dataset filtering (only train on passing traces)
  • The decision engine (unified keep/repair/replay/drop)
→ See Evaluations for the full guide.

Eval Score

An eval score is a single evaluation result. Each score has:
FieldDescription
nameWhich evaluator produced it (e.g., “relevance”)
score0.0 to 1.0
passedBoolean — did it meet the threshold?
sourceWho ran it: built_in, llm_judge, sdk, compat_engine
categoryWhat it measures: quality or compatibility
A single trace can have many eval scores from different evaluators.

Eval Verdict

The eval verdict is the aggregate outcome for a trace: pass, fail, or review. Computed from all individual eval scores.
Eval Score vs Eval Verdict: An eval score is one evaluator’s result (e.g., “relevance: 0.85”). The eval verdict is the roll-up across all scores. A trace with 5 passing scores and 1 failing score gets verdict = fail.

Decision Engine

The decision engine is the final arbiter. It combines quality eval scores AND compatibility scores into a single verdict per trace. Precedence runs top to bottom — the first rule that matches wins and short-circuits the rest, so a higher rule overrides a lower one (e.g. a drop-level compat score, or quality below 40%, beats a repair-level score). Rule 3 is the one that overrides repair: a trace whose surfaces are only repair-level still drops if its quality average falls below 40%, because a low-quality output isn’t worth mechanically fixing.
Quality Score vs Compatibility Score: Both are stored as eval scores but measure different things. Quality scores (category="quality") ask “is the output good?” Compatibility scores (category="compatibility") ask “does this trace work with the current agent version?” A trace must pass BOTH to be kept for training.

Next

Skills & Data Pipeline

How filtered + scored traces become reproducible training datasets.

Evaluations Guide

Hands-on guide to setting up evaluators.