Evaluator
An evaluator is a configured quality check that runs on traces. DecimalAI supports three categories:| Category | How It Works | Example |
|---|---|---|
| Deterministic | Rule-based checks, instant, zero cost | Response length, JSON validity, regex match |
| LLM-as-Judge | An LLM scores the output | Relevance, helpfulness, safety, coherence |
| Custom | User-defined via the SDK @eval decorator | Domain-specific checks, business rules |
- The eval dashboard (trends, distributions)
- Dataset filtering (only train on passing traces)
- The decision engine (unified keep/repair/replay/drop)
Eval Score
An eval score is a single evaluation result. Each score has:| Field | Description |
|---|---|
name | Which evaluator produced it (e.g., “relevance”) |
score | 0.0 to 1.0 |
passed | Boolean — did it meet the threshold? |
source | Who ran it: built_in, llm_judge, sdk, compat_engine |
category | What it measures: quality or compatibility |
Eval Verdict
The eval verdict is the aggregate outcome for a trace:pass, fail, or review. Computed from all individual eval scores.
Eval Score vs Eval Verdict: An eval score is one evaluator’s result (e.g., “relevance: 0.85”). The eval verdict is the roll-up across all scores. A trace with 5 passing scores and 1 failing score gets verdict =
fail.Decision Engine
The decision engine is the final arbiter. It combines quality eval scores AND compatibility scores into a single verdict per trace. Precedence runs top to bottom — the first rule that matches wins and short-circuits the rest, so a higher rule overrides a lower one (e.g. a drop-level compat score, or quality below 40%, beats a repair-level score). Rule 3 is the one that overrides repair: a trace whose surfaces are onlyrepair-level still drops if its quality average falls below 40%, because a low-quality output isn’t worth mechanically fixing.
Quality Score vs Compatibility Score: Both are stored as eval scores but measure different things. Quality scores (
category="quality") ask “is the output good?” Compatibility scores (category="compatibility") ask “does this trace work with the current agent version?” A trace must pass BOTH to be kept for training.Next
Skills & Data Pipeline
How filtered + scored traces become reproducible training datasets.
Evaluations Guide
Hands-on guide to setting up evaluators.