Skip to main content
DecimalAI provides a comprehensive evaluation framework for assessing the quality of your agent’s outputs. Evaluations can be run client-side (in your process), server-side (via the platform), or through external tools like DeepEval and LangSmith.
Evaluators attach via your framework’s install() call, not init(). decimalai.init(...) sets up tracing; the evals=[...] and builtin_evals= arguments live on the framework integration’s install() function (e.g. decimalai.langchain.install(...)). The examples below use the LangChain install(); swap in your framework’s module.

Pre-Built Evaluators

DecimalAI ships 10 ready-to-use evaluators in two tiers:

Deterministic Evaluators (Free)

These run instantly with zero external calls:
EvaluatorWhat it checksExample
json_validOutput is valid JSONevals=[json_valid]
contains(patterns)Output contains required substringscontains(["source:", "http"])
not_contains(patterns)Output doesn’t contain banned stringsnot_contains(["TODO", "FIXME"])
regex_match(pattern)Output matches a regex patternregex_match(r"\d{4}-\d{2}-\d{2}")
length_check(min, max)Word/char count within boundslength_check(min_words=10, max_words=500)
import decimalai
from decimalai.evals import json_valid, contains, length_check
from decimalai.langchain import install

decimalai.init(api_key="dai_sk_...")
install(
    agent_name="my-bot",
    evals=[json_valid, contains(["source:"]), length_check(min_words=20)],
)

Built-in Auto-Checks (Always On)

In addition to the evaluators above, DecimalAI runs 5 automatic quality checks on every trace — zero configuration, zero cost:
CheckWhat it measuresScore logic
completionDid the trace complete without error?1.0 if status is success, else 0.0
has_outputIs there a non-empty final output?1.0 if output length > 0
tool_complianceDid all tool calls produce results?Ratio of successful tool calls
latencyResponse time scoreLinear decay from 1.0 (0s) to 0.0 (10s+)
token_efficiencyTotal token usageLinear decay from 1.0 (0 tokens) to 0.0 (5000+)
These appear in the dashboard labeled source: builtin and feed into the Decision Engine alongside your custom evaluators.
Built-in evals run SDK-side, not server-side. They are attached to the trace payload by install()’s wrapper before the trace is sent. If you ingest traces via bare HTTP POST /api/v1/traces (without the SDK), no built-in scores are computed automatically — push them yourself via POST /api/v1/traces/{id}/eval-scores or use the SDK.
You don’t need to configure or register these when using the SDK — they run automatically on every trace. To disable them, set builtin_evals=False in install().

LLM-as-Judge Evaluators

These use an LLM to score outputs on a 0.0–1.0 scale with a pass/fail verdict:
EvaluatorWhat it scoresInput needed
Relevance()Does the output address the input?input + output
Factuality()Is the output grounded in facts?input + output + context
Faithfulness()Is output faithful to tool results?output + tool_calls
Toxicity()Is the content safe and non-harmful?output
Conciseness()Is the output appropriately concise?input + output
import decimalai
from decimalai.evals import Relevance, Toxicity, Faithfulness
from decimalai.langchain import install

decimalai.init(api_key="dai_sk_...")
install(
    agent_name="my-rag-bot",
    evals=[Relevance(), Toxicity(), Faithfulness()],
)

# Use a specific model for judging:
install(
    agent_name="my-bot",
    evals=[Relevance(model="claude-haiku-4-5")],
)

Execution Modes

Evals run in your process using your own API key via litellm:
pip install decimalai[evals]
  • Unlimited evaluations — no metering
  • You choose the judge modelgpt-4o-mini for cheap, gpt-4o for quality, claude-sonnet-4-6 for diversity
  • Works offline — in CI pipelines, notebooks, local testing
  • Privacy — trace data never leaves your infrastructure
Set your model’s API key in the environment:
export OPENAI_API_KEY="sk-..."       # for gpt-4o judge
# or
export ANTHROPIC_API_KEY="sk-ant-..."  # for claude judge

Custom Evaluators

The @eval Decorator

eval is two different things. The @eval decorator that defines a custom evaluator lives at decimalai.evals.eval (import it with from decimalai.evals import eval). The top-level decimalai.eval(...) is a different function — it pushes an already-computed score to a trace (see External Score Import). It is not the decorator. Reach for decimalai.evals.eval to define a check, and decimalai.eval to record a result.
Write custom evaluators that run on every trace:
from decimalai.evals import eval, TraceData

@eval(name="has_citation")
def check_citation(trace: TraceData) -> bool:
    """Returns True if the output includes a citation."""
    return "[source:" in trace.output

@eval(name="response_quality")
def check_quality(trace: TraceData) -> float:
    """Returns a 0.0-1.0 score based on response length and structure."""
    has_structure = any(h in trace.output for h in ["##", "1.", "- "])
    has_length = len(trace.output.split()) > 50
    return (0.5 * has_structure) + (0.5 * has_length)

import decimalai
from decimalai.langchain import install

decimalai.init(api_key="dai_sk_...")
install(
    agent_name="my-bot",
    evals=[check_citation, check_quality],
)
An evaluator’s return type determines how its result becomes a score:
Return typePass logicExample
boolTrue = pass (score 1.0), False = fail (score 0.0)return "[source:" in trace.output
float0.0–1.0 score, pass threshold at 0.5return 0.85
dictMulti-score: each key becomes a separate scorereturn {"clarity": 0.8, "accuracy": 0.9}
EvalResultFull control over score, pass/fail, and reasonreturn EvalResult(score=0.9, passed=True, reason="...")
from decimalai.evals import eval, TraceData, EvalResult

@eval(name="tone_check")
def check_tone(trace: TraceData) -> EvalResult:
    return EvalResult(score=0.9, passed=True, reason="Professional tone")

Sampling Rate

Expensive evaluators (especially LLM-as-judge) can run on a subset of traces:
@eval(name="expensive_llm_review", sampling_rate=0.1)  # Only 10% of traces
def llm_review(trace: TraceData) -> float:
    # Call external LLM for detailed review
    ...
Traces not sampled are skipped silently — no score is recorded for them.

TraceData Fields

FieldTypeDescription
trace.inputstrUser input / query
trace.outputstrFinal agent output
trace.tool_callslist[ToolCallView]Tool calls with args + results
trace.llm_callslist[LlmCallView]LLM calls with prompts + completions
trace.active_skillslist[str]Skills that were activated
trace.metadatadictCustom metadata

Eval Dashboard

The Evaluate page provides a production-grade dashboard:
  • Stat Cards — Pass rate, evaluated count, average score, failed/review counts
  • Pass Rate Over Time — Stacked bar chart showing daily pass/fail trends with a trend line
  • Score Distribution — Histogram of eval scores across all evaluators
  • Evaluator Breakdown — Scores grouped by source (builtin, SDK, LLM judge, external)
  • Verdict Filters — Filter the trace table by pass, fail, review, or unevaluated

Auto-Scoring

Configure automatic evaluation of incoming traces per-agent:
ModeCostDescription
OffFreeManual evaluation only
DeterministicFreeRun built-in checks on every trace automatically
LLM JudgeMeteredAI-powered scoring with monthly budget limits
Configure auto-scoring from the Evaluate dashboard’s Auto-Scoring panel, or by registering evaluators via the API. Each evaluator you create for an agent runs automatically on that agent’s incoming traces:
# List the evaluators currently attached to an agent
curl -H "Authorization: Bearer $API_KEY" \
  "https://api.decimal.ai/api/v1/evaluators?agent_name=my-bot"

# Register a deterministic evaluator (runs automatically on new traces)
curl -X POST -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  https://api.decimal.ai/api/v1/evaluators \
  -d '{
    "agent_name": "my-bot",
    "name": "json_valid",
    "eval_type": "deterministic",
    "category": "quality",
    "enabled": true
  }'

# Register an LLM-judge evaluator (metered)
curl -X POST -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  https://api.decimal.ai/api/v1/evaluators \
  -d '{
    "agent_name": "my-bot",
    "template_id": "relevance",
    "eval_type": "llm_judge",
    "enabled": true
  }'
The older GET/PUT /api/v1/agents/{name}/eval-policy route is deprecated. Configure auto-scoring through /api/v1/evaluators (above) instead.

External Score Import

Push scores from external evaluation tools (DeepEval, LangSmith, custom pipelines):

SDK Convenience Methods

# Push individual scores
decimalai.eval(
    trace_id="abc-123",
    name="correctness",
    score=0.91,
    reason="Factually accurate",
    source="custom",
)

# Push DeepEval results directly
from deepeval import evaluate
results = evaluate(test_cases, [correctness, faithfulness])

decimalai.push_deepeval_results(
    trace_id="abc-123",
    results=results,
)

REST API

POST /api/v1/traces/{trace_id}/eval-scores
Content-Type: application/json
Authorization: Bearer dai_sk_...

{
  "scores": [
    {"name": "faithfulness", "score": 0.92, "source": "deepeval"},
    {"name": "answer_relevancy", "score": 0.85, "source": "ragas"}
  ]
}

LangSmith Webhook

For teams using LangSmith online evals, set up a webhook to push scores automatically:
LangSmith online eval completes → fires webhook →
POST /api/v1/traces/{trace_id}/eval-scores
{ "source": "langsmith", "scores": [{"name": "correctness", "score": 0.85}] }
External scores appear alongside built-in scores in the dashboard and feed into the same decision engine.

Decision Engine

The decision engine aggregates scores from all sources (built-in, SDK, LLM judge, external) into a unified verdict: The eval verdict (Pass / Fail / Review / Unevaluated) answers “is this output high quality?” It is orthogonal to the compatibility verdict (keep / repair / flag / replay / drop), which answers “what should we do with this trace for training?” The engine reads both as inputs but keeps the two axes distinct.
VerdictMeaningTriggered when
PassOutput is high qualityAll evaluators pass, or average score > threshold
FailOutput has quality issuesAny critical evaluator fails
ReviewUncertain — needs human reviewScores are borderline
UnevaluatedNo evaluators have runNew trace, no auto-scoring configured
Verdicts feed into:
  • Dataset filtering — only pass traces are included in training data
  • Compatibility scoring — combined with manifest verdicts (keep/repair/replay/drop)
  • Dashboard analytics — pass rate trends and regression detection

Batch Evaluation

Run evaluators across multiple traces programmatically:
from decimalai.evals import batch_eval, Relevance, Toxicity

results = batch_eval(
    trace_ids=["abc", "def", "ghi"],
    evals=[Relevance(), Toxicity()],
    max_workers=4,
)

print(results["summary"])
# {"relevance": {"passed": 2, "failed": 1}, "toxicity": {"passed": 3, "failed": 0}}
Batch eval fetches traces from the backend, runs your evaluators in parallel, and pushes scores back. Useful for backfilling scores on existing traces or running offline eval passes.

Next Steps

Eval Scores API

Push and retrieve evaluation scores per trace.

Evaluation concepts

Evaluators, eval scores, eval verdicts, the unified decision engine.

Datasets

Filter datasets by eval verdict — only train on high-quality traces.

Migration from other tools

Push scores from DeepEval, LangSmith, or your custom eval pipeline.