Evaluations - DecimalAI

DecimalAI provides a comprehensive evaluation framework for assessing the quality of your agent’s outputs. Evaluations can be run client-side (in your process), server-side (via the platform), or through external tools like DeepEval and LangSmith.

Evaluators attach via your framework’s install() call, not init(). decimalai.init(...) sets up tracing; the evals=[...] and builtin_evals= arguments live on the framework integration’s install() function (e.g. decimalai.langchain.install(...)). The examples below use the LangChain install(); swap in your framework’s module.

Pre-Built Evaluators

DecimalAI ships 10 ready-to-use evaluators in two tiers:

Deterministic Evaluators (Free)

These run instantly with zero external calls:

Evaluator	What it checks	Example
`json_valid`	Output is valid JSON	`evals=[json_valid]`
`contains(patterns)`	Output contains required substrings	`contains(["source:", "http"])`
`not_contains(patterns)`	Output doesn’t contain banned strings	`not_contains(["TODO", "FIXME"])`
`regex_match(pattern)`	Output matches a regex pattern	`regex_match(r"\d{4}-\d{2}-\d{2}")`
`length_check(min, max)`	Word/char count within bounds	`length_check(min_words=10, max_words=500)`

import decimalai
from decimalai.evals import json_valid, contains, length_check
from decimalai.langchain import install

decimalai.init(api_key="dai_sk_...")
install(
    agent_name="my-bot",
    evals=[json_valid, contains(["source:"]), length_check(min_words=20)],
)

Built-in Auto-Checks (Always On)

In addition to the evaluators above, DecimalAI runs 5 automatic quality checks on every trace — zero configuration, zero cost:

Check	What it measures	Score logic
`completion`	Did the trace complete without error?	`1.0` if status is success, else `0.0`
`has_output`	Is there a non-empty final output?	`1.0` if output length > 0
`tool_compliance`	Did all tool calls produce results?	Ratio of successful tool calls
`latency`	Response time score	Linear decay from `1.0` (0s) to `0.0` (10s+)
`token_efficiency`	Total token usage	Linear decay from `1.0` (0 tokens) to `0.0` (5000+)

These appear in the dashboard labeled source: builtin and feed into the Decision Engine alongside your custom evaluators.

Built-in evals run SDK-side, not server-side. They are attached to the trace payload by install()’s wrapper before the trace is sent. If you ingest traces via bare HTTP POST /api/v1/traces (without the SDK), no built-in scores are computed automatically — push them yourself via POST /api/v1/traces/{id}/eval-scores or use the SDK.

You don’t need to configure or register these when using the SDK — they run automatically on every trace. To disable them, set builtin_evals=False in install().

LLM-as-Judge Evaluators

These use an LLM to score outputs on a 0.0–1.0 scale with a pass/fail verdict:

Evaluator	What it scores	Input needed
`Relevance()`	Does the output address the input?	input + output
`Factuality()`	Is the output grounded in facts?	input + output + context
`Faithfulness()`	Is output faithful to tool results?	output + tool_calls
`Toxicity()`	Is the content safe and non-harmful?	output
`Conciseness()`	Is the output appropriately concise?	input + output

import decimalai
from decimalai.evals import Relevance, Toxicity, Faithfulness
from decimalai.langchain import install

decimalai.init(api_key="dai_sk_...")
install(
    agent_name="my-rag-bot",
    evals=[Relevance(), Toxicity(), Faithfulness()],
)

# Use a specific model for judging:
install(
    agent_name="my-bot",
    evals=[Relevance(model="claude-haiku-4-5")],
)

Execution Modes

Client-Side (BYO Key)
Server-Side (Metered)

Evals run in your process using your own API key via litellm:

pip install decimalai[evals]

Unlimited evaluations — no metering
You choose the judge model — gpt-4o-mini for cheap, gpt-4o for quality, claude-sonnet-4-6 for diversity
Works offline — in CI pipelines, notebooks, local testing
Privacy — trace data never leaves your infrastructure

Set your model’s API key in the environment:

export OPENAI_API_KEY="sk-..."       # for gpt-4o judge
# or
export ANTHROPIC_API_KEY="sk-ant-..."  # for claude judge

Evals run on DecimalAI’s infrastructure using our API key:

evals=[Relevance(use_server=True)]

No API key needed — DecimalAI handles the LLM call
Metered by your billing plan
Consistent scoring — same model across all evaluations

Custom Evaluators

The `@eval` Decorator

eval is two different things. The @eval decorator that defines a custom evaluator lives at decimalai.evals.eval (import it with from decimalai.evals import eval). The top-level decimalai.eval(...) is a different function — it pushes an already-computed score to a trace (see External Score Import). It is not the decorator. Reach for decimalai.evals.eval to define a check, and decimalai.eval to record a result.

Write custom evaluators that run on every trace:

from decimalai.evals import eval, TraceData

@eval(name="has_citation")
def check_citation(trace: TraceData) -> bool:
    """Returns True if the output includes a citation."""
    return "[source:" in trace.output

@eval(name="response_quality")
def check_quality(trace: TraceData) -> float:
    """Returns a 0.0-1.0 score based on response length and structure."""
    has_structure = any(h in trace.output for h in ["##", "1.", "- "])
    has_length = len(trace.output.split()) > 50
    return (0.5 * has_structure) + (0.5 * has_length)

import decimalai
from decimalai.langchain import install

decimalai.init(api_key="dai_sk_...")
install(
    agent_name="my-bot",
    evals=[check_citation, check_quality],
)

An evaluator’s return type determines how its result becomes a score:

Return type	Pass logic	Example
`bool`	`True` = pass (score 1.0), `False` = fail (score 0.0)	`return "[source:" in trace.output`
`float`	0.0–1.0 score, pass threshold at 0.5	`return 0.85`
`dict`	Multi-score: each key becomes a separate score	`return {"clarity": 0.8, "accuracy": 0.9}`
`EvalResult`	Full control over score, pass/fail, and reason	`return EvalResult(score=0.9, passed=True, reason="...")`

from decimalai.evals import eval, TraceData, EvalResult

@eval(name="tone_check")
def check_tone(trace: TraceData) -> EvalResult:
    return EvalResult(score=0.9, passed=True, reason="Professional tone")

Sampling Rate

Expensive evaluators (especially LLM-as-judge) can run on a subset of traces:

@eval(name="expensive_llm_review", sampling_rate=0.1)  # Only 10% of traces
def llm_review(trace: TraceData) -> float:
    # Call external LLM for detailed review
    ...

Traces not sampled are skipped silently — no score is recorded for them.

TraceData Fields

Field	Type	Description
`trace.input`	`str`	User input / query
`trace.output`	`str`	Final agent output
`trace.tool_calls`	`list[ToolCallView]`	Tool calls with args + results
`trace.llm_calls`	`list[LlmCallView]`	LLM calls with prompts + completions
`trace.active_skills`	`list[str]`	Skills that were activated
`trace.metadata`	`dict`	Custom metadata

Eval Dashboard

The Evaluate page provides a production-grade dashboard:

Stat Cards — Pass rate, evaluated count, average score, failed/review counts
Pass Rate Over Time — Stacked bar chart showing daily pass/fail trends with a trend line
Score Distribution — Histogram of eval scores across all evaluators
Evaluator Breakdown — Scores grouped by source (builtin, SDK, LLM judge, external)
Verdict Filters — Filter the trace table by pass, fail, review, or unevaluated

Auto-Scoring

Configure automatic evaluation of incoming traces per-agent:

Mode	Cost	Description
Off	Free	Manual evaluation only
Deterministic	Free	Run built-in checks on every trace automatically
LLM Judge	Metered	AI-powered scoring with monthly budget limits

Configure auto-scoring from the Evaluate dashboard’s Auto-Scoring panel, or by registering evaluators via the API. Each evaluator you create for an agent runs automatically on that agent’s incoming traces:

# List the evaluators currently attached to an agent
curl -H "Authorization: Bearer $API_KEY" \
  "https://api.decimal.ai/api/v1/evaluators?agent_name=my-bot"

# Register a deterministic evaluator (runs automatically on new traces)
curl -X POST -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  https://api.decimal.ai/api/v1/evaluators \
  -d '{
    "agent_name": "my-bot",
    "name": "json_valid",
    "eval_type": "deterministic",
    "category": "quality",
    "enabled": true
  }'

# Register an LLM-judge evaluator (metered)
curl -X POST -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  https://api.decimal.ai/api/v1/evaluators \
  -d '{
    "agent_name": "my-bot",
    "template_id": "relevance",
    "eval_type": "llm_judge",
    "enabled": true
  }'

The older GET/PUT /api/v1/agents/{name}/eval-policy route is deprecated. Configure auto-scoring through /api/v1/evaluators (above) instead.

External Score Import

Push scores from external evaluation tools (DeepEval, LangSmith, custom pipelines):

SDK Convenience Methods

# Push individual scores
decimalai.eval(
    trace_id="abc-123",
    name="correctness",
    score=0.91,
    reason="Factually accurate",
    source="custom",
)

# Push DeepEval results directly
from deepeval import evaluate
results = evaluate(test_cases, [correctness, faithfulness])

decimalai.push_deepeval_results(
    trace_id="abc-123",
    results=results,
)

REST API

POST /api/v1/traces/{trace_id}/eval-scores
Content-Type: application/json
Authorization: Bearer dai_sk_...

{
  "scores": [
    {"name": "faithfulness", "score": 0.92, "source": "deepeval"},
    {"name": "answer_relevancy", "score": 0.85, "source": "ragas"}
  ]
}

LangSmith Webhook

For teams using LangSmith online evals, set up a webhook to push scores automatically:

LangSmith online eval completes → fires webhook →
POST /api/v1/traces/{trace_id}/eval-scores
{ "source": "langsmith", "scores": [{"name": "correctness", "score": 0.85}] }

External scores appear alongside built-in scores in the dashboard and feed into the same decision engine.

Decision Engine

The decision engine aggregates scores from all sources (built-in, SDK, LLM judge, external) into a unified verdict: The eval verdict (Pass / Fail / Review / Unevaluated) answers “is this output high quality?” It is orthogonal to the compatibility verdict (keep / repair / flag / replay / drop), which answers “what should we do with this trace for training?” The engine reads both as inputs but keeps the two axes distinct.

Verdict	Meaning	Triggered when
Pass	Output is high quality	All evaluators pass, or average score > threshold
Fail	Output has quality issues	Any critical evaluator fails
Review	Uncertain — needs human review	Scores are borderline
Unevaluated	No evaluators have run	New trace, no auto-scoring configured

Verdicts feed into:

Dataset filtering — only pass traces are included in training data
Compatibility scoring — combined with manifest verdicts (keep/repair/replay/drop)
Dashboard analytics — pass rate trends and regression detection

Batch Evaluation

Run evaluators across multiple traces programmatically:

from decimalai.evals import batch_eval, Relevance, Toxicity

results = batch_eval(
    trace_ids=["abc", "def", "ghi"],
    evals=[Relevance(), Toxicity()],
    max_workers=4,
)

print(results["summary"])
# {"relevance": {"passed": 2, "failed": 1}, "toxicity": {"passed": 3, "failed": 0}}

Batch eval fetches traces from the backend, runs your evaluators in parallel, and pushes scores back. Useful for backfilling scores on existing traces or running offline eval passes.

Next Steps

Eval Scores API

Push and retrieve evaluation scores per trace.

Evaluation concepts

Evaluators, eval scores, eval verdicts, the unified decision engine.

Datasets

Filter datasets by eval verdict — only train on high-quality traces.

Migration from other tools

Push scores from DeepEval, LangSmith, or your custom eval pipeline.

​Pre-Built Evaluators

​Deterministic Evaluators (Free)

​Built-in Auto-Checks (Always On)

​LLM-as-Judge Evaluators

​Execution Modes

​Custom Evaluators

​The @eval Decorator

​Sampling Rate

​TraceData Fields

​Eval Dashboard

​Auto-Scoring

​External Score Import

​SDK Convenience Methods

​REST API

​LangSmith Webhook

​Decision Engine

​Batch Evaluation

​Next Steps

Eval Scores API

Evaluation concepts

Datasets

Migration from other tools

Pre-Built Evaluators

Deterministic Evaluators (Free)

Built-in Auto-Checks (Always On)

LLM-as-Judge Evaluators

Execution Modes

Custom Evaluators

The `@eval` Decorator

Sampling Rate

TraceData Fields

Eval Dashboard

Auto-Scoring

External Score Import

SDK Convenience Methods

REST API

LangSmith Webhook

Decision Engine

Batch Evaluation

Next Steps