Skip to main content
skillevaluation is an open specification and reference runner for benchmarking an agent skill via declarative A/B test cases. The format lives next to SKILL.md as eval.yaml and answers a single question:
Does this skill actually help an agent? By how much?
The spec and runner are independent of DecimalAI — pip install "skillevaluation[runner]" executes a full A/B benchmark on your machine, on your own API key, with no account. DecimalAI is a conforming hosted runner of the same spec (it imports the same judge and validator code) and adds what a local run can’t: history, verified results, rankings, and distribution.

Package

skillevaluation on PyPI — Apache 2.0

Schema

JSON Schema for eval.yaml — ships in the wheel: from skillevaluation.resources import load_schema

Why a separate spec

SKILL.md tells an agent how to do something. eval.yaml tells DecimalAI (or any conforming runner) how to measure whether it’s working. Keeping the two side-by-side on disk means:
  • Skill authors version eval cases with the skill itself
  • A skill pulled via decimalai skills pull brings its eval suite with it
  • A regression is detectable against the same cases that proved the skill worked

The 60-second format

# eval.yaml — lives next to SKILL.md
cases:
  - name: tracks_with_id
    prompt: "Classify these schema fields: email, ip_address, name, age."
    expectations:
      - "The response classifies email as PII"
      - "The response identifies ip_address as pseudonymous"
    validators:
      - cmd: "jq -e '.email.category == \"PII\"' output.json"

  - name: ignores_weather
    prompt: "What's the weather in Paris?"
    expectations:
      - "The response answers the weather question directly"
      - "The response does not invoke the GDPR classifier"
Each case runs twice — once with the skill loaded into the agent’s manifest, once without. The runner classifies each case into one of five outcomes:
OutcomeMeaning
flip_to_passThe skill rescued the agent — the “win”
pass_keptBoth arms passed; skill didn’t discriminate on this case
fail_keptBoth arms failed; skill didn’t rescue
flip_to_failThe skill HURT — regression marker
errorA transient failure prevented evaluation
The aggregate is what becomes the registry headline: “+34 pts pass rate, −46% agent turns.”

Assertion kinds

Each case can mix two assertion kinds:
  • expectations — natural-language claims, graded by an LLM judge
  • validators — shell commands, graded by exit code
A case MUST have at least one of either. Use validators when a precise structural check is possible (cheaper, deterministic); use expectations for genuinely semantic claims.

Run it locally (free)

The open-source package ships a complete reference runner. Your own API key, your machine — nothing is sent to DecimalAI:
pip install "skillevaluation[runner]"
export ANTHROPIC_API_KEY=sk-ant-...

skillevaluation run ./my-skill --model claude-haiku-4-5
Each case executes twice (with the skill / without), both arms are graded with the same expectations + validators, and you get the delta table plus a results.json conforming to the open wire schema. The without-skill baseline is cached locally, so re-runs while you iterate on SKILL.md cost half. Gate it in CI with --fail-on-verdict fail --min-delta-pts 10, or dry-run the plumbing for free with --adapter mock. Local runs are unlimited and unmetered — iterate as much as you like.

Push results to DecimalAI

Attach a local run to your skill’s Benchmark tab (free — it’s a JSON upload, no quota consumed):
decimalai skills push results.json
Pushed runs are tagged unverified: they show on your skill’s own page but never feed registry rankings — self-reported numbers can’t poison the leaderboard.

Verified runs (hosted)

A verified run is one the DecimalAI runner executed — same open-spec judge and validators (the platform literally imports them from the skillevaluation package), but in a trusted environment, stamped with model + date. Only verified runs feed registry cards, rankings, and SkillScore. Two ways to get one:
decimalai skills benchmark ./my-skill/    # metered against your plan's monthly case quota
…or publish the skill — publishing automatically triggers a verification run, free and quota-exempt. Hosted runs are metered in cases (one metered case = one eval case executed, both arms + judge included; errored cases refunded). See Pricing → Quota enforcement.
Local runPushed resultVerified run
Costs youyour LLM tokensnothingplan quota (or free on publish)
Shows on your skill page✓ (tagged unverified)
Feeds rankings / registry cards

Bring your own runner

The full runner contract (spec/runner-contract.md) and the golden compatibility-tests/ fixtures ship inside the skillevaluation package — load them with from skillevaluation.resources import load_schema. Any implementation that reproduces those fixtures is conforming; the reference runner itself passes the suite — compare against it.

Composing with agentversion

A skillevaluation run produces a numeric score. That score can be recorded on an AgentVersion manifest’s evaluation.gates[] via the skillevaluation:// URI scheme:
{
  "evaluation": {
    "gates": [
      {
        "name": "skillevaluation:gdpr-pii-classifier",
        "actual_score": 0.92,
        "threshold": 0.80,
        "passed": true,
        "evaluator_ref": "skillevaluation://abc123def456@v0.1.0",
        "ran_at": "2026-05-28T14:00:00Z"
      }
    ]
  }
}
Each gate object carries the fields below:
name
string
required
Human-readable identifier for the gate, e.g. skillevaluation:gdpr-pii-classifier.
actual_score
number
required
The score the run produced, 0.01.0 (pass rate of the with-skill arm).
threshold
number
required
The minimum actual_score required for the gate to pass.
passed
boolean
required
Whether actual_score met threshold.
evaluator_ref
string
required
A skillevaluation:// URI pinning the exact eval suite + version that produced the score, e.g. skillevaluation://abc123def456@v0.1.0.
ran_at
string
ISO 8601 timestamp of when the run completed.
This lets a lifecycle transition cite a specific eval suite’s verdict as evidence — “this manifest reached production because the gdpr-pii-classifier benchmark passed at 92%.”

Status

  • Package is at v0.2.3 (pre-stable; breaking changes possible before v1.0) — the 0.2 line added the reference runner + CLI, and the wheel now ships spec/ + schemas/ (from skillevaluation.resources import load_schema)
  • DecimalAI’s hosted runner consumes the same package — judge and validator behavior is shared by construction, not by copy
  • Conformance suite + JSON Schemas ship inside the skillevaluation package (from skillevaluation.resources import load_schema)
  • Want a different language implementation? The package’s CONFORMANCE.md + golden in/out fixtures define conformance — anything that reproduces them conforms