skillevaluation (eval.yaml)

skillevaluation is an open specification and reference runner for benchmarking an agent skill via declarative A/B test cases. The format lives next to SKILL.md as eval.yaml and answers a single question:

Does this skill actually help an agent? By how much?

The spec and runner are independent of DecimalAI — pip install "skillevaluation[runner]" executes a full A/B benchmark on your machine, on your own API key, with no account. DecimalAI is a conforming hosted runner of the same spec (it imports the same judge and validator code) and adds what a local run can’t: history, verified results, rankings, and distribution.

Package

skillevaluation on PyPI — Apache 2.0

Schema

JSON Schema for eval.yaml — ships in the wheel: from skillevaluation.resources import load_schema

Why a separate spec

SKILL.md tells an agent how to do something. eval.yaml tells DecimalAI (or any conforming runner) how to measure whether it’s working. Keeping the two side-by-side on disk means:

Skill authors version eval cases with the skill itself
A skill pulled via decimalai skills pull brings its eval suite with it
A regression is detectable against the same cases that proved the skill worked

The 60-second format

# eval.yaml — lives next to SKILL.md
cases:
  - name: tracks_with_id
    prompt: "Classify these schema fields: email, ip_address, name, age."
    expectations:
      - "The response classifies email as PII"
      - "The response identifies ip_address as pseudonymous"
    validators:
      - cmd: "jq -e '.email.category == \"PII\"' output.json"

  - name: ignores_weather
    prompt: "What's the weather in Paris?"
    expectations:
      - "The response answers the weather question directly"
      - "The response does not invoke the GDPR classifier"

Each case runs twice — once with the skill loaded into the agent’s manifest, once without. The runner classifies each case into one of five outcomes:

Outcome	Meaning
`flip_to_pass`	The skill rescued the agent — the “win”
`pass_kept`	Both arms passed; skill didn’t discriminate on this case
`fail_kept`	Both arms failed; skill didn’t rescue
`flip_to_fail`	The skill HURT — regression marker
`error`	A transient failure prevented evaluation

The aggregate is what becomes the registry headline: “+34 pts pass rate, −46% agent turns.”

Assertion kinds

Each case can mix two assertion kinds:

expectations — natural-language claims, graded by an LLM judge
validators — shell commands, graded by exit code

A case MUST have at least one of either. Use validators when a precise structural check is possible (cheaper, deterministic); use expectations for genuinely semantic claims.

Run it locally (free)

The open-source package ships a complete reference runner. Your own API key, your machine — nothing is sent to DecimalAI:

pip install "skillevaluation[runner]"
export ANTHROPIC_API_KEY=sk-ant-...

skillevaluation run ./my-skill --model claude-haiku-4-5

Each case executes twice (with the skill / without), both arms are graded with the same expectations + validators, and you get the delta table plus a results.json conforming to the open wire schema. The without-skill baseline is cached locally, so re-runs while you iterate on SKILL.md cost half. Gate it in CI with --fail-on-verdict fail --min-delta-pts 10, or dry-run the plumbing for free with --adapter mock. Local runs are unlimited and unmetered — iterate as much as you like.

Push results to DecimalAI

Attach a local run to your skill’s Benchmark tab (free — it’s a JSON upload, no quota consumed):

decimalai skills push results.json

Pushed runs are tagged unverified: they show on your skill’s own page but never feed registry rankings — self-reported numbers can’t poison the leaderboard.

Verified runs (hosted)

A verified run is one the DecimalAI runner executed — same open-spec judge and validators (the platform literally imports them from the skillevaluation package), but in a trusted environment, stamped with model + date. Only verified runs feed registry cards, rankings, and SkillScore. Two ways to get one:

decimalai skills benchmark ./my-skill/    # metered against your plan's monthly case quota

…or publish the skill — publishing automatically triggers a verification run, free and quota-exempt. Hosted runs are metered in cases (one metered case = one eval case executed, both arms + judge included; errored cases refunded). See Pricing → Quota enforcement.

	Local run	Pushed result	Verified run
Costs you	your LLM tokens	nothing	plan quota (or free on publish)
Shows on your skill page	—	✓ (tagged unverified)	✓
Feeds rankings / registry cards	—	—	✓

Bring your own runner

The full runner contract (spec/runner-contract.md) and the golden compatibility-tests/ fixtures ship inside the skillevaluation package — load them with from skillevaluation.resources import load_schema. Any implementation that reproduces those fixtures is conforming; the reference runner itself passes the suite — compare against it.

Composing with agentversion

A skillevaluation run produces a numeric score. That score can be recorded on an AgentVersion manifest’s evaluation.gates[] via the skillevaluation:// URI scheme:

{
  "evaluation": {
    "gates": [
      {
        "name": "skillevaluation:gdpr-pii-classifier",
        "actual_score": 0.92,
        "threshold": 0.80,
        "passed": true,
        "evaluator_ref": "skillevaluation://abc123def456@v0.1.0",
        "ran_at": "2026-05-28T14:00:00Z"
      }
    ]
  }
}

Each gate object carries the fields below:

name

string

required

Human-readable identifier for the gate, e.g. skillevaluation:gdpr-pii-classifier.

actual_score

number

required

The score the run produced, 0.0–1.0 (pass rate of the with-skill arm).

threshold

number

required

The minimum actual_score required for the gate to pass.

passed

boolean

required

Whether actual_score met threshold.

evaluator_ref

string

required

A skillevaluation:// URI pinning the exact eval suite + version that produced the score, e.g. skillevaluation://abc123def456@v0.1.0.

ran_at

string

ISO 8601 timestamp of when the run completed.

This lets a lifecycle transition cite a specific eval suite’s verdict as evidence — “this manifest reached production because the gdpr-pii-classifier benchmark passed at 92%.”

Status

Package is at v0.2.3 (pre-stable; breaking changes possible before v1.0) — the 0.2 line added the reference runner + CLI, and the wheel now ships spec/ + schemas/ (from skillevaluation.resources import load_schema)
DecimalAI’s hosted runner consumes the same package — judge and validator behavior is shared by construction, not by copy
Conformance suite + JSON Schemas ship inside the skillevaluation package (from skillevaluation.resources import load_schema)
Want a different language implementation? The package’s CONFORMANCE.md + golden in/out fixtures define conformance — anything that reproduces them conforms

Package

Schema

​Why a separate spec

​The 60-second format

​Assertion kinds

​Run it locally (free)

​Push results to DecimalAI

​Verified runs (hosted)

​Bring your own runner

​Composing with agentversion

​Status

Why a separate spec

The 60-second format

Assertion kinds

Run it locally (free)

Push results to DecimalAI

Verified runs (hosted)

Bring your own runner

Composing with agentversion

Status