skillevaluation is an open specification and reference runner for benchmarking an agent skill via declarative A/B test cases. The format lives next to SKILL.md as eval.yaml and answers a single question:
Does this skill actually help an agent? By how much?The spec and runner are independent of DecimalAI —
pip install "skillevaluation[runner]" executes a full A/B benchmark on your machine, on your own API key, with no account. DecimalAI is a conforming hosted runner of the same spec (it imports the same judge and validator code) and adds what a local run can’t: history, verified results, rankings, and distribution.
Package
skillevaluation on PyPI — Apache 2.0Schema
JSON Schema for
eval.yaml — ships in the wheel: from skillevaluation.resources import load_schemaWhy a separate spec
SKILL.md tells an agent how to do something. eval.yaml tells DecimalAI (or any conforming runner) how to measure whether it’s working. Keeping the two side-by-side on disk means:
- Skill authors version eval cases with the skill itself
- A skill pulled via
decimalai skills pullbrings its eval suite with it - A regression is detectable against the same cases that proved the skill worked
The 60-second format
| Outcome | Meaning |
|---|---|
flip_to_pass | The skill rescued the agent — the “win” |
pass_kept | Both arms passed; skill didn’t discriminate on this case |
fail_kept | Both arms failed; skill didn’t rescue |
flip_to_fail | The skill HURT — regression marker |
error | A transient failure prevented evaluation |
Assertion kinds
Each case can mix two assertion kinds:expectations— natural-language claims, graded by an LLM judgevalidators— shell commands, graded by exit code
Run it locally (free)
The open-source package ships a complete reference runner. Your own API key, your machine — nothing is sent to DecimalAI:results.json conforming to the open wire schema. The without-skill baseline is cached locally, so re-runs while you iterate on SKILL.md cost half. Gate it in CI with --fail-on-verdict fail --min-delta-pts 10, or dry-run the plumbing for free with --adapter mock.
Local runs are unlimited and unmetered — iterate as much as you like.
Push results to DecimalAI
Attach a local run to your skill’s Benchmark tab (free — it’s a JSON upload, no quota consumed):Verified runs (hosted)
A verified run is one the DecimalAI runner executed — same open-spec judge and validators (the platform literally imports them from theskillevaluation package), but in a trusted environment, stamped with model + date. Only verified runs feed registry cards, rankings, and SkillScore.
Two ways to get one:
| Local run | Pushed result | Verified run | |
|---|---|---|---|
| Costs you | your LLM tokens | nothing | plan quota (or free on publish) |
| Shows on your skill page | — | ✓ (tagged unverified) | ✓ |
| Feeds rankings / registry cards | — | — | ✓ |
Bring your own runner
The full runner contract (spec/runner-contract.md) and the golden compatibility-tests/ fixtures ship inside the skillevaluation package — load them with from skillevaluation.resources import load_schema. Any implementation that reproduces those fixtures is conforming; the reference runner itself passes the suite — compare against it.
Composing with agentversion
Askillevaluation run produces a numeric score. That score can be recorded on an AgentVersion manifest’s evaluation.gates[] via the skillevaluation:// URI scheme:
Human-readable identifier for the gate, e.g.
skillevaluation:gdpr-pii-classifier.The score the run produced,
0.0–1.0 (pass rate of the with-skill arm).The minimum
actual_score required for the gate to pass.Whether
actual_score met threshold.A
skillevaluation:// URI pinning the exact eval suite + version that produced the score, e.g. skillevaluation://abc123def456@v0.1.0.ISO 8601 timestamp of when the run completed.
Status
- Package is at v0.2.3 (pre-stable; breaking changes possible before v1.0) — the 0.2 line added the reference runner + CLI, and the wheel now ships
spec/+schemas/(from skillevaluation.resources import load_schema) - DecimalAI’s hosted runner consumes the same package — judge and validator behavior is shared by construction, not by copy
- Conformance suite + JSON Schemas ship inside the
skillevaluationpackage (from skillevaluation.resources import load_schema) - Want a different language implementation? The package’s
CONFORMANCE.md+ golden in/out fixtures define conformance — anything that reproduces them conforms