Skip to main content
Every skill carries a SkillScore (0–100) that answers one question: how well has this skill been proven to work? It’s built from real evidence — not stars, not download counts.

The three signals

A SkillScore blends up to three independent kinds of proof:

Benchmark

The skill is run on example tasks with it vs. without it. Did it raise the pass rate?

Live eval

On real production runs that were evaluated, did the skill pass?

AI rating

An AI judge scores the quality of real production outputs.
A skill can have one, two, or all three. More signals → a more trustworthy score — but even a single signal earns a visible score. The skill detail page shows which signals are in and which are still pending (e.g. “Backed by 1 of 3 signals”), and how to earn the rest. The three signals feed into a single composite — whichever are present are combined into one 0–100 number:
A skill with no evidence yet shows New rather than a number — not a failure, just nothing measured.

Where the evidence comes from

The score is always computed from the evidence attached to that specific skill. What differs is whose runs count and who can see the result:
The skill is…Scored fromInherits a score?Visible to
PublicEveryone who uses it, anonymized — its cross-org reputationn/a — it is the public reputationEveryone
Private to your teamYour team’s own runsNo — scored from your team’s data onlyYour team
A copy you forkedIts own runs — starts at New, earns its ownNo — starts fresh, builds its ownYour team

Forking starts a fresh score

When you fork a skill you get an independent, editable copy. Because you can change it, it does not inherit the original’s score — it builds its own from your runs. The original keeps its public score.

Public reputation vs. your results

For a public skill your team uses, you’ll see two complementary numbers. They answer different questions, so we keep them separate:

SkillScore (public)

“Is this proven to work, across everyone?” The cross-org reputation — use it to decide whether to adopt a skill.

Your results

“Is it working for us?” Your team’s own pass rate, usage, and ratings on the skill — use it to catch a version that regressed for your workload, then pin a version or fork your own.
There is only ever one SkillScore per skill (the public reputation). “Your results” is your own raw production data shown alongside it — not a second, competing score.

How it’s calculated

A background job regularly reads each skill’s benchmark runs, evaluated traces, and AI ratings, computes the 0–100, and stores it with a per-signal breakdown. Public skills are scored from anonymized cross-org evidence; private and forked skills are scored from your team’s data only and kept private to your org.

On the leaderboard

The registry leaderboard ranks skills by SkillScore. A skill needs at least one signal to appear — and the more corroborating signals it has, the more its score can be trusted.
Want to raise a skill’s score? Run a verified benchmark, route real traffic through it so live evals accrue, and let the AI rater sample your production traces. Each added signal both raises confidence and unlocks leaderboard ranking.

Public Registry

Browse and install skills ranked by SkillScore.

skillevaluation

The A/B benchmark that produces the strongest SkillScore signal.

Skills

Author, auto-discover, and track SKILL.md files.