SkillScore - DecimalAI

Every skill carries a SkillScore (0–100) that answers one question: how well has this skill been proven to work? It’s built from real evidence — not stars, not download counts.

The three signals

A SkillScore blends up to three independent kinds of proof:

Benchmark

The skill is run on example tasks with it vs. without it. Did it raise the pass rate?

Live eval

On real production runs that were evaluated, did the skill pass?

AI rating

An AI judge scores the quality of real production outputs.

A skill can have one, two, or all three. More signals → a more trustworthy score — but even a single signal earns a visible score. The skill detail page shows which signals are in and which are still pending (e.g. “Backed by 1 of 3 signals”), and how to earn the rest. The three signals feed into a single composite — whichever are present are combined into one 0–100 number:

A skill with no evidence yet shows New rather than a number — not a failure, just nothing measured.

Where the evidence comes from

The score is always computed from the evidence attached to that specific skill. What differs is whose runs count and who can see the result:

The skill is…	Scored from	Inherits a score?	Visible to
Public	Everyone who uses it, anonymized — its cross-org reputation	n/a — it is the public reputation	Everyone
Private to your team	Your team’s own runs	No — scored from your team’s data only	Your team
A copy you forked	Its own runs — starts at New, earns its own	No — starts fresh, builds its own	Your team

Forking starts a fresh score

When you fork a skill you get an independent, editable copy. Because you can change it, it does not inherit the original’s score — it builds its own from your runs. The original keeps its public score.

Public reputation vs. your results

For a public skill your team uses, you’ll see two complementary numbers. They answer different questions, so we keep them separate:

SkillScore (public)

“Is this proven to work, across everyone?” The cross-org reputation — use it to decide whether to adopt a skill.

Your results

“Is it working for us?” Your team’s own pass rate, usage, and ratings on the skill — use it to catch a version that regressed for your workload, then pin a version or fork your own.

There is only ever one SkillScore per skill (the public reputation). “Your results” is your own raw production data shown alongside it — not a second, competing score.

How it’s calculated

A background job regularly reads each skill’s benchmark runs, evaluated traces, and AI ratings, computes the 0–100, and stores it with a per-signal breakdown. Public skills are scored from anonymized cross-org evidence; private and forked skills are scored from your team’s data only and kept private to your org.

On the leaderboard

The registry leaderboard ranks skills by SkillScore. A skill needs at least one signal to appear — and the more corroborating signals it has, the more its score can be trusted.

Want to raise a skill’s score? Run a verified benchmark, route real traffic through it so live evals accrue, and let the AI rater sample your production traces. Each added signal both raises confidence and unlocks leaderboard ranking.

Public Registry

Browse and install skills ranked by SkillScore.

skillevaluation

The A/B benchmark that produces the strongest SkillScore signal.

Skills

Author, auto-discover, and track SKILL.md files.

​The three signals

Benchmark

Live eval

AI rating

​Where the evidence comes from

Forking starts a fresh score

​Public reputation vs. your results

SkillScore (public)

Your results

​How it’s calculated

​On the leaderboard

​Related

Public Registry

skillevaluation

Skills

The three signals

Where the evidence comes from

Public reputation vs. your results

How it’s calculated

On the leaderboard

Related