Playground uses your own LLM API keys (BYOK — Bring Your Own Key). DecimalAI doesn’t subsidize LLM calls — you pay your provider directly. Configure keys in Settings → Credentials.
Playground vs. skillevaluation. The Playground is for interactive, single-trace exploration — re-run one trace, eyeball the side-by-side, iterate by hand. To measure whether a skill actually helps across a batch of cases (a with-skill vs. without-skill A/B that produces a SkillScore), use the skillevaluation benchmark instead — see Skills. Use the Playground to form a hypothesis; use a benchmark to prove it.
Getting Started
Navigate to the Playground page from the sidebar, or open it contextually:- From a trace: Click “Open in Playground” on any trace detail page
- From a skill: Click “Test in Playground” on any skill detail page
- Direct URL:
/playgroundor/playground?skill=code-review
Three Modes
The Playground runs in one of three modes — pick the one that matches what you’re iterating on.- Import from Trace
- Skill Testing
- Scratch Pad
Re-run a production trace with modifications:
- Select an agent from the dropdown
- Select a trace — the system prompt and user message auto-populate
- The original output appears on the right for comparison
- Edit the system prompt or user message
- Choose a model and temperature
- Click Run (or press
⌘+Enter) - Compare the new output against the original side-by-side
Model Selection
The Playground page supports multiple LLM providers:| Provider | Models | API Key Env Var |
|---|---|---|
| OpenAI | gpt-4o, gpt-4o-mini, gpt-4-turbo, o1-preview, o3-mini | OPENAI_API_KEY |
| gemini-3.5-flash, gemini-2.5-pro | GEMINI_API_KEY | |
| Anthropic | claude-opus-4-7, claude-sonnet-4-6, claude-haiku-4-5 | ANTHROPIC_API_KEY |
All three providers run directly in the Playground with your own keys (BYOK) — set each provider’s key in Settings.
gemini-3.5-flash is the default model.Temperature Control
Adjust the temperature slider (0.0–2.0) to control output randomness:| Temperature | Behavior |
|---|---|
| 0.0 | Deterministic — same output every time |
| 0.3–0.5 | Focused but varied — good for code and structured outputs |
| 0.7 | Default — balanced creativity and precision |
| 1.0–2.0 | More creative — good for brainstorming and open-ended tasks |
Comparing Outputs
When importing from a trace or testing a skill, the page shows a side-by-side comparison:| Left Panel | Right Panel |
|---|---|
| Original output — what the agent produced in production | New output — what the agent produces with your modified prompt |
Saving Skill Changes
In Skill Testing mode, after running a modified skill body:- If the output improves, click “Save as New Version”
- This creates a new
SkillVersionwith your edited body - The version is automatically tracked in the skill’s version history
- All future activations of this skill use the updated body
Keyboard Shortcuts
| Shortcut | Action |
|---|---|
⌘+Enter (Mac) / Ctrl+Enter (Win) | Run the prompt |
Workflow Examples
Debugging a Bad Output
Hand-Tuning a Skill Body
Pick a few traces
Select 3–5 recent traces from the trace dropdown to sanity-check your edit against real conversations.
Confirm it helps
Monitor the new version’s effectiveness on the Skills dashboard. To confirm the edit actually helps across a batch — not just on the handful you eyeballed — run a skillevaluation benchmark.
Testing Different Models
Next Steps
Tracing
Open any production trace in the playground to iterate on it.
Skills
Test skill changes in the playground before saving a new version.
Evaluations
Score playground runs with the same evaluators used in production.