Skip to main content
The Playground page lets you experiment with your agent’s prompts in a safe sandbox. Re-run real production traces with modified instructions, compare outputs side-by-side, and save successful edits as new skill versions.
Playground uses your own LLM API keys (BYOK — Bring Your Own Key). DecimalAI doesn’t subsidize LLM calls — you pay your provider directly. Configure keys in Settings → Credentials.
Playground vs. skillevaluation. The Playground is for interactive, single-trace exploration — re-run one trace, eyeball the side-by-side, iterate by hand. To measure whether a skill actually helps across a batch of cases (a with-skill vs. without-skill A/B that produces a SkillScore), use the skillevaluation benchmark instead — see Skills. Use the Playground to form a hypothesis; use a benchmark to prove it.

Getting Started

Navigate to the Playground page from the sidebar, or open it contextually:
  • From a trace: Click “Open in Playground” on any trace detail page
  • From a skill: Click “Test in Playground” on any skill detail page
  • Direct URL: /playground or /playground?skill=code-review

Three Modes

The Playground runs in one of three modes — pick the one that matches what you’re iterating on.
Re-run a production trace with modifications:
  1. Select an agent from the dropdown
  2. Select a trace — the system prompt and user message auto-populate
  3. The original output appears on the right for comparison
  4. Edit the system prompt or user message
  5. Choose a model and temperature
  6. Click Run (or press ⌘+Enter)
  7. Compare the new output against the original side-by-side
This is the default mode — ideal for debugging unexpected outputs or testing prompt changes against real conversations.

Model Selection

The Playground page supports multiple LLM providers:
ProviderModelsAPI Key Env Var
OpenAIgpt-4o, gpt-4o-mini, gpt-4-turbo, o1-preview, o3-miniOPENAI_API_KEY
Googlegemini-3.5-flash, gemini-2.5-proGEMINI_API_KEY
Anthropicclaude-opus-4-7, claude-sonnet-4-6, claude-haiku-4-5ANTHROPIC_API_KEY
Select a provider and model from the dropdowns. If no API key is configured for the selected provider, you’ll see a friendly error with a link to Settings.
All three providers run directly in the Playground with your own keys (BYOK) — set each provider’s key in Settings. gemini-3.5-flash is the default model.

Temperature Control

Adjust the temperature slider (0.0–2.0) to control output randomness:
TemperatureBehavior
0.0Deterministic — same output every time
0.3–0.5Focused but varied — good for code and structured outputs
0.7Default — balanced creativity and precision
1.0–2.0More creative — good for brainstorming and open-ended tasks

Comparing Outputs

When importing from a trace or testing a skill, the page shows a side-by-side comparison:
Left PanelRight Panel
Original output — what the agent produced in productionNew output — what the agent produces with your modified prompt
This makes it easy to spot differences and judge whether your changes improved the output.

Saving Skill Changes

In Skill Testing mode, after running a modified skill body:
  1. If the output improves, click “Save as New Version”
  2. This creates a new SkillVersion with your edited body
  3. The version is automatically tracked in the skill’s version history
  4. All future activations of this skill use the updated body
Saving overwrites the skill’s current body. The previous version is preserved in the version history — you can always revert from the Skills page.

Keyboard Shortcuts

ShortcutAction
⌘+Enter (Mac) / Ctrl+Enter (Win)Run the prompt

Workflow Examples

Debugging a Bad Output

1

Find the trace

Find a trace with a poor output in the Traces page.
2

Open in Playground

Click “Open in Playground” on the trace detail page.
3

Refine the prompt

Edit the system prompt to add more specific instructions.
4

Iterate

Run → compare → iterate until the output improves.
5

Apply the fix

Apply the improved prompt to your agent’s configuration.

Hand-Tuning a Skill Body

1

Open the skill

Open a skill → “Test in Playground”.
2

Pick a few traces

Select 3–5 recent traces from the trace dropdown to sanity-check your edit against real conversations.
3

Edit and compare

For each trace: edit the skill body → run → compare side-by-side.
4

Save a new version

When satisfied, click “Save as New Version”.
5

Confirm it helps

Monitor the new version’s effectiveness on the Skills dashboard. To confirm the edit actually helps across a batch — not just on the handful you eyeballed — run a skillevaluation benchmark.

Testing Different Models

1

Set up the prompt

Import a trace or write a scratch prompt.
2

Run the first model

Run with gpt-4o → note the output.
3

Run a second model

Switch to claude-sonnet-4-6 → run again.
4

Choose the best fit

Compare outputs to choose the best model for your use case.

Next Steps

Tracing

Open any production trace in the playground to iterate on it.

Skills

Test skill changes in the playground before saving a new version.

Evaluations

Score playground runs with the same evaluators used in production.