Datasets API - DecimalAI

A dataset is a curated collection of training examples built from filtered production traces. The key innovation: combining manifest compatibility + eval scores means your training data is both current (recorded against the active agent config) and high-quality (passed evaluation). Each dataset version locks the manifest and filter set used to build it, so builds are reproducible.

Lifecycle

Common patterns

Build from keep + pass traces

POST /datasets/{id}/build with allowed_verdicts=["keep"] and min_eval_score=0.7 is the canonical SFT recipe.

Export to JSONL or Parquet

GET /datasets/{id}/versions/{v}/export?format=jsonl returns the rows ready for OpenAI fine-tuning or HuggingFace.

Pull as HuggingFace Dataset

decimalai.load_hf_dataset(...) returns a materialized datasets.Dataset object. Compatible with every open-source trainer. (Use decimalai.pull_dataset(dataset_id, path) instead to download to a local JSONL/Parquet file.)

Compare versions

Versions are immutable. To see what’s changed between v1 and v2, fetch both and diff the row counts + source breakdowns.

Formats

Format	Full Name	Use Case
SFT	Supervised Fine-Tuning	One input→output row per LLM call. Imitation learning. Most common.
DPO	Direct Preference Optimization	One row per (input, chosen, rejected) triple. Replay-driven — when v2 outperforms v1 on the same input, v2 = chosen.

Quick start

import httpx

# Build a new SFT version
resp = httpx.post(
    "https://api.decimal.ai/api/v1/datasets/ds_abc123/build",
    headers={"Authorization": "Bearer dai_sk_..."},
    json={"allowed_verdicts": ["keep"], "min_eval_score": 0.7},
)
version_id = resp.json()["version_id"]

# Load as a HuggingFace Dataset (in memory — plugs straight into TRL/Axolotl/Unsloth)
import decimalai
decimalai.init()  # reads DECIMAL_API_KEY; or init(api_key="dai_sk_...")
ds = decimalai.load_hf_dataset("ds_abc123", version=version_id)
print(f"{len(ds)} rows ready for training")

# Or download to a local file instead — returns a summary dict, not a Dataset
result = decimalai.pull_dataset("ds_abc123", "./train.jsonl", version=version_id)
print(f"Wrote {result['row_count']} rows to {result['file_path']}")

Datasets Guide — filter strategies and recipes
Training Pipeline Tutorial — end-to-end: trace → eval → fine-tune
Skills & Data Pipeline — SFT vs DPO, repair, replay

​Lifecycle

​Common patterns

Build from keep + pass traces

Export to JSONL or Parquet

Pull as HuggingFace Dataset

Compare versions

​Formats

​Quick start

​Related

Lifecycle

Common patterns

Formats

Quick start

Related