Skip to main content
The datasets API materializes a versioned DecimalAI dataset into a form your training stack can consume. Three patterns, in increasing convenience:
  1. pull_dataset — writes a local file (JSONL or Parquet). Works with any tool.
  2. push_to_hub — pushes to HuggingFace Hub. Instantly loadable by Axolotl, Unsloth, TRL, etc.
  3. load_hf_dataset — returns a datasets.Dataset directly, no file needed.
See the Datasets guide for versioning and verdict-based filtering.

decimalai.pull_dataset()

Download a versioned dataset to a local file.
# Pull latest version as JSONL
result = decimalai.pull_dataset("ds_abc123", "./training_data.jsonl")
print(f"Wrote {result['row_count']} rows to {result['file_path']}")

# Pull specific version as Parquet
result = decimalai.pull_dataset("ds_abc123", "./data.parquet", version="v2")
dataset_id
str
required
The dataset ID.
path
str
required
Local file path. Format is inferred from extension (.jsonl or .parquet).
version
str
default:"latest"
Version specifier: None/"latest", "v3"/"3", or a full UUID.
format
str
default:"auto"
Override format: "jsonl" or "parquet". Defaults to auto-detect from file extension.
Returns: {"row_count": 500, "file_path": "./data.jsonl", "bytes_written": 12345, "format": "jsonl"}

decimalai.push_to_hub()

Push a dataset to HuggingFace Hub. Makes the dataset instantly loadable by Axolotl, Unsloth, TRL, and any tool supporting load_dataset().
result = decimalai.push_to_hub(
    "ds_abc123",
    "my-org/support-agent-sft",
    version="latest",
)
print(f"Pushed to {result['repo_url']}")

# Now usable in training:
# from datasets import load_dataset
# ds = load_dataset("my-org/support-agent-sft")
dataset_id
str
required
The DecimalAI dataset ID.
repo_id
str
required
HuggingFace repo in "org/dataset-name" format.
version
str
default:"latest"
Version specifier: None/"latest", "v3"/"3", or UUID.
token
str
default:"HF_TOKEN env"
HuggingFace API token. Falls back to HF_TOKEN env var or cached login.
private
bool
default:"True"
Create a private repo.
split
str
default:"train"
Dataset split name.
Returns: {"repo_url": "...", "repo_id": "...", "row_count": 500, "version_id": "...", "split": "train"}
Requires pip install huggingface_hub datasets. These are optional dependencies.

decimalai.load_hf_dataset()

Load a dataset directly as a HuggingFace Dataset object — no intermediate file needed.
ds = decimalai.load_hf_dataset("ds_abc123", version="v2")
# Dataset({features: ['messages'], num_rows: 500})

# Plug directly into TRL
from trl import SFTTrainer
trainer = SFTTrainer(model=model, train_dataset=ds, ...)
dataset_id
str
required
The DecimalAI dataset ID.
version
str
default:"latest"
Version specifier.
Returns: A datasets.Dataset object.

What’s next

Datasets guide

Verdict filtering, versioning, and how to build training-ready splits.

Training tutorial

End-to-end SFT recipe using a DecimalAI dataset.