Last updated:
Introducing DI-0

DI-0 is our first robotics data library: paired human demonstrations and teleoperation traces with synchronized vision, language, and action—built so teams can train and fine-tune advanced vision-language-action (VLA) models for foundational manipulation, mobile robots, and humanoids without stitching ad hoc exports.
What is DI-0?
A curated multimodal dataset—not a weights release. Episodes combine what cameras see, what human demonstrators or teleoperators do, and how robot state evolves over time. That alignment is what modern VLAs need to connect natural-language intent to contact-rich behavior in the real world.
What the library includes
DI-0 is structured for embodied pretraining and fine-tuning:
- Human demonstration segments: synchronized video or depth, proprioception, gripper or hand state, and time-aligned language intent where available.
- Teleoperation data: expert pilots driving arms, mobile manipulators, or humanoids—including recoveries and corrections you rarely capture in scripted simulation-only corpora.
- Embodiment-aware tracks: calibrations and action formats that map across industrial arms, AMRs with manipulators, and bipedal stacks sharing common training interfaces.
- Operational metadata: scene and skill tags, success and failure boundaries, consent and retention flags—so teams know which slices belong in public research mixes versus locked-down fine-tunes.
How teams use DI-0
Typical workflows look like this:
- Pretraining mix-ins: add DI-0 slices to large VLA pools to improve contact, clutter, and language grounding before site-specific tuning.
- Private fine-tuning: combine DI-0 with your own Ground-Log exports to close the sim-to-real gap without leaking sensitive facility detail.
- Humanoid bootstrapping: lean on teleoperation-heavy trajectories to teach whole-body coordination before closed-loop policies run on hardware.
- Shared benchmarks: evaluate checkpoints on held-out DI-0 splits so labs compare architectures in one embodied metric space—not only internet video baselines.
Curation and access
Useful physical AI data is as much filtering as collection:
- Quality gates: remove degenerate episodes, sensor dropouts, and redundant traversals so every hour of disk teaches something new.
- Deduplication: embedding- and trajectory-space dedup to prevent overfitting to the same aisle or workcell.
- Label contracts: sparse language spans tied to temporal segments instead of vague file-level captions.
- Licensed access: partner and academic programs with audit trails, aligned with how we expand into larger DI-1 releases.
Who it is for
DI-0 supports teams who need robotics-native corpora:
- Research labs training open or proprietary VLAs for dexterous manipulation and human-scale robots.
- Manufacturers and 3PL pilots validating policies before fleet-wide OTA schedules.
- Partners benchmarking new model families against shared embodied data—not only scraped web video.
- Humanoid programs that require rich teleoperation and demonstration signal beyond static pose datasets.
Roadmap
DI-0 is the first slice of a growing program: more sites, embodiments, languages, and edge cases—and tighter coupling to Bench-Fabric regression suites so library improvements show up directly in promotion gates.
Pair DI-0 with Dynamic intelligence capture products—Ground-Log for governed multimodal truth, Fleet-Tape for production replay, and Bench-Fabric for scenario gates—so fine-tunes stay anchored to the floors where robots actually work.