Why Real Robot Data Beats Simulation

The single biggest bottleneck in deploying learned robot manipulation policies is not model architecture — it is data. Simulation has improved dramatically with tools like Isaac Sim, MuJoCo, and SAPIEN, but the sim-to-real gap remains stubbornly wide for contact-rich tasks. Here is why real-world data remains essential in 2026.

The Sim-to-Real Gap Is Not Closing Fast Enough

Objects deform, slip, and interact in ways that even the best physics engines approximate poorly. A simulated cloth behaves nothing like a real cotton towel. A simulated cardboard box does not buckle the same way a real one does under lateral pressure. Lighting, surface textures, camera noise, and lens distortion in real environments create visual distributions that synthetic rendering cannot fully replicate, even with neural rendering techniques.

The numbers tell the story. In 2025, Google DeepMind's RT-2 achieved 62% success on novel object manipulation using 130,000 real-world demonstrations. Toyota Research Institute's Diffusion Policy reached 94% success on trained tasks with just 200 demonstrations — but required those demonstrations to be high-quality, consistent, and collected on the exact hardware configuration used for deployment. The lesson: for contact-rich manipulation, real-world data is not optional.

When You Need Real Data

Real-world data collection is justified when your task involves any of the following:

  • Contact-rich manipulation: Insertion, threading, assembly, peg-in-hole, and any task where force feedback matters. Sim contact models are not accurate enough for sub-millimeter tolerance tasks.
  • Deformable objects: Cloth, cables, food items, bags. Deformable simulation is improving but still produces qualitatively different behaviors than real materials.
  • Visual diversity: Tasks where the policy must generalize across real-world lighting conditions, object textures, backgrounds, and camera viewpoints that synthetic rendering cannot fully cover.
  • Hardware-specific deployment: When your policy will run on a specific robot arm with specific cameras, collecting data on that exact setup eliminates the embodiment gap entirely.
  • Human-robot interaction: Any task involving handoffs, collaborative assembly, or shared workspace with humans. Human behavior cannot be realistically simulated.

Domain Randomization Has Limits

Domain randomization — varying textures, lighting, physics parameters, and camera poses in simulation — helps bridge the sim-to-real gap, but it has well-documented limits. Randomizing physics parameters over a wide range produces unrealistic behaviors that confuse the policy. Randomizing visual properties without matching real-world distributions creates a "reality gap in reverse" where the policy expects visual noise that does not exist. For most practical manipulation tasks in 2026, the most efficient path is: small-scale real data collection with a managed service (like SVRC) to validate your approach, then scale with simulation augmentation where it helps.

The 4 Teleoperation Methods Compared

Teleoperation — a human operator remotely controlling a robot to demonstrate desired behaviors — is the dominant method for collecting robot manipulation data. The choice of teleoperation interface directly affects data quality, collection speed, operator fatigue, and cost. See our dedicated teleoperation guide for deeper technical detail on each method.

Method 1: Leader-Follower Arms (ACT / ALOHA Style)

How it works: The operator physically moves a lightweight "leader" arm; a heavier "follower" arm replicates the leader's joint positions in real time via direct USB Dynamixel bus at 3-8 ms latency. This is the approach used by ALOHA, Mobile ALOHA, and the majority of university research setups in 2026.

Why it produces the best data: The operator has proprioceptive feedback through the physical leader arm — they can feel resistance, near-singularity stiffness, and table contact. This makes demonstrations smoother, more consistent, and more natural than any screen-based interface. With gravity compensation enabled (Dynamixel current limits at 30-50% rated torque), operators can work 2-3 hour sessions without significant fatigue.

Hardware cost: $3,000-$8,000 for the leader arm, plus the follower/production arm. Total per-station cost including cameras and compute: $15,000-$35,000.

Throughput: 20-35 demonstrations per hour for tabletop manipulation tasks. The fastest method for contact-rich tasks because the proprioceptive feedback reduces failed attempts.

Best for: Serious data collection campaigns, contact-rich manipulation (insertion, assembly), bimanual tasks, any scenario where data quality is the priority.

Method 2: VR Headset (Meta Quest 3)

How it works: The operator wears a VR headset and sees the robot's camera feed or a mixed-reality view. Hand controller positions are mapped to end-effector Cartesian positions via inverse kinematics running on the workstation.

Latency: 15-40 ms end-to-end (WiFi + IK computation + arm command). Adequate for most manipulation tasks but noticeable for precision work.

Hardware cost: $500 for the headset. No additional hardware beyond the robot arm, cameras, and compute.

Throughput: 15-25 demonstrations per hour for tabletop manipulation. Slightly lower than leader-follower because the operator lacks proprioceptive feedback and makes more failed attempts on precision tasks.

Operator fatigue: VR headset causes nausea in some operators after 60-90 minutes. Plan for 15-minute breaks every hour. Maximum practical session: 3-4 hours per day.

Best for: Teams on a budget, tasks that do not require sub-millimeter precision, rapid prototyping of data collection pipelines, gross manipulation (picking, placing, sorting).

Method 3: SpaceMouse / Keyboard

How it works: The operator uses a 6-axis SpaceMouse ($200) or dual-analog gamepad to control end-effector velocity or joint velocities. The operator watches the robot through a live camera feed on a monitor.

Latency: 1-5 ms (direct USB HID). The lowest latency of any method, but the limited control bandwidth more than offsets this advantage.

Throughput: 5-12 demonstrations per hour. The slowest method because the operator cannot directly specify 6-DOF poses, leading to jerky trajectories and suboptimal paths.

Data quality: Lower than all other methods. Demonstrations are less consistent between operators because each develops their own joystick control strategy. Trajectories contain more noise and hesitation.

Best for: Quick prototyping when no other interface is available, mobile robot navigation data, tasks where arm trajectory smoothness is not critical.

Method 4: Haptic Gloves / Exoskeleton

How it works: The operator wears a hand exoskeleton or haptic glove (SenseGlove Nova 2 at $8,000/pair or HaptX G1 at $20,000/pair) that tracks finger joint angles and maps them to a dexterous robot hand. Force feedback enables the operator to feel object contact and adjust grip force in real time.

Throughput: 8-15 demonstrations per hour. Dexterous tasks are inherently slower because they involve more complex sequences (multi-finger grasp, in-hand rotation, placement).

Operator fatigue: High. Gloves are physically demanding. 45-minute session limit with 10-minute breaks. Total daily session limit: 4-5 hours.

Best for: Dexterous manipulation research, multi-finger grasping, in-hand rotation, tool use, garment manipulation, any task where finger-level control is essential.

Comparison Table

Method Precision Operator Learning Curve Setup Cost Demos/Hour Ideal Task Types
Leader-Follower Highest 1-2 hours $15K-$35K 20-35 Contact-rich, insertion, bimanual, assembly
VR / Quest 3 Good 30 min - 1 hour $5K-$12K 15-25 Pick-place, sorting, packing, gross manipulation
SpaceMouse Moderate 2-4 hours $3K-$8K 5-12 Prototyping, navigation, simple manipulation
Haptic Gloves High (dexterous) 3-6 hours $25K-$55K 8-15 Multi-finger grasping, in-hand rotation, tool use

Episode Design Principles

Before collecting a single demonstration, you need to define what constitutes a good episode. Poor episode design is the most common source of wasted data collection effort. Here are the five design decisions that determine whether your dataset will train a successful policy.

Task Specification

Write a precise, unambiguous task description that an operator can execute consistently. Bad: "Pick up the object." Good: "Grasp the red cube from the left bin using a top-down pinch grasp, lift to 15 cm above the table, move to the center of the right bin, and release. The cube must land upright in the bin." The more specific your task specification, the more consistent your demonstrations, and the easier your policy will learn.

Reset Consistency

Define the initial state of the scene precisely. What objects are on the table? Where are they positioned? What is the acceptable randomization range? How is the robot arm positioned at the start of each episode? A well-defined reset procedure should take 15-30 seconds between episodes. If your resets take longer than 60 seconds or involve ambiguous placement, you are losing throughput and introducing distribution noise that degrades training.

For tabletop tasks, create a template (printed paper with position markers, or a fixture with alignment guides) that operators use to reset the scene. Randomize object positions within a defined bounding box (typically 15-25 cm square) — document the exact randomization range.

Observation Space

Define exactly what the policy will observe at inference time, and record exactly those signals during collection. Typical observation spaces include:

  • Camera images: 2-4 views (overhead, wrist, side) at 480x640 resolution, 30 fps. Use Intel RealSense D405/D455 for RGB-D or standard USB cameras for RGB-only.
  • Joint positions (qpos): Float array of joint angles at 50 Hz. Include gripper aperture as the final dimension.
  • Joint velocities (qvel): Float array of joint angular velocities at 50 Hz.
  • End-effector pose: Optional 6-DOF (position + orientation) at 50 Hz. Useful for Cartesian-space policies.
  • Gripper state: Binary (open/closed) or continuous (aperture width) at 50 Hz.

Do not record signals you will not use at inference time. Extra unused observation dimensions add noise and storage cost without benefit.

Action Space

The action space defines what the policy predicts. For most manipulation policies in 2026, the action is one of:

  • Joint position targets: The most common choice for ACT and Diffusion Policy. Record the leader arm's joint positions as the action signal.
  • End-effector velocity: Used by some Cartesian-space policies. Record commanded end-effector velocity from the teleoperation interface.
  • Absolute end-effector pose: Less common but used by some VLA models. Record the target end-effector 6-DOF pose.

Important: Store actions in joint space whenever possible. Most modern policies (ACT, Diffusion Policy, pi-0) operate in joint space. Storing Cartesian poses as actions requires an IK solve at training time, which is error-prone and adds computational overhead.

Episode Length

Keep episodes as short as possible while still completing the full task. Typical episode lengths:

  • Simple pick-and-place: 3-8 seconds (150-400 timesteps at 50 Hz)
  • Multi-step manipulation: 10-25 seconds (500-1,250 timesteps)
  • Complex assembly: 30-90 seconds (1,500-4,500 timesteps)

Long episodes (>60 seconds) are harder to learn from because the credit assignment problem grows exponentially. If your task naturally takes longer than 60 seconds, consider decomposing it into subtask episodes that are chained during inference.

How Much Data Do You Need?

The number of demonstrations required depends on your policy architecture, task complexity, and desired success rate. Here are evidence-based guidelines from published results and our experience at SVRC.

Policy Type Episodes Needed Evidence Notes
ACT (Action Chunking with Transformers) 50 - 200 ALOHA paper: 50 demos for simple tasks, 200 for bimanual Most data-efficient architecture for manipulation. Quality matters more than quantity.
Diffusion Policy 100 - 500 TRI: 200 demos for 94% success on trained tasks Handles multi-modal action distributions well. Benefits from diverse demonstrations.
VLA Fine-Tuning (pi-0, OpenVLA, RT-2) 500 - 2,000 OpenVLA: 1K demos for single-task fine-tuning Pre-trained on large data, but fine-tuning still needs substantial task-specific data.
From-Scratch Foundation Model 5,000 - 100,000+ RT-2: 130K demos. Open X-Embodiment: 2.2M episodes across 22 robots. Only relevant if building a generalist model. Most teams should fine-tune instead.

The critical insight: 50 high-quality demonstrations will outperform 500 low-quality ones every time. If you cannot get 70%+ policy success with 100 demonstrations of a simple task, the problem is data quality or pipeline bugs, not data quantity. Fix those first before collecting more.

Data Formats Deep Dive

Choosing the right data format early prevents painful migration later. The robotics community has converged on three primary formats. For a full technical comparison, see our dedicated HDF5 vs RLDS vs LeRobot format guide.

HDF5 (Hierarchical Data Format 5)

HDF5 is the most widely used format for robot demonstration data. It stores heterogeneous data (joint positions, images, metadata) in a single file with efficient random access. The ACT and Diffusion Policy codebases use HDF5 natively.

Episode structure: Each episode is a group containing datasets for observations, actions, and metadata attributes. Here is the standard ACT/ALOHA layout:

/episode_0/
    observations/
        images/
            cam_high        # uint8 [T x 480 x 640 x 3]
            cam_wrist_left  # uint8 [T x 480 x 640 x 3]
            cam_wrist_right # uint8 [T x 480 x 640 x 3]
        qpos                # float32 [T x 14]   (7 joints per arm)
        qvel                # float32 [T x 14]
    action                  # float32 [T x 14]   (leader arm positions)
    attrs:
        task = "pick_cube_bimanual"
        operator_id = "op_03"
        success = True
        timestamp = "2026-04-10T14:32:00Z"
        robot_serial = "openarm_007"

When to use HDF5: As your primary collection and storage format. Convert to other formats on demand. HDF5 is the most inspection-friendly format, has mature Python tooling (h5py, HDFView), and supports efficient random access to any frame in any episode.

RLDS (Reinforcement Learning Datasets)

Google's RLDS format uses TFRecord files with a standardized schema. It is the native format for Open X-Embodiment (2.2M+ episodes across 22 robot types) and the Octo generalist policy.

Strengths: Standardized schema enables cross-dataset training without per-dataset adapters. Efficient streaming from cloud storage via tf.data. Community scale: 50+ datasets available in RLDS format.

Weaknesses: TensorFlow dependency adds friction for PyTorch teams. Sequential access patterns (no efficient random frame access). Schema is rigid — adding custom sensor types requires writing a custom DatasetBuilder.

LeRobot Format (Hugging Face)

Hugging Face's LeRobot format uses Parquet files for tabular data and MP4 video for camera observations. It is designed for sharing on the Hugging Face Hub and integrates natively with the LeRobot training library.

Strengths: One-command upload to Hugging Face Hub. Built-in web visualization. Compact MP4 storage (5-10x smaller than raw HDF5). Growing community with 300+ public datasets. Native ACT and Diffusion Policy training support.

Weaknesses: MP4 encoding is lossy — not suitable as a source-of-truth format for contact-sensitive tasks. Video decoding adds latency during training. Newer format with evolving tooling.

Our recommendation: Collect and store in HDF5. Use lerobot.scripts.convert_dataset to export to LeRobot format for sharing on Hugging Face Hub, and write a custom DatasetBuilder for RLDS if you need Open X-Embodiment compatibility.

Quality Control Checklist

Collecting 1,000 bad demonstrations is worse than collecting 50 good ones. Quality directly determines policy performance. Here is the 10-point QC framework we use at SVRC and recommend to every team collecting robot data.

  1. Task success verification: Every episode must complete the full task successfully. Failed demonstrations (dropped objects, missed grasps, incomplete trajectories) actively harm policy training. Implement real-time quality review: an observer marks each episode as success/failure immediately after collection. Target: 95%+ success rate in your final dataset.
  2. Timestamp synchronization: All sensor streams (cameras, joint states, actions) must be timestamped and synchronized to <5 ms tolerance. Use hardware-triggered cameras (GPIO pulse from workstation) and a shared clock source (chrony NTP or IEEE 1588 PTP). Verify by recording a known event (arm contacts table) and checking all streams show the event within 5 ms.
  3. Trajectory consistency: For a given task, demonstrations should follow similar strategies. If operator A picks from the left and operator B picks from the right, the policy learns an ambiguous bimodal distribution that fails in both cases. Define approach strategy, grasp point, and placement position before collection begins. Audit trajectory distribution weekly.
  4. Scene diversity: Systematically vary object positions (random within defined bounding box), object instances (3-5 per category), lighting conditions (overhead, angled, dim), and background clutter. Document the diversity parameters for each collection session. Without diversity, the policy overfits to your lab's exact scene layout.
  5. Frame drop monitoring: Camera streams must be checked for dropped frames. A single dropped frame creates a temporal discontinuity that confuses sequence-based policies. Monitor frame drop rates in real time. Reject episodes with >2% frame loss and investigate the cause (USB bandwidth, compute bottleneck).
  6. Gripper state consistency: Validate gripper open/close signals against camera evidence. Phantom gripper events (the data says the gripper closed but the camera shows it did not) are surprisingly common with noisy gripper sensors. Flag and correct or discard affected episodes.
  7. Joint limit compliance: No episode should contain joint positions outside the robot's safe operating range. Near-singularity configurations produce extreme joint velocities that create outliers in the action distribution. Set software joint limits 5 degrees inside the hardware limits.
  8. Episode length bounds: Define minimum and maximum episode length for your task. Episodes that are significantly shorter (task not completed) or longer (operator hesitation, recovery from near-failure) than the expected range should be flagged for review.
  9. Metadata completeness: Every episode needs: task name, operator ID, timestamp, robot serial number, camera configuration, success/failure label, and free-text notes for anomalies. Incomplete metadata makes dataset management, quality analysis, and reproducibility impossible.
  10. Cross-operator consistency: If using multiple operators, compare trajectory distributions across operators weekly. An operator whose demonstrations deviate significantly from the consensus should be retrained or excluded. Use dynamic time warping (DTW) distance between joint trajectories as a quantitative consistency metric.

Scaling from Pilot to Production

The path from a proof-of-concept dataset to production-scale data requires deliberate infrastructure investment at each stage. Here is the 4-stage playbook.

Stage 1: Proof of Concept (50 episodes)

One operator, one robot, one task. The goal is to validate your teleoperation setup, data pipeline, and training code. Train an ACT or Diffusion Policy model on this data and evaluate on the real robot. If you cannot get 70%+ success with 50 demonstrations of a simple task, the problem is data quality or pipeline bugs, not data quantity. Do not collect more data until you fix the root cause.

Timeline: 1-3 days. Cost (in-house): $200-$500 in operator time.

Stage 2: Initial Training (200 episodes)

Add a second operator to validate that your collection protocol is operator-independent. Implement automated quality checks: episode length bounds, joint limit violations, success/failure classification. Start tracking quality metrics systematically. This is the stage where you should see 80-90%+ policy success on simple tasks.

Timeline: 1-2 weeks. Cost (in-house): $600-$1,500 in operator time.

Stage 3: Production Policy (1,000 episodes)

Deploy 2-4 parallel collection stations with identical hardware configurations. Use a central data pipeline that aggregates episodes from all stations, runs quality checks, and produces training-ready datasets nightly. At this scale, operator management becomes the bottleneck: recruit, train, and retain 4-8 operators who can maintain consistent quality over weeks.

Key investments:

  • Automated reset mechanisms: Reduce the 15-60 seconds of manual scene reset. Even saving 15 seconds per episode across 1,000 episodes saves 4+ hours of operator time.
  • Real-time quality dashboards: Per-station throughput, success rates, and quality metrics visible to operators and supervisors. The SVRC Data Platform provides this out of the box.
  • Standardized calibration: With multiple stations, camera extrinsics and robot base calibration must be standardized. A 2 mm calibration error across stations introduces systematic noise.

Timeline: 2-4 weeks with 2 stations. Cost (in-house): $3,000-$8,000 in operator time, $30K-$100K in hardware.

Stage 4: Foundation Model Contribution (10,000+ episodes)

At this scale, you are building a foundation dataset. Consider contributing to open datasets (Open X-Embodiment, DROID) for community benefit and citation impact. Use RLDS or LeRobot format for interoperability. Invest in task ontology and metadata standards so your dataset is navigable and reusable across research groups.

Timeline: 2-6 months with 4 stations. Cost (in-house): $15,000-$50,000+ in operator time.

Or: hire SVRC to collect at any of these stages and skip the infrastructure build entirely.

Cost Calculator: In-House vs. SVRC Managed

Understanding the true cost of data collection helps teams make informed build-vs-buy decisions. Here is a realistic comparison for a 500-episode single-task campaign.

Cost Category In-House SVRC Managed
Hardware (robot + teleop + cameras + compute) $15,000 - $35,000 $0 (included)
Infrastructure setup (workspace, calibration, software) $2,000 - $5,000 $0 (included)
Operator recruitment and training $1,000 - $3,000 $0 (included)
Operator labor (500 episodes at 20 demos/hr, $35/hr) $875 Included in campaign price
QA and quality review $500 - $1,500 $0 (included)
Engineer time (pipeline, debugging, monitoring) $5,000 - $15,000 $0
Total $24,375 - $60,375 $8,000 - $15,000
Effective cost per episode $49 - $121 $16 - $30

In-house costs include hardware amortized over a single 500-episode campaign. If you plan to collect 5,000+ episodes over 12 months, in-house becomes more economical at scale. For one-time or initial campaigns, managed collection is typically 2-4x more cost-effective when you account for engineer time and opportunity cost.

7 Common Mistakes Teams Make Collecting Their First Dataset

After working with dozens of teams on their first data collection campaigns, we see the same mistakes repeatedly. Avoid these and you will save weeks of wasted effort.

  1. Collecting data before validating the training pipeline. Teams buy hardware, collect 500 episodes, then discover their training code has a bug that makes all the data unusable. Always train on 10-20 test episodes first. Validate that your pipeline produces a policy that moves the robot sensibly (even if it fails the task) before investing in scale.
  2. Ignoring operator training. An untrained operator produces 3-5x worse data than a trained one. The first 20 demonstrations from a new operator should be considered calibration data and discarded. Invest 2-4 hours in operator training before any production data collection.
  3. Inconsistent resets. If the scene is not reset consistently between episodes, the policy learns a distribution that includes arbitrary starting states. This dramatically increases the data needed to learn a robust policy. Spend time building a repeatable reset procedure before collecting data.
  4. Not monitoring quality in real time. Teams often collect an entire dataset, then discover that the last 200 episodes had a camera that shifted position, or a gripper sensor that started producing phantom signals. Monitor quality metrics in real time and flag anomalies immediately.
  5. Recording in the wrong action space. Collecting Cartesian end-effector poses when your policy expects joint positions (or vice versa) means re-processing or re-collecting everything. Decide on your action space before collection, and validate that the recorded action format matches your training code exactly.
  6. Underestimating scene diversity. A policy trained on demonstrations with the object always in the same position will fail when the object is 5 cm to the left. Systematically randomize object positions, lighting, and object instances from the very first episode.
  7. Skipping metadata. "We'll add metadata later" means "we will never add metadata." Record task name, operator ID, timestamp, robot serial, camera config, and success/failure label on every single episode. This costs 10 seconds per episode and saves days of confusion later.

Let SVRC Collect Your Training Data

Silicon Valley Robotics Center operates dedicated data collection infrastructure for ML teams who need high-quality robot demonstration data without building their own collection pipeline. We handle hardware, operators, QA, and format conversion — you get training-ready datasets delivered on your timeline.

  • Managed data collection: From 50-episode pilots to 10,000+ episode campaigns. You specify the task; we deliver the dataset in HDF5, RLDS, or LeRobot format.
  • Hardware packages: Pre-configured data collection stations (OpenArm + cameras + compute + software) ready to deploy in your lab.
  • Open datasets: Browse our library of publicly available robot manipulation datasets.
  • Data platform: Upload, visualize, QA, and export your datasets through the SVRC Fearless Platform.