MANUS in Embodied AI: From Human Motion to Robotic Dexterity

Dexterous robot hands are becoming a cornerstone of embodied AI. Recent advances in motion capture, simulation, reinforcement learning, and self-supervised foundation models have enabled robots to perform increasingly human-like manipulation skills.

Replicating the precision, adaptability, and tactile control of the human hand, comprising 27 bones, dozens of muscles and tendons, and nearly 30 degrees of freedom, remains one of the greatest challenges in robotics.

Building a robotic hand that matches human-level dexterity, sensitivity, and coordination requires not just mechanical design, but also advanced training pipelines capable of translating human motion into intelligent control.

This article compares two leading training strategies, RialTo and V-JEPA 2, and highlights how MANUS gloves supports both approaches in developing dexterous, embodied intelligence.

‍

Reconciling Reality through Simulation: A Real-to-Sim-to-Real Approach for Robust Manipulation (MIT Computer Science & Artificial Intelligence Laboratory)
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning (Meta AI)

‍

Two Training Strategies: RialTo vs. V-JEPA 2

RialTo — Real-to-Sim-to-Real Training

RialTo combines Imitation Learning (IL) and Reinforcement Learning (RL) to train robust, data-efficient control policies for specific manipulation tasks.

The process begins with real-world demonstrations, expands learning in simulation, and brings the refined policy back to physical deployment.

How It Works

Capture expert demonstrations: The demonstrations can be gained through human operators performing tasks via teleoperation or motion capture, which provides high-quality demonstrations for initial learning.
‍
Build a digital twin: The real environment is reconstructed in 3D with accurate physics and visuals, creating a simulation-ready training space.
‍
Train in simulation: The robot performs reinforcement-learning trials with domain randomization of object positions, lighting, and physical disturbances to build robustness.
‍
Sim-to-real deployment: The optimized vision-based policy is deployed on the robot, maintaining stability and precision under real-world conditions.
‍

This Real-to-Sim-to-Real (R2S2R) loop allows robots to practice safely at scale, bridge the simulation-to-reality gap, and achieve reliable performance in dynamic environments.

‍

V-JEPA 2 — Self-Supervised World Modeling

V-JEPA 2 uses Self-Supervised Learning (SSL) to build a general world model that links perception and physical action.

Instead of mastering one task, it learns broad representations of motion and causality, enabling zero-shot planning across unfamiliar settings.

How It Works

Action-free pretraining (V-JEPA 2): A large Vision Transformer (ViT) learns a general understanding of motion, appearance, and physical dynamics by predicting masked regions across diverse online videos.
‍
Action-conditioned post-training (V-JEPA-AC): A smaller transformer is fine-tuned on unlabeled robot videos from a large robot manipulation dataset, allowing it to learn the core relationship between executed actions and environmental changes.
‍
Planning and control: The system uses Model Predictive Control (MPC) to imagine future states and select actions that move it closest to the goal, requiring no task-specific rewards or retraining.
‍

This self-supervised process enables robots to generalize across tasks, adapt to new environments, and act intelligently without prior task-specific experience.

‍

Capturing Human Motion Through MANUS Gloves

Both RialTo and V-JEPA 2 rely on high-fidelity human and robot motion data to connect perception with action.

This data is often collected through teleoperation using advanced hand-tracking solutions such as MANUS gloves, which can capture fine finger movements with high precision and low latency.

By recording natural hand motion in real time, such devices provide the foundation for creating dexterous manipulation datasets that fuel embodied AI research.

Motion Data Representation

Joint Angle Data

The most common format directly maps human finger joint rotations to robotic hand joints. It is fast and simple but can lose accuracy when human and robot kinematics differ.

‍

Fingertip Trajectory Mapping

An emerging method focuses on fingertip positions and orientations instead of joint rotations. This reduces kinematic mismatches and improves precision for fine-grained tasks such as in-hand manipulation and tool use.

‍

Simulated Policy Training

In RialTo-style pipelines, teleoperated motion data can be transferred into simulation, where reinforcement learning and domain randomization help robots adapt to variations in lighting, texture, and physical dynamics.

Simulation platforms such as NVIDIA Isaac Sim offer safe, scalable environments for robotic training.

With the robot training framework, Isaac Lab 2.3, researchers can accelerate whole-body control, integrate multiple teleoperation interfaces (including MANUS gloves, Apple Vision Pro, and Vive Hand Tracking), and evaluate policy performance more efficiently.

‍

Conclusion

At the heart of both RialTo and V-JEPA 2 lies high-fidelity human motion data captured through teleoperation.

In RialTo, it provides the ground truth for imitation learning within digital twin simulations.

In V-JEPA 2, it offers causal grounding, helping AI understand how human movement creates real-world effects.

With millimeter-level accuracy and lifelike motion fidelity, MANUS gloves exemplify how precise motion capture can bridge human dexterity and machine intelligence.

As embodied AI continues to advance, this synergy between human motion and robotic learning will shape the next generation of dexterous, adaptive robots capable of interacting naturally with the world.

‍

November 3, 2025

Comparisons & Research