Overview of Embodied AI¶
Embodied AI is the study of intelligent agents that learn to perceive, reason, and act in the physical world through embodied interaction. Unlike disembodied AI (language models, recommender systems), embodied agents must deal with continuous physics, real-time constraints, and the consequences of physical actions.
What Makes Embodied AI Different?¶
| Challenge | Disembodied AI | Embodied AI |
|---|---|---|
| State space | Structured (text, tabular) | Raw sensory (vision, proprioception, tactile) |
| Action space | Discrete tokens | Continuous torques, velocities |
| Feedback | Immediate (loss) | Delayed, sparse, noisy rewards |
| Safety | Output filtering | Physical damage risk |
| Data | Internet-scale | Limited, expensive |
| Latency | Flexible | Real-time constraint (<50ms) |
The Sim-to-Real Paradigm¶
Most modern embodied AI research follows the sim-to-real pipeline:
- Simulate: Train policies in physics simulation (Isaac Gym, MuJoCo, PyBullet)
- Randomize: Apply domain randomization to bridge the sim-to-real gap
- Transfer: Deploy the trained policy on real hardware
- Fine-tune (optional): Adapt with small amounts of real-world data
graph LR
SIM[Simulation Environment] -->|Domain Randomization| POL[RL Policy Training]
POL -->|Zero-shot Transfer| REAL[Real Robot]
REAL -->|Fine-tuning Data| POL
Why Simulation?¶
- Speed: Thousands of parallel environments, millions of steps per hour
- Safety: No risk of hardware damage during exploration
- Cost: Much cheaper than real-world experiments
- Reproducibility: Deterministic environments, easy to share
Domain Randomization¶
To make sim-trained policies robust to real-world variation, randomize simulation parameters:
| Category | Examples |
|---|---|
| Physics | Friction, mass, damping, motor strength, joint limits |
| Visual | Lighting, textures, camera position, backgrounds |
| Dynamics | Action delay, observation noise, actuator model |
| Morphology | Link lengths, body mass distribution |
The policy learns to be robust to this variation, which covers the real-world distribution.
Key Simulation Platforms¶
| Platform | Developer | Strengths |
|---|---|---|
| Isaac Gym / Isaac Lab | NVIDIA | GPU-accelerated, massive parallelism, robotics focus |
| MuJoCo | Google DeepMind | Accurate contact physics, widely used in research |
| PyBullet | Erwin Coumans | Open-source, good for manipulation |
| Gazebo / ROS | Open Robotics | Full ROS integration, diverse sensor simulation |
| SAPIEN | UC San Diego | Articulated object manipulation |
| Habitat | Meta | Indoor navigation, photorealistic rendering |
Isaac Gym / Isaac Lab
For locomotion and large-scale RL training, Isaac Gym (and its successor Isaac Lab) is currently the most popular choice due to GPU-accelerated physics simulation. It can run thousands of parallel environments on a single GPU.
Types of Embodied AI Systems¶
By Robot Morphology¶
- Legged robots: Quadrupeds (ANYmal, Unitree Go/B), bipeds (humanoids), hexapods
- Wheeled robots: Mobile bases, wheeled manipulation platforms
- Arms: Fixed-base manipulators (Franka, UR5, xArm)
- Hands: Dexterous hands (Allegro, Shadow, LEAP)
- Mobile manipulators: Arm on mobile base (Spot + arm, Mobile ALOHA)
- Humanoids: Full-body systems (Atlas, Figure, Unitree H1)
By Capability¶
| Capability | Description | Key Challenge |
|---|---|---|
| Locomotion | Walking, running, climbing over diverse terrain | Balance, terrain adaptation, energy efficiency |
| Manipulation | Grasping, placing, tool use | Contact-rich physics, dexterity |
| Loco-manipulation | Moving + manipulating simultaneously | Whole-body coordination |
| Navigation | Moving through environments to reach goals | Mapping, obstacle avoidance |
The Learning Pipeline¶
A typical embodied AI training pipeline:
- Task design: Define reward function, success criteria, initial state distribution
- Environment setup: Create simulation with robot URDF/MJCF, terrain, objects
- Policy architecture: Choose observation space, action space, network architecture
- Training: Run RL algorithm (typically PPO) across parallel environments
- Evaluation: Test in simulation, analyze failure modes
- Sim-to-real: Deploy on real hardware, evaluate transfer quality
- Iteration: Refine based on real-world performance
Observation Spaces¶
Common observation inputs for embodied agents:
- Proprioception: Joint positions, velocities, torques, base orientation (IMU)
- Exteroception: Camera images, depth maps, LiDAR, tactile sensors
- Commands: Desired velocity, target position, language instruction
- History: Stacked past observations for handling latency and partial observability
Action Spaces¶
- Joint position targets: Specify desired joint angles (PD controller tracks them)
- Joint velocity targets: Specify desired joint velocities
- Joint torques: Direct torque commands (most flexible, hardest to learn)
- End-effector poses: Cartesian space targets (requires IK)
Key References¶
- Tan, J., et al. (2018). "Sim-to-Real: Learning Agile Locomotion For Quadruped Robots." RSS.
- Tobin, J., et al. (2017). "Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World." IROS.
- Makoviychuk, V., et al. (2021). "Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning." NeurIPS.