# GR00T Locomotion Policy Architecture **Model**: GR00T-WholeBodyControl-Walk **Source**: NVIDIA GR00T Whole Body Control **Total Parameters**: 470,578 --- ## Table of Contents 1. [Overview](#overview) 2. [Input Space (Observation)](#input-space-observation) 3. [Output Space (Action)](#output-space-action) 4. [Network Architecture](#network-architecture) 5. [Forward Pass Data Flow](#forward-pass-data-flow) 6. [Comparison with Unitree Policy](#comparison-with-unitree-policy) 7. [Training Considerations](#training-considerations) --- ## Overview The GR00T policy is an **asymmetric actor-critic** architecture designed for bipedal locomotion with whole-body control. It uses: - A **temporal observation stack** (6 frames × 86D = 516D) - An **estimator network** (encoder) that processes full observations into a compressed latent space - An **actor network** (decoder) that generates actions from the latent representation + recent observations ### Key Features - **Temporal context**: 6-frame history for stability and dynamics understanding - **Privileged learning**: Estimator sees all 29 DOF, actor only controls 15 DOF (legs + waist) - **Height & orientation commands**: Explicit control over base height and orientation - **ELU activations**: For smoother gradients compared to ReLU --- ## Input Space (Observation) ### Single Frame (86D) One observation frame contains: ``` [0:3] velocity_commands (3D) - [vx, vy, yaw_rate] desired velocities [3] height_command (1D) - desired base height [4:7] orientation_command (3D) - [roll, pitch, yaw] desired orientation [7:10] angular_velocity (3D) - IMU angular velocity (scaled by 0.5) [10:13] gravity_vector (3D) - projected gravity direction [13:42] joint_positions (29D) - all joint angles relative to default pose [42:71] joint_velocities (29D) - all joint velocities [71:86] previous_actions (15D) - actions from previous timestep --- Total: 86D per frame ``` ### Stacked Observation (516D) The policy receives **6 consecutive frames** stacked together: ``` obs_516d = [frame_t-5, frame_t-4, frame_t-3, frame_t-2, frame_t-1, frame_t] = 6 × 86D = 516D ``` **Temporal reasoning**: The history allows the policy to: - Infer velocities and accelerations - Predict contact events and transitions - Maintain gait stability across steps ### Observation Scaling ```python # From Unitree training config ang_vel_scale = 0.5 # Angular velocity scaling dof_pos_scale = 1.0 # Joint position scaling dof_vel_scale = 0.05 # Joint velocity scaling cmd_scale = [2.0, 2.0, 0.5] # [vx, vy, yaw] command scaling ``` ### Default Joint Positions (Standing Pose) ```python default_angles = [ # Left leg (6D) -0.1, 0.0, 0.0, 0.3, -0.2, 0.0, # hip_pitch, hip_roll, hip_yaw, knee, ankle_pitch, ankle_roll # Right leg (6D) -0.1, 0.0, 0.0, 0.3, -0.2, 0.0, # Waist (3D) 0.0, 0.0, 0.0, # yaw, roll, pitch # Left arm (7D) 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, # shoulder_pitch/roll/yaw, elbow, wrist_roll/pitch/yaw # Right arm (7D) 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 ] ``` --- ## Output Space (Action) ### Action Vector (15D) ``` [0:6] left_leg_deltas (6D) - position offsets for left leg joints [6:12] right_leg_deltas (6D) - position offsets for right leg joints [12:15] waist_deltas (3D) - position offsets for waist joints --- Total: 15D ``` ### Action Scaling & Application ```python action_scale = 0.25 # Action multiplier # Target joint positions target_pos = default_angles[:15] + action * action_scale ``` ### Controlled DOFs The policy controls **only the lower body**: - **12 leg joints**: Full leg control for locomotion - **3 waist joints**: Yaw, roll, pitch for balance The **arms (14 DOF)** are NOT controlled by the locomotion policy and can be: - Held at fixed positions (e.g., slight forward/outward pose) - Controlled by a separate arm policy - Used for manipulation while walking --- ## Network Architecture ### Overview Diagram ``` Input (516D) │ ├─────────────────────────────────────────────────┐ │ │ v v ┌─────────────────┐ ┌────────────────┐ │ ESTIMATOR │ │ Input Slice │ │ (Encoder) │ │ (Last 86D) │ └─────────────────┘ └────────────────┘ │ │ │ (35D latent) │ │ │ ├──► [:3] ──────────┐ │ │ (3D) │ │ │ v v └──► [3:] ────► Normalize ─────► Concat (121D) (32D) (32D) │ v ┌─────────────┐ │ ACTOR │ │ (Decoder) │ └─────────────┘ │ v Output (15D) ``` --- ## 1. Estimator Network (Encoder) **Purpose**: Encode full 516D observation into a compact 35D latent representation. ### Architecture ```python Estimator( # Layer 1: Initial compression Linear(516, 256), # Weight: [256, 516] = 132,096 params ELU(), # Smooth activation # Layer 2: Feature refinement Linear(256, 256), # Weight: [256, 256] = 65,536 params ELU(), # Layer 3: Final encoding Linear(256, 35), # Weight: [35, 256] = 8,960 params ) ``` ### Layer Details #### Layer 1: `Linear(516 → 256)` + `ELU()` - **Input**: 516D stacked observations (6 frames × 86D) - **Output**: 256D intermediate features - **Parameters**: 132,096 + 256 (bias) = 132,352 - **Purpose**: Initial dimensionality reduction while preserving temporal information #### Layer 2: `Linear(256 → 256)` + `ELU()` - **Input**: 256D intermediate features - **Output**: 256D refined features - **Parameters**: 65,536 + 256 (bias) = 65,792 - **Purpose**: Feature refinement and non-linear transformation #### Layer 3: `Linear(256 → 35)` - **Input**: 256D refined features - **Output**: 35D latent encoding - **Parameters**: 8,960 + 35 (bias) = 8,995 - **Purpose**: Final compression to latent space **Total Estimator Parameters**: 207,139 ### Latent Space Breakdown (35D) The 35D latent is split into two components: 1. **First 3D** (indices 0-2): Direct features (not normalized) 2. **Last 32D** (indices 3-34): Dynamics features (L2-normalized for stability) **Normalization**: ```python latent_first = latent[:, :3] # (batch, 3) latent_rest = latent[:, 3:] # (batch, 32) # L2 normalize the rest latent_rest_norm = torch.clamp( torch.norm(latent_rest, p=2, dim=-1, keepdim=True), min=1e-8 ) latent_rest_normalized = latent_rest / latent_rest_norm # (batch, 32) ``` --- ## 2. Actor Network (Decoder) **Purpose**: Decode latent representation + recent observation into motor actions. ### Actor Input Construction (121D) ```python actor_input = torch.cat([ obs_recent, # Last 86D from input (most recent frame) latent_first, # First 3D from estimator (direct features) latent_rest_normalized # Last 32D from estimator (normalized dynamics) ], dim=-1) # Total: 86D + 3D + 32D = 121D ``` ### Architecture ```python Actor( # Layer 1: Initial expansion Linear(121, 512), # Weight: [512, 121] = 61,952 params ELU(), # Layer 2: Feature processing Linear(512, 256), # Weight: [256, 512] = 131,072 params ELU(), # Layer 3: Feature refinement Linear(256, 256), # Weight: [256, 256] = 65,536 params ELU(), # Layer 4: Action head Linear(256, 15), # Weight: [15, 256] = 3,840 params ) ``` ### Layer Details #### Layer 1: `Linear(121 → 512)` + `ELU()` - **Input**: 121D (recent obs + latent) - **Output**: 512D expanded features - **Parameters**: 61,952 + 512 (bias) = 62,464 - **Purpose**: Expand compressed representation for action generation #### Layer 2: `Linear(512 → 256)` + `ELU()` - **Input**: 512D expanded features - **Output**: 256D intermediate features - **Parameters**: 131,072 + 256 (bias) = 131,328 - **Purpose**: Process features with large capacity #### Layer 3: `Linear(256 → 256)` + `ELU()` - **Input**: 256D intermediate features - **Output**: 256D refined features - **Parameters**: 65,536 + 256 (bias) = 65,792 - **Purpose**: Additional refinement layer for complex motor patterns #### Layer 4: `Linear(256 → 15)` - **Input**: 256D refined features - **Output**: 15D actions (leg + waist deltas) - **Parameters**: 3,840 + 15 (bias) = 3,855 - **Purpose**: Final action prediction layer **Total Actor Parameters**: 263,439 --- ## Forward Pass Data Flow ### Complete Forward Pass ```python def forward(obs_516d): """ Args: obs_516d: (batch, 516) stacked observations Returns: actions: (batch, 15) joint position deltas """ # 1. Encode full observation latent = estimator(obs_516d) # (batch, 35) # 2. Split and normalize latent latent_first = latent[:, :3] # (batch, 3) latent_rest = latent[:, 3:] # (batch, 32) latent_rest_norm = torch.clamp( torch.norm(latent_rest, p=2, dim=-1, keepdim=True), min=1e-8 ) latent_rest_normalized = latent_rest / latent_rest_norm # (batch, 32) # 3. Extract recent observation obs_recent = obs_516d[:, -86:] # Last frame (batch, 86) # 4. Concatenate actor input actor_input = torch.cat([ obs_recent, # 86D latent_first, # 3D latent_rest_normalized # 32D ], dim=-1) # Total: 121D # 5. Generate actions actions = actor(actor_input) # (batch, 15) return actions ``` ### Computational Cost ``` Forward pass operations: 1. Estimator: 516 → 256 → 256 → 35 (~207K params) 2. Normalization: L2 norm on 32D (~32 ops) 3. Concatenation: Build 121D input (memory copy) 4. Actor: 121 → 512 → 256 → 256 → 15 (~263K params) Total: ~470K multiply-adds per inference ``` ## References 1. **GR00T Project**: https://github.com/NVlabs/gr00t_wbc 2. **Paper**: "GR00T: Whole-Body Control for Humanoid Robots" (NVIDIA Research) 3. **Unitree SDK**: https://github.com/unitreerobotics/unitree_sdk2_python --- ## License The GR00T model is provided by NVIDIA under the NVIDIA Open Model License. See `NVIDIA Open Model License` in the model directory for details. --- **Last Updated**: 2025-01-20