© NABLAS Inc. All Rights Reserved 1
Paper Discussion
GR00T N1:
An Open Foundation Model for
Generalist Humanoid Robots
© NABLAS Inc. All Rights Reserved 2
● Introduction of VLAs
C O N T E N T S
© NABLAS Inc. All Rights Reserved 3
Vision-language-action models
3
OpenVLA, Pi_0 and others
How they work?
VLM + Action Head
What they work on?
Robot arm or bimanual robot arms
Key limitations:
Restricted frequency, limited performance for long horizon tasks.
© NABLAS Inc. All Rights Reserved 4
● Introduction of VLAs
● What’s new in GR00T
C O N T E N T S
© NABLAS Inc. All Rights Reserved 5
Architecture (“system1-system2”)
5
10Hz 120Hz
© NABLAS Inc. All Rights Reserved 6
Architecture (“system1-system2”)
6
© NABLAS Inc. All Rights Reserved 7
DiT (Diffusion transformer)
7
© NABLAS Inc. All Rights Reserved 8
● Introduction of VLAs
● What’s new in GR00T
● Experiments
C O N T E N T S
© NABLAS Inc. All Rights Reserved 9
Data for training
9
“traditional large scale robotics dataset is like
an archipelago of ‘data islands’” caused by:
- different robot embodiments
- different sensors
- other settings…
© NABLAS Inc. All Rights Reserved 10
How is cross-dataset latent prepared?
10
- latent action extraction from
egocentric human demonstration
videos
- “LATENT ACTION
PRETRAINING FROM
VIDEOS”
- Use the X_t and X_t+H
frames to generate latent
action z_t.
- use z_t and x_t to
reconstruct the x_t+H
frames
- keep only the encoder,
latent action z_t is then
used in pre-training
- “Neural trajectories”
© NABLAS Inc. All Rights Reserved 11
How is cross-dataset latent prepared?
11
- latent action extraction from egocentric human
demonstration videos
- “LATENT ACTION PRETRAINING FROM
VIDEOS”
- Use the X_t and X_t+H frames to
generate latent action z_t.
- use z_t and x_t to reconstruct the
x_t+H frames
- keep only the encoder, latent action
z_t is then used as the training
dataset
- “Neural trajectories”
- Video generation model trained on the real
data
- Generate the video based on the first frame
© NABLAS Inc. All Rights Reserved 12
Pre-training and post-training
12
- Pre-training
- Human demonstration (latent actions)
- real humanoid data (real actions)
- Augmented humanoid data (latent action, inverse action)
- neural trajectories
- Post-training
- fine-tune on each single embodiment (real robot data)
- neural trajectories
© NABLAS Inc. All Rights Reserved 13
● Introduction of VLAs
● What’s new in GR00T
● Experiments
● Results and takeaways
C O N T E N T S
© NABLAS Inc. All Rights Reserved 14
Results
14
© NABLAS Inc. All Rights Reserved 15
● The foundation model for humanoid robots
● Data pyramid for training
○ Use human demonstration to form large scale dataset for training
○ leverage simulation, synthetic data, and real-world robot data together.
○ The neutral trajectories for “real” data augmentation
● The System1-system2 framework for real-time and smooth processing
Takeaway
© NABLAS Inc. All Rights Reserved 16
Thanks
For
Listening
16

An Open Foundation Model for Generalist Humanoid Robot

  • 1.
    © NABLAS Inc.All Rights Reserved 1 Paper Discussion GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
  • 2.
    © NABLAS Inc.All Rights Reserved 2 ● Introduction of VLAs C O N T E N T S
  • 3.
    © NABLAS Inc.All Rights Reserved 3 Vision-language-action models 3 OpenVLA, Pi_0 and others How they work? VLM + Action Head What they work on? Robot arm or bimanual robot arms Key limitations: Restricted frequency, limited performance for long horizon tasks.
  • 4.
    © NABLAS Inc.All Rights Reserved 4 ● Introduction of VLAs ● What’s new in GR00T C O N T E N T S
  • 5.
    © NABLAS Inc.All Rights Reserved 5 Architecture (“system1-system2”) 5 10Hz 120Hz
  • 6.
    © NABLAS Inc.All Rights Reserved 6 Architecture (“system1-system2”) 6
  • 7.
    © NABLAS Inc.All Rights Reserved 7 DiT (Diffusion transformer) 7
  • 8.
    © NABLAS Inc.All Rights Reserved 8 ● Introduction of VLAs ● What’s new in GR00T ● Experiments C O N T E N T S
  • 9.
    © NABLAS Inc.All Rights Reserved 9 Data for training 9 “traditional large scale robotics dataset is like an archipelago of ‘data islands’” caused by: - different robot embodiments - different sensors - other settings…
  • 10.
    © NABLAS Inc.All Rights Reserved 10 How is cross-dataset latent prepared? 10 - latent action extraction from egocentric human demonstration videos - “LATENT ACTION PRETRAINING FROM VIDEOS” - Use the X_t and X_t+H frames to generate latent action z_t. - use z_t and x_t to reconstruct the x_t+H frames - keep only the encoder, latent action z_t is then used in pre-training - “Neural trajectories”
  • 11.
    © NABLAS Inc.All Rights Reserved 11 How is cross-dataset latent prepared? 11 - latent action extraction from egocentric human demonstration videos - “LATENT ACTION PRETRAINING FROM VIDEOS” - Use the X_t and X_t+H frames to generate latent action z_t. - use z_t and x_t to reconstruct the x_t+H frames - keep only the encoder, latent action z_t is then used as the training dataset - “Neural trajectories” - Video generation model trained on the real data - Generate the video based on the first frame
  • 12.
    © NABLAS Inc.All Rights Reserved 12 Pre-training and post-training 12 - Pre-training - Human demonstration (latent actions) - real humanoid data (real actions) - Augmented humanoid data (latent action, inverse action) - neural trajectories - Post-training - fine-tune on each single embodiment (real robot data) - neural trajectories
  • 13.
    © NABLAS Inc.All Rights Reserved 13 ● Introduction of VLAs ● What’s new in GR00T ● Experiments ● Results and takeaways C O N T E N T S
  • 14.
    © NABLAS Inc.All Rights Reserved 14 Results 14
  • 15.
    © NABLAS Inc.All Rights Reserved 15 ● The foundation model for humanoid robots ● Data pyramid for training ○ Use human demonstration to form large scale dataset for training ○ leverage simulation, synthetic data, and real-world robot data together. ○ The neutral trajectories for “real” data augmentation ● The System1-system2 framework for real-time and smooth processing Takeaway
  • 16.
    © NABLAS Inc.All Rights Reserved 16 Thanks For Listening 16