π0.5: a Vision-Language-Action Model with Open-World Generalization

π0.5
: a Vision-Language-Action Model with Open-World
Generalization
A paper by

Let’s talk about
● Transfusion
○ An architecture mixing autoregression and diffusion with a single Transformer
○ https://arxiv.org/abs/2408.11039
● π0
○ A robot foundation model based on Transfusion
● FAST
○ An action representation method
● π0.5
○ An improved π0
with better embodied reasoning and planning
○ https://www.physicalintelligence.company/download/pi05.pdf

Transfusion
● A single Transformer does diffusion and autoregressive token prediction
● At inference, when a BOI token is outputted, the model switches into image diffusion mode
● N tokens of noise are injected and the diffusion process is run
● After, a EOI token is outputted and autoregressive token prediction continues

π0
: A Vision-Language-Action Flow Model for General Robot Control
● VLM extended with an action expert
○ Similar to mixture-of-experts with 2 experts and a special routing method
● The VLM processes vision and language instruction
● The action expert uses flow-matching to generate actions
● Both interact through attention
● Trained to predict robot action

FAST
● Takes inspiration from image encoding techniques (JPEG)
● Compresses action sequences
● Uses the frequency space to encode the images
● Learn a BPE tokenizer on top

π0.5
: A Vision-Language-Action Model with Open-World Generalization
● Same VLM/Action Expert architecture
● First trained to predict VQA style information
○ Uses the FAST tokenizer to predict actions autoregressively
● Then the action-expert is added and the model is post-trained to output
continuous actions

Interesting bits
● π0.5
can do simple planning which π0
could not do
○ π0
could be combined with other methods
○ Using GPT-4 as a planner doesn’t perform very well
● π0.5
has a stronger focus on cross-environment than π0
● Evaluations are done with a scoring system that allows to appreciate partial
success
○ Many evaluation of robotic system are binary which can be difficult to
interpret when the goals are complex

How to evaluate VLMs embodied reasoning capabilities?
● Embodied reasoning is becoming more and more popular
● We can use the ERQA benchmark
○ https://github.com/embodiedreasoning/ERQA
○ Comes from Gemini Robotics
● Current scores:
Qwen2.5-VL-3B-Instruct

π0.5: a Vision-Language-Action Model with Open-World Generalization

More Related Content

Similar to π0.5: a Vision-Language-Action Model with Open-World Generalization

More from NABLAS株式会社

Recently uploaded

π0.5: a Vision-Language-Action Model with Open-World Generalization