π0.5
: a Vision-Language-Action Model with Open-World
Generalization
A paper by
Let’s talk about
● Transfusion
○ An architecture mixing autoregression and diffusion with a single Transformer
○ https://arxiv.org/abs/2408.11039
● π0
○ A robot foundation model based on Transfusion
○ https://arxiv.org/abs/2410.24164
● FAST
○ An action representation method
○ https://arxiv.org/abs/2501.09747
● π0.5
○ An improved π0
with better embodied reasoning and planning
○ https://www.physicalintelligence.company/download/pi05.pdf
Transfusion
● A single Transformer does diffusion and autoregressive token prediction
● At inference, when a BOI token is outputted, the model switches into image diffusion mode
● N tokens of noise are injected and the diffusion process is run
● After, a EOI token is outputted and autoregressive token prediction continues
π0
: A Vision-Language-Action Flow Model for General Robot Control
● VLM extended with an action expert
○ Similar to mixture-of-experts with 2 experts and a special routing method
● The VLM processes vision and language instruction
● The action expert uses flow-matching to generate actions
● Both interact through attention
● Trained to predict robot action
FAST
● Takes inspiration from image encoding techniques (JPEG)
● Compresses action sequences
● Uses the frequency space to encode the images
● Learn a BPE tokenizer on top
π0.5
: A Vision-Language-Action Model with Open-World Generalization
● Same VLM/Action Expert architecture
● First trained to predict VQA style information
○ Uses the FAST tokenizer to predict actions autoregressively
● Then the action-expert is added and the model is post-trained to output
continuous actions
Interesting bits
● π0.5
can do simple planning which π0
could not do
○ π0
could be combined with other methods
○ Using GPT-4 as a planner doesn’t perform very well
● π0.5
has a stronger focus on cross-environment than π0
● Evaluations are done with a scoring system that allows to appreciate partial
success
○ Many evaluation of robotic system are binary which can be difficult to
interpret when the goals are complex
Let’s look at the robots
How to evaluate VLMs embodied reasoning capabilities?
● Embodied reasoning is becoming more and more popular
● We can use the ERQA benchmark
○ https://github.com/embodiedreasoning/ERQA
○ Comes from Gemini Robotics
● Current scores:
Qwen2.5-VL-3B-Instruct

π0.5: a Vision-Language-Action Model with Open-World Generalization

  • 1.
    π0.5 : a Vision-Language-ActionModel with Open-World Generalization A paper by
  • 2.
    Let’s talk about ●Transfusion ○ An architecture mixing autoregression and diffusion with a single Transformer ○ https://arxiv.org/abs/2408.11039 ● π0 ○ A robot foundation model based on Transfusion ○ https://arxiv.org/abs/2410.24164 ● FAST ○ An action representation method ○ https://arxiv.org/abs/2501.09747 ● π0.5 ○ An improved π0 with better embodied reasoning and planning ○ https://www.physicalintelligence.company/download/pi05.pdf
  • 3.
    Transfusion ● A singleTransformer does diffusion and autoregressive token prediction ● At inference, when a BOI token is outputted, the model switches into image diffusion mode ● N tokens of noise are injected and the diffusion process is run ● After, a EOI token is outputted and autoregressive token prediction continues
  • 4.
    π0 : A Vision-Language-ActionFlow Model for General Robot Control ● VLM extended with an action expert ○ Similar to mixture-of-experts with 2 experts and a special routing method ● The VLM processes vision and language instruction ● The action expert uses flow-matching to generate actions ● Both interact through attention ● Trained to predict robot action
  • 5.
    FAST ● Takes inspirationfrom image encoding techniques (JPEG) ● Compresses action sequences ● Uses the frequency space to encode the images ● Learn a BPE tokenizer on top
  • 6.
    π0.5 : A Vision-Language-ActionModel with Open-World Generalization ● Same VLM/Action Expert architecture ● First trained to predict VQA style information ○ Uses the FAST tokenizer to predict actions autoregressively ● Then the action-expert is added and the model is post-trained to output continuous actions
  • 8.
    Interesting bits ● π0.5 cando simple planning which π0 could not do ○ π0 could be combined with other methods ○ Using GPT-4 as a planner doesn’t perform very well ● π0.5 has a stronger focus on cross-environment than π0 ● Evaluations are done with a scoring system that allows to appreciate partial success ○ Many evaluation of robotic system are binary which can be difficult to interpret when the goals are complex
  • 9.
    Let’s look atthe robots
  • 10.
    How to evaluateVLMs embodied reasoning capabilities? ● Embodied reasoning is becoming more and more popular ● We can use the ERQA benchmark ○ https://github.com/embodiedreasoning/ERQA ○ Comes from Gemini Robotics ● Current scores: Qwen2.5-VL-3B-Instruct