SlideShare a Scribd company logo
1 of 23
Download to read offline
Vision-Language
Representations for
Robotics
Dinesh Jayaraman
Assistant Professor,
University of Pennsylvania
Funding agencies:
A Robot’s “Control Loop”
decision-making
observation
How should the robot represent the information in its visual
observations? 2
representation
?
representation encoder
What is a Good Visual Representation?
𝝓(⋅) 𝝓(𝐱)
Image 𝐱
“dog”
Current state-of-the-art for many computer vision
tasks involves learned representations that are:
pretrained without supervision! 3
representation encoder
What is a Good Visual Representation?
𝝓(⋅) 𝝓(𝐱)
Image 𝐱
“dog”
… for Recognition?
Current state-of-the-art for many computer vision
tasks involves learned representations that are:
pretrained without supervision! 4
Good representations organize information conveniently for the task.
Background: Contrastive Unsupervised Learning
What is to stop the representation from collapsing to 𝒛 𝒙 = 𝟎 ∀𝒙?
To prevent this, push different images to have different representations
Pull two views of the same image together in the representation
8
Background: Contrastive Unsupervised Learning
What is to stop the representation from collapsing to 𝒛 𝒙 = 𝟎 ∀𝒙?
To prevent this, push different images to have different representations
Pull two views of the same image together in the representation
6
Key Idea of Contrastive Learning:
Train representations from a collection of unlabeled images by
pulling together and pushing apart the right image pairs
What is a Good Visual Representation for
Robotics?
𝝓(⋅) 𝝓(𝐨)
Image 𝐨 𝝅(⋅)
“rotate
gripper
+4°”
𝒂 = 𝝅(𝝓(𝒐))
representation encoder action “policy”
7
Overview: The Reinforcement Learning (RL)
Formalism
Action
Agent’s objective: maximize the discounted sum of “reward” over time by
executing a good action sequence 𝑎1, 𝑎2, … ,
max
𝜋
𝑅 𝜋 = 𝔼 ෍
𝑡=0
∞
𝛾𝑡𝑟 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1
State 𝑠𝑡
Agent Environment
unknown
𝑠 ∈ 𝑆
𝑎 ∈ 𝐴
𝑃(𝑠′|𝑠, 𝑎)
𝑟(𝑠, 𝑎, 𝑠′)
States
Actions
Transition function
Task Reward function
8
Task-Conditioned “Universal” Value Functions
• Optimal Value Function of A State
𝑉∗
𝑠𝑜 = 𝔼 ෍
𝑡=0
∞
𝛾𝑡𝑟 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1
“How good is this state for completing the task 𝑔 (if acting optimally)”?
• 𝑉 Value functions are a useful abstraction:
• Can guide policy improvement such as through RL
• Well-known “Bellman equation” constraints connecting 𝑉 values at
consecutive steps, permitting easy dynamic programming-style learning.
• Don’t require known actions
𝑉∗
𝑠0; 𝑔 = 𝔼 ෍
𝑡=0
∞
𝛾𝑡
𝑟 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1; 𝑔
… Conditioned On A Task g [Schaul et al 2015]
9
Key Idea: Representations as Value Functions
𝝓(⋅) 𝝓(𝐨) 𝝅(⋅)
“rotate
gripper
+4°”
𝒂 = 𝝅(𝝓(𝒐))
Image 𝐨
representation encoder policy
10
Key Idea: Representations as Value Functions
𝑽∗(⋅ ; 𝒈)
0.8
𝑽∗(𝒐; 𝒈)
𝝓(⋅) 𝝓(𝐨)
Image 𝐨
“squeeze
the
brush
dry”
or
task specification 𝑔
11
Key Idea: Representations as Value Functions
𝑽∗(⋅ ; 𝒈) 𝑽∗(𝒐; 𝒈)
𝝓(⋅) 𝝓(𝐨)
Image 𝐨
Goal image 𝐠 𝝓(𝐠) 0.8
distance
Representation 𝝓(⋅) should be rich enough so that it easily expresses 𝑉∗
12
Key Idea: Representations as Value Functions
𝑽∗(⋅ ; 𝒈) 𝑽∗(𝒐; 𝒈)
𝝓(⋅) 𝝓(𝐨)
Image 𝐨
𝝓𝒍
(𝐠) 0.8
distance
“squeeze
the
brush
dry”
Goal language 𝐠 𝝓𝒍
(⋅)
13
Train 𝝓, 𝝓𝒍 through training
the value function:
𝑽∗
𝝓 𝒐 , 𝝓𝒍
𝒈 = 𝒅(𝝓 𝒐 ,𝝓𝒍
𝒈 )
Pre-Train on Pre-Recorded In-the-Wild Human
Videos
Human videos are goal-directed, and abundant!
• Treat the final frame of any video as the goal
• Reward function? r = 1 for last step of video, 0 elsewhere.
• Actions not available, but no problem: we only care for 𝑉∗(𝒔)
Ego4D dataset
(Grauman 2022)
EpicKitchens
(Damen 2021)
14
Offline RL Value Function Training Objective
Encourages 𝑉∗
⋅ ; 𝑔 = ||𝜙 ⋅ − 𝜙 𝑔 ||2 to become a
valid (“Bellman-consistent”) value function.
Pulls every frame o
preceding g to have high
𝑉∗
𝑜 ; 𝑔
Consecutive frames should be at different distances from the goal
– pushes frames apart
All frames leading up to the goal should be close to goal
– pulls frames together
Training representations as value functions with offline RL
generates a new control-aware contrastive learning objective! 15
Results: Image ↔ Language-Goal Distance 𝒅(𝝓 𝒐 ,𝝓𝒍 𝒈 )
16
Results: Image↔ Image-GoalDistance𝒅(𝝓 𝒐 ,𝝓𝒍 𝒈 )
On demo data, our representations predict smooth goal-conditioned V*
on human and robot videos.
17
What Can We Do With 𝝓(⋅) and 𝝓𝒍
(⋅)?
• Use as representations for robot learning:
• Training robot policies on image representation with:
• behavior cloning
• language-conditioned behavior cloning [Lynch ’20]
• Use as dense reward functions to guide reinforcement policy learning:
• 𝑅 𝑜, 𝑎, 𝑜′
; 𝑔 = 𝑉∗
𝑜′, 𝑔 − 𝑉∗
𝑜, 𝑔 = ||𝜙 𝑜′
− 𝜙 𝑔 ||2 − ||𝜙 𝑜
− 𝜙 𝑔 ||2
• offline RL (reward-weighted regression [Peters ’07]) for policy learning from noisy
demos
• online policy improvement with trajectory optimization and RL (natural policy
gradient [Kakade ’01]) 18
Quantitative Results Summary
Results: Real-World BC / Offline RL From 20 Demos
Results: Language-Conditioned Behavior Cloning
19
Noisy demos → Language Goal-Based Policies
20
Goal: “Place the pineapple in the pot”
Noisy demos → Image Goal-Based Policies
Close Drawer (100% success) Pick and Place Melon (100%)
Push Bottle (90%) Fold Towel (90%)
21
Takeaways
For a robot to make decisions about good actions, in what format should it
internally represent the images from its camera stream?
• Modern visual representations leverage deep neural networks self-supervised from
large unlabeled datasets of images, largely focus on visual recognition use cases.
• No explicitly robot-focused representations before the work presented here.
• Training representations as goal-conditioned “universal value functions”: a
powerful new way to learn control-aware vision, language, (and other?)
representations.
22
Resources
23
The work presented here is covered the following papers:
• Yecheng Jason Ma, Vikash Kumar, Amy Zhang, Osbert Bastani, Dinesh
Jayaraman. LIV: Language-Image Representations and Rewards for
Robotic Control. ICML 2023.
• Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert
Bastani, Vikash Kumar, Amy Zhang. VIP: Towards Universal Visual Reward
and Representation via Value-Implicit Pre-Training. ICLR 2023.
For further reading on self-supervised representations:
• Balestriero et al, A Cookbook of Self-Supervised Learning. arXiv 2023.

More Related Content

Similar to “Vision-language Representations for Robotics,” a Presentation from the University of Pennsylvania

Brains@Bay Meetup: The Increasing Role of Sensorimotor Experience in Artifici...
Brains@Bay Meetup: The Increasing Role of Sensorimotor Experience in Artifici...Brains@Bay Meetup: The Increasing Role of Sensorimotor Experience in Artifici...
Brains@Bay Meetup: The Increasing Role of Sensorimotor Experience in Artifici...Numenta
 
AI driven classification framework for advanced Test Automation
AI driven classification framework for advanced Test AutomationAI driven classification framework for advanced Test Automation
AI driven classification framework for advanced Test AutomationSTePINForum
 
Reward constrained interactive recommendation with natural language feedback ...
Reward constrained interactive recommendation with natural language feedback ...Reward constrained interactive recommendation with natural language feedback ...
Reward constrained interactive recommendation with natural language feedback ...Jeong-Gwan Lee
 
Confusion Matrices for Improving Performance of Feature Pattern Classifier Sy...
Confusion Matrices for Improving Performance of Feature Pattern Classifier Sy...Confusion Matrices for Improving Performance of Feature Pattern Classifier Sy...
Confusion Matrices for Improving Performance of Feature Pattern Classifier Sy...Daniele Loiacono
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Sudeep Das, Ph.D.
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Turi, Inc.
 
Semantic Genetic Programming Tutorial
Semantic Genetic Programming TutorialSemantic Genetic Programming Tutorial
Semantic Genetic Programming TutorialAlbertoMoraglio
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiersbutest
 
State representation learning for control: an overview
State representation learning for control: an overview State representation learning for control: an overview
State representation learning for control: an overview Natalia Díaz Rodríguez
 
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processingNAVER Engineering
 
Robotic models of active perception
Robotic models of active perceptionRobotic models of active perception
Robotic models of active perceptionDimitri Ognibene
 
孫民/從電腦視覺看人工智慧 : 下一件大事
孫民/從電腦視覺看人工智慧 : 下一件大事孫民/從電腦視覺看人工智慧 : 下一件大事
孫民/從電腦視覺看人工智慧 : 下一件大事台灣資料科學年會
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareTigerGraph
 

Similar to “Vision-language Representations for Robotics,” a Presentation from the University of Pennsylvania (20)

Brains@Bay Meetup: The Increasing Role of Sensorimotor Experience in Artifici...
Brains@Bay Meetup: The Increasing Role of Sensorimotor Experience in Artifici...Brains@Bay Meetup: The Increasing Role of Sensorimotor Experience in Artifici...
Brains@Bay Meetup: The Increasing Role of Sensorimotor Experience in Artifici...
 
AI driven classification framework for advanced Test Automation
AI driven classification framework for advanced Test AutomationAI driven classification framework for advanced Test Automation
AI driven classification framework for advanced Test Automation
 
Reward constrained interactive recommendation with natural language feedback ...
Reward constrained interactive recommendation with natural language feedback ...Reward constrained interactive recommendation with natural language feedback ...
Reward constrained interactive recommendation with natural language feedback ...
 
Confusion Matrices for Improving Performance of Feature Pattern Classifier Sy...
Confusion Matrices for Improving Performance of Feature Pattern Classifier Sy...Confusion Matrices for Improving Performance of Feature Pattern Classifier Sy...
Confusion Matrices for Improving Performance of Feature Pattern Classifier Sy...
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Ppts21
Ppts21Ppts21
Ppts21
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
 
Semantic Genetic Programming Tutorial
Semantic Genetic Programming TutorialSemantic Genetic Programming Tutorial
Semantic Genetic Programming Tutorial
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiers
 
Learning To Run
Learning To RunLearning To Run
Learning To Run
 
State representation learning for control: an overview
State representation learning for control: an overview State representation learning for control: an overview
State representation learning for control: an overview
 
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
 
Robotic models of active perception
Robotic models of active perceptionRobotic models of active perception
Robotic models of active perception
 
MILA DL & RL summer school highlights
MILA DL & RL summer school highlights MILA DL & RL summer school highlights
MILA DL & RL summer school highlights
 
Face detection system design seminar
Face detection system design seminarFace detection system design seminar
Face detection system design seminar
 
Session 4 .pdf
Session 4 .pdfSession 4 .pdf
Session 4 .pdf
 
孫民/從電腦視覺看人工智慧 : 下一件大事
孫民/從電腦視覺看人工智慧 : 下一件大事孫民/從電腦視覺看人工智慧 : 下一件大事
孫民/從電腦視覺看人工智慧 : 下一件大事
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA Hardware
 

More from Edge AI and Vision Alliance

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...Edge AI and Vision Alliance
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...Edge AI and Vision Alliance
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...Edge AI and Vision Alliance
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...Edge AI and Vision Alliance
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...Edge AI and Vision Alliance
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...Edge AI and Vision Alliance
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsightsEdge AI and Vision Alliance
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...Edge AI and Vision Alliance
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...Edge AI and Vision Alliance
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...Edge AI and Vision Alliance
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...Edge AI and Vision Alliance
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...Edge AI and Vision Alliance
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...Edge AI and Vision Alliance
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...Edge AI and Vision Alliance
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from SamsaraEdge AI and Vision Alliance
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...Edge AI and Vision Alliance
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...Edge AI and Vision Alliance
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...Edge AI and Vision Alliance
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...Edge AI and Vision Alliance
 
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...Edge AI and Vision Alliance
 

More from Edge AI and Vision Alliance (20)

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
 
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
 

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

“Vision-language Representations for Robotics,” a Presentation from the University of Pennsylvania

  • 1. Vision-Language Representations for Robotics Dinesh Jayaraman Assistant Professor, University of Pennsylvania Funding agencies:
  • 2. A Robot’s “Control Loop” decision-making observation How should the robot represent the information in its visual observations? 2 representation ?
  • 3. representation encoder What is a Good Visual Representation? 𝝓(⋅) 𝝓(𝐱) Image 𝐱 “dog” Current state-of-the-art for many computer vision tasks involves learned representations that are: pretrained without supervision! 3
  • 4. representation encoder What is a Good Visual Representation? 𝝓(⋅) 𝝓(𝐱) Image 𝐱 “dog” … for Recognition? Current state-of-the-art for many computer vision tasks involves learned representations that are: pretrained without supervision! 4 Good representations organize information conveniently for the task.
  • 5. Background: Contrastive Unsupervised Learning What is to stop the representation from collapsing to 𝒛 𝒙 = 𝟎 ∀𝒙? To prevent this, push different images to have different representations Pull two views of the same image together in the representation 8
  • 6. Background: Contrastive Unsupervised Learning What is to stop the representation from collapsing to 𝒛 𝒙 = 𝟎 ∀𝒙? To prevent this, push different images to have different representations Pull two views of the same image together in the representation 6 Key Idea of Contrastive Learning: Train representations from a collection of unlabeled images by pulling together and pushing apart the right image pairs
  • 7. What is a Good Visual Representation for Robotics? 𝝓(⋅) 𝝓(𝐨) Image 𝐨 𝝅(⋅) “rotate gripper +4°” 𝒂 = 𝝅(𝝓(𝒐)) representation encoder action “policy” 7
  • 8. Overview: The Reinforcement Learning (RL) Formalism Action Agent’s objective: maximize the discounted sum of “reward” over time by executing a good action sequence 𝑎1, 𝑎2, … , max 𝜋 𝑅 𝜋 = 𝔼 ෍ 𝑡=0 ∞ 𝛾𝑡𝑟 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1 State 𝑠𝑡 Agent Environment unknown 𝑠 ∈ 𝑆 𝑎 ∈ 𝐴 𝑃(𝑠′|𝑠, 𝑎) 𝑟(𝑠, 𝑎, 𝑠′) States Actions Transition function Task Reward function 8
  • 9. Task-Conditioned “Universal” Value Functions • Optimal Value Function of A State 𝑉∗ 𝑠𝑜 = 𝔼 ෍ 𝑡=0 ∞ 𝛾𝑡𝑟 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1 “How good is this state for completing the task 𝑔 (if acting optimally)”? • 𝑉 Value functions are a useful abstraction: • Can guide policy improvement such as through RL • Well-known “Bellman equation” constraints connecting 𝑉 values at consecutive steps, permitting easy dynamic programming-style learning. • Don’t require known actions 𝑉∗ 𝑠0; 𝑔 = 𝔼 ෍ 𝑡=0 ∞ 𝛾𝑡 𝑟 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1; 𝑔 … Conditioned On A Task g [Schaul et al 2015] 9
  • 10. Key Idea: Representations as Value Functions 𝝓(⋅) 𝝓(𝐨) 𝝅(⋅) “rotate gripper +4°” 𝒂 = 𝝅(𝝓(𝒐)) Image 𝐨 representation encoder policy 10
  • 11. Key Idea: Representations as Value Functions 𝑽∗(⋅ ; 𝒈) 0.8 𝑽∗(𝒐; 𝒈) 𝝓(⋅) 𝝓(𝐨) Image 𝐨 “squeeze the brush dry” or task specification 𝑔 11
  • 12. Key Idea: Representations as Value Functions 𝑽∗(⋅ ; 𝒈) 𝑽∗(𝒐; 𝒈) 𝝓(⋅) 𝝓(𝐨) Image 𝐨 Goal image 𝐠 𝝓(𝐠) 0.8 distance Representation 𝝓(⋅) should be rich enough so that it easily expresses 𝑉∗ 12
  • 13. Key Idea: Representations as Value Functions 𝑽∗(⋅ ; 𝒈) 𝑽∗(𝒐; 𝒈) 𝝓(⋅) 𝝓(𝐨) Image 𝐨 𝝓𝒍 (𝐠) 0.8 distance “squeeze the brush dry” Goal language 𝐠 𝝓𝒍 (⋅) 13 Train 𝝓, 𝝓𝒍 through training the value function: 𝑽∗ 𝝓 𝒐 , 𝝓𝒍 𝒈 = 𝒅(𝝓 𝒐 ,𝝓𝒍 𝒈 )
  • 14. Pre-Train on Pre-Recorded In-the-Wild Human Videos Human videos are goal-directed, and abundant! • Treat the final frame of any video as the goal • Reward function? r = 1 for last step of video, 0 elsewhere. • Actions not available, but no problem: we only care for 𝑉∗(𝒔) Ego4D dataset (Grauman 2022) EpicKitchens (Damen 2021) 14
  • 15. Offline RL Value Function Training Objective Encourages 𝑉∗ ⋅ ; 𝑔 = ||𝜙 ⋅ − 𝜙 𝑔 ||2 to become a valid (“Bellman-consistent”) value function. Pulls every frame o preceding g to have high 𝑉∗ 𝑜 ; 𝑔 Consecutive frames should be at different distances from the goal – pushes frames apart All frames leading up to the goal should be close to goal – pulls frames together Training representations as value functions with offline RL generates a new control-aware contrastive learning objective! 15
  • 16. Results: Image ↔ Language-Goal Distance 𝒅(𝝓 𝒐 ,𝝓𝒍 𝒈 ) 16
  • 17. Results: Image↔ Image-GoalDistance𝒅(𝝓 𝒐 ,𝝓𝒍 𝒈 ) On demo data, our representations predict smooth goal-conditioned V* on human and robot videos. 17
  • 18. What Can We Do With 𝝓(⋅) and 𝝓𝒍 (⋅)? • Use as representations for robot learning: • Training robot policies on image representation with: • behavior cloning • language-conditioned behavior cloning [Lynch ’20] • Use as dense reward functions to guide reinforcement policy learning: • 𝑅 𝑜, 𝑎, 𝑜′ ; 𝑔 = 𝑉∗ 𝑜′, 𝑔 − 𝑉∗ 𝑜, 𝑔 = ||𝜙 𝑜′ − 𝜙 𝑔 ||2 − ||𝜙 𝑜 − 𝜙 𝑔 ||2 • offline RL (reward-weighted regression [Peters ’07]) for policy learning from noisy demos • online policy improvement with trajectory optimization and RL (natural policy gradient [Kakade ’01]) 18
  • 19. Quantitative Results Summary Results: Real-World BC / Offline RL From 20 Demos Results: Language-Conditioned Behavior Cloning 19
  • 20. Noisy demos → Language Goal-Based Policies 20 Goal: “Place the pineapple in the pot”
  • 21. Noisy demos → Image Goal-Based Policies Close Drawer (100% success) Pick and Place Melon (100%) Push Bottle (90%) Fold Towel (90%) 21
  • 22. Takeaways For a robot to make decisions about good actions, in what format should it internally represent the images from its camera stream? • Modern visual representations leverage deep neural networks self-supervised from large unlabeled datasets of images, largely focus on visual recognition use cases. • No explicitly robot-focused representations before the work presented here. • Training representations as goal-conditioned “universal value functions”: a powerful new way to learn control-aware vision, language, (and other?) representations. 22
  • 23. Resources 23 The work presented here is covered the following papers: • Yecheng Jason Ma, Vikash Kumar, Amy Zhang, Osbert Bastani, Dinesh Jayaraman. LIV: Language-Image Representations and Rewards for Robotic Control. ICML 2023. • Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, Amy Zhang. VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training. ICLR 2023. For further reading on self-supervised representations: • Balestriero et al, A Cookbook of Self-Supervised Learning. arXiv 2023.