SlideShare a Scribd company logo
1 of 62
Download to read offline
Frontiers in Perceptual AI:
First-Person Video and Multimodal Perception
Kristen Grauman
University of Texas at Austin
FAIR, Meta AI
The third-person Web perceptual experience
Caltech 101 (2004), Caltech 256 (2006) PASCAL (2007-12)
ImageNet (2009)
LabelMe (2007)
MS COCO (2014)
SUN (2010)
Places (2014)
BSD (2001)
Visual Genome (2016) AVA (2018)
Kinetics (2017)
ActivityNet (2015)
A curated “disembodied” moment in time
from a spectator’s perspective
Kristen Grauman, FAIR & UT Austin
First-person “egocentric” perceptual experience
Uncurated long-form video stream driven by
the agent’s goals, interactions, and attention
Kristen Grauman, FAIR & UT Austin
First-person perception and learning
Status quo:
Learning and inference with
“disembodied” images/videos.
On the horizon:
Visual learning in the context
of agent goals, interaction, and
multi-sensory observations.
Kristen Grauman, FAIR & UT Austin
Why egocentric video?
Robot learning Augmented reality
Kristen Grauman, FAIR & UT Austin
Existing first-person video datasets
Inspire our effort, but call for greater scale, content, diversity
EPIC Kitchens
Damen et al. 2020
45 people, 100 hrs
kitchens only
UT Ego
Lee et al. 2012
4 people, 17 hrs
daily life, in/outdoors
EGTEA Gaze+
Li et al. 2018
32 people, 28 hrs
kitchens only
Charades-Ego
Sigurdsson 2018
71 people, 34 hrs
indoor
ADL
Pirsiavash 2012
20 people, 10 hrs
apartment
Kristen Grauman, FAIR & UT Austin
Existing first-person video datasets
Inspire our effort, but call for greater scale, content, diversity
EPIC Kitchens
Damen et al. 2020
45 people, 100 hrs
kitchens only
UT Ego
Lee et al. 2012
4 people, 17 hrs
daily life, in/outdoors
EGTEA Gaze+
Li et al. 2018
32 people, 28 hrs
kitchens only
Charades-Ego
Sigurdsson 2018
71 people, 34 hrs
indoor
ADL
Pirsiavash 2012
20 people, 10 hrs
apartment
#people
#hours
#scenes
Ego4D
GOAL
Kristen Grauman, FAIR & UT Austin
# Hours
# Participants
EPIC-Kitchens-100
Combining all
prior egocentric
datasets
(2008-2020)
Existing first-person video datasets
3,670 hours of in-the-wild daily life activity
931 participants from 74 worldwide locations
Benchmark tasks to catalyze research
Ego4D: A massive-scale egocentric dataset
Multimodal: audio, 3D scans, IMU, stereo, multi-camera
# Hours
# Participants
[Grauman et al. CVPR 2022]
Ego4D: everyday activity around the world
Towards diverse geographic coverage
IIIT Hyderabad
KAUST
U. Tokyo
U. Bristol
U. Catania
Georgia Tech
CMU
MIT
U. Minnesota
U. Indiana
UPenn
U of Los Andes Natl U.
Singapore
CMU Africa
Facebook
Ego4D: FAIR + university consortium
931 unique camera wearers
Towards diverse demographic coverage
Wearable cameras
We deploy a variety of head-mounted cameras.
GoPro Vuzix Blade Pupil Labs ZShade WeeView
Kristen Grauman, FAIR & UT Austin
Everyday activities in the home:
●Sleeping
●Daily hygiene
●Doing hair/make-up
●Cleaning / laundry
●Cooking
●Talking with family members
●Hosting a party
●Eating
●Yardwork / shoveling snow
●Household management - care for
kids
●Fixing something in the home
●Playing with pets
●Crafting/knitting/sewing/drawing/pai
nting/etc
Entertainment/Leisure
●Watching movies at cinema
●Watching tv
●Reading books
●Playing games / video games
●Attending sporting events - watching and
participating in
●Attending play/ballet
●Attending concerts
●Hanging out with friends at a bar
●Eating at a restaurant
●Eating at a friend’s home
●Attending a party
●Talking on the phone
●Listening to music
●BBQ’ing/picnics
●Going to a salon (nail, hair, spa)
●Getting a tattoo / piercing
●Volunteering
●Practicing a musical instrument
●Attending a festival or fair
●Hanging out at a coffee shop
Exercise:
●Going to the gym
●Yoga practice
●Swimming in a pool/ocean
●Working out at home
●Cycling / jogging
●Dancing
●Working out outside
●Walking on street
●Going to the park
●Hiking
●Tourism
Work
●Working at desk
●Participating in a meeting
●Attending a lecture/class
●Writing on whiteboard
●Video call
●Eating at the cafeteria
●Making coffee
●Talking to colleagues
Errands
●Grocery shopping
●Clothes, shopping
●Getting car fixed
●Going to the bank
●Walking the dog
●Washing the dog /
pet, grooming horse
●Appointments: doctor,
dentist, hair
Transportation:
●Car - commuting,
road trip
●Bus
●Train
●Airplane
●Bike
●Skateboard/scooter
Unscripted, daily-life scenarios
How people spend their days: US Bureau of Labor Statistics
https://www.bls.gov/news.release/atus.nr0.htm
Ego4D: everyday activity around the world
Ego4D data: 3D environment scans
Available
for 491
hours of
video
Ego4D data: multi-camera and eye gaze
Multiple simultaneous egocentric cameras Eye gaze
(Indiana University)
Ego4D annotations: text narrations
Dense
descriptive text
of each camera
wearer activity
+ clip-level
summaries
13 sentences
per minute
4M+ sentences
Kristen Grauman, FAIR & UT Austin
Privacy and ethics
• Review: Each partner underwent separate months-long IRB
review process, overseeing ethical and privacy standards for
data collection, management, and informed consent.
• Consent: Forms signed by all recorded people where relevant
• De-identification: State-of-the-art de-identification processes,
featuring both automated and manual reviews for faces,
screens, credit cards, and other identifiers
Kristen Grauman, FAIR & UT Austin
Ego4D benchmark suite
Past
Episodic Memory
“where is my X?”
Forecasting
“what will I do next?”
Future
Hands & Objects
“what am I doing and how?”
Present
Audio-visual Diarization
“who said what when?”
Social Interaction
“who is attending to whom?”
Present
Ego4D challenges
Broad and growing participation at CVPR 2022, ECCV 2022, CVPR 2023
Hosted on EvalAI
https://ego4d-data.org/docs/challenge/
Papers citing Ego4D
Source: Google Scholar
Slide credit: David Crandall
Ego4D is enabling new research in the
broader community
Papers citing
Ego4D
Ego4D: computer vision and beyond
Computer
Vision
Audio
Augmented
Reality
Language
Speech
Robotics
3D Sensing
Ego4D team
Kristen Grauman, Andrew Westbury, Eugene Byrne*, Zachary Chavis*, Antonino Furnari*, Rohit Girdhar*, Jackson Hamburger*,
Hao Jiang*, Miao Liu*, Xingyu Liu*, Miguel Martin*, Tushar Nagarajan*, Ilija Radosavovic*, Santhosh Kumar Ramakrishnan*,
Fiona Ryan*, Jayant Sharma*, Michael Wray*, Mengmeng Xu*, Eric Zhongcong Xu*, Chen Zhao*, Siddhant Bansal, Dhruv
Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni,
Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei Huang, Wenqi Jia, Weslie Khoo, Jachym
Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava
Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari,
Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei
Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem,
Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park,
James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik
First-person perception and learning
Status quo:
Learning and inference with
“disembodied” images/videos.
On the horizon:
Visual learning in the context
of agent goals, interaction, and
multi-sensory observations.
Kristen Grauman, FAIR & UT Austin
Ego4D: large-scale multimodal dataset
Grauman et al. CVPR 2022
Recall: Ego4D text narrations
Dense
descriptive text
of each camera
wearer activity
+ clip-level
summaries
13 sentences
per minute
4M+ sentences
Our idea: video-language embedding learning, representing the hierarchical
relationship between action descriptions (what) and higher-level summaries (why)
Existing methods: match short clips to
corresponding narrations (Lin et al. 2022;
Miech et al. 2020; Bain et al. 2021….)
Hierarchical video-language learning
Kumar et al. CVPR 2023
HierVL
Child-level
(what)
Parent-level
(why)
Our idea: video-language embedding learning, representing the hierarchical
relationship between action descriptions (what) and higher-level summaries (why)
Kumar et al. CVPR 2023
HierVL
Kumar et al. CVPR 2023 Lin et al. Neurips 2022
T-SNE
HierVL on downstream tasks
Kumar et al. CVPR 2023
EPIC-KITCHENS Multi-Instance Retrieval
Ego4D Long Term Anticipation
CharadesEgo Action Recognition
Pretrained HierVL features provide strong
performance on multiple downstream video tasks
Listening to learn about the visual world
Material properties
Object identity
Ambient scene
Conversation Egocentric activities
Emotion 3D space
1 drum kit, 5 different spaces
Source: Shred Shed Studio
Spatial effects in audio
Agent’s spatial hearing cues:
• Interaural time difference (ITD),
• Interaural level difference (ILD)
• Spectral detail (from pinna reflections)
Factors from 3D environment:
• Geometry of the space
• Materials in the room
• Position of source and receiver
SoundSpaces audio simulation platform
We introduce the SoundSpaces audio
simulation platform
• Visually realistic real-world 3D
environments (Matterport3D, Replica,
Gibson, HM3D…)
• Acoustically realistic (geometry, materials,
source location) binaural sound in real-time,
for waveform of your choice
• Room impulse response (RIR) for any
source x receiver location
• Habitat-compatible
One possible
source
C. Chen*, U. Jain*, et al., SoundSpaces, ECCV 2020; C. Chen et al. SoundSpaces 2.0, NeurIPS 2022
SoundSpaces: https://github.com/facebookresearch/sound-spaces
Which smoke alarm is beeping?
Agent view Top-down map (unknown to the agent)
Listen with headphones online for spatial sound experience http://vision.cs.utexas.edu/projects/audio_visual_navigation/
SoundSpaces audio simulation
C. Chen*, U. Jain*, et al., SoundSpaces, ECCV 2020 & SoundSpaces 2.0, NeurIPS 2022
Check out the
SoundSpaces challenge in the
EmbodiedAI workshop at CVPR!
https://soundspaces.org/challenge
Daredevil (2003), Character Matt Murdock “sees” by listening
38
Recovering the shape of the scene
Our idea: VisualEchoes
feature learning via echolocation
Monocular depth prediction Surface normal estimation Visual navigation
Goal: Learn image representation via echolocation to
benefit downstream (visual-only) spatial tasks
Key insight: supervision from acoustically
interacting with the physical world.
Gao et al. ECCV 2020 Kristen Grauman, FAIR & UT Austin
Agent location
RGB Depth Echoes
L
R
L
R
L
R
L
R
Top-down view of a Replica scene
Camera
viewpoint
Gao et al. ECCV 2020
Echolocation in SoundSpaces
Emit a chirp at the receiver position
and capture the resulting echoes
VisualEchoes approach
VisualEcho-Net
Audio-Visual
Fusion
STFT
Spectrograms
Left ear
Right ear
Echo-Net
FC
Agent’s View
Echo response
Predict
orientation
: the same orientation as the agent’s current view
: the orientation if the agent turns right by 90◦
: the opposite orientation to the agent’s current view
: the orientation if the agent turns left by 90◦
Echo received from
one of:
Learn visual representation from (in)congruence of echo and view
Learned encoding for RGB images alone
benefits from prior audio spatial learning
Kristen Grauman, FAIR & UT Austin
VisualEchoes for downstream tasks
Pre-train monocular depth prediction CNN with VisualEchoes
No audio input, test on real images (NYU-V2 dataset)
[Gao et al., ECCV 2020] Hu et al., Revisiting single image depth estimation, WACV 2019
Sim2Real
transfer
VisualEchoes for downstream tasks
Monocular depth prediction
Surface normal estimation
Visual navigation
Competitive with (or even better than) supervised pre-training!
Kristen Grauman, FAIR & UT Austin
[Gao et al., ECCV 2020]
Audio-visual floorplans
Given a short video, can
we infer the layout of the
entire home?
Kristen Grauman, FAIR & UT Austin
Purushwalkam et al., ICCV 2021
Audio-visual floorplans
Purushwalkam et al., ICCV 2021
Supervised training of a multi-modal encoder-decoder network
Audio-visual floorplans
Kristen Grauman, FAIR & UT Austin
SoTA visual mapping
that extrapolates to
unseen areas
[Ramakrishnan et al. ECCV 2020]
Purushwalkam et al., ICCV 2021
Audio-visual floorplans
Kristen Grauman, FAIR & UT Austin Purushwalkam et al., ICCV 2021
Listening to learn about the visual world
Material properties
Object identity
Ambient scene
Conversation Egocentric activities
Emotion 3D space
Cocktail-party problem
Self-supervising audio source separation:
“Mix-and-Separate”
Simpson et al. 2015; Huang et al. 2015; Yu et al. 2017; Ephrat et al. 2018; Owens & Efros
2018; Zhao et al. 2018; Afouras et al. 2018; Zhao et al. 2019; Gao et al. 2019
Mixed audio
Video
1
Video
2
Facial appearance hints at voice qualities
Cross-modal embedding space
[Nagrani et al. ECCV’18, Nagrani et al. CVPR’18, Kim et al. ACCV’18, Chung et al. ICASSP’19, Wen et al. ICLR’19]
Prior work learns cross-modal face-voice embeddings
for person identification.
Face 1 Face 2 Face 3 Face 4 Face 5
Face 2
Voice 2
Voice 4
Face 4
Audio-visual speech separation
Cross-modal embedding space
Cross-modal face-to-voice matching
Pull
Push
Distinctive voice tracks
aid embedding learning
Vocal and facial prior
aids separation
Our idea: Mutually beneficial tasks!
[Gao & Grauman, CVPR 2021]
Separated speech
Cross-modal embedding space
Pull Push
Speech mixture
Face embedding
Lip motion
Face embedding
Lip motion
VisualVoice
Jointly learn audio-visual speech separation and
cross-modal face-voice embeddings
[Gao & Grauman, CVPR 2021]
Speech mixture
Separated voice for the right speaker
Separated voice for the left speaker
Speech with background noise
Enhanced speech
Speech with background noise
Enhanced speech
VisualVoice vs. prior state-of-the-art methods
Our method improves the state-of-the-art on all five datasets.
Summary
Towards embodied multimodal first-person perception
● Ego4D: massive multimodal first-person data and benchmark
● Hierarchical vision-language embedding to capture goals with actions
● Inferring the shape of a scene with echoes, sounds, and vision
● Audio-visual source separation to listen to voice of interest
Kristen Grauman
UT Austin and FAIR
grauman@cs.utexas.edu
Changan
Chen
Ziad
Al-Halah
Ruohan
Gao
Senthil
Purushwalkam
Ashutosh
Kumar
Rohit
Girdhar
Lorenzo
Torresani
Consortium

More Related Content

Similar to “Frontiers in Perceptual AI: First-person Video and Multimodal Perception,” a Keynote Presentation from Kristen Grauman

Activity Recognition using RGBD
Activity Recognition using RGBDActivity Recognition using RGBD
Activity Recognition using RGBDnazlitemu
 
VSMM 2016 Keynote: Using AR and VR to create Empathic Experiences
VSMM 2016 Keynote: Using AR and VR to create Empathic ExperiencesVSMM 2016 Keynote: Using AR and VR to create Empathic Experiences
VSMM 2016 Keynote: Using AR and VR to create Empathic ExperiencesMark Billinghurst
 
Natural User Interface Demo based on - 3D Brick Game using Kinect
Natural User Interface Demo based on - 3D Brick Game using KinectNatural User Interface Demo based on - 3D Brick Game using Kinect
Natural User Interface Demo based on - 3D Brick Game using KinectNitesh Bhatia
 
Ethics for Conversational AI
Ethics for Conversational AIEthics for Conversational AI
Ethics for Conversational AIVerena Rieser
 
Designing for Stress Cases - Baltimore Design Week 2016 - Kelly Driver and An...
Designing for Stress Cases - Baltimore Design Week 2016 - Kelly Driver and An...Designing for Stress Cases - Baltimore Design Week 2016 - Kelly Driver and An...
Designing for Stress Cases - Baltimore Design Week 2016 - Kelly Driver and An...Anthony D. Paul
 
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Universitat Politècnica de Catalunya
 
Reality As A Knowledge Medium
Reality As A Knowledge MediumReality As A Knowledge Medium
Reality As A Knowledge Mediumfridolin.wild
 
Birds Identification System using Deep Learning
Birds Identification System using Deep LearningBirds Identification System using Deep Learning
Birds Identification System using Deep LearningIRJET Journal
 
Iris based Human Identification
Iris based Human IdentificationIris based Human Identification
Iris based Human Identificationdswazalwar
 
Fingerprint technology in our lives
Fingerprint technology in our livesFingerprint technology in our lives
Fingerprint technology in our livescodykong
 
Vision Performance Institute, Where Science Meets the Clinician: The Simulate...
Vision Performance Institute, Where Science Meets the Clinician: The Simulate...Vision Performance Institute, Where Science Meets the Clinician: The Simulate...
Vision Performance Institute, Where Science Meets the Clinician: The Simulate...Dominick Maino
 
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016Universitat Politècnica de Catalunya
 
Slide-show on Biometrics
Slide-show on BiometricsSlide-show on Biometrics
Slide-show on BiometricsPathik504
 
Introduction to privacy feedback research @ DesRes2016
Introduction to privacy feedback research @ DesRes2016Introduction to privacy feedback research @ DesRes2016
Introduction to privacy feedback research @ DesRes2016Alessandro Carelli
 
International Visual Methods Conference Sept OU 2011
International Visual Methods Conference Sept OU 2011International Visual Methods Conference Sept OU 2011
International Visual Methods Conference Sept OU 2011Nicola Louise Beddall-Hill
 
Beyond Reality (2027): The Future of Virtual and Augmented Reality
Beyond Reality (2027): The Future of Virtual and Augmented RealityBeyond Reality (2027): The Future of Virtual and Augmented Reality
Beyond Reality (2027): The Future of Virtual and Augmented RealityMark Billinghurst
 

Similar to “Frontiers in Perceptual AI: First-person Video and Multimodal Perception,” a Keynote Presentation from Kristen Grauman (20)

Activity Recognition using RGBD
Activity Recognition using RGBDActivity Recognition using RGBD
Activity Recognition using RGBD
 
VSMM 2016 Keynote: Using AR and VR to create Empathic Experiences
VSMM 2016 Keynote: Using AR and VR to create Empathic ExperiencesVSMM 2016 Keynote: Using AR and VR to create Empathic Experiences
VSMM 2016 Keynote: Using AR and VR to create Empathic Experiences
 
Natural User Interface Demo based on - 3D Brick Game using Kinect
Natural User Interface Demo based on - 3D Brick Game using KinectNatural User Interface Demo based on - 3D Brick Game using Kinect
Natural User Interface Demo based on - 3D Brick Game using Kinect
 
Ethics for Conversational AI
Ethics for Conversational AIEthics for Conversational AI
Ethics for Conversational AI
 
Designing for Stress Cases - Baltimore Design Week 2016 - Kelly Driver and An...
Designing for Stress Cases - Baltimore Design Week 2016 - Kelly Driver and An...Designing for Stress Cases - Baltimore Design Week 2016 - Kelly Driver and An...
Designing for Stress Cases - Baltimore Design Week 2016 - Kelly Driver and An...
 
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
 
Deep Learning Summit (DLS01-6)
Deep Learning Summit (DLS01-6)Deep Learning Summit (DLS01-6)
Deep Learning Summit (DLS01-6)
 
Reality As A Knowledge Medium
Reality As A Knowledge MediumReality As A Knowledge Medium
Reality As A Knowledge Medium
 
Birds Identification System using Deep Learning
Birds Identification System using Deep LearningBirds Identification System using Deep Learning
Birds Identification System using Deep Learning
 
ResumeJuly16
ResumeJuly16ResumeJuly16
ResumeJuly16
 
Iris based Human Identification
Iris based Human IdentificationIris based Human Identification
Iris based Human Identification
 
Fingerprint technology in our lives
Fingerprint technology in our livesFingerprint technology in our lives
Fingerprint technology in our lives
 
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
 
Vision Performance Institute, Where Science Meets the Clinician: The Simulate...
Vision Performance Institute, Where Science Meets the Clinician: The Simulate...Vision Performance Institute, Where Science Meets the Clinician: The Simulate...
Vision Performance Institute, Where Science Meets the Clinician: The Simulate...
 
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
 
Kb gait-recognition
Kb gait-recognitionKb gait-recognition
Kb gait-recognition
 
Slide-show on Biometrics
Slide-show on BiometricsSlide-show on Biometrics
Slide-show on Biometrics
 
Introduction to privacy feedback research @ DesRes2016
Introduction to privacy feedback research @ DesRes2016Introduction to privacy feedback research @ DesRes2016
Introduction to privacy feedback research @ DesRes2016
 
International Visual Methods Conference Sept OU 2011
International Visual Methods Conference Sept OU 2011International Visual Methods Conference Sept OU 2011
International Visual Methods Conference Sept OU 2011
 
Beyond Reality (2027): The Future of Virtual and Augmented Reality
Beyond Reality (2027): The Future of Virtual and Augmented RealityBeyond Reality (2027): The Future of Virtual and Augmented Reality
Beyond Reality (2027): The Future of Virtual and Augmented Reality
 

More from Edge AI and Vision Alliance

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...Edge AI and Vision Alliance
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...Edge AI and Vision Alliance
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...Edge AI and Vision Alliance
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...Edge AI and Vision Alliance
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...Edge AI and Vision Alliance
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...Edge AI and Vision Alliance
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...Edge AI and Vision Alliance
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsightsEdge AI and Vision Alliance
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...Edge AI and Vision Alliance
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...Edge AI and Vision Alliance
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...Edge AI and Vision Alliance
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...Edge AI and Vision Alliance
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...Edge AI and Vision Alliance
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...Edge AI and Vision Alliance
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...Edge AI and Vision Alliance
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from SamsaraEdge AI and Vision Alliance
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...Edge AI and Vision Alliance
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...Edge AI and Vision Alliance
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...Edge AI and Vision Alliance
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...Edge AI and Vision Alliance
 

More from Edge AI and Vision Alliance (20)

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
 

Recently uploaded

Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaWSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringWSO2
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxMarkSteadman7
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

“Frontiers in Perceptual AI: First-person Video and Multimodal Perception,” a Keynote Presentation from Kristen Grauman

  • 1. Frontiers in Perceptual AI: First-Person Video and Multimodal Perception Kristen Grauman University of Texas at Austin FAIR, Meta AI
  • 2. The third-person Web perceptual experience Caltech 101 (2004), Caltech 256 (2006) PASCAL (2007-12) ImageNet (2009) LabelMe (2007) MS COCO (2014) SUN (2010) Places (2014) BSD (2001) Visual Genome (2016) AVA (2018) Kinetics (2017) ActivityNet (2015) A curated “disembodied” moment in time from a spectator’s perspective Kristen Grauman, FAIR & UT Austin
  • 3. First-person “egocentric” perceptual experience Uncurated long-form video stream driven by the agent’s goals, interactions, and attention Kristen Grauman, FAIR & UT Austin
  • 4. First-person perception and learning Status quo: Learning and inference with “disembodied” images/videos. On the horizon: Visual learning in the context of agent goals, interaction, and multi-sensory observations. Kristen Grauman, FAIR & UT Austin
  • 5. Why egocentric video? Robot learning Augmented reality Kristen Grauman, FAIR & UT Austin
  • 6. Existing first-person video datasets Inspire our effort, but call for greater scale, content, diversity EPIC Kitchens Damen et al. 2020 45 people, 100 hrs kitchens only UT Ego Lee et al. 2012 4 people, 17 hrs daily life, in/outdoors EGTEA Gaze+ Li et al. 2018 32 people, 28 hrs kitchens only Charades-Ego Sigurdsson 2018 71 people, 34 hrs indoor ADL Pirsiavash 2012 20 people, 10 hrs apartment Kristen Grauman, FAIR & UT Austin
  • 7. Existing first-person video datasets Inspire our effort, but call for greater scale, content, diversity EPIC Kitchens Damen et al. 2020 45 people, 100 hrs kitchens only UT Ego Lee et al. 2012 4 people, 17 hrs daily life, in/outdoors EGTEA Gaze+ Li et al. 2018 32 people, 28 hrs kitchens only Charades-Ego Sigurdsson 2018 71 people, 34 hrs indoor ADL Pirsiavash 2012 20 people, 10 hrs apartment #people #hours #scenes Ego4D GOAL Kristen Grauman, FAIR & UT Austin
  • 8. # Hours # Participants EPIC-Kitchens-100 Combining all prior egocentric datasets (2008-2020) Existing first-person video datasets
  • 9. 3,670 hours of in-the-wild daily life activity 931 participants from 74 worldwide locations Benchmark tasks to catalyze research Ego4D: A massive-scale egocentric dataset Multimodal: audio, 3D scans, IMU, stereo, multi-camera # Hours # Participants [Grauman et al. CVPR 2022]
  • 10. Ego4D: everyday activity around the world
  • 11. Towards diverse geographic coverage IIIT Hyderabad KAUST U. Tokyo U. Bristol U. Catania Georgia Tech CMU MIT U. Minnesota U. Indiana UPenn U of Los Andes Natl U. Singapore CMU Africa Facebook Ego4D: FAIR + university consortium
  • 12. 931 unique camera wearers Towards diverse demographic coverage
  • 13.
  • 14. Wearable cameras We deploy a variety of head-mounted cameras. GoPro Vuzix Blade Pupil Labs ZShade WeeView Kristen Grauman, FAIR & UT Austin
  • 15. Everyday activities in the home: ●Sleeping ●Daily hygiene ●Doing hair/make-up ●Cleaning / laundry ●Cooking ●Talking with family members ●Hosting a party ●Eating ●Yardwork / shoveling snow ●Household management - care for kids ●Fixing something in the home ●Playing with pets ●Crafting/knitting/sewing/drawing/pai nting/etc Entertainment/Leisure ●Watching movies at cinema ●Watching tv ●Reading books ●Playing games / video games ●Attending sporting events - watching and participating in ●Attending play/ballet ●Attending concerts ●Hanging out with friends at a bar ●Eating at a restaurant ●Eating at a friend’s home ●Attending a party ●Talking on the phone ●Listening to music ●BBQ’ing/picnics ●Going to a salon (nail, hair, spa) ●Getting a tattoo / piercing ●Volunteering ●Practicing a musical instrument ●Attending a festival or fair ●Hanging out at a coffee shop Exercise: ●Going to the gym ●Yoga practice ●Swimming in a pool/ocean ●Working out at home ●Cycling / jogging ●Dancing ●Working out outside ●Walking on street ●Going to the park ●Hiking ●Tourism Work ●Working at desk ●Participating in a meeting ●Attending a lecture/class ●Writing on whiteboard ●Video call ●Eating at the cafeteria ●Making coffee ●Talking to colleagues Errands ●Grocery shopping ●Clothes, shopping ●Getting car fixed ●Going to the bank ●Walking the dog ●Washing the dog / pet, grooming horse ●Appointments: doctor, dentist, hair Transportation: ●Car - commuting, road trip ●Bus ●Train ●Airplane ●Bike ●Skateboard/scooter Unscripted, daily-life scenarios How people spend their days: US Bureau of Labor Statistics https://www.bls.gov/news.release/atus.nr0.htm
  • 16. Ego4D: everyday activity around the world
  • 17. Ego4D data: 3D environment scans Available for 491 hours of video
  • 18. Ego4D data: multi-camera and eye gaze Multiple simultaneous egocentric cameras Eye gaze (Indiana University)
  • 19. Ego4D annotations: text narrations Dense descriptive text of each camera wearer activity + clip-level summaries 13 sentences per minute 4M+ sentences Kristen Grauman, FAIR & UT Austin
  • 20. Privacy and ethics • Review: Each partner underwent separate months-long IRB review process, overseeing ethical and privacy standards for data collection, management, and informed consent. • Consent: Forms signed by all recorded people where relevant • De-identification: State-of-the-art de-identification processes, featuring both automated and manual reviews for faces, screens, credit cards, and other identifiers Kristen Grauman, FAIR & UT Austin
  • 21. Ego4D benchmark suite Past Episodic Memory “where is my X?” Forecasting “what will I do next?” Future Hands & Objects “what am I doing and how?” Present Audio-visual Diarization “who said what when?” Social Interaction “who is attending to whom?” Present
  • 22. Ego4D challenges Broad and growing participation at CVPR 2022, ECCV 2022, CVPR 2023 Hosted on EvalAI https://ego4d-data.org/docs/challenge/
  • 23. Papers citing Ego4D Source: Google Scholar Slide credit: David Crandall Ego4D is enabling new research in the broader community Papers citing Ego4D
  • 24. Ego4D: computer vision and beyond Computer Vision Audio Augmented Reality Language Speech Robotics 3D Sensing
  • 25. Ego4D team Kristen Grauman, Andrew Westbury, Eugene Byrne*, Zachary Chavis*, Antonino Furnari*, Rohit Girdhar*, Jackson Hamburger*, Hao Jiang*, Miao Liu*, Xingyu Liu*, Miguel Martin*, Tushar Nagarajan*, Ilija Radosavovic*, Santhosh Kumar Ramakrishnan*, Fiona Ryan*, Jayant Sharma*, Michael Wray*, Mengmeng Xu*, Eric Zhongcong Xu*, Chen Zhao*, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei Huang, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik
  • 26. First-person perception and learning Status quo: Learning and inference with “disembodied” images/videos. On the horizon: Visual learning in the context of agent goals, interaction, and multi-sensory observations. Kristen Grauman, FAIR & UT Austin
  • 27. Ego4D: large-scale multimodal dataset Grauman et al. CVPR 2022
  • 28. Recall: Ego4D text narrations Dense descriptive text of each camera wearer activity + clip-level summaries 13 sentences per minute 4M+ sentences
  • 29. Our idea: video-language embedding learning, representing the hierarchical relationship between action descriptions (what) and higher-level summaries (why) Existing methods: match short clips to corresponding narrations (Lin et al. 2022; Miech et al. 2020; Bain et al. 2021….) Hierarchical video-language learning Kumar et al. CVPR 2023
  • 30. HierVL Child-level (what) Parent-level (why) Our idea: video-language embedding learning, representing the hierarchical relationship between action descriptions (what) and higher-level summaries (why) Kumar et al. CVPR 2023
  • 31. HierVL Kumar et al. CVPR 2023 Lin et al. Neurips 2022 T-SNE
  • 32. HierVL on downstream tasks Kumar et al. CVPR 2023 EPIC-KITCHENS Multi-Instance Retrieval Ego4D Long Term Anticipation CharadesEgo Action Recognition Pretrained HierVL features provide strong performance on multiple downstream video tasks
  • 33. Listening to learn about the visual world Material properties Object identity Ambient scene Conversation Egocentric activities Emotion 3D space
  • 34. 1 drum kit, 5 different spaces Source: Shred Shed Studio
  • 35. Spatial effects in audio Agent’s spatial hearing cues: • Interaural time difference (ITD), • Interaural level difference (ILD) • Spectral detail (from pinna reflections) Factors from 3D environment: • Geometry of the space • Materials in the room • Position of source and receiver
  • 36. SoundSpaces audio simulation platform We introduce the SoundSpaces audio simulation platform • Visually realistic real-world 3D environments (Matterport3D, Replica, Gibson, HM3D…) • Acoustically realistic (geometry, materials, source location) binaural sound in real-time, for waveform of your choice • Room impulse response (RIR) for any source x receiver location • Habitat-compatible One possible source C. Chen*, U. Jain*, et al., SoundSpaces, ECCV 2020; C. Chen et al. SoundSpaces 2.0, NeurIPS 2022 SoundSpaces: https://github.com/facebookresearch/sound-spaces
  • 37. Which smoke alarm is beeping? Agent view Top-down map (unknown to the agent) Listen with headphones online for spatial sound experience http://vision.cs.utexas.edu/projects/audio_visual_navigation/ SoundSpaces audio simulation C. Chen*, U. Jain*, et al., SoundSpaces, ECCV 2020 & SoundSpaces 2.0, NeurIPS 2022 Check out the SoundSpaces challenge in the EmbodiedAI workshop at CVPR! https://soundspaces.org/challenge
  • 38. Daredevil (2003), Character Matt Murdock “sees” by listening 38 Recovering the shape of the scene
  • 39. Our idea: VisualEchoes feature learning via echolocation Monocular depth prediction Surface normal estimation Visual navigation Goal: Learn image representation via echolocation to benefit downstream (visual-only) spatial tasks Key insight: supervision from acoustically interacting with the physical world. Gao et al. ECCV 2020 Kristen Grauman, FAIR & UT Austin
  • 40. Agent location RGB Depth Echoes L R L R L R L R Top-down view of a Replica scene Camera viewpoint Gao et al. ECCV 2020 Echolocation in SoundSpaces Emit a chirp at the receiver position and capture the resulting echoes
  • 41. VisualEchoes approach VisualEcho-Net Audio-Visual Fusion STFT Spectrograms Left ear Right ear Echo-Net FC Agent’s View Echo response Predict orientation : the same orientation as the agent’s current view : the orientation if the agent turns right by 90◦ : the opposite orientation to the agent’s current view : the orientation if the agent turns left by 90◦ Echo received from one of: Learn visual representation from (in)congruence of echo and view Learned encoding for RGB images alone benefits from prior audio spatial learning Kristen Grauman, FAIR & UT Austin
  • 42. VisualEchoes for downstream tasks Pre-train monocular depth prediction CNN with VisualEchoes No audio input, test on real images (NYU-V2 dataset) [Gao et al., ECCV 2020] Hu et al., Revisiting single image depth estimation, WACV 2019 Sim2Real transfer
  • 43. VisualEchoes for downstream tasks Monocular depth prediction Surface normal estimation Visual navigation Competitive with (or even better than) supervised pre-training! Kristen Grauman, FAIR & UT Austin [Gao et al., ECCV 2020]
  • 44. Audio-visual floorplans Given a short video, can we infer the layout of the entire home? Kristen Grauman, FAIR & UT Austin Purushwalkam et al., ICCV 2021
  • 45. Audio-visual floorplans Purushwalkam et al., ICCV 2021 Supervised training of a multi-modal encoder-decoder network
  • 46. Audio-visual floorplans Kristen Grauman, FAIR & UT Austin SoTA visual mapping that extrapolates to unseen areas [Ramakrishnan et al. ECCV 2020] Purushwalkam et al., ICCV 2021
  • 47. Audio-visual floorplans Kristen Grauman, FAIR & UT Austin Purushwalkam et al., ICCV 2021
  • 48. Listening to learn about the visual world Material properties Object identity Ambient scene Conversation Egocentric activities Emotion 3D space
  • 50. Self-supervising audio source separation: “Mix-and-Separate” Simpson et al. 2015; Huang et al. 2015; Yu et al. 2017; Ephrat et al. 2018; Owens & Efros 2018; Zhao et al. 2018; Afouras et al. 2018; Zhao et al. 2019; Gao et al. 2019 Mixed audio Video 1 Video 2
  • 51. Facial appearance hints at voice qualities Cross-modal embedding space [Nagrani et al. ECCV’18, Nagrani et al. CVPR’18, Kim et al. ACCV’18, Chung et al. ICASSP’19, Wen et al. ICLR’19] Prior work learns cross-modal face-voice embeddings for person identification. Face 1 Face 2 Face 3 Face 4 Face 5 Face 2 Voice 2 Voice 4 Face 4
  • 52. Audio-visual speech separation Cross-modal embedding space Cross-modal face-to-voice matching Pull Push Distinctive voice tracks aid embedding learning Vocal and facial prior aids separation Our idea: Mutually beneficial tasks! [Gao & Grauman, CVPR 2021]
  • 53. Separated speech Cross-modal embedding space Pull Push Speech mixture Face embedding Lip motion Face embedding Lip motion VisualVoice Jointly learn audio-visual speech separation and cross-modal face-voice embeddings [Gao & Grauman, CVPR 2021]
  • 55. Separated voice for the right speaker
  • 56. Separated voice for the left speaker
  • 61. VisualVoice vs. prior state-of-the-art methods Our method improves the state-of-the-art on all five datasets.
  • 62. Summary Towards embodied multimodal first-person perception ● Ego4D: massive multimodal first-person data and benchmark ● Hierarchical vision-language embedding to capture goals with actions ● Inferring the shape of a scene with echoes, sounds, and vision ● Audio-visual source separation to listen to voice of interest Kristen Grauman UT Austin and FAIR grauman@cs.utexas.edu Changan Chen Ziad Al-Halah Ruohan Gao Senthil Purushwalkam Ashutosh Kumar Rohit Girdhar Lorenzo Torresani Consortium