The Rapid Evolution and Future of
Machine Perception
Jay Yagnik
4/12/18
Confidential & Proprietary
Google’s View of the World (2005)
Confidential & Proprietary
Google’s View of the World (2005)
Confidential & Proprietary
Google’s View of the World (2005)
??
??
??
??
??
Confidential & Proprietary
Google’s View of the World (2015)
Confidential & Proprietary
Google’s View of the World (2015)
Confidential & Proprietary
Google’s View of the World (2015)
Confidential & Proprietary
Google’s View of the World (2015)
Confidential & Proprietary
Google’s View of the World (2015)
Confidential & Proprietary
Google’s View of the World (2015)
Confidential & Proprietary
Perception Really Works!
(far better than I expected)
* Human Performance based on analysis done by Andrej Karpathy. More details here.
Confidential + Proprietary
41.6%: Last Google submission to
server with intelligently selected
Ensemble of 5 Faster RCNN with
Resnet and Inception-Resnet
Inception V2
SSD
Inception
Resnet SSD
Resnet
Faster
RCNN
Ensemble of
Resnet Faster
RCNN
Inception
Resnet Faster
RCNN
Ensemble of Resnet/Inception
Resnet Faster RCNN
Ensemble of Resnet/Inception Resnet
Faster RCNN w/multicrop
COCO
deadline:
9/16/2016
Confidential & Proprietary
Perception Works -- Handwriting Recognition
Handwriting Recognition in 90 languages and 25 scripts across Google products
Confidential & Proprietary
Perception Works -- Geo
Automatic business discovery from StreetView imagery
Agropecuaria Galão 0.933
Galego Automóveis 0.438
Spazio Del Corpo 0.210
Confidential & Proprietary
Perception Works -- Image Captioning
Human
- A young girl asleep on the sofa cuddling
a stuffed bear.
Google
Brain
- A close up of a child holding a stuffed
animal.
- A baby is asleep next to a teddy bear.
Confidential & Proprietary
Smarter YouTube thumbnails Deep-learned features greatly outperform
hand-designed features on video annotation
Perception Works -- YouTube
1
7
Deep-learned visual features, VLAD variants
(1024 - 8192 dim)
Handcrafted audio-visual features (~40K
dim)
Deep-learned visual features, Average pooling
Deep-learned visual features, Max pooling
17
Deep-learned features outperform handcrafted features by
>50% with 3% of dimensions
Confidential + Proprietary
Combined Vision + Translation
Confidential + Proprietary
Google Photos Search
Confidential + Proprietary
Image Smart Reply Bubble Zoom
Confidential & Proprietary
Novel Net Architectures
● Inception
● Multibox++
● LSTMs
● FaceNet
● …
Nutritious Brain Food
NN models need lots of
labeled data to improve
● Weak supervision
● Cross-domain xfer
● Smart data cleaning
● Data synthesis
Infrastructure
Shared components across
audio, image & video
● Reusable code and
approaches across
domains & platforms
● Reusable data across
domains
Recent Progress Supported by Three Pillars
Confidential & Proprietary
Novel Net Architectures
● Inception/ResNet
● Multibox++
● LSTMs
● FaceNet
● …
Nutritious Brain Food
NN models need lots of
labeled data to improve
● Weak supervision
● Cross-domain xfer
● Smart data cleaning
● Data synthesis
Infrastructure
Shared components across
audio, image & video
● Reusable code and
approaches across
domains & platforms
● Reusable data across
domains
Recent Progress Supported by Three Pillars
Confidential + Proprietary
A key recurring theme... End-to-end deep learning
2010: Each Layer learned
separately…. 4-5 layers.
2016: All layers trained
together...10s or 100s of layers
Main Considerations
● Trainability
● Space-Time complexity
Confidential & Proprietary
Novel Net Architectures
● Inception
● Multibox++
● LSTMs
● FaceNet
● …
Nutritious Brain Food
NN models need lots of
labeled data to improve
● Weak supervision
● Cross-domain xfer
● Smart data cleaning
● Data synthesis
Infrastructure
Shared components across
audio, image & video
● Reusable code and
approaches across
domains & platforms
● Reusable data across
domains
Recent Progress Supported by Three Pillars
Confidential & Proprietary
The main bottleneck -- Data
● Semi supervised learning
● Active learning loops to minimize
human compute.
● Loop closure domain transfer.
Confidential & Proprietary
Identify promising
videos
YouTube
Illustrative Example:
Hard Positive Mining from YouTube ⇒ Image Models
Confidential & ProprietaryYouTube
Propagate frame-level labels
forward & back in time from anchor
How to ensure semantic
consistency?
A
Image search model:
find good anchor
frames
Ballet: 0.92
AA
Confidential & ProprietaryYouTube
Propagate frame-level labels
forward & back in time from anchor
How to ensure semantic consistency?
Dense Flow Trajectories
AAA
Confidential & ProprietaryYouTube
Candidate Set of Hard Positives:
Semantically consistent frames
from dense trajectory tracking
AAA
Confidential & ProprietaryYouTube
Add candidates that score poorly
for existing CNN to training set
AAA
Confidential & Proprietary
Example Web Images vs. YouTube Frames for “Footwear”
Cluster centroids mined from web images
Cluster centroids from hard positive mining
footwear “in the wild”
mostly posed images
Confidential & ProprietaryConfidential & Proprietary
Domain Transfer
Confidential & Proprietary
E.g. A user uploading this video to Google Photos
Domain Transfer
Home video
annotation
model
YouTube frame
annotation
model
Image / photo
annotation
model
Trained on
web
images
Trained on
video
thumbnails
Trained on
home
videos
Video
frames
Toddler
Dancing
2 31
Confidential & Proprietary
Novel Net Architectures
● Inception
● Multibox++
● LSTMs
● FaceNet
● …
Nutritious Brain Food
NN models need lots of
labeled data to improve
● Weak supervision
● Cross-domain xfer
● Smart data cleaning
● Data synthesis
Infrastructure
Shared components across
audio, image & video
● Reusable code and
approaches across
domains & platforms
● Reusable data across
domains
Recent Progress Supported by Three Pillars
Confidential & Proprietary
Infrastructure Progress
● Open source
Machine Learning
library
● Especially useful for
Deep Learning
● For research and
production
● Apache 2.0 license
Confidential & ProprietaryConfidential & Proprietary
Future Research Directions
Confidential & Proprietary
The Future of Perception
Perception
Machine
Learning
Robotics
LanguageGraphics
Confidential & Proprietary
Jointly-trained audio-
visual models
Vision + Graphics:
Synthetic training data
for real-world tasks
Joint understanding
of language & images
+
Future Directions: Cross-Modal Learning
Confidential & Proprietary
Jointly-trained audio-
visual models
Vision + Graphics:
Synthetic training data
for real-world tasks
Joint understanding
of language & images
+
Future Directions: Cross-Modal Learning
Confidential & Proprietary
Jointly-trained audio-
visual models
Vision + Graphics:
Synthetic training data
for real-world tasks
Joint understanding
of language & images
+
Future Directions: Cross-Modal Learning
Confidential & Proprietary
Future Directions: Scene Understanding & Prediction
Confidential & Proprietary
Neural-net based image
and video compression
Generating & editing
creative content
Magical photography
and special effects
Future Directions: Analysis ⇒ Synthesis
Confidential & Proprietary
Neural-net based image
and video compression
Generating & editing
creative content
Magical photography
and special effects
Future Directions: Analysis ⇒ Synthesis
Confidential & Proprietary
Neural-net based image
and video compression
Generating & editing
creative content
Magical photography
and special effects
Future Directions: Analysis ⇒ Synthesis
Confidential & Proprietary
Embedded & Interactive
Robotics &
Intelligent Environments
“Greek sculpture of two
men wrestling?”
Active Perception:
attention, info gathering
Future Directions: Beyond “Passive” Perception
Confidential & Proprietary
Embedded & Interactive
Robotics &
Intelligent Environments
“Greek sculpture of two
men wrestling?”
Active Perception:
attention, info gathering
Future Directions: Beyond “Passive” Perception
Confidential & Proprietary
Active Perception:
attention, info gathering
Robotics &
Intelligent Environments
Future Directions: Beyond “Passive” Perception
“Centaur vs. Lapith”
Embedded & Interactive
Confidential & Proprietary
Active Perception:
attention, info gathering
Robotics &
Intelligent Environments
Future Directions: Beyond “Passive” Perception
“Centaur vs. Lapith”
Embedded & Interactive
Confidential & ProprietaryConfidential & Proprietary
Going deeper with language...
Confidential & Proprietary
Perception and Language ground each other
1. Language provides deep context for perceptual grounding.
2. Perception provides sensory grounding for language.
● “Visualization” of a sentence. Could become the representation / embedding for it.
3. Knowledge from Perception → Common Sense
● E.g. Objects fall on the ground
Automatic business discovery
from StreetView imagery
Confidential & Proprietary
Conclusion
● Tremendous progress in recent years, fueled by:
○ Novel network architectures
○ Data augmentation methods
○ Infrastructure
● What’s next?
○ Joint learning of perception systems with other domains like robotics, graphics,
language, etc.
○ Shared progress across all areas, due to common ML components.
○ Progress on recognition on the long tail.
Thanks!

Jay Y

  • 1.
    The Rapid Evolutionand Future of Machine Perception Jay Yagnik 4/12/18
  • 2.
    Confidential & Proprietary Google’sView of the World (2005)
  • 3.
    Confidential & Proprietary Google’sView of the World (2005)
  • 4.
    Confidential & Proprietary Google’sView of the World (2005) ?? ?? ?? ?? ??
  • 5.
    Confidential & Proprietary Google’sView of the World (2015)
  • 6.
    Confidential & Proprietary Google’sView of the World (2015)
  • 7.
    Confidential & Proprietary Google’sView of the World (2015)
  • 8.
    Confidential & Proprietary Google’sView of the World (2015)
  • 9.
    Confidential & Proprietary Google’sView of the World (2015)
  • 10.
    Confidential & Proprietary Google’sView of the World (2015)
  • 11.
    Confidential & Proprietary PerceptionReally Works! (far better than I expected)
  • 12.
    * Human Performancebased on analysis done by Andrej Karpathy. More details here.
  • 13.
    Confidential + Proprietary 41.6%:Last Google submission to server with intelligently selected Ensemble of 5 Faster RCNN with Resnet and Inception-Resnet Inception V2 SSD Inception Resnet SSD Resnet Faster RCNN Ensemble of Resnet Faster RCNN Inception Resnet Faster RCNN Ensemble of Resnet/Inception Resnet Faster RCNN Ensemble of Resnet/Inception Resnet Faster RCNN w/multicrop COCO deadline: 9/16/2016
  • 14.
    Confidential & Proprietary PerceptionWorks -- Handwriting Recognition Handwriting Recognition in 90 languages and 25 scripts across Google products
  • 15.
    Confidential & Proprietary PerceptionWorks -- Geo Automatic business discovery from StreetView imagery Agropecuaria Galão 0.933 Galego Automóveis 0.438 Spazio Del Corpo 0.210
  • 16.
    Confidential & Proprietary PerceptionWorks -- Image Captioning Human - A young girl asleep on the sofa cuddling a stuffed bear. Google Brain - A close up of a child holding a stuffed animal. - A baby is asleep next to a teddy bear.
  • 17.
    Confidential & Proprietary SmarterYouTube thumbnails Deep-learned features greatly outperform hand-designed features on video annotation Perception Works -- YouTube 1 7 Deep-learned visual features, VLAD variants (1024 - 8192 dim) Handcrafted audio-visual features (~40K dim) Deep-learned visual features, Average pooling Deep-learned visual features, Max pooling 17 Deep-learned features outperform handcrafted features by >50% with 3% of dimensions
  • 18.
  • 19.
  • 20.
    Confidential + Proprietary ImageSmart Reply Bubble Zoom
  • 21.
    Confidential & Proprietary NovelNet Architectures ● Inception ● Multibox++ ● LSTMs ● FaceNet ● … Nutritious Brain Food NN models need lots of labeled data to improve ● Weak supervision ● Cross-domain xfer ● Smart data cleaning ● Data synthesis Infrastructure Shared components across audio, image & video ● Reusable code and approaches across domains & platforms ● Reusable data across domains Recent Progress Supported by Three Pillars
  • 22.
    Confidential & Proprietary NovelNet Architectures ● Inception/ResNet ● Multibox++ ● LSTMs ● FaceNet ● … Nutritious Brain Food NN models need lots of labeled data to improve ● Weak supervision ● Cross-domain xfer ● Smart data cleaning ● Data synthesis Infrastructure Shared components across audio, image & video ● Reusable code and approaches across domains & platforms ● Reusable data across domains Recent Progress Supported by Three Pillars
  • 23.
    Confidential + Proprietary Akey recurring theme... End-to-end deep learning 2010: Each Layer learned separately…. 4-5 layers. 2016: All layers trained together...10s or 100s of layers Main Considerations ● Trainability ● Space-Time complexity
  • 24.
    Confidential & Proprietary NovelNet Architectures ● Inception ● Multibox++ ● LSTMs ● FaceNet ● … Nutritious Brain Food NN models need lots of labeled data to improve ● Weak supervision ● Cross-domain xfer ● Smart data cleaning ● Data synthesis Infrastructure Shared components across audio, image & video ● Reusable code and approaches across domains & platforms ● Reusable data across domains Recent Progress Supported by Three Pillars
  • 25.
    Confidential & Proprietary Themain bottleneck -- Data ● Semi supervised learning ● Active learning loops to minimize human compute. ● Loop closure domain transfer.
  • 26.
    Confidential & Proprietary Identifypromising videos YouTube Illustrative Example: Hard Positive Mining from YouTube ⇒ Image Models
  • 27.
    Confidential & ProprietaryYouTube Propagateframe-level labels forward & back in time from anchor How to ensure semantic consistency? A Image search model: find good anchor frames Ballet: 0.92 AA
  • 28.
    Confidential & ProprietaryYouTube Propagateframe-level labels forward & back in time from anchor How to ensure semantic consistency? Dense Flow Trajectories AAA
  • 29.
    Confidential & ProprietaryYouTube CandidateSet of Hard Positives: Semantically consistent frames from dense trajectory tracking AAA
  • 30.
    Confidential & ProprietaryYouTube Addcandidates that score poorly for existing CNN to training set AAA
  • 31.
    Confidential & Proprietary ExampleWeb Images vs. YouTube Frames for “Footwear” Cluster centroids mined from web images Cluster centroids from hard positive mining footwear “in the wild” mostly posed images
  • 32.
    Confidential & ProprietaryConfidential& Proprietary Domain Transfer
  • 33.
    Confidential & Proprietary E.g.A user uploading this video to Google Photos
  • 34.
    Domain Transfer Home video annotation model YouTubeframe annotation model Image / photo annotation model Trained on web images Trained on video thumbnails Trained on home videos Video frames Toddler Dancing 2 31
  • 35.
    Confidential & Proprietary NovelNet Architectures ● Inception ● Multibox++ ● LSTMs ● FaceNet ● … Nutritious Brain Food NN models need lots of labeled data to improve ● Weak supervision ● Cross-domain xfer ● Smart data cleaning ● Data synthesis Infrastructure Shared components across audio, image & video ● Reusable code and approaches across domains & platforms ● Reusable data across domains Recent Progress Supported by Three Pillars
  • 36.
    Confidential & Proprietary InfrastructureProgress ● Open source Machine Learning library ● Especially useful for Deep Learning ● For research and production ● Apache 2.0 license
  • 37.
    Confidential & ProprietaryConfidential& Proprietary Future Research Directions
  • 38.
    Confidential & Proprietary TheFuture of Perception Perception Machine Learning Robotics LanguageGraphics
  • 39.
    Confidential & Proprietary Jointly-trainedaudio- visual models Vision + Graphics: Synthetic training data for real-world tasks Joint understanding of language & images + Future Directions: Cross-Modal Learning
  • 40.
    Confidential & Proprietary Jointly-trainedaudio- visual models Vision + Graphics: Synthetic training data for real-world tasks Joint understanding of language & images + Future Directions: Cross-Modal Learning
  • 41.
    Confidential & Proprietary Jointly-trainedaudio- visual models Vision + Graphics: Synthetic training data for real-world tasks Joint understanding of language & images + Future Directions: Cross-Modal Learning
  • 42.
    Confidential & Proprietary FutureDirections: Scene Understanding & Prediction
  • 43.
    Confidential & Proprietary Neural-netbased image and video compression Generating & editing creative content Magical photography and special effects Future Directions: Analysis ⇒ Synthesis
  • 44.
    Confidential & Proprietary Neural-netbased image and video compression Generating & editing creative content Magical photography and special effects Future Directions: Analysis ⇒ Synthesis
  • 45.
    Confidential & Proprietary Neural-netbased image and video compression Generating & editing creative content Magical photography and special effects Future Directions: Analysis ⇒ Synthesis
  • 46.
    Confidential & Proprietary Embedded& Interactive Robotics & Intelligent Environments “Greek sculpture of two men wrestling?” Active Perception: attention, info gathering Future Directions: Beyond “Passive” Perception
  • 47.
    Confidential & Proprietary Embedded& Interactive Robotics & Intelligent Environments “Greek sculpture of two men wrestling?” Active Perception: attention, info gathering Future Directions: Beyond “Passive” Perception
  • 48.
    Confidential & Proprietary ActivePerception: attention, info gathering Robotics & Intelligent Environments Future Directions: Beyond “Passive” Perception “Centaur vs. Lapith” Embedded & Interactive
  • 49.
    Confidential & Proprietary ActivePerception: attention, info gathering Robotics & Intelligent Environments Future Directions: Beyond “Passive” Perception “Centaur vs. Lapith” Embedded & Interactive
  • 50.
    Confidential & ProprietaryConfidential& Proprietary Going deeper with language...
  • 51.
    Confidential & Proprietary Perceptionand Language ground each other 1. Language provides deep context for perceptual grounding. 2. Perception provides sensory grounding for language. ● “Visualization” of a sentence. Could become the representation / embedding for it. 3. Knowledge from Perception → Common Sense ● E.g. Objects fall on the ground Automatic business discovery from StreetView imagery
  • 52.
    Confidential & Proprietary Conclusion ●Tremendous progress in recent years, fueled by: ○ Novel network architectures ○ Data augmentation methods ○ Infrastructure ● What’s next? ○ Joint learning of perception systems with other domains like robotics, graphics, language, etc. ○ Shared progress across all areas, due to common ML components. ○ Progress on recognition on the long tail.
  • 53.