Jay Y

The Rapid Evolution and Future of
Machine Perception
Jay Yagnik
4/12/18

Confidential & Proprietary
Google’s View of the World (2005)

??
??
??
??
??

Perception Really Works!
(far better than I expected)

* Human Performance based on analysis done by Andrej Karpathy. More details here.

Confidential + Proprietary
41.6%: Last Google submission to
server with intelligently selected
Ensemble of 5 Faster RCNN with
Resnet and Inception-Resnet
Inception V2
SSD
Inception
Resnet SSD
Resnet
Faster
RCNN
Ensemble of
Resnet Faster
RCNN
Inception
Resnet Faster
RCNN
Ensemble of Resnet/Inception
Resnet Faster RCNN
Ensemble of Resnet/Inception Resnet
Faster RCNN w/multicrop
COCO
deadline:
9/16/2016

Perception Works -- Handwriting Recognition
Handwriting Recognition in 90 languages and 25 scripts across Google products

Perception Works -- Geo
Automatic business discovery from StreetView imagery
Agropecuaria Galão 0.933
Galego Automóveis 0.438
Spazio Del Corpo 0.210

Perception Works -- Image Captioning
Human
- A young girl asleep on the sofa cuddling
a stuffed bear.
Google
Brain
- A close up of a child holding a stuffed
animal.
- A baby is asleep next to a teddy bear.

Smarter YouTube thumbnails Deep-learned features greatly outperform
hand-designed features on video annotation
Perception Works -- YouTube
1
7
Deep-learned visual features, VLAD variants
(1024 - 8192 dim)
Handcrafted audio-visual features (~40K
dim)
Deep-learned visual features, Average pooling
Deep-learned visual features, Max pooling
17
Deep-learned features outperform handcrafted features by
>50% with 3% of dimensions

Combined Vision + Translation

Google Photos Search

Image Smart Reply Bubble Zoom

Novel Net Architectures
● Inception
● Multibox++
● LSTMs
● FaceNet
● …
Nutritious Brain Food
NN models need lots of
labeled data to improve
● Weak supervision
● Cross-domain xfer
● Smart data cleaning
● Data synthesis
Infrastructure
Shared components across
audio, image & video
● Reusable code and
approaches across
domains & platforms
● Reusable data across
domains
Recent Progress Supported by Three Pillars

Novel Net Architectures
● Inception/ResNet
● Multibox++
● LSTMs
● FaceNet
● …
Nutritious Brain Food
NN models need lots of
labeled data to improve
● Weak supervision
● Cross-domain xfer
● Smart data cleaning
● Data synthesis
Infrastructure
Shared components across
audio, image & video
● Reusable code and
approaches across
domains & platforms
● Reusable data across
domains
Recent Progress Supported by Three Pillars

A key recurring theme... End-to-end deep learning
2010: Each Layer learned
separately…. 4-5 layers.
2016: All layers trained
together...10s or 100s of layers
Main Considerations
● Trainability
● Space-Time complexity

The main bottleneck -- Data
● Semi supervised learning
● Active learning loops to minimize
human compute.
● Loop closure domain transfer.

Identify promising
videos
YouTube
Illustrative Example:
Hard Positive Mining from YouTube ⇒ Image Models

Confidential & ProprietaryYouTube
Propagate frame-level labels
forward & back in time from anchor
How to ensure semantic
consistency?
A
Image search model:
find good anchor
frames
Ballet: 0.92
AA

Propagate frame-level labels
forward & back in time from anchor
How to ensure semantic consistency?
Dense Flow Trajectories
AAA

Candidate Set of Hard Positives:
Semantically consistent frames
from dense trajectory tracking
AAA

Add candidates that score poorly
for existing CNN to training set
AAA

Example Web Images vs. YouTube Frames for “Footwear”
Cluster centroids mined from web images
Cluster centroids from hard positive mining
footwear “in the wild”
mostly posed images

Confidential & ProprietaryConfidential & Proprietary
Domain Transfer

E.g. A user uploading this video to Google Photos

Domain Transfer
Home video
annotation
model
YouTube frame
annotation
model
Image / photo
annotation
model
Trained on
web
images
Trained on
video
thumbnails
Trained on
home
videos
Video
frames
Toddler
Dancing
2 31

Infrastructure Progress
● Open source
Machine Learning
library
● Especially useful for
Deep Learning
● For research and
production
● Apache 2.0 license

Future Research Directions

The Future of Perception
Perception
Machine
Learning
Robotics
LanguageGraphics

Jointly-trained audio-
visual models
Vision + Graphics:
Synthetic training data
for real-world tasks
Joint understanding
of language & images
+
Future Directions: Cross-Modal Learning

Future Directions: Scene Understanding & Prediction

Neural-net based image
and video compression
Generating & editing
creative content
Magical photography
and special effects
Future Directions: Analysis ⇒ Synthesis

Embedded & Interactive
Robotics &
Intelligent Environments
“Greek sculpture of two
men wrestling?”
Active Perception:
attention, info gathering
Future Directions: Beyond “Passive” Perception

Active Perception:
attention, info gathering
Robotics &
Intelligent Environments
Future Directions: Beyond “Passive” Perception
“Centaur vs. Lapith”
Embedded & Interactive

Going deeper with language...

Perception and Language ground each other
1. Language provides deep context for perceptual grounding.
2. Perception provides sensory grounding for language.
● “Visualization” of a sentence. Could become the representation / embedding for it.
3. Knowledge from Perception → Common Sense
● E.g. Objects fall on the ground
Automatic business discovery
from StreetView imagery

Conclusion
● Tremendous progress in recent years, fueled by:
○ Novel network architectures
○ Data augmentation methods
○ Infrastructure
● What’s next?
○ Joint learning of perception systems with other domains like robotics, graphics,
language, etc.
○ Shared progress across all areas, due to common ML components.
○ Progress on recognition on the long tail.

Jay Y

More Related Content

Similar to Jay Y

More from Hilary Ip

Recently uploaded

Jay Y