Trends and Developments in
Deep Learning Research
Jeff Dean
Google Brain team
g.co/brain
In collaboration with many other people at Google
ConvNets
Accuracy
Scale (data size, model size)
1980s and 1990s
neural networks
other approaches
more
computeAccuracy
Scale (data size, model size)
neural networks
other approaches
1980s and 1990s
more
computeAccuracy
Scale (data size, model size)
neural networks
other approaches
Now
5% errors
humans
2011
26% errors
2016
3% errors
2011
5% errors
humans
26% errors
Growing Use of Deep Learning at Google
Android
Apps
drug discovery
Gmail
Image understanding
Maps
Natural language
understanding
Photos
Robotics research
Speech
Translation
YouTube
… many others ...
Across many
products/areas:
# of directories containing model description files
Google Brain Team
● Research: 27 papers in ICML, NIPS, and ICLR in 2016, plus
others in venues like ACL, ICASSP, CVPR, ICER, and OSDI;
Brain Residency Program, hosted ~50 interns in 2016, ...
● Product impact: Many dozens of high impact collaborations in
Google products/efforts like Search, Ads, Photos, Translate, GMail,
Maps, Cloud ML, speech recognition, self-driving cars, robotics, …
● Open-source tools for ML: TensorFlow, Magenta, visualization
tools, …
g.co/brain
Need to build the right tools
What do you want in a machine learning system?
● Ease of expression: for lots of crazy ML ideas/algorithms
● Scalability: can run experiments quickly
● Portability: can run on wide variety of platforms
● Reproducibility: easy to share and reproduce research
● Production readiness: go from research to real products
Open, standard software for
general machine learning
Great for Deep Learning in
particular
First released Nov 2015
Apache 2.0 license
http://tensorflow.org/
and
https://github.com/tensorflow/tensorflow
tensorflow.org/tutorials
Why Did We Build TensorFlow?
Wanted system that was flexible, scalable, and production-ready
DistBelief, our first system, was good on two of these, but lacked flexibility
Most existing open-source packages were also good on 2 of 3 but not all 3
TensorFlow Goals
Establish common platform for expressing machine learning ideas and systems
Make this platform the best in the world for both research and production use
Open source it so that it becomes a platform for everyone, not just Google
Facts and Figures
Launched on Nov. 9, 2015
Initial launch was reasonably fully-featured:
auto differentiation, queues, control flow, fairly comprehensive set of ops, …
tutorials made system accessible
out-of-the-box support for CPUs, GPUs, multiple devices, multiple platforms
Some Stats
500+ contributors, most of them outside Google
12,000+ commits since Nov, 2015 (~30 per day)
1M+ binary downloads
#15 most popular repository on GitHub by stars (just passed Linux!)
Used in ML classes at quite a few universities now:
Toronto, Berkeley, Stanford, …
Many companies/organizations using TensorFlow:
Google, DeepMind, OpenAI, Twitter, Snapchat, Airbus, Uber, ...
December 2016 (0.6)
Python 3.3+ Faster on GPUs
February 2016 (0.7)
Dynamic
Loading of
Kernels
CuDNN v4
April 2016 (0.8)
June 2016 (0.9)
iOS GPUs on Mac
August 2016 (0.10)
Slim
October 2016 (0.11)
CuDNN v5
November 2016 (0.12)
Languages
Version 1.0 upcoming
Stability Usability
Backwards Compatibility Documentation
Libraries
Models
Appeared in OSDI 2016
https://arxiv.org/abs/1605.08695
http://tensorflow.org/whitepaper2015.pdf
Strong External Adoption
GitHub launch Nov. 2015
GitHub launch Sep. 2013
GitHub launch Jan. 2012
GitHub launch Jan. 2008
1M+ binary installs since November, 2015
GitHub launch Apr. 2015
Experiment Turnaround Time and Research Productivity
● Minutes, Hours:
○ Interactive research! Instant gratification!
● 1-4 days
○ Tolerable
○ Interactivity replaced by running many experiments in parallel
● 1-4 weeks
○ High value experiments only
○ Progress stalls
● >1 month
○ Don’t even try
Just-In-Time Compilation
via TensorFlow’s XLA, "Accelerated Linear Algebra" compiler
0x00000000 movq (%rdx), %rax
0x00000003 vmovaps (%rax), %xmm0
0x00000007 vmulps %xmm0, %xmm0, %xmm0
0x0000000b vmovaps %xmm0, (%rdi)
...
TF graphs go in,
Optimized & specialized
assembly comes out.
Let's explain that!
Demo:
Inspect JIT code in
TensorFlow
iPython shell
XLA:CPU
XLA:GPU
What are some ways that
deep learning is having
a significant impact at Google?
All of these examples implemented using TensorFlow
or our predecessor system
“How cold is
it outside?”
Deep
Recurrent
Neural Network
Acoustic Input Text Output
Reduced word errors by more than 30%
Speech Recognition
Google Research Blog - August 2012, August 2015
“ocean”
Deep
Convolutional
Neural Network
Your Photo
Automatic Tag
Search personal photos without tags.
Google Photos Search
Google Research Blog - June 2013
Google Photos Search
Reuse same model for completely
different problems
Same basic model structure
trained on different data,
useful in completely different contexts
Example: given image → predict interesting pixels
We have tons of vision problems
Image search, StreetView, Satellite Imagery,
Translation, Robotics, Self-driving Cars,
www.google.com/sunroof
MEDICAL IMAGING
Using similar model for detecting diabetic
retinopathy in retinal images
Performance on par or slightly better than the median of 8 U.S.
board-certified ophthalmologists (F-score of 0.95 vs. 0.91).
http://research.googleblog.com/2016/11/deep-learning-for-detection-of-diabetic.html
Computers can now see
Large implications for machine learning for robotics
Combining Vision with Robotics
“Deep Learning for Robots: Learning
from Large-Scale Interaction”, Google
Research Blog, March, 2016
“Learning Hand-Eye Coordination for
Robotic Grasping with Deep Learning
and Large-Scale Data Collection”,
Sergey Levine, Peter Pastor, Alex
Krizhevsky, & Deirdre Quillen,
Arxiv, arxiv.org/abs/1603.02199
Better language understanding
Score for
doc,query
pair
Deep
Neural
Network
Query & document features
Query: “car parts for sale”,
Doc: “Rebuilt transmissions …”
Launched in 2015
Third most important search ranking signal (of 100s)
RankBrain in Google Search Ranking
Bloomberg, Oct 2015: “Google Turning Its Lucrative Web Search Over to AI Machines”
Sequence-to-Sequence Model
A B C
v
D __ X Y Z
X Y Z Q
Input sequence
Target sequence
[Sutskever & Vinyals & Le NIPS 2014]
Deep LSTM
Sequence-to-Sequence Model: Machine Translation
v
Input sentence
Target sentence
[Sutskever & Vinyals & Le NIPS 2014] How
Quelle est taille?votre <EOS>
Sequence-to-Sequence Model: Machine Translation
v
Input sentence
Target sentence
[Sutskever & Vinyals & Le NIPS 2014] How
Quelle est taille?votre <EOS>
tall
How
Sequence-to-Sequence Model: Machine Translation
v
Input sentence
Target sentence
[Sutskever & Vinyals & Le NIPS 2014] How tall are
Quelle est taille?votre <EOS> How tall
Sequence-to-Sequence Model: Machine Translation
v
Input sentence
Target sentence
[Sutskever & Vinyals & Le NIPS 2014] How tall you?are
Quelle est taille?votre <EOS> How aretall
Sequence-to-Sequence Model: Machine Translation
v
Input sentence
[Sutskever & Vinyals & Le NIPS 2014]
At inference time:
Beam search to choose most probable
over possible output sequences
Quelle est taille?votre <EOS>
April 1, 2009: April Fool’s Day joke
Nov 5, 2015: Launched Real Product
Feb 1, 2016: >10% of mobile Inbox replies
Smart Reply
Small
Feed-Forward
Neural Network
Incoming Email
Activate
Smart Reply?
yes/no
Smart Reply
Google Research Blog
- Nov 2015
Small
Feed-Forward
Neural Network
Incoming Email
Activate
Smart Reply?
Deep Recurrent
Neural Network
Generated Replies
yes/no
Smart Reply
Google Research Blog
- Nov 2015
Combining vision and language
Image Captioning
W __ A young girl
A young girl asleep[Vinyals et al., CVPR 2015]
Model: A close up of a child
holding a stuffed animal.
Human: A young girl asleep on
the sofa cuddling a stuffed
bear.
Model: A baby is asleep next to
a teddy bear.
Image Captions Research
Translation as a sign of
better language understanding
Great quality improvements
...but challenging scalability issues
https://arxiv.org/abs/1609.08144
Encoder LSTMs
Decoder LSTMs
<s> Y1 Y3
SoftMax
Y1 Y2 </s>
X3 X2 </s>
8 Layers
Gpu1
Gpu2
Gpu2
Gpu3
+ + +
Gpu8
Attention
+ ++
Gpu1
Gpu2
Gpu3
Gpu8
Google Neural Machine Translation Model
One
model
replica:
one
machine
w/ 8
GPUs
Model + Data Parallelism
...
Params
Many
replicas
Parameters
distributed across
many parameter
server machines
Params Params
...
neural (GNMT)
phrase-based (PBMT)
English
>
Spanish
English
>
French
English
>
Chinese
Spanish
>
English
French
>
English
Chinese
>
English
Translation model
Translationquality
0
1
2
3
4
5
6
human
perfect translation
Neural Machine Translation
Closes gap between old system
and human-quality translation
by 58% to 87%
Enables better communication
across the world
research.googleblog.com/2016/09/a-neural-network-for-machine.html
More widespread use of:
Transfer and multi-task learning, zero-shot learning
Currently:
Most models are trained from scratch for a single task
This is quite inefficient:
Data inefficient: needs lots of data for each task
Computation inefficient: starting from scratch is a lot of work
Human ML expert inefficient: substantial effort required for
each task
Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation,
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,
Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean
https://arxiv.org/abs/1611.04558
https://research.googleblog.com/2016/11/zero-shot-translation-with-googles.html
Bigger models, but sparsely activated
Per-Example Routing
Outrageously Large Neural Networks: The Sparsely-gated Mixture-of-Experts Layer,
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le & Jeff Dean
Submitted to ICLR 2017, https://openreview.net/pdf?id=B1ckMDqlg
Per-Example Routing
Automated machine learning (“learning to learn”)
Current:
Solution = ML expertise + data + computation
Can we turn this into:
Solution = data + 100X computation
???
Idea: model-generating model trained via RL
(1) Generate ten models
(2) Train them for a few hours
(3) Use loss of the generated models as reinforcement
learning signal
CIFAR-10 Image Recognition Task
“Normal” LSTM cell
Cell discovered by
architecture search
Penn Tree Bank Language Modeling Task
More computational power needed
Deep learning is transforming how we design computers
Special computation properties
reduced
precision
ok
about 1.2
× about 0.6
about 0.7
1.21042
× 0.61127
0.73989343
NOT
handful of
specific
operations
× =
reduced
precision
ok
about 1.2
× about 0.6
about 0.7
1.21042
× 0.61127
0.73989343
NOT
Special computation properties
Tensor Processing Unit
Custom Google-designed chip for
neural net computations
In production use for >20 months: used on every
search query, for neural machine translation,
for AlphaGo match, ...
Example queries of the future
Which of these eye
images shows
symptoms of diabetic
retinopathy?
Please fetch me a cup
of tea from the kitchen
Describe this video
in Spanish
Find me documents related to
reinforcement learning for
robotics and summarize them
in German
Conclusions
Deep neural networks are making significant strides in
speech, vision, language, search, robotics, healthcare, …
If you’re not considering how to use deep neural nets to solve
some problems, you almost certainly should be
g.co/brain
More info about our work
Thanks!
g.co/brain

Jeff Dean at AI Frontiers: Trends and Developments in Deep Learning Research

  • 1.
    Trends and Developmentsin Deep Learning Research Jeff Dean Google Brain team g.co/brain In collaboration with many other people at Google
  • 2.
  • 3.
    Accuracy Scale (data size,model size) 1980s and 1990s neural networks other approaches
  • 4.
    more computeAccuracy Scale (data size,model size) neural networks other approaches 1980s and 1990s
  • 5.
    more computeAccuracy Scale (data size,model size) neural networks other approaches Now
  • 6.
  • 7.
  • 8.
    Growing Use ofDeep Learning at Google Android Apps drug discovery Gmail Image understanding Maps Natural language understanding Photos Robotics research Speech Translation YouTube … many others ... Across many products/areas: # of directories containing model description files
  • 9.
    Google Brain Team ●Research: 27 papers in ICML, NIPS, and ICLR in 2016, plus others in venues like ACL, ICASSP, CVPR, ICER, and OSDI; Brain Residency Program, hosted ~50 interns in 2016, ... ● Product impact: Many dozens of high impact collaborations in Google products/efforts like Search, Ads, Photos, Translate, GMail, Maps, Cloud ML, speech recognition, self-driving cars, robotics, … ● Open-source tools for ML: TensorFlow, Magenta, visualization tools, … g.co/brain
  • 10.
    Need to buildthe right tools
  • 11.
    What do youwant in a machine learning system? ● Ease of expression: for lots of crazy ML ideas/algorithms ● Scalability: can run experiments quickly ● Portability: can run on wide variety of platforms ● Reproducibility: easy to share and reproduce research ● Production readiness: go from research to real products
  • 12.
    Open, standard softwarefor general machine learning Great for Deep Learning in particular First released Nov 2015 Apache 2.0 license http://tensorflow.org/ and https://github.com/tensorflow/tensorflow
  • 13.
  • 14.
    Why Did WeBuild TensorFlow? Wanted system that was flexible, scalable, and production-ready DistBelief, our first system, was good on two of these, but lacked flexibility Most existing open-source packages were also good on 2 of 3 but not all 3
  • 15.
    TensorFlow Goals Establish commonplatform for expressing machine learning ideas and systems Make this platform the best in the world for both research and production use Open source it so that it becomes a platform for everyone, not just Google
  • 16.
    Facts and Figures Launchedon Nov. 9, 2015 Initial launch was reasonably fully-featured: auto differentiation, queues, control flow, fairly comprehensive set of ops, … tutorials made system accessible out-of-the-box support for CPUs, GPUs, multiple devices, multiple platforms
  • 17.
    Some Stats 500+ contributors,most of them outside Google 12,000+ commits since Nov, 2015 (~30 per day) 1M+ binary downloads #15 most popular repository on GitHub by stars (just passed Linux!) Used in ML classes at quite a few universities now: Toronto, Berkeley, Stanford, … Many companies/organizations using TensorFlow: Google, DeepMind, OpenAI, Twitter, Snapchat, Airbus, Uber, ...
  • 20.
    December 2016 (0.6) Python3.3+ Faster on GPUs
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 28.
  • 29.
    Version 1.0 upcoming StabilityUsability Backwards Compatibility Documentation Libraries Models
  • 30.
    Appeared in OSDI2016 https://arxiv.org/abs/1605.08695 http://tensorflow.org/whitepaper2015.pdf
  • 31.
    Strong External Adoption GitHublaunch Nov. 2015 GitHub launch Sep. 2013 GitHub launch Jan. 2012 GitHub launch Jan. 2008 1M+ binary installs since November, 2015 GitHub launch Apr. 2015
  • 32.
    Experiment Turnaround Timeand Research Productivity ● Minutes, Hours: ○ Interactive research! Instant gratification! ● 1-4 days ○ Tolerable ○ Interactivity replaced by running many experiments in parallel ● 1-4 weeks ○ High value experiments only ○ Progress stalls ● >1 month ○ Don’t even try
  • 33.
    Just-In-Time Compilation via TensorFlow’sXLA, "Accelerated Linear Algebra" compiler 0x00000000 movq (%rdx), %rax 0x00000003 vmovaps (%rax), %xmm0 0x00000007 vmulps %xmm0, %xmm0, %xmm0 0x0000000b vmovaps %xmm0, (%rdi) ... TF graphs go in, Optimized & specialized assembly comes out. Let's explain that!
  • 34.
    Demo: Inspect JIT codein TensorFlow iPython shell XLA:CPU XLA:GPU
  • 35.
    What are someways that deep learning is having a significant impact at Google? All of these examples implemented using TensorFlow or our predecessor system
  • 36.
    “How cold is itoutside?” Deep Recurrent Neural Network Acoustic Input Text Output Reduced word errors by more than 30% Speech Recognition Google Research Blog - August 2012, August 2015
  • 37.
    “ocean” Deep Convolutional Neural Network Your Photo AutomaticTag Search personal photos without tags. Google Photos Search Google Research Blog - June 2013
  • 38.
  • 39.
    Reuse same modelfor completely different problems Same basic model structure trained on different data, useful in completely different contexts Example: given image → predict interesting pixels
  • 42.
    We have tonsof vision problems Image search, StreetView, Satellite Imagery, Translation, Robotics, Self-driving Cars, www.google.com/sunroof
  • 43.
    MEDICAL IMAGING Using similarmodel for detecting diabetic retinopathy in retinal images
  • 44.
    Performance on paror slightly better than the median of 8 U.S. board-certified ophthalmologists (F-score of 0.95 vs. 0.91). http://research.googleblog.com/2016/11/deep-learning-for-detection-of-diabetic.html
  • 45.
    Computers can nowsee Large implications for machine learning for robotics
  • 46.
    Combining Vision withRobotics “Deep Learning for Robots: Learning from Large-Scale Interaction”, Google Research Blog, March, 2016 “Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection”, Sergey Levine, Peter Pastor, Alex Krizhevsky, & Deirdre Quillen, Arxiv, arxiv.org/abs/1603.02199
  • 47.
  • 48.
    Score for doc,query pair Deep Neural Network Query &document features Query: “car parts for sale”, Doc: “Rebuilt transmissions …” Launched in 2015 Third most important search ranking signal (of 100s) RankBrain in Google Search Ranking Bloomberg, Oct 2015: “Google Turning Its Lucrative Web Search Over to AI Machines”
  • 49.
    Sequence-to-Sequence Model A BC v D __ X Y Z X Y Z Q Input sequence Target sequence [Sutskever & Vinyals & Le NIPS 2014] Deep LSTM
  • 50.
    Sequence-to-Sequence Model: MachineTranslation v Input sentence Target sentence [Sutskever & Vinyals & Le NIPS 2014] How Quelle est taille?votre <EOS>
  • 51.
    Sequence-to-Sequence Model: MachineTranslation v Input sentence Target sentence [Sutskever & Vinyals & Le NIPS 2014] How Quelle est taille?votre <EOS> tall How
  • 52.
    Sequence-to-Sequence Model: MachineTranslation v Input sentence Target sentence [Sutskever & Vinyals & Le NIPS 2014] How tall are Quelle est taille?votre <EOS> How tall
  • 53.
    Sequence-to-Sequence Model: MachineTranslation v Input sentence Target sentence [Sutskever & Vinyals & Le NIPS 2014] How tall you?are Quelle est taille?votre <EOS> How aretall
  • 54.
    Sequence-to-Sequence Model: MachineTranslation v Input sentence [Sutskever & Vinyals & Le NIPS 2014] At inference time: Beam search to choose most probable over possible output sequences Quelle est taille?votre <EOS>
  • 55.
    April 1, 2009:April Fool’s Day joke Nov 5, 2015: Launched Real Product Feb 1, 2016: >10% of mobile Inbox replies Smart Reply
  • 56.
    Small Feed-Forward Neural Network Incoming Email Activate SmartReply? yes/no Smart Reply Google Research Blog - Nov 2015
  • 57.
    Small Feed-Forward Neural Network Incoming Email Activate SmartReply? Deep Recurrent Neural Network Generated Replies yes/no Smart Reply Google Research Blog - Nov 2015
  • 58.
  • 59.
    Image Captioning W __A young girl A young girl asleep[Vinyals et al., CVPR 2015]
  • 60.
    Model: A closeup of a child holding a stuffed animal. Human: A young girl asleep on the sofa cuddling a stuffed bear. Model: A baby is asleep next to a teddy bear. Image Captions Research
  • 61.
    Translation as asign of better language understanding
  • 62.
    Great quality improvements ...butchallenging scalability issues https://arxiv.org/abs/1609.08144
  • 63.
    Encoder LSTMs Decoder LSTMs <s>Y1 Y3 SoftMax Y1 Y2 </s> X3 X2 </s> 8 Layers Gpu1 Gpu2 Gpu2 Gpu3 + + + Gpu8 Attention + ++ Gpu1 Gpu2 Gpu3 Gpu8 Google Neural Machine Translation Model One model replica: one machine w/ 8 GPUs
  • 64.
    Model + DataParallelism ... Params Many replicas Parameters distributed across many parameter server machines Params Params ...
  • 65.
    neural (GNMT) phrase-based (PBMT) English > Spanish English > French English > Chinese Spanish > English French > English Chinese > English Translationmodel Translationquality 0 1 2 3 4 5 6 human perfect translation Neural Machine Translation Closes gap between old system and human-quality translation by 58% to 87% Enables better communication across the world research.googleblog.com/2016/09/a-neural-network-for-machine.html
  • 66.
    More widespread useof: Transfer and multi-task learning, zero-shot learning
  • 67.
    Currently: Most models aretrained from scratch for a single task This is quite inefficient: Data inefficient: needs lots of data for each task Computation inefficient: starting from scratch is a lot of work Human ML expert inefficient: substantial effort required for each task
  • 68.
    Google's Multilingual NeuralMachine Translation System: Enabling Zero-Shot Translation, Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean https://arxiv.org/abs/1611.04558 https://research.googleblog.com/2016/11/zero-shot-translation-with-googles.html
  • 69.
    Bigger models, butsparsely activated
  • 70.
  • 71.
    Outrageously Large NeuralNetworks: The Sparsely-gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le & Jeff Dean Submitted to ICLR 2017, https://openreview.net/pdf?id=B1ckMDqlg Per-Example Routing
  • 72.
    Automated machine learning(“learning to learn”)
  • 73.
    Current: Solution = MLexpertise + data + computation Can we turn this into: Solution = data + 100X computation ???
  • 74.
    Idea: model-generating modeltrained via RL (1) Generate ten models (2) Train them for a few hours (3) Use loss of the generated models as reinforcement learning signal
  • 75.
  • 76.
    “Normal” LSTM cell Celldiscovered by architecture search Penn Tree Bank Language Modeling Task
  • 77.
    More computational powerneeded Deep learning is transforming how we design computers
  • 78.
    Special computation properties reduced precision ok about1.2 × about 0.6 about 0.7 1.21042 × 0.61127 0.73989343 NOT
  • 79.
    handful of specific operations × = reduced precision ok about1.2 × about 0.6 about 0.7 1.21042 × 0.61127 0.73989343 NOT Special computation properties
  • 80.
    Tensor Processing Unit CustomGoogle-designed chip for neural net computations In production use for >20 months: used on every search query, for neural machine translation, for AlphaGo match, ...
  • 81.
    Example queries ofthe future Which of these eye images shows symptoms of diabetic retinopathy? Please fetch me a cup of tea from the kitchen Describe this video in Spanish Find me documents related to reinforcement learning for robotics and summarize them in German
  • 82.
    Conclusions Deep neural networksare making significant strides in speech, vision, language, search, robotics, healthcare, … If you’re not considering how to use deep neural nets to solve some problems, you almost certainly should be
  • 83.
  • 84.