SlideShare a Scribd company logo
1 of 18
Training Recurrent Neural
Networks at Scale
Erich Elsen
Research Scientist
Erich Elsen
Natural User Interfaces
• Goal: Make interacting with computers as
natural as interacting with humans
• AI problems:
– Speech recognition
– Emotional recognition
– Semantic understanding
– Dialog systems
– Speech synthesis
Erich Elsen
Deep Speech Applications
• Voice controlled apps
• Peel Partnership
• English and Mandarin APIs in the US
• Integration into Baidu’s products in China
Erich Elsen
Deep Speech: End-to-end learning
• Deep neural network predicts
probability of characters directly from
audio
. . .
. . .
T H _ E … D O G
Erich Elsen
Connectionist Temporal Classification
Erich Elsen
Deep Speech: CTC
E .01 .05 .1 .1 .8 .05
H .01 .1 .1 .6 .05 .05
T .01 .8 .75 .2 .05 .1
BLANK .97 .05 .05 .1 .1 .8
• Simplified sequence of network outputs
(probabilities)
• Generally many more timesteps than letters
• Need to look at all the ways we can write “the”
• Adjacent characters collapse
• TTTHEE, TTTTHE, TTHHEE, THEEEE, ….
• Solve with dynamic programming
Time
Erich Elsen
warp-ctc
• Recently open sourced our CTC
implementation
• Efficient, parallel CPU and GPU backend
• 100-400X faster than other implementations
• Apache license, C interface
https://github.com/baidu-research/warp-ctc
Erich Elsen
Accuracy scales with Data
Data & Model Size
Performance
Deep Learning algorithms
Many previous methods
• 40% error reduction for each 10x increase in dataset size
Erich Elsen
Training sets
• Train on ~1½ years of data (and growing)
• English and Mandarin
• End-to-end deep learning is key to
assembling large datasets
• Datasets drive accuracy
Erich Elsen
Large Datasets = Large Models
Dataset Size
Big Model
Small Model
Accuracy
• Models require over 20 Exa-flops to train (exa =
10^18)
• Trained on 4+ Terabytes of audio
Erich Elsen
Virtuous Cycle of Innovation
Perform ExperimentLearn
Iterate
Design New Experiment
Erich Elsen
Experiment Scaling
• Batch Norm impact with deeper networks
• Sequence wise normalization:
Erich Elsen
Parallelism across GPUs
Model Parallel
Data Parallel
MPI_Allreduce()
Training Data Training Data
For these models, Data Parallelism works best
Erich Elsen
Performance for RNN training
• 55% of GPU FMA peak using a single GPU
• ~48% of peak using 8 GPUs in one node
• Weak scaling very efficient, albeit algorithmically
challenged
1
2
4
8
16
32
64
128
256
512
1 2 4 8 16 32 64 128
TFLOP/s
Number of GPUs
Typical
training run
one node multi node
Erich Elsen
All-reduce
• We implemented our own all-reduce out of
send and receive
• Several algorithm choices based on size
• Careful attention to affinity and topology
Erich Elsen
Scalability
• Batch size is hard to increase
– algorithm, memory limits
• Performance at small batch sizes (32, 64)
leads to scalability limits
Erich Elsen
Precision
• FP16 also mostly works
– Use FP32 for softmax and weight updates
• More sensitive to labeling error
1
10
100
1000
10000
100000
1000000
10000000
100000000
-31
-30
-29
-28
-27
-26
-25
-24
-23
-22
-21
-20
-19
-18
-17
-16
-15
-14
-13
-12
-11
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
Count
Magnitude
Weight Distribution
Erich Elsen
Conclusion
• We have to do experiments at scale
• Pushing compute scaling for end-to-end
deep learning
• Efficient training for large datasets
– 50 Teraflops/second sustained on one model
– 20 Exaflops to train each model
• Thanks to Bryan Catanzaro, Carl Case, Adam Coates for donating some slides
Erich Elsen

More Related Content

Viewers also liked

Viewers also liked (9)

Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16
Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16
Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16
 
Corinna Cortes, Head of Research, Google at MLconf NYC
Corinna Cortes, Head of Research, Google at MLconf NYCCorinna Cortes, Head of Research, Google at MLconf NYC
Corinna Cortes, Head of Research, Google at MLconf NYC
 
Notes from 2016 bay area deep learning school
Notes from 2016 bay area deep learning school Notes from 2016 bay area deep learning school
Notes from 2016 bay area deep learning school
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
 
Lessons from 2MM machine learning models
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning models
 
Deep Learning in real world @Deep Learning Tokyo
Deep Learning in real world @Deep Learning TokyoDeep Learning in real world @Deep Learning Tokyo
Deep Learning in real world @Deep Learning Tokyo
 
[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn
 
aiconf2017okanohara
aiconf2017okanoharaaiconf2017okanohara
aiconf2017okanohara
 
Deep Learning: a birds eye view
Deep Learning: a birds eye viewDeep Learning: a birds eye view
Deep Learning: a birds eye view
 

Similar to Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
Uwe Friedrichsen
 
Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
Shirin Elsinghorst
 

Similar to Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16 (20)

Deep Domain
Deep DomainDeep Domain
Deep Domain
 
Scalable Deep Learning on AWS with Apache MXNet
Scalable Deep Learning on AWS with Apache MXNetScalable Deep Learning on AWS with Apache MXNet
Scalable Deep Learning on AWS with Apache MXNet
 
Deep learning introduction
Deep learning introductionDeep learning introduction
Deep learning introduction
 
Smaller and Easier: Machine Learning on Embedded Things
Smaller and Easier: Machine Learning on Embedded ThingsSmaller and Easier: Machine Learning on Embedded Things
Smaller and Easier: Machine Learning on Embedded Things
 
R tech introcomputer
R tech introcomputerR tech introcomputer
R tech introcomputer
 
Pdc lecture1
Pdc lecture1Pdc lecture1
Pdc lecture1
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksA Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
 
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksA Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
 
Repeating History...On Purpose...with Elixir
Repeating History...On Purpose...with ElixirRepeating History...On Purpose...with Elixir
Repeating History...On Purpose...with Elixir
 
Large scalecplex
Large scalecplexLarge scalecplex
Large scalecplex
 
CPLEX Optimization Studio, Modeling, Theory, Best Practices and Case Studies
CPLEX Optimization Studio, Modeling, Theory, Best Practices and Case StudiesCPLEX Optimization Studio, Modeling, Theory, Best Practices and Case Studies
CPLEX Optimization Studio, Modeling, Theory, Best Practices and Case Studies
 
Concurrency & Parallel Programming
Concurrency & Parallel ProgrammingConcurrency & Parallel Programming
Concurrency & Parallel Programming
 
Unum Computing: An Energy Efficient and Massively Parallel Approach to Valid ...
Unum Computing: An Energy Efficient and Massively Parallel Approach to Valid ...Unum Computing: An Energy Efficient and Massively Parallel Approach to Valid ...
Unum Computing: An Energy Efficient and Massively Parallel Approach to Valid ...
 
Scalable Deep Learning on AWS using Apache MXNet (May 2017)
Scalable Deep Learning on AWS using Apache MXNet (May 2017)Scalable Deep Learning on AWS using Apache MXNet (May 2017)
Scalable Deep Learning on AWS using Apache MXNet (May 2017)
 
Elixir
ElixirElixir
Elixir
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
 
Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
 
Windows Server 2008 R2 Dev Session 02
Windows Server 2008 R2 Dev Session 02Windows Server 2008 R2 Dev Session 02
Windows Server 2008 R2 Dev Session 02
 

More from MLconf

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf
 

More from MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 

Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

  • 1. Training Recurrent Neural Networks at Scale Erich Elsen Research Scientist
  • 2. Erich Elsen Natural User Interfaces • Goal: Make interacting with computers as natural as interacting with humans • AI problems: – Speech recognition – Emotional recognition – Semantic understanding – Dialog systems – Speech synthesis
  • 3. Erich Elsen Deep Speech Applications • Voice controlled apps • Peel Partnership • English and Mandarin APIs in the US • Integration into Baidu’s products in China
  • 4. Erich Elsen Deep Speech: End-to-end learning • Deep neural network predicts probability of characters directly from audio . . . . . . T H _ E … D O G
  • 6. Erich Elsen Deep Speech: CTC E .01 .05 .1 .1 .8 .05 H .01 .1 .1 .6 .05 .05 T .01 .8 .75 .2 .05 .1 BLANK .97 .05 .05 .1 .1 .8 • Simplified sequence of network outputs (probabilities) • Generally many more timesteps than letters • Need to look at all the ways we can write “the” • Adjacent characters collapse • TTTHEE, TTTTHE, TTHHEE, THEEEE, …. • Solve with dynamic programming Time
  • 7. Erich Elsen warp-ctc • Recently open sourced our CTC implementation • Efficient, parallel CPU and GPU backend • 100-400X faster than other implementations • Apache license, C interface https://github.com/baidu-research/warp-ctc
  • 8. Erich Elsen Accuracy scales with Data Data & Model Size Performance Deep Learning algorithms Many previous methods • 40% error reduction for each 10x increase in dataset size
  • 9. Erich Elsen Training sets • Train on ~1½ years of data (and growing) • English and Mandarin • End-to-end deep learning is key to assembling large datasets • Datasets drive accuracy
  • 10. Erich Elsen Large Datasets = Large Models Dataset Size Big Model Small Model Accuracy • Models require over 20 Exa-flops to train (exa = 10^18) • Trained on 4+ Terabytes of audio
  • 11. Erich Elsen Virtuous Cycle of Innovation Perform ExperimentLearn Iterate Design New Experiment
  • 12. Erich Elsen Experiment Scaling • Batch Norm impact with deeper networks • Sequence wise normalization:
  • 13. Erich Elsen Parallelism across GPUs Model Parallel Data Parallel MPI_Allreduce() Training Data Training Data For these models, Data Parallelism works best
  • 14. Erich Elsen Performance for RNN training • 55% of GPU FMA peak using a single GPU • ~48% of peak using 8 GPUs in one node • Weak scaling very efficient, albeit algorithmically challenged 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 TFLOP/s Number of GPUs Typical training run one node multi node
  • 15. Erich Elsen All-reduce • We implemented our own all-reduce out of send and receive • Several algorithm choices based on size • Careful attention to affinity and topology
  • 16. Erich Elsen Scalability • Batch size is hard to increase – algorithm, memory limits • Performance at small batch sizes (32, 64) leads to scalability limits
  • 17. Erich Elsen Precision • FP16 also mostly works – Use FP32 for softmax and weight updates • More sensitive to labeling error 1 10 100 1000 10000 100000 1000000 10000000 100000000 -31 -30 -29 -28 -27 -26 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Count Magnitude Weight Distribution
  • 18. Erich Elsen Conclusion • We have to do experiments at scale • Pushing compute scaling for end-to-end deep learning • Efficient training for large datasets – 50 Teraflops/second sustained on one model – 20 Exaflops to train each model • Thanks to Bryan Catanzaro, Carl Case, Adam Coates for donating some slides Erich Elsen

Editor's Notes

  1. Model Parallel: Latency sensitive Data Parallel: Bandwidth sensitive