Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

910 views

Published on

Training Recurrent Neural Networks at Scale: One of our projects at Baidu’s Silicon Valley AI Lab is using deep learning to develop state of the art end-to-end speech recognition systems based on recurrent neural networks for multiple languages. The training set for each language is multiple terabytes in size and each model requires in excess of 10 Exaflops to train. Training such models requires scale and techniques that are unusual for deep learning but more common in high performance computing. I will talk about the challenges involved and the software and hardware solutions that we employ.

Published in: Technology
  • Be the first to comment

Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

  1. 1. Training Recurrent Neural Networks at Scale Erich Elsen Research Scientist
  2. 2. Erich Elsen Natural User Interfaces • Goal: Make interacting with computers as natural as interacting with humans • AI problems: – Speech recognition – Emotional recognition – Semantic understanding – Dialog systems – Speech synthesis
  3. 3. Erich Elsen Deep Speech Applications • Voice controlled apps • Peel Partnership • English and Mandarin APIs in the US • Integration into Baidu’s products in China
  4. 4. Erich Elsen Deep Speech: End-to-end learning • Deep neural network predicts probability of characters directly from audio . . . . . . T H _ E … D O G
  5. 5. Erich Elsen Connectionist Temporal Classification
  6. 6. Erich Elsen Deep Speech: CTC E .01 .05 .1 .1 .8 .05 H .01 .1 .1 .6 .05 .05 T .01 .8 .75 .2 .05 .1 BLANK .97 .05 .05 .1 .1 .8 • Simplified sequence of network outputs (probabilities) • Generally many more timesteps than letters • Need to look at all the ways we can write “the” • Adjacent characters collapse • TTTHEE, TTTTHE, TTHHEE, THEEEE, …. • Solve with dynamic programming Time
  7. 7. Erich Elsen warp-ctc • Recently open sourced our CTC implementation • Efficient, parallel CPU and GPU backend • 100-400X faster than other implementations • Apache license, C interface https://github.com/baidu-research/warp-ctc
  8. 8. Erich Elsen Accuracy scales with Data Data & Model Size Performance Deep Learning algorithms Many previous methods • 40% error reduction for each 10x increase in dataset size
  9. 9. Erich Elsen Training sets • Train on ~1½ years of data (and growing) • English and Mandarin • End-to-end deep learning is key to assembling large datasets • Datasets drive accuracy
  10. 10. Erich Elsen Large Datasets = Large Models Dataset Size Big Model Small Model Accuracy • Models require over 20 Exa-flops to train (exa = 10^18) • Trained on 4+ Terabytes of audio
  11. 11. Erich Elsen Virtuous Cycle of Innovation Perform ExperimentLearn Iterate Design New Experiment
  12. 12. Erich Elsen Experiment Scaling • Batch Norm impact with deeper networks • Sequence wise normalization:
  13. 13. Erich Elsen Parallelism across GPUs Model Parallel Data Parallel MPI_Allreduce() Training Data Training Data For these models, Data Parallelism works best
  14. 14. Erich Elsen Performance for RNN training • 55% of GPU FMA peak using a single GPU • ~48% of peak using 8 GPUs in one node • Weak scaling very efficient, albeit algorithmically challenged 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 TFLOP/s Number of GPUs Typical training run one node multi node
  15. 15. Erich Elsen All-reduce • We implemented our own all-reduce out of send and receive • Several algorithm choices based on size • Careful attention to affinity and topology
  16. 16. Erich Elsen Scalability • Batch size is hard to increase – algorithm, memory limits • Performance at small batch sizes (32, 64) leads to scalability limits
  17. 17. Erich Elsen Precision • FP16 also mostly works – Use FP32 for softmax and weight updates • More sensitive to labeling error 1 10 100 1000 10000 100000 1000000 10000000 100000000 -31 -30 -29 -28 -27 -26 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Count Magnitude Weight Distribution
  18. 18. Erich Elsen Conclusion • We have to do experiments at scale • Pushing compute scaling for end-to-end deep learning • Efficient training for large datasets – 50 Teraflops/second sustained on one model – 20 Exaflops to train each model • Thanks to Bryan Catanzaro, Carl Case, Adam Coates for donating some slides Erich Elsen

×