Unprecedented progress
DEEP LEARNING CHALLENGES
François Courteille Principal Solutions Architect@NVIDIA – fcourteille@nvidia.com
Meetup IA / PowerAI - 17 mai 2018
2
ACKNOWLEDGEMENTS
Adam Grzywaczewski – adamg@nvidia.com
3
50% Reduction in Emergency
Road Repair Costs
>$6M / Year Savings and
Reduced Risk of Outage
INFRASTRUCTUREHEALTHCARE IOT
AI TO TRANSFORM EVERY INDUSTRY
>80% Accuracy & Immediate
Alert to Radiologists
4
PREDICTIVE
MAINTENANCE
FIELD INSPECTIONFACTORY INSPECTION
AI FOR INDUSTRIAL APPLICATIONS
Quality Inspection
Fault Detection & Classification
Inventory Inspection
Condition Based Maintenance
Remaining Useful Life
Sensor Time Series Analysis
Failure Prediction
5
UIUC & NCSA: ASTROPHYSICS
5,000X LIGO Signal Processing
U. FLORIDA & UNC: DRUG DISCOVERY
300,000X Molecular Energetics Prediction
U.S. DoE: PARTICLE PHYSICS
33% More Accurate Neutrino Detection
PRINCETON & ITER: CLEAN ENERGY
50% Higher Accuracy for Fusion Sustainment
LBNL: High Energy Particle PHYSICS
100,000X faster simulated output by GAN
Univ. of Illinois: DRUG DISCOVERY
2X Accuracy for Protein Structure Prediction
DEEP LEARNING COMES TO HPC
Accelerates Scientific Discovery
6
CHALLENGES
7
CHALLENGES
Key challenges when developing Deep Learning projects
• Data & Models
• Infrastructure
• Software
• Algorithms
• People
8
PEOPLE
9
DEEP LEARNING INSTITUTE
DLI Mission: Help the world to solve the most challenging
problems using AI and deep learning
We help developers, data scientists and engineers to get
started in architecting, optimizing, and deploying neural
networks to solve real-world problems in diverse industries
such as autonomous vehicles, healthcare, robotics, media
& entertainment and game development.
MULTI GPU DEEP LEARNING TRAINING
11
October 9-11, 2018 | Munich | #GTC18
www.gputechconf.eu
CONNECT
Connect with technology
experts from NVIDIA and
other leading organizations
LEARN
Gain insight and valuable
hands-on training through
hundreds of sessions and
research posters
DISCOVER
See how GPU technologies
are creating amazing
breakthroughs in important
fields such as deep learning
INNOVATE
Hear about disruptive
innovations as early-stage
companies and startups
present their work
Don’t miss the world’s most important event for GPU developers
October 9—11, 2018 in Munich
REGISTRATION IS OPEN AT WWW.GPUTECHCONF.EU
14
NEURAL NETWORKS ARE NOT NEW
Historically we never had large datasets
Dataset Size
Accuracy
0 5 10 15 20 25
Algorithm performance in small data regime
Small NN ML1
ML2 ML3
The MNIST (1999) database
contains 60,000 training images
and 10,000 testing images.
15
NEURAL NETWORKS ARE NOT NEW
Availability of data and compute
Dataset Size
Accuracy
0 50 100 150 200 250
Algorithm performance in big data regime
Small NN ML1
ML2 ML3
16
NEURAL NETWORKS ARE NOT NEW
Data and model size are the key to accuracy
Dataset Size
Accuracy
0 50 100 150 200 250 300 350 400 450
Algorithm performance in big data regime
Small NN ML1 ML2 ML3 Big NN
17
NEURAL NETWORKS ARE NOT NEW
Surpassing human-level performance
Dataset Size
Accuracy
0 500 1000 1500 2000 2500
Algorithm performance in large data regime
Small NN ML1 ML2 ML3 Big NN Bigger NN
18
DATA
19
EXPLODING DATASETS
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409.
Logarithmic relationship between the dataset size and accuracy
• Translation
• Language Models
• Character Language Models
• Image Classification
• Attention Speech Models
21
EXPLODING DATASETS
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409.
Logarithmic relationship between the dataset size and accuracy
22
EXPLODING DATASETS
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409.
The good news – you can calculate how much data you need
Step 1: Establish how much accuracy you need
Target
accuracy
23
EXPLODING DATASETS
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409.
The good news – you can calculate how much data you need
Step 2: Conduct a number of experiments with dataset of the varying size (in the Power-law region)
Target
accuracy
24
EXPLODING DATASETS
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409.
The good news – you can calculate how much data you need
Step 3: Interpolate
Target
accuracy
25
EXPLODING DATASETS
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409.
The bad news – log-scale
Target
accuracy
26
MODEL
27
2016 – Baidu Deep Speech 2
Superhuman Voice Recognition
2015 – Microsoft ResNet
Superhuman Image Recognition
2017 – Google Neural Machine Translation
Near Human Language Translation
100 ExaFLOPS
8700 Million Parameters
20 ExaFLOPS
300 Million Parameters
7 ExaFLOPS
60 Million Parameters
To Tackle Increasingly Complex Challenges
NEURAL NETWORK COMPLEXITY IS EXPLODING
28
100 EXAFLOPS
=
2 YEARS ON A DUAL CPU SERVER
29
EXPLODING MODEL COMPLEXITY
„Outrageously large neural networks“ – size does matter
Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
Nuts and Bolts of Applying Deep Learning, Andrew Ng, 2016 - https://youtu.be/F1ka6a13S9I
VGG 19 vs Google LSTM using
Sparsely-Gated Mixture-of-Experts
layer
30
EXPLODING MODEL COMPLEXITY
Good news – model size scales sublinearly
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409.
BestModelSize
Amount of Data
Model Size Scales Sublinearly
31
EVIDENCE FROM IMAGE PROCESSING
Good news – model size scales sublinearly
Zoph, Barret, et al. "Learning transferable architectures for scalable image recognition." arXiv preprint arXiv:1707.07012 (2017).
32
IMPLICATIONS
The good news: Requirements
are predictable.
We can predict how much data we
will need
We can predict how much
computing power we will need
The bad news: The values can
be significant.
Good and bad news
33
IMPLICATIONS
Experimental Nature of Deep Learning - Unacceptable training time
Idea
Code
Experiment
34
CONCLUSIONS
What does your research team do in the mean time?
Xkcd.com
35
THE GPU COMPUTING
From Gaming to High Performance Computing
36
TESLA V100 32GB
WORLD’S MOST ADVANCED DATA CENTER GPU
NOW WITH 2X THE MEMORY
5,120 CUDA cores
640 NEW Tensor cores
7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS | 125 Tensor TFLOPS
20MB SM RF | 16MB Cache
32GB HBM2 @ 900GB/s | 300GB/s NVLink
37
FASTER RESULTS ON COMPLEX DL AND HPC
Up to 50% Faster Results With 2x The Memory
Unsupervised Image
Translation
Input winter photo
AI converts it to
summer
Dual E5-2698v4 server, 512GB DDR4, Ubuntu 16.04, CUDA9, cuDNN7| NMT is GNMT-like and run with
TensorFlow NGC Container 18.01 (Batch Size= 128 (for 16GB) and 256 (for 32GB) | FFT is with
cufftbench 1k x 1k x 1k and comparing 2 V100 16GB (DGX1V) vs. 2 V100 32GB (DGX1V)
Neural Machine
Translation (NMT)
3D FFT 1k x 1k x 1k
1.5X Faster
Calculations
1.5X Faster
Language Translation
1.2
step/sec
0.8
step/sec
2.5TF
3.8TF
GAN Image to ImageGen
1024x1024
res images
512x512
res images
4X Higher
resolution
Accuracy
(16 layers)
Accuracy
(152 layers)
HIGHER ACCURACY HIGHER RESOLUTIONFASTER RESULTS
R-CNN for object detection at 1080P with Caffe | V100 16GB
uses VGG16| V100 32GB uses Resnet-152
V100 16GB V100 32GB
VGG-16 RN-152
40% Lower Error
Rate
GAN by NVRESEARCH (https://arxiv.org/pdf/1703.00848.pdf) |
V100 16GB and V100 32GB with FP32
38
IMPLICATIONS
Automotive example
Majority of useful problems are too complex for a single GPU training
VERY CONSERVATIVE CONSERVATIVE
Fleet size (data capture per hour) 100 cars / 1TB/hour 125 cars / 1.5TB/hour
Duration of data collection 260 days * 8 hours 325 days * 10 hours
Data Compression factor 0.0005 0.0008
Total training set 104 TB 487.5 TB
InceptionV3 training time
(with 1 Pascal GPU)
9.1 years 42.6 years
AlexNet training time
(with 1 Pascal GPU)
1.1 years 5.4 years
46
BALANCED SYSTEM DESIGNED FOR SCALE
48
PREFERRED NETWORKS
It consists of 128 nodes with 8 NVIDIA
P100 GPUs each, for 1024 GPUs in
total.
The nodes are connected with two FDR
Infiniband links (56Gbps x 2).
Training ImageNet in 15 minutes
Akiba, T., Suzuki, S., & Fukuda, K. (2017). Extremely large minibatch sgd: Training resnet-50 on
imagenet in 15 minutes. arXiv preprint arXiv:1711.04325.
49
DGX FAMILY
DGX-1V – 1 PetaFLOP FP16
50
IBM POWER9 ARCHITECTURE
CPU-GPU FAST CONNECTION
NVLINK

IBM Cloud Paris Meetup 20180517 - Deep Learning Challenges

  • 1.
    Unprecedented progress DEEP LEARNINGCHALLENGES François Courteille Principal Solutions Architect@NVIDIA – fcourteille@nvidia.com Meetup IA / PowerAI - 17 mai 2018
  • 2.
  • 3.
    3 50% Reduction inEmergency Road Repair Costs >$6M / Year Savings and Reduced Risk of Outage INFRASTRUCTUREHEALTHCARE IOT AI TO TRANSFORM EVERY INDUSTRY >80% Accuracy & Immediate Alert to Radiologists
  • 4.
    4 PREDICTIVE MAINTENANCE FIELD INSPECTIONFACTORY INSPECTION AIFOR INDUSTRIAL APPLICATIONS Quality Inspection Fault Detection & Classification Inventory Inspection Condition Based Maintenance Remaining Useful Life Sensor Time Series Analysis Failure Prediction
  • 5.
    5 UIUC & NCSA:ASTROPHYSICS 5,000X LIGO Signal Processing U. FLORIDA & UNC: DRUG DISCOVERY 300,000X Molecular Energetics Prediction U.S. DoE: PARTICLE PHYSICS 33% More Accurate Neutrino Detection PRINCETON & ITER: CLEAN ENERGY 50% Higher Accuracy for Fusion Sustainment LBNL: High Energy Particle PHYSICS 100,000X faster simulated output by GAN Univ. of Illinois: DRUG DISCOVERY 2X Accuracy for Protein Structure Prediction DEEP LEARNING COMES TO HPC Accelerates Scientific Discovery
  • 6.
  • 7.
    7 CHALLENGES Key challenges whendeveloping Deep Learning projects • Data & Models • Infrastructure • Software • Algorithms • People
  • 8.
  • 9.
    9 DEEP LEARNING INSTITUTE DLIMission: Help the world to solve the most challenging problems using AI and deep learning We help developers, data scientists and engineers to get started in architecting, optimizing, and deploying neural networks to solve real-world problems in diverse industries such as autonomous vehicles, healthcare, robotics, media & entertainment and game development.
  • 10.
    MULTI GPU DEEPLEARNING TRAINING
  • 11.
    11 October 9-11, 2018| Munich | #GTC18 www.gputechconf.eu CONNECT Connect with technology experts from NVIDIA and other leading organizations LEARN Gain insight and valuable hands-on training through hundreds of sessions and research posters DISCOVER See how GPU technologies are creating amazing breakthroughs in important fields such as deep learning INNOVATE Hear about disruptive innovations as early-stage companies and startups present their work Don’t miss the world’s most important event for GPU developers October 9—11, 2018 in Munich REGISTRATION IS OPEN AT WWW.GPUTECHCONF.EU
  • 12.
    14 NEURAL NETWORKS ARENOT NEW Historically we never had large datasets Dataset Size Accuracy 0 5 10 15 20 25 Algorithm performance in small data regime Small NN ML1 ML2 ML3 The MNIST (1999) database contains 60,000 training images and 10,000 testing images.
  • 13.
    15 NEURAL NETWORKS ARENOT NEW Availability of data and compute Dataset Size Accuracy 0 50 100 150 200 250 Algorithm performance in big data regime Small NN ML1 ML2 ML3
  • 14.
    16 NEURAL NETWORKS ARENOT NEW Data and model size are the key to accuracy Dataset Size Accuracy 0 50 100 150 200 250 300 350 400 450 Algorithm performance in big data regime Small NN ML1 ML2 ML3 Big NN
  • 15.
    17 NEURAL NETWORKS ARENOT NEW Surpassing human-level performance Dataset Size Accuracy 0 500 1000 1500 2000 2500 Algorithm performance in large data regime Small NN ML1 ML2 ML3 Big NN Bigger NN
  • 16.
  • 17.
    19 EXPLODING DATASETS Hestness, J.,Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409. Logarithmic relationship between the dataset size and accuracy • Translation • Language Models • Character Language Models • Image Classification • Attention Speech Models
  • 18.
    21 EXPLODING DATASETS Hestness, J.,Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409. Logarithmic relationship between the dataset size and accuracy
  • 19.
    22 EXPLODING DATASETS Hestness, J.,Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409. The good news – you can calculate how much data you need Step 1: Establish how much accuracy you need Target accuracy
  • 20.
    23 EXPLODING DATASETS Hestness, J.,Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409. The good news – you can calculate how much data you need Step 2: Conduct a number of experiments with dataset of the varying size (in the Power-law region) Target accuracy
  • 21.
    24 EXPLODING DATASETS Hestness, J.,Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409. The good news – you can calculate how much data you need Step 3: Interpolate Target accuracy
  • 22.
    25 EXPLODING DATASETS Hestness, J.,Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409. The bad news – log-scale Target accuracy
  • 23.
  • 24.
    27 2016 – BaiduDeep Speech 2 Superhuman Voice Recognition 2015 – Microsoft ResNet Superhuman Image Recognition 2017 – Google Neural Machine Translation Near Human Language Translation 100 ExaFLOPS 8700 Million Parameters 20 ExaFLOPS 300 Million Parameters 7 ExaFLOPS 60 Million Parameters To Tackle Increasingly Complex Challenges NEURAL NETWORK COMPLEXITY IS EXPLODING
  • 25.
    28 100 EXAFLOPS = 2 YEARSON A DUAL CPU SERVER
  • 26.
    29 EXPLODING MODEL COMPLEXITY „Outrageouslylarge neural networks“ – size does matter Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017). Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014). Nuts and Bolts of Applying Deep Learning, Andrew Ng, 2016 - https://youtu.be/F1ka6a13S9I VGG 19 vs Google LSTM using Sparsely-Gated Mixture-of-Experts layer
  • 27.
    30 EXPLODING MODEL COMPLEXITY Goodnews – model size scales sublinearly Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409. BestModelSize Amount of Data Model Size Scales Sublinearly
  • 28.
    31 EVIDENCE FROM IMAGEPROCESSING Good news – model size scales sublinearly Zoph, Barret, et al. "Learning transferable architectures for scalable image recognition." arXiv preprint arXiv:1707.07012 (2017).
  • 29.
    32 IMPLICATIONS The good news:Requirements are predictable. We can predict how much data we will need We can predict how much computing power we will need The bad news: The values can be significant. Good and bad news
  • 30.
    33 IMPLICATIONS Experimental Nature ofDeep Learning - Unacceptable training time Idea Code Experiment
  • 31.
    34 CONCLUSIONS What does yourresearch team do in the mean time? Xkcd.com
  • 32.
    35 THE GPU COMPUTING FromGaming to High Performance Computing
  • 33.
    36 TESLA V100 32GB WORLD’SMOST ADVANCED DATA CENTER GPU NOW WITH 2X THE MEMORY 5,120 CUDA cores 640 NEW Tensor cores 7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS | 125 Tensor TFLOPS 20MB SM RF | 16MB Cache 32GB HBM2 @ 900GB/s | 300GB/s NVLink
  • 34.
    37 FASTER RESULTS ONCOMPLEX DL AND HPC Up to 50% Faster Results With 2x The Memory Unsupervised Image Translation Input winter photo AI converts it to summer Dual E5-2698v4 server, 512GB DDR4, Ubuntu 16.04, CUDA9, cuDNN7| NMT is GNMT-like and run with TensorFlow NGC Container 18.01 (Batch Size= 128 (for 16GB) and 256 (for 32GB) | FFT is with cufftbench 1k x 1k x 1k and comparing 2 V100 16GB (DGX1V) vs. 2 V100 32GB (DGX1V) Neural Machine Translation (NMT) 3D FFT 1k x 1k x 1k 1.5X Faster Calculations 1.5X Faster Language Translation 1.2 step/sec 0.8 step/sec 2.5TF 3.8TF GAN Image to ImageGen 1024x1024 res images 512x512 res images 4X Higher resolution Accuracy (16 layers) Accuracy (152 layers) HIGHER ACCURACY HIGHER RESOLUTIONFASTER RESULTS R-CNN for object detection at 1080P with Caffe | V100 16GB uses VGG16| V100 32GB uses Resnet-152 V100 16GB V100 32GB VGG-16 RN-152 40% Lower Error Rate GAN by NVRESEARCH (https://arxiv.org/pdf/1703.00848.pdf) | V100 16GB and V100 32GB with FP32
  • 35.
    38 IMPLICATIONS Automotive example Majority ofuseful problems are too complex for a single GPU training VERY CONSERVATIVE CONSERVATIVE Fleet size (data capture per hour) 100 cars / 1TB/hour 125 cars / 1.5TB/hour Duration of data collection 260 days * 8 hours 325 days * 10 hours Data Compression factor 0.0005 0.0008 Total training set 104 TB 487.5 TB InceptionV3 training time (with 1 Pascal GPU) 9.1 years 42.6 years AlexNet training time (with 1 Pascal GPU) 1.1 years 5.4 years
  • 36.
  • 37.
    48 PREFERRED NETWORKS It consistsof 128 nodes with 8 NVIDIA P100 GPUs each, for 1024 GPUs in total. The nodes are connected with two FDR Infiniband links (56Gbps x 2). Training ImageNet in 15 minutes Akiba, T., Suzuki, S., & Fukuda, K. (2017). Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325.
  • 38.
    49 DGX FAMILY DGX-1V –1 PetaFLOP FP16
  • 39.
    50 IBM POWER9 ARCHITECTURE CPU-GPUFAST CONNECTION NVLINK