GTC 2017
AI FOR INDUSTRY
2
A NEW ERA OF COMPUTING
PC INTERNET
WinTel, Yahoo!
1 billion PC users
1995 2005 2015
MOBILE-CLOUD
iPhone, Amazon AWS
2.5 billion mobile users
AI & IOT
Deep Learning, GPU
100s of billions of devices
3
4
HOW A DEEP NEURAL NETWORK SEES
Image “Audi A7”
Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks” ICML 2009 & Comm. ACM 2011.
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng.
5
Training
Device
Datacenter
GPU DEEP LEARNING
IS A NEW COMPUTING MODEL
TRAINING
Billions of Trillions of Operations
GPU train larger models, accelerate
time to market
6
Training
Device
Datacenter
GPU DEEP LEARNING
IS A NEW COMPUTING MODEL
DATACENTER INFERENCING
10s of billions of image, voice, video
queries per day
GPU inference for fast response,
maximize datacenter throughput
2016 – Baidu Deep Speech 2
Superhuman Voice Recognition
2015 – Microsoft ResNet
Superhuman Image Recognition
2017 – Google Neural Machine Translation
Near Human Language Translation
100 ExaFLOPS
8700 Million Parameters
20 ExaFLOPS
300 Million Parameters
7 ExaFLOPS
60 Million Parameters
To Tackle Increasingly Complex Challenges
NEURAL NETWORK COMPLEXITY IS EXPLODING
8
GEFORCE NOW
You listen to music on Spotify.
You watch movies on Netflix.
GeForce Now lets you play games
the same way.
Instantly stream the latest titles
from our powerful cloud-gaming
supercomputers. Think of it as your
game console in the sky.
Gaming is now easy and instant.
ROAD TO
EXASCALE
Volta to Fuel Most Powerful
US Supercomputers
1.8
1.5 1.6
1.5
cuFFT Physics
(QUDA)
Seismic
(RTM)
STREAM
V100PerformanceNormalizedtoP100
1.5X HPC Performance in 1 Year
Summit
Supercomputer
200+ PetaFlops
~3,400 Nodes
10 Megawatts
System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla
P100 or V100. V100 measured on pre-production hardware.
9
NVIDIA SATURN V
124 DGX-1 Deep Learning Supercomputers
10
TESLA V100
THE MOST ADVANCED DATA CENTER GPU EVER BUILT
5,120 CUDA cores
640 NEW Tensor cores
7.5 FP64 TFLOPS | 15 FP32 TFLOPS
120 Tensor TFLOPS
20MB SM RF | 16MB Cache | 16GB HBM2 @ 900 GB/s
300 GB/s NVLink
11
NEW TENSOR CORE BUILT FOR AI
Delivering 120 TFLOPS of DL Performance
TENSOR CORE
ALL MAJOR FRAMEWORKSVOLTA-OPTIMIZED cuDNN
MATRIX DATA OPTIMIZATION:
Dense Matrix of Tensor Compute
TENSOR-OP CONVERSION:
FP32 to Tensor Op Data for
Frameworks
TENSOR CORE
VOLTA TENSOR CORE
4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]
Optimized For Deep Learning
12
REVOLUTIONARY AI PERFORMANCE
3X Faster DL Training Performance
Over 80x DL Training
Performance in 3 Years
1x K80
cuDNN2
4x M40
cuDNN3
8x P100
cuDNN6
8x V100
cuDNN7
0x
20x
40x
60x
80x
100x
Q1
15
Q3
15
Q2
17
Q2
16
Googlenet Training Performance
(Speedup Vs K80)
SpeedupvsK80
85% Scale-Out Efficiency
Scales to 64 GPUs with Microsoft
Cognitive Toolkit
0 5 10 15
64X V100
8X V100
8X P100
Multi-Node Training with NCCL2.0
(ResNet-50)
ResNet50 Training for 90 Epochs with 1.28M images dataset | Using
Caffe2 | V100 performance measured on pre-production hardware.
1 Hour
7.4 Hours
18 Hours
3X Reduction in Time to Train
Over P100
0 10 20
1X
V100
1X
P100
2X
CPU
LSTM Training
(Neural Machine Translation)
Neural Machine Translation Training for 13 Epochs |German ->English,
WMT15 subset | CPU = 2x Xeon E5 2699 V4 | V100 performance
measured on pre-production hardware.
15 Days
18 Hours
6 Hours
13
VOLTA DELIVERS 3X MORE INFERENCE
THROUGHPUT
Low Latency performance with V100 and TensorRT
Fuse Layers
Compact
Optimize Precision
(FP32, FP16, INT8)
3x more throughput at 7ms latency with V100
(ResNet-50)
TensorRT Compiled
Real-time
Network
Trained
Neural
Network
0
1,000
2,000
3,000
4,000
5,000
CPU Tesla P100
(TensorFlow)
Tesla P100
(TensorRT)
Tesla V100
(TensorRT)Throughput@7ms(Images/Sec)
33ms
CPU Server: 2X Xeon E5-2660 V4; GPU: w/P100, w/V100 (@150W) | V100 performance measured on pre-production hardware.
3X
10ms
7ms
7ms
14
0
1,000
2,000
3,000
4,000
5,000
6,000
Throughput@7ms(Images/Sec)
NVIDIA TENSORRT 3
World’s Fastest Inference Platform
40X
ResNet-50 Throughput @ 7ms Latency
14ms
CPU-only
Tesla V100
TensorFlow
Tesla V100
TRT 3
(FP16)
Tesla P4
TRT 3
(INT8)
Workload ResNet-50| Data-set ImageNet | CPU: Skylake | GPU: Tesla P4 or Tesla V100
GTC Taiwan 2017 企業端深度學習與人工智慧應用

GTC Taiwan 2017 企業端深度學習與人工智慧應用

  • 1.
  • 2.
    2 A NEW ERAOF COMPUTING PC INTERNET WinTel, Yahoo! 1 billion PC users 1995 2005 2015 MOBILE-CLOUD iPhone, Amazon AWS 2.5 billion mobile users AI & IOT Deep Learning, GPU 100s of billions of devices
  • 3.
  • 4.
    4 HOW A DEEPNEURAL NETWORK SEES Image “Audi A7” Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks” ICML 2009 & Comm. ACM 2011. Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng.
  • 5.
    5 Training Device Datacenter GPU DEEP LEARNING ISA NEW COMPUTING MODEL TRAINING Billions of Trillions of Operations GPU train larger models, accelerate time to market
  • 6.
    6 Training Device Datacenter GPU DEEP LEARNING ISA NEW COMPUTING MODEL DATACENTER INFERENCING 10s of billions of image, voice, video queries per day GPU inference for fast response, maximize datacenter throughput
  • 7.
    2016 – BaiduDeep Speech 2 Superhuman Voice Recognition 2015 – Microsoft ResNet Superhuman Image Recognition 2017 – Google Neural Machine Translation Near Human Language Translation 100 ExaFLOPS 8700 Million Parameters 20 ExaFLOPS 300 Million Parameters 7 ExaFLOPS 60 Million Parameters To Tackle Increasingly Complex Challenges NEURAL NETWORK COMPLEXITY IS EXPLODING
  • 8.
    8 GEFORCE NOW You listento music on Spotify. You watch movies on Netflix. GeForce Now lets you play games the same way. Instantly stream the latest titles from our powerful cloud-gaming supercomputers. Think of it as your game console in the sky. Gaming is now easy and instant. ROAD TO EXASCALE Volta to Fuel Most Powerful US Supercomputers 1.8 1.5 1.6 1.5 cuFFT Physics (QUDA) Seismic (RTM) STREAM V100PerformanceNormalizedtoP100 1.5X HPC Performance in 1 Year Summit Supercomputer 200+ PetaFlops ~3,400 Nodes 10 Megawatts System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100. V100 measured on pre-production hardware.
  • 9.
    9 NVIDIA SATURN V 124DGX-1 Deep Learning Supercomputers
  • 10.
    10 TESLA V100 THE MOSTADVANCED DATA CENTER GPU EVER BUILT 5,120 CUDA cores 640 NEW Tensor cores 7.5 FP64 TFLOPS | 15 FP32 TFLOPS 120 Tensor TFLOPS 20MB SM RF | 16MB Cache | 16GB HBM2 @ 900 GB/s 300 GB/s NVLink
  • 11.
    11 NEW TENSOR COREBUILT FOR AI Delivering 120 TFLOPS of DL Performance TENSOR CORE ALL MAJOR FRAMEWORKSVOLTA-OPTIMIZED cuDNN MATRIX DATA OPTIMIZATION: Dense Matrix of Tensor Compute TENSOR-OP CONVERSION: FP32 to Tensor Op Data for Frameworks TENSOR CORE VOLTA TENSOR CORE 4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32] Optimized For Deep Learning
  • 12.
    12 REVOLUTIONARY AI PERFORMANCE 3XFaster DL Training Performance Over 80x DL Training Performance in 3 Years 1x K80 cuDNN2 4x M40 cuDNN3 8x P100 cuDNN6 8x V100 cuDNN7 0x 20x 40x 60x 80x 100x Q1 15 Q3 15 Q2 17 Q2 16 Googlenet Training Performance (Speedup Vs K80) SpeedupvsK80 85% Scale-Out Efficiency Scales to 64 GPUs with Microsoft Cognitive Toolkit 0 5 10 15 64X V100 8X V100 8X P100 Multi-Node Training with NCCL2.0 (ResNet-50) ResNet50 Training for 90 Epochs with 1.28M images dataset | Using Caffe2 | V100 performance measured on pre-production hardware. 1 Hour 7.4 Hours 18 Hours 3X Reduction in Time to Train Over P100 0 10 20 1X V100 1X P100 2X CPU LSTM Training (Neural Machine Translation) Neural Machine Translation Training for 13 Epochs |German ->English, WMT15 subset | CPU = 2x Xeon E5 2699 V4 | V100 performance measured on pre-production hardware. 15 Days 18 Hours 6 Hours
  • 13.
    13 VOLTA DELIVERS 3XMORE INFERENCE THROUGHPUT Low Latency performance with V100 and TensorRT Fuse Layers Compact Optimize Precision (FP32, FP16, INT8) 3x more throughput at 7ms latency with V100 (ResNet-50) TensorRT Compiled Real-time Network Trained Neural Network 0 1,000 2,000 3,000 4,000 5,000 CPU Tesla P100 (TensorFlow) Tesla P100 (TensorRT) Tesla V100 (TensorRT)Throughput@7ms(Images/Sec) 33ms CPU Server: 2X Xeon E5-2660 V4; GPU: w/P100, w/V100 (@150W) | V100 performance measured on pre-production hardware. 3X 10ms 7ms 7ms
  • 14.
    14 0 1,000 2,000 3,000 4,000 5,000 6,000 Throughput@7ms(Images/Sec) NVIDIA TENSORRT 3 World’sFastest Inference Platform 40X ResNet-50 Throughput @ 7ms Latency 14ms CPU-only Tesla V100 TensorFlow Tesla V100 TRT 3 (FP16) Tesla P4 TRT 3 (INT8) Workload ResNet-50| Data-set ImageNet | CPU: Skylake | GPU: Tesla P4 or Tesla V100