19-May-2016
Frédéric Parienté, Business Development Manager
DEEP LEARNING UPDATE
#GTC16 TESLA ANNOUNCEMENTS
NVGRAPH
CUDA 8 & UNIFIED MEMORYNVIDIA DGX-1
NVIDIA SDK
TESLA P100
Deep Learning Achieves
“Superhuman” Results
A NEW COMPUTING MODEL
Traditional Computer Vision
Experts + Time
Deep Learning Object Detection
DNN + Data + HPC
ImageNet
DL Systems Sold Application
STATE OF DEEP LEARNING SYSTEMS
Geography
100k Deep
Learning
in 2015
Machine Learning Algorithms
Image Recognition
Object Recognition
Big Data
Natural Language Processing
Action Recognition
Medical
Other
Facial Recognition
Speech Recognition
1
2
3
4
5
6
7
8
9
10
TOP 10
MICROSOFT: “SUPER DEEP NETWORKS”
Microsoft Deep ResNet
http://arxiv.org/pdf/1512.03385v1.pdf
18 LAYERS
1.8 GF
152 LAYERS
11.3 GF
>6X MORE FLOPSRevolution of Depth
BAIDU: DL DEVELOPERS NEED HPC
“Investments in computer systems — and I
think the bleeding-edge of AI, and deep
learning specifically, is shifting to HPC
(high performance computing) — can cut
down the time to run an experiment, and
therefore go around the circle, from a
week to a day and sometimes even faster.”
“Those of us that grew up doing machine
learning often didn’t grow up with an HPC
or computer systems background …
partnerships between machine learning
researchers and computer systems
researchers tend to help both teams drive
a lot of machine learning progress.”
— Andrew Ng
NVIDIA DGX-1
WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER
170 TFLOPS FP16
8x Tesla P100 16GB
NVLink Hybrid Cube Mesh
Accelerates Major AI Frameworks
Dual Xeon
7 TB SSD Deep Learning Cache
Dual 10GbE, Quad IB 100Gb
3RU – 3200W
NVIDIA DGX-1 DEEP LEARNING SYSTEM
BENEFITS FOR AI RESEARCHERS
cuDNN NCCL
cuSPARSE cuBLAS
Design
Big Networks
Reduce
Training Times
DL SDK
Ongoing Updates
cuFFT
Fastest
DL Supercomputer
NVIDIA DGX-1 SOFTWARE STACK
Optimized for Deep Learning Performance
Accelerated Deep
Learning
cuDNN NCCL
cuSPARSE cuBLAS cuFFT
Container Based
Applications
NVIDIA Cloud
Management
Digits DL Frameworks GPU
Apps
INTRODUCING TESLA P100
New GPU Architecture to Enable the World’s Fastest Compute Node
Pascal Architecture NVLinkCoWoS HBM2 Page Migration Engine
PCIe
Switch
PCIe
Switch
CPU CPU
Highest Compute Performance GPU Interconnect for Maximum
Scalability
Unifying Compute & Memory in
Single Package
Simple Parallel Programming with
Virtually Unlimited Memory
Unified Memory
CPU
Tesla
P100
GIANT LEAPS
IN EVERYTHING
NVLINK
PAGE MIGRATION ENGINE
PASCAL ARCHITECTURE
CoWoS HBM2 Stacked Mem
K40
Teraflops(FP32/FP16)
5
10
15
20
P100
(FP32)
P100
(FP16)
M40
K40
Bi-directionalBW(GB/Sec)
40
80
120
160
P100
M40
K40
Bandwidth(GB/s)
200
400
600
P100
M40 K40
AddressableMemory(GB)
10
100
1000
P100
M40
21 Teraflops of FP16 for Deep Learning 5x GPU-GPU Bandwidth
3x Higher for Massive Data Workloads Virtually Unlimited Memory Space
10000800
HUGE JUMP IN PERFORMANCE
DUAL XEON 8X TESLA M40
DGX-1
(8X TESLA P100)
FLOPS (CPU + GPU) 3 TF 58 TF 170 TF
PROC-PROC BW 25 GB/s 64 GB/s 640 GB/s
ALEXNET TRAIN TIME 150 HOURS 9 HOURS 2 HOURS
# NODES FOR 3HR TAT 250 4 1
PERFORMANCE 1X 63X 250X
HIGHEST ABSOLUTE PERFORMANCE DELIVERED
NVLink for Max Scalability, More than 45x Faster with 8x P100
0x
5x
10x
15x
20x
25x
30x
35x
40x
45x
50x
Caffe/Alexnet VASP HOOMD-Blue COSMO MILC Amber HACC
2x K80 (M40 for Alexnet) 2x P100 4x P100 8x P100
Speed-upvsDualSocketHaswell
2x Haswell CPU
DATACENTER IN A RACK
1 Rack of Tesla P100 Delivers Performance of 6,000 CPUs
QUANTUM PHYSICS
MILC
WEATHER
COSMO
DEEP LEARNING
CAFFE/ALEXNET
12 NODES in RACK
8 P100s per Node
MOLECULAR DYNAMICS
AMBER
# of Racks1 108642 . . .40 84
638 CPUs / 186 kW
650 CPUs / 190 kW
6,000 CPUs / 1.8 MW
2,900 CPUs / 850 kW
96 P100s / 38 kW
36 NODES in RACK
DUAL CPUs Per Node
. . .
TESLA P100 ACCELERATOR
Compute 5.3 TF DP ∙ 10.6 TF SP ∙ 21.2 TF HP
Memory HBM2: 720 GB/s ∙ 16 GB
Interconnect NVLink (up to 8 way) + PCIe Gen3
Programmability
Page Migration Engine
Unified Memory
Availability
DGX-1: Order Now
Atos, Cray, Dell, HP, IBM: Q1 2017
END-TO-END PRODUCT FAMILY
HYPERSCALE HPC
Tesla M4, M40
MIXED-APPS HPC
Tesla K80
STRONG-SCALING HPC
Tesla P100
FULLY INTEGRATED DL
SUPERCOMPUTER
DGX-1
For customers who need to
get going now with fully
integrated solution
Hyperscale & HPC data
centers running apps that
scale to multiple GPUs
HPC data centers running mix
of CPU and GPU workloads
Hyperscale deployment for DL
training, inference, video &
image processing
NVIDIA DEEP LEARNING SDK
High Performance GPU-Acceleration for Deep Learning
COMPUTER VISION SPEECH AND AUDIO BEHAVIOR
Object Detection Voice Recognition Translation
Recommendation
Engines
Sentiment Analysis
DEEP LEARNING
cuDNN
MATH LIBRARIES
cuBLAS cuSPARSE
MULTI-GPU
NCCL
cuFFT
Mocha.jl
Image Classification
DEEP LEARNING
SDK
FRAMEWORKS
APPLICATIONS
NVIDIA CUDNN
Building blocks for accelerating deep
neural networks on GPUs
High performance deep neural
network training
Accelerates Deep Learning: Caffe,
CNTK, Tensorflow, Theano, Torch
Performance continues to improve
over time
“NVIDIA has improved the speed of cuDNN
with each release while extending the
interface to more operations and devices
at the same time.”
— Evan Shelhamer, Lead Caffe Developer, UC Berkeley
developer.nvidia.com/cudnn
AlexNet training throughput based on 20 iterations,
CPU: 1x E5-2680v3 12 Core 2.5GHz.
0.0x
2.0x
4.0x
6.0x
8.0x
10.0x
12.0x
2014 2015 2016
K40
(cuDNN v1)
M40
(cuDNN v3)
Pascal
(cuDNN v5)
WHAT’S NEW IN CUDNN 5?
LSTM recurrent neural networks deliver up
to 6x speedup in Torch
Improved performance:
• Deep Neural Networks with 3x3 convolutions,
like VGG, GoogleNet and ResNets
• 3D Convolutions
• FP16 routines on Pascal GPUs
Pascal GPU, RNNs, Improved Performance
Performance relative to torch-rnn
(https://github.com/jcjohnson/torch-rnn)
DeepSpeech2: http://arxiv.org/abs/1512.02595
Char-rnn: https://github.com/karpathy/char-rnn
5.9x
Speedup for char-rnn
RNN Layers
2.8x
Speedup for DeepSpeech 2
RNN Layers
DEEP LEARNING &
ARTIFICIAL INTELLIGENCE
Sep 28-29, 2016 | Amsterdam
www.gputechconf.eu #GTC16
SELF-DRIVING CARS VIRTUAL REALITY &
AUGMENTED REALITY
SUPERCOMPUTING & HPC
GTC Europe is a two-day conference designed to expose the innovative ways developers, businesses and academics are
using parallel computing to transform our world.
EUROPE’S BRIGHTEST MINDS & BEST IDEAS
2 Days | 800 Attendees | 50+ Exhibitors | 50+ Speakers | 15+ Tracks | 15+ Workshops | 1-to-1 Meetings

Deep Learning Update May 2016

  • 1.
    19-May-2016 Frédéric Parienté, BusinessDevelopment Manager DEEP LEARNING UPDATE
  • 2.
    #GTC16 TESLA ANNOUNCEMENTS NVGRAPH CUDA8 & UNIFIED MEMORYNVIDIA DGX-1 NVIDIA SDK TESLA P100
  • 3.
    Deep Learning Achieves “Superhuman”Results A NEW COMPUTING MODEL Traditional Computer Vision Experts + Time Deep Learning Object Detection DNN + Data + HPC ImageNet
  • 5.
    DL Systems SoldApplication STATE OF DEEP LEARNING SYSTEMS Geography 100k Deep Learning in 2015 Machine Learning Algorithms Image Recognition Object Recognition Big Data Natural Language Processing Action Recognition Medical Other Facial Recognition Speech Recognition 1 2 3 4 5 6 7 8 9 10 TOP 10
  • 6.
    MICROSOFT: “SUPER DEEPNETWORKS” Microsoft Deep ResNet http://arxiv.org/pdf/1512.03385v1.pdf 18 LAYERS 1.8 GF 152 LAYERS 11.3 GF >6X MORE FLOPSRevolution of Depth
  • 7.
    BAIDU: DL DEVELOPERSNEED HPC “Investments in computer systems — and I think the bleeding-edge of AI, and deep learning specifically, is shifting to HPC (high performance computing) — can cut down the time to run an experiment, and therefore go around the circle, from a week to a day and sometimes even faster.” “Those of us that grew up doing machine learning often didn’t grow up with an HPC or computer systems background … partnerships between machine learning researchers and computer systems researchers tend to help both teams drive a lot of machine learning progress.” — Andrew Ng
  • 8.
    NVIDIA DGX-1 WORLD’S FIRSTDEEP LEARNING SUPERCOMPUTER 170 TFLOPS FP16 8x Tesla P100 16GB NVLink Hybrid Cube Mesh Accelerates Major AI Frameworks Dual Xeon 7 TB SSD Deep Learning Cache Dual 10GbE, Quad IB 100Gb 3RU – 3200W
  • 9.
    NVIDIA DGX-1 DEEPLEARNING SYSTEM
  • 10.
    BENEFITS FOR AIRESEARCHERS cuDNN NCCL cuSPARSE cuBLAS Design Big Networks Reduce Training Times DL SDK Ongoing Updates cuFFT Fastest DL Supercomputer
  • 11.
    NVIDIA DGX-1 SOFTWARESTACK Optimized for Deep Learning Performance Accelerated Deep Learning cuDNN NCCL cuSPARSE cuBLAS cuFFT Container Based Applications NVIDIA Cloud Management Digits DL Frameworks GPU Apps
  • 12.
    INTRODUCING TESLA P100 NewGPU Architecture to Enable the World’s Fastest Compute Node Pascal Architecture NVLinkCoWoS HBM2 Page Migration Engine PCIe Switch PCIe Switch CPU CPU Highest Compute Performance GPU Interconnect for Maximum Scalability Unifying Compute & Memory in Single Package Simple Parallel Programming with Virtually Unlimited Memory Unified Memory CPU Tesla P100
  • 13.
    GIANT LEAPS IN EVERYTHING NVLINK PAGEMIGRATION ENGINE PASCAL ARCHITECTURE CoWoS HBM2 Stacked Mem K40 Teraflops(FP32/FP16) 5 10 15 20 P100 (FP32) P100 (FP16) M40 K40 Bi-directionalBW(GB/Sec) 40 80 120 160 P100 M40 K40 Bandwidth(GB/s) 200 400 600 P100 M40 K40 AddressableMemory(GB) 10 100 1000 P100 M40 21 Teraflops of FP16 for Deep Learning 5x GPU-GPU Bandwidth 3x Higher for Massive Data Workloads Virtually Unlimited Memory Space 10000800
  • 14.
    HUGE JUMP INPERFORMANCE DUAL XEON 8X TESLA M40 DGX-1 (8X TESLA P100) FLOPS (CPU + GPU) 3 TF 58 TF 170 TF PROC-PROC BW 25 GB/s 64 GB/s 640 GB/s ALEXNET TRAIN TIME 150 HOURS 9 HOURS 2 HOURS # NODES FOR 3HR TAT 250 4 1 PERFORMANCE 1X 63X 250X
  • 15.
    HIGHEST ABSOLUTE PERFORMANCEDELIVERED NVLink for Max Scalability, More than 45x Faster with 8x P100 0x 5x 10x 15x 20x 25x 30x 35x 40x 45x 50x Caffe/Alexnet VASP HOOMD-Blue COSMO MILC Amber HACC 2x K80 (M40 for Alexnet) 2x P100 4x P100 8x P100 Speed-upvsDualSocketHaswell 2x Haswell CPU
  • 16.
    DATACENTER IN ARACK 1 Rack of Tesla P100 Delivers Performance of 6,000 CPUs QUANTUM PHYSICS MILC WEATHER COSMO DEEP LEARNING CAFFE/ALEXNET 12 NODES in RACK 8 P100s per Node MOLECULAR DYNAMICS AMBER # of Racks1 108642 . . .40 84 638 CPUs / 186 kW 650 CPUs / 190 kW 6,000 CPUs / 1.8 MW 2,900 CPUs / 850 kW 96 P100s / 38 kW 36 NODES in RACK DUAL CPUs Per Node . . .
  • 17.
    TESLA P100 ACCELERATOR Compute5.3 TF DP ∙ 10.6 TF SP ∙ 21.2 TF HP Memory HBM2: 720 GB/s ∙ 16 GB Interconnect NVLink (up to 8 way) + PCIe Gen3 Programmability Page Migration Engine Unified Memory Availability DGX-1: Order Now Atos, Cray, Dell, HP, IBM: Q1 2017
  • 18.
    END-TO-END PRODUCT FAMILY HYPERSCALEHPC Tesla M4, M40 MIXED-APPS HPC Tesla K80 STRONG-SCALING HPC Tesla P100 FULLY INTEGRATED DL SUPERCOMPUTER DGX-1 For customers who need to get going now with fully integrated solution Hyperscale & HPC data centers running apps that scale to multiple GPUs HPC data centers running mix of CPU and GPU workloads Hyperscale deployment for DL training, inference, video & image processing
  • 20.
    NVIDIA DEEP LEARNINGSDK High Performance GPU-Acceleration for Deep Learning COMPUTER VISION SPEECH AND AUDIO BEHAVIOR Object Detection Voice Recognition Translation Recommendation Engines Sentiment Analysis DEEP LEARNING cuDNN MATH LIBRARIES cuBLAS cuSPARSE MULTI-GPU NCCL cuFFT Mocha.jl Image Classification DEEP LEARNING SDK FRAMEWORKS APPLICATIONS
  • 21.
    NVIDIA CUDNN Building blocksfor accelerating deep neural networks on GPUs High performance deep neural network training Accelerates Deep Learning: Caffe, CNTK, Tensorflow, Theano, Torch Performance continues to improve over time “NVIDIA has improved the speed of cuDNN with each release while extending the interface to more operations and devices at the same time.” — Evan Shelhamer, Lead Caffe Developer, UC Berkeley developer.nvidia.com/cudnn AlexNet training throughput based on 20 iterations, CPU: 1x E5-2680v3 12 Core 2.5GHz. 0.0x 2.0x 4.0x 6.0x 8.0x 10.0x 12.0x 2014 2015 2016 K40 (cuDNN v1) M40 (cuDNN v3) Pascal (cuDNN v5)
  • 22.
    WHAT’S NEW INCUDNN 5? LSTM recurrent neural networks deliver up to 6x speedup in Torch Improved performance: • Deep Neural Networks with 3x3 convolutions, like VGG, GoogleNet and ResNets • 3D Convolutions • FP16 routines on Pascal GPUs Pascal GPU, RNNs, Improved Performance Performance relative to torch-rnn (https://github.com/jcjohnson/torch-rnn) DeepSpeech2: http://arxiv.org/abs/1512.02595 Char-rnn: https://github.com/karpathy/char-rnn 5.9x Speedup for char-rnn RNN Layers 2.8x Speedup for DeepSpeech 2 RNN Layers
  • 23.
    DEEP LEARNING & ARTIFICIALINTELLIGENCE Sep 28-29, 2016 | Amsterdam www.gputechconf.eu #GTC16 SELF-DRIVING CARS VIRTUAL REALITY & AUGMENTED REALITY SUPERCOMPUTING & HPC GTC Europe is a two-day conference designed to expose the innovative ways developers, businesses and academics are using parallel computing to transform our world. EUROPE’S BRIGHTEST MINDS & BEST IDEAS 2 Days | 800 Attendees | 50+ Exhibitors | 50+ Speakers | 15+ Tracks | 15+ Workshops | 1-to-1 Meetings