HPE demystifies deep
learning for faster intelligence
across all organizations
Edmondo Orlotti
HPC & AI Business Development Manager
October, 2017
Data analytics and insights are fueling the digital transformation
Enhanced customer
experiences
Improved products
and services
Optimized business
processes
Personalized, real-time
mobile insights for retail
Genomics sequencing
analytics for Life Sciences
Predictive maintenance
insights for manufacturing
2
AI propels analytics and insights to a new dimension
Unleash automated intelligence from massive data volumes
3
Data protection
and archival to
mitigate risk
HPE Fraud
Detection using
deep learning
Infrastructure
modernization for new
data types and scale
User behavioral analytics
for the data center using
machine learning
Next generation
analytics for
real-time business
HPE Intelligent Edge
real-time analytics with
SAP Leonardo
Insights from
modeling and
simulation
Deep learning in HPC
using GPU-accelerated
computing
What’s all the “buzz” around AI?
4
1 Source : McKinsey AI report, 2017
Gain competitive advantage using the vibrant new market of AI
Overview of HPE’s GPU portfolio
HPE has a comprehensive, purpose-built portfolio for deep learning
6
Compute ideal for training models in data center Edge analytics and
inference engine
Compute for both training models
and inference at edge
HPE Apollo 6500
HPC Storage Choice of Fabrics
HPE SGI 8600
Government,
academia and
industries
Financial
services
Life Sciences,
Health
Government
and academia
Autonomous
vehicles / Mfg.
AI Software Framework
HPE Apollo
4520
Arista
Networking
Intel® Omni-Path
Architecture
Mellanox
InfiniBand
HPE FlexFabric
Network
HPC Data
Management
Framework
Software
Large-scale, storage
virtualization & tiered
data management
platform
Petaflop scale for deep
learning and HPC
The enterprise bridge to
accelerated computing
HPE Apollo 2000
The bridge to enterprise
scale-out architecture
HPE Edgeline EL4000
Unprecedented deep edge compute and
high capacity storage; open standards
Advisory, professional and operational services, HPE Flexible Capacity, HPE Datacenter Care for Hyperscale
HPE Apollo sx40
Maximize GPU capacity and
performance with lower TCO
Easy Setup and Flexible OS
Using Bright Computing’s distribution
of deep learning software
development components and
workload management tool
integration
Introducing Tesla V100
5
TESLA V100
THE MOST ADVANCED DATA CENTER GPU EVER BUILT
5,120 CUDA cores
640 NEW Tensor cores
7.5 FP64 TFLOPS | 15 FP32 TFLOPS
120 Tensor TFLOPS
20MB SM RF | 16MB Cache | 16GB HBM2 @ 900 GB/s
300 GB/s NVLink
V100
Tensor Cores
2
P100
FP32
V100
Tensor Cores
P100
FP16
ImagesperSecond
ImagesperSecond
2.4x faster
ResNet-50 Inference
TensorRT - 7ms Latency
3.7x faster
V100 measured on pre-production hardware.
ResNet-50 Training
VOLTA: A GIANT LEAP FOR DEEP LEARNING
4
The Fastest and Most Productive GPU for Deep Learning and HPC
Volta Architecture
Most Productive GPU Tensor Core
120 Programmable
TFLOPS Deep Learning
Improved SIMT Model
New Algorithms
Volta MPS
Inference Utilization
Improved NVLink &
HBM2
Efficient Bandwidth
INTRODUCING TESLA V100
5
*full GV100 chip contains 84 SMs
21B transistors
815 mm2
80 SM
5120 CUDA Cores
640 Tensor Cores
16 GB HBM2
900 GB/s HBM2
300 GB/s NVLink
TESLA V100 ARCHITECTURE
Completely new ISA
Twice the schedulers
Simplified Issue Logic
Large, fast L1 cache
Improved SIMT model
Tensor acceleration
VOLTA V100 SM
8
VOLTA NVLINK
300GB/sec
50% more links
28% faster signaling
Hardware
Accelerated
Work Submission
Hardware
Isolation
VOLTA MULTI-PROCESS SERVICE
Volta GV100
A B C
CUDA MULTI-PROCESS SERVICE CONTROL
CPU Processes
14
GPU Execution
Volta MPS Enhancements:
• Reduced launch latency
• Improved launch throughput
• Improved quality of service with
scheduler partitioning
• More reliable performance
• 3x more clients than Pascal
A B C
VOLTA MULTI-PROCESS SERVICE
Volta: Starvation Free AlgorithmsPascal: for messages Lock-Free
Algorithms
Threads cannot wait
Threads may wait for messages
VOLTA: INDEPENDENT THREAD SCHEDULING
6
ALL MAJOR FRAMEWORKSVOLTA-OPTIMIZED cuDNN
MATRIX DATA OPTIMIZATION:
Dense Matrix of Tensor Compute
TENSOR-OP CONVERSION:
FP32 to Tensor Op Data for
Frameworks
VOLTA TENSOR CORE
4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]
Optimized For Deep Learning
NEW TENSOR CORE BUILT FOR AI
Delivering 120 TFLOPS of DL Performance
7
Over 80x DL Training
Performance in 3 Years
cuDNN3
1x K80
cuDNN2
8x P100
cuDNN6
4x M40
8x V100
cuDNN7
0x
20x
40x
60x
80x
100x
Q1
15
Q3
15
Q2
17
Q2
16
Googlenet Training Performance
(Speedup Vs K80)
SpeedupvsK80
85% Scale-Out Efficiency
Scales to 64 GPUs with Microsoft
Cognitive Toolkit
0 5 10 15
64X V100
8X V100
8X P100
Multi-Node Training with NCCL2.0
(ResNet-50)
ResNet50 Training for 90 Epochs with 1.28M images dataset | Using
Caffe2 | V100 performance measured on pre-production hardware.
1 Hour
7.4 Hours
18 Hours
3X Reduction in Time to Train
Over P100
0 10 20
1X
V100
1X
P100
2X
CPU
LSTM Training
(Neural Machine Translation)
Neural Machine Translation Training for 13 Epochs |German ->English,
WMT15 subset | CPU = 2x Xeon E5 2699 V4 | V100 performance
measured on pre-production hardware.
15 Days
18 Hours
6 Hours
AI PERFORMANCE
3X Faster DL Training Performance
8
TensorRT
Fuse Layers
Compact
Optimize Precision
(FP32, FP16, INT8)
Compiled
Real-time
Network
Trained
Neural
Network
3x more throughput at 7ms latency with V100
(ResNet-50)
5,000
33ms
0
1,000
2,000
3,000
4,000
CPU Tesla P100 Tesla P100
(TensorFlow) (TensorRT)
Tesla V100
(TensorRT)Throughput@7ms(Images/Sec)
CPU Server: 2X Xeon E5-2660 V4; GPU: w/P100, w/V100 (@150W) | V100 performance measured on pre-production hardware.
3X
10ms
7ms
7ms
VOLTA DELIVERS 3X MORE INFERENCE THROUGHPUT
Low Latency performance with V100 and TensorRT
10
SINGLE UNIVERSAL GPU FOR ALL ACCELERATED WORKLOADS
V100 UNIVERSAL GPU
BOOSTS ALLACCELERATED WORKLOADS
HPC
1.5X
Vs P100
k
3X
Vs P100
AI Training
3X
Vs P100
AI Inference
2X
Vs M60
Virtual Desktop
11
80% Perf at Half the Power
40% More Performance in a Rack
V100
Max Efficiency
V100
Max Performance
13 KW Rack
4 Nodes of 8xV100
13
ResNet-50 Networks
Trained Per Day
13 KW Rack
7 Nodes of 8xV100
18
ResNet-50 Networks
Trained Per Day
ResNet-50 Training, Max Efficiency run with V100@160W | V100 performance
measured on pre-production hardware.
OPTIMIZED FOR DATACENTER EFFICIENCY
12
For NVLink Servers For PCIe Servers
Compute 7.5 TF DP ∙ 15 TF SP ∙ 120 TF DL 7 TF DP ∙ 14 TF SP ∙ 112 TF DL
Memory HBM2: 900 GB/s ∙ 16 GB HBM2: 900 GB/s ∙ 16 GB
Interconnect NVLink (up to 300 GB/s) +
PCIe Gen3 (up to 32 GB/s)
PCIe Gen3 (up to 32 GB/s)
Power 300W 250W
TESLA V100 SPECIFICATIONS
HPE enables an optimized Deep Learning Experience
22
Hardware Infrastructure
Deep Learning Services
Fraud Detection, Predictive
Maintenance, Patient Diagnostics
Applications
Deep Learning Frameworks
Data Infrastructure
HPE Confidential
External announcement at NVDIIA GTC on May10th, 2017
Thank you
Edmondo.Orlotti@HPE.com
23December 2015, #c03880772

HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability

  • 1.
    HPE demystifies deep learningfor faster intelligence across all organizations Edmondo Orlotti HPC & AI Business Development Manager October, 2017
  • 2.
    Data analytics andinsights are fueling the digital transformation Enhanced customer experiences Improved products and services Optimized business processes Personalized, real-time mobile insights for retail Genomics sequencing analytics for Life Sciences Predictive maintenance insights for manufacturing 2
  • 3.
    AI propels analyticsand insights to a new dimension Unleash automated intelligence from massive data volumes 3 Data protection and archival to mitigate risk HPE Fraud Detection using deep learning Infrastructure modernization for new data types and scale User behavioral analytics for the data center using machine learning Next generation analytics for real-time business HPE Intelligent Edge real-time analytics with SAP Leonardo Insights from modeling and simulation Deep learning in HPC using GPU-accelerated computing
  • 4.
    What’s all the“buzz” around AI? 4 1 Source : McKinsey AI report, 2017 Gain competitive advantage using the vibrant new market of AI
  • 5.
    Overview of HPE’sGPU portfolio
  • 6.
    HPE has acomprehensive, purpose-built portfolio for deep learning 6 Compute ideal for training models in data center Edge analytics and inference engine Compute for both training models and inference at edge HPE Apollo 6500 HPC Storage Choice of Fabrics HPE SGI 8600 Government, academia and industries Financial services Life Sciences, Health Government and academia Autonomous vehicles / Mfg. AI Software Framework HPE Apollo 4520 Arista Networking Intel® Omni-Path Architecture Mellanox InfiniBand HPE FlexFabric Network HPC Data Management Framework Software Large-scale, storage virtualization & tiered data management platform Petaflop scale for deep learning and HPC The enterprise bridge to accelerated computing HPE Apollo 2000 The bridge to enterprise scale-out architecture HPE Edgeline EL4000 Unprecedented deep edge compute and high capacity storage; open standards Advisory, professional and operational services, HPE Flexible Capacity, HPE Datacenter Care for Hyperscale HPE Apollo sx40 Maximize GPU capacity and performance with lower TCO Easy Setup and Flexible OS Using Bright Computing’s distribution of deep learning software development components and workload management tool integration
  • 7.
  • 8.
    5 TESLA V100 THE MOSTADVANCED DATA CENTER GPU EVER BUILT 5,120 CUDA cores 640 NEW Tensor cores 7.5 FP64 TFLOPS | 15 FP32 TFLOPS 120 Tensor TFLOPS 20MB SM RF | 16MB Cache | 16GB HBM2 @ 900 GB/s 300 GB/s NVLink
  • 9.
    V100 Tensor Cores 2 P100 FP32 V100 Tensor Cores P100 FP16 ImagesperSecond ImagesperSecond 2.4xfaster ResNet-50 Inference TensorRT - 7ms Latency 3.7x faster V100 measured on pre-production hardware. ResNet-50 Training VOLTA: A GIANT LEAP FOR DEEP LEARNING
  • 10.
    4 The Fastest andMost Productive GPU for Deep Learning and HPC Volta Architecture Most Productive GPU Tensor Core 120 Programmable TFLOPS Deep Learning Improved SIMT Model New Algorithms Volta MPS Inference Utilization Improved NVLink & HBM2 Efficient Bandwidth INTRODUCING TESLA V100
  • 11.
    5 *full GV100 chipcontains 84 SMs 21B transistors 815 mm2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink TESLA V100 ARCHITECTURE
  • 12.
    Completely new ISA Twicethe schedulers Simplified Issue Logic Large, fast L1 cache Improved SIMT model Tensor acceleration VOLTA V100 SM
  • 13.
    8 VOLTA NVLINK 300GB/sec 50% morelinks 28% faster signaling
  • 14.
    Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESSSERVICE Volta GV100 A B C CUDA MULTI-PROCESS SERVICE CONTROL CPU Processes 14 GPU Execution Volta MPS Enhancements: • Reduced launch latency • Improved launch throughput • Improved quality of service with scheduler partitioning • More reliable performance • 3x more clients than Pascal A B C VOLTA MULTI-PROCESS SERVICE
  • 15.
    Volta: Starvation FreeAlgorithmsPascal: for messages Lock-Free Algorithms Threads cannot wait Threads may wait for messages VOLTA: INDEPENDENT THREAD SCHEDULING
  • 16.
    6 ALL MAJOR FRAMEWORKSVOLTA-OPTIMIZEDcuDNN MATRIX DATA OPTIMIZATION: Dense Matrix of Tensor Compute TENSOR-OP CONVERSION: FP32 to Tensor Op Data for Frameworks VOLTA TENSOR CORE 4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32] Optimized For Deep Learning NEW TENSOR CORE BUILT FOR AI Delivering 120 TFLOPS of DL Performance
  • 17.
    7 Over 80x DLTraining Performance in 3 Years cuDNN3 1x K80 cuDNN2 8x P100 cuDNN6 4x M40 8x V100 cuDNN7 0x 20x 40x 60x 80x 100x Q1 15 Q3 15 Q2 17 Q2 16 Googlenet Training Performance (Speedup Vs K80) SpeedupvsK80 85% Scale-Out Efficiency Scales to 64 GPUs with Microsoft Cognitive Toolkit 0 5 10 15 64X V100 8X V100 8X P100 Multi-Node Training with NCCL2.0 (ResNet-50) ResNet50 Training for 90 Epochs with 1.28M images dataset | Using Caffe2 | V100 performance measured on pre-production hardware. 1 Hour 7.4 Hours 18 Hours 3X Reduction in Time to Train Over P100 0 10 20 1X V100 1X P100 2X CPU LSTM Training (Neural Machine Translation) Neural Machine Translation Training for 13 Epochs |German ->English, WMT15 subset | CPU = 2x Xeon E5 2699 V4 | V100 performance measured on pre-production hardware. 15 Days 18 Hours 6 Hours AI PERFORMANCE 3X Faster DL Training Performance
  • 18.
    8 TensorRT Fuse Layers Compact Optimize Precision (FP32,FP16, INT8) Compiled Real-time Network Trained Neural Network 3x more throughput at 7ms latency with V100 (ResNet-50) 5,000 33ms 0 1,000 2,000 3,000 4,000 CPU Tesla P100 Tesla P100 (TensorFlow) (TensorRT) Tesla V100 (TensorRT)Throughput@7ms(Images/Sec) CPU Server: 2X Xeon E5-2660 V4; GPU: w/P100, w/V100 (@150W) | V100 performance measured on pre-production hardware. 3X 10ms 7ms 7ms VOLTA DELIVERS 3X MORE INFERENCE THROUGHPUT Low Latency performance with V100 and TensorRT
  • 19.
    10 SINGLE UNIVERSAL GPUFOR ALL ACCELERATED WORKLOADS V100 UNIVERSAL GPU BOOSTS ALLACCELERATED WORKLOADS HPC 1.5X Vs P100 k 3X Vs P100 AI Training 3X Vs P100 AI Inference 2X Vs M60 Virtual Desktop
  • 20.
    11 80% Perf atHalf the Power 40% More Performance in a Rack V100 Max Efficiency V100 Max Performance 13 KW Rack 4 Nodes of 8xV100 13 ResNet-50 Networks Trained Per Day 13 KW Rack 7 Nodes of 8xV100 18 ResNet-50 Networks Trained Per Day ResNet-50 Training, Max Efficiency run with V100@160W | V100 performance measured on pre-production hardware. OPTIMIZED FOR DATACENTER EFFICIENCY
  • 21.
    12 For NVLink ServersFor PCIe Servers Compute 7.5 TF DP ∙ 15 TF SP ∙ 120 TF DL 7 TF DP ∙ 14 TF SP ∙ 112 TF DL Memory HBM2: 900 GB/s ∙ 16 GB HBM2: 900 GB/s ∙ 16 GB Interconnect NVLink (up to 300 GB/s) + PCIe Gen3 (up to 32 GB/s) PCIe Gen3 (up to 32 GB/s) Power 300W 250W TESLA V100 SPECIFICATIONS
  • 22.
    HPE enables anoptimized Deep Learning Experience 22 Hardware Infrastructure Deep Learning Services Fraud Detection, Predictive Maintenance, Patient Diagnostics Applications Deep Learning Frameworks Data Infrastructure HPE Confidential External announcement at NVDIIA GTC on May10th, 2017
  • 23.