Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
HPE demystifies deep
learning for faster intelligence
across all organizations
Edmondo Orlotti
HPC & AI Business Developme...
Data analytics and insights are fueling the digital transformation
Enhanced customer
experiences
Improved products
and ser...
AI propels analytics and insights to a new dimension
Unleash automated intelligence from massive data volumes
3
Data prote...
What’s all the “buzz” around AI?
4
1 Source : McKinsey AI report, 2017
Gain competitive advantage using the vibrant new ma...
Overview of HPE’s GPU portfolio
HPE has a comprehensive, purpose-built portfolio for deep learning
6
Compute ideal for training models in data center Edge...
Introducing Tesla V100
5
TESLA V100
THE MOST ADVANCED DATA CENTER GPU EVER BUILT
5,120 CUDA cores
640 NEW Tensor cores
7.5 FP64 TFLOPS | 15 FP32 ...
V100
Tensor Cores
2
P100
FP32
V100
Tensor Cores
P100
FP16
ImagesperSecond
ImagesperSecond
2.4x faster
ResNet-50 Inference
...
4
The Fastest and Most Productive GPU for Deep Learning and HPC
Volta Architecture
Most Productive GPU Tensor Core
120 Pro...
5
*full GV100 chip contains 84 SMs
21B transistors
815 mm2
80 SM
5120 CUDA Cores
640 Tensor Cores
16 GB HBM2
900 GB/s HBM2...
Completely new ISA
Twice the schedulers
Simplified Issue Logic
Large, fast L1 cache
Improved SIMT model
Tensor acceleratio...
8
VOLTA NVLINK
300GB/sec
50% more links
28% faster signaling
Hardware
Accelerated
Work Submission
Hardware
Isolation
VOLTA MULTI-PROCESS SERVICE
Volta GV100
A B C
CUDA MULTI-PROCESS S...
Volta: Starvation Free AlgorithmsPascal: for messages Lock-Free
Algorithms
Threads cannot wait
Threads may wait for messag...
6
ALL MAJOR FRAMEWORKSVOLTA-OPTIMIZED cuDNN
MATRIX DATA OPTIMIZATION:
Dense Matrix of Tensor Compute
TENSOR-OP CONVERSION:...
7
Over 80x DL Training
Performance in 3 Years
cuDNN3
1x K80
cuDNN2
8x P100
cuDNN6
4x M40
8x V100
cuDNN7
0x
20x
40x
60x
80x...
8
TensorRT
Fuse Layers
Compact
Optimize Precision
(FP32, FP16, INT8)
Compiled
Real-time
Network
Trained
Neural
Network
3x ...
10
SINGLE UNIVERSAL GPU FOR ALL ACCELERATED WORKLOADS
V100 UNIVERSAL GPU
BOOSTS ALLACCELERATED WORKLOADS
HPC
1.5X
Vs P100
...
11
80% Perf at Half the Power
40% More Performance in a Rack
V100
Max Efficiency
V100
Max Performance
13 KW Rack
4 Nodes o...
12
For NVLink Servers For PCIe Servers
Compute 7.5 TF DP ∙ 15 TF SP ∙ 120 TF DL 7 TF DP ∙ 14 TF SP ∙ 112 TF DL
Memory HBM2...
HPE enables an optimized Deep Learning Experience
22
Hardware Infrastructure
Deep Learning Services
Fraud Detection, Predi...
Thank you
Edmondo.Orlotti@HPE.com
23December 2015, #c03880772
Upcoming SlideShare
Loading in …5
×

HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability

218 views

Published on

HPC DAY 2017 - http://www.hpcday.eu/

NVIDIA Volta Architecture. Performance. Efficiency. Availability

Edmondo Orlotti | EMEA HPC Marketing Manager at HPE

Published in: Technology
  • Be the first to comment

  • Be the first to like this

HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability

  1. 1. HPE demystifies deep learning for faster intelligence across all organizations Edmondo Orlotti HPC & AI Business Development Manager October, 2017
  2. 2. Data analytics and insights are fueling the digital transformation Enhanced customer experiences Improved products and services Optimized business processes Personalized, real-time mobile insights for retail Genomics sequencing analytics for Life Sciences Predictive maintenance insights for manufacturing 2
  3. 3. AI propels analytics and insights to a new dimension Unleash automated intelligence from massive data volumes 3 Data protection and archival to mitigate risk HPE Fraud Detection using deep learning Infrastructure modernization for new data types and scale User behavioral analytics for the data center using machine learning Next generation analytics for real-time business HPE Intelligent Edge real-time analytics with SAP Leonardo Insights from modeling and simulation Deep learning in HPC using GPU-accelerated computing
  4. 4. What’s all the “buzz” around AI? 4 1 Source : McKinsey AI report, 2017 Gain competitive advantage using the vibrant new market of AI
  5. 5. Overview of HPE’s GPU portfolio
  6. 6. HPE has a comprehensive, purpose-built portfolio for deep learning 6 Compute ideal for training models in data center Edge analytics and inference engine Compute for both training models and inference at edge HPE Apollo 6500 HPC Storage Choice of Fabrics HPE SGI 8600 Government, academia and industries Financial services Life Sciences, Health Government and academia Autonomous vehicles / Mfg. AI Software Framework HPE Apollo 4520 Arista Networking Intel® Omni-Path Architecture Mellanox InfiniBand HPE FlexFabric Network HPC Data Management Framework Software Large-scale, storage virtualization & tiered data management platform Petaflop scale for deep learning and HPC The enterprise bridge to accelerated computing HPE Apollo 2000 The bridge to enterprise scale-out architecture HPE Edgeline EL4000 Unprecedented deep edge compute and high capacity storage; open standards Advisory, professional and operational services, HPE Flexible Capacity, HPE Datacenter Care for Hyperscale HPE Apollo sx40 Maximize GPU capacity and performance with lower TCO Easy Setup and Flexible OS Using Bright Computing’s distribution of deep learning software development components and workload management tool integration
  7. 7. Introducing Tesla V100
  8. 8. 5 TESLA V100 THE MOST ADVANCED DATA CENTER GPU EVER BUILT 5,120 CUDA cores 640 NEW Tensor cores 7.5 FP64 TFLOPS | 15 FP32 TFLOPS 120 Tensor TFLOPS 20MB SM RF | 16MB Cache | 16GB HBM2 @ 900 GB/s 300 GB/s NVLink
  9. 9. V100 Tensor Cores 2 P100 FP32 V100 Tensor Cores P100 FP16 ImagesperSecond ImagesperSecond 2.4x faster ResNet-50 Inference TensorRT - 7ms Latency 3.7x faster V100 measured on pre-production hardware. ResNet-50 Training VOLTA: A GIANT LEAP FOR DEEP LEARNING
  10. 10. 4 The Fastest and Most Productive GPU for Deep Learning and HPC Volta Architecture Most Productive GPU Tensor Core 120 Programmable TFLOPS Deep Learning Improved SIMT Model New Algorithms Volta MPS Inference Utilization Improved NVLink & HBM2 Efficient Bandwidth INTRODUCING TESLA V100
  11. 11. 5 *full GV100 chip contains 84 SMs 21B transistors 815 mm2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink TESLA V100 ARCHITECTURE
  12. 12. Completely new ISA Twice the schedulers Simplified Issue Logic Large, fast L1 cache Improved SIMT model Tensor acceleration VOLTA V100 SM
  13. 13. 8 VOLTA NVLINK 300GB/sec 50% more links 28% faster signaling
  14. 14. Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA MULTI-PROCESS SERVICE CONTROL CPU Processes 14 GPU Execution Volta MPS Enhancements: • Reduced launch latency • Improved launch throughput • Improved quality of service with scheduler partitioning • More reliable performance • 3x more clients than Pascal A B C VOLTA MULTI-PROCESS SERVICE
  15. 15. Volta: Starvation Free AlgorithmsPascal: for messages Lock-Free Algorithms Threads cannot wait Threads may wait for messages VOLTA: INDEPENDENT THREAD SCHEDULING
  16. 16. 6 ALL MAJOR FRAMEWORKSVOLTA-OPTIMIZED cuDNN MATRIX DATA OPTIMIZATION: Dense Matrix of Tensor Compute TENSOR-OP CONVERSION: FP32 to Tensor Op Data for Frameworks VOLTA TENSOR CORE 4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32] Optimized For Deep Learning NEW TENSOR CORE BUILT FOR AI Delivering 120 TFLOPS of DL Performance
  17. 17. 7 Over 80x DL Training Performance in 3 Years cuDNN3 1x K80 cuDNN2 8x P100 cuDNN6 4x M40 8x V100 cuDNN7 0x 20x 40x 60x 80x 100x Q1 15 Q3 15 Q2 17 Q2 16 Googlenet Training Performance (Speedup Vs K80) SpeedupvsK80 85% Scale-Out Efficiency Scales to 64 GPUs with Microsoft Cognitive Toolkit 0 5 10 15 64X V100 8X V100 8X P100 Multi-Node Training with NCCL2.0 (ResNet-50) ResNet50 Training for 90 Epochs with 1.28M images dataset | Using Caffe2 | V100 performance measured on pre-production hardware. 1 Hour 7.4 Hours 18 Hours 3X Reduction in Time to Train Over P100 0 10 20 1X V100 1X P100 2X CPU LSTM Training (Neural Machine Translation) Neural Machine Translation Training for 13 Epochs |German ->English, WMT15 subset | CPU = 2x Xeon E5 2699 V4 | V100 performance measured on pre-production hardware. 15 Days 18 Hours 6 Hours AI PERFORMANCE 3X Faster DL Training Performance
  18. 18. 8 TensorRT Fuse Layers Compact Optimize Precision (FP32, FP16, INT8) Compiled Real-time Network Trained Neural Network 3x more throughput at 7ms latency with V100 (ResNet-50) 5,000 33ms 0 1,000 2,000 3,000 4,000 CPU Tesla P100 Tesla P100 (TensorFlow) (TensorRT) Tesla V100 (TensorRT)Throughput@7ms(Images/Sec) CPU Server: 2X Xeon E5-2660 V4; GPU: w/P100, w/V100 (@150W) | V100 performance measured on pre-production hardware. 3X 10ms 7ms 7ms VOLTA DELIVERS 3X MORE INFERENCE THROUGHPUT Low Latency performance with V100 and TensorRT
  19. 19. 10 SINGLE UNIVERSAL GPU FOR ALL ACCELERATED WORKLOADS V100 UNIVERSAL GPU BOOSTS ALLACCELERATED WORKLOADS HPC 1.5X Vs P100 k 3X Vs P100 AI Training 3X Vs P100 AI Inference 2X Vs M60 Virtual Desktop
  20. 20. 11 80% Perf at Half the Power 40% More Performance in a Rack V100 Max Efficiency V100 Max Performance 13 KW Rack 4 Nodes of 8xV100 13 ResNet-50 Networks Trained Per Day 13 KW Rack 7 Nodes of 8xV100 18 ResNet-50 Networks Trained Per Day ResNet-50 Training, Max Efficiency run with V100@160W | V100 performance measured on pre-production hardware. OPTIMIZED FOR DATACENTER EFFICIENCY
  21. 21. 12 For NVLink Servers For PCIe Servers Compute 7.5 TF DP ∙ 15 TF SP ∙ 120 TF DL 7 TF DP ∙ 14 TF SP ∙ 112 TF DL Memory HBM2: 900 GB/s ∙ 16 GB HBM2: 900 GB/s ∙ 16 GB Interconnect NVLink (up to 300 GB/s) + PCIe Gen3 (up to 32 GB/s) PCIe Gen3 (up to 32 GB/s) Power 300W 250W TESLA V100 SPECIFICATIONS
  22. 22. HPE enables an optimized Deep Learning Experience 22 Hardware Infrastructure Deep Learning Services Fraud Detection, Predictive Maintenance, Patient Diagnostics Applications Deep Learning Frameworks Data Infrastructure HPE Confidential External announcement at NVDIIA GTC on May10th, 2017
  23. 23. Thank you Edmondo.Orlotti@HPE.com 23December 2015, #c03880772

×