ACCELERATING DATA SCIENCE
WITH GPUs
AGENDA
• Session 1:
○ NVIDIA-Iguazio Accelerated Solutions for Deep Learning and Machine
Learning
Dr. Gabriel Noaje
Senior Solutions Architect
E-mail: gnoaje@nvidia.com
http://bit.ly/GabrielNoaje
AGENDA
• Session 2:
○ Data Science as a service using GPUs
○ Demo
• Anant GANDHI , Solutions Engineer ,Iguazio
He has 12 years of experience in helping customers
across Banking , Aerospace & Telecom with expertise in
Big Data and Analytics ecosystem
https://www.linkedin.com/in/anant-gandhi-b5447614/
NVIDIA ACCELERATED SOLUTIONS FOR
DEEP LEARNING AND MACHINE LEARNING
Dr. Gabriel Noaje
Senior Solutions Architect, APAC South
gnoaje@nvidia.com
2
NVIDIA
The AI Computing Company
GAMING
DESIGN
Visualization
MACHINE LEARNINGHPC DEEP LEARNING
Scientific Computing, AI and Data Analytics
TRANSPORTATION
HEALTHCARE
Industry Verticals
3
APPS &
FRAMEWORKS
CUDA-X &
NVIDIA SDKs
NVIDIA DATA CENTER PLATFORM
Single Platform Drives Utilization and Productivity
CUDA & CORE LIBRARIES - cuBLAS | NCCL
DEEP LEARNING
cuDNN
HPC
cuFFTOpenACC
+600
Applications
Amber
NAMD
CUSTOMER
USE CASES Speech Translate Recommender
SCIENTIFIC APPLICATIONS
Molecular
Simulations
Weather
Forecasting
Seismic
Mapping
CONSUMER INTERNET & INDUSTRY APPLICATIONS
ManufacturingHealthcare Finance
MACHINE LEARNING
cuMLcuDF cuGRAPH cuDNN CUTLASS TensorRT
VIRTUAL GPU
VIRTUAL GRAPHICS
vDWS vPC
Creative &
Technical
Knowledge
Workers
vAPPS
TESLA GPUs
& SYSTEMS
EVERY MAJOR CLOUDTESLA GPU NVIDIA HGX EVERY OEM
4
ONE ARCHITECTURE –
MULTIPLE USES CASES THROUGH NVIDIA SDK
CLARA for Medical Imaging DEEPSTREAM for Video Analytics
DRIVE for Autonomous Vehicles
RAPIDS for Machine Learning
VRWorks for Virtual RealityISAAC for Robotics
5
RAPIDS
6
GPU-ACCELERATED DATA SCIENCE
Use Cases in Every Industry
Ad Personalization
Click Through Rate Optimization
Churn Reduction
CONSUMER INTERNET
Claim fraud
Customer service chatbots/routing
Risk evaluation
FINANCIAL SERVICES
Remaining Useful Life Estimation
Failure Prediction
Demand Forecasting
MANUFACTURING
Detect Network/Security Anomalies
Forecasting Network Performance
Network Resource Optimization (SON)
TELCO
Supply Chain & Inventory Management
Price Management / Markdown Optimization
Promotion Prioritization And Ad Targeting
RETAIL
Personalization & Intelligent Customer Interactions
Connected Vehicle Predictive Maintenance
Forecasting, Demand, & Capacity Planning
AUTOMOTIVE
Sensor Data Tag Mapping
Anomaly Detection
Robust Fault Prediction
OIL & GAS
Improve Clinical Care
Drive Operational Efficiency
Speed Up Drug Discovery
HEALTHCARE
7
EXTENDING DL → BIG DATA ANALYTICS
From Business Intelligence to Data Science
Deep
Learning
Traditional Machine Learning (regressions, decision trees, graph)Analytics
DATA SCIENCE
ARTIFICIAL INTELLIGENCE
DENSE DATA TABULAR/SPARSE DATA
DENSE DATA TYPES
(images, video, voice)
8
ML WORKFLOW STIFLES INNOVATION
Data
Sources
Wrangle Data
Train
Time-consuming, inefficient workflow that wastes data science productivity
Data
LakeETL
Evaluate Predictions
Data Preparation Train Deploy
9
WHAT IS RAPIDS?
rapids.ai
Suit of open-source, end-to-end data
science tools
Built on CUDA
Pandas-like API for data cleaning and
transformation
Scikit-learn-like API
A unifying framework for GPU data
science
The New GPU Data Science Pipeline
10
DATA SCIENCE WORKFLOW WITH RAPIDS
Open Source, End-to-end GPU-accelerated Workflow Built On CUDA
DATA
DATA PREPARATION (cuDF)
GPUs accelerated compute for in-memory data preparation
Simplified implementation using familiar data science tools
Python drop-in Pandas replacement built on CUDA C++. GPU-accelerated Spark
PREDICTIONS
11
DATA SCIENCE WORKFLOW WITH RAPIDS
Open Source, End-to-end GPU-accelerated Workflow Built On CUDA
MODEL TRAINING (cuML)
GPU-acceleration of today’s most popular ML algorithms
XGBoost, PCA, Kalman, K-means, k-NN, DBScan, tSVD …
DATA PREDICTIONS
12
DATA SCIENCE WORKFLOW WITH RAPIDS
Open Source, End-to-end GPU-accelerated Workflow Built On CUDA
VISUALIZATION (cuGRAPH)
Effortless exploration of datasets, billions of records in milliseconds
Dynamic interaction with data = faster ML model development
Data visualization ecosystem (Graphistry & OmniSci), integrated with RAPIDS
DATA PREDICTIONS
13
12
6
39
GPU
POWERED
WORKFLOW
DAY IN THE LIFE OF A DATA SCIENTIST
Train Model
Validate
Test Model
Experiment with
Optimizations and
Repeat
Go Home on Time
Dataset
Downloads
Overnight
Start
GET A COFFEE
Stay Late
Restart Data Prep
Workflow Again
Find Unexpected Null
Values Stored as String…
Switch to Decaf
12
6
39
CPU
POWERED
WORKFLOW
Restart Data Prep
Workflow
@*#! Forgot to Add
a Feature
ANOTHER…
GET A COFFEE
Start Data Prep
Workflow
GET A COFFEE
Configure Data Prep
Workflow
Dataset
Downloads
Overnight
Dataset Collection Analysis Data Prep Train Inference
14
TRADITIONAL
DATA SCIENCE
CLUSTER
Workload Profile:
Fannie Mae Mortgage Data:
• 192GB data set
• 16 years, 68 quarters
• 34.7 Million single family mortgage loans
• 1.85 Billion performance records
• XGBoost training set: 50 features
300 Servers | $3M | 180 kW
15
GPU-ACCELERATED
MACHINE
LEARNING
CLUSTER
1 DGX-2 | 10 kW
1/8 the Cost | 1/15 the Space
1/18 the Power
DGX-2 and RAPIDS for
Predictive Analytics
0 2,000 4,000 6,000 8,000 10,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
End-to-End
16
FASTER SPEEDS, REAL WORLD BENEFITS
2,290
1,956
1,999
1,948
169
157
0 500 1,000 1,500 2,000 2,500
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
0 2,000 4,000 6,000 8,000 10,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
cuML — XGBoost
2,741
1,675
715
379
42
19
0 1,000 2,000 3,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
End-to-End
cuIO/cuDF —
Load and Data Preparation
Benchmark
200GB CSV dataset; Data preparation
includes joins, variable transformations.
CPU Cluster Configuration
CPU nodes (61 GB of memory, 8 vCPUs,
64-bit platform), Apache Spark
DGX Cluster Configuration
5x DGX-1 on InfiniBand network
Time in seconds — Shorter is better
cuIO / cuDF (Load and Data Preparation) Data Conversion XGBoost
17
GTC2019 RAPIDS TRAINING CONTENT
S9801 - RAPIDS: Deep Dive Into How the Platform Works
S9577 - RAPIDS: The Platform Inside and Out
S9793 - cuDF: RAPIDS GPU-Accelerated Data Frame Library
S91043 - RAPIDS CUDA DataFrame Internals for C++ Developers
S9817 - RAPIDS cuML: A Library for GPU Accelerated Machine Learning
S9783 - Accelerating Graph Algorithms with RAPIDS
Many more sessions – 26 sessions on Rapids related topics !!!
PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9801-rapids-deep-dive-into-how-the-platform-works.pdf
RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S9801/
PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9577-rapids-the-platform-inside-and-out.pdf
RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S9577/
PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9793-cudf-rapids-gpu-accelerated-data-frame-library.pdf
RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S9793/
PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s91043-rapids-cuda-dataframe-internals-for-c++-developers.pdf
RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S91043/
PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9817-rapids-cuml-a-library-for-gpu-accelerated-machine-learning.pdf
RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S9817/
PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9783-accelerating-graph-algorithms-with-rapids.pdf
RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S9783/
DEEP LEARNING
19
50% Reduction in Emergency
Road Repair Costs
>$6M / Year Savings and
Reduced Risk of Outage
INFRASTRUCTUREHEALTHCARE IOT
AI TRANSFORMING EVERY INDUSTRY
>80% Accuracy & Immediate Alert
to Radiologists
20
NVIDIA BREAKS RECORDS IN AI PERFORMANCE
MLPerf Records Both At Scale And Per Accelerator
Record Type Benchmark Record
Max Scale
(Minutes To
Train)
Object Detection (Heavy Weight) Mask R-CNN 18.47 Mins
Translation (Recurrent) GNMT 1.8 Mins
Reinforcement Learning (MiniGo) 13.57 Mins
Per Accelerator
(Hours To Train)
Object Detection (Heavy Weight) Mask R-CNN 25.39 Hrs
Object Detection (Light Weight) SSD 3.04 Hrs
Translation (Recurrent) GNMT 2.63 Hrs
Translation (Non-recurrent)Transformer 2.61 Hrs
Reinforcement Learning (MiniGo) 3.65 Hrs
Per Accelerator comparison using reported performance for MLPerf 0.6 NVIDIA DGX-2H (16 V100s) compared to other submissions at same scale except for MiniGo where NVIDIA DGX-1 (8 V100s) submission was used| MLPerf
ID Max Scale: Mask R-CNN: 0.6-23, GNMT: 0.6-26, MiniGo: 0.6-11 | MLPerf ID Per Accelerator: Mask R-CNN, SSD, GNMT, Transformer: all use 0.6-20, MiniGo: 0.6-10
21
NVIDIA DGX SUPERPOD BREAKS
AT SCALE AI RECORDS
Under 20 Minutes To Train Each MLPerf Benchmark
14.43
35.6
1.21
2.11
0.85
1.28
18.47
13.57
2.23
1.8
1.59
1.33
0 20 40
Object Detection (Heavy Weight)
Mask R-CNN
Reinforcement Learning
MiniGo
Object Detection (Light Weight)
SSD
Translation (Recurrent)
GNMT
Translation (Non-recurrent)
Transformer
Image Classification
ResNet-50 v.1.5
NVIDIA GPU
Google TPU
Intel CPU
MLPerf At Scale Submissions
Minutes To Train (Lower Is Better)
No TPU Submission
MLPerf 0.6 Performance at Max Scale | MLPerf ID at Scale: RN50 v1.5: 0.6-30, 0.6-6 | Transformer: 0.6-28, 0.6-6 | GNMT: 0.6-26, 0.6-5 | SSD: 0.6-27, 0.6-6 | MiniGo: 0.6-11, 0.6-7 | Mask R-CNN: 0.6-23, 0.6-3
22
UP TO 80% MORE PERFORMANCE ON SAME SERVER
Software Innovation Delivers Continuous MLPerf Improvements
1.2x 1.3x 1.2x
1.5x
1.8x
0
1
2
Image Classification
RN50 v.1.5
Translation
(non-recurrent)
Transformer
Object Detection
(Light Weight)
SSD
Translation
(recurrent)
GNMT
Object Detection
(Heavy Weight)
Mask R-CNN
RelativeSpeedup
MLPerf 0.5 MLPerf 0.6
MLPerf On DGX-2 Server (7 Month Improvement)
Comparing the throughput of a single DGX-2H server on a single epoch (Single pass of the dataset through the neural network) | MLPerf ID 0.5/0.6 comparison: ResNet50 v1.5: 0.5-20/0.6-30 | Transformer: 0.5-21/0.6-20
| SSD: 0.5-21/0.6-20 | GNMT: 0.5-19/0.6-20 | Mask R-CNN: 0.5-21/0.6-20
23
DRAMATICALLY MORE FOR YOUR MONEY
300 Self-hosted Broadwell CPU Servers
180 KWatts
Deep Learning
Training
Image training
Resnet 50
1 DGX-2
10 KWatts
Deep Learning
Training
Image training
Resnet 50
GPU-AcceleratedCPU-Only Cluster
SAME
THROUGHPUT
1/8
THE COST
1/18
THE POWER
1/30
THE SPACE
24
NVIDIA DGX-2
Designed To Train The Previously Impossible
1
2
3
8
4
5 Two Intel Xeon Platinum CPUs
6 1.5 TB System Memory
24
30 TB NVME SSDs
Internal Storage
NVIDIA Tesla V100 32GB
Two HGX-2 GPU Motherboards
8 V100 32GB GPUs per board
6 NVSwitches per board
512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches
2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE
1600 Gb/sec Total
Bi-directional Bandwidth
7
Two High-Speed Ethernet
10/25/40/100 GigE
TESLA V100
TENSOR CORE GPU
World’s Most Powerful
Data Center GPU
5,120 CUDA cores
640 NEW Tensor cores
7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS
| 125 Tensor TFLOPS
20MB SM RF | 16MB Cache
32 GB HBM2 @ 900GB/s |
300GB/s NVLink
26
NVSWITCH
World’s Highest Bandwidth
On-node Switch
7.2 Terabits/sec or 900 GB/sec
18 NVLINK ports | 50GB/s per
port bi-directional
Fully-connected crossbar
2 billion transistors |
47.5mm x 47.5mm package
27
WORLD RECORDS FOR CONVERSATIONAL AI
BERT Training and Inference Records
Largest Transformer Based Model Ever Trained
EXPLODING MODEL SIZE
Complexity to Train
CONVERSATIONAL AI RECORDS
Code Available on Github
Image
Recognition
NLP
(Q&A, Translation)
NLP – Generative Tasks
(Chatbots, Auto-completion)
8.3Bn
1.5Bn
26M
340M
NumberofParametersbyNetwork
53
minutes
BERTLARGE
Speed Training Record
GPT-2 8B
Largest Transformer Based Model Trained
8.3Bn
parameters
2.2ms
Latency
BERTBASE
Fastest Inference (18X Faster Than CPU) X
20X
40X
60X
80X
0 500 1000 1500
NormalizedSpeedup(1/Time)
# of V100 GPUs
BERTLARGE Training Record: 1472 Tesla V100-SXM3-32GB 450W GPUs | 92 DGX-2H Servers | 8 Mellanox Infiniband Adapters per node
BERTBASE Inference Record: SQuAD Dataset| Tesla T4 16GB GPU | CPU: Intel Xeon Gold 6240 & OpenVINO v2
Scaling Training Performance on: BERT | Speedups show performance scaling on 1x, 16x, 64x and 92x DGX-2H Servers with 16 NVIDIA V100 GPUs each
Training GPUs - Near Linear Scaling
Requires Leading AI Infrastructure
28
ML/DL
INFRASTRUCTURE
29
AI PLATFORM CONSIDERATIONS
Factors impacting deep learning platform decisions
I have limited budget,
need lowest up-front
cost possible
TOTAL COST OF
OWNERSHIP
“I want the most GPU
bang for the buck
SCALING
PERFORMANCE
“
DEVELOPER
PRODUCTIVITY
Must get started now,
line of business wants to
deliver results yesterday
“
30
COMPARING AI COMPUTE ALTERNATIVES
AI/DL Expertise &
Innovation
AI/DL Software Stack
Operating System Image
Hardware Architecture
Looking beyond the “spec sheet”
EvaluationCriteria
31
NVIDIA DGX PODTM:
HIGH-DENSITY COMPUTE REFERENCE ARCH.
Nine DGX-1 Servers
• Eight Tesla V100 GPUs
• NVIDIA. GPUDirect™ over RDMA support
• Run at MaxQ
• 100 GbE networking (up to 4 x 100 GbE)
Twelve Storage Nodes
• 192 GB RAM
• 3.8 TB SSD
• 100 TB HDD (1.2 PB Total HDD)
• 50 GbE networking
Network
• In-rack: 100 GbE to DGX-1 servers
• In-rack: 50 GbE to storage nodes
• Out-of-rack: 4 x 100 GbE (up to 8)
Rack
• 35 kW Power
• 42U x 1200 mm x 700 mm (minimum)
• Rear Door Cooler
4 POD design with cooling
DGX-1 POD
• NVIDIA DGX POD
• Support scalability to hundreds of nodes
• Based on proven SATURNV architecture
NVIDIA DGX
SUPERPOD
AI LEADERSHIP REQUIRES
AI INFRASTRUCTURE LEADERSHIP
Test Bed for Highest Performance Scale-Up Systems
• 9.4 PF on HPL | ~200 AI PF | #22 on Top500 list
• <2 mins To Train RN-50
Modular & Scalable GPU SuperPOD Architecture
• Built in 3 Weeks
• Optimized For Compute, Networking, Storage & Software
Integrates Fully Optimized Software Stacks
• Freely Available Through NGC
• 96 DGX-2H
• 10 Mellanox EDR IB per node
• 1,536 V100 Tensor Core
GPUs
• 1 megawatt of power
Autonomous Vehicles | Speech AI | Healthcare | Graphics | HPC
33
Installed/
running
Problem!
Open source / forum
Open source / forum
Framework?
Libraries?
O/S?
GPU?
Drivers?
Server?
Network?
Storage?
Multiple paths to
problem resolution
Server, Storage & Network
Solution Providers
SUPPORTING AI:
ALTERNATIVE APPROACHES
34
SUPPORTING AI WITH DGX REFERENCE
ARCHITECTURE SOLUTIONS
“Update to PyTorch
container XX.XX”
AI Expertise
NPN
Partner
Running!Problem!
DGX RA
Solution
Storage
DGX RA
Solution
Storage
“My PyTorch CNN model
is running 30% slower
than yesterday!”
IT Admin
35
THE NEW NGC
GPU-optimized Software Hub. Simplifying DL, ML and HPC Workflows
NGC
50+ Containers
DL, ML, HPC
Pre-trained Models
NLP, Classification, Object Detection & more
Industry Workflows
Medical Imaging, Intelligent Video Analytics
Model Training Scripts
NLP, Image Classification, Object Detection & more
Innovate Faster
Deploy Anywhere
Simplify Deployments
ngc.nvidia.com
Solving the complexity of managing
distributed computing on GPU
AGENDA
• Session 2:
○ Data Science as a service using GPUs
○ Demo
3
ML Pipelines
Serverless Functions
& Notebooks
Services
Data Persistent & GPU
compute
Shared resources
Iguazio: Integrated and Open Data Science Platform
PandasDaskTensorFlow PyTorch Spark Presto GrafanaPrometheusRapids
Jupyter Notebook Nuclio
KubeFlow
DL workloads
ML workloads
Model
Inferencing
GPU sharing
DEMO
Q & A
6
§ Quick way for data scientists to work
on a cluster of GPUs.
o Built-in integration with GPU
o No Devops is required
§ Frees GPU resources after Jupyter
notebook becomes idle
§ Maximizing the efficiency of GPU
usage among the data science team
Optimize GPU sharing
Enabling GPU at scale
7
Supporting DGX cluster
DGX cluster
§ Running data science
workload on a DGX cluster
§ Running Jupyter, Spark,
TensorFlow and distributed
Python on a DGX cluster
§ Monitoring jobs on the
cluster level
8
§ Models are running as functions
at scale on a GPU cluster
§ High-performance parallel
execution engine
§ Easy control of GPU resources
per function
§ Quick deployment of models from
Jupyter notebooks
Running models in an inferencing layer with GPU
Quick deployment of models in
a serving layer
9
§ Self service on a managed platform
§ Jobs scheduling
§ Cloud experience for on-prem
§ Full and open data science
environment at the click of a
button
§ Built-in integration for Jupyter and
GPU
Ease of management and orchestration
Easy access to GPU
10
§ Direct writes/reads into/from the GPU’s memory using RAPIDS data frames
o By doing that users can read data from the database and analyze it directly on the GPU without
any intermediate layer
§ Streaming data in chunks directly into GPU
§ Full parallelism - multiple nodes can read data, each only one shard
Advanced integration with RAPIDs
CUDA
PYTHON
APACHE ARROW on GPU Memory
DASK
DEEP
LEARNING
FRAMEWORKS
CUDNN
RAPIDS
CUMLCUDF CUGRAPH
11
§ Iguazio’s serverless functions
(Nuclio) improves GPU utilization
and sharing, resulting in almost
four times faster application
performance when compared to
the use of NVIDIA GPUs within
monolithic architectures.
§ Linear scalability
Serverless & GPU – Better Performance
4x FASTER
12
How we Enable Large Scale Data Processing on GPUs
Raw Data
10s-1000s
Terabytes
Filtered
TBs
Partitioned
100s GBs
Chunked
1-10s GBs
Final
Results
Filter Partition
Chunk
Chunk
Chunk
Merge
DB + native Support
10s GBs
13
§ Speed up data science projects
o Immediate access to GPU (training and inferencing)
§ Increase overall GPU utilization (90%)
o Helping customers to maximize their GPU utilization
§ Fully managed PaaS with built-in GPU integration
o Application provisioning, orchestration and managed notebooks enabling training at scale on a
shared GPU cluster
o Tight integration with Nvidia TensorRT, RAPIDS, DeepOPS
§ Simplify management of GPU’s & DGX
o Automated workflow for a continuous data science pipeline
§ Improved performance by 4x
o By creating a shared resource pool and load balancing across all GPU’s
Value for Nvidia customers
14
§ Built-in GPU monitoring dashboard
integrated with Nvidia Deepops
§ Advanced troubleshooting
Identify which service/app is utilizing
the GPU resource
Integrated GPU monitoring (coming soon)
anantg@iguazio.com | www.iguazio.com
Thank You

Accelerating Data Science With GPUs

  • 1.
  • 2.
    AGENDA • Session 1: ○NVIDIA-Iguazio Accelerated Solutions for Deep Learning and Machine Learning Dr. Gabriel Noaje Senior Solutions Architect E-mail: gnoaje@nvidia.com http://bit.ly/GabrielNoaje
  • 3.
    AGENDA • Session 2: ○Data Science as a service using GPUs ○ Demo • Anant GANDHI , Solutions Engineer ,Iguazio He has 12 years of experience in helping customers across Banking , Aerospace & Telecom with expertise in Big Data and Analytics ecosystem https://www.linkedin.com/in/anant-gandhi-b5447614/
  • 4.
    NVIDIA ACCELERATED SOLUTIONSFOR DEEP LEARNING AND MACHINE LEARNING Dr. Gabriel Noaje Senior Solutions Architect, APAC South gnoaje@nvidia.com
  • 5.
    2 NVIDIA The AI ComputingCompany GAMING DESIGN Visualization MACHINE LEARNINGHPC DEEP LEARNING Scientific Computing, AI and Data Analytics TRANSPORTATION HEALTHCARE Industry Verticals
  • 6.
    3 APPS & FRAMEWORKS CUDA-X & NVIDIASDKs NVIDIA DATA CENTER PLATFORM Single Platform Drives Utilization and Productivity CUDA & CORE LIBRARIES - cuBLAS | NCCL DEEP LEARNING cuDNN HPC cuFFTOpenACC +600 Applications Amber NAMD CUSTOMER USE CASES Speech Translate Recommender SCIENTIFIC APPLICATIONS Molecular Simulations Weather Forecasting Seismic Mapping CONSUMER INTERNET & INDUSTRY APPLICATIONS ManufacturingHealthcare Finance MACHINE LEARNING cuMLcuDF cuGRAPH cuDNN CUTLASS TensorRT VIRTUAL GPU VIRTUAL GRAPHICS vDWS vPC Creative & Technical Knowledge Workers vAPPS TESLA GPUs & SYSTEMS EVERY MAJOR CLOUDTESLA GPU NVIDIA HGX EVERY OEM
  • 7.
    4 ONE ARCHITECTURE – MULTIPLEUSES CASES THROUGH NVIDIA SDK CLARA for Medical Imaging DEEPSTREAM for Video Analytics DRIVE for Autonomous Vehicles RAPIDS for Machine Learning VRWorks for Virtual RealityISAAC for Robotics
  • 8.
  • 9.
    6 GPU-ACCELERATED DATA SCIENCE UseCases in Every Industry Ad Personalization Click Through Rate Optimization Churn Reduction CONSUMER INTERNET Claim fraud Customer service chatbots/routing Risk evaluation FINANCIAL SERVICES Remaining Useful Life Estimation Failure Prediction Demand Forecasting MANUFACTURING Detect Network/Security Anomalies Forecasting Network Performance Network Resource Optimization (SON) TELCO Supply Chain & Inventory Management Price Management / Markdown Optimization Promotion Prioritization And Ad Targeting RETAIL Personalization & Intelligent Customer Interactions Connected Vehicle Predictive Maintenance Forecasting, Demand, & Capacity Planning AUTOMOTIVE Sensor Data Tag Mapping Anomaly Detection Robust Fault Prediction OIL & GAS Improve Clinical Care Drive Operational Efficiency Speed Up Drug Discovery HEALTHCARE
  • 10.
    7 EXTENDING DL →BIG DATA ANALYTICS From Business Intelligence to Data Science Deep Learning Traditional Machine Learning (regressions, decision trees, graph)Analytics DATA SCIENCE ARTIFICIAL INTELLIGENCE DENSE DATA TABULAR/SPARSE DATA DENSE DATA TYPES (images, video, voice)
  • 11.
    8 ML WORKFLOW STIFLESINNOVATION Data Sources Wrangle Data Train Time-consuming, inefficient workflow that wastes data science productivity Data LakeETL Evaluate Predictions Data Preparation Train Deploy
  • 12.
    9 WHAT IS RAPIDS? rapids.ai Suitof open-source, end-to-end data science tools Built on CUDA Pandas-like API for data cleaning and transformation Scikit-learn-like API A unifying framework for GPU data science The New GPU Data Science Pipeline
  • 13.
    10 DATA SCIENCE WORKFLOWWITH RAPIDS Open Source, End-to-end GPU-accelerated Workflow Built On CUDA DATA DATA PREPARATION (cuDF) GPUs accelerated compute for in-memory data preparation Simplified implementation using familiar data science tools Python drop-in Pandas replacement built on CUDA C++. GPU-accelerated Spark PREDICTIONS
  • 14.
    11 DATA SCIENCE WORKFLOWWITH RAPIDS Open Source, End-to-end GPU-accelerated Workflow Built On CUDA MODEL TRAINING (cuML) GPU-acceleration of today’s most popular ML algorithms XGBoost, PCA, Kalman, K-means, k-NN, DBScan, tSVD … DATA PREDICTIONS
  • 15.
    12 DATA SCIENCE WORKFLOWWITH RAPIDS Open Source, End-to-end GPU-accelerated Workflow Built On CUDA VISUALIZATION (cuGRAPH) Effortless exploration of datasets, billions of records in milliseconds Dynamic interaction with data = faster ML model development Data visualization ecosystem (Graphistry & OmniSci), integrated with RAPIDS DATA PREDICTIONS
  • 16.
    13 12 6 39 GPU POWERED WORKFLOW DAY IN THELIFE OF A DATA SCIENTIST Train Model Validate Test Model Experiment with Optimizations and Repeat Go Home on Time Dataset Downloads Overnight Start GET A COFFEE Stay Late Restart Data Prep Workflow Again Find Unexpected Null Values Stored as String… Switch to Decaf 12 6 39 CPU POWERED WORKFLOW Restart Data Prep Workflow @*#! Forgot to Add a Feature ANOTHER… GET A COFFEE Start Data Prep Workflow GET A COFFEE Configure Data Prep Workflow Dataset Downloads Overnight Dataset Collection Analysis Data Prep Train Inference
  • 17.
    14 TRADITIONAL DATA SCIENCE CLUSTER Workload Profile: FannieMae Mortgage Data: • 192GB data set • 16 years, 68 quarters • 34.7 Million single family mortgage loans • 1.85 Billion performance records • XGBoost training set: 50 features 300 Servers | $3M | 180 kW
  • 18.
    15 GPU-ACCELERATED MACHINE LEARNING CLUSTER 1 DGX-2 |10 kW 1/8 the Cost | 1/15 the Space 1/18 the Power DGX-2 and RAPIDS for Predictive Analytics 0 2,000 4,000 6,000 8,000 10,000 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 End-to-End
  • 19.
    16 FASTER SPEEDS, REALWORLD BENEFITS 2,290 1,956 1,999 1,948 169 157 0 500 1,000 1,500 2,000 2,500 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 0 2,000 4,000 6,000 8,000 10,000 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 cuML — XGBoost 2,741 1,675 715 379 42 19 0 1,000 2,000 3,000 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 End-to-End cuIO/cuDF — Load and Data Preparation Benchmark 200GB CSV dataset; Data preparation includes joins, variable transformations. CPU Cluster Configuration CPU nodes (61 GB of memory, 8 vCPUs, 64-bit platform), Apache Spark DGX Cluster Configuration 5x DGX-1 on InfiniBand network Time in seconds — Shorter is better cuIO / cuDF (Load and Data Preparation) Data Conversion XGBoost
  • 20.
    17 GTC2019 RAPIDS TRAININGCONTENT S9801 - RAPIDS: Deep Dive Into How the Platform Works S9577 - RAPIDS: The Platform Inside and Out S9793 - cuDF: RAPIDS GPU-Accelerated Data Frame Library S91043 - RAPIDS CUDA DataFrame Internals for C++ Developers S9817 - RAPIDS cuML: A Library for GPU Accelerated Machine Learning S9783 - Accelerating Graph Algorithms with RAPIDS Many more sessions – 26 sessions on Rapids related topics !!! PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9801-rapids-deep-dive-into-how-the-platform-works.pdf RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S9801/ PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9577-rapids-the-platform-inside-and-out.pdf RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S9577/ PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9793-cudf-rapids-gpu-accelerated-data-frame-library.pdf RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S9793/ PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s91043-rapids-cuda-dataframe-internals-for-c++-developers.pdf RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S91043/ PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9817-rapids-cuml-a-library-for-gpu-accelerated-machine-learning.pdf RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S9817/ PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9783-accelerating-graph-algorithms-with-rapids.pdf RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S9783/
  • 21.
  • 22.
    19 50% Reduction inEmergency Road Repair Costs >$6M / Year Savings and Reduced Risk of Outage INFRASTRUCTUREHEALTHCARE IOT AI TRANSFORMING EVERY INDUSTRY >80% Accuracy & Immediate Alert to Radiologists
  • 23.
    20 NVIDIA BREAKS RECORDSIN AI PERFORMANCE MLPerf Records Both At Scale And Per Accelerator Record Type Benchmark Record Max Scale (Minutes To Train) Object Detection (Heavy Weight) Mask R-CNN 18.47 Mins Translation (Recurrent) GNMT 1.8 Mins Reinforcement Learning (MiniGo) 13.57 Mins Per Accelerator (Hours To Train) Object Detection (Heavy Weight) Mask R-CNN 25.39 Hrs Object Detection (Light Weight) SSD 3.04 Hrs Translation (Recurrent) GNMT 2.63 Hrs Translation (Non-recurrent)Transformer 2.61 Hrs Reinforcement Learning (MiniGo) 3.65 Hrs Per Accelerator comparison using reported performance for MLPerf 0.6 NVIDIA DGX-2H (16 V100s) compared to other submissions at same scale except for MiniGo where NVIDIA DGX-1 (8 V100s) submission was used| MLPerf ID Max Scale: Mask R-CNN: 0.6-23, GNMT: 0.6-26, MiniGo: 0.6-11 | MLPerf ID Per Accelerator: Mask R-CNN, SSD, GNMT, Transformer: all use 0.6-20, MiniGo: 0.6-10
  • 24.
    21 NVIDIA DGX SUPERPODBREAKS AT SCALE AI RECORDS Under 20 Minutes To Train Each MLPerf Benchmark 14.43 35.6 1.21 2.11 0.85 1.28 18.47 13.57 2.23 1.8 1.59 1.33 0 20 40 Object Detection (Heavy Weight) Mask R-CNN Reinforcement Learning MiniGo Object Detection (Light Weight) SSD Translation (Recurrent) GNMT Translation (Non-recurrent) Transformer Image Classification ResNet-50 v.1.5 NVIDIA GPU Google TPU Intel CPU MLPerf At Scale Submissions Minutes To Train (Lower Is Better) No TPU Submission MLPerf 0.6 Performance at Max Scale | MLPerf ID at Scale: RN50 v1.5: 0.6-30, 0.6-6 | Transformer: 0.6-28, 0.6-6 | GNMT: 0.6-26, 0.6-5 | SSD: 0.6-27, 0.6-6 | MiniGo: 0.6-11, 0.6-7 | Mask R-CNN: 0.6-23, 0.6-3
  • 25.
    22 UP TO 80%MORE PERFORMANCE ON SAME SERVER Software Innovation Delivers Continuous MLPerf Improvements 1.2x 1.3x 1.2x 1.5x 1.8x 0 1 2 Image Classification RN50 v.1.5 Translation (non-recurrent) Transformer Object Detection (Light Weight) SSD Translation (recurrent) GNMT Object Detection (Heavy Weight) Mask R-CNN RelativeSpeedup MLPerf 0.5 MLPerf 0.6 MLPerf On DGX-2 Server (7 Month Improvement) Comparing the throughput of a single DGX-2H server on a single epoch (Single pass of the dataset through the neural network) | MLPerf ID 0.5/0.6 comparison: ResNet50 v1.5: 0.5-20/0.6-30 | Transformer: 0.5-21/0.6-20 | SSD: 0.5-21/0.6-20 | GNMT: 0.5-19/0.6-20 | Mask R-CNN: 0.5-21/0.6-20
  • 26.
    23 DRAMATICALLY MORE FORYOUR MONEY 300 Self-hosted Broadwell CPU Servers 180 KWatts Deep Learning Training Image training Resnet 50 1 DGX-2 10 KWatts Deep Learning Training Image training Resnet 50 GPU-AcceleratedCPU-Only Cluster SAME THROUGHPUT 1/8 THE COST 1/18 THE POWER 1/30 THE SPACE
  • 27.
    24 NVIDIA DGX-2 Designed ToTrain The Previously Impossible 1 2 3 8 4 5 Two Intel Xeon Platinum CPUs 6 1.5 TB System Memory 24 30 TB NVME SSDs Internal Storage NVIDIA Tesla V100 32GB Two HGX-2 GPU Motherboards 8 V100 32GB GPUs per board 6 NVSwitches per board 512GB Total HBM2 Memory interconnected by Plane Card Twelve NVSwitches 2.4 TB/sec bi-section bandwidth Eight EDR Infiniband/100 GigE 1600 Gb/sec Total Bi-directional Bandwidth 7 Two High-Speed Ethernet 10/25/40/100 GigE
  • 28.
    TESLA V100 TENSOR COREGPU World’s Most Powerful Data Center GPU 5,120 CUDA cores 640 NEW Tensor cores 7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS | 125 Tensor TFLOPS 20MB SM RF | 16MB Cache 32 GB HBM2 @ 900GB/s | 300GB/s NVLink
  • 29.
    26 NVSWITCH World’s Highest Bandwidth On-nodeSwitch 7.2 Terabits/sec or 900 GB/sec 18 NVLINK ports | 50GB/s per port bi-directional Fully-connected crossbar 2 billion transistors | 47.5mm x 47.5mm package
  • 30.
    27 WORLD RECORDS FORCONVERSATIONAL AI BERT Training and Inference Records Largest Transformer Based Model Ever Trained EXPLODING MODEL SIZE Complexity to Train CONVERSATIONAL AI RECORDS Code Available on Github Image Recognition NLP (Q&A, Translation) NLP – Generative Tasks (Chatbots, Auto-completion) 8.3Bn 1.5Bn 26M 340M NumberofParametersbyNetwork 53 minutes BERTLARGE Speed Training Record GPT-2 8B Largest Transformer Based Model Trained 8.3Bn parameters 2.2ms Latency BERTBASE Fastest Inference (18X Faster Than CPU) X 20X 40X 60X 80X 0 500 1000 1500 NormalizedSpeedup(1/Time) # of V100 GPUs BERTLARGE Training Record: 1472 Tesla V100-SXM3-32GB 450W GPUs | 92 DGX-2H Servers | 8 Mellanox Infiniband Adapters per node BERTBASE Inference Record: SQuAD Dataset| Tesla T4 16GB GPU | CPU: Intel Xeon Gold 6240 & OpenVINO v2 Scaling Training Performance on: BERT | Speedups show performance scaling on 1x, 16x, 64x and 92x DGX-2H Servers with 16 NVIDIA V100 GPUs each Training GPUs - Near Linear Scaling Requires Leading AI Infrastructure
  • 31.
  • 32.
    29 AI PLATFORM CONSIDERATIONS Factorsimpacting deep learning platform decisions I have limited budget, need lowest up-front cost possible TOTAL COST OF OWNERSHIP “I want the most GPU bang for the buck SCALING PERFORMANCE “ DEVELOPER PRODUCTIVITY Must get started now, line of business wants to deliver results yesterday “
  • 33.
    30 COMPARING AI COMPUTEALTERNATIVES AI/DL Expertise & Innovation AI/DL Software Stack Operating System Image Hardware Architecture Looking beyond the “spec sheet” EvaluationCriteria
  • 34.
    31 NVIDIA DGX PODTM: HIGH-DENSITYCOMPUTE REFERENCE ARCH. Nine DGX-1 Servers • Eight Tesla V100 GPUs • NVIDIA. GPUDirect™ over RDMA support • Run at MaxQ • 100 GbE networking (up to 4 x 100 GbE) Twelve Storage Nodes • 192 GB RAM • 3.8 TB SSD • 100 TB HDD (1.2 PB Total HDD) • 50 GbE networking Network • In-rack: 100 GbE to DGX-1 servers • In-rack: 50 GbE to storage nodes • Out-of-rack: 4 x 100 GbE (up to 8) Rack • 35 kW Power • 42U x 1200 mm x 700 mm (minimum) • Rear Door Cooler 4 POD design with cooling DGX-1 POD • NVIDIA DGX POD • Support scalability to hundreds of nodes • Based on proven SATURNV architecture
  • 35.
    NVIDIA DGX SUPERPOD AI LEADERSHIPREQUIRES AI INFRASTRUCTURE LEADERSHIP Test Bed for Highest Performance Scale-Up Systems • 9.4 PF on HPL | ~200 AI PF | #22 on Top500 list • <2 mins To Train RN-50 Modular & Scalable GPU SuperPOD Architecture • Built in 3 Weeks • Optimized For Compute, Networking, Storage & Software Integrates Fully Optimized Software Stacks • Freely Available Through NGC • 96 DGX-2H • 10 Mellanox EDR IB per node • 1,536 V100 Tensor Core GPUs • 1 megawatt of power Autonomous Vehicles | Speech AI | Healthcare | Graphics | HPC
  • 36.
    33 Installed/ running Problem! Open source /forum Open source / forum Framework? Libraries? O/S? GPU? Drivers? Server? Network? Storage? Multiple paths to problem resolution Server, Storage & Network Solution Providers SUPPORTING AI: ALTERNATIVE APPROACHES
  • 37.
    34 SUPPORTING AI WITHDGX REFERENCE ARCHITECTURE SOLUTIONS “Update to PyTorch container XX.XX” AI Expertise NPN Partner Running!Problem! DGX RA Solution Storage DGX RA Solution Storage “My PyTorch CNN model is running 30% slower than yesterday!” IT Admin
  • 38.
    35 THE NEW NGC GPU-optimizedSoftware Hub. Simplifying DL, ML and HPC Workflows NGC 50+ Containers DL, ML, HPC Pre-trained Models NLP, Classification, Object Detection & more Industry Workflows Medical Imaging, Intelligent Video Analytics Model Training Scripts NLP, Image Classification, Object Detection & more Innovate Faster Deploy Anywhere Simplify Deployments ngc.nvidia.com
  • 39.
    Solving the complexityof managing distributed computing on GPU
  • 40.
    AGENDA • Session 2: ○Data Science as a service using GPUs ○ Demo
  • 41.
    3 ML Pipelines Serverless Functions &Notebooks Services Data Persistent & GPU compute Shared resources Iguazio: Integrated and Open Data Science Platform PandasDaskTensorFlow PyTorch Spark Presto GrafanaPrometheusRapids Jupyter Notebook Nuclio KubeFlow DL workloads ML workloads Model Inferencing GPU sharing
  • 42.
  • 43.
  • 44.
    6 § Quick wayfor data scientists to work on a cluster of GPUs. o Built-in integration with GPU o No Devops is required § Frees GPU resources after Jupyter notebook becomes idle § Maximizing the efficiency of GPU usage among the data science team Optimize GPU sharing Enabling GPU at scale
  • 45.
    7 Supporting DGX cluster DGXcluster § Running data science workload on a DGX cluster § Running Jupyter, Spark, TensorFlow and distributed Python on a DGX cluster § Monitoring jobs on the cluster level
  • 46.
    8 § Models arerunning as functions at scale on a GPU cluster § High-performance parallel execution engine § Easy control of GPU resources per function § Quick deployment of models from Jupyter notebooks Running models in an inferencing layer with GPU Quick deployment of models in a serving layer
  • 47.
    9 § Self serviceon a managed platform § Jobs scheduling § Cloud experience for on-prem § Full and open data science environment at the click of a button § Built-in integration for Jupyter and GPU Ease of management and orchestration Easy access to GPU
  • 48.
    10 § Direct writes/readsinto/from the GPU’s memory using RAPIDS data frames o By doing that users can read data from the database and analyze it directly on the GPU without any intermediate layer § Streaming data in chunks directly into GPU § Full parallelism - multiple nodes can read data, each only one shard Advanced integration with RAPIDs CUDA PYTHON APACHE ARROW on GPU Memory DASK DEEP LEARNING FRAMEWORKS CUDNN RAPIDS CUMLCUDF CUGRAPH
  • 49.
    11 § Iguazio’s serverlessfunctions (Nuclio) improves GPU utilization and sharing, resulting in almost four times faster application performance when compared to the use of NVIDIA GPUs within monolithic architectures. § Linear scalability Serverless & GPU – Better Performance 4x FASTER
  • 50.
    12 How we EnableLarge Scale Data Processing on GPUs Raw Data 10s-1000s Terabytes Filtered TBs Partitioned 100s GBs Chunked 1-10s GBs Final Results Filter Partition Chunk Chunk Chunk Merge DB + native Support 10s GBs
  • 51.
    13 § Speed updata science projects o Immediate access to GPU (training and inferencing) § Increase overall GPU utilization (90%) o Helping customers to maximize their GPU utilization § Fully managed PaaS with built-in GPU integration o Application provisioning, orchestration and managed notebooks enabling training at scale on a shared GPU cluster o Tight integration with Nvidia TensorRT, RAPIDS, DeepOPS § Simplify management of GPU’s & DGX o Automated workflow for a continuous data science pipeline § Improved performance by 4x o By creating a shared resource pool and load balancing across all GPU’s Value for Nvidia customers
  • 52.
    14 § Built-in GPUmonitoring dashboard integrated with Nvidia Deepops § Advanced troubleshooting Identify which service/app is utilizing the GPU resource Integrated GPU monitoring (coming soon)
  • 53.