TAIPEI | SEP. 21-22, 2016
Eric Kang 康勝閔, Sep. 21 2016
NVIDIA DGX-1 超級電腦
與人工智慧及深度學習
2
GPU Computing
NVIDIA
Computing for the Most Demanding Users
Computing Human Imagination
Computing Human Intelligence
3
DEEP LEARNING EVERYWHERE
INTERNET & CLOUD
Image Classification
Speech Recognition
Language Translation
Language Processing
Sentiment Analysis
Recommendation
MEDIA &
ENTERTAINMENT
Video Captioning
Video Search
Real Time Translation
AUTONOMOUS
MACHINES
Pedestrian Detection
Lane Tracking
Recognize Traffic Sign
SECURITY & DEFENSE
Face Detection
Video Surveillance
Satellite Imagery
MEDICINE & BIOLOGY
Cancer Cell Detection
Diabetic Grading
Drug Discovery
4
DEEP LEARNING APPROACH
Deploy:
Dog
Cat
Honey badger
Errors
Dog
Cat
Raccoon
Dog
Train:
DNN
DNN
5
72%
74%
84%
88%
93%
96%
2010 2011 2012 2013 2014 2015
“SUPERHUMAN” RESULTS
SPARK HYPERSCALE
ADOPTION
Deep Learning
ImageNet — Accuracy %
Cloud Services with AI Powered by NVIDIA
Alibaba/Aliyun Amazon Baidu eBay Facebook
Flickr Google iFLYTEK iQIYI JD.com
Orange Periscope Pinterest Qihoo 360 Shazam
Skype Sogou Twitter Yahoo Supermarket Yandex Yelp
Hand-coded CV
Human
74%
76%
6
Source: IDC Worldwide Big Data and Analytics 2016 Predictions, November 2015.
IDC FutureScape: Worldwide Digital Strategy Consulting 2016 Predictions, Nov 2015;
“By 2020, 80% of Big Data and Analytics
deployments will need distributed micro
analytics and 40% of all business analytics
software will incorporate prescriptive
analytics built on cognitive computing
functionality. Both of these trends require a
dramatic increase in processing power that
could be enabled by GPUs.”
— IDC
“By 2018, over 50% of developer teams will
embed cognitive services in their apps (vs 1%
today) providing U.S. enterprises with over
$60 billion annual savings by 2020.”
— IDC
AI — THE NEXT TRILLION $
IT OPPORTUNITY
7
Deep Learning is a massive opportunity
Data Scientist productivity is vital
NVIDIA is the choice of the deep learning
world
DGX-1 is fast, instantly productive
NVIDIA DGX-1
The Essential Tool of
Deep Learning Scientists
170 TFLOPS | 8x Tesla P100 16GB | NVLink Hybrid Cube Mesh
2x Xeon | 8 TB RAID 0 | Quad IB 100Gbps, Dual 10GbE | 3U
8
TESLA P100 WITH NVLINK
New GPU Architecture to Enable the World’s Fastest Compute Node
Pascal Architecture NVLink CoWoS HBM2 Page Migration Engine
PCIe
Switch
PCIe
Switch
CPU CPU
Highest Compute Performance GPU Interconnect for Maximum
Scalability
Unifying Compute & Memory in
Single Package
Simple Parallel Programming with
Virtually Unlimited Memory
Unified Memory
CPU
Tesla
P100
9
Engineered for deep learning | 170TF FP16 | 8x Tesla P100
NVLink hybrid cube mesh | Accelerates major AI frameworks
NVIDIA DGX-1
WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER
10
NVIDIA DEEP LEARNING SDK
High Performance GPU-Acceleration for Deep Learning
COMPUTER VISION SPEECH AND AUDIO BEHAVIOR
Object Detection Voice Recognition Translation
Recommendation
Engines
Sentiment Analysis
DEEP LEARNING
cuDNN
MATH LIBRARIES
cuBLAS cuSPARSE
MULTI-GPU
NCCL
cuFFT
Mocha.jl
Image Classification
DEEP LEARNING
SDK
FRAMEWORKS
APPLICATIONS
11
NVIDIA CUDNN
Building blocks for accelerating deep
neural networks on GPUs
High performance deep neural network
training and inference
Accelerates Caffe, CNTK, Tensorflow,
Theano, Torch
Performance continues to improve over
time
“NVIDIA has improved the speed of cuDNN
with each release while extending the
interface to more operations and devices
at the same time.”
— Evan Shelhamer, Lead Caffe Developer, UC Berkeley
developer.nvidia.com/cudnn
AlexNet training throughput based on 20 iterations,
CPU: 1x E5-2680v3 12 Core 2.5GHz.
0x
2x
4x
6x
8x
10x
12x
2014 2015 2016
K40
(cuDNN v1)
M40
(cuDNN v3)
Pascal
(cuDNN v5)
12
NVIDIA DIGITS
Interactive Deep Learning GPU Training System
Test Image
Monitor ProgressConfigure DNNProcess Data Visualize Layers
developer.nvidia.com/digits
github.com/NVIDIA/DIGITS
13
Instant productivity — plug-and-
play, supports every AI framework
Performance optimized across
the entire stack
Always up-to-date via the cloud
Mixed framework environments
—containerized
Direct access to NVIDIA experts
DGX STACK
Fully integrated Deep Learning platform
14
NVIDIA DOCKER ON GITHUB
15
NVIDIA IMAGES
Prebuilt and ready to use
16
DGX-1 CONTAINER LAUNCH FLOW
Customer data stays on premise
Web Browser
Node Management
User Authentication
Docker Image push/pull
Scheduler UI
HW/SW Metrics
LOCAL LAN
All Application Data
NFS Storage
DIGITS UI
Interactive Sessions
compute.nvidia.com 1. User schedules
containers to run
3. User interacts
with application
17
DIGITS FOR DGX-1
A complete GPU-accelerated deep learning workflow
MANAGE TRAIN DEPLOY
DIGITS
DATA CENTER AUTOMOTIVE
TRAINTEST
MANAGE / AUGMENT
EMBEDDED
GPU INFERENCE ENGINE
MODEL ZOO
18
BUILT FOR THE DATA CENTER
Data Center Ready24/7 Uptime
Boost data center throughput
Scalable Performance
Maximize reliability Simplify system operations
! !○
19
END-TO-END DESIGN FOR SYSTEM UPTIME
24/7 Uptime
Scalable
Performance
Data Center
Ready
Guaranteed Quality
System Qual. Tests: Thermal, Stress, Airflow rate, Shock & Vibe
System Monitoring and Management for Tesla only
Dedicated Technical Staff for Failure Analysis
Extensive Qualification
& Testing
Long Burn-in Testing
Zero Error Tolerance at Aggressive Clocks
Even with Differentiated Engineering 5% of GPUs
are screened out
Differentiated
Engineering
Low Operating Voltage for Long Term Reliability
Large Guard-band for Guaranteed Quality
Error Correction Code (ECC) for Data Integrity
20
DYNAMIC PAGE RETIREMENT MAXIMIZES UPTIME
24/7 Uptime
Scalable
Performance
Data Center
Ready
GPU MEMORY
Uncorrectable Data Error
causes application to
crash
Weak memory page is
retired
Tesla GPU with Dynamic
Page Retirement
GPU without Dynamic
Page Retirement (DPR)
Weak memory is still active
1. Users lose productivity as jobs continue to crash
2. IT Managers need to physically open up the server
and remove the bad GPU
3. Customer satisfaction risk with RMA process
1. Removes bad memory with simple reboot
2. No physical work required for IT
3. Negligible impact: <0.01% of memory is retired
!
21
DATA CENTER QUALIFIED BY SERVER OEMS
24/7 Uptime
Scalable
Performance
Data Center
Ready
Server with
Tesla GPU
Server with
Unqualified GPU
Designed for max airflow through GPU
Supports airflow front-to-back & back-to-front
Lower power consumption
GPU Temp Running Linpack: 54C
Works against server airflow
Higher power consumption
Lower reliability
GPU Temp Running Linpack: 71C
Airflow
Temp: 54C
Temp: 71C
22
SCALE-OUT PERFORMANCE IN THE DATA CENTER
24/7 Uptime
Scalable
Performance
Data Center
Ready
0
500
1000
1500
2000
8 16 32 64 96
Up to 2x Faster
Application Performance at Scale with
GPUDirect RDMA
GPUDirect RDMAA
Direct transfers between GPUs
67% Lower GPU-to-GPU Latency
5x Higher GPU-to-GPU MPI Bandwidth
Time-stepsperSec
# of Nodes
Hoomd-Blue Application
LJ Liquid Benchmark, 256K Particles
without RDMA
with RDMA
23
NVLINK DELIVERS SCALABLE PERFORMANCE
24/7 Uptime
Scalable
Performance
Data Center
Ready
More than 45x Faster with 8x P100 Interconnected with NVLink
0x
5x
10x
15x
20x
25x
30x
35x
40x
45x
50x
Caffe/Alexnet VASP HOOMD-Blue COSMO MILC Amber HACC
2x K80 (M40 for Alexnet) 2x P100 4x P100 8x P100
Speed-upvsDualSocketHaswell
2x
Haswell
CPU
24
DATA CENTER GPU MANAGEMENT
24/7 Uptime
Scalable
Performance
Device
Management
• Device Identification
• Board Monitoring
• Clock Management
Per GPU Configuration &
Monitoring
Data Center
Ready
Enterprise-Grade Management Tool for Operating the Data Center
Active Health
Monitoring ! Diagnostics &
System Validation
Runtime Health Checks
Prologue Checks
Epilogue Checks
Deep HW Diagnostics
System Validation Tests
Policy & Group Config
Management
Pre-configured policies
Job level accounting
Stateful configuration
Power & Clock
Mgmt.
Dynamic Power Capping
Synchronous Clock Boost
!
Data Center GPU Manager (Tesla GPUs Only)
All GPUs Supported
25
DATA CENTER GPU MANAGER
24/7 Uptime
Scalable
Performance
Data Center
Ready
Integrated into Leading Industry Tools for HPC
Moab Cluster Suite
TORQUE
PBS Professional
IBM Platform HPC
IBM Platform LSF
Bright Cluster Manager
StackIQ Boss for HPC
with CUDA Pallet
Grid Engine
3rd Party
Software
TAIPEI | SEP. 21-22, 2016
THANK YOU

NVIDIA DGX-1 超級電腦與人工智慧及深度學習

  • 1.
    TAIPEI | SEP.21-22, 2016 Eric Kang 康勝閔, Sep. 21 2016 NVIDIA DGX-1 超級電腦 與人工智慧及深度學習
  • 2.
    2 GPU Computing NVIDIA Computing forthe Most Demanding Users Computing Human Imagination Computing Human Intelligence
  • 3.
    3 DEEP LEARNING EVERYWHERE INTERNET& CLOUD Image Classification Speech Recognition Language Translation Language Processing Sentiment Analysis Recommendation MEDIA & ENTERTAINMENT Video Captioning Video Search Real Time Translation AUTONOMOUS MACHINES Pedestrian Detection Lane Tracking Recognize Traffic Sign SECURITY & DEFENSE Face Detection Video Surveillance Satellite Imagery MEDICINE & BIOLOGY Cancer Cell Detection Diabetic Grading Drug Discovery
  • 4.
    4 DEEP LEARNING APPROACH Deploy: Dog Cat Honeybadger Errors Dog Cat Raccoon Dog Train: DNN DNN
  • 5.
    5 72% 74% 84% 88% 93% 96% 2010 2011 20122013 2014 2015 “SUPERHUMAN” RESULTS SPARK HYPERSCALE ADOPTION Deep Learning ImageNet — Accuracy % Cloud Services with AI Powered by NVIDIA Alibaba/Aliyun Amazon Baidu eBay Facebook Flickr Google iFLYTEK iQIYI JD.com Orange Periscope Pinterest Qihoo 360 Shazam Skype Sogou Twitter Yahoo Supermarket Yandex Yelp Hand-coded CV Human 74% 76%
  • 6.
    6 Source: IDC WorldwideBig Data and Analytics 2016 Predictions, November 2015. IDC FutureScape: Worldwide Digital Strategy Consulting 2016 Predictions, Nov 2015; “By 2020, 80% of Big Data and Analytics deployments will need distributed micro analytics and 40% of all business analytics software will incorporate prescriptive analytics built on cognitive computing functionality. Both of these trends require a dramatic increase in processing power that could be enabled by GPUs.” — IDC “By 2018, over 50% of developer teams will embed cognitive services in their apps (vs 1% today) providing U.S. enterprises with over $60 billion annual savings by 2020.” — IDC AI — THE NEXT TRILLION $ IT OPPORTUNITY
  • 7.
    7 Deep Learning isa massive opportunity Data Scientist productivity is vital NVIDIA is the choice of the deep learning world DGX-1 is fast, instantly productive NVIDIA DGX-1 The Essential Tool of Deep Learning Scientists 170 TFLOPS | 8x Tesla P100 16GB | NVLink Hybrid Cube Mesh 2x Xeon | 8 TB RAID 0 | Quad IB 100Gbps, Dual 10GbE | 3U
  • 8.
    8 TESLA P100 WITHNVLINK New GPU Architecture to Enable the World’s Fastest Compute Node Pascal Architecture NVLink CoWoS HBM2 Page Migration Engine PCIe Switch PCIe Switch CPU CPU Highest Compute Performance GPU Interconnect for Maximum Scalability Unifying Compute & Memory in Single Package Simple Parallel Programming with Virtually Unlimited Memory Unified Memory CPU Tesla P100
  • 9.
    9 Engineered for deeplearning | 170TF FP16 | 8x Tesla P100 NVLink hybrid cube mesh | Accelerates major AI frameworks NVIDIA DGX-1 WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER
  • 10.
    10 NVIDIA DEEP LEARNINGSDK High Performance GPU-Acceleration for Deep Learning COMPUTER VISION SPEECH AND AUDIO BEHAVIOR Object Detection Voice Recognition Translation Recommendation Engines Sentiment Analysis DEEP LEARNING cuDNN MATH LIBRARIES cuBLAS cuSPARSE MULTI-GPU NCCL cuFFT Mocha.jl Image Classification DEEP LEARNING SDK FRAMEWORKS APPLICATIONS
  • 11.
    11 NVIDIA CUDNN Building blocksfor accelerating deep neural networks on GPUs High performance deep neural network training and inference Accelerates Caffe, CNTK, Tensorflow, Theano, Torch Performance continues to improve over time “NVIDIA has improved the speed of cuDNN with each release while extending the interface to more operations and devices at the same time.” — Evan Shelhamer, Lead Caffe Developer, UC Berkeley developer.nvidia.com/cudnn AlexNet training throughput based on 20 iterations, CPU: 1x E5-2680v3 12 Core 2.5GHz. 0x 2x 4x 6x 8x 10x 12x 2014 2015 2016 K40 (cuDNN v1) M40 (cuDNN v3) Pascal (cuDNN v5)
  • 12.
    12 NVIDIA DIGITS Interactive DeepLearning GPU Training System Test Image Monitor ProgressConfigure DNNProcess Data Visualize Layers developer.nvidia.com/digits github.com/NVIDIA/DIGITS
  • 13.
    13 Instant productivity —plug-and- play, supports every AI framework Performance optimized across the entire stack Always up-to-date via the cloud Mixed framework environments —containerized Direct access to NVIDIA experts DGX STACK Fully integrated Deep Learning platform
  • 14.
  • 15.
  • 16.
    16 DGX-1 CONTAINER LAUNCHFLOW Customer data stays on premise Web Browser Node Management User Authentication Docker Image push/pull Scheduler UI HW/SW Metrics LOCAL LAN All Application Data NFS Storage DIGITS UI Interactive Sessions compute.nvidia.com 1. User schedules containers to run 3. User interacts with application
  • 17.
    17 DIGITS FOR DGX-1 Acomplete GPU-accelerated deep learning workflow MANAGE TRAIN DEPLOY DIGITS DATA CENTER AUTOMOTIVE TRAINTEST MANAGE / AUGMENT EMBEDDED GPU INFERENCE ENGINE MODEL ZOO
  • 18.
    18 BUILT FOR THEDATA CENTER Data Center Ready24/7 Uptime Boost data center throughput Scalable Performance Maximize reliability Simplify system operations ! !○
  • 19.
    19 END-TO-END DESIGN FORSYSTEM UPTIME 24/7 Uptime Scalable Performance Data Center Ready Guaranteed Quality System Qual. Tests: Thermal, Stress, Airflow rate, Shock & Vibe System Monitoring and Management for Tesla only Dedicated Technical Staff for Failure Analysis Extensive Qualification & Testing Long Burn-in Testing Zero Error Tolerance at Aggressive Clocks Even with Differentiated Engineering 5% of GPUs are screened out Differentiated Engineering Low Operating Voltage for Long Term Reliability Large Guard-band for Guaranteed Quality Error Correction Code (ECC) for Data Integrity
  • 20.
    20 DYNAMIC PAGE RETIREMENTMAXIMIZES UPTIME 24/7 Uptime Scalable Performance Data Center Ready GPU MEMORY Uncorrectable Data Error causes application to crash Weak memory page is retired Tesla GPU with Dynamic Page Retirement GPU without Dynamic Page Retirement (DPR) Weak memory is still active 1. Users lose productivity as jobs continue to crash 2. IT Managers need to physically open up the server and remove the bad GPU 3. Customer satisfaction risk with RMA process 1. Removes bad memory with simple reboot 2. No physical work required for IT 3. Negligible impact: <0.01% of memory is retired !
  • 21.
    21 DATA CENTER QUALIFIEDBY SERVER OEMS 24/7 Uptime Scalable Performance Data Center Ready Server with Tesla GPU Server with Unqualified GPU Designed for max airflow through GPU Supports airflow front-to-back & back-to-front Lower power consumption GPU Temp Running Linpack: 54C Works against server airflow Higher power consumption Lower reliability GPU Temp Running Linpack: 71C Airflow Temp: 54C Temp: 71C
  • 22.
    22 SCALE-OUT PERFORMANCE INTHE DATA CENTER 24/7 Uptime Scalable Performance Data Center Ready 0 500 1000 1500 2000 8 16 32 64 96 Up to 2x Faster Application Performance at Scale with GPUDirect RDMA GPUDirect RDMAA Direct transfers between GPUs 67% Lower GPU-to-GPU Latency 5x Higher GPU-to-GPU MPI Bandwidth Time-stepsperSec # of Nodes Hoomd-Blue Application LJ Liquid Benchmark, 256K Particles without RDMA with RDMA
  • 23.
    23 NVLINK DELIVERS SCALABLEPERFORMANCE 24/7 Uptime Scalable Performance Data Center Ready More than 45x Faster with 8x P100 Interconnected with NVLink 0x 5x 10x 15x 20x 25x 30x 35x 40x 45x 50x Caffe/Alexnet VASP HOOMD-Blue COSMO MILC Amber HACC 2x K80 (M40 for Alexnet) 2x P100 4x P100 8x P100 Speed-upvsDualSocketHaswell 2x Haswell CPU
  • 24.
    24 DATA CENTER GPUMANAGEMENT 24/7 Uptime Scalable Performance Device Management • Device Identification • Board Monitoring • Clock Management Per GPU Configuration & Monitoring Data Center Ready Enterprise-Grade Management Tool for Operating the Data Center Active Health Monitoring ! Diagnostics & System Validation Runtime Health Checks Prologue Checks Epilogue Checks Deep HW Diagnostics System Validation Tests Policy & Group Config Management Pre-configured policies Job level accounting Stateful configuration Power & Clock Mgmt. Dynamic Power Capping Synchronous Clock Boost ! Data Center GPU Manager (Tesla GPUs Only) All GPUs Supported
  • 25.
    25 DATA CENTER GPUMANAGER 24/7 Uptime Scalable Performance Data Center Ready Integrated into Leading Industry Tools for HPC Moab Cluster Suite TORQUE PBS Professional IBM Platform HPC IBM Platform LSF Bright Cluster Manager StackIQ Boss for HPC with CUDA Pallet Grid Engine 3rd Party Software
  • 26.
    TAIPEI | SEP.21-22, 2016 THANK YOU