TAIPEI | SEP. 21-22, 2016
Marc Hamilton, VP Solutions Architecture & Engineering
AI, A NEW COMPUTING MODEL
2
GPU Computing
NVIDIA
Computing for the Most Demanding Users
Computing Human Imagination
Computing Human Intelligence
3
DEEP LEARNING —
A NEW COMPUTING MODEL
“Software that writes software”
“little girl is eating
piece of cake"
LEARNING
ALGORITHM
“millions of trillions
of FLOPS”
4
AI IS EVERYWHERE
“Find where I parked my car” “Find the bag I just saw
in this magazine”
“What movie should
I watch next?”
5
TOUCHING OUR LIVES
Bringing grandmother closer to
family by bridging language barrier
Predicting sick baby’s vitals like heart
rate, blood pressure, survival rate
Enabling the blind to “see” their
surrounding, read emotions on faces
6
FUELING ALL INDUSTRIES
Increasing public safety with smart
video surveillance at airports & malls
Providing intelligent services in
hotels, banks and stores
Separating weeds as it harvests,
reduces chemical usage by 90%
7
DEEP LEARNING DEMANDS NEW CLASS OF HPC
TRAINING INFERENCING
Data / Users
Scalable
Performance
Throughput
+ Efficiency
Billions of TFLOPS per training run
Years of compute-days on Xeon CPU
GPU turns years to days
Billions of FLOPS per inference
Seconds for response on Xeon CPU
GPU for instant response
8
BAIDU DEEP SPEECH 2
12K
Neurons
100M
Parameters
2.5x Deep Speech 1 4x Deep Speech 1
15
Exaflops
Super-human
Accuracy
10x Deep Speech 1
2 Months on CPU Server | 2 Days on DGX-1
Word Error Rate
DS2: 5% | Human: 6% | DS1: 8%
“Deep Speech 2: End-to-End Speech Recognition in English and Mandarin”, 12/2015 | Dataset: LibriSpeech test-clean
9
MODERN AI NEEDS NEW INFERENCE SOLUTION
0 0.5 1 1.5 2 2.5
Network
Network
Deep Speech 2
User Wait Time (seconds)
“Where is the nearest Szechuan restaurant?”
User Experience: From Seconds to Instant
Wait Time for Text after Speech is Complete
6 sec
CPU
0.1 sec
Pascal GPU
Deep Speech 2 inference performance on 16 user server | CPU: 170 ms of estimated compute time
required for each 100 ms of speech sample | Pascal GPU: 51 ms of compute required for each 100
ms of speech sample
2.2 sec
CPU
10
NVIDIA DGX-1
AI Supercomputer-in-a-Box
170 TFLOPS | 8x Tesla P100 16GB | NVLink Hybrid Cube Mesh
2x Xeon | 8 TB RAID 0 | Quad IB 100Gbps, Dual 10GbE | 3U — 3200W
11
“FIVE MIRACLES”
16nm FinFETPascal Architecture CoWoS with HBM2 New AI AlgorithmsNVLink
12
0X
4X
8X
12X
16X
GeForce® GTX TITAN X GeForce® GTX 1080 Tesla® P100 DIGITS™ DevBox (4X
GeForce® GTX Titan X)
Quadro® VCA (8X Quadro®
M6000)
DGX-1™ (8X Tesla® P100)
RelativeTrainingPerformance
ResNet Inception v3 AlexNet vgg MSR
DGX-1 — A LEAGUE OF ITS OWN
NVIDIA CONFIDENTIAL. PRELIMINARY NUMBERS. NOT FOR DISTRIBUTION.
Caffe on DeepMark. GeForce TITAN X and GTX 1080 system: Intel Core i7-5930K @ 3.5 GHz, 64 GB System Memory | Tesla P100 (SXM2) system: Dual CPU server, Intel E5-2698 v4 @ 2.2 GHz, 256 GB System Memory
1X
GeForce GTX TITAN X GeForce GTX 1080 Tesla P100 DIGITS	DevBox
(4X GeForce GTX TITAN X)
Quadro VCA
(8X Quadro M6000)
DGX-1
(8X Tesla P100)
13
Instant productivity — plug-and-
play, supports every AI framework
Performance optimized across
the entire stack
Always up-to-date via the cloud
Mixed framework environments
—containerized
Direct access to NVIDIA experts
DGX STACK
Fully integrated Deep Learning platform
14
DGX — THE ESSENTIAL TOOL
OF DEEP LEARNING SCIENTISTS
The platform of
AI pioneers
Reduce training time
from weeks to days
250 node HPC
Supercomputer-in-a-Box
15
0 50 100 150 200 250 300
P40
P4
1x CPU (14 cores)
Inference Execution Time (ms)
11 ms
6 ms
User Experience: Instant Response
45x Faster with Pascal + TensorRT
Faster, more responsive AI-powered services such as voice recognition, speech translation
Efficient inference on images, video, & other data in hyperscale production data centers
Based on VGG-19 from IntelCaffe Github: https://github.com/intel/caffe/tree/master/models/mkl2017_vgg_19
CPU: IntelCaffe, batch size = 4, Intel E5-2690v4, using Intel MKL 2017 | GPU: Caffe, batch size = 4, using TensorRT internal version
INTRODUCING NVIDIA TensorRT
High Performance Inference Engine
260 ms
16
NVIDIA DEEPSTREAM SDK
Delivering Video Analytics at Scale
Inference
Preprocess
Hardware
Decode
“Boy playing soccer”
Simple, high performance API for analyzing video
Decode H.264, HEVC, MPEG-2, MPEG-4, VP9
CUDA-optimized resize and scale
TensorRT
0
20
40
60
80
100
1x Tesla P4 Server +
DeepStream SDK
13x E5-2650 v4 Servers
ConcurrentVideoStreams
Concurrent Video Streams Analyzed
720p30 decode | IntelCaffe using dual socket E5-2650 v4 CPU servers, Intel MKL 2017
Based on GoogLeNet optimized by Intel: https://github.com/intel/caffe/tree/master/models/mkl2017_googlenet_v2
17
PIONEERS ADOPTING HPC
FOR DEEP LEARNING
“Investments in computer systems — and I think
the bleeding-edge of AI, and deep learning
specifically, is shifting to HPC — can cut down
the time to run an experiment from a week to
a day and sometimes even faster.”
— Andrew Ng, Baidu
Dr. Andrew Ng, Chief Scientist, Baidu
18NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
END-TO-END DATA CENTER PRODUCT FAMILY
MIXED-APPS HPCSTRONG-SCALE HPC
Data centers running HPC and DL
apps scaling to multiple GPUs
HPC data centers running mix
of CPU and GPU workloads
HYPERSCALE HPC
Hyperscale deployment for deep
learning training & inference
Training - Tesla P100
Inference - Tesla P40 & P4
Tesla P100 with NVLink Tesla P100 with PCI-E
19
NVIDIA EXPERTISE AT EVERY STEP
Solution Architects
Global Network
of Partners
Deep Learning
Institute
GTC
Conferences
1:1 support
Network training setup
Network optimization
Certified expert instructors
Worldwide workshops
Online courses
Epicenter of industry leaders
Onsite training
Global reach
NVIDIA Partner Network
OEMs
Startups
Need image
20
NVIDIA DEEP LEARNING PARTNERS
Graph Analytics Enterprises Data ManagementDL Frameworks Enterprise DL
Services Core Analytics Tech
21
MOST PERVASIVE HPC PLATFORM EVER BUILT
ACCESS ANYWHERE BUY ANYWHERE LEARN EVERYWHERE
+ 240 Resellers Worldwide
1000
Universities Teaching CUDA
78
Countries
300K
CUDA Developers
TAIPEI | SEP. 21-22, 2016
THANK YOU

AI, A New Computing Model

  • 1.
    TAIPEI | SEP.21-22, 2016 Marc Hamilton, VP Solutions Architecture & Engineering AI, A NEW COMPUTING MODEL
  • 2.
    2 GPU Computing NVIDIA Computing forthe Most Demanding Users Computing Human Imagination Computing Human Intelligence
  • 3.
    3 DEEP LEARNING — ANEW COMPUTING MODEL “Software that writes software” “little girl is eating piece of cake" LEARNING ALGORITHM “millions of trillions of FLOPS”
  • 4.
    4 AI IS EVERYWHERE “Findwhere I parked my car” “Find the bag I just saw in this magazine” “What movie should I watch next?”
  • 5.
    5 TOUCHING OUR LIVES Bringinggrandmother closer to family by bridging language barrier Predicting sick baby’s vitals like heart rate, blood pressure, survival rate Enabling the blind to “see” their surrounding, read emotions on faces
  • 6.
    6 FUELING ALL INDUSTRIES Increasingpublic safety with smart video surveillance at airports & malls Providing intelligent services in hotels, banks and stores Separating weeds as it harvests, reduces chemical usage by 90%
  • 7.
    7 DEEP LEARNING DEMANDSNEW CLASS OF HPC TRAINING INFERENCING Data / Users Scalable Performance Throughput + Efficiency Billions of TFLOPS per training run Years of compute-days on Xeon CPU GPU turns years to days Billions of FLOPS per inference Seconds for response on Xeon CPU GPU for instant response
  • 8.
    8 BAIDU DEEP SPEECH2 12K Neurons 100M Parameters 2.5x Deep Speech 1 4x Deep Speech 1 15 Exaflops Super-human Accuracy 10x Deep Speech 1 2 Months on CPU Server | 2 Days on DGX-1 Word Error Rate DS2: 5% | Human: 6% | DS1: 8% “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin”, 12/2015 | Dataset: LibriSpeech test-clean
  • 9.
    9 MODERN AI NEEDSNEW INFERENCE SOLUTION 0 0.5 1 1.5 2 2.5 Network Network Deep Speech 2 User Wait Time (seconds) “Where is the nearest Szechuan restaurant?” User Experience: From Seconds to Instant Wait Time for Text after Speech is Complete 6 sec CPU 0.1 sec Pascal GPU Deep Speech 2 inference performance on 16 user server | CPU: 170 ms of estimated compute time required for each 100 ms of speech sample | Pascal GPU: 51 ms of compute required for each 100 ms of speech sample 2.2 sec CPU
  • 10.
    10 NVIDIA DGX-1 AI Supercomputer-in-a-Box 170TFLOPS | 8x Tesla P100 16GB | NVLink Hybrid Cube Mesh 2x Xeon | 8 TB RAID 0 | Quad IB 100Gbps, Dual 10GbE | 3U — 3200W
  • 11.
    11 “FIVE MIRACLES” 16nm FinFETPascalArchitecture CoWoS with HBM2 New AI AlgorithmsNVLink
  • 12.
    12 0X 4X 8X 12X 16X GeForce® GTX TITANX GeForce® GTX 1080 Tesla® P100 DIGITS™ DevBox (4X GeForce® GTX Titan X) Quadro® VCA (8X Quadro® M6000) DGX-1™ (8X Tesla® P100) RelativeTrainingPerformance ResNet Inception v3 AlexNet vgg MSR DGX-1 — A LEAGUE OF ITS OWN NVIDIA CONFIDENTIAL. PRELIMINARY NUMBERS. NOT FOR DISTRIBUTION. Caffe on DeepMark. GeForce TITAN X and GTX 1080 system: Intel Core i7-5930K @ 3.5 GHz, 64 GB System Memory | Tesla P100 (SXM2) system: Dual CPU server, Intel E5-2698 v4 @ 2.2 GHz, 256 GB System Memory 1X GeForce GTX TITAN X GeForce GTX 1080 Tesla P100 DIGITS DevBox (4X GeForce GTX TITAN X) Quadro VCA (8X Quadro M6000) DGX-1 (8X Tesla P100)
  • 13.
    13 Instant productivity —plug-and- play, supports every AI framework Performance optimized across the entire stack Always up-to-date via the cloud Mixed framework environments —containerized Direct access to NVIDIA experts DGX STACK Fully integrated Deep Learning platform
  • 14.
    14 DGX — THEESSENTIAL TOOL OF DEEP LEARNING SCIENTISTS The platform of AI pioneers Reduce training time from weeks to days 250 node HPC Supercomputer-in-a-Box
  • 15.
    15 0 50 100150 200 250 300 P40 P4 1x CPU (14 cores) Inference Execution Time (ms) 11 ms 6 ms User Experience: Instant Response 45x Faster with Pascal + TensorRT Faster, more responsive AI-powered services such as voice recognition, speech translation Efficient inference on images, video, & other data in hyperscale production data centers Based on VGG-19 from IntelCaffe Github: https://github.com/intel/caffe/tree/master/models/mkl2017_vgg_19 CPU: IntelCaffe, batch size = 4, Intel E5-2690v4, using Intel MKL 2017 | GPU: Caffe, batch size = 4, using TensorRT internal version INTRODUCING NVIDIA TensorRT High Performance Inference Engine 260 ms
  • 16.
    16 NVIDIA DEEPSTREAM SDK DeliveringVideo Analytics at Scale Inference Preprocess Hardware Decode “Boy playing soccer” Simple, high performance API for analyzing video Decode H.264, HEVC, MPEG-2, MPEG-4, VP9 CUDA-optimized resize and scale TensorRT 0 20 40 60 80 100 1x Tesla P4 Server + DeepStream SDK 13x E5-2650 v4 Servers ConcurrentVideoStreams Concurrent Video Streams Analyzed 720p30 decode | IntelCaffe using dual socket E5-2650 v4 CPU servers, Intel MKL 2017 Based on GoogLeNet optimized by Intel: https://github.com/intel/caffe/tree/master/models/mkl2017_googlenet_v2
  • 17.
    17 PIONEERS ADOPTING HPC FORDEEP LEARNING “Investments in computer systems — and I think the bleeding-edge of AI, and deep learning specifically, is shifting to HPC — can cut down the time to run an experiment from a week to a day and sometimes even faster.” — Andrew Ng, Baidu Dr. Andrew Ng, Chief Scientist, Baidu
  • 18.
    18NVIDIA CONFIDENTIAL. DONOT DISTRIBUTE. END-TO-END DATA CENTER PRODUCT FAMILY MIXED-APPS HPCSTRONG-SCALE HPC Data centers running HPC and DL apps scaling to multiple GPUs HPC data centers running mix of CPU and GPU workloads HYPERSCALE HPC Hyperscale deployment for deep learning training & inference Training - Tesla P100 Inference - Tesla P40 & P4 Tesla P100 with NVLink Tesla P100 with PCI-E
  • 19.
    19 NVIDIA EXPERTISE ATEVERY STEP Solution Architects Global Network of Partners Deep Learning Institute GTC Conferences 1:1 support Network training setup Network optimization Certified expert instructors Worldwide workshops Online courses Epicenter of industry leaders Onsite training Global reach NVIDIA Partner Network OEMs Startups Need image
  • 20.
    20 NVIDIA DEEP LEARNINGPARTNERS Graph Analytics Enterprises Data ManagementDL Frameworks Enterprise DL Services Core Analytics Tech
  • 21.
    21 MOST PERVASIVE HPCPLATFORM EVER BUILT ACCESS ANYWHERE BUY ANYWHERE LEARN EVERYWHERE + 240 Resellers Worldwide 1000 Universities Teaching CUDA 78 Countries 300K CUDA Developers
  • 22.
    TAIPEI | SEP.21-22, 2016 THANK YOU