5 
4 
3 
2 
1 
0 
2003 
2005 
2007 
2009 
2011 
2013 
TeraFLOPS 
GPU 
CPU
GTC — GROWING AND EXPANDING 
2010 
2012 
2014 
397 
429 
729 
FASTEST GROWING TOPICS 
Big Data Analytics 
Machine Learning 
Computer Vision 
FASTEST GROWING TOPICS 
Energy Exploration 
Life Science & Genomics 
Molecular Dynamics 
#1 TOPIC 
HPC / Supercomputing
2012 
2013 
2014 
FOSTERING THE GPU ECOSYSTEM Big Data / Cloud / Computer Vision 
AudioStreamTV
CUDA EVERYWHERE
Takayuki Aoki 
Global Scientific Information and Computing Center Tokyo Institute of Technology 
“ Large-scale CFD Applications and a Full GPU Implementation of a Weather Prediction Code on the TSUBAME Supercomputer 
”
BANDWIDTH BOTTLENECKS 
CPU 
GPU 
PCIe 
PCI Express 
CPU Memory 
GPU Memory 
16GB/sec 
60GB/sec 
288GB/sec
INTRODUCING NVLINK 
CPU 
GPU 
PCIe 
Differential with embedded clock 
PCIe programming model (w/ DMA+) 
Unified Memory 
Cache coherency in Gen 2.0 
5 to 12X PCIe
5X More Bandwidth for Multi-GPU Scaling 
GPU 
PCIe SWITCH 
CPU 
GPU 
GPU 
GPU
3D MEMORY 
3D Chip-on-Wafer integration 
Many X bandwidth 
2.5X capacity 
4X energy efficiency 
0 
200 
400 
600 
800 
1000 
1200 
2008 
2010 
2012 
2014 
2016 
Memory Bandwidth
Blaise Pascal 
1623-1662 
Mechanical Calculator 
Probability Theory 
Pascal’s Theorem 
Pascal’s Law
PASCAL 
NVLink 
3D Memory 
Module 
5 to 12X PCIe 3.0 
2 to 4X memory BW & size 
1/3 size of PCIe card
SGEMM / W Normalized 
2012 
2014 
2008 
2010 
2016 
Tesla 
CUDA 
Fermi 
FP64 
Kepler 
Dynamic Parallelism 
Maxwell 
DX12 
Pascal 
Unified Memory 
3D Memory 
NVLink 
20 
16 
12 
8 
6 
2 
0 
GPU ROADMAP 
4 
10 
14 
18
MACHINE LEARNING 
Branch of Artificial Intelligence 
Computers that learn from data 
person 
car 
helmet 
motorcycle 
bird 
frog 
person 
dog 
chair 
person 
hammer 
flower pot 
power drill
Machine Learning using Deep Neural Networks 
Input 
Result
Building High-level Features Using Large Scale Unsupervised Learning 
Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, A. Ng 
Stanford / Google 
1 billion connections 
10 million 200x200 pixel images 
1,000 machines (16,000 cores) 
3 days
1,000 CPU Servers 2,000 CPUs • 16,000 cores 
600 kWatts 
$5,000,000 
GOOGLE BRAIN 
Today’s Largest Networks 
1B connections 
10M images 
~3 days 
~30 ExaFLOPS 
Human Brain 
~100B neurons x 1000 connections 
500M images 
5,000,000X “Google Brain” 
~150 YottaFLOPS 
~40,000 “Google Brain-Years” 
SOURCE: Ian Goodfellow
Deep Learning with COTS HPC Systems 
A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro 
Stanford / NVIDIA • ICML 2013 
STANFORD AI LAB 
3 GPU-Accelerated Servers 12 GPUs • 18,432 cores 
4 kWatts 
$33,000 
Now You Can Build Google’s $1M Artificial Brain on the Cheap 
“ 
“ 
-Wired 
1,000 CPU Servers 2,000 CPUs • 16,000 cores 
600 kWatts 
$5,000,000 
GOOGLE BRAIN
DEMO: MACHINE LEARNING, SIMPLE TRAINING SET
1.2M 
1000 
2 
7 
25 
Image training set Classes Weeks of training GPUs EXAFLOPS total to train 
DEMO: MACHINE LEARNING, NYU OVERFEAT
CUDA for MACHINE LEARNING 
Talks @ GTC 
Image Detection 
Face Recognition 
Gesture Recognition 
Video Search & Analytics 
Speech Recognition & Translation 
Recommendation Engines 
Indexing & Search 
Use Cases 
Early Adopters 
Image Analytics for Creative Cloud 
Image Classification 
Speech/Image Recognition 
Recommendation 
Hadoop 
Search Rankings
Big Data & Infinite Compute Turbocharge Deep Learning 
SOURCE: KPCB/Mary Meeker, company data. Unstructured data: IDC's Digital Universe Study. 
800M photos uploaded per day 
100 hours of video uploaded per minute 
Unstructured data exploding 
0 
100 
200 
300 
400 
500 
600 
700 
800 
900 
2007 
2008 
2009 
2010 
2011 
2012 
2013 
2014 
Facebook 
Instagram 
Snapchat 
Flickr 
0 
20 
40 
60 
80 
100 
120 
2007 
2008 
2009 
2010 
2011 
2012 
2013 
Hours (YouTube) 
Millions 
1,104 
5,379 
0 
1,000 
2,000 
3,000 
4,000 
5,000 
6,000 
2010 
2015 
Exabytes of data
DEMO: TITAN Z REVEAL
5,760 CUDA cores 
12GB memory 
8 TeraFLOPS 
$2999
STANFORD AI LAB 
1 Titan Z-Accelerated Server 3 Titan Zs • 17,280 cores 
2 kWatts $12,000 
1,000 CPU Servers 2,000 CPUs • 16,000 cores 
600 kWatts 
$5,000,000 
GOOGLE BRAIN 
300X energy efficiency 
400X lower cost 
Fits next to a desk
RenderMan with programmable shading 
1.5 hours to render each frame 
CCI 6/32 minicomputer 
First CGI Film Nominated for an Academy Award®
State-of-the-art water simulator 48 hours to simulate the base water 250 hours to render each frame 
2013 Academy Award® Winner BEST VISUAL EFFECTS
DEMO: WHALE
DEMO: FLEX
DEMO: FLAMEWORKS
DEMO: UE4
One is a photo, One is Iray…
Bunkspeed 
Maya 
Catia 
3ds Max 
IRAY VCA SCALABLE GPU RENDERING APPLIANCE 
8 Kepler-class 
12GB per GPU 
23,040 
2 x 1GigE 2 x 10GigE 1 x InfiniBand 
GPUs 
GPU memory 
CUDA cores 
Network
DEMO: IRAY / HONDA
0 
20 
40 
60 
80 
Relative Performance 
CPU-only Workstation 
Quadro K5000 Workstation 
Iray VCA 
Bunkspeed 
Maya 
Catia 
3ds Max 
IRAY VCA SCALABLE GPU RENDERING APPLIANCE 
MSRP $50,000
GRID GPU in the Cloud
Ben Fathi 
Chief Technology Officer 
Horizon DaaS Platform
Mobile CUDA
“10 of the Top 10” Greenest Supercomputers Powered by CUDA GPUs
Unify GPU and Tegra Architecture 
192 fully programmable CUDA cores 
326 GFLOPS 
4X energy efficiency over A15 
TEGRA K1 Mobile Super Chip 
MOBILE ARCHITECTURE 
Maxwell 
Kepler 
Tesla 
Fermi 
Tegra 3 
Tegra 4 
Tegra K1 
GPU ARCHITECTURE
Computer Vision on CUDA 
Feature Detection / Tracking 
~30 GFLOPS @ 30 Hz 
Object Recognition / Tracking 
~180 GFLOPS @ 30 Hz 
3D Scene Interpretation 
~280 GFLOPS @ 30 Hz
JETSON TK1 1st MOBILE SUPERCOMPUTER FOR EMBEDDED SYSTEMS 
192 CUDA cores 
326 GFLOPS 
VisionWorks SDK 
$192
VISIONWORKS 
COMPUTER VISION ON CUDA 
Driver Assistance Computational Photography 
Augmented Reality Robotics 
CUDA 
Jetson TK1 
VisionWorks Primitives 
Your Code 
Sample Pipelines 
Object Detection / 
Tracking 
Structure from Motion … 
Classifier Corner Detection …
Single Precision GFLOPS / W Normalized 
80 
60 
0 
40 
2013 
2014 
2011 
2012 
2015 
Tegra 2 
Tegra 3 
Tegra 4 
Tegra K1 
Kepler GPU 
CUDA 
64b & 32b CPU 
Erista 
Maxwell GPU 
20 
TEGRA ROADMAP
Andreas Reich 
Head of Audi Pre-Development
VIDEO: AUDI ADAS
CUDA EVERYWHERE 
PASCAL 
PC 
CLOUD 
MOBILE
DEMO: PORTAL ON SHIELD
GPU Technology Conference 2014 Keynote

GPU Technology Conference 2014 Keynote

  • 2.
    5 4 3 2 1 0 2003 2005 2007 2009 2011 2013 TeraFLOPS GPU CPU
  • 3.
    GTC — GROWINGAND EXPANDING 2010 2012 2014 397 429 729 FASTEST GROWING TOPICS Big Data Analytics Machine Learning Computer Vision FASTEST GROWING TOPICS Energy Exploration Life Science & Genomics Molecular Dynamics #1 TOPIC HPC / Supercomputing
  • 4.
    2012 2013 2014 FOSTERING THE GPU ECOSYSTEM Big Data / Cloud / Computer Vision AudioStreamTV
  • 5.
  • 6.
    Takayuki Aoki GlobalScientific Information and Computing Center Tokyo Institute of Technology “ Large-scale CFD Applications and a Full GPU Implementation of a Weather Prediction Code on the TSUBAME Supercomputer ”
  • 7.
    BANDWIDTH BOTTLENECKS CPU GPU PCIe PCI Express CPU Memory GPU Memory 16GB/sec 60GB/sec 288GB/sec
  • 8.
    INTRODUCING NVLINK CPU GPU PCIe Differential with embedded clock PCIe programming model (w/ DMA+) Unified Memory Cache coherency in Gen 2.0 5 to 12X PCIe
  • 9.
    5X More Bandwidthfor Multi-GPU Scaling GPU PCIe SWITCH CPU GPU GPU GPU
  • 10.
    3D MEMORY 3DChip-on-Wafer integration Many X bandwidth 2.5X capacity 4X energy efficiency 0 200 400 600 800 1000 1200 2008 2010 2012 2014 2016 Memory Bandwidth
  • 11.
    Blaise Pascal 1623-1662 Mechanical Calculator Probability Theory Pascal’s Theorem Pascal’s Law
  • 12.
    PASCAL NVLink 3DMemory Module 5 to 12X PCIe 3.0 2 to 4X memory BW & size 1/3 size of PCIe card
  • 13.
    SGEMM / WNormalized 2012 2014 2008 2010 2016 Tesla CUDA Fermi FP64 Kepler Dynamic Parallelism Maxwell DX12 Pascal Unified Memory 3D Memory NVLink 20 16 12 8 6 2 0 GPU ROADMAP 4 10 14 18
  • 14.
    MACHINE LEARNING Branchof Artificial Intelligence Computers that learn from data person car helmet motorcycle bird frog person dog chair person hammer flower pot power drill
  • 15.
    Machine Learning usingDeep Neural Networks Input Result
  • 16.
    Building High-level FeaturesUsing Large Scale Unsupervised Learning Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, A. Ng Stanford / Google 1 billion connections 10 million 200x200 pixel images 1,000 machines (16,000 cores) 3 days
  • 17.
    1,000 CPU Servers2,000 CPUs • 16,000 cores 600 kWatts $5,000,000 GOOGLE BRAIN Today’s Largest Networks 1B connections 10M images ~3 days ~30 ExaFLOPS Human Brain ~100B neurons x 1000 connections 500M images 5,000,000X “Google Brain” ~150 YottaFLOPS ~40,000 “Google Brain-Years” SOURCE: Ian Goodfellow
  • 18.
    Deep Learning withCOTS HPC Systems A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro Stanford / NVIDIA • ICML 2013 STANFORD AI LAB 3 GPU-Accelerated Servers 12 GPUs • 18,432 cores 4 kWatts $33,000 Now You Can Build Google’s $1M Artificial Brain on the Cheap “ “ -Wired 1,000 CPU Servers 2,000 CPUs • 16,000 cores 600 kWatts $5,000,000 GOOGLE BRAIN
  • 19.
    DEMO: MACHINE LEARNING,SIMPLE TRAINING SET
  • 20.
    1.2M 1000 2 7 25 Image training set Classes Weeks of training GPUs EXAFLOPS total to train DEMO: MACHINE LEARNING, NYU OVERFEAT
  • 21.
    CUDA for MACHINELEARNING Talks @ GTC Image Detection Face Recognition Gesture Recognition Video Search & Analytics Speech Recognition & Translation Recommendation Engines Indexing & Search Use Cases Early Adopters Image Analytics for Creative Cloud Image Classification Speech/Image Recognition Recommendation Hadoop Search Rankings
  • 22.
    Big Data &Infinite Compute Turbocharge Deep Learning SOURCE: KPCB/Mary Meeker, company data. Unstructured data: IDC's Digital Universe Study. 800M photos uploaded per day 100 hours of video uploaded per minute Unstructured data exploding 0 100 200 300 400 500 600 700 800 900 2007 2008 2009 2010 2011 2012 2013 2014 Facebook Instagram Snapchat Flickr 0 20 40 60 80 100 120 2007 2008 2009 2010 2011 2012 2013 Hours (YouTube) Millions 1,104 5,379 0 1,000 2,000 3,000 4,000 5,000 6,000 2010 2015 Exabytes of data
  • 24.
  • 25.
    5,760 CUDA cores 12GB memory 8 TeraFLOPS $2999
  • 26.
    STANFORD AI LAB 1 Titan Z-Accelerated Server 3 Titan Zs • 17,280 cores 2 kWatts $12,000 1,000 CPU Servers 2,000 CPUs • 16,000 cores 600 kWatts $5,000,000 GOOGLE BRAIN 300X energy efficiency 400X lower cost Fits next to a desk
  • 27.
    RenderMan with programmableshading 1.5 hours to render each frame CCI 6/32 minicomputer First CGI Film Nominated for an Academy Award®
  • 28.
    State-of-the-art water simulator48 hours to simulate the base water 250 hours to render each frame 2013 Academy Award® Winner BEST VISUAL EFFECTS
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
    One is aphoto, One is Iray…
  • 34.
    Bunkspeed Maya Catia 3ds Max IRAY VCA SCALABLE GPU RENDERING APPLIANCE 8 Kepler-class 12GB per GPU 23,040 2 x 1GigE 2 x 10GigE 1 x InfiniBand GPUs GPU memory CUDA cores Network
  • 35.
  • 36.
    0 20 40 60 80 Relative Performance CPU-only Workstation Quadro K5000 Workstation Iray VCA Bunkspeed Maya Catia 3ds Max IRAY VCA SCALABLE GPU RENDERING APPLIANCE MSRP $50,000
  • 38.
    GRID GPU inthe Cloud
  • 39.
    Ben Fathi ChiefTechnology Officer Horizon DaaS Platform
  • 40.
  • 41.
    “10 of theTop 10” Greenest Supercomputers Powered by CUDA GPUs
  • 42.
    Unify GPU andTegra Architecture 192 fully programmable CUDA cores 326 GFLOPS 4X energy efficiency over A15 TEGRA K1 Mobile Super Chip MOBILE ARCHITECTURE Maxwell Kepler Tesla Fermi Tegra 3 Tegra 4 Tegra K1 GPU ARCHITECTURE
  • 43.
    Computer Vision onCUDA Feature Detection / Tracking ~30 GFLOPS @ 30 Hz Object Recognition / Tracking ~180 GFLOPS @ 30 Hz 3D Scene Interpretation ~280 GFLOPS @ 30 Hz
  • 44.
    JETSON TK1 1stMOBILE SUPERCOMPUTER FOR EMBEDDED SYSTEMS 192 CUDA cores 326 GFLOPS VisionWorks SDK $192
  • 45.
    VISIONWORKS COMPUTER VISIONON CUDA Driver Assistance Computational Photography Augmented Reality Robotics CUDA Jetson TK1 VisionWorks Primitives Your Code Sample Pipelines Object Detection / Tracking Structure from Motion … Classifier Corner Detection …
  • 46.
    Single Precision GFLOPS/ W Normalized 80 60 0 40 2013 2014 2011 2012 2015 Tegra 2 Tegra 3 Tegra 4 Tegra K1 Kepler GPU CUDA 64b & 32b CPU Erista Maxwell GPU 20 TEGRA ROADMAP
  • 47.
    Andreas Reich Headof Audi Pre-Development
  • 48.
  • 51.
    CUDA EVERYWHERE PASCAL PC CLOUD MOBILE
  • 52.