Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
HPC Advisory Council Meeting Lugano | 22 March 2016
The Tesla Accelerated Computing Platform
Axel Koehler , Principal Solu...
2
Agenda
Introduction
TESLA Platform for HPC
TESLA Platform for HYPERSCALE
TESLA Platform for MACHINE LEARNING
TESLA Syste...
3
ENTERPRISE AUTOGAMING DATA CENTERPRO VISUALIZATION
4
TESLA PLATFORM PRODUCT STACK
Software
System Tools &
Services
Accelerators
Accelerated
Computing
Toolkit
Tesla K80
HPC
E...
5
TESLA PLATFORM FOR HPC
6
CPU
Optimized for
Serial Tasks
GPU Accelerator
Optimized for
Parallel Tasks
HETEROGENEOUS COMPUTING MODEL!
Complementary...
7
COMMON PROGRAMMING MODELS ACROSS
MULTIPLE CPUS
x86
Libraries
Programming
Languages
Compiler
Directives
AmgX
cuBLAS
8
370 GPU-Accelerated
Applications
www.nvidia.com/appscatalog
9
TESLA K80
World’s Fastest Accelerator
for HPC & Data Analytics
0 5 10 15 20 25 30
Tesla K80 Server
Dual CPU Server
# of ...
10
VISUALIZE DATA INSTANTLY FOR FASTER SCIENCE
Traditional
Slower Time to Discovery
CPU Supercomputer Viz Cluster
Simulati...
11
EGL CONTEXT MANAGEMENT
Top systems support OpenGL under X
EGL: Driver based context management
Support for full OpenGL*...
12
SCALABLE RENDERING AND COMPOSITING
Large-scale (volume) data visualization
Interactive visualization of TB of data
Stan...
13
NVLINK : A HIGH-SPEED GPU INTERCONNECT
Whitepaper: http://www.nvidia.com/object/nvlink.html
GPU to CPU via NVLink
NVLin...
14
U.S. TO BUILD TWO FLAGSHIP SUPERCOMPUTERS
Powered by the Tesla Platform
100-300 PFLOPS Peak
10x in Scientific App Perfo...
15
TESLA PLATFORM FOR HYPERSCALE
16
EXABYTES OF CONTENT PRODUCED DAILY
User-Generated Content Dominates Web Services
10M Users
40 years of video/day
1.7M B...
17
TESLA FOR HYPERSCALE
10M Users
40 years of video/day
270M Items sold/day
43% on mobile devices
TESLA M4TESLA M40
HYPERS...
18
HTTP (~10ms)
GPU REST Engine (GRE) SDK
Accelerated Microservices
for Web and Mobile Applications
Supercomputer performa...
19
TESLA M4
Highest Throughput
Hyperscale Workload
Acceleration
CUDA Cores 1024
Peak SP 2.2 TFLOPS
GDDR5 Memory 4 GB
Bandw...
20
JETSON TX1
Embedded
Deep Learning
•  Unmatched performance under 10W
•  Advanced tech for autonomous machines
•  Smalle...
21
HYPERSCALE DATACENTER NOW ACCELERATED
Tesla Platform
SERVERS FOR TRAINING
Scales with Data
SERVERS FOR INFERENCE, WEB S...
22
TESLA PLATFORM FOR MACHINE LEARNING
23
DEEP LEARNING EVERYWHERE
INTERNET & CLOUD
Image Classification
Speech Recognition
Language Translation
Language Process...
24
Why is Deep Learning Hot Now?
Big Data Availability GPU AccelerationNew ML Techniques
350 millions
images uploaded
per ...
25
Image “Volvo XC90”
Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief ...
26
DRIVE PX AUTO-PILOT
CAR COMPUTER
NVIDIA GPU DEEP LEARNING
SUPERCOMPUTER
Neural Net Model
Classified Object
!
Camera Inp...
27
Camera Inputs
Medical Compute Center
(Training)
Hospital/Doctor
(Inference)
Classified Object
Med. device inputs
Neural...
28
GPUs deliver --
- same or better prediction accuracy
- faster results
- smaller footprint
- lower power
NEURAL
NETWORKS...
29
NVIDIA CUDA
ACCELERATED COMPUTING PLATFORM
WATSON CHAINER THEANO MATCONVNET
TENSORFLOW CNTK TORCH CAFFE
NVIDIA GPU THE ...
cuDNN
Deep Learning Primitives
IGNITING ARTIFICIAL
INTELLIGENCE
"  GPU-accelerated Deep Learning
subroutines
"  High perfo...
31
NVIDIA DIGITS
Interactive Deep Learning GPU Training System
Test Image
Monitor ProgressConfigure DNNProcess Data Visual...
32
TESLA M40
World’s Fastest Accelerator
for Deep Learning Training
0 1 2 3 4 5 6 7 8 9 10 11 12 13
GPU Server with
4x TES...
33
Facebook’s deep learning machine
Purpose-Built for Deep Learning Training
2x Faster Training for Faster Deployment
2x L...
34
Designed for AI Computing at large scale
Built on the NVIDIA Tesla Platform
• 8 Tesla M40s deliver aggregate 96 GB GDDR...
35
NCCL
GOAL:
•  Build a research library of accelerated collectives that is easily
integrated and topology-aware so as to...
TESLA SYSTEM SOFTWARE AND TOOLS
DATA CENTER GPU MANAGEMENT
Device Management!
Board-level GPU
Configuration & Monitoring
•  Device Identification
•  Confi...
DATA CENTER GPU MANAGER (DCGM)
Compute Node
Management Node
DC GPU Manager
DC Cluster Management SW
Mgmt. SW Agent
APIs
Ne...
GROWING CONTAINER ADOPTION IN DATA
CENTER
“Docker spreads like wildfire, especially in the enterprise”
Rightscale 2016 Clo...
GPU CONTAINERIZATION USING NVIDIA-DOCKER
Single command-line interface to take care of all
deployment steps
•  Discovery, ...
Axel Koehler
akoehler@nvidia.com
Upcoming SlideShare
Loading in …5
×

Tesla Accelerated Computing Platform

1,727 views

Published on

Axel Koehler from Nvidia presented this deck at the 2016 HPC Advisory Council Switzerland Conference.

“Accelerated computing is transforming the data center that delivers unprecedented through- put, enabling new discoveries and services for end users. This talk will give an overview about the NVIDIA Tesla accelerated computing platform including the latest developments in hardware and software. In addition it will be shown how deep learning on GPUs is changing how we use computers to understand data.”


In related news, the GPU Technology Conference takes place April 4-7 in Silicon Valley.

Watch the video presentation: http://insidehpc.com/2016/03/tesla-accelerated-computing/

See more talks in the Swiss Conference Video Gallery:
http://insidehpc.com/2016-swiss-hpc-conference/

Sign up for our insideHPC Newsletter:
http://insidehpc.com/newsletter

Published in: Technology

Tesla Accelerated Computing Platform

  1. 1. HPC Advisory Council Meeting Lugano | 22 March 2016 The Tesla Accelerated Computing Platform Axel Koehler , Principal Solution Architect
  2. 2. 2 Agenda Introduction TESLA Platform for HPC TESLA Platform for HYPERSCALE TESLA Platform for MACHINE LEARNING TESLA System Software and Tools Data Center GPU Manager, Docker
  3. 3. 3 ENTERPRISE AUTOGAMING DATA CENTERPRO VISUALIZATION
  4. 4. 4 TESLA PLATFORM PRODUCT STACK Software System Tools & Services Accelerators Accelerated Computing Toolkit Tesla K80 HPC Enterprise Services · Data Center GPU Manager · Mesos · Docker GRID 2.0 Tesla M60, M6 Enterprise Virtualization DL Training Hyperscale Hyperscale Suite Tesla M40 Tesla M4 Web Services
  5. 5. 5 TESLA PLATFORM FOR HPC
  6. 6. 6 CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks HETEROGENEOUS COMPUTING MODEL! Complementary Processors Work Together
  7. 7. 7 COMMON PROGRAMMING MODELS ACROSS MULTIPLE CPUS x86 Libraries Programming Languages Compiler Directives AmgX cuBLAS
  8. 8. 8 370 GPU-Accelerated Applications www.nvidia.com/appscatalog
  9. 9. 9 TESLA K80 World’s Fastest Accelerator for HPC & Data Analytics 0 5 10 15 20 25 30 Tesla K80 Server Dual CPU Server # of Days AMBER Benchmark: PME-JAC-NVE Simulation for 1 microsecond CPU: E5-2698v3 @ 2.3GHz. 64GB System Memory, CentOS 6.2 CUDA Cores 2496 Peak DP 1.9 TFLOPS Peak DP w/ Boost 2.9 TFLOPS GDDR5 Memory 24 GB Bandwidth 480 GB/s Power 300 W GPU Boost Dynamic Simulation Time from 1 Month to 1 Week 5x Faster AMBER Performance
  10. 10. 10 VISUALIZE DATA INSTANTLY FOR FASTER SCIENCE Traditional Slower Time to Discovery CPU Supercomputer Viz Cluster Simulation- 1 Week Viz- 1 Day Multiple Iterations Time to Discovery = Months Tesla Platform Faster Time to Discovery GPU-Accelerated Supercomputer Visualize while you simulate/without data transfers Restart Simulation Instantly Multiple Iterations Time to Discovery = Weeks Flexible Scalable Interactive Days Data Transfer
  11. 11. 11 EGL CONTEXT MANAGEMENT Top systems support OpenGL under X EGL: Driver based context management Support for full OpenGL*, not only GL ES Available in e.g. VTK New opportunities for CUDA/OpenGL** interop *Full OpenGL in r355.11; **CUDA interop in r358.7 Leaving it to the driver Tesla GPU Tesla driver with EGL ParaView/VMD X-server
  12. 12. 12 SCALABLE RENDERING AND COMPOSITING Large-scale (volume) data visualization Interactive visualization of TB of data Stand-alone or coupling into simulation HW Accelerated remote rendering Plugin for ParaView available http://www.nvidia-arc.com/products/nvidia-index.html NVIDIA INDEX Dataset from NCSA Blue Waters
  13. 13. 13 NVLINK : A HIGH-SPEED GPU INTERCONNECT Whitepaper: http://www.nvidia.com/object/nvlink.html GPU to CPU via NVLink NVLink Pascal CPU (NVLINK Enabled) DDR Memory 10s-100s GB HBM 16-32GB DDR4 50-75 GB/s 1Tbyte/s PCIe GPU to GPU via NVLink PascalPascal CPU (x86) PCIe Switch NVlink
  14. 14. 14 U.S. TO BUILD TWO FLAGSHIP SUPERCOMPUTERS Powered by the Tesla Platform 100-300 PFLOPS Peak 10x in Scientific App Performance IBM POWER9 CPU + NVIDIA Volta GPU NVLink High Speed Interconnect 40 TFLOPS per Node, >3,400 Nodes 2017 Major Step Forward on the Path to Exascale
  15. 15. 15 TESLA PLATFORM FOR HYPERSCALE
  16. 16. 16 EXABYTES OF CONTENT PRODUCED DAILY User-Generated Content Dominates Web Services 10M Users 40 years of video/day 1.7M Broadcasters Users watch 1.5 hours/day 6B Queries/day 10% use speech 270M Items sold/day 43% on mobile devices 8B Video views/day 400% growth in 6 months 300 hours of video/minute 50% on mobile devices Challenge: Harnessing the Data Tsunami in Real-time
  17. 17. 17 TESLA FOR HYPERSCALE 10M Users 40 years of video/day 270M Items sold/day 43% on mobile devices TESLA M4TESLA M40 HYPERSCALE SUITE POWERFUL: Fastest Deep Learning Performance LOW POWER: Highest Hyperscale Throughput GPU Accelerated FFmpeg Image Compute Engine ! ! GPU REST Engine !
  18. 18. 18 HTTP (~10ms) GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters Powerful nodes with low response time (~10ms) Easy to develop new microservices Open source, integrates with existing infrastructure Easy to deploy & scale Ready-to-run Docker file GPU REST Engine Image Classification Speech Recognition … Image Scaling developer.nvidia.com/gre
  19. 19. 19 TESLA M4 Highest Throughput Hyperscale Workload Acceleration CUDA Cores 1024 Peak SP 2.2 TFLOPS GDDR5 Memory 4 GB Bandwidth 88 GB/s Form Factor PCIe Low Profile Power 50 – 75 W Video Processing Image Processing Video Transcode Machine Learning Inference H.264 & H.265, SD & HD Stabilization and Enhancements Resize, Filter, Search, Auto-Enhance
  20. 20. 20 JETSON TX1 Embedded Deep Learning •  Unmatched performance under 10W •  Advanced tech for autonomous machines •  Smaller than a credit card JETSON TX1 GPU 1 TFLOP/s 256-core Maxwell CPU 64-bit ARM A57 CPUs Memory 4 GB LPDDR4 | 25.6 GB/s Storage 16 GB eMMC Wifi/BT 802.11 2x2 ac/BT Ready Networking 1 Gigabit Ethernet Size 50mm x 87mm Interface 400 pin board-to-board connector
  21. 21. 21 HYPERSCALE DATACENTER NOW ACCELERATED Tesla Platform SERVERS FOR TRAINING Scales with Data SERVERS FOR INFERENCE, WEB SERVICES Scales with Users ! Exabytes of Content / Day Trained Model Model Deployed on Every Server Billions of Devices
  22. 22. 22 TESLA PLATFORM FOR MACHINE LEARNING
  23. 23. 23 DEEP LEARNING EVERYWHERE INTERNET & CLOUD Image Classification Speech Recognition Language Translation Language Processing Sentiment Analysis Recommendation MEDIA & ENTERTAINMENT Video Captioning Video Search Real Time Translation AUTONOMOUS MACHINES Pedestrian Detection Lane Tracking Recognize Traffic Sign SECURITY & DEFENSE Face Detection Video Surveillance Satellite Imagery MEDICINE & BIOLOGY Cancer Cell Detection Diabetic Grading Drug Discovery
  24. 24. 24 Why is Deep Learning Hot Now? Big Data Availability GPU AccelerationNew ML Techniques 350 millions images uploaded per day 2.5 Petabytes of customer data hourly 300 hours of video uploaded every minute
  25. 25. 25 Image “Volvo XC90” Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks” ICML 2009 & Comm. ACM 2011. Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng. WHAT IS DEEP LEARNING?
  26. 26. 26 DRIVE PX AUTO-PILOT CAR COMPUTER NVIDIA GPU DEEP LEARNING SUPERCOMPUTER Neural Net Model Classified Object ! Camera Inputs Cars That See Better … And Learn
  27. 27. 27 Camera Inputs Medical Compute Center (Training) Hospital/Doctor (Inference) Classified Object Med. device inputs Neural Net Model ! ! Deep Learning Platform In Medical Feedback
  28. 28. 28 GPUs deliver -- - same or better prediction accuracy - faster results - smaller footprint - lower power NEURAL NETWORKS GPUS Inherently Parallel ! ! Matrix Operations ! ! FLOPS ! ! Bandwidth ! ! GPUS AND DEEP LEARNING
  29. 29. 29 NVIDIA CUDA ACCELERATED COMPUTING PLATFORM WATSON CHAINER THEANO MATCONVNET TENSORFLOW CNTK TORCH CAFFE NVIDIA GPU THE ENGINE OF DEEP LEARNING
  30. 30. cuDNN Deep Learning Primitives IGNITING ARTIFICIAL INTELLIGENCE "  GPU-accelerated Deep Learning subroutines "  High performance neural network training "  Accelerates Major Deep Learning frameworks: Caffe, Theano, Torch "  Up to 3.5x faster AlexNet training in Caffe than baseline GPU Millions of Images Trained Per Day Tiled FFT up to 2x faster than FFT developer.nvidia.com/cudnn 0 20 40 60 80 100 cuDNN 1 cuDNN 2 cuDNN 3 cuDNN 4 0.0x 0.5x 1.0x 1.5x 2.0x 2.5x
  31. 31. 31 NVIDIA DIGITS Interactive Deep Learning GPU Training System Test Image Monitor ProgressConfigure DNNProcess Data Visualize Layers http://developer.nvidia.com/digits
  32. 32. 32 TESLA M40 World’s Fastest Accelerator for Deep Learning Training 0 1 2 3 4 5 6 7 8 9 10 11 12 13 GPU Server with 4x TESLA M40 Dual CPU Server 13x Faster Training Caffe Number of Days CUDA Cores 3072 Peak SP 7 TFLOPS GDDR5 Memory 12 GB Bandwidth 288 GB/s Power 250W Reduce Training Time from 13 Days to just 1 Day Note: Caffe benchmark with AlexNet, CPU server uses 2x E5-2680v3 12 Core 2.5GHz CPU, 128GB System Memory, Ubuntu 14.04
  33. 33. 33 Facebook’s deep learning machine Purpose-Built for Deep Learning Training 2x Faster Training for Faster Deployment 2x Larger Networks for Higher Accuracy Powered by Eight Tesla M40 GPUs Open Rack Compliant Serkan Piantino Engineering Director of Facebook AI Research “Most of the major advances in machine learning and AI in the past few years have been contingent on tapping into powerful GPUs and huge data sets to build and train advanced models”
  34. 34. 34 Designed for AI Computing at large scale Built on the NVIDIA Tesla Platform • 8 Tesla M40s deliver aggregate 96 GB GDDR5 memory and 56 teraflops of SP performance • Leverages world’s leading deep learning platform to tap into frameworks such as Torch and libraries such as cuDNN Operational Efficiency and Serviceability • Free-air Cooled Design Optimizes Thermal and Power Efficiency • Components swappable without tools • Configurable PCI-e for versatility
  35. 35. 35 NCCL GOAL: •  Build a research library of accelerated collectives that is easily integrated and topology-aware so as to improve the scalability of multi-GPU applications APPROACH: •  Pattern the library after MPI’s collectives •  Handle the intra-node communication in an optimal way •  Provide the necessary functionality for MPI to build on top to handle inter-node Accelerating Multi-GPU Communications for Deep Learning github.com/NVIDIA/nccl
  36. 36. TESLA SYSTEM SOFTWARE AND TOOLS
  37. 37. DATA CENTER GPU MANAGEMENT Device Management! Board-level GPU Configuration & Monitoring •  Device Identification •  Configuration & Monitoring •  Clock Management All GPUs Supported Tesla GPUs Only Tesla GPUs Only ! Active Diagnostics ! Health & Governance •  GPU Recovery & Isolation •  System Validation •  Comprehensive Diagnostics •  Real-time Monitoring & Analysis •  Governance Policies •  Power & Clock Management Diagnostics, Recovery & System Validation Proactive Health, Policy & Power Mgmt. Today Data Center GPU Manager (DCGM)
  38. 38. DATA CENTER GPU MANAGER (DCGM) Compute Node Management Node DC GPU Manager DC Cluster Management SW Mgmt. SW Agent APIs Network Tesla Enterprise Driver Admin GPU GPU GPU GPU Admin CLI DCGM Available as library & CLI Ready for integration into ISV Mgmt. Software —  eg. Bright Cluster Manager , IBM Platform Cluster Manager Ready for integration with HPC Job Schedulers —  eg. Altair PBS Works, Moab & Maui, IBM Platform LSF, SLURM, Univa GRID Engine DCGM currently in Public Beta http://www.nvidia.com/object/data-center-gpu-manager.html
  39. 39. GROWING CONTAINER ADOPTION IN DATA CENTER “Docker spreads like wildfire, especially in the enterprise” Rightscale 2016 Cloud Survey Report >2X growth in Docker adoption in a year Across Enterprise, Cloud and HPC
  40. 40. GPU CONTAINERIZATION USING NVIDIA-DOCKER Single command-line interface to take care of all deployment steps •  Discovery, Config/setup, Device allocation Pre-built images on Docker HUB – CUDA, Caffe, Digits •  Reproducible builds across heterogeneous targets Remote deployment using NVIDIA-Docker-Plugin and REST interface Key Highlights #  NVIDIA Docker on GitHUB (experimental) – Available Now #  Bundled with CUDA Product – Future Versions (In planning)
  41. 41. Axel Koehler akoehler@nvidia.com

×