ALISON B LOWNDES
AI DevRel | EMEA
Fuelling the AI Revolution
with Gaming
@alisonblowndes
2
The day job
AUTOMOTIVE
Auto sensors reporting
location, problems
COMMUNICATIONS
Location-based advertising
CONSUMER PACKAGED GOODS
Sentiment analysis of
what’s hot, problems
$
FINANCIAL SERVICES
Risk & portfolio analysis
New products
EDUCATION & RESEARCH
Experiment sensor analysis
HIGH TECHNOLOGY /
INDUSTRIAL MFG.
Mfg. quality
Warranty analysis
LIFE SCIENCES MEDIA/ENTERTAINMENT
Viewers / advertising
effectiveness
ON-LINE SERVICES /
SOCIAL MEDIA
People & career matching
HEALTH CARE
Patient sensors,
monitoring, EHRs
OIL & GAS
Drilling exploration sensor
analysis
RETAIL
Consumer sentiment
TRAVEL &
TRANSPORTATION
Sensor analysis for
optimal traffic flows
UTILITIES
Smart Meter analysis
for network capacity,
LAW ENFORCEMENT
& DEFENSE
Threat analysis - social media
monitoring, photo analysis
www.FrontierDevelopmentLab.org
A MISSION CONTROL
FOR SPACESHIP EARTH
KICK OFF WORKSHOP
ESA ESRIN, ROME
25th - 29th June ‘18
RESEARCH SPRINT
30th June - 17th August
5
6
EMBED TORNADO VIDEO
7
8
NVIDIA RESEARCH
9
TITAN V
THE MOST POWERFUL PC GPU EVER
5,120 CUDA cores
640 NEW Tensor cores
12GB HBM2 Memory 1.7Gbps
110 Tensor TFLOPS
13.8 TFLOPS fp16 | 6.9 fp32
10
THE RISE OF GPU COMPUTING
Big Data Needs Algorithms and Compute That Scales
1980 1990 2000 2010 2020
Original data up to the year 2010 collected and plotted by M. Horowitz,
F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
103
105
107
1.5X per year
END OF MOORE’S LAW
Single-threaded perf
GPU-Computing perf
1.5X per year
1.1X per year
CPU vs. GPU
11
“the machine equivalent of experience”
Definitions
14
2012
15
CONVOLUTION
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
1
2
2
1
1
1
0
1
2
2
2
1
1
0
1
2
2
2
1
1
0
0
1
1
1
1
1
0
0
0
0
0
0
0
4
0
0
0
0
0
0
0
-4
1
0
-8
Source
Pixel
Convolution
kernel
New pixel value
(destination
pixel)
Centre element of the kernel is placed over the source pixel.
The source pixel is then replaced with a weighted sum of
itself and nearby pixels.
http://www.biomachina.org/
FAST FOURIER TRANSFORMATION (FFT)
=
+
+
DEEP LEARNING
NEW PROGRAMMING MODEL
DEVICE
EDGE
CLOUD
19
Describe the differences between these 2 bikes……..
20
Image “Volvo XC90”
Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks” ICML 2009 & Comm. ACM 2011.
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng.
CONVOLUTIONAL NEURAL NETWORKS
21
http://distill.pub/2017/momentum/
22
Li et al, University of Maryland & US Naval Academy
https://arxiv.org/pdf/1712.09913.pdf
23
Residual Networks
24
ResNets vs Highway Nets (IDSIA)
https://arxiv.org/pdf/1612.07771.pdf
Klaus Greff, Rupesh K. Srivastava
Really great explanation of “representation”
Compares the two.. shows for language
modelling, translation HN >> RN.
Not quite as simple as each layer building a new
level of representation from the previous - since
removing any layer doesn’t critically disrupt.
25
Capsules & routing with EM, Hinton et al
https://arxiv.org/pdf/1710.09829.pdf [NIPS2017]
SegCaps: capsules for object segmentation https://arxiv.org/pdf/1804.04241.pdf University of Central Florida
26
PRUNING
27
TRAINING VS INFERENCE
28
Long short-term memory (LSTM)
Hochreiter (1991) analysed vanishing gradient “LSTM falls out of this almost naturally”
Gates control importance of
the corresponding
activations
Training
via
backprop
unfolded
in time
LSTM:
input
gate
output
gate
Long time dependencies are preserved until
input gate is closed (-) and forget gate is open (O)
forget
gate
Fig from Vinyals et al, Google April 2015 NIC Generator
Fig from Graves, Schmidhuber et al, Supervised
Sequence Labelling with RNNs
AlexNet
CAMBRIAN EXPLOSION
Convolutional
Networks
ReLuEncoder/Decoder
Dropout PoolingConcat
BatchNorm
Recurrent
Networks
GRULSTM
CTC
Beam Search
WaveNet Attention
Generative Adversarial
Networks
3D-GAN
Speech Enhancement GANCoupled GAN
Conditional GANMedGAN
Reinforcement
Learning
DQN Simulation
DDPG
New
Species
Neural Collaborative FilteringMixture of Experts
Block Sparse LSTM
Capsule Nets
31
Shakir Mohamed and Danilo Rezende, DeepMind, UAI 2017
32
IMAGINE NO
LIMITATIONS TO
DRUG DISCOVERY
It takes 12 years and $2.6B to bring a new drug to
market. Insilico Medicine takes an innovative approach
to speed drug discovery and uses GPU-accelerated
Generative Adversarial Networks (GANs) to "invent"
new molecular structures on demand. With GANs, in
only a few months’ time, Insilico did what would have
taken years with conventional research methods –
identified 69 new molecules. To date, Insilico has
generated 5,000+ new molecules, integrated GANs
into a comprehensive drug discovery framework, and
established working collaborations with large
pharmaceutical companies. This unique approach
could dramatically reduce the time and cost of
developing novel substances with medicinal properties.
33
Dreamcatcher
https://autodeskresearch.com/projects/dreamcatcher
34
35
“After finishing this book, you will have a deep understanding of how to set
technical direction for a machine learning project.”
www.mlyearning.org
36
37
38
Brain Computer Interfaces
Focused on treatment for
disease and dysfunction
eg epilepsy, depression,
Parkinsons but ultimately
to advance human
intelligence by restoring
and extending cognitive
vibrancy.
“We’re either going to have to merge with AI or be left behind”; Elon Musk
REINFORCEMENT LEARNING
“from our own mistakes”
40
RL TIMELINE
•1984: Rich Suttons thesis
•1989: Q-learning
•1991: REINFORCE
•2006: MCTS (Szepesvari)
•2009: David Silver’s thesis
•2013: DQN
•2015: DQN in Nature
•2016: A3C & DRL
•2017: Soft Q-learning/ AlphaGo Zero
•2018: GA3C
41
Tabish Rashid et al, Oxford/DeepMind QMIX: https://arxiv.org/pdf/1803.11485.pdf
42
DEEPMIND ALPHA* 1.0 => 2.0 => 0.0
* ..a deeply structured hybrid (Gary Marcus, Jan 2018)
43
ROBOTS
THINK /
REASON
SENSE /
PERCEIVE
ACT
44
ATLAS, BOSTON DYNAMICS
Inserting video: Insert/Video/Video from File.
Insert video by browsing your directory and selecting OK.
File types that works best in PowerPoint are mp4 or wmv
45
46
Holodeck: photorealistic, collaborative VR
environment
Real-world experience through sight, sound, and
haptics
Explore high-fidelity, full-resolution models
THE DESIGN LAB OF THE
FUTURE
IMMERSIVE AI - HOLODECK
47
48
49
NVIDIA ISAAC ROBOTICS PLATFORM
SIMULATION TRAINING DEPLOYMENT
SDK
https://developer.nvidia.com/isaac-sdk
50
51
INSERT TELEPORTATION VIDEO OF TIM DRIVING A VR VERSION OF A REAL SDC
A massive Deep Learning challenge
53
54
BRIDGING THE REALITY GAP
https://arxiv.org/pdf/1804.06516.pdf NVR+University of Toronto
“leveraging the power of synthetic data improves upon results obtained using real data alone”
55
END TO END SYSTEM FOR AV
Sources: NVIDIA, RAND Corporation
SIMULATETRAIN MODELSCOLLECT DATA RE-SIMULATE
Lanes Lights
Path
Signs
PedestriansCars
Up to 2 petabytes per
car per year
20 DNNs
1+ million images
per DNN
10B+ miles to
ensure safe driving
Validate and verify
for self-driving cars
by 2020
MAPPING
5 + DNNs
Create and update
HD maps
56
LEVEL 5 DEMANDS
EXTREME
COMPUTING
+ 10X camera resolution
+ Surround LIDAR point-cloud processing
+ Camera & LIDAR localization to HD map
+ Tracking all surrounding objects
+ New map generation
+ Sophisticated path planning & control
+ Algorithm diversity
+ Sensor & computing fail-operate
+ ASIL-D Functional Safety
+ Excess computing capacity
Level 2 Level 5
57
COMPUTATIONAL SCALE REQUIRED
3 million labeled images
1 DGX-1 trains 300k labeled images on 1 DNN in 1 day
10 DNNs required for self-driving
10 parallel experiments at all times
100 DGX-1 per car
58
Xavier (ASIL-D rated)
“Our in-depth technical assessment confirms the Xavier SoC
architecture is suitable for use in autonomous driving applications and
highlights NVIDIA’s commitment to enable safe autonomous driving”;
TÜV SÜD, Germany
59
XAVIER DLA
NOW OPEN SOURCE
Command Interface
Tensor Execution Micro-controller
Memory Interface
Input DMA
(Activations
and Weights)
Unified
512KB
Input
Buffer
Activations
and
Weights
Sparse Weight
Decompression
Native
Winograd
Input
Transform
MAC
Array
2048 Int8
or
1024 Int16
or
1024 FP16
Output
Accumulators
Output
Postprocess
or
(Activation
Function,
Pooling
etc.)
Output
DMA
WWW.NVDLA.ORG
60
Jetson
Xavier
developer.nvidia.com/
jetson-xavier
61
AI SAFETY:
• Understanding
• Trust
62
Dynamic Memory Networks
MetaMind now Salesforce (2yrs ago) https://arxiv.org/pdf/1603.01417v1.pdf
63
“The Building Blocks of
Interpretability”
https://distill.pub/2018/building-blocks/
Chris Olah, Google et al, March 2018
64
66
A PLETHORA OF HEALTHCARE STORIES
Molecular Energetics
For Drug Discovery
AI for Drug Discovery
Medical Decision
Making
Treatment Outcomes
Reducing Cancer
Diagnosis Errors by
85%
Predicting Toxicology
Predicting Growth
Problems
Image Processing Gene Mutations Detect Colon Polyps
Predicting Disease from
Medical Records
Enabling Detection of
Fatty Acid Liver Disease
67
68
Radiation Therapy with AI
ViewRay MRIdian (imaging & radiation dose)
V-NET CINEMATIC RENDERINGAUTOMAP
GPU DEEP LEARNING
REVOLUTIONIZING MEDICAL IMAGING
CUDA
Source: https://arxiv.org/pdf/1704.08841.pdf Source: https://arxiv.org/pdf/1606.04797.pdf Source: Elliot Fishman, M.D. & Siemens
70
https://arxiv.org/abs/1804.07723
INPAINTING
71
COMPUTATIONAL & DATA SCIENCE
AI SUPERCOMPUTING WILL TRANSFORM HPC
Extending Reach of HPC By Combining Computational & Data Science
DATA SCIENCECOMPUTATIONAL SCIENCE
Turbulent Flow Molecular Dynamics
Structural Analysis N-body Simulation “Next move?”
“Is there cancer?”“What’s happening?”
“What does she mean?” Understanding Universe
Clean EnergyDrug Discovery
Monitoring Climate
Change
72
GPU READY APPS
AWARENESS : ADOPTION : UTILIZATION
✓ Easy setup
✓ Best app practices
✓ Optimal results
✓ Higher productivity
✓ Faster discoveries
App Users
Life Science
Developers
Deep Learning
IT Managers
HPC/Data Centers
Simple, ISV-approved,
step-by-step
instructions
1.
2.
3.
4.
5.
Target
Customers
Quick Start
Guide
User
Benefits
www.nvidia.com/gpu-ready-apps
73
74
75
POWERING THE DEEP LEARNING ECOSYSTEM
NVIDIA SDK accelerates every major framework
COMPUTER VISION
OBJECT DETECTION IMAGE CLASSIFICATION
SPEECH & AUDIO
VOICE RECOGNITION LANGUAGE TRANSLATION
NATURAL LANGUAGE PROCESSING
RECOMMENDATION ENGINES SENTIMENT ANALYSIS
DEEP LEARNING FRAMEWORKS
NVIDIA DEEP LEARNING SDK
developer.nvidia.com/deep-learning-software
76
developer.nvidia.com
May 8-11, 2017 | Silicon Valley
CUDA 9
https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/
78NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
NVIDIA DEEP LEARNING SDK UPDATE
GPU-accelerated
DL Primitives
Faster training
Optimizations for RNNs
Leading frameworks support
cuDNN 7
Multi-node distributed
training (multiple machines)
Leading frameworks support
Multi-GPU &
Multi-node
NCCL 2
TensorFlow model reader
Object detection
INT8 RNNs support
High-performance
Inference Engine
TensorRT 4
79
Circa 2000 - Torch7 - 4th (using odd numbers only 1,3,5,7)
Web-scale learning in speech, image and video applications
Maintained by top researchers including
Soumith Chintala - Research Engineer @ Facebook
All the goodness of Torch7 with an intuitive Python frontend that focuses on rapid
prototyping, readable code & support for a wide variety of deep learning models.
https://pytorch.org/2018/05/02/road-to-1.0.html
80
Benchmarks Jan 2018https://github.com/u39kun/deep-learning-benchmark
81
Apex - A PyTorch Extension
● Goal: Raise PyTorch customer awareness and increase adoption of NVIDIA Tensor Cores
● Content: Provide an easy to use set of utility functions in PyTorch for mixed-precision optimizations
● Benefit: Few lines of code to achieve improved training speed while maintaining accuracy and
stability of single precision (Tensor Cores)
● Target audience: Deep learning researchers and developers of PyTorch with NVIDIA Volta
● Key Features: AMP (Auditor for mixed-precision) and Optimizer Wrapper (Dynamic loss scaling and
master parameters)
● Teams Involved: Leading NVIDIA PyTorch team and collaboration with external FB PyTorch team
Overview
82
DALI
Full input pipeline acceleration including
data loading and augmentation
Drop-in integration with direct plugins to
frameworks – MxNet, TensorFlow, PyTorch
Portable workflows through multiple input
formats and configurable graphs
Unblock CPU with GPU-accelerated DL pre-processing library
Version 0.1 supports:
• Resnet-50 image classification & segmentation training
• Input formats – JPEG, LMDB, RecordIO, TFRecord
• Python APIs to define, build and run an input pipelineAvailable as open source -
https://github.com/NVIDIA/DALI
• DALI Samples & Tutorial:
https://github.com/NVIDIA/DALI/blob/master/examples/Getting%20started.ipynb
• nvJPEG (Webpage, Documentation) - https://developer.nvidia.com/nvjpeg
NEW NVIDIA TENSORRT 4
DRIVE PX 2
JETSON TX2
NVIDIA DLA
TESLA P4
TensorRT
TESLA V100
Programmable Inference Accelerator
Compile and Optimize Neural Networks | Support for Every Framework
Optimize for Each Target Platform
84
TensorRT INTEGRATED WITH TensorFlow
8x faster Inference Than TensorFlow Only
*
14
86
705
0
100
200
300
400
500
600
700
800
CPU Only FP32 P4 FP32 P4 INT8 TensorRT
Images/sec
Throughput at < 7ms latency (TensorFlow
ResNet-50)
*
Available in TensorFlow 1.7
https://github.com/tensorflow/tensorflow
* Min CPU latency measured was 70 ms. It is not < 7 ms.
CPU: Skylake Gold 6140, 2.5GHz, Ubuntu 16.04; 18 CPU threads.
Pascal P4; CUDA (384.111; v9.0.176);
Batch size: CPU=1, TF_GPU=1 (latency 12 ms) , TF-TRT=4 w/ latency=6ms
85
NVIDIA TensorRT 4
Recommender,
Speech & Machine
Translation
RNN and MLP Layers ⚫ ONNX Import ⚫ NVIDIA DRIVE Support
Free download to members of NVIDIA Developer Program soon at
developer.nvidia.com/tensorrt
50x Faster Inference
for ONNX Models
Easily import and accelerate
inference for ONNX frameworks
with native ONNX parser in
TensorRT
Accelerate inference of
recommenders, speech and
machine translation apps with
new layers and optimizations
Deploy optimized deep learning
inference models NVIDIA DRIVE
Xavier
Support for NVIDIA
DRIVE Xavier
1
45x
0X
10X
20X
30X
40X
50X
CPU TensorRT
86
TESLA V100 32GB
THE MOST ADVANCED DATA CENTER GPU EVER BUILT
5,120 CUDA cores
640 NEW Tensor cores
7.5 FP64 TFLOPS | 15 FP32 TFLOPS
120 Tensor TFLOP
20MB SM RF | 16MB Cache | 32GB HBM2 @ 900 GB/s
300 GB/s NVLink
https://devblogs.nvidia.com/parallelforall/inside-volta/
87
TENSORCORES
D = A * B + C (4x4)
88
Tensor Cores: More resources
Further information
Mixed-precision blog: https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/
Mixed-precision best practices: https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html
Mixed-precision arVix paper: https://arxiv.org/abs/1710.03740
GTC 2018 Sessions: Training with Mixed Precision: Theory and Practice and Training with Mixed Precision: Real Examples
Call to action:
1) Share these resources with your customers and partners
2) Provide feedback to help us prioritize new examples and improve examples
DGX/NGC deep learning framework containers: Latest versions of software stack and Tensor Core optimized examples - DGX/NGC Registry
Tools
TensorFlow OpenSeq2seq: https://nvidia.github.io/OpenSeq2Seq/html/mixed-precision.html & arVix paper
PyTorch Apex: https://nvidia.github.io/apex/fp16_utils.html & NVIDIA developer news article
Examples
New mixed-precision model examples: https://developer.nvidia.com/deep-learning-examples
GitHub: https://github.com/NVIDIA/DeepLearningExamples
89
NVSWITCH
WORLD’S HIGHEST BANDWIDTH ON-NODE SWITCH
7.2 Terabits/sec or 900 GB/sec
18 NVLINK ports | 50GB/s per port bi-directional
Fully-connected crossbar
2 billion transistors | 47.5mm x 47.5mm package
ANNOUNCING NVIDIA DGX-2
THE LARGEST GPU EVER CREATED
2 PFLOPS | 512GB HBM2 | 10 kW | 350 lbs
91
GROUNDBREAKING AI
AT YOUR DESK
Designed for your office
DGX Station is the world’s first
personal supercomputer for leading-
edge AI development. Built on the
same Deep Learning Stack powering
all NVIDIA DGX™ Systems, you can now
experiment at your desk and extend
your work across DGX Systems and
the cloud.
Now with 32GB per GPU, same cost
NVIDIA GPU CLOUD
Optimized Stacks for Every Cloud
20,000+ Registered Organizations | 30 Containers
NOW on AWS, GCP, AliCloud, Oracle Cloud, DGX
93
Containerized Applications
TF Tuned SW
NVIDIA Docker
CNTK Tuned SW
NVIDIA Docker
Caffe2 Tuned SW
NVIDIA Docker
PyTorch Tuned SW
NVIDIA Docker
CUDA RTCUDA RTCUDA RTCUDA RT
Linux Kernel + CUDA Driver
Tuned SW
NVIDIA Docker
CUDA RT
Other
Frameworks
and Apps. . .
ALWAYS UP-TO-DATE
Monthly Updates from NVIDIA to Frameworks and Containers
94
NVIDIA-DOCKER 1.0
github.com/NVIDIA/nvidia-docker
Open-source Project on Github since 2016
>1.5MM downloads
>5K stars
>4.5 MM pulls of CUDA images
Enabled Various Use-cases
DGX / NGC containers
Major deep learning frameworks
Customers Adopted
95
ANNOUNCING:
JETSON XAVIER
Computer for Autonomous Machines
AI Server Performance in 30W  15W  10W
512 Volta CUDA Cores  2x NVDLA
8 core CPU
30 DL TOPS
96
97
Sample Code
Deep Learning
CUDA, Linux4Tegra, ROS
Multimedia API
MediaComputer Vision Graphics
Nsight Developer Tools
Jetson Embedded Supercomputer: Advanced GPU, 64-bit CPU, Video CODEC, VIC, ISP
JETPACK SDK FOR AI @ THE EDGE
DEVELOPER.NVIDIA.COM/EMBEDDED-COMPUTING
TensorRT
cuDNN
VisionWorks
OpenCV
Vulkan
OpenGL
libargus
Video API
98
Available to Instructors Now!
developer.nvidia.com/teaching-kits
Robotics Teaching
Kit with ‘Jet’ - ServoCity
D E E P L E A R N I N G
Fundamentals
Accelerated Computing
Game Development &
Digital Content
Finance
NVIDIA DEEP LEARNING
INSTITUTE
Online self-paced labs and instructor-led
workshops on deep learning and
accelerated computing
Take self-paced labs at
www.nvidia.co.uk/dlilabs
View upcoming workshops and request a
workshop onsite at www.nvidia.co.uk/dli
Educators can join the University
Ambassador Program to teach DLI courses
on campus and access resources. Learn
more at www.nvidia.com/dli
Intelligent Video
Analytics
Healthcare
Robotics
Autonomous Vehicles
Virtual Reality
100
NVIDIA
INCEPTION
PROGRAM
Accelerates AI startups with a boost of
GPU tools, tech and deep learning expertise
Startup Qualifications
Driving advances in the field of AI
Business plan
Incorporated
Web presence
Technology
DL startup kit*
Pascal Titan X
Deep Learning Institute (DLI) credit
Connect with a DL tech expert
DGX-1 ISV discount*
Software release notification
Live webinar and office hours
*By application
Marketing
Inclusion in NVIDIA marketing efforts
GPU Technology Conference (GTC)
discount
Emerging Company Summit (ECS)
participation+
Marketing kit
One-page story template
eBook template
Inception web badge and banners
Social promotion request form
Event opportunities list
Promotion at industry events
GPU ventures+
+By invitation
www.nvidia.com/inception
101

NVIDIA @ Infinite Conference, London

  • 1.
    ALISON B LOWNDES AIDevRel | EMEA Fuelling the AI Revolution with Gaming @alisonblowndes
  • 2.
    2 The day job AUTOMOTIVE Autosensors reporting location, problems COMMUNICATIONS Location-based advertising CONSUMER PACKAGED GOODS Sentiment analysis of what’s hot, problems $ FINANCIAL SERVICES Risk & portfolio analysis New products EDUCATION & RESEARCH Experiment sensor analysis HIGH TECHNOLOGY / INDUSTRIAL MFG. Mfg. quality Warranty analysis LIFE SCIENCES MEDIA/ENTERTAINMENT Viewers / advertising effectiveness ON-LINE SERVICES / SOCIAL MEDIA People & career matching HEALTH CARE Patient sensors, monitoring, EHRs OIL & GAS Drilling exploration sensor analysis RETAIL Consumer sentiment TRAVEL & TRANSPORTATION Sensor analysis for optimal traffic flows UTILITIES Smart Meter analysis for network capacity, LAW ENFORCEMENT & DEFENSE Threat analysis - social media monitoring, photo analysis
  • 3.
  • 4.
    A MISSION CONTROL FORSPACESHIP EARTH KICK OFF WORKSHOP ESA ESRIN, ROME 25th - 29th June ‘18 RESEARCH SPRINT 30th June - 17th August
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
    9 TITAN V THE MOSTPOWERFUL PC GPU EVER 5,120 CUDA cores 640 NEW Tensor cores 12GB HBM2 Memory 1.7Gbps 110 Tensor TFLOPS 13.8 TFLOPS fp16 | 6.9 fp32
  • 10.
    10 THE RISE OFGPU COMPUTING Big Data Needs Algorithms and Compute That Scales 1980 1990 2000 2010 2020 Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp 103 105 107 1.5X per year END OF MOORE’S LAW Single-threaded perf GPU-Computing perf 1.5X per year 1.1X per year CPU vs. GPU
  • 11.
  • 12.
  • 14.
  • 15.
    15 CONVOLUTION 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 2 2 1 1 1 0 1 2 2 2 1 1 0 1 2 2 2 1 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 -4 1 0 -8 Source Pixel Convolution kernel New pixel value (destination pixel) Centreelement of the kernel is placed over the source pixel. The source pixel is then replaced with a weighted sum of itself and nearby pixels. http://www.biomachina.org/
  • 16.
  • 17.
    DEEP LEARNING NEW PROGRAMMINGMODEL DEVICE EDGE CLOUD
  • 19.
    19 Describe the differencesbetween these 2 bikes……..
  • 20.
    20 Image “Volvo XC90” Imagesource: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks” ICML 2009 & Comm. ACM 2011. Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng. CONVOLUTIONAL NEURAL NETWORKS
  • 21.
  • 22.
    22 Li et al,University of Maryland & US Naval Academy https://arxiv.org/pdf/1712.09913.pdf
  • 23.
  • 24.
    24 ResNets vs HighwayNets (IDSIA) https://arxiv.org/pdf/1612.07771.pdf Klaus Greff, Rupesh K. Srivastava Really great explanation of “representation” Compares the two.. shows for language modelling, translation HN >> RN. Not quite as simple as each layer building a new level of representation from the previous - since removing any layer doesn’t critically disrupt.
  • 25.
    25 Capsules & routingwith EM, Hinton et al https://arxiv.org/pdf/1710.09829.pdf [NIPS2017] SegCaps: capsules for object segmentation https://arxiv.org/pdf/1804.04241.pdf University of Central Florida
  • 26.
  • 27.
  • 28.
    28 Long short-term memory(LSTM) Hochreiter (1991) analysed vanishing gradient “LSTM falls out of this almost naturally” Gates control importance of the corresponding activations Training via backprop unfolded in time LSTM: input gate output gate Long time dependencies are preserved until input gate is closed (-) and forget gate is open (O) forget gate Fig from Vinyals et al, Google April 2015 NIC Generator Fig from Graves, Schmidhuber et al, Supervised Sequence Labelling with RNNs
  • 29.
  • 30.
    CAMBRIAN EXPLOSION Convolutional Networks ReLuEncoder/Decoder Dropout PoolingConcat BatchNorm Recurrent Networks GRULSTM CTC BeamSearch WaveNet Attention Generative Adversarial Networks 3D-GAN Speech Enhancement GANCoupled GAN Conditional GANMedGAN Reinforcement Learning DQN Simulation DDPG New Species Neural Collaborative FilteringMixture of Experts Block Sparse LSTM Capsule Nets
  • 31.
    31 Shakir Mohamed andDanilo Rezende, DeepMind, UAI 2017
  • 32.
    32 IMAGINE NO LIMITATIONS TO DRUGDISCOVERY It takes 12 years and $2.6B to bring a new drug to market. Insilico Medicine takes an innovative approach to speed drug discovery and uses GPU-accelerated Generative Adversarial Networks (GANs) to "invent" new molecular structures on demand. With GANs, in only a few months’ time, Insilico did what would have taken years with conventional research methods – identified 69 new molecules. To date, Insilico has generated 5,000+ new molecules, integrated GANs into a comprehensive drug discovery framework, and established working collaborations with large pharmaceutical companies. This unique approach could dramatically reduce the time and cost of developing novel substances with medicinal properties.
  • 33.
  • 34.
  • 35.
    35 “After finishing thisbook, you will have a deep understanding of how to set technical direction for a machine learning project.” www.mlyearning.org
  • 36.
  • 37.
  • 38.
    38 Brain Computer Interfaces Focusedon treatment for disease and dysfunction eg epilepsy, depression, Parkinsons but ultimately to advance human intelligence by restoring and extending cognitive vibrancy. “We’re either going to have to merge with AI or be left behind”; Elon Musk
  • 39.
  • 40.
    40 RL TIMELINE •1984: RichSuttons thesis •1989: Q-learning •1991: REINFORCE •2006: MCTS (Szepesvari) •2009: David Silver’s thesis •2013: DQN •2015: DQN in Nature •2016: A3C & DRL •2017: Soft Q-learning/ AlphaGo Zero •2018: GA3C
  • 41.
    41 Tabish Rashid etal, Oxford/DeepMind QMIX: https://arxiv.org/pdf/1803.11485.pdf
  • 42.
    42 DEEPMIND ALPHA* 1.0=> 2.0 => 0.0 * ..a deeply structured hybrid (Gary Marcus, Jan 2018)
  • 43.
  • 44.
    44 ATLAS, BOSTON DYNAMICS Insertingvideo: Insert/Video/Video from File. Insert video by browsing your directory and selecting OK. File types that works best in PowerPoint are mp4 or wmv
  • 45.
  • 46.
    46 Holodeck: photorealistic, collaborativeVR environment Real-world experience through sight, sound, and haptics Explore high-fidelity, full-resolution models THE DESIGN LAB OF THE FUTURE IMMERSIVE AI - HOLODECK
  • 47.
  • 48.
  • 49.
    49 NVIDIA ISAAC ROBOTICSPLATFORM SIMULATION TRAINING DEPLOYMENT SDK https://developer.nvidia.com/isaac-sdk
  • 50.
  • 51.
    51 INSERT TELEPORTATION VIDEOOF TIM DRIVING A VR VERSION OF A REAL SDC
  • 52.
    A massive DeepLearning challenge
  • 53.
  • 54.
    54 BRIDGING THE REALITYGAP https://arxiv.org/pdf/1804.06516.pdf NVR+University of Toronto “leveraging the power of synthetic data improves upon results obtained using real data alone”
  • 55.
    55 END TO ENDSYSTEM FOR AV Sources: NVIDIA, RAND Corporation SIMULATETRAIN MODELSCOLLECT DATA RE-SIMULATE Lanes Lights Path Signs PedestriansCars Up to 2 petabytes per car per year 20 DNNs 1+ million images per DNN 10B+ miles to ensure safe driving Validate and verify for self-driving cars by 2020 MAPPING 5 + DNNs Create and update HD maps
  • 56.
    56 LEVEL 5 DEMANDS EXTREME COMPUTING +10X camera resolution + Surround LIDAR point-cloud processing + Camera & LIDAR localization to HD map + Tracking all surrounding objects + New map generation + Sophisticated path planning & control + Algorithm diversity + Sensor & computing fail-operate + ASIL-D Functional Safety + Excess computing capacity Level 2 Level 5
  • 57.
    57 COMPUTATIONAL SCALE REQUIRED 3million labeled images 1 DGX-1 trains 300k labeled images on 1 DNN in 1 day 10 DNNs required for self-driving 10 parallel experiments at all times 100 DGX-1 per car
  • 58.
    58 Xavier (ASIL-D rated) “Ourin-depth technical assessment confirms the Xavier SoC architecture is suitable for use in autonomous driving applications and highlights NVIDIA’s commitment to enable safe autonomous driving”; TÜV SÜD, Germany
  • 59.
    59 XAVIER DLA NOW OPENSOURCE Command Interface Tensor Execution Micro-controller Memory Interface Input DMA (Activations and Weights) Unified 512KB Input Buffer Activations and Weights Sparse Weight Decompression Native Winograd Input Transform MAC Array 2048 Int8 or 1024 Int16 or 1024 FP16 Output Accumulators Output Postprocess or (Activation Function, Pooling etc.) Output DMA WWW.NVDLA.ORG
  • 60.
  • 61.
  • 62.
    62 Dynamic Memory Networks MetaMindnow Salesforce (2yrs ago) https://arxiv.org/pdf/1603.01417v1.pdf
  • 63.
    63 “The Building Blocksof Interpretability” https://distill.pub/2018/building-blocks/ Chris Olah, Google et al, March 2018
  • 64.
  • 66.
    66 A PLETHORA OFHEALTHCARE STORIES Molecular Energetics For Drug Discovery AI for Drug Discovery Medical Decision Making Treatment Outcomes Reducing Cancer Diagnosis Errors by 85% Predicting Toxicology Predicting Growth Problems Image Processing Gene Mutations Detect Colon Polyps Predicting Disease from Medical Records Enabling Detection of Fatty Acid Liver Disease
  • 67.
  • 68.
    68 Radiation Therapy withAI ViewRay MRIdian (imaging & radiation dose)
  • 69.
    V-NET CINEMATIC RENDERINGAUTOMAP GPUDEEP LEARNING REVOLUTIONIZING MEDICAL IMAGING CUDA Source: https://arxiv.org/pdf/1704.08841.pdf Source: https://arxiv.org/pdf/1606.04797.pdf Source: Elliot Fishman, M.D. & Siemens
  • 70.
  • 71.
    71 COMPUTATIONAL & DATASCIENCE AI SUPERCOMPUTING WILL TRANSFORM HPC Extending Reach of HPC By Combining Computational & Data Science DATA SCIENCECOMPUTATIONAL SCIENCE Turbulent Flow Molecular Dynamics Structural Analysis N-body Simulation “Next move?” “Is there cancer?”“What’s happening?” “What does she mean?” Understanding Universe Clean EnergyDrug Discovery Monitoring Climate Change
  • 72.
    72 GPU READY APPS AWARENESS: ADOPTION : UTILIZATION ✓ Easy setup ✓ Best app practices ✓ Optimal results ✓ Higher productivity ✓ Faster discoveries App Users Life Science Developers Deep Learning IT Managers HPC/Data Centers Simple, ISV-approved, step-by-step instructions 1. 2. 3. 4. 5. Target Customers Quick Start Guide User Benefits www.nvidia.com/gpu-ready-apps
  • 73.
  • 74.
  • 75.
    75 POWERING THE DEEPLEARNING ECOSYSTEM NVIDIA SDK accelerates every major framework COMPUTER VISION OBJECT DETECTION IMAGE CLASSIFICATION SPEECH & AUDIO VOICE RECOGNITION LANGUAGE TRANSLATION NATURAL LANGUAGE PROCESSING RECOMMENDATION ENGINES SENTIMENT ANALYSIS DEEP LEARNING FRAMEWORKS NVIDIA DEEP LEARNING SDK developer.nvidia.com/deep-learning-software
  • 76.
  • 77.
    May 8-11, 2017| Silicon Valley CUDA 9 https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/
  • 78.
    78NVIDIA CONFIDENTIAL. DONOT DISTRIBUTE. NVIDIA DEEP LEARNING SDK UPDATE GPU-accelerated DL Primitives Faster training Optimizations for RNNs Leading frameworks support cuDNN 7 Multi-node distributed training (multiple machines) Leading frameworks support Multi-GPU & Multi-node NCCL 2 TensorFlow model reader Object detection INT8 RNNs support High-performance Inference Engine TensorRT 4
  • 79.
    79 Circa 2000 -Torch7 - 4th (using odd numbers only 1,3,5,7) Web-scale learning in speech, image and video applications Maintained by top researchers including Soumith Chintala - Research Engineer @ Facebook All the goodness of Torch7 with an intuitive Python frontend that focuses on rapid prototyping, readable code & support for a wide variety of deep learning models. https://pytorch.org/2018/05/02/road-to-1.0.html
  • 80.
  • 81.
    81 Apex - APyTorch Extension ● Goal: Raise PyTorch customer awareness and increase adoption of NVIDIA Tensor Cores ● Content: Provide an easy to use set of utility functions in PyTorch for mixed-precision optimizations ● Benefit: Few lines of code to achieve improved training speed while maintaining accuracy and stability of single precision (Tensor Cores) ● Target audience: Deep learning researchers and developers of PyTorch with NVIDIA Volta ● Key Features: AMP (Auditor for mixed-precision) and Optimizer Wrapper (Dynamic loss scaling and master parameters) ● Teams Involved: Leading NVIDIA PyTorch team and collaboration with external FB PyTorch team Overview
  • 82.
    82 DALI Full input pipelineacceleration including data loading and augmentation Drop-in integration with direct plugins to frameworks – MxNet, TensorFlow, PyTorch Portable workflows through multiple input formats and configurable graphs Unblock CPU with GPU-accelerated DL pre-processing library Version 0.1 supports: • Resnet-50 image classification & segmentation training • Input formats – JPEG, LMDB, RecordIO, TFRecord • Python APIs to define, build and run an input pipelineAvailable as open source - https://github.com/NVIDIA/DALI • DALI Samples & Tutorial: https://github.com/NVIDIA/DALI/blob/master/examples/Getting%20started.ipynb • nvJPEG (Webpage, Documentation) - https://developer.nvidia.com/nvjpeg
  • 83.
    NEW NVIDIA TENSORRT4 DRIVE PX 2 JETSON TX2 NVIDIA DLA TESLA P4 TensorRT TESLA V100 Programmable Inference Accelerator Compile and Optimize Neural Networks | Support for Every Framework Optimize for Each Target Platform
  • 84.
    84 TensorRT INTEGRATED WITHTensorFlow 8x faster Inference Than TensorFlow Only * 14 86 705 0 100 200 300 400 500 600 700 800 CPU Only FP32 P4 FP32 P4 INT8 TensorRT Images/sec Throughput at < 7ms latency (TensorFlow ResNet-50) * Available in TensorFlow 1.7 https://github.com/tensorflow/tensorflow * Min CPU latency measured was 70 ms. It is not < 7 ms. CPU: Skylake Gold 6140, 2.5GHz, Ubuntu 16.04; 18 CPU threads. Pascal P4; CUDA (384.111; v9.0.176); Batch size: CPU=1, TF_GPU=1 (latency 12 ms) , TF-TRT=4 w/ latency=6ms
  • 85.
    85 NVIDIA TensorRT 4 Recommender, Speech& Machine Translation RNN and MLP Layers ⚫ ONNX Import ⚫ NVIDIA DRIVE Support Free download to members of NVIDIA Developer Program soon at developer.nvidia.com/tensorrt 50x Faster Inference for ONNX Models Easily import and accelerate inference for ONNX frameworks with native ONNX parser in TensorRT Accelerate inference of recommenders, speech and machine translation apps with new layers and optimizations Deploy optimized deep learning inference models NVIDIA DRIVE Xavier Support for NVIDIA DRIVE Xavier 1 45x 0X 10X 20X 30X 40X 50X CPU TensorRT
  • 86.
    86 TESLA V100 32GB THEMOST ADVANCED DATA CENTER GPU EVER BUILT 5,120 CUDA cores 640 NEW Tensor cores 7.5 FP64 TFLOPS | 15 FP32 TFLOPS 120 Tensor TFLOP 20MB SM RF | 16MB Cache | 32GB HBM2 @ 900 GB/s 300 GB/s NVLink https://devblogs.nvidia.com/parallelforall/inside-volta/
  • 87.
    87 TENSORCORES D = A* B + C (4x4)
  • 88.
    88 Tensor Cores: Moreresources Further information Mixed-precision blog: https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/ Mixed-precision best practices: https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html Mixed-precision arVix paper: https://arxiv.org/abs/1710.03740 GTC 2018 Sessions: Training with Mixed Precision: Theory and Practice and Training with Mixed Precision: Real Examples Call to action: 1) Share these resources with your customers and partners 2) Provide feedback to help us prioritize new examples and improve examples DGX/NGC deep learning framework containers: Latest versions of software stack and Tensor Core optimized examples - DGX/NGC Registry Tools TensorFlow OpenSeq2seq: https://nvidia.github.io/OpenSeq2Seq/html/mixed-precision.html & arVix paper PyTorch Apex: https://nvidia.github.io/apex/fp16_utils.html & NVIDIA developer news article Examples New mixed-precision model examples: https://developer.nvidia.com/deep-learning-examples GitHub: https://github.com/NVIDIA/DeepLearningExamples
  • 89.
    89 NVSWITCH WORLD’S HIGHEST BANDWIDTHON-NODE SWITCH 7.2 Terabits/sec or 900 GB/sec 18 NVLINK ports | 50GB/s per port bi-directional Fully-connected crossbar 2 billion transistors | 47.5mm x 47.5mm package
  • 90.
    ANNOUNCING NVIDIA DGX-2 THELARGEST GPU EVER CREATED 2 PFLOPS | 512GB HBM2 | 10 kW | 350 lbs
  • 91.
    91 GROUNDBREAKING AI AT YOURDESK Designed for your office DGX Station is the world’s first personal supercomputer for leading- edge AI development. Built on the same Deep Learning Stack powering all NVIDIA DGX™ Systems, you can now experiment at your desk and extend your work across DGX Systems and the cloud. Now with 32GB per GPU, same cost
  • 92.
    NVIDIA GPU CLOUD OptimizedStacks for Every Cloud 20,000+ Registered Organizations | 30 Containers NOW on AWS, GCP, AliCloud, Oracle Cloud, DGX
  • 93.
    93 Containerized Applications TF TunedSW NVIDIA Docker CNTK Tuned SW NVIDIA Docker Caffe2 Tuned SW NVIDIA Docker PyTorch Tuned SW NVIDIA Docker CUDA RTCUDA RTCUDA RTCUDA RT Linux Kernel + CUDA Driver Tuned SW NVIDIA Docker CUDA RT Other Frameworks and Apps. . . ALWAYS UP-TO-DATE Monthly Updates from NVIDIA to Frameworks and Containers
  • 94.
    94 NVIDIA-DOCKER 1.0 github.com/NVIDIA/nvidia-docker Open-source Projecton Github since 2016 >1.5MM downloads >5K stars >4.5 MM pulls of CUDA images Enabled Various Use-cases DGX / NGC containers Major deep learning frameworks Customers Adopted
  • 95.
    95 ANNOUNCING: JETSON XAVIER Computer forAutonomous Machines AI Server Performance in 30W  15W  10W 512 Volta CUDA Cores  2x NVDLA 8 core CPU 30 DL TOPS
  • 96.
  • 97.
    97 Sample Code Deep Learning CUDA,Linux4Tegra, ROS Multimedia API MediaComputer Vision Graphics Nsight Developer Tools Jetson Embedded Supercomputer: Advanced GPU, 64-bit CPU, Video CODEC, VIC, ISP JETPACK SDK FOR AI @ THE EDGE DEVELOPER.NVIDIA.COM/EMBEDDED-COMPUTING TensorRT cuDNN VisionWorks OpenCV Vulkan OpenGL libargus Video API
  • 98.
    98 Available to InstructorsNow! developer.nvidia.com/teaching-kits Robotics Teaching Kit with ‘Jet’ - ServoCity D E E P L E A R N I N G
  • 99.
    Fundamentals Accelerated Computing Game Development& Digital Content Finance NVIDIA DEEP LEARNING INSTITUTE Online self-paced labs and instructor-led workshops on deep learning and accelerated computing Take self-paced labs at www.nvidia.co.uk/dlilabs View upcoming workshops and request a workshop onsite at www.nvidia.co.uk/dli Educators can join the University Ambassador Program to teach DLI courses on campus and access resources. Learn more at www.nvidia.com/dli Intelligent Video Analytics Healthcare Robotics Autonomous Vehicles Virtual Reality
  • 100.
    100 NVIDIA INCEPTION PROGRAM Accelerates AI startupswith a boost of GPU tools, tech and deep learning expertise Startup Qualifications Driving advances in the field of AI Business plan Incorporated Web presence Technology DL startup kit* Pascal Titan X Deep Learning Institute (DLI) credit Connect with a DL tech expert DGX-1 ISV discount* Software release notification Live webinar and office hours *By application Marketing Inclusion in NVIDIA marketing efforts GPU Technology Conference (GTC) discount Emerging Company Summit (ECS) participation+ Marketing kit One-page story template eBook template Inception web badge and banners Social promotion request form Event opportunities list Promotion at industry events GPU ventures+ +By invitation www.nvidia.com/inception
  • 101.