SlideShare a Scribd company logo
1 of 39
GPU Algorithms and Trends
Presentation, Mid 2018
Contents
• Why GPU ?
• Evolution of the GPU and its Programming models
• Typical Algorithms
• Image processing, Image Analysis, DB, VR, Graphics + Compute, Crypto
• Deep learning
• Bandwidth/ performance analysis tools
• Trends in GPU algorithms
• A journey, not a deal
Why GPU ?
Graphics Hardware Landscape
What does the GPU do ?
• Efficient Graphics processing
• High quality – Advanced shaders (Programmable)
• High efficiency – Discard unwanted pixels (Hardware)
• Co-processing with CPU
• Goals for the 2018+ Graphics Processor and beyond
• How can we keep the CPU at 0%, and the GPU 100% ?
• In other words, keep data-saturated, not data-starved
A historical perspective
• Embedded Graphics GPUs (and APIs pre-vulkan)
• Non-existent communication between different blocks (Blackbox)
• Non-existent heterogeneity between CPU and GPU
• GPU output  Optimised only for display scanout (exceptions – video streaming ..)
• Desktop GPUs (and APIs)
• Focused on Higher quality via Programmability
• Driven largely by Microsoft DirectX APIs, followed by OpenGL
From Pixels to FLOPs
• Need for controlling individual blocks, and manage individual contexts
better on Desktop APIs, led to more and more programmable cores
• Less load on (dynamic) drivers, more (one-time) load on application
• Target 0% CPU API overhead, 100% GPU loading
HW architecture advances
• Graphics
• HDR, HEVC, Data compression
• Low-latency pre-emption
• Application specific HW units
• HDR
• Multi-GPU architectures (SLI, ..)
• Compute
• CUDA core micro-arch
• Memory hierarchies
• Thread-level pre-emption
• Common
• GDDR5 advances
• Memory controllers, interconnects
• Clocks, micro arch
• Board designs
• Fan/ Thermal designs/ Noise considerations
Compute Hardware Advances (continued)
Deep learning Hardware Landscape
https://www.forbes.com/sites/moorinsights/2017/03/03/a-machine-learning-landscape-where-amd-intel-nvidia-qualcomm-and-xilinx-ai-engines-live/#7c1a12fc742f
•“.. Understanding C Major took me 27
years …” - Illayaraja, composer
• AIVA technologies AI composer, available, today
Programming Models
• Languages
• CUDA
• Native language acceleration
• Numba
• C++ AMP
• New on the GPU
• Branching !
• Exceptions !
High level comparison
Power performance
• Nvidia power/performance
Power and Area
• Integrated chipsets
• AMD llano 32nm
• Discrete Nvidia GPU Area
Quick introduction to GPU Programming
Organisation of the code
• Main.cpp
• main()
• Timing measurement code (cudaEvent*..)
• CUDA Acceleration code - Kernel.cu
• Kernel wrappers
• CudaMem allocations
• Grid/block calculations
• Kernel calls
• Actual kernel
Moving an algorithm to GPU – Tips
• C++ file or .cu file ?
• 'cudaEvent_t': undeclared identifier
• Include cuda_runtime.h, not just cuda.h
• Dreaded - “0x4 unspecified launch failure”
• cudaOccupancyMaxPotentialBlockSize
• GridSize
• BlockSize
• Tool for memory bug-checks
• cuda-memcheck.exe
• %PROGFILES%NVIDIA GPU Computing ToolkitCUDAvx.ybin
• ========= Invalid __global__ write of size 4
• ========= at 0x000006b0 in ….
• CUDA errors can be resident so be aware of the API behaviour
• Errors reported by some APIs ex cudaThreadSynchronize() are previous errors !
• 1D large arrays (ex 1M entries) have issues
• Move to 2D
• Each kernel composed of “a Grid of Blocks of Threads”
Specifying shared memory
• Static
• Declared in kernel
• Dynamic shared memory allocation
• The shared memory allocation size per thread block must be specified (in
bytes) using an optional third execution configuration parameter in the kernel
call
• myKernel <<<grids,blocks, memsizeBytes>>>();
• How to synchronise shared memory accesses across threads ?
• __syncthreads() in the kernel
GPU Profiling
• CudaStreamCreate – Startup several seconds
• General DNNs – more Host to Device, than Device to Host
• Yolov3 analysis:
• 23% in add-bias
• 16% in shortcut
• 15% in normalize
• 12% in fill
• C:Program FilesNVIDIA GPU Computing ToolkitCUDAv9.1libnvvp
• Using CUDNN for batch-norm reduces this time by about 20%
After profiling:
Moved BatchNorm to CUDNN results in 50% reduction
27% shortcut kernel
19% fill kernel
13% activate array
13% copy
Improving performance
• 3 fundamental steps
• Profile
• Profile
• Profile
• Bottlenecks ?
• Read data
• Write data
• Compute
• Avoid stalls - utilize internal memory judiciously
• Memory transfer and computation should be done in parallel
• Increase utilization – Occupancy
• Utilise helper APIs “cudaOccupancyMaxPotentialBlockSize”
CUDA and CUDNN
• CUDNN is a library of functions, built using the CUDA API
• Focused on Neural networks
• Downloaded separately from CUDA kit
• What performance improvement does it bring ?
• Yolo with different options
Yolo – with different options (Tegra TK1)
0 5 10 15 20 25 30 35 40
CPU
CUDA
CUDNN
YOLOv2 Inference Time (Seconds) - Tegra TK1
CPU 39
CUDA 0.53
CUDNN 0.01
Algorithms on GPU
Data exploration
• 4 free parameters – Can model an elephant
• http://neuralnetworksanddeeplearning.com/chap3.html#overfitting_and_reg
ularization
Medicine – Drug discovery
• AtomNet - structure-based, deep convolutional neural network
designed to predict the bioactivity of small molecules for drug
discovery applications. (Atomwise company)
• apply the convolutional concepts of feature locality and hierarchical
composition to the modeling of bioactivity and chemical interactions
Segmentation – Ex Tumors in Pancrea images
• Small organ segmentation
• Recurrent Saliency Transformation Network. The key innovation is a saliency
transformation module, which repeatedly converts the segmentation
probability map from the previous iteration as spatial weights and applies
these weights to the current iteration
Challenges – Availability of training data
• Significant challenge in object detection
• Why ?
• Solution - Synthetic data
• Image augmentation
• Lighting, transformations, transparency
• euclidaug
• Ray tracing
• Completely under our control
Challenges - Latency of Algorithms on GPU
• How to profile ? What tools ?
• Typical Graphics latencies
• VR example, framebuffer, display relation
• Compute - Average inference latency of Inception v2 with TF 1.5
• 33ms with batch size of 1
• 540ms with batch size of 32
• “GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed”
Emerging – Compute-In-Flash
• Syntiant, Mythic Analog NN Implementation on Flash
http://www.calit2.uci.edu/uploads/Media/Text/HOLLEMAN.pdf
Emerging – DL and Operating Systems
• Windows
• Linaro
• Intelligence is not a single thing
• A group of intelligences working together
• Attention, reasoning, processing speed, movement
• Information and Intelligence not always visual !!
Conclusion
• Religion and Spirituality
• Future trends
• “near-chip-memory”
• Better atomics
• Process technologies
• Truly heterogenous multi-core architectures
From IBM
Netscope
• http://ethereon.github.io/netscope/quickstart.html
• Tool for visualizing neural network architectures (or technically, any
directed acyclic graph). It currently supports Caffe's prototxt format.
Arrow - GDF
Visualisation H20 – from VW talk on analytics
• https://www.youtube.com/watch?v=-mBg-lFz5fQ
• VW – Use GPU for both – analysis+queries
What are we creating AI for ?
• Intelligence on earth
• Intelligence outside earth
• Space travel under 0-gravity
• Cardiovascular deterioration
• Decalcification
• Demineralisation of bones
• Muscular fitness
• Demineralisation recovery time high, perhaps not recoverable
• Reconaissance missions

More Related Content

What's hot

CaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark ClusterCaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark ClusterJen Aman
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)Sharad Agarwal
 
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakLeveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakDatabricks
 
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...Yahoo Developer Network
 
AI Chip Trends and Forecast
AI Chip Trends and ForecastAI Chip Trends and Forecast
AI Chip Trends and ForecastCastLabKAIST
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 
How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?Deepak Shankar
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleSpark Summit
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Clusterairbots
 
Video Analytics At Scale: DL, CV, ML On Databricks Platform
Video Analytics At Scale: DL, CV, ML On Databricks PlatformVideo Analytics At Scale: DL, CV, ML On Databricks Platform
Video Analytics At Scale: DL, CV, ML On Databricks PlatformDatabricks
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
GPU power consumption and performance trends
GPU power consumption and performance trendsGPU power consumption and performance trends
GPU power consumption and performance trendsAlessio Villardita
 

What's hot (18)

CaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark ClusterCaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark Cluster
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)
 
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakLeveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
 
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
 
AI Chip Trends and Forecast
AI Chip Trends and ForecastAI Chip Trends and Forecast
AI Chip Trends and Forecast
 
Programming Models for Heterogeneous Chips
Programming Models for  Heterogeneous ChipsProgramming Models for  Heterogeneous Chips
Programming Models for Heterogeneous Chips
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?
 
Danish presentation
Danish presentationDanish presentation
Danish presentation
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
 
Video Analytics At Scale: DL, CV, ML On Databricks Platform
Video Analytics At Scale: DL, CV, ML On Databricks PlatformVideo Analytics At Scale: DL, CV, ML On Databricks Platform
Video Analytics At Scale: DL, CV, ML On Databricks Platform
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
GPU power consumption and performance trends
GPU power consumption and performance trendsGPU power consumption and performance trends
GPU power consumption and performance trends
 

Similar to GPU Algorithms and trends 2018

Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitMilind Bhandarkar
 
Throughput oriented aarchitectures
Throughput oriented aarchitecturesThroughput oriented aarchitectures
Throughput oriented aarchitecturesNomy059
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
GPU Computing: A brief overview
GPU Computing: A brief overviewGPU Computing: A brief overview
GPU Computing: A brief overviewRajiv Kumar
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentationtestSri1
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-isctembreternitz
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAFacultad de Informática UCM
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
Deploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfDeploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfObject Automation
 
GPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkGPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkRed Hat Developers
 
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Subbu Rama
 
Performance Evaluation and Comparison of Service-based Image Processing based...
Performance Evaluation and Comparison of Service-based Image Processing based...Performance Evaluation and Comparison of Service-based Image Processing based...
Performance Evaluation and Comparison of Service-based Image Processing based...Matthias Trapp
 
High-Performance Computing with C++
High-Performance Computing with C++High-Performance Computing with C++
High-Performance Computing with C++JetBrains
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningSri Ambati
 

Similar to GPU Algorithms and trends 2018 (20)

Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Throughput oriented aarchitectures
Throughput oriented aarchitecturesThroughput oriented aarchitectures
Throughput oriented aarchitectures
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
GPU Computing: A brief overview
GPU Computing: A brief overviewGPU Computing: A brief overview
GPU Computing: A brief overview
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
Android performance
Android performanceAndroid performance
Android performance
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Deploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfDeploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdf
 
GPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkGPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech Talk
 
Cuda
CudaCuda
Cuda
 
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
 
Performance Evaluation and Comparison of Service-based Image Processing based...
Performance Evaluation and Comparison of Service-based Image Processing based...Performance Evaluation and Comparison of Service-based Image Processing based...
Performance Evaluation and Comparison of Service-based Image Processing based...
 
High-Performance Computing with C++
High-Performance Computing with C++High-Performance Computing with C++
High-Performance Computing with C++
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine Learning
 

More from Prabindh Sundareson

Synthetic Data and Graphics Techniques in Robotics
Synthetic Data and Graphics Techniques in RoboticsSynthetic Data and Graphics Techniques in Robotics
Synthetic Data and Graphics Techniques in RoboticsPrabindh Sundareson
 
Machine learning in the Indian Context - IEEE talk at SRM Institute
Machine learning in the Indian Context - IEEE talk at SRM InstituteMachine learning in the Indian Context - IEEE talk at SRM Institute
Machine learning in the Indian Context - IEEE talk at SRM InstitutePrabindh Sundareson
 
ICCE Asia 2017 - Program Outline
ICCE Asia 2017 - Program OutlineICCE Asia 2017 - Program Outline
ICCE Asia 2017 - Program OutlinePrabindh Sundareson
 
Call for Papers - ICCE Asia 2017
Call for Papers - ICCE Asia 2017Call for Papers - ICCE Asia 2017
Call for Papers - ICCE Asia 2017Prabindh Sundareson
 
Technology, Innovation - A Perspective
Technology, Innovation - A PerspectiveTechnology, Innovation - A Perspective
Technology, Innovation - A PerspectivePrabindh Sundareson
 
IEEE - Consumer Electronics Trends Opportunities (2015)
IEEE - Consumer Electronics Trends Opportunities (2015)IEEE - Consumer Electronics Trends Opportunities (2015)
IEEE - Consumer Electronics Trends Opportunities (2015)Prabindh Sundareson
 
GFX part 8 - Three.js introduction and usage
GFX part 8 - Three.js introduction and usageGFX part 8 - Three.js introduction and usage
GFX part 8 - Three.js introduction and usagePrabindh Sundareson
 
GFX Part 7 - Introduction to Rendering Targets in OpenGL ES
GFX Part 7 - Introduction to Rendering Targets in OpenGL ESGFX Part 7 - Introduction to Rendering Targets in OpenGL ES
GFX Part 7 - Introduction to Rendering Targets in OpenGL ESPrabindh Sundareson
 
GFX Part 6 - Introduction to Vertex and Fragment Shaders in OpenGL ES
GFX Part 6 - Introduction to Vertex and Fragment Shaders in OpenGL ESGFX Part 6 - Introduction to Vertex and Fragment Shaders in OpenGL ES
GFX Part 6 - Introduction to Vertex and Fragment Shaders in OpenGL ESPrabindh Sundareson
 
GFX Part 5 - Introduction to Object Transformations in OpenGL ES
GFX Part 5 - Introduction to Object Transformations in OpenGL ESGFX Part 5 - Introduction to Object Transformations in OpenGL ES
GFX Part 5 - Introduction to Object Transformations in OpenGL ESPrabindh Sundareson
 
GFX Part 4 - Introduction to Texturing in OpenGL ES
GFX Part 4 - Introduction to Texturing in OpenGL ESGFX Part 4 - Introduction to Texturing in OpenGL ES
GFX Part 4 - Introduction to Texturing in OpenGL ESPrabindh Sundareson
 
GFX Part 3 - Vertices and interactions in OpenGL
GFX Part 3 - Vertices and interactions in OpenGLGFX Part 3 - Vertices and interactions in OpenGL
GFX Part 3 - Vertices and interactions in OpenGLPrabindh Sundareson
 
GFX Part 2 - Introduction to GPU Programming
GFX Part 2 - Introduction to GPU ProgrammingGFX Part 2 - Introduction to GPU Programming
GFX Part 2 - Introduction to GPU ProgrammingPrabindh Sundareson
 
GFX Part 1 - Introduction to GPU HW and OpenGL ES specifications
GFX Part 1 - Introduction to GPU HW and OpenGL ES specificationsGFX Part 1 - Introduction to GPU HW and OpenGL ES specifications
GFX Part 1 - Introduction to GPU HW and OpenGL ES specificationsPrabindh Sundareson
 
John Carmack talk at SMU, April 2014 - Virtual Reality
John Carmack talk at SMU, April 2014 - Virtual RealityJohn Carmack talk at SMU, April 2014 - Virtual Reality
John Carmack talk at SMU, April 2014 - Virtual RealityPrabindh Sundareson
 
Gfx2014 Graphics Workshop - Lab manual
Gfx2014 Graphics Workshop - Lab manualGfx2014 Graphics Workshop - Lab manual
Gfx2014 Graphics Workshop - Lab manualPrabindh Sundareson
 

More from Prabindh Sundareson (20)

Synthetic Data and Graphics Techniques in Robotics
Synthetic Data and Graphics Techniques in RoboticsSynthetic Data and Graphics Techniques in Robotics
Synthetic Data and Graphics Techniques in Robotics
 
Work and Life
Work and Life Work and Life
Work and Life
 
Machine learning in the Indian Context - IEEE talk at SRM Institute
Machine learning in the Indian Context - IEEE talk at SRM InstituteMachine learning in the Indian Context - IEEE talk at SRM Institute
Machine learning in the Indian Context - IEEE talk at SRM Institute
 
Students Hackathon - 2017
Students Hackathon - 2017Students Hackathon - 2017
Students Hackathon - 2017
 
ICCE Asia 2017 - Program Outline
ICCE Asia 2017 - Program OutlineICCE Asia 2017 - Program Outline
ICCE Asia 2017 - Program Outline
 
Call for Papers - ICCE Asia 2017
Call for Papers - ICCE Asia 2017Call for Papers - ICCE Asia 2017
Call for Papers - ICCE Asia 2017
 
Technology, Innovation - A Perspective
Technology, Innovation - A PerspectiveTechnology, Innovation - A Perspective
Technology, Innovation - A Perspective
 
Open Shading Language (OSL)
Open Shading Language (OSL)Open Shading Language (OSL)
Open Shading Language (OSL)
 
IEEE - Consumer Electronics Trends Opportunities (2015)
IEEE - Consumer Electronics Trends Opportunities (2015)IEEE - Consumer Electronics Trends Opportunities (2015)
IEEE - Consumer Electronics Trends Opportunities (2015)
 
GFX part 8 - Three.js introduction and usage
GFX part 8 - Three.js introduction and usageGFX part 8 - Three.js introduction and usage
GFX part 8 - Three.js introduction and usage
 
GFX Part 7 - Introduction to Rendering Targets in OpenGL ES
GFX Part 7 - Introduction to Rendering Targets in OpenGL ESGFX Part 7 - Introduction to Rendering Targets in OpenGL ES
GFX Part 7 - Introduction to Rendering Targets in OpenGL ES
 
GFX Part 6 - Introduction to Vertex and Fragment Shaders in OpenGL ES
GFX Part 6 - Introduction to Vertex and Fragment Shaders in OpenGL ESGFX Part 6 - Introduction to Vertex and Fragment Shaders in OpenGL ES
GFX Part 6 - Introduction to Vertex and Fragment Shaders in OpenGL ES
 
GFX Part 5 - Introduction to Object Transformations in OpenGL ES
GFX Part 5 - Introduction to Object Transformations in OpenGL ESGFX Part 5 - Introduction to Object Transformations in OpenGL ES
GFX Part 5 - Introduction to Object Transformations in OpenGL ES
 
GFX Part 4 - Introduction to Texturing in OpenGL ES
GFX Part 4 - Introduction to Texturing in OpenGL ESGFX Part 4 - Introduction to Texturing in OpenGL ES
GFX Part 4 - Introduction to Texturing in OpenGL ES
 
GFX Part 3 - Vertices and interactions in OpenGL
GFX Part 3 - Vertices and interactions in OpenGLGFX Part 3 - Vertices and interactions in OpenGL
GFX Part 3 - Vertices and interactions in OpenGL
 
GFX Part 2 - Introduction to GPU Programming
GFX Part 2 - Introduction to GPU ProgrammingGFX Part 2 - Introduction to GPU Programming
GFX Part 2 - Introduction to GPU Programming
 
GFX Part 1 - Introduction to GPU HW and OpenGL ES specifications
GFX Part 1 - Introduction to GPU HW and OpenGL ES specificationsGFX Part 1 - Introduction to GPU HW and OpenGL ES specifications
GFX Part 1 - Introduction to GPU HW and OpenGL ES specifications
 
John Carmack talk at SMU, April 2014 - Virtual Reality
John Carmack talk at SMU, April 2014 - Virtual RealityJohn Carmack talk at SMU, April 2014 - Virtual Reality
John Carmack talk at SMU, April 2014 - Virtual Reality
 
GFX2014 OpenGL ES Quiz
GFX2014 OpenGL ES QuizGFX2014 OpenGL ES Quiz
GFX2014 OpenGL ES Quiz
 
Gfx2014 Graphics Workshop - Lab manual
Gfx2014 Graphics Workshop - Lab manualGfx2014 Graphics Workshop - Lab manual
Gfx2014 Graphics Workshop - Lab manual
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 

GPU Algorithms and trends 2018

  • 1. GPU Algorithms and Trends Presentation, Mid 2018
  • 2. Contents • Why GPU ? • Evolution of the GPU and its Programming models • Typical Algorithms • Image processing, Image Analysis, DB, VR, Graphics + Compute, Crypto • Deep learning • Bandwidth/ performance analysis tools • Trends in GPU algorithms • A journey, not a deal
  • 5. What does the GPU do ? • Efficient Graphics processing • High quality – Advanced shaders (Programmable) • High efficiency – Discard unwanted pixels (Hardware) • Co-processing with CPU • Goals for the 2018+ Graphics Processor and beyond • How can we keep the CPU at 0%, and the GPU 100% ? • In other words, keep data-saturated, not data-starved
  • 6. A historical perspective • Embedded Graphics GPUs (and APIs pre-vulkan) • Non-existent communication between different blocks (Blackbox) • Non-existent heterogeneity between CPU and GPU • GPU output  Optimised only for display scanout (exceptions – video streaming ..) • Desktop GPUs (and APIs) • Focused on Higher quality via Programmability • Driven largely by Microsoft DirectX APIs, followed by OpenGL
  • 7. From Pixels to FLOPs • Need for controlling individual blocks, and manage individual contexts better on Desktop APIs, led to more and more programmable cores • Less load on (dynamic) drivers, more (one-time) load on application • Target 0% CPU API overhead, 100% GPU loading
  • 8. HW architecture advances • Graphics • HDR, HEVC, Data compression • Low-latency pre-emption • Application specific HW units • HDR • Multi-GPU architectures (SLI, ..) • Compute • CUDA core micro-arch • Memory hierarchies • Thread-level pre-emption • Common • GDDR5 advances • Memory controllers, interconnects • Clocks, micro arch • Board designs • Fan/ Thermal designs/ Noise considerations
  • 10.
  • 11. Deep learning Hardware Landscape https://www.forbes.com/sites/moorinsights/2017/03/03/a-machine-learning-landscape-where-amd-intel-nvidia-qualcomm-and-xilinx-ai-engines-live/#7c1a12fc742f
  • 12. •“.. Understanding C Major took me 27 years …” - Illayaraja, composer • AIVA technologies AI composer, available, today
  • 13. Programming Models • Languages • CUDA • Native language acceleration • Numba • C++ AMP • New on the GPU • Branching ! • Exceptions !
  • 15. Power performance • Nvidia power/performance
  • 16. Power and Area • Integrated chipsets • AMD llano 32nm • Discrete Nvidia GPU Area
  • 17. Quick introduction to GPU Programming
  • 18. Organisation of the code • Main.cpp • main() • Timing measurement code (cudaEvent*..) • CUDA Acceleration code - Kernel.cu • Kernel wrappers • CudaMem allocations • Grid/block calculations • Kernel calls • Actual kernel
  • 19. Moving an algorithm to GPU – Tips • C++ file or .cu file ? • 'cudaEvent_t': undeclared identifier • Include cuda_runtime.h, not just cuda.h • Dreaded - “0x4 unspecified launch failure” • cudaOccupancyMaxPotentialBlockSize • GridSize • BlockSize • Tool for memory bug-checks • cuda-memcheck.exe • %PROGFILES%NVIDIA GPU Computing ToolkitCUDAvx.ybin • ========= Invalid __global__ write of size 4 • ========= at 0x000006b0 in …. • CUDA errors can be resident so be aware of the API behaviour • Errors reported by some APIs ex cudaThreadSynchronize() are previous errors ! • 1D large arrays (ex 1M entries) have issues • Move to 2D • Each kernel composed of “a Grid of Blocks of Threads”
  • 20. Specifying shared memory • Static • Declared in kernel • Dynamic shared memory allocation • The shared memory allocation size per thread block must be specified (in bytes) using an optional third execution configuration parameter in the kernel call • myKernel <<<grids,blocks, memsizeBytes>>>(); • How to synchronise shared memory accesses across threads ? • __syncthreads() in the kernel
  • 21. GPU Profiling • CudaStreamCreate – Startup several seconds • General DNNs – more Host to Device, than Device to Host • Yolov3 analysis: • 23% in add-bias • 16% in shortcut • 15% in normalize • 12% in fill • C:Program FilesNVIDIA GPU Computing ToolkitCUDAv9.1libnvvp • Using CUDNN for batch-norm reduces this time by about 20% After profiling: Moved BatchNorm to CUDNN results in 50% reduction 27% shortcut kernel 19% fill kernel 13% activate array 13% copy
  • 22. Improving performance • 3 fundamental steps • Profile • Profile • Profile • Bottlenecks ? • Read data • Write data • Compute • Avoid stalls - utilize internal memory judiciously • Memory transfer and computation should be done in parallel • Increase utilization – Occupancy • Utilise helper APIs “cudaOccupancyMaxPotentialBlockSize”
  • 23. CUDA and CUDNN • CUDNN is a library of functions, built using the CUDA API • Focused on Neural networks • Downloaded separately from CUDA kit • What performance improvement does it bring ? • Yolo with different options
  • 24. Yolo – with different options (Tegra TK1) 0 5 10 15 20 25 30 35 40 CPU CUDA CUDNN YOLOv2 Inference Time (Seconds) - Tegra TK1 CPU 39 CUDA 0.53 CUDNN 0.01
  • 26. Data exploration • 4 free parameters – Can model an elephant • http://neuralnetworksanddeeplearning.com/chap3.html#overfitting_and_reg ularization
  • 27. Medicine – Drug discovery • AtomNet - structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. (Atomwise company) • apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions
  • 28. Segmentation – Ex Tumors in Pancrea images • Small organ segmentation • Recurrent Saliency Transformation Network. The key innovation is a saliency transformation module, which repeatedly converts the segmentation probability map from the previous iteration as spatial weights and applies these weights to the current iteration
  • 29. Challenges – Availability of training data • Significant challenge in object detection • Why ? • Solution - Synthetic data • Image augmentation • Lighting, transformations, transparency • euclidaug • Ray tracing • Completely under our control
  • 30. Challenges - Latency of Algorithms on GPU • How to profile ? What tools ? • Typical Graphics latencies • VR example, framebuffer, display relation • Compute - Average inference latency of Inception v2 with TF 1.5 • 33ms with batch size of 1 • 540ms with batch size of 32 • “GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed”
  • 31. Emerging – Compute-In-Flash • Syntiant, Mythic Analog NN Implementation on Flash http://www.calit2.uci.edu/uploads/Media/Text/HOLLEMAN.pdf
  • 32. Emerging – DL and Operating Systems • Windows • Linaro
  • 33. • Intelligence is not a single thing • A group of intelligences working together • Attention, reasoning, processing speed, movement • Information and Intelligence not always visual !!
  • 34. Conclusion • Religion and Spirituality • Future trends • “near-chip-memory” • Better atomics • Process technologies • Truly heterogenous multi-core architectures
  • 36. Netscope • http://ethereon.github.io/netscope/quickstart.html • Tool for visualizing neural network architectures (or technically, any directed acyclic graph). It currently supports Caffe's prototxt format.
  • 38. Visualisation H20 – from VW talk on analytics • https://www.youtube.com/watch?v=-mBg-lFz5fQ • VW – Use GPU for both – analysis+queries
  • 39. What are we creating AI for ? • Intelligence on earth • Intelligence outside earth • Space travel under 0-gravity • Cardiovascular deterioration • Decalcification • Demineralisation of bones • Muscular fitness • Demineralisation recovery time high, perhaps not recoverable • Reconaissance missions