Submit Search
Upload
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cliff Woolley, NVIDIA)
âą
2 likes
âą
735 views
N
npinto
Follow
http://cs264.org http://j.mp/eF9Oj2
Read less
Read more
Education
Technology
Slideshow view
Report
Share
Slideshow view
Report
Share
1 of 55
Download now
Download to read offline
Recommended
Lect06
Lect06
Vin Voro
Â
Implement Checkpointing for Android
Implement Checkpointing for Android
National Cheng Kung University
Â
Implement Checkpointing for Android (ELCE2012)
Implement Checkpointing for Android (ELCE2012)
National Cheng Kung University
Â
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
npinto
Â
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
npinto
Â
Performance Profiling in a Virtualized Environment
Performance Profiling in a Virtualized Environment
Jiaqing Du
Â
Gnu linux for safety related systems
Gnu linux for safety related systems
DTQ4
Â
Toward a practical âHPC Cloudâ: Performance tuning of a virtualized HPC cluster
Toward a practical âHPC Cloudâ: Performance tuning of a virtualized HPC cluster
Ryousei Takano
Â
Recommended
Lect06
Lect06
Vin Voro
Â
Implement Checkpointing for Android
Implement Checkpointing for Android
National Cheng Kung University
Â
Implement Checkpointing for Android (ELCE2012)
Implement Checkpointing for Android (ELCE2012)
National Cheng Kung University
Â
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
npinto
Â
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
npinto
Â
Performance Profiling in a Virtualized Environment
Performance Profiling in a Virtualized Environment
Jiaqing Du
Â
Gnu linux for safety related systems
Gnu linux for safety related systems
DTQ4
Â
Toward a practical âHPC Cloudâ: Performance tuning of a virtualized HPC cluster
Toward a practical âHPC Cloudâ: Performance tuning of a virtualized HPC cluster
Ryousei Takano
Â
Toward a practical âHPC Cloudâ: Performance tuning of a virtualized HPC cluster
Toward a practical âHPC Cloudâ: Performance tuning of a virtualized HPC cluster
Ryousei Takano
Â
Hp All In 1
Hp All In 1
RBratton
Â
Persistent Memory Programming with Pmemkv
Persistent Memory Programming with Pmemkv
IntelÂź Software
Â
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
npinto
Â
06threadsimp
06threadsimp
Zhiwen Guo
Â
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
npinto
Â
ABC of Teradata System Performance Analysis
ABC of Teradata System Performance Analysis
Shaheryar Iqbal
Â
Teradata Co-existing Systems Parallel Efficiency -- Calculation & Reconfigura...
Teradata Co-existing Systems Parallel Efficiency -- Calculation & Reconfigura...
Shaheryar Iqbal
Â
01intro
01intro
Zhiwen Guo
Â
[xSIG 2018]InvisibleWriteRule: Extended write protocol for 1VCC
[xSIG 2018]InvisibleWriteRule: Extended write protocol for 1VCC
Sho Nakazono
Â
Multi-threaded Performance Pitfalls
Multi-threaded Performance Pitfalls
Ciaran McHale
Â
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
peknap
Â
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
npinto
Â
Climb - Property-based dispatch in functional languages [Report]
Climb - Property-based dispatch in functional languages [Report]
Christopher Chedeau
Â
XT Best Practices
XT Best Practices
Jeff Larkin
Â
Shader model 5 0 and compute shader
Shader model 5 0 and compute shader
zaywalker
Â
2 colin walls - how to measure rtos performance
2 colin walls - how to measure rtos performance
Ievgenii Katsan
Â
Hpc4
Hpc4
William Brouwer
Â
Pgq
Pgq
tapoueh
Â
A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...
ChangWoo Min
Â
Gpu Join Presentation
Gpu Join Presentation
Suman Karumuri
Â
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
npinto
Â
More Related Content
What's hot
Toward a practical âHPC Cloudâ: Performance tuning of a virtualized HPC cluster
Toward a practical âHPC Cloudâ: Performance tuning of a virtualized HPC cluster
Ryousei Takano
Â
Hp All In 1
Hp All In 1
RBratton
Â
Persistent Memory Programming with Pmemkv
Persistent Memory Programming with Pmemkv
IntelÂź Software
Â
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
npinto
Â
06threadsimp
06threadsimp
Zhiwen Guo
Â
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
npinto
Â
ABC of Teradata System Performance Analysis
ABC of Teradata System Performance Analysis
Shaheryar Iqbal
Â
Teradata Co-existing Systems Parallel Efficiency -- Calculation & Reconfigura...
Teradata Co-existing Systems Parallel Efficiency -- Calculation & Reconfigura...
Shaheryar Iqbal
Â
01intro
01intro
Zhiwen Guo
Â
[xSIG 2018]InvisibleWriteRule: Extended write protocol for 1VCC
[xSIG 2018]InvisibleWriteRule: Extended write protocol for 1VCC
Sho Nakazono
Â
Multi-threaded Performance Pitfalls
Multi-threaded Performance Pitfalls
Ciaran McHale
Â
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
peknap
Â
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
npinto
Â
Climb - Property-based dispatch in functional languages [Report]
Climb - Property-based dispatch in functional languages [Report]
Christopher Chedeau
Â
XT Best Practices
XT Best Practices
Jeff Larkin
Â
Shader model 5 0 and compute shader
Shader model 5 0 and compute shader
zaywalker
Â
2 colin walls - how to measure rtos performance
2 colin walls - how to measure rtos performance
Ievgenii Katsan
Â
Hpc4
Hpc4
William Brouwer
Â
Pgq
Pgq
tapoueh
Â
A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...
ChangWoo Min
Â
What's hot
(20)
Toward a practical âHPC Cloudâ: Performance tuning of a virtualized HPC cluster
Toward a practical âHPC Cloudâ: Performance tuning of a virtualized HPC cluster
Â
Hp All In 1
Hp All In 1
Â
Persistent Memory Programming with Pmemkv
Persistent Memory Programming with Pmemkv
Â
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
Â
06threadsimp
06threadsimp
Â
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
Â
ABC of Teradata System Performance Analysis
ABC of Teradata System Performance Analysis
Â
Teradata Co-existing Systems Parallel Efficiency -- Calculation & Reconfigura...
Teradata Co-existing Systems Parallel Efficiency -- Calculation & Reconfigura...
Â
01intro
01intro
Â
[xSIG 2018]InvisibleWriteRule: Extended write protocol for 1VCC
[xSIG 2018]InvisibleWriteRule: Extended write protocol for 1VCC
Â
Multi-threaded Performance Pitfalls
Multi-threaded Performance Pitfalls
Â
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Â
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
Â
Climb - Property-based dispatch in functional languages [Report]
Climb - Property-based dispatch in functional languages [Report]
Â
XT Best Practices
XT Best Practices
Â
Shader model 5 0 and compute shader
Shader model 5 0 and compute shader
Â
2 colin walls - how to measure rtos performance
2 colin walls - how to measure rtos performance
Â
Hpc4
Hpc4
Â
Pgq
Pgq
Â
A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...
Â
Viewers also liked
Gpu Join Presentation
Gpu Join Presentation
Suman Karumuri
Â
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
npinto
Â
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
npinto
Â
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
mohamedragabslideshare
Â
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
npinto
Â
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
npinto
Â
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
npinto
Â
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
npinto
Â
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
npinto
Â
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
npinto
Â
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
npinto
Â
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
npinto
Â
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
npinto
Â
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
Â
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
npinto
Â
Halvar Flake: Why Johnny canât tell if he is compromised
Halvar Flake: Why Johnny canât tell if he is compromised
Area41
Â
Viewers also liked
(16)
Gpu Join Presentation
Gpu Join Presentation
Â
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
Â
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
Â
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Â
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
Â
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
Â
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
Â
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
Â
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
Â
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
Â
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
Â
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
Â
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
Â
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
Â
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
Â
Halvar Flake: Why Johnny canât tell if he is compromised
Halvar Flake: Why Johnny canât tell if he is compromised
Â
Similar to [Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cliff Woolley, NVIDIA)
Cassandra in Operation
Cassandra in Operation
niallmilton
Â
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Jeff Larkin
Â
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community
Â
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch government
DoKC
Â
Understand and Harness the Capabilities of IntelÂź Xeon Phiâą Processors
Understand and Harness the Capabilities of IntelÂź Xeon Phiâą Processors
IntelÂź Software
Â
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
DataWorks Summit
Â
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Ceph Community
Â
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios
Â
SQLintersection keynote a tale of two teams
SQLintersection keynote a tale of two teams
Sumeet Bansal
Â
CLFS 2010
CLFS 2010
bergwolf
Â
Os Madsen Block
Os Madsen Block
oscon2007
Â
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Tommy Lee
Â
Tips and Tricks for SAP Sybase IQ
Tips and Tricks for SAP Sybase IQ
Don Brizendine
Â
Concept of thread
Concept of thread
Munmun Das Bhowmik
Â
Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09
Steve Staso
Â
Presentation architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
xKinAnx
Â
Presentation architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
solarisyourep
Â
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
ScyllaDB
Â
Stream Processing
Stream Processing
arnamoy10
Â
Hardware planning & sizing for sql server
Hardware planning & sizing for sql server
Davide Mauri
Â
Similar to [Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cliff Woolley, NVIDIA)
(20)
Cassandra in Operation
Cassandra in Operation
Â
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Â
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
Â
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch government
Â
Understand and Harness the Capabilities of IntelÂź Xeon Phiâą Processors
Understand and Harness the Capabilities of IntelÂź Xeon Phiâą Processors
Â
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
Â
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Â
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Â
SQLintersection keynote a tale of two teams
SQLintersection keynote a tale of two teams
Â
CLFS 2010
CLFS 2010
Â
Os Madsen Block
Os Madsen Block
Â
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Â
Tips and Tricks for SAP Sybase IQ
Tips and Tricks for SAP Sybase IQ
Â
Concept of thread
Concept of thread
Â
Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09
Â
Presentation architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
Â
Presentation architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
Â
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Â
Stream Processing
Stream Processing
Â
Hardware planning & sizing for sql server
Hardware planning & sizing for sql server
Â
More from npinto
"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
npinto
Â
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
npinto
Â
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
npinto
Â
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
npinto
Â
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
npinto
Â
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction
npinto
Â
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
npinto
Â
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
npinto
Â
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
npinto
Â
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
npinto
Â
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
npinto
Â
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
npinto
Â
More from npinto
(12)
"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
Â
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
Â
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
Â
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
Â
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
Â
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction
Â
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
Â
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
Â
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
Â
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
Â
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
Â
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
Â
Recently uploaded
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
Sabitha Banu
Â
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
MaryGraceBautista27
Â
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
TechSoup
Â
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
nelietumpap1
Â
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
Ashokrao Mane college of Pharmacy Peth-Vadgaon
Â
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
Celine George
Â
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
Conquiztadors- the Quiz Society of Sri Venkateswara College
Â
Há»C Tá»T TIáșŸNG ANH 11 THEO CHÆŻÆ NG TRĂNH GLOBAL SUCCESS ÄĂP ĂN CHI TIáșŸT - Cáșą NÄ...
Há»C Tá»T TIáșŸNG ANH 11 THEO CHÆŻÆ NG TRĂNH GLOBAL SUCCESS ÄĂP ĂN CHI TIáșŸT - Cáșą NÄ...
Nguyen Thanh Tu Collection
Â
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
SpandanaRallapalli
Â
USPSÂź Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPSÂź Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
Postal Advocate Inc.
Â
Visit to a blind student's schoolđ§âđŠŻđ§âđŠŻ(community medicine)
Visit to a blind student's schoolđ§âđŠŻđ§âđŠŻ(community medicine)
lakshayb543
Â
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
MiaBumagat1
Â
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
nelietumpap1
Â
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
Conquiztadors- the Quiz Society of Sri Venkateswara College
Â
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
ChelloAnnAsuncion2
Â
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
Anupkumar Sharma
Â
call girls in Kamla Market (DELHI) đ >àŒ9953330565đ genuine Escort Service đâïžâïž
call girls in Kamla Market (DELHI) đ >àŒ9953330565đ genuine Escort Service đâïžâïž
9953056974 Low Rate Call Girls In Saket, Delhi NCR
Â
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
Conquiztadors- the Quiz Society of Sri Venkateswara College
Â
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Carlos105
Â
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
MIPLM
Â
Recently uploaded
(20)
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
Â
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Â
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Â
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
Â
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
Â
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
Â
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
Â
Há»C Tá»T TIáșŸNG ANH 11 THEO CHÆŻÆ NG TRĂNH GLOBAL SUCCESS ÄĂP ĂN CHI TIáșŸT - Cáșą NÄ...
Há»C Tá»T TIáșŸNG ANH 11 THEO CHÆŻÆ NG TRĂNH GLOBAL SUCCESS ÄĂP ĂN CHI TIáșŸT - Cáșą NÄ...
Â
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
Â
USPSÂź Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPSÂź Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
Â
Visit to a blind student's schoolđ§âđŠŻđ§âđŠŻ(community medicine)
Visit to a blind student's schoolđ§âđŠŻđ§âđŠŻ(community medicine)
Â
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
Â
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
Â
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
Â
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Â
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
Â
call girls in Kamla Market (DELHI) đ >àŒ9953330565đ genuine Escort Service đâïžâïž
call girls in Kamla Market (DELHI) đ >àŒ9953330565đ genuine Escort Service đâïžâïž
Â
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
Â
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Â
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
Â
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cliff Woolley, NVIDIA)
1.
Analysis-âDriven Optimization
Presented by Cliff Woolley for Harvard CS264 © NVIDIA 2010
2.
Performance Optimization Process
Use appropriate performance metric for each kernel For example, Gflops -âbound kernel Determine what limits kernel performance Memory throughput Instruction throughput Latency Combination of the above Address the limiters in the order of importance Determine how close to the HW limits the resource is being used Analyze for possible inefficiencies Apply optimizations Often these will just fall out from how HW operates 2 © NVIDIA 2010
3.
Presentation Outline
Identifying performance limiters Analyzing and optimizing : Memory-âbound kernels Instruction (math) bound kernels Kernels with poor latency hiding For each: Brief background How to analyze How to judge whether particular issue is problematic How to optimize -â Note: Most information is for Fermi GPUs 3 © NVIDIA 2010
4.
New in CUDA
Toolkit 4.0: Automated Performance Analysis using Visual Profiler Summary analysis & hints Per-âSession Per-âDevice Per-âContext Per-âKernel New UI for kernel analysis Identify the limiting factor Analyze instruction throughput Analyze memory throughput Analyze kernel occupancy © NVIDIA 2010
5.
Notes on profiler
Most counters are reported per Streaming Multiprocessor (SM) Not entire GPU Exceptions: L2 and DRAM counters A single run can collect a few counters Multiple runs are needed when profiling more counters Done automatically by the Visual Profiler Have to be done manually using command-âline profiler Counter values may not be exactly the same for repeated runs Threadblocks and warps are scheduled at run-âtime See the profiler documentation for more information 5 © NVIDIA 2010
6.
Identifying Performance Limiters
6 © NVIDIA 2010
7.
Limited by Bandwidth
or Arithmetic? Perfect instructions:bytes ratio for Fermi C2050: ~4.5 : 1 with ECC on ~3.6 : 1 with ECC off These assume fp32 instructions, throughput for other instructions varies Algorithmic analysis: Rough estimate of arithmetic to bytes ratio Code likely uses more instructions and bytes than algorithm analysis suggests: Instructions for loop control, pointer math, etc. Address pattern may result in more memory fetches Two ways to investigate: Use the profiler (quick, but approximate) Use source code modification (more accurate, more work intensive) 7 © NVIDIA 2010
8.
A Note on
Counting Global Memory Accesses Load/store instruction count can be lower than the number of actual memory transactions Address pattern, different word sizes Counting requests from L1 to the rest of the memory system makes the most sense Caching-âloads: count L1 misses Non-âcaching loads and stores: count L2 read requests Note that L2 counters are for the entire chip, L1 counters are per SM One 32-âbit access instruction -â> one 128-âbyte transaction per warp One 64-âbit access instruction -â> two 128-âbyte transactions per warp One 128-âbit access instruction -â> four 128-âbyte transactions per warp 8 © NVIDIA 2010
9.
Analysis with Profiler
Profiler counters: instructions_issued, instructions_executed Both incremented by 1 per warp gld_request, gst_request Incremented by 1 per warp for each load/store instruction l1_global_load_miss, l1_global_load_hit, global_store_transaction Incremented by 1 per L1 line (line is 128B) uncached_global_load_transaction Incremented by 1 per group of 1, 2, 3, or 4 transactions Better to look at L2_read_request counter (incremented by 1 per 32B transaction; per GPU, not per SM) Compare: 32 * instructions_issued /* 32 = warp size */ 128B * (global_store_transaction +  l1_global_load_miss) 9 © NVIDIA 2010
10.
Analysis with Modified
Source Code Time memory-âonly and math-âonly versions of the kernel -âdependent control-âflow or addressing Gives you good estimates for: Time spent accessing memory Time spent in executing instructions Compare the times taken by the modified kernels Helps decide whether the kernel is mem or math bound Compare the sum of mem-âonly and math-âonly times to full-âkernel time Shows how well memory operations are overlapped with arithmetic Can reveal latency bottleneck 10 © NVIDIA 2010
11.
Some Example Scenarios time
mem math full Memory-bound Good mem-math overlap: latency likely not a problem (assuming memory throughput is not low compared to HW theory) 11 © NVIDIA 2010
12.
Some Example Scenarios time
mem math full mem math full Memory-bound Math-bound Good mem-math Good mem-math overlap: latency likely overlap: latency likely not a problem not a problem (assuming memory (assuming instruction throughput is not low throughput is not low compared to HW theory) compared to HW theory) © NVIDIA 2010
13.
Some Example Scenarios time
mem math full mem math full mem math full Memory-bound Math-bound Balanced Good mem-math Good mem-math Good mem-math overlap: latency likely overlap: latency likely overlap: latency likely not a problem not a problem not a problem (assuming memory (assuming instruction (assuming memory/instr throughput is not low throughput is not low throughput is not low compared to HW theory) compared to HW theory) compared to HW theory) 13 © NVIDIA 2010
14.
Some Example Scenarios time
mem math full mem math full mem math full mem math full Memory-bound Math-bound Balanced Memory and latency bound Good mem-math Good mem-math Good mem-math Poor mem-math overlap: overlap: latency likely overlap: latency likely overlap: latency likely latency is a problem not a problem not a problem not a problem (assuming memory (assuming instruction (assuming memory/instr throughput is not low throughput is not low throughput is not low compared to HW theory) compared to HW theory) compared to HW theory) 14 © NVIDIA 2010
15.
Source Modification
Memory-âonly: Remove as much arithmetic as possible Without changing access pattern Use the profiler to verify that load/store instruction count is the same Store-âonly: Also remove the loads (to compare read time vs. write time) Math-âonly: Remove global memory accesses Need to trick the compiler: Compiler throws away all code that it detects as not contributing to stores Put stores inside conditionals that always evaluate to false Condition should depend on the value about to be stored (prevents other optimizations) Condition outcome should not be known to the compiler 15 © NVIDIA 2010
16.
Source Modification for
Math-âonly __global__  void  fwd_3D(  ...,  int flag) { ... value  =  temp  +  coeff *  vsq; If you compare only the if(  1  ==  value  *  flag  ) flag, the compiler may g_output[out_idx]  =  value; move the computation into the conditional as } well 16 © NVIDIA 2010
17.
Source Modification and
Occupancy Removing pieces of code is likely to affect register count This could increase occupancy, skewing the results See slide 23 to see how that could affect throughput Make sure to keep the same occupancy Check the occupancy with profiler before modifications After modifications, if necessary add shared memory to kernel<<<  grid,  block,  smem, ...>>>(...) 17 © NVIDIA 2010
18.
Case Study: Limiter
Analysis 3DFD of the wave equation, fp32 Analysis: Time (ms): Instr:byte ratio = ~2.66 Full-âkernel: 35.39 32*18,194,139 / 128*1,708,032 Mem-âonly: 33.27 Good overlap between math and mem: Math-âonly: 16.25 2.12 ms of math-âonly time (13%) are not overlapped with mem Instructions issued: App memory throughput: 62 GB/s Full-âkernel: 18,194,139 HW theory is 114 GB/s Mem-âonly: 7,497,296 Conclusion: Math-âonly: 16,839,792 Code is memory-âbound Memory access transactions: Latency could be an issue too Full-âkernel: 1,708,032 Optimizations should focus on memory Mem-âonly: 1,708,032 throughput first Math-âonly: 0 math contributes very little to total time (2.12 out of 35.39ms) 18 © NVIDIA 2010
19.
Summary: Limiter Analysis
Rough algorithmic analysis: How many bytes needed, how many instructions Profiler analysis: Instruction count, memory request/transaction count Analysis with source modification: Memory-âonly version of the kernel Math-âonly version of the kernel Examine how these times relate and overlap 19 © NVIDIA 2010
20.
Optimizations for Global
Memory 20 © NVIDIA 2010
21.
Memory Throughput Analysis
Throughput: from application point of view From app point of view: count bytes requested by the application From HW point of view: count bytes moved by the hardware The two can be different Scattered/misaligned pattern: not all transaction bytes are utilized Broadcast: the same small transaction serves many requests Two aspects to analyze for performance impact: Addressing pattern Number of concurrent accesses in flight 21 © NVIDIA 2010
22.
Memory Throughput Analysis
Determining that access pattern is problematic: App throughput is much smaller than HW throughput Use profiler to get HW throughput Profiler counters: access instruction count is significantly smaller than transaction count gld_request < ( l1_global_load_miss + l1_global_load_hit) * ( word_size / 4B ) gst_request < 4 * l2_write_requests * ( word_size / 4B ) Make sure to adjust the transaction counters for word size (see slide 8) Determining that the number of concurrent accesses is insufficient: Throughput from HW point of view is much lower than theoretical 22 © NVIDIA 2010
23.
Concurrent Accesses and
Performance Increment a 64M element array Two accesses per thread (load then store, but they are dependent) Thus, each warp (32 threads) has one outstanding transaction at a time Tesla C2050, ECC on, max achievable bandwidth: ~114 GB/s Several independent smaller accesses have the same effect as one larger one. For example: Four 32-âbit ~= one 128-âbit 23 © NVIDIA 2010
24.
Optimization: Address Pattern
Coalesce the address pattern 128-âbyte lines for caching loads 32-âbyte segments for non-âcaching loads, stores Coalesce to maximize utilization of bus transactions Refer to CUDA C Programming Guide / Best Practices Guide Try using non-âcaching loads Smaller transactions (32B instead of 128B) more efficient for scattered or partially-âfilled patterns Try fetching data from texture Smaller transactions and different caching Cache not polluted by other gmem loads 24 © NVIDIA 2010
25.
Optimizing Access Concurrency
Have enough concurrent accesses to saturate the bus Need (mem_latency Fermi C2050 global memory: 400-â800 cycle latency, 1.15 GHz clock, 144 GB/s bandwidth, 14 SMs Need 30-â50 128-âbyte transactions in flight per SM Ways to increase concurrent accesses: Increase occupancy Adjust threadblock dimensions To maximize occupancy at given register and smem requirements Reduce register count (-âmaxrregcount option, or __launch_bounds__) Modify code to process several elements per thread 25 © NVIDIA 2010
26.
Case Study: Access
Pattern 1 Same 3DFD code as in the previous study Using caching loads (compiler default): Memory throughput: 62 / 74 GB/s for app / hw Different enough to be interesting Loads are coalesced: gld_request == ( l1_global_load_miss + l1_global_load_hit ) There are halo loads that use only 4 threads out of 32 For these transactions only 16 bytes out of 128 are useful Solution: try non-âcaching loads ( -âXptxas dlcm=cg compiler option) Performance increase of 7% Not bad for just trying a compiler flag, no code change Memory throughput: 66 / 67 GB/s for app / hw 26 © NVIDIA 2010
27.
Case Study: Accesses
in Flight Continuing with the FD code Throughput from both app and hw point of view is 66-â67 GB/s Now 30.84 out of 33.71 ms are due to mem 1024 concurrent threads per SM Due to register count (24 per thread) Simple copy kernel reaches ~80% of achievable mem throughput at this thread count Solution: increase accesses per thread Modified code so that each thread is responsible for 2 output points Doubles the load and store count per thread, saves some indexing math Doubles the tile size -â> reduces bandwidth spent on halos Further 25% increase in performance App and HW throughputs are now 82 and 84 GB/s, respectively 27 © NVIDIA 2010
28.
Case Study: Access
Pattern 2 Kernel from climate simulation code Mostly fp64 (so, at least 2 transactions per mem access) Profiler results: gld_request: 72,704 l1_global_load_hit: 439,072 l1_global_load_miss: 724,192 Analysis: L1 hit rate: 37.7% 16 transactions per load instruction Indicates bad access pattern (2 are expected due to 64-âbit words) Of the 16, 10 miss in L1 and contribute to mem bus traffic So, we fetch 5x more bytes than needed by the app 28 © NVIDIA 2010
29.
Case Study: Access
Pattern 2 Looking closer at the access pattern: Each thread linearly traverses a contiguous memory region Expecting CPU-âlike L1 caching -âblock for L1 and L2 One of the worst access patterns for GPUs Solution: Transposed the code so that each warp accesses a contiguous memory region 2.17 transactions per load instruction This and some other changes improved performance by 3x 29 © NVIDIA 2010
30.
Optimizing with Compression
When all else has been optimized and kernel is limited by the number of bytes needed, consider compression Approaches: Int: conversion between 8-â, 16-â, 32-âbit integers is 1 instruction (64-âbit requires a couple) FP: conversion between fp16, fp32, fp64 is one instruction fp16 (1s5e10m) is storage only, no math instructions Range-âbased: Lower and upper limits are kernel arguments Data is an index for interpolation between the limits Application in practice: Clark et al. http://arxiv.org/abs/0911.3191 30 © NVIDIA 2010
31.
Summary: Memory Analysis
and Optimization Analyze: Access pattern: Compare counts of access instructions and transactions Compare throughput from app and hw point of view Number of accesses in flight Look at occupancy and independent accesses per thread Compare achieved throughput to theoretical throughput Also to simple memcpy throughput at the same occupancy Optimizations: Coalesce address patterns per warp (nothing new here), consider texture Process more words per thread (if insufficient accesses in flight to saturate bus) Try the 4 combinations of L1 size and load type (caching and non-âcaching) Consider compression 31 © NVIDIA 2010
32.
Optimizations for Instruction
Throughput 32 © NVIDIA 2010
33.
Possible Limiting Factors
Raw instruction throughput Know the kernel instruction mix fp32, fp64, int, mem, transcendentals, etc. have different throughputs Refer to the CUDA C Programming Guide Can examine assembly, if needed: Can look at SASS (machine assembly) by disassembling the .cubin using cuobjdump Instruction serialization Occurs when threads in a warp issue the same instruction in sequence As opposed to the entire warp issuing the instruction at once Some causes: Shared memory bank conflicts Constant memory bank conflicts 33 © NVIDIA 2010
34.
Instruction Throughput: Analysis
Profiler counters (both incremented by 1 per warp): instructions  executed: counts instructions encoutered during execution instructions  issued: also includes additional issues due to serialization Difference between the two: issues that happened due to serialization, instr cache misses, etc. instructions issued Compare achieved throughput to HW capabilities Peak instruction throughput is documented in the Programming Guide Profiler also reports throughput: GT200: as a fraction of theoretical peak for fp32 instructions Fermi: as IPC (instructions per clock) 34 © NVIDIA 2010
35.
Instruction Throughput: Optimization
Use intrinsics where possible ( __sin(),  __sincos(), __exp(),  etc.) Available for a number of math.h functions Less accuracy, much higher throughput Refer to the CUDA C Programming Guide for details Often a single instruction, whereas a non-âintrinsic is a SW sequence Additional compiler flags that also help (select GT200-âlevel precision): -âftz=true : flush denormals to 0 -âprec-Ââdiv=false : faster fp division instruction sequence (some precision loss) -âprec-Ââsqrt=false : faster fp sqrt instruction sequence (some precision loss) Make sure you do fp64 arithmetic only where you mean it: fp64 throughput is lower than fp32 fp 35 © NVIDIA 2010
36.
Serialization: Profiler Analysis
Serialization is significant if instructions_issued is significantly higher than instructions_executed Warp divergence Profiler counters: divergent_branch, branch Compare the two to see what percentage diverges However, this only counts the branches, not the rest of serialized instructions SMEM bank conflicts Profiler counters: l1_shared_bank_conflict: incremented by 1 per warp for each replay double counts for 64-âbit accesses shared_load, shared_store: incremented by 1 per warp per smem LD/ST instruction Bank conflicts are significant if both are true: instruction throughput affects performance l1_shared_bank_conflict is significant compared to instructions_issued, shared_load+shared_store 36 © NVIDIA 2010
37.
Serialization: Analysis with
Modified Code Modify kernel code to assess performance improvement if serialization were removed Helps decide whether optimizations are worth pursuing Shared memory bank conflicts: Change indexing to be either broadcasts or just threadIdx.x Should also declare smem variables as volatile Warp divergence: change the condition to always take the same path Time both paths to see what each costs 37 © NVIDIA 2010
38.
Serialization: Optimization
Shared memory bank conflicts: Pad SMEM arrays See CUDA Best Practices Guide, Transpose SDK whitepaper Rearrange data in SMEM Warp serialization: Try grouping threads that take the same path Rearrange the data, pre-âprocess the data Rearrange how threads index data (may affect memory perf) 38 © NVIDIA 2010
39.
Case Study: SMEM
Bank Conflicts A different climate simulation code kernel, fp64 Profiler values: Instructions: Executed / issued: 2,406,426 / 2,756,140 Difference: 349,714 (12.7% GMEM: Total load and store transactions: 170,263 Instr:byte ratio: 4 suggests that instructions are a significant limiter (especially since there is a lot of fp64 math) SMEM: Load / store: 421,785 / 95,172 Bank conflict: 674,856 (really 337,428 because of double-âcounting for fp64) This means a total of 854,385 SMEM access instructions, (421,785 +95,172+337,428) , 39% replays Solution: Pad shared memory array: performance increased by 15% replayed instructions reduced down to 1% 39 © NVIDIA 2010
40.
Instruction Throughput: Summary
Analyze: Check achieved instruction throughput Compare to HW peak (but must take instruction mix into consideration) Check percentage of instructions due to serialization Optimizations: Intrinsics, compiler options for expensive operations Group threads that are likely to follow same execution path Avoid SMEM bank conflicts (pad, rearrange data) 40 © NVIDIA 2010
41.
Optimizations for Latency
41 © NVIDIA 2010
42.
Latency: Analysis
Suspect if: Neither memory nor instruction throughput rates are close to HW theoretical rates Poor overlap between mem and math Full-âkernel time is significantly larger than max{mem-âonly, math-âonly} Two possible causes: Insufficient concurrent threads per multiprocessor to hide latency Occupancy too low Too few threads in kernel launch to load the GPU Too few concurrent threadblocks per SM when using __syncthreads() __syncthreads()  can prevent overlap between math and mem within the same threadblock 42 © NVIDIA 2010
43.
Simplified View of
Latency and Syncs Kernel where most math cannot be executed until all data is loaded by Memory-only time the threadblock Math-only time Full-kernel time, one large threadblock per SM time 43 © NVIDIA 2010
44.
Simplified View of
Latency and Syncs Kernel where most math cannot be executed until all data is loaded by Memory-only time the threadblock Math-only time Full-kernel time, one large threadblock per SM Full-kernel time, two threadblocks per SM (each half the size of one large one) time 44 © NVIDIA 2010
45.
Latency: Optimization
Insufficient threads or workload: Increase the level of parallelism (more threads) If occupancy is already high but latency is not being hidden: Process several output elements per thread gives more independent memory and arithmetic instructions (which get pipelined) Barriers: Can assess impact on perf by commenting out __syncthreads() Incorrect result, but gives upper bound on improvement Try running several smaller threadblocks In some cases that costs extra bandwidth due to halos Check out Vasily talk 2238 from GTC 2010 for a detailed treatment: 45 © NVIDIA 2010
46.
Register Spilling
46 © NVIDIA 2010
47.
Register Spilling
Fermi HW limit is 63 registers per thread Spills also possible when register limit is programmer-âspecified Common when trying to achieve certain occupancy with -Ââmaxrregcount compiler flag or __launch_bounds__  in source lmem is like gmem, except that writes are cached in L1 lmem load hit in L1 -â> no bus traffic lmem load miss in L1 -â> bus traffic (128 bytes per miss) Compiler flag Xptxas v  gives the register and lmem usage per thread Potential impact on performance Additional bandwidth pressure if evicted from L1 Additional instructions Not always a problem, easy to investigate with quick profiler analysis 47 © NVIDIA 2010
48.
Register Spilling: Analysis
Profiler counters: l1_local_load_hit, l1_local_load_miss Impact on instruction count: Compare to total instructions issued Impact on memory throughput: Misses add 128 bytes per warp Compare 2*l1_local_load_miss count to gmem access count (stores + loads) Multiply lmem load misses by 2: missed line must have been evicted -â> store across bus Comparing with caching loads: count only gmem misses in L1 Comparing with non-âcaching loads: count all loads 48 © NVIDIA 2010
49.
Optimization for Register
Spilling Try increasing the limit of registers per thread Use a higher limit in maxrregcount, or lower thread count for __launch_bounds__ Will likely decrease occupancy, potentially making gmem accesses less efficient However, may still be an overall win fewer total bytes being accessed in gmem Non-âcaching loads for gmem potentially fewer contentions with spilled registers in L1 Increase L1 size to 48KB default is 16KB L1 / 48KB smem 49 © NVIDIA 2010
50.
Register Spilling: Case
Study FD kernel, (3D-âcross stencil) fp32, so all gmem accesses are 4-âbyte words Need higher occupancy to saturate memory bandwidth Coalesced, non-âcaching loads one gmem request = 128 bytes all gmem loads result in bus traffic Larger threadblocks mean lower gmem pressure Halos (ghost cells) are smaller as a percentage Aiming to have 1024 concurrent threads per SM Means no more than 32 registers per thread Compiled with maxrregcount=32 50 © NVIDIA 2010
51.
Case Study: Register
Spilling 1 10th order in space kernel (31-âpoint stencil) 32 registers per thread : 68 bytes of lmem per thread : upto 1024 threads per SM Profiled counters: l1_local_load_miss = 36 inst_issued = 8,308,582 l1_local_load_hit = 70,956 gld_request = 595,200 local_store = 64,800 gst_request = 128,000 Conclusion: spilling is not a problem in this case The ratio of gmem to lmem bus traffic is approx 10,044 : 1 (hardly any bus traffic is due to spills) L1 contains most of the spills (99.9% hit rate for lmem loads) Only 1.6% of all instructions are due to spills Comparison: 42 registers per thread : no spilling : upto 768 threads per SM Single 512-âthread block per SM : 24% perf decrease Three 256-âthread blocks per SM : 7% perf decrease 51 © NVIDIA 2010
52.
Case Study: Register
Spilling 2 12th order in space kernel (37-âpoint stencil) 32 registers per thread : 80 bytes of lmem per thread : upto 1024 threads per SM Profiled counters: l1_local_load_miss = 376,889 inst_issued = 10,154,216 l1_local_load_hit = 36,931 gld_request = 550,656 local_store = 71,176 gst_request = 115,200 Conclusion: spilling is a problem in this case The ratio of gmem to lmem bus traffic is approx 6 : 7 (53% of bus traffic is due to spilling) L1 does not contain the spills (8.9% hit rate for lmem loads) Only 4.1% of all instructions are due to spills Solution: increase register limit per thread 42 registers per thread : no spilling : upto 768 threads per SM Single 512-âthread block per SM : 13% perf increase Three 256-âthread blocks per SM : 37% perf increase 52 © NVIDIA 2010
53.
Register Spilling: Summary
due to: Increased pressure on the memory bus Increased instruction count Use the profiler to examine the impact by comparing: 2*l1_local_load_miss to all gmem Local access count to total instructions issued Impact is significant if: Memory-âbound code: lmem misses are a significant percentage of total bus traffic Instruction-âbound code: lmem accesses are a significant percentage of all instructions 53 © NVIDIA 2010
54.
Summary
Determining what limits your kernel most: Arithmetic, memory bandwidth, latency Address the bottlenecks in the order of importance Analyze for inefficient use of hardware Estimate the impact on overall performance Optimize to most efficiently use hardware More resources: Fundamental Optimizations talk Prior CUDA tutorials at Supercomputing http://gpgpu.org/{sc2007,sc2008,sc2009} CUDA C Programming Guide, CUDA C Best Practices Guide CUDA webinars 54 © NVIDIA 2010
55.
Questions?
55 © NVIDIA 2010
Download now