SlideShare a Scribd company logo
1 of 33
Download to read offline
CODE GPU WITH CUDA
OPTIMIZING MEMORY & CONTROL FLOW
CreatedbyMarinaKolpakova( )forcuda.geek Itseez
PREVIOUS
OUTLINE
Memory types
Memory caching
Types of memory access patterns
Textures
control flow performance limiters
list of common advices
MEMORY
OPTIMIZATION
MEMORY TYPES
Memory Scope Location Cached Access Lifetime
Register Thread On-chip N/A R/W Thread
Local Thread Off-chip L1/L2 R/W Thread
Shared Block On-chip N/A R/W Block
Global Grid + Host Off-chip L2 R/W App
Constant Grid + Host Off-chip L1,L2,L3 R App
Texture Grid + Host Off-chip L1,L2 R App
MEMORY TYPES
MEMORY TYPES
GPU CACHES
GPU caches are not intended for the same use as CPU's
Not aimed at temporal reuse. Smaller than CPU size (especially per thread, e.g. Fermi:
48 KB L1, 1536 threads on fly, cache / thread = 1 x 128-byte line).
Aimed at spatial reuse. Intended to smooth some access patterns, help with spilled
registers and stack.
Do not tile relying on block size. Lines likely become evicted next few access
Use smem for tiling. Same latency, fully programmable
L2 aimed to speed up atomics and gmem writes.
GMEM
Learn your access pattern before thinking about latency hiding and try not to thresh the
memory bus.
Four general categories of inefficient memory access patterns:
Miss-aligned (offset) warp addresses
Strided access between threads within a warp
Thread-affine (each thread in a warp accesses a large contiguous region)
Irregular (scattered) addresses
Always be aware about bytes you actually need and bytes you transfer through the bus
GMEM: MISS-ALIGNED
Add extra padding for data to force alignment
Use read-only texture L1
Combination of above
GMEM: STRIDED
If pattern is regular, try to change data layout: AoS -> SoA
GMEM: STRIDED
Use smem to correct access pattern.
1. load gmem -> smem with best coalescing
2. synchronize
3. use
GMEM: STRIDED
Use warp shuffle to permute elements for warp
1. coalescingly load elements needed by warp
2. permute
3. use
GMEM: STRIDED
Use proper caching strategy
cg – cache global
ldg – cache in texture L1
cs – cache streaming
GMEM: THREAD-AFFINE
Each thread accesses relatively long continuous memory region
Load big structures using AoS
Thread loads continuous region of data
All threads load the same data
GMEM: THREAD-AFFINE
Work distribution
i n t t i d = b l o c k I d x . x * b l o c k D i m . x + t h r e a d I d x . x ;
i n t t h r e a d N = N / b l o c k D i m . x * g r i d D i m . x ;
f o r ( s i z e _ t i = t i d * N ; i < ( t i d + 1 ) * N ; + + i )
{
s u m = + i n [ i ]
}
f o r ( s i z e _ t i = t i d ; i < N ; i + = b l o c k D i m . x * g r i d D i m . x )
{
s u m = + i n [ i ]
}
UNIFORM LOAD
All threads in a block access the same address as read only.
Memory operation uses 3-level constant cache
Generated by compiler
Available as PTX asm insertion
_ _ d e v i c e _ _ _ _ f o r c e i n l i n e _ _ f l o a t _ _ l d u ( c o n s t f l o a t * p t r )
{
f l o a t v a l ;
a s m ( " l d u . g l o b a l . f 3 2 % 0 , [ % 1 ] ; " : " = " f ( v a l ) : l ( p t r ) ) ;
r e t u r n v a l ;
}
GMEM: IRREGULAR
Random memory access. Threads in a warp access many lines, strides are irregular.
Improve data locality
Try 2D-local arrays (Morton-ordered)
Use read-only texture L1
Kernel fission to localize the worst case.
TEXTURE
Smaller transactions and different caching (dedicated L1, 48 KB, ~104 clock latency)
Cache is not polluted by other GMEM loads, separate partition for each warp scheduler
helps to prevent cache threshing
Possible hardware interpolation (Note: 9-bit alpha)
Hardware handling of out-of-bound access
Kepler improvements:
sm_30+ Bindless textures. No global static variables. Can be used in threaded code
sm_32+ GMEM access through texture cache bypassing interpolation units
SMEM: BANKING
KEPLER: 32-BIT AND 64-BIT MODES
special case: 2D smem usage (Fermi example)
_ _ s h a r e d _ _ f l o a t s m e m _ b u f f e r [ 3 2 ] [ 3 2 + 1 ]
SMEM
The common techniques are:
use smem to improve memory access pattern
use smem for stencil processing
But the gap between smem and math throughput is increasing
Tesla: 16 (32 bit) banks vs 8 thread processors (2:1)
GF100: 32 (32 bit) banks vs 32 thread processors (1:1)
GF104: 32 (32 bit) banks vs 48 thread processors (2:3)
Kepler: 32 (64 bit) banks vs 192 thread processors (1:3)
Max size 48 KB (49152 B), assume max occupancy 64x32,
so 24 bytes per thread.
More intensive memory usage affects occupancy.
SMEM (CONT.)
smem + L1 use the same 64K B. Program-configurable split:
Fermi: 48:16, 16:48
Kepler: 48:16, 16:48, 32:32
cudaDeviceSetCacheConfig(), cudaFuncSetCacheConfig()
prefer L1 to improve lmem usage
prefer smem for stencil kernels
smen often used for:
data sharing across the block
inter-block communication
bock-level buffers (for scan or reduction)
stencil code
LMEM
Local memory is a stack memory analogue: call stack, register spilling. Note: Both Local
memory reads/writes are cached in L1.
Registers are for automatic variables
Volatile keyword enforces spilling
Registers do not support indexing: local memory is used for local arrays
Register spilling leads to more instructions and memory traffic
i n t a = 4 2 ;
i n t b [ S I Z E ] = { 0 , } ;
SPILLING CONTROL
1. Use __launch_bounds__ to help compiler to select maximum amount of registers.
2. Compile with -maxrregcount to enforce compiler optimization for register usage and
register spilling if needed
3. By default you run less concurrent warps per SM
_ _ g l o b a l _ _ v o i d _ _ l a u n c h _ b o u n d s _ _ (
m a x T h r e a d s P e r B l o c k , m i n B l o c k s P e r M u l t i p r o c e s s o r ) k e r n e l ( . . . )
{
/ / . . .
}
CONTROL FLOW
CONTROL FLOW: PROBLEMS
Warp divergence: branching, early loop exit... Inspect SASS to find divergent pieces of
code
Workload is data dependent: code-path depends on input (like classification task)
Too many synchronization logic: intensive usage of parallel data structures, lots of
atomics, __sychthreads(), etc
Resident warps: occupy resources but do nothing
Big blocks: tail effect
CONTROL FLOW: SOLUTIONS
Understand your problem. Select best algorithm keeping in mind GPU architecture.
Maximize independent parallelism
Compiler generates branch predication with -O3 during if/switch optimization but
number of instructions has to be less or equal than a given threshold. Threshold = 7 if
lots of divergent warps, 4 otherwise
Adjust thread block size
Try work queues
KERNEL FUSION AND FISSION
Fusion
Replace chain of kernel calls with fused one
Helps to save memory reads/writes. Intermediate results can be kept in registers
Enables further ILP optimizations
Kernels should have almost the same access pattern
Fission
Replace one kernel call with a chain
Helps to localize ineffective memory access patterns
Insert small kernels that repack data (e.g. integral image)
TUNING BLOCK CONFIGURATION
Finding optimal launch configuration is crucial to achieve best performance. Launch
configuration affects occupancy
low occupancy presents full hardware utilization and lowers possibility to hide patency
high occupancy for kernels with large memory demands results in over polluted read or
write queues
Experiment to find optimal configuration (block and grid resolutions, amount of work per
thread) that is optimal for your kernel.
TUNING BLOCK CONFIGURATION
Finding optimal launch configuration is crucial to achieve best performance. Launch
configuration affects occupancy
FINAL WORDS
Basic CUDA Code Optimizations
use compiler flags
do not trick compiler
use structure of arrays
improve memory layout
load by cache line
process by row
cache data in registers
re-compute values instead of re-loading
keep data on GPU
FINAL WORDS
Conventional parallelization optimizations
use light-weight locking,
... atomics,
... and lock-free code.
minimize locking,
... memory fences,
... and volatile accesses.
FINAL WORDS
Conventional architectural optimizations
utilize shared memory,
... constant memory,
... streams,
... thread voting,
... and rsqrtf;
detect compute capability and number of SMs;
tune thread count,
... blocks per SM,
... launch bounds,
and L1 cache/shared memory configuration
THE END
NEXT
BY / 2013–2015CUDA.GEEK

More Related Content

What's hot

Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Marina Kolpakova
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsLinaro
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportLinaro
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMLinaro
 
TinyML - 4 speech recognition
TinyML - 4 speech recognition TinyML - 4 speech recognition
TinyML - 4 speech recognition 艾鍗科技
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVELinaro
 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsLinaro
 
Code GPU with CUDA - Applying optimization techniques
Code GPU with CUDA - Applying optimization techniquesCode GPU with CUDA - Applying optimization techniques
Code GPU with CUDA - Applying optimization techniquesMarina Kolpakova
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)elliando dias
 
Tensorflow lite for microcontroller
Tensorflow lite for microcontrollerTensorflow lite for microcontroller
Tensorflow lite for microcontrollerRouyun Pan
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
An Open Discussion of RISC-V BitManip, trends, and comparisons _ ClaireRISC-V International
 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerLinaro
 
GEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions FrameworkGEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions FrameworkAlexey Smirnov
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5PRADEEP
 
Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bitsChiou-Nan Chen
 
TLPI - 6 Process
TLPI - 6 ProcessTLPI - 6 Process
TLPI - 6 ProcessShu-Yu Fu
 

What's hot (20)

Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loops
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler support
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
 
TinyML - 4 speech recognition
TinyML - 4 speech recognition TinyML - 4 speech recognition
TinyML - 4 speech recognition
 
ocelot
ocelotocelot
ocelot
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVE
 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON Intrinsics
 
Code GPU with CUDA - Applying optimization techniques
Code GPU with CUDA - Applying optimization techniquesCode GPU with CUDA - Applying optimization techniques
Code GPU with CUDA - Applying optimization techniques
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
 
Tensorflow lite for microcontroller
Tensorflow lite for microcontrollerTensorflow lite for microcontroller
Tensorflow lite for microcontroller
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 
Assembly language part I
Assembly language part IAssembly language part I
Assembly language part I
 
eBPF/XDP
eBPF/XDP eBPF/XDP
eBPF/XDP
 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-Vectorizer
 
GEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions FrameworkGEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions Framework
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
 
Machine Trace Metrics
Machine Trace MetricsMachine Trace Metrics
Machine Trace Metrics
 
Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bits
 
TLPI - 6 Process
TLPI - 6 ProcessTLPI - 6 Process
TLPI - 6 Process
 

Similar to Code GPU with CUDA - Optimizing memory and control flow

Chapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworldChapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworldPraveen Kumar
 
301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilogSrinivas Naidu
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance CachingScyllaDB
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxtidwellveronique
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxtidwellveronique
 
DIGITAL DESIGNS SLIDES 7 ENGINEERING 2ND YEAR
DIGITAL DESIGNS SLIDES 7 ENGINEERING  2ND YEARDIGITAL DESIGNS SLIDES 7 ENGINEERING  2ND YEAR
DIGITAL DESIGNS SLIDES 7 ENGINEERING 2ND YEARkasheen2803
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009Léia de Sousa
 
ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata EvonCanales257
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningRenaldas Zioma
 
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docxPlease do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docxARIV4
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory OptimizationWei Lin
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimizationguest3eed30
 
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of ThingsJerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of ThingsSamsung Open Source Group
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_partyOpen Party
 
Multi-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architectureMulti-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architectureUmair Amjad
 

Similar to Code GPU with CUDA - Optimizing memory and control flow (20)

1083 wang
1083 wang1083 wang
1083 wang
 
Chapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworldChapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworld
 
301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
 
L05 parallel
L05 parallelL05 parallel
L05 parallel
 
Memory management
Memory managementMemory management
Memory management
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docx
 
DIGITAL DESIGNS SLIDES 7 ENGINEERING 2ND YEAR
DIGITAL DESIGNS SLIDES 7 ENGINEERING  2ND YEARDIGITAL DESIGNS SLIDES 7 ENGINEERING  2ND YEAR
DIGITAL DESIGNS SLIDES 7 ENGINEERING 2ND YEAR
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009
 
Memoryhierarchy
MemoryhierarchyMemoryhierarchy
Memoryhierarchy
 
ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docxPlease do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docx
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Architecture Assignment Help
Architecture Assignment HelpArchitecture Assignment Help
Architecture Assignment Help
 
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of ThingsJerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_party
 
Multi-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architectureMulti-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architecture
 

Recently uploaded

Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxLigayaBacuel1
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 

Recently uploaded (20)

Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 

Code GPU with CUDA - Optimizing memory and control flow

  • 1. CODE GPU WITH CUDA OPTIMIZING MEMORY & CONTROL FLOW CreatedbyMarinaKolpakova( )forcuda.geek Itseez PREVIOUS
  • 2. OUTLINE Memory types Memory caching Types of memory access patterns Textures control flow performance limiters list of common advices
  • 4. MEMORY TYPES Memory Scope Location Cached Access Lifetime Register Thread On-chip N/A R/W Thread Local Thread Off-chip L1/L2 R/W Thread Shared Block On-chip N/A R/W Block Global Grid + Host Off-chip L2 R/W App Constant Grid + Host Off-chip L1,L2,L3 R App Texture Grid + Host Off-chip L1,L2 R App
  • 7. GPU CACHES GPU caches are not intended for the same use as CPU's Not aimed at temporal reuse. Smaller than CPU size (especially per thread, e.g. Fermi: 48 KB L1, 1536 threads on fly, cache / thread = 1 x 128-byte line). Aimed at spatial reuse. Intended to smooth some access patterns, help with spilled registers and stack. Do not tile relying on block size. Lines likely become evicted next few access Use smem for tiling. Same latency, fully programmable L2 aimed to speed up atomics and gmem writes.
  • 8. GMEM Learn your access pattern before thinking about latency hiding and try not to thresh the memory bus. Four general categories of inefficient memory access patterns: Miss-aligned (offset) warp addresses Strided access between threads within a warp Thread-affine (each thread in a warp accesses a large contiguous region) Irregular (scattered) addresses Always be aware about bytes you actually need and bytes you transfer through the bus
  • 9. GMEM: MISS-ALIGNED Add extra padding for data to force alignment Use read-only texture L1 Combination of above
  • 10. GMEM: STRIDED If pattern is regular, try to change data layout: AoS -> SoA
  • 11. GMEM: STRIDED Use smem to correct access pattern. 1. load gmem -> smem with best coalescing 2. synchronize 3. use
  • 12. GMEM: STRIDED Use warp shuffle to permute elements for warp 1. coalescingly load elements needed by warp 2. permute 3. use
  • 13. GMEM: STRIDED Use proper caching strategy cg – cache global ldg – cache in texture L1 cs – cache streaming
  • 14. GMEM: THREAD-AFFINE Each thread accesses relatively long continuous memory region Load big structures using AoS Thread loads continuous region of data All threads load the same data
  • 15. GMEM: THREAD-AFFINE Work distribution i n t t i d = b l o c k I d x . x * b l o c k D i m . x + t h r e a d I d x . x ; i n t t h r e a d N = N / b l o c k D i m . x * g r i d D i m . x ; f o r ( s i z e _ t i = t i d * N ; i < ( t i d + 1 ) * N ; + + i ) { s u m = + i n [ i ] } f o r ( s i z e _ t i = t i d ; i < N ; i + = b l o c k D i m . x * g r i d D i m . x ) { s u m = + i n [ i ] }
  • 16. UNIFORM LOAD All threads in a block access the same address as read only. Memory operation uses 3-level constant cache Generated by compiler Available as PTX asm insertion _ _ d e v i c e _ _ _ _ f o r c e i n l i n e _ _ f l o a t _ _ l d u ( c o n s t f l o a t * p t r ) { f l o a t v a l ; a s m ( " l d u . g l o b a l . f 3 2 % 0 , [ % 1 ] ; " : " = " f ( v a l ) : l ( p t r ) ) ; r e t u r n v a l ; }
  • 17. GMEM: IRREGULAR Random memory access. Threads in a warp access many lines, strides are irregular. Improve data locality Try 2D-local arrays (Morton-ordered) Use read-only texture L1 Kernel fission to localize the worst case.
  • 18. TEXTURE Smaller transactions and different caching (dedicated L1, 48 KB, ~104 clock latency) Cache is not polluted by other GMEM loads, separate partition for each warp scheduler helps to prevent cache threshing Possible hardware interpolation (Note: 9-bit alpha) Hardware handling of out-of-bound access Kepler improvements: sm_30+ Bindless textures. No global static variables. Can be used in threaded code sm_32+ GMEM access through texture cache bypassing interpolation units
  • 19. SMEM: BANKING KEPLER: 32-BIT AND 64-BIT MODES special case: 2D smem usage (Fermi example) _ _ s h a r e d _ _ f l o a t s m e m _ b u f f e r [ 3 2 ] [ 3 2 + 1 ]
  • 20. SMEM The common techniques are: use smem to improve memory access pattern use smem for stencil processing But the gap between smem and math throughput is increasing Tesla: 16 (32 bit) banks vs 8 thread processors (2:1) GF100: 32 (32 bit) banks vs 32 thread processors (1:1) GF104: 32 (32 bit) banks vs 48 thread processors (2:3) Kepler: 32 (64 bit) banks vs 192 thread processors (1:3) Max size 48 KB (49152 B), assume max occupancy 64x32, so 24 bytes per thread. More intensive memory usage affects occupancy.
  • 21. SMEM (CONT.) smem + L1 use the same 64K B. Program-configurable split: Fermi: 48:16, 16:48 Kepler: 48:16, 16:48, 32:32 cudaDeviceSetCacheConfig(), cudaFuncSetCacheConfig() prefer L1 to improve lmem usage prefer smem for stencil kernels smen often used for: data sharing across the block inter-block communication bock-level buffers (for scan or reduction) stencil code
  • 22. LMEM Local memory is a stack memory analogue: call stack, register spilling. Note: Both Local memory reads/writes are cached in L1. Registers are for automatic variables Volatile keyword enforces spilling Registers do not support indexing: local memory is used for local arrays Register spilling leads to more instructions and memory traffic i n t a = 4 2 ; i n t b [ S I Z E ] = { 0 , } ;
  • 23. SPILLING CONTROL 1. Use __launch_bounds__ to help compiler to select maximum amount of registers. 2. Compile with -maxrregcount to enforce compiler optimization for register usage and register spilling if needed 3. By default you run less concurrent warps per SM _ _ g l o b a l _ _ v o i d _ _ l a u n c h _ b o u n d s _ _ ( m a x T h r e a d s P e r B l o c k , m i n B l o c k s P e r M u l t i p r o c e s s o r ) k e r n e l ( . . . ) { / / . . . }
  • 25. CONTROL FLOW: PROBLEMS Warp divergence: branching, early loop exit... Inspect SASS to find divergent pieces of code Workload is data dependent: code-path depends on input (like classification task) Too many synchronization logic: intensive usage of parallel data structures, lots of atomics, __sychthreads(), etc Resident warps: occupy resources but do nothing Big blocks: tail effect
  • 26. CONTROL FLOW: SOLUTIONS Understand your problem. Select best algorithm keeping in mind GPU architecture. Maximize independent parallelism Compiler generates branch predication with -O3 during if/switch optimization but number of instructions has to be less or equal than a given threshold. Threshold = 7 if lots of divergent warps, 4 otherwise Adjust thread block size Try work queues
  • 27. KERNEL FUSION AND FISSION Fusion Replace chain of kernel calls with fused one Helps to save memory reads/writes. Intermediate results can be kept in registers Enables further ILP optimizations Kernels should have almost the same access pattern Fission Replace one kernel call with a chain Helps to localize ineffective memory access patterns Insert small kernels that repack data (e.g. integral image)
  • 28. TUNING BLOCK CONFIGURATION Finding optimal launch configuration is crucial to achieve best performance. Launch configuration affects occupancy low occupancy presents full hardware utilization and lowers possibility to hide patency high occupancy for kernels with large memory demands results in over polluted read or write queues Experiment to find optimal configuration (block and grid resolutions, amount of work per thread) that is optimal for your kernel.
  • 29. TUNING BLOCK CONFIGURATION Finding optimal launch configuration is crucial to achieve best performance. Launch configuration affects occupancy
  • 30. FINAL WORDS Basic CUDA Code Optimizations use compiler flags do not trick compiler use structure of arrays improve memory layout load by cache line process by row cache data in registers re-compute values instead of re-loading keep data on GPU
  • 31. FINAL WORDS Conventional parallelization optimizations use light-weight locking, ... atomics, ... and lock-free code. minimize locking, ... memory fences, ... and volatile accesses.
  • 32. FINAL WORDS Conventional architectural optimizations utilize shared memory, ... constant memory, ... streams, ... thread voting, ... and rsqrtf; detect compute capability and number of SMs; tune thread count, ... blocks per SM, ... launch bounds, and L1 cache/shared memory configuration
  • 33. THE END NEXT BY / 2013–2015CUDA.GEEK