CODE GPU WITH CUDA
IDENTIFYING PERFORMANCE LIMITERS
CreatedbyMarinaKolpakova( )forcuda.geek Itseez
PREVIOUS
OUTLINE
How to identify performance limiters?
What and how to measure?
Why to profile?
Profiling case study: transpose
Code paths analysis
OUT OF SCOPE
Visual profiler opportunities
HOW TO IDENTIFY PERFORMANCE LIMITERS
Time
Subsample when measuring performance
Determine your code wall time. You'll optimize it
Profile
Collect metrics and events
Determine limiting factors (e.c. memory, divergence)
HOW TO IDENTIFY PERFORMANCE LIMITERS
Prototype
Prototype kernel parts separately and time them
Determine memory access or data dependency patterns
(Micro)benchmark
Determine hardware characteristics
Tune for particular architecture, GPU class
Look into SASS
Check compiler optimizations
Look for a further improvements
TIMING: WHAT TO MEASURE?
Wall time: user will see this time
GPU time: specific kernel time
CPU ⇔ GPU memory transfers time:
not considered for GPU time analysis
significantly impact wall time
Data dependent cases timing:
worst case time
time of single iteration
consider probability
HOW TO MEASURE?
SYSTEM TIMER (UNIX)
# i n c l u d e < t i m e . h >
d o u b l e r u n K e r n e l ( c o n s t d i m 3 g r i d , c o n s t d i m 3 b l o c k )
{
s t r u c t t i m e s p e c s t a r t T i m e , e n d T i m e ;
c l o c k _ g e t t i m e ( C L O C K _ M O N O T O N I C , & s t a r t T i m e ) ;
k e r n e l < < < g r i d , b l o c k > > > ( ) ;
< b > c u d a D e v i c e S y n c h r o n i z e ( ) ; < / b >
c l o c k _ g e t t i m e ( C L O C K _ M O N O T O N I C , & e n d T i m e ) ;
i n t 6 4 s t a r t N s = ( i n t 6 4 ) s t a r t T i m e . t v _ s e c * 1 0 0 0 0 0 0 0 0 0 + s t a r t T i m e . t v _ n s e c ;
i n t 6 4 e n d N s = ( i n t 6 4 ) e n d T i m e . t v _ s e c * 1 0 0 0 0 0 0 0 0 0 + e n d T i m e . t v _ n s e c ;
r e t u r n ( e n d N s - s t a r t N s ) / 1 0 0 0 0 0 0 0 . ; / / g e t m s
}
Preferred for wall time measurement
HOW TO MEASURE?
TIMING WITH CUDA EVENTS
d o u b l e r u n K e r n e l ( c o n s t d i m 3 g r i d , c o n s t d i m 3 b l o c k )
{
c u d a E v e n t _ t > s t a r t , s t o p ;
c u d a E v e n t C r e a t e ( & s t a r t ) ; c u d a E v e n t C r e a t e ( & s t o p ) ;
c u d a E v e n t R e c o r d ( s t a r t , 0 ) ;
k e r n e l < < < g r i d , b l o c k > > > ( ) ;
c u d a E v e n t R e c o r d ( s t o p , 0 ) ;
< b > c u d a E v e n t S y n c h r o n i z e ( s t o p ) ; < / b >
f l o a t m s ;
c u d a E v e n t E l a p s e d T i m e ( & m s , s t a r t , s t o p ) ;
c u d a E v e n t D e s t r o y ( s t a r t ) ; c u d a E v e n t D e s t r o y ( s t o p ) ;
r e t u r n m s ;
}
Preferred for GPU time measurement
Can be used with CUDA streams without synchronization
WHY TO PROFILE?
Profiler will not do your work for you,
but profiler helps:
to verify memory access patterns
to identify bottlenecks
to collect statistic in data-dependent workloads
to check your hypothesis
to understand how hardware behaves
Think about profiling and benchmarking
as about scientific experiments
DEVICE CODE PROFILER
events are hardware counters, usually reported per SM
SM id selected by profiler with assumption that all SMs do approximately the same
amount of work
Exceptions: L2 and DRAM counters
metrics computed from number of events and hardware specific properties (e.c. number
of SM)
Single run can collect only a few counters
Profiler repeats kernel launches to collect all counters
Results may vary for repeated runs
PROFILING FOR MEMORY
Memory metrics
which have load or store in name counts from software perspective (in terms of
memory requests)
local_store_transactions
which have read or write in name counts from hardware perspective (in terms of
bytes transfered)
l2_subp0_read_sector_misses
Counters are incremented
per warp
per cache line/transaction size
per request/instruction
PROFILING FOR MEMORY
Access pattern efficiency
check the ratio between bytes requested by the threads / application code and bytes
moved by the hardware (L2/DRAM)
use g{ld,st}_transactions_per_requestmetric
Throughput analysis
compare application HW throughput to possible for your GPU (can be found in
documentation)
g{ld,st}_requested_throughput
INSTRUCTIONS/BYTES RATIO
Profiler counters:
instructions_issued, instructions_executed
incremented by warp, but “issued” includes replays
global_store_transaction, uncached_global_load_transaction
transaction can be 32,64,128 byte. Requires additional analysis to determine
average.
Compute ratio:
(warpSize X instructions_issued) v.s. (global_store_transaction +
l1_global_load_miss) * avgTransactionSize
LIST OF EVENTS FOR SM_35
domain event
texture (a) tex{0,1,2,3}_cache_sector_{queries,misses}
rocache_subp{0,1,2,3}_gld_warp_count_{32,64,128}b
rocache_subp{0,1,2,3}_gld_thread_count_{32,64,128}b
L2 (b) fb_subp{0,1}_{read,write}_sectors
l2_subp{0,1,2,3}_total_{read,write}_sector_queries
l2_subp{0,1,2,3}_{read,write}_{l1,system}_sector_queries
l2_subp{0,1,2,3}_{read,write}_sector_misses
l2_subp{0,1,2,3}_read_tex_sector_queries
l2_subp{0,1,2,3}_read_{l1,tex}_hit_sectors
LD/ST (c) g{ld,st}_inst_{8,16,32,64,128}bit
rocache_gld_inst_{8,16,32,64,128}bit
LIST OF EVENTS FOR SM_35
domain event
sm (d) prof_trigger_0{0-7}
{shared,local}_{load,store}
g{ld,st}_request
{local,l1_shared,__l1_global}_{load,store}_transactions
l1_local_{load,store}_{hit,miss}
l1_global_load_{hit,miss}
uncached_global_load_transaction
global_store_transaction
shared_{load,store}_replay
global_{ld,st}_mem_divergence_replays
LIST OF EVENTS FOR SM_35
domain event
sm (d) {threads,warps,sm_cta}_launched
inst_issued{1,2}
[thread_,not_predicated_off_thread_]inst_executed
{atom,gred}_count
active_{cycles,warps}
LIST OF METRICS FOR SM_35
metric
g{ld,st}_requested_throughput
tex_cache_{hit_rate,throughput}
dram_{read,write}_throughput
nc_gld_requested_throughput
{local,shared}_{load,store}_throughput
{l2,system}_{read,write}_throughput
g{st,ld}_{throughput,efficiency}
l2_{l1,texture}_read_{hit_rate,throughput}
l1_cache_{global,local}_hit_rate
LIST OF METRICS FOR SM_35
metric
{local,shared}_{load,store}_transactions[_per_request]
gl{d,st}_transactions[_per_request]
{sysmem,dram,l2}_{read,write}_transactions
tex_cache_transactions
{inst,shared,global,global_cache,local}_replay_overhead
local_memory_overhead
shared_efficiency
achieved_occupancy
sm_efficiency[_instance]
ipc[_instance]
issued_ipc
inst_per_warp
LIST OF METRICS FOR SM_35
metric
flops_{sp,dp}[_add,mul,fma]
warp_execution_efficiency
warp_nonpred_execution_efficiency
flops_sp_special
stall_{inst_fetch,exec_dependency,data_request,texture,sync,other}
{l1_shared,l2,tex,dram,system}_utilization
{cf,ldst}_{issued,executed}
{ldst,alu,cf,tex}_fu_utilization
issue_slot_utilization
inst_{issued,executed}
issue_slots
ROI PROFILING
# i n c l u d e < c u d a _ p r o f i l e r _ a p i . h >
/ / a l g o r i t h m s e t u p c o d e
u d a P r o f i l e r S t a r t ( ) ;
p e r f _ t e s t _ c u d a _ a c c e l e r a t e d _ c o d e ( ) ;
c u d a P r o f i l e r S t o p ( ) ;
Profile only part that you are optimizing right now
shorter and simpler profiler log
Do not significantly overhead your code runtime
Used with --profile-from-start offnvprof option
CASE STUDY: MATRIX TRANSPOSE
& n v p r o f - - d e v i c e s 2 . / b i n / d e m o _ b e n c h
CASE STUDY: MATRIX TRANSPOSE
& n v p r o f - - d e v i c e s 2 
- - m e t r i c s g l d _ t r a n s a c t i o n s _ p e r _ r e q u e s t , g s t _ t r a n s a c t i o n s _ p e r _ r e q u e s t 
. / b i n / d e m o _ b e n c h
CASE STUDY: MATRIX TRANSPOSE
& n v p r o f - - d e v i c e s 2 - - m e t r i c s s h a r e d _ r e p l a y _ o v e r h e a d . / b i n / d e m o _ b e n c h
CODE PATHS ANALYSIS
The main idea: determine performance limiters through measuring different parts
independently
Simple case: time memory-only and math-only versions of the kernel
Shows how well memory operations are overlapped with arithmetic: compare the sum
of mem-only and math-only times to full-kernel time
t e m p l a t e < t y p e n a m e T >
_ _ g l o b a l _ _ v o i d
b e n c h m a r k _ c o n t i g u o u s _ d i r e c t _ l o a d ( T * s , t y p e n a m e T : : v a l u e _ t y p e * r , b o o l d o S t o r e )
{
i n t g l o b a l _ i n d e x = t h r e a d I d x . x + b l o c k D i m . x * b l o c k I d x . x ;
T d a t a = s [ g l o b a l _ i n d e x ] ;
a s m ( " " : : : " m e m o r y " ) ;
i f ( s & & d o S t o r e )
r [ g l o b a l _ i n d e x ] = s u m ( d a t a ) ;
}
DEVICE SIDE TIMING
Device timer located on ROP/SM depending on hardware revision
It's relatively easy to compute per thread values but hard to analyze kernel performance
due to grid serialization
sometimes is suitable for benchmarking
t e m p l a t e < t y p e n a m e T , t y p e n a m e D , t y p e n a m e L > _ _ g l o b a l _ _
v o i d l a t e n c y _ k e r n e l ( T * * a , i n t l e n , i n t s t r i d e , i n t i n n e r _ i t s ,
D * l a t e n c y , L f u n c )
{
D s t a r t _ t i m e , e n d _ t i m e ;
v o l a t i l e D s u m _ t i m e = 0 ;
f o r ( i n t k = 0 ; k < i n n e r _ i t s ; + + k )
{
T * j = ( ( T * ) a ) + t h r e a d I d x . y * l e n + t h r e a d I d x . x ;
s t a r t _ t i m e = c l o c k 6 4 ( ) ;
f o r ( i n t c u r r = 0 ; c u r r < l e n / s t r i d e ; + + c u r r ) j = f u n c ( j ) ;
e n d _ t i m e = c l o c k 6 4 ( ) ; s u m _ t i m e + = ( e n d _ t i m e - s t a r t _ t i m e ) ;
}
i f ( ! t h r e a d I d x . x ) a t o m i c A d d ( l a t e n c y , s u m _ t i m e ) ;
}
FINAL WORDS
Time
Profile
(Micro)benchmark
Prototype
Look into SASS
THE END
LIST OF PRESENTATIONS
BY / 2013–2015CUDA.GEEK

Code GPU with CUDA - Identifying performance limiters

  • 1.
    CODE GPU WITHCUDA IDENTIFYING PERFORMANCE LIMITERS CreatedbyMarinaKolpakova( )forcuda.geek Itseez PREVIOUS
  • 2.
    OUTLINE How to identifyperformance limiters? What and how to measure? Why to profile? Profiling case study: transpose Code paths analysis
  • 3.
    OUT OF SCOPE Visualprofiler opportunities
  • 4.
    HOW TO IDENTIFYPERFORMANCE LIMITERS Time Subsample when measuring performance Determine your code wall time. You'll optimize it Profile Collect metrics and events Determine limiting factors (e.c. memory, divergence)
  • 5.
    HOW TO IDENTIFYPERFORMANCE LIMITERS Prototype Prototype kernel parts separately and time them Determine memory access or data dependency patterns (Micro)benchmark Determine hardware characteristics Tune for particular architecture, GPU class Look into SASS Check compiler optimizations Look for a further improvements
  • 6.
    TIMING: WHAT TOMEASURE? Wall time: user will see this time GPU time: specific kernel time CPU ⇔ GPU memory transfers time: not considered for GPU time analysis significantly impact wall time Data dependent cases timing: worst case time time of single iteration consider probability
  • 7.
    HOW TO MEASURE? SYSTEMTIMER (UNIX) # i n c l u d e < t i m e . h > d o u b l e r u n K e r n e l ( c o n s t d i m 3 g r i d , c o n s t d i m 3 b l o c k ) { s t r u c t t i m e s p e c s t a r t T i m e , e n d T i m e ; c l o c k _ g e t t i m e ( C L O C K _ M O N O T O N I C , & s t a r t T i m e ) ; k e r n e l < < < g r i d , b l o c k > > > ( ) ; < b > c u d a D e v i c e S y n c h r o n i z e ( ) ; < / b > c l o c k _ g e t t i m e ( C L O C K _ M O N O T O N I C , & e n d T i m e ) ; i n t 6 4 s t a r t N s = ( i n t 6 4 ) s t a r t T i m e . t v _ s e c * 1 0 0 0 0 0 0 0 0 0 + s t a r t T i m e . t v _ n s e c ; i n t 6 4 e n d N s = ( i n t 6 4 ) e n d T i m e . t v _ s e c * 1 0 0 0 0 0 0 0 0 0 + e n d T i m e . t v _ n s e c ; r e t u r n ( e n d N s - s t a r t N s ) / 1 0 0 0 0 0 0 0 . ; / / g e t m s } Preferred for wall time measurement
  • 8.
    HOW TO MEASURE? TIMINGWITH CUDA EVENTS d o u b l e r u n K e r n e l ( c o n s t d i m 3 g r i d , c o n s t d i m 3 b l o c k ) { c u d a E v e n t _ t > s t a r t , s t o p ; c u d a E v e n t C r e a t e ( & s t a r t ) ; c u d a E v e n t C r e a t e ( & s t o p ) ; c u d a E v e n t R e c o r d ( s t a r t , 0 ) ; k e r n e l < < < g r i d , b l o c k > > > ( ) ; c u d a E v e n t R e c o r d ( s t o p , 0 ) ; < b > c u d a E v e n t S y n c h r o n i z e ( s t o p ) ; < / b > f l o a t m s ; c u d a E v e n t E l a p s e d T i m e ( & m s , s t a r t , s t o p ) ; c u d a E v e n t D e s t r o y ( s t a r t ) ; c u d a E v e n t D e s t r o y ( s t o p ) ; r e t u r n m s ; } Preferred for GPU time measurement Can be used with CUDA streams without synchronization
  • 9.
    WHY TO PROFILE? Profilerwill not do your work for you, but profiler helps: to verify memory access patterns to identify bottlenecks to collect statistic in data-dependent workloads to check your hypothesis to understand how hardware behaves Think about profiling and benchmarking as about scientific experiments
  • 10.
    DEVICE CODE PROFILER eventsare hardware counters, usually reported per SM SM id selected by profiler with assumption that all SMs do approximately the same amount of work Exceptions: L2 and DRAM counters metrics computed from number of events and hardware specific properties (e.c. number of SM) Single run can collect only a few counters Profiler repeats kernel launches to collect all counters Results may vary for repeated runs
  • 11.
    PROFILING FOR MEMORY Memorymetrics which have load or store in name counts from software perspective (in terms of memory requests) local_store_transactions which have read or write in name counts from hardware perspective (in terms of bytes transfered) l2_subp0_read_sector_misses Counters are incremented per warp per cache line/transaction size per request/instruction
  • 12.
    PROFILING FOR MEMORY Accesspattern efficiency check the ratio between bytes requested by the threads / application code and bytes moved by the hardware (L2/DRAM) use g{ld,st}_transactions_per_requestmetric Throughput analysis compare application HW throughput to possible for your GPU (can be found in documentation) g{ld,st}_requested_throughput
  • 13.
    INSTRUCTIONS/BYTES RATIO Profiler counters: instructions_issued,instructions_executed incremented by warp, but “issued” includes replays global_store_transaction, uncached_global_load_transaction transaction can be 32,64,128 byte. Requires additional analysis to determine average. Compute ratio: (warpSize X instructions_issued) v.s. (global_store_transaction + l1_global_load_miss) * avgTransactionSize
  • 14.
    LIST OF EVENTSFOR SM_35 domain event texture (a) tex{0,1,2,3}_cache_sector_{queries,misses} rocache_subp{0,1,2,3}_gld_warp_count_{32,64,128}b rocache_subp{0,1,2,3}_gld_thread_count_{32,64,128}b L2 (b) fb_subp{0,1}_{read,write}_sectors l2_subp{0,1,2,3}_total_{read,write}_sector_queries l2_subp{0,1,2,3}_{read,write}_{l1,system}_sector_queries l2_subp{0,1,2,3}_{read,write}_sector_misses l2_subp{0,1,2,3}_read_tex_sector_queries l2_subp{0,1,2,3}_read_{l1,tex}_hit_sectors LD/ST (c) g{ld,st}_inst_{8,16,32,64,128}bit rocache_gld_inst_{8,16,32,64,128}bit
  • 15.
    LIST OF EVENTSFOR SM_35 domain event sm (d) prof_trigger_0{0-7} {shared,local}_{load,store} g{ld,st}_request {local,l1_shared,__l1_global}_{load,store}_transactions l1_local_{load,store}_{hit,miss} l1_global_load_{hit,miss} uncached_global_load_transaction global_store_transaction shared_{load,store}_replay global_{ld,st}_mem_divergence_replays
  • 16.
    LIST OF EVENTSFOR SM_35 domain event sm (d) {threads,warps,sm_cta}_launched inst_issued{1,2} [thread_,not_predicated_off_thread_]inst_executed {atom,gred}_count active_{cycles,warps}
  • 17.
    LIST OF METRICSFOR SM_35 metric g{ld,st}_requested_throughput tex_cache_{hit_rate,throughput} dram_{read,write}_throughput nc_gld_requested_throughput {local,shared}_{load,store}_throughput {l2,system}_{read,write}_throughput g{st,ld}_{throughput,efficiency} l2_{l1,texture}_read_{hit_rate,throughput} l1_cache_{global,local}_hit_rate
  • 18.
    LIST OF METRICSFOR SM_35 metric {local,shared}_{load,store}_transactions[_per_request] gl{d,st}_transactions[_per_request] {sysmem,dram,l2}_{read,write}_transactions tex_cache_transactions {inst,shared,global,global_cache,local}_replay_overhead local_memory_overhead shared_efficiency achieved_occupancy sm_efficiency[_instance] ipc[_instance] issued_ipc inst_per_warp
  • 19.
    LIST OF METRICSFOR SM_35 metric flops_{sp,dp}[_add,mul,fma] warp_execution_efficiency warp_nonpred_execution_efficiency flops_sp_special stall_{inst_fetch,exec_dependency,data_request,texture,sync,other} {l1_shared,l2,tex,dram,system}_utilization {cf,ldst}_{issued,executed} {ldst,alu,cf,tex}_fu_utilization issue_slot_utilization inst_{issued,executed} issue_slots
  • 20.
    ROI PROFILING # in c l u d e < c u d a _ p r o f i l e r _ a p i . h > / / a l g o r i t h m s e t u p c o d e u d a P r o f i l e r S t a r t ( ) ; p e r f _ t e s t _ c u d a _ a c c e l e r a t e d _ c o d e ( ) ; c u d a P r o f i l e r S t o p ( ) ; Profile only part that you are optimizing right now shorter and simpler profiler log Do not significantly overhead your code runtime Used with --profile-from-start offnvprof option
  • 21.
    CASE STUDY: MATRIXTRANSPOSE & n v p r o f - - d e v i c e s 2 . / b i n / d e m o _ b e n c h
  • 22.
    CASE STUDY: MATRIXTRANSPOSE & n v p r o f - - d e v i c e s 2 - - m e t r i c s g l d _ t r a n s a c t i o n s _ p e r _ r e q u e s t , g s t _ t r a n s a c t i o n s _ p e r _ r e q u e s t . / b i n / d e m o _ b e n c h
  • 23.
    CASE STUDY: MATRIXTRANSPOSE & n v p r o f - - d e v i c e s 2 - - m e t r i c s s h a r e d _ r e p l a y _ o v e r h e a d . / b i n / d e m o _ b e n c h
  • 24.
    CODE PATHS ANALYSIS Themain idea: determine performance limiters through measuring different parts independently Simple case: time memory-only and math-only versions of the kernel Shows how well memory operations are overlapped with arithmetic: compare the sum of mem-only and math-only times to full-kernel time t e m p l a t e < t y p e n a m e T > _ _ g l o b a l _ _ v o i d b e n c h m a r k _ c o n t i g u o u s _ d i r e c t _ l o a d ( T * s , t y p e n a m e T : : v a l u e _ t y p e * r , b o o l d o S t o r e ) { i n t g l o b a l _ i n d e x = t h r e a d I d x . x + b l o c k D i m . x * b l o c k I d x . x ; T d a t a = s [ g l o b a l _ i n d e x ] ; a s m ( " " : : : " m e m o r y " ) ; i f ( s & & d o S t o r e ) r [ g l o b a l _ i n d e x ] = s u m ( d a t a ) ; }
  • 25.
    DEVICE SIDE TIMING Devicetimer located on ROP/SM depending on hardware revision It's relatively easy to compute per thread values but hard to analyze kernel performance due to grid serialization sometimes is suitable for benchmarking t e m p l a t e < t y p e n a m e T , t y p e n a m e D , t y p e n a m e L > _ _ g l o b a l _ _ v o i d l a t e n c y _ k e r n e l ( T * * a , i n t l e n , i n t s t r i d e , i n t i n n e r _ i t s , D * l a t e n c y , L f u n c ) { D s t a r t _ t i m e , e n d _ t i m e ; v o l a t i l e D s u m _ t i m e = 0 ; f o r ( i n t k = 0 ; k < i n n e r _ i t s ; + + k ) { T * j = ( ( T * ) a ) + t h r e a d I d x . y * l e n + t h r e a d I d x . x ; s t a r t _ t i m e = c l o c k 6 4 ( ) ; f o r ( i n t c u r r = 0 ; c u r r < l e n / s t r i d e ; + + c u r r ) j = f u n c ( j ) ; e n d _ t i m e = c l o c k 6 4 ( ) ; s u m _ t i m e + = ( e n d _ t i m e - s t a r t _ t i m e ) ; } i f ( ! t h r e a d I d x . x ) a t o m i c A d d ( l a t e n c y , s u m _ t i m e ) ; }
  • 26.
  • 27.
    THE END LIST OFPRESENTATIONS BY / 2013–2015CUDA.GEEK