Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Code GPU with CUDA - Identifying performance limiters

1,407 views

Published on

The presentation describes how to identify performance limiters.

Published in: Education
  • Be the first to comment

Code GPU with CUDA - Identifying performance limiters

  1. 1. CODE GPU WITH CUDA IDENTIFYING PERFORMANCE LIMITERS CreatedbyMarinaKolpakova( )forcuda.geek Itseez PREVIOUS
  2. 2. OUTLINE How to identify performance limiters? What and how to measure? Why to profile? Profiling case study: transpose Code paths analysis
  3. 3. OUT OF SCOPE Visual profiler opportunities
  4. 4. HOW TO IDENTIFY PERFORMANCE LIMITERS Time Subsample when measuring performance Determine your code wall time. You'll optimize it Profile Collect metrics and events Determine limiting factors (e.c. memory, divergence)
  5. 5. HOW TO IDENTIFY PERFORMANCE LIMITERS Prototype Prototype kernel parts separately and time them Determine memory access or data dependency patterns (Micro)benchmark Determine hardware characteristics Tune for particular architecture, GPU class Look into SASS Check compiler optimizations Look for a further improvements
  6. 6. TIMING: WHAT TO MEASURE? Wall time: user will see this time GPU time: specific kernel time CPU ⇔ GPU memory transfers time: not considered for GPU time analysis significantly impact wall time Data dependent cases timing: worst case time time of single iteration consider probability
  7. 7. HOW TO MEASURE? SYSTEM TIMER (UNIX) # i n c l u d e < t i m e . h > d o u b l e r u n K e r n e l ( c o n s t d i m 3 g r i d , c o n s t d i m 3 b l o c k ) { s t r u c t t i m e s p e c s t a r t T i m e , e n d T i m e ; c l o c k _ g e t t i m e ( C L O C K _ M O N O T O N I C , & s t a r t T i m e ) ; k e r n e l < < < g r i d , b l o c k > > > ( ) ; < b > c u d a D e v i c e S y n c h r o n i z e ( ) ; < / b > c l o c k _ g e t t i m e ( C L O C K _ M O N O T O N I C , & e n d T i m e ) ; i n t 6 4 s t a r t N s = ( i n t 6 4 ) s t a r t T i m e . t v _ s e c * 1 0 0 0 0 0 0 0 0 0 + s t a r t T i m e . t v _ n s e c ; i n t 6 4 e n d N s = ( i n t 6 4 ) e n d T i m e . t v _ s e c * 1 0 0 0 0 0 0 0 0 0 + e n d T i m e . t v _ n s e c ; r e t u r n ( e n d N s - s t a r t N s ) / 1 0 0 0 0 0 0 0 . ; / / g e t m s } Preferred for wall time measurement
  8. 8. HOW TO MEASURE? TIMING WITH CUDA EVENTS d o u b l e r u n K e r n e l ( c o n s t d i m 3 g r i d , c o n s t d i m 3 b l o c k ) { c u d a E v e n t _ t > s t a r t , s t o p ; c u d a E v e n t C r e a t e ( & s t a r t ) ; c u d a E v e n t C r e a t e ( & s t o p ) ; c u d a E v e n t R e c o r d ( s t a r t , 0 ) ; k e r n e l < < < g r i d , b l o c k > > > ( ) ; c u d a E v e n t R e c o r d ( s t o p , 0 ) ; < b > c u d a E v e n t S y n c h r o n i z e ( s t o p ) ; < / b > f l o a t m s ; c u d a E v e n t E l a p s e d T i m e ( & m s , s t a r t , s t o p ) ; c u d a E v e n t D e s t r o y ( s t a r t ) ; c u d a E v e n t D e s t r o y ( s t o p ) ; r e t u r n m s ; } Preferred for GPU time measurement Can be used with CUDA streams without synchronization
  9. 9. WHY TO PROFILE? Profiler will not do your work for you, but profiler helps: to verify memory access patterns to identify bottlenecks to collect statistic in data-dependent workloads to check your hypothesis to understand how hardware behaves Think about profiling and benchmarking as about scientific experiments
  10. 10. DEVICE CODE PROFILER events are hardware counters, usually reported per SM SM id selected by profiler with assumption that all SMs do approximately the same amount of work Exceptions: L2 and DRAM counters metrics computed from number of events and hardware specific properties (e.c. number of SM) Single run can collect only a few counters Profiler repeats kernel launches to collect all counters Results may vary for repeated runs
  11. 11. PROFILING FOR MEMORY Memory metrics which have load or store in name counts from software perspective (in terms of memory requests) local_store_transactions which have read or write in name counts from hardware perspective (in terms of bytes transfered) l2_subp0_read_sector_misses Counters are incremented per warp per cache line/transaction size per request/instruction
  12. 12. PROFILING FOR MEMORY Access pattern efficiency check the ratio between bytes requested by the threads / application code and bytes moved by the hardware (L2/DRAM) use g{ld,st}_transactions_per_requestmetric Throughput analysis compare application HW throughput to possible for your GPU (can be found in documentation) g{ld,st}_requested_throughput
  13. 13. INSTRUCTIONS/BYTES RATIO Profiler counters: instructions_issued, instructions_executed incremented by warp, but “issued” includes replays global_store_transaction, uncached_global_load_transaction transaction can be 32,64,128 byte. Requires additional analysis to determine average. Compute ratio: (warpSize X instructions_issued) v.s. (global_store_transaction + l1_global_load_miss) * avgTransactionSize
  14. 14. LIST OF EVENTS FOR SM_35 domain event texture (a) tex{0,1,2,3}_cache_sector_{queries,misses} rocache_subp{0,1,2,3}_gld_warp_count_{32,64,128}b rocache_subp{0,1,2,3}_gld_thread_count_{32,64,128}b L2 (b) fb_subp{0,1}_{read,write}_sectors l2_subp{0,1,2,3}_total_{read,write}_sector_queries l2_subp{0,1,2,3}_{read,write}_{l1,system}_sector_queries l2_subp{0,1,2,3}_{read,write}_sector_misses l2_subp{0,1,2,3}_read_tex_sector_queries l2_subp{0,1,2,3}_read_{l1,tex}_hit_sectors LD/ST (c) g{ld,st}_inst_{8,16,32,64,128}bit rocache_gld_inst_{8,16,32,64,128}bit
  15. 15. LIST OF EVENTS FOR SM_35 domain event sm (d) prof_trigger_0{0-7} {shared,local}_{load,store} g{ld,st}_request {local,l1_shared,__l1_global}_{load,store}_transactions l1_local_{load,store}_{hit,miss} l1_global_load_{hit,miss} uncached_global_load_transaction global_store_transaction shared_{load,store}_replay global_{ld,st}_mem_divergence_replays
  16. 16. LIST OF EVENTS FOR SM_35 domain event sm (d) {threads,warps,sm_cta}_launched inst_issued{1,2} [thread_,not_predicated_off_thread_]inst_executed {atom,gred}_count active_{cycles,warps}
  17. 17. LIST OF METRICS FOR SM_35 metric g{ld,st}_requested_throughput tex_cache_{hit_rate,throughput} dram_{read,write}_throughput nc_gld_requested_throughput {local,shared}_{load,store}_throughput {l2,system}_{read,write}_throughput g{st,ld}_{throughput,efficiency} l2_{l1,texture}_read_{hit_rate,throughput} l1_cache_{global,local}_hit_rate
  18. 18. LIST OF METRICS FOR SM_35 metric {local,shared}_{load,store}_transactions[_per_request] gl{d,st}_transactions[_per_request] {sysmem,dram,l2}_{read,write}_transactions tex_cache_transactions {inst,shared,global,global_cache,local}_replay_overhead local_memory_overhead shared_efficiency achieved_occupancy sm_efficiency[_instance] ipc[_instance] issued_ipc inst_per_warp
  19. 19. LIST OF METRICS FOR SM_35 metric flops_{sp,dp}[_add,mul,fma] warp_execution_efficiency warp_nonpred_execution_efficiency flops_sp_special stall_{inst_fetch,exec_dependency,data_request,texture,sync,other} {l1_shared,l2,tex,dram,system}_utilization {cf,ldst}_{issued,executed} {ldst,alu,cf,tex}_fu_utilization issue_slot_utilization inst_{issued,executed} issue_slots
  20. 20. ROI PROFILING # i n c l u d e < c u d a _ p r o f i l e r _ a p i . h > / / a l g o r i t h m s e t u p c o d e u d a P r o f i l e r S t a r t ( ) ; p e r f _ t e s t _ c u d a _ a c c e l e r a t e d _ c o d e ( ) ; c u d a P r o f i l e r S t o p ( ) ; Profile only part that you are optimizing right now shorter and simpler profiler log Do not significantly overhead your code runtime Used with --profile-from-start offnvprof option
  21. 21. CASE STUDY: MATRIX TRANSPOSE & n v p r o f - - d e v i c e s 2 . / b i n / d e m o _ b e n c h
  22. 22. CASE STUDY: MATRIX TRANSPOSE & n v p r o f - - d e v i c e s 2 - - m e t r i c s g l d _ t r a n s a c t i o n s _ p e r _ r e q u e s t , g s t _ t r a n s a c t i o n s _ p e r _ r e q u e s t . / b i n / d e m o _ b e n c h
  23. 23. CASE STUDY: MATRIX TRANSPOSE & n v p r o f - - d e v i c e s 2 - - m e t r i c s s h a r e d _ r e p l a y _ o v e r h e a d . / b i n / d e m o _ b e n c h
  24. 24. CODE PATHS ANALYSIS The main idea: determine performance limiters through measuring different parts independently Simple case: time memory-only and math-only versions of the kernel Shows how well memory operations are overlapped with arithmetic: compare the sum of mem-only and math-only times to full-kernel time t e m p l a t e < t y p e n a m e T > _ _ g l o b a l _ _ v o i d b e n c h m a r k _ c o n t i g u o u s _ d i r e c t _ l o a d ( T * s , t y p e n a m e T : : v a l u e _ t y p e * r , b o o l d o S t o r e ) { i n t g l o b a l _ i n d e x = t h r e a d I d x . x + b l o c k D i m . x * b l o c k I d x . x ; T d a t a = s [ g l o b a l _ i n d e x ] ; a s m ( " " : : : " m e m o r y " ) ; i f ( s & & d o S t o r e ) r [ g l o b a l _ i n d e x ] = s u m ( d a t a ) ; }
  25. 25. DEVICE SIDE TIMING Device timer located on ROP/SM depending on hardware revision It's relatively easy to compute per thread values but hard to analyze kernel performance due to grid serialization sometimes is suitable for benchmarking t e m p l a t e < t y p e n a m e T , t y p e n a m e D , t y p e n a m e L > _ _ g l o b a l _ _ v o i d l a t e n c y _ k e r n e l ( T * * a , i n t l e n , i n t s t r i d e , i n t i n n e r _ i t s , D * l a t e n c y , L f u n c ) { D s t a r t _ t i m e , e n d _ t i m e ; v o l a t i l e D s u m _ t i m e = 0 ; f o r ( i n t k = 0 ; k < i n n e r _ i t s ; + + k ) { T * j = ( ( T * ) a ) + t h r e a d I d x . y * l e n + t h r e a d I d x . x ; s t a r t _ t i m e = c l o c k 6 4 ( ) ; f o r ( i n t c u r r = 0 ; c u r r < l e n / s t r i d e ; + + c u r r ) j = f u n c ( j ) ; e n d _ t i m e = c l o c k 6 4 ( ) ; s u m _ t i m e + = ( e n d _ t i m e - s t a r t _ t i m e ) ; } i f ( ! t h r e a d I d x . x ) a t o m i c A d d ( l a t e n c y , s u m _ t i m e ) ; }
  26. 26. FINAL WORDS Time Profile (Micro)benchmark Prototype Look into SASS
  27. 27. THE END LIST OF PRESENTATIONS BY / 2013–2015CUDA.GEEK

×