EXPLORING GPGPU Workloads Unai Lopez Novoa Intelligent Systems Group University of Basque Country
Outline <ul><li>Introduction to GPGPU </li></ul><ul><li>GPU Architecture </li></ul><ul><li>Workload Analysis </li></ul><ul...
Introduction to HPC <ul><li>HPC: High Performance Computing </li></ul><ul><li>Goals: </li></ul><ul><ul><li>Solve computati...
Motivation <ul><li>GPU: Graphics Computing Unit </li></ul><ul><li>  ATI 4890 </li></ul><ul><li>Specialized for massively d...
Introduction <ul><li>SISD : Single-core CPU (1 execution thread) </li></ul><ul><li>MIMD : Multi-core CPU, Computing Cluste...
GPU Architecture <ul><li>1 GPU    several multiprocessors (SIMD engines) </li></ul><ul><li>1 multiprocessor    several e...
GPU Architecture <ul><li>Radeon HD 5870 </li></ul>Global Memory Private  Memory Workitem 1 Private  Memory Workitem 1 Comp...
Programming GpGpu <ul><li>Kernel: code to run on each thread </li></ul><ul><li>Extension of Ansi-C </li></ul>__kernel void...
Programming GpGpu for (i = 0; i < N; i++) a[i] = a[i] * 2; #pragma omp for for (i = 0; i < N; i++) a[i] = a[i] * 2; i = ge...
Workload factors that affect efficiency <ul><li>Dynamic instruction count </li></ul><ul><li>Floating point instruction cou...
GP-GPU Sim <ul><li>Software that models a modern GPU </li></ul><ul><li>Runs GPU apps to profile characteristics </li></ul>...
GP-GPU Sim <ul><li>Different configurations: </li></ul>
Benchmarks
Statistical Methods <ul><li>PCA: Principal Component Analysis </li></ul><ul><ul><li>Goal: remove correlations among differ...
Statistical Methods <ul><li>Hierarchical Clustering Analysis </li></ul><ul><ul><li>Classification technique based on simil...
Experimental methodology <ul><li>Load GPGPU-Sim with several workloads and 6 different configurations </li></ul><ul><li>Ge...
Results <ul><li>Workload classification </li></ul><ul><ul><li>Trade-off: available simulation time and amount of accuracy ...
Results <ul><li>GPGPU Workload fundamental analysis </li></ul><ul><ul><li>Branch divergence, high barrier and instruction ...
Conclusions <ul><li>GPUs provide big computing power at low cost </li></ul><ul><li>But not every problem can be treated th...
Future of GPGPU <ul><li>GPUs more and more used  </li></ul><ul><ul><li>i.e. Top500, new machines by Cray </li></ul></ul><u...
EXPLORING GPGPU Workloads Unai Lopez Novoa Intelligent Systems Group University of Basque Country
Upcoming SlideShare
Loading in …5
×

Exploring Gpgpu Workloads

1,922 views

Published on

Exploring Gpgpu Workloads, EHU, Junio 2011

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,922
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Up to here, seen and explained how a GPU Works. Now we’ll start talking about the paper experimentation. The authors wanted to evaluate the workload on GPUs to get the relevant parameters that affect the performance and to be able to tune the executions. This were some of the characteristics they took in account
  • Methods used to analyze mentioned characteristics
  • Exploring Gpgpu Workloads

    1. 1. EXPLORING GPGPU Workloads Unai Lopez Novoa Intelligent Systems Group University of Basque Country
    2. 2. Outline <ul><li>Introduction to GPGPU </li></ul><ul><li>GPU Architecture </li></ul><ul><li>Workload Analysis </li></ul><ul><li>Conclusions </li></ul>
    3. 3. Introduction to HPC <ul><li>HPC: High Performance Computing </li></ul><ul><li>Goals: </li></ul><ul><ul><li>Solve computational problems with accuracy, speed, and efficiency </li></ul></ul><ul><ul><li>E xploit computing resources </li></ul></ul><ul><li>Clusters, supercomputers </li></ul><ul><ul><li>Collection of traditional “cores” + GPUs </li></ul></ul>
    4. 4. Motivation <ul><li>GPU: Graphics Computing Unit </li></ul><ul><li> ATI 4890 </li></ul><ul><li>Specialized for massively data parallel computation </li></ul><ul><li>GPGPU = General-Purpose computing on GPUs </li></ul><ul><li>Price: ~ 100 - 400 € (low ~medium end) </li></ul><ul><li>Speed-up : 2x – 50x ! </li></ul>
    5. 5. Introduction <ul><li>SISD : Single-core CPU (1 execution thread) </li></ul><ul><li>MIMD : Multi-core CPU, Computing Cluster </li></ul><ul><ul><li>(shared vs. distributed memory) </li></ul></ul><ul><li>SIMD : </li></ul><ul><ul><li>Vector instructions </li></ul></ul><ul><ul><li>GPU (using CUDA / OpenCL  GPGPU) </li></ul></ul>SISD SIMD MISD MIMD
    6. 6. GPU Architecture <ul><li>1 GPU  several multiprocessors (SIMD engines) </li></ul><ul><li>1 multiprocessor  several execution units (stream processing units) </li></ul><ul><ul><li>Each one capable of running 1 thread (arranged in workgroups) </li></ul></ul><ul><ul><li>ATI 4890  800 “stream processing units” </li></ul></ul><ul><li>Memory hierarchy: </li></ul><ul><ul><li>Register ( per thread ) </li></ul></ul><ul><ul><li>Local ( per workgroup ) </li></ul></ul><ul><ul><li>Global (for the whole GPU) </li></ul></ul>
    7. 7. GPU Architecture <ul><li>Radeon HD 5870 </li></ul>Global Memory Private Memory Workitem 1 Private Memory Workitem 1 Compute Unit 1 Local Memory Global / Constant Memory Data Cache Local Memory Private Memory Workitem 1 Private Memory Workitem 1 Compute Unit N Compute Device Compute Device Memory
    8. 8. Programming GpGpu <ul><li>Kernel: code to run on each thread </li></ul><ul><li>Extension of Ansi-C </li></ul>__kernel void helloWorldKernel (__global int* a, …) { /* code */ }
    9. 9. Programming GpGpu for (i = 0; i < N; i++) a[i] = a[i] * 2; #pragma omp for for (i = 0; i < N; i++) a[i] = a[i] * 2; i = get_thread_id (); a[i] = a[i] * 2 ; int a[8]; SISD MIMD (with OpenMP) SIMT Time __kernel void helloWorld (__global int* a) { i = get_thread_id(); a[i] = a[i] * 2; } 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
    10. 10. Workload factors that affect efficiency <ul><li>Dynamic instruction count </li></ul><ul><li>Floating point instruction count </li></ul><ul><li>Memory instruction count </li></ul><ul><li>Branch instruction count </li></ul><ul><li>Atomic instruction count </li></ul><ul><li>… </li></ul><ul><li>Number of divergent warps </li></ul><ul><li>Thread count per kernel </li></ul><ul><li>Registers used per thread </li></ul><ul><li>Bytes transferred device2host/host2device </li></ul><ul><li>… </li></ul>
    11. 11. GP-GPU Sim <ul><li>Software that models a modern GPU </li></ul><ul><li>Runs GPU apps to profile characteristics </li></ul><ul><li>Used to time: </li></ul><ul><ul><li>Multiprocessor cores </li></ul></ul><ul><ul><li>Caches </li></ul></ul><ul><ul><li>Interconnection network </li></ul></ul><ul><ul><li>Memory controllers & DRAM </li></ul></ul><ul><li>http://groups.google.com/group/gpgpu-sim/ </li></ul>
    12. 12. GP-GPU Sim <ul><li>Different configurations: </li></ul>
    13. 13. Benchmarks
    14. 14. Statistical Methods <ul><li>PCA: Principal Component Analysis </li></ul><ul><ul><li>Goal: remove correlations among different input dimensions  reduce data dimensions </li></ul></ul><ul><ul><li>Forms a matrix with input dimensions(cols) and observations(rows) </li></ul></ul><ul><ul><li>Tries to loose no information </li></ul></ul><ul><li>Example: </li></ul>
    15. 15. Statistical Methods <ul><li>Hierarchical Clustering Analysis </li></ul><ul><ul><li>Classification technique based on similarity on the dataset </li></ul></ul><ul><ul><li>Not necessary to know the number of clusters a-priori </li></ul></ul><ul><ul><li>A tree-like structure is generated (dendogram) </li></ul></ul><ul><li>Example: </li></ul>
    16. 16. Experimental methodology <ul><li>Load GPGPU-Sim with several workloads and 6 different configurations </li></ul><ul><li>Generate a 93 (dimensions) x 38 (workloads) matrix from GPGPU-sim output </li></ul><ul><li>PCA performed using Statistica to obtain principal components (16 PCs for 90% Variance, 7 PCs for 70%) </li></ul><ul><li>Hierarchical Cluster Analysis performed </li></ul>
    17. 17. Results <ul><li>Workload classification </li></ul><ul><ul><li>Trade-off: available simulation time and amount of accuracy </li></ul></ul><ul><li>Microarchitecture impact </li></ul><ul><ul><li>Workloads are independent on underlying architecture (number of cores, register file size, interconnection network,…) </li></ul></ul><ul><li>Dendogram of Workloads (16 Pr.Comps, 90% Var) </li></ul><ul><li>Some benchmarks: </li></ul><ul><li>SS= Similarity Score (51K Inst.) </li></ul><ul><li>SLA= Scan of Large Arrays (1310K Inst.) </li></ul><ul><li>PR = Parallel Reduction (830K Inst.) </li></ul><ul><li>BN = Binomial Options (131K Inst.) </li></ul>
    18. 18. Results <ul><li>GPGPU Workload fundamental analysis </li></ul><ul><ul><li>Branch divergence, high barrier and instruction count, and heavy weight threads are the most performance-affecting characteristics </li></ul></ul><ul><li>Characteristics Based Classification </li></ul><ul><ul><li>Diverse behavior in terms of instruction executed by each thread, total thread count and total instruction count </li></ul></ul><ul><li>Discussion and Limitations </li></ul><ul><ul><li>nVidia and Ati have similar architecture and behavior </li></ul></ul><ul><ul><li>Intel Larrabee is different in architecture, behaviour and programming model </li></ul></ul>
    19. 19. Conclusions <ul><li>GPUs provide big computing power at low cost </li></ul><ul><li>But not every problem can be treated this way </li></ul><ul><li>Important to take in account workload characteristics to exploit GPU capabilities </li></ul><ul><li>Benchmarks show the relevance of branch divergence, memory coalescing and kernel size </li></ul>
    20. 20. Future of GPGPU <ul><li>GPUs more and more used </li></ul><ul><ul><li>i.e. Top500, new machines by Cray </li></ul></ul><ul><li>Evolving architecture on GPUs (i.e. nVidia Fermi) </li></ul><ul><ul><li>Speed, memory, floating-point support… more computing than graphics power! </li></ul></ul><ul><li>New tools </li></ul><ul><ul><li>Example: PGI Compilers that auto-parallelize codes for GPUs </li></ul></ul><ul><li>Hybrid computing: CPU + GPU simultaneously </li></ul><ul><ul><li>Don ’t leave the CPU idle while the GPU is busy </li></ul></ul>
    21. 21. EXPLORING GPGPU Workloads Unai Lopez Novoa Intelligent Systems Group University of Basque Country

    ×