EXPLORING GPGPU Workloads Unai Lopez Novoa Intelligent Systems Group University of Basque Country
Outline Introduction to GPGPU GPU Architecture Workload Analysis Conclusions
Introduction to HPC HPC: High Performance Computing Goals: Solve computational problems with accuracy, speed, and efficiency E xploit  computing resources Clusters, supercomputers Collection of traditional  “cores” + GPUs
Motivation GPU: Graphics Computing Unit   ATI 4890 Specialized for massively data parallel computation GPGPU = General-Purpose computing on GPUs Price: ~ 100 - 400 € (low ~medium end) Speed-up :  2x – 50x !
Introduction SISD : Single-core CPU (1 execution thread) MIMD : Multi-core CPU, Computing Cluster (shared vs. distributed memory) SIMD :  Vector instructions  GPU (using CUDA / OpenCL    GPGPU) SISD SIMD MISD MIMD
GPU Architecture 1 GPU    several multiprocessors (SIMD engines) 1 multiprocessor    several execution units (stream processing units) Each one capable of running 1 thread (arranged in workgroups) ATI 4890    800  “stream processing units” Memory hierarchy: Register ( per thread ) Local ( per workgroup ) Global (for the whole GPU)
GPU Architecture Radeon HD 5870 Global Memory Private  Memory Workitem 1 Private  Memory Workitem 1 Compute Unit 1 Local Memory Global / Constant Memory Data Cache Local Memory Private  Memory Workitem 1 Private  Memory Workitem 1 Compute Unit  N Compute Device Compute Device Memory
Programming GpGpu Kernel: code to run on each thread Extension of Ansi-C __kernel void helloWorldKernel (__global int* a, …) { /* code */ }
Programming GpGpu for (i = 0; i < N; i++) a[i] = a[i] * 2; #pragma omp for for (i = 0; i < N; i++) a[i] = a[i] * 2; i = get_thread_id (); a[i] = a[i] * 2 ; int a[8]; SISD MIMD (with OpenMP) SIMT Time __kernel void helloWorld (__global int* a) { i = get_thread_id();   a[i] = a[i] * 2; } 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Workload factors that affect efficiency Dynamic instruction count Floating point instruction count Memory instruction count Branch instruction count Atomic instruction count … Number of divergent warps Thread count per kernel Registers used per thread Bytes transferred device2host/host2device …
GP-GPU Sim Software that models a modern GPU Runs GPU apps to profile characteristics Used to time: Multiprocessor cores Caches Interconnection network Memory controllers & DRAM http://groups.google.com/group/gpgpu-sim/
GP-GPU Sim Different configurations:
Benchmarks
Statistical Methods PCA: Principal Component Analysis Goal: remove correlations among different input dimensions     reduce data dimensions Forms a matrix with input dimensions(cols) and observations(rows) Tries to loose no information Example:
Statistical Methods Hierarchical Clustering Analysis Classification technique based on similarity on the dataset Not necessary to know the number of clusters a-priori A tree-like structure is generated (dendogram) Example:
Experimental methodology Load GPGPU-Sim with several workloads and 6 different configurations Generate a 93 (dimensions) x 38 (workloads) matrix from GPGPU-sim output PCA performed using Statistica to obtain principal components (16 PCs for 90% Variance, 7 PCs for 70%) Hierarchical Cluster Analysis performed
Results Workload classification Trade-off: available simulation time and amount of accuracy Microarchitecture impact Workloads are independent on underlying architecture (number of cores, register file size, interconnection network,…) Dendogram of Workloads (16 Pr.Comps, 90% Var) Some benchmarks: SS= Similarity Score (51K Inst.) SLA= Scan of Large Arrays (1310K Inst.) PR = Parallel Reduction (830K Inst.) BN = Binomial Options (131K Inst.)
Results GPGPU Workload fundamental analysis Branch divergence, high barrier and instruction count, and heavy weight threads are the most performance-affecting characteristics Characteristics Based Classification Diverse behavior in terms of instruction executed by each thread, total thread count and total instruction count Discussion and Limitations nVidia and Ati have similar architecture and behavior Intel Larrabee is different in architecture, behaviour and programming model
Conclusions GPUs provide big computing power at low cost But not every problem can be treated this way Important to take in account workload characteristics  to exploit GPU capabilities Benchmarks show the relevance of branch divergence, memory coalescing and kernel size
Future of GPGPU GPUs more and more used  i.e. Top500, new machines by Cray Evolving architecture on GPUs (i.e. nVidia Fermi) Speed, memory, floating-point support… more computing than graphics power! New tools Example: PGI Compilers that auto-parallelize codes for GPUs Hybrid computing: CPU + GPU simultaneously Don ’t leave the CPU idle while the GPU is busy
EXPLORING GPGPU Workloads Unai Lopez Novoa Intelligent Systems Group University of Basque Country

Exploring Gpgpu Workloads

  • 1.
    EXPLORING GPGPU WorkloadsUnai Lopez Novoa Intelligent Systems Group University of Basque Country
  • 2.
    Outline Introduction toGPGPU GPU Architecture Workload Analysis Conclusions
  • 3.
    Introduction to HPCHPC: High Performance Computing Goals: Solve computational problems with accuracy, speed, and efficiency E xploit computing resources Clusters, supercomputers Collection of traditional “cores” + GPUs
  • 4.
    Motivation GPU: GraphicsComputing Unit ATI 4890 Specialized for massively data parallel computation GPGPU = General-Purpose computing on GPUs Price: ~ 100 - 400 € (low ~medium end) Speed-up : 2x – 50x !
  • 5.
    Introduction SISD :Single-core CPU (1 execution thread) MIMD : Multi-core CPU, Computing Cluster (shared vs. distributed memory) SIMD : Vector instructions GPU (using CUDA / OpenCL  GPGPU) SISD SIMD MISD MIMD
  • 6.
    GPU Architecture 1GPU  several multiprocessors (SIMD engines) 1 multiprocessor  several execution units (stream processing units) Each one capable of running 1 thread (arranged in workgroups) ATI 4890  800 “stream processing units” Memory hierarchy: Register ( per thread ) Local ( per workgroup ) Global (for the whole GPU)
  • 7.
    GPU Architecture RadeonHD 5870 Global Memory Private Memory Workitem 1 Private Memory Workitem 1 Compute Unit 1 Local Memory Global / Constant Memory Data Cache Local Memory Private Memory Workitem 1 Private Memory Workitem 1 Compute Unit N Compute Device Compute Device Memory
  • 8.
    Programming GpGpu Kernel:code to run on each thread Extension of Ansi-C __kernel void helloWorldKernel (__global int* a, …) { /* code */ }
  • 9.
    Programming GpGpu for(i = 0; i < N; i++) a[i] = a[i] * 2; #pragma omp for for (i = 0; i < N; i++) a[i] = a[i] * 2; i = get_thread_id (); a[i] = a[i] * 2 ; int a[8]; SISD MIMD (with OpenMP) SIMT Time __kernel void helloWorld (__global int* a) { i = get_thread_id(); a[i] = a[i] * 2; } 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
  • 10.
    Workload factors thataffect efficiency Dynamic instruction count Floating point instruction count Memory instruction count Branch instruction count Atomic instruction count … Number of divergent warps Thread count per kernel Registers used per thread Bytes transferred device2host/host2device …
  • 11.
    GP-GPU Sim Softwarethat models a modern GPU Runs GPU apps to profile characteristics Used to time: Multiprocessor cores Caches Interconnection network Memory controllers & DRAM http://groups.google.com/group/gpgpu-sim/
  • 12.
    GP-GPU Sim Differentconfigurations:
  • 13.
  • 14.
    Statistical Methods PCA:Principal Component Analysis Goal: remove correlations among different input dimensions  reduce data dimensions Forms a matrix with input dimensions(cols) and observations(rows) Tries to loose no information Example:
  • 15.
    Statistical Methods HierarchicalClustering Analysis Classification technique based on similarity on the dataset Not necessary to know the number of clusters a-priori A tree-like structure is generated (dendogram) Example:
  • 16.
    Experimental methodology LoadGPGPU-Sim with several workloads and 6 different configurations Generate a 93 (dimensions) x 38 (workloads) matrix from GPGPU-sim output PCA performed using Statistica to obtain principal components (16 PCs for 90% Variance, 7 PCs for 70%) Hierarchical Cluster Analysis performed
  • 17.
    Results Workload classificationTrade-off: available simulation time and amount of accuracy Microarchitecture impact Workloads are independent on underlying architecture (number of cores, register file size, interconnection network,…) Dendogram of Workloads (16 Pr.Comps, 90% Var) Some benchmarks: SS= Similarity Score (51K Inst.) SLA= Scan of Large Arrays (1310K Inst.) PR = Parallel Reduction (830K Inst.) BN = Binomial Options (131K Inst.)
  • 18.
    Results GPGPU Workloadfundamental analysis Branch divergence, high barrier and instruction count, and heavy weight threads are the most performance-affecting characteristics Characteristics Based Classification Diverse behavior in terms of instruction executed by each thread, total thread count and total instruction count Discussion and Limitations nVidia and Ati have similar architecture and behavior Intel Larrabee is different in architecture, behaviour and programming model
  • 19.
    Conclusions GPUs providebig computing power at low cost But not every problem can be treated this way Important to take in account workload characteristics to exploit GPU capabilities Benchmarks show the relevance of branch divergence, memory coalescing and kernel size
  • 20.
    Future of GPGPUGPUs more and more used i.e. Top500, new machines by Cray Evolving architecture on GPUs (i.e. nVidia Fermi) Speed, memory, floating-point support… more computing than graphics power! New tools Example: PGI Compilers that auto-parallelize codes for GPUs Hybrid computing: CPU + GPU simultaneously Don ’t leave the CPU idle while the GPU is busy
  • 21.
    EXPLORING GPGPU WorkloadsUnai Lopez Novoa Intelligent Systems Group University of Basque Country

Editor's Notes

  • #11 Up to here, seen and explained how a GPU Works. Now we’ll start talking about the paper experimentation. The authors wanted to evaluate the workload on GPUs to get the relevant parameters that affect the performance and to be able to tune the executions. This were some of the characteristics they took in account
  • #15 Methods used to analyze mentioned characteristics