Exploring Gpgpu Workloads

EXPLORING GPGPU Workloads Unai Lopez Novoa Intelligent Systems Group University of Basque Country

Outline Introduction to GPGPU GPU Architecture Workload Analysis Conclusions

Introduction to HPC HPC: High Performance Computing Goals: Solve computational problems with accuracy, speed, and efficiency E xploit computing resources Clusters, supercomputers Collection of traditional “cores” + GPUs

Motivation GPU: Graphics Computing Unit ATI 4890 Specialized for massively data parallel computation GPGPU = General-Purpose computing on GPUs Price: ~ 100 - 400 € (low ~medium end) Speed-up : 2x – 50x !

Introduction SISD : Single-core CPU (1 execution thread) MIMD : Multi-core CPU, Computing Cluster (shared vs. distributed memory) SIMD : Vector instructions GPU (using CUDA / OpenCL  GPGPU) SISD SIMD MISD MIMD

GPU Architecture 1 GPU  several multiprocessors (SIMD engines) 1 multiprocessor  several execution units (stream processing units) Each one capable of running 1 thread (arranged in workgroups) ATI 4890  800 “stream processing units” Memory hierarchy: Register ( per thread ) Local ( per workgroup ) Global (for the whole GPU)

GPU Architecture Radeon HD 5870 Global Memory Private Memory Workitem 1 Private Memory Workitem 1 Compute Unit 1 Local Memory Global / Constant Memory Data Cache Local Memory Private Memory Workitem 1 Private Memory Workitem 1 Compute Unit N Compute Device Compute Device Memory

Programming GpGpu Kernel: code to run on each thread Extension of Ansi-C __kernel void helloWorldKernel (__global int* a, …) { /* code */ }

Programming GpGpu for (i = 0; i < N; i++) a[i] = a[i] * 2; #pragma omp for for (i = 0; i < N; i++) a[i] = a[i] * 2; i = get_thread_id (); a[i] = a[i] * 2 ; int a[8]; SISD MIMD (with OpenMP) SIMT Time __kernel void helloWorld (__global int* a) { i = get_thread_id(); a[i] = a[i] * 2; } 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Workload factors that affect efficiency Dynamic instruction count Floating point instruction count Memory instruction count Branch instruction count Atomic instruction count … Number of divergent warps Thread count per kernel Registers used per thread Bytes transferred device2host/host2device …

GP-GPU Sim Software that models a modern GPU Runs GPU apps to profile characteristics Used to time: Multiprocessor cores Caches Interconnection network Memory controllers & DRAM http://groups.google.com/group/gpgpu-sim/

GP-GPU Sim Different configurations:

Statistical Methods PCA: Principal Component Analysis Goal: remove correlations among different input dimensions  reduce data dimensions Forms a matrix with input dimensions(cols) and observations(rows) Tries to loose no information Example:

Statistical Methods Hierarchical Clustering Analysis Classification technique based on similarity on the dataset Not necessary to know the number of clusters a-priori A tree-like structure is generated (dendogram) Example:

Experimental methodology Load GPGPU-Sim with several workloads and 6 different configurations Generate a 93 (dimensions) x 38 (workloads) matrix from GPGPU-sim output PCA performed using Statistica to obtain principal components (16 PCs for 90% Variance, 7 PCs for 70%) Hierarchical Cluster Analysis performed

Results Workload classification Trade-off: available simulation time and amount of accuracy Microarchitecture impact Workloads are independent on underlying architecture (number of cores, register file size, interconnection network,…) Dendogram of Workloads (16 Pr.Comps, 90% Var) Some benchmarks: SS= Similarity Score (51K Inst.) SLA= Scan of Large Arrays (1310K Inst.) PR = Parallel Reduction (830K Inst.) BN = Binomial Options (131K Inst.)

Results GPGPU Workload fundamental analysis Branch divergence, high barrier and instruction count, and heavy weight threads are the most performance-affecting characteristics Characteristics Based Classification Diverse behavior in terms of instruction executed by each thread, total thread count and total instruction count Discussion and Limitations nVidia and Ati have similar architecture and behavior Intel Larrabee is different in architecture, behaviour and programming model

Conclusions GPUs provide big computing power at low cost But not every problem can be treated this way Important to take in account workload characteristics to exploit GPU capabilities Benchmarks show the relevance of branch divergence, memory coalescing and kernel size

Future of GPGPU GPUs more and more used i.e. Top500, new machines by Cray Evolving architecture on GPUs (i.e. nVidia Fermi) Speed, memory, floating-point support… more computing than graphics power! New tools Example: PGI Compilers that auto-parallelize codes for GPUs Hybrid computing: CPU + GPU simultaneously Don ’t leave the CPU idle while the GPU is busy

Exploring Gpgpu Workloads

More Related Content

What's hot

Viewers also liked

Similar to Exploring Gpgpu Workloads

More from Unai Lopez-Novoa

Recently uploaded

Exploring Gpgpu Workloads

Editor's Notes