Newbie’s guide to_the_gpgpu_universe
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Newbie’s guide to_the_gpgpu_universe

  • 977 views
Uploaded on

This is a lite introduction to the world of GPGPU, designed to be about 1H presentation.

This is a lite introduction to the world of GPGPU, designed to be about 1H presentation.

More in: Technology , Sports
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
977
On Slideshare
938
From Embeds
39
Number of Embeds
1

Actions

Shares
Downloads
41
Comments
0
Likes
1

Embeds 39

http://www.linkedin.com 39

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Newbie’s guide to the GPGPU universe Ofer Rosenberg
  • 2. Agenda • GPU History • Anatomy of a Modern GPU • Typical GPGPU Models • The GPGPU universe
  • 3. GPU History A GPGPU perspective 3
  • 4. From Shaders to Compute (1) In the beginning, GPU HW was fixed & optimized for Graphics… Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAPH 2008:
  • 5. From Shaders to Compute (2) • GPUs evolved to programmable (which made Gaming companies very happy…) Shader: A simple program, that may run on a graphics processing unit, and describe the traits of either a vertex or a pixel.
  • 6. The birth of GPGPU (1) • Interest from the academic world Pixel shader = do the same program for (1024 X 768 X 60) = highly efficient SPMD (Single Program, Multiple Data) machine • Fictitious graphics pipe to solve problems – Advanced Graphics problems – General Computational problems 6
  • 7. The birth of GPGPU (2) • In 2002, Mark Harris from NVIDIA coined the term GPGPU “General-Purpose computation on Graphics Processing Units” • Used a graphics language for general computation • Highly effective, but : – The developer needs to learn another (not intuitive) language – The developer was limited by the graphics language
  • 8. From Shaders to Compute (3) • GPUs needed one more evolutional step  Unified Shaders 8
  • 9. Rise of modern GPGPU • Unified Architecture paved the way for modern GPGPU languages GeForce 8800 GTX (G80) was released on Nov. 2006 CUDA 0.8 was released on Feb. 2007 (first official Beta) ATI x1900 (R580) released on Jan 2006 CTM was released on Nov. 2006
  • 10. Evolution of Compute APIs (GPGPU) • CUDA & CTM led to two compute standards: Direct Compute & OpenCL • DirectCompute is a Microsoft standard – Released as part of WIn7/DX11, a.k.a. Compute Shaders – Runs only on Windows – Microsoft C++ AMP maps to DirectCompute • OpenCL is a cross-OS / cross-Vendor standard – Managed by a working group in Khronos – Apple is the spec editor & conformance owner – Work can be scheduled on both GPUs and CPUs CUDA 1.0 Released June 2007 CUDA 2.0 Released Aug 2008 OpenCL 1.0 Released Dec 2008 DirectX 11 Released Oct 2009 CUDA 3.0 Released Mar 2010 OpenCL 1.1 Released June 2010 CUDA 4.0 Released May 2011 OpenCL 1.2 Released Nov 2011 CUDA 4.1 Released Jan 2012 CUDA 4.2 Released April 2012 C++ AMP 1.0 Released Aug 2012 CUDA 5.0 Released Oct 2012 CUDA 5.5 Released July 2013 OpenCL 2.0 Provisional Released July 2013 CTM SDK Released Nov 2006
  • 11. GPGPU Evolution 2004 – Stanford University: Brook for GPUs 2006 – AMD releases CTM NVIDIA releases CUDA 2008 – OpenCL 1.0 released G80 – 346 GFLOPS R580 – 375 GFLOPS
  • 12. GPGPU Evolution Nov 2009 - First Hybrid SC in the Top10: Chinese Tianhe-1 1,024 Intel Xeon E5450 CPUs 5,120 Radeon 4870 X2 GPUs Nov 2010 – First Hybrid SC reaches #1 on Top500 list: Tianhe-1A 14,336 Xeon X5670 CPUs 7,168 Nvidia Tesla M2050 GPUs Source: http://www.top500.org/lists/
  • 13. GPGPU Evolution 2013 - OpenCL on : Nexus 4 (Qualcomm Adreno 320) Nexus 10 (ARM Mali T604) Android 4.2 adds GPU support for Renderscript 2014 – NVIDIA Tegra 5 will support CUDA 2013 – GPGPU Continuum becomes a reality
  • 14. The GPGPU Continuum Apple A6 GPU 25 GFLOPS < 2W ORNL TITAN SC 27 PFLOPS 8200 KW AMD G-T16R 46 GFLOPS* 4.5W NVIDIA GTX Titan 4500 GFLOPS 250W Intel i7-3770 511 GFLOPS* 77W * GFLOPS of CPU+GPU
  • 15. Anatomy of a Modern GPU GPGPU Perspective 15
  • 16. Massive Parallelism From GPGPU perspective, GPU is a highly multi-threaded wide vector machine 16
  • 17. Parallelism detailed • Multi (Many) Cores • Wide Vector Unit • Multi-threaded (latency/stalls hiding) 17 14 SMXsK20NVIDIA 32 Compute UnitsHD7970AMD 60 CoresXeon Phi 5110PIntel 6 Warps per SMX32 floats = WarpK20NVIDIA 4 Wavefronts per CU64 floats = WavefrontHD7970AMD 1 VPU per Core16 floats = VPUXeon Phi 5110PIntel 64 Warps per SMXK20NVIDIA 40 Wavefronts per CUHD7970AMD NVIDIA GK110 SMX
  • 18. Typical GPU Caveats • Wide vectors = SIMD (SIMT) execution – Conditional code has to be executed “vector wide” – Mitigation: Predication (execute all code using masks on parts) – Performance hit on mixed execution, up to 1/N efficiency (where N is vector width) • Many Cores & Small caches = High percentage of Stalls – Mitigation: • Hold multiple in-flight contexts (aka Warps/Wavefronts) per core • Stall = fast context switch between in-flight context and active context • Requires huge register bank (NV & AMD: 256KB per SMX/CU) – Latency hiding depends on having enough in-flight contexts 18A Must Read: (images to the right are taken from this talk) “From Shader Code to a Teraflop: How GPU Shader Cores Work”, By Kayvon Fatahalian, Stanford University and Mike Houston, Fellow, AMD
  • 19. Typical GPGPU Models This section describes some general GPGPU models, which apply to a wide range of languages 19
  • 20. Simplified System Model • Host runs the OS, Application, Drivers, etc. • GPU is connected to the Host through PCIe, Shared Memory, etc. Application code contains API calls*,  which use a Runtime environment,  which provides GPU access The Application code contains “kernels”,  which are short programs/functions,  which are loaded and executed on the GPU * In some languages the API calls are abstracted through special syntax or directives 20 Host Application Runtime GPU KernelKernel Kernel
  • 21. GPGPU Execution Model (1) • A “kernel” is executed on a grid (1D/2D/3D) • Each point in the grid executes one instance of the kernel, orthogonally* • Per-instance read/write is accomplished by using the instance’s index * There are sync primitives on a group/block level (or whole device) 21 OpenCL CUDA // Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; } int main() { // Kernel invocation dim3 dimBlock(16, 16); dim3 dimGrid((N + dimBlock.x – 1) / dimBlock.x, (N + dimBlock.y – 1) / dimBlock.y); MatAdd<<<dimGrid, dimBlock>>>(A, B, C); }
  • 22. GPGPU Execution Model (2) • GPU execution model is asynchronous – Commands are sent down the stack – Kernels executed based on GPU load & status (serves a few Apps) – Application code may wait on completion • Quequeing Model – Explicit (OpenCL) – Default is implicit, Advanced usage is explicit (CUDA) • SPMD  MPMD – GPU used to be able to execute one kernel at a time – Modern languages support multiple simultaneous kernels 22
  • 23. GPGPU Memory Model Basically, a distributed memory system: • Separated Host memory / Device memory – Create a buffer/image on the host – Create a buffer/image on the device • Opaque handle (OpenCL) or device-side pointer (CUDA) • Sync operations between memories: – Read / Write – Map / Unmap (marshalling) • Pinned memory for faster sync • GPU can access Host mapped memory (CUDA) 23 Host Application Runtime GPU Buffer Create Write Buffer
  • 24. GPU Memory Model • Few types, GPU architecture driven • Has affect on performance – use the right type • Watch out from coherency issues – Not your typical MESI architecture… 24
  • 25. Compilation Model • Most GPGPU languages use dynamic compilation – A common practice in the world of GPUs – Different GPU architectures : no common ISA – ISA varies even between generations of the same vendor • Front-End converts High-level language to IR (Intermediate Representation) – Assembly of a virtual machine – LLVM is very common in this world – In some languages, this happens at application compile time • Back-End(s) converts from IR to Binary – Some Vendors use additional intermediate-to-intermediate stages • Most languages enable storing of IR & IL – Some do it implicitly (CUDA) OpenCL C C for CUDA Fortran LLVM* IR PTX IL GK110 Binary GF104 Binary OpenACC * NVIDIA has “NVVM”, which is LLVM with a set of restrictions
  • 26. 26
  • 27. GPGPU usages CUDA usages Advanced Graphics Game Physics Computer Vision Cluster/ HPC Finance Scientific Media Processing Johannes Gutenberg University Mains •CUDA Community Showcase: •~900 applications from Academia •http://www.nvidia.com/object/cuda-apps- flash-new.html# Imperial College London UC Davis, California TU Darmstadt
  • 28. GPGPU Languages • Welcome to the jungle… 28
  • 29. 29
  • 30. Vendor overview: NVIDIA Geforce: • GPU for Gaming • GTX680 Tesla: • GPU Accelerators • K10 / K20 Quadro: • Professional GFX • K5000 All running the same cores (Kepler GK104 or GK110)
  • 31. Vendor overview: AMD 31 Radeon: • GPU for Gaming • HD7970 FirePro: • Professional GFX • W9000 All running the same cores (GCN) APU: • CPU+GPU on same die • A10
  • 32. Vendor overview: Intel Xeon Phi: • Accelerator Card • 5110P CPU: • CPU+GPU on same die • Haswell Core i7-4xxx
  • 33. Leading Mobile GPU Vendors Vivante CG4000 • Unified Shaders • 4 Cores, SIMD4 each • Supports OpenCL 1.2 • 48 Gflops NVIDIA Tegra 4 • 6 X 4-wide Vertex shaders • 4 X 4-wide Pixel Shaders • No GPGPU support • 74 GFLOPS ARM Mali T604 • 4 Cores • Multiple “pipes” per core • Supports OpenCL 1.1 • 68 GFlops Imagination PowerVR 5xx • Apple, Samsung, Motorola, Intel • Unified Shaders • Supports OpenCL 1.1 EP (543) • 38 Gflops (Apple’s MP4 ver) Qualcomm Adreno 320 • Part of Snapdragon S4 • Unified Shader • Supports OpenCL 1.1 EP • 50 GFlops