Compute API –Past & Future

726 views
678 views

Published on

Presentation given in a workshop on GPU & Parallel Computing that was in Israel on Jan 6th, 2011.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
726
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Compute API –Past & Future

  1. 1. Compute API – Past & Future Ofer Rosenberg Visual Computing Software1
  2. 2. Intro and acknowledgments• Who am I ? – For the past two years leading the Intel representation in OpenCL working group @ Khronos – Additional background of Media, Signal Processing, etc. – http://il.linkedin.com/in/oferrosenberg• Acknowledgments: – This presentation contains ideas based on talks with lots of people (who should be mentioned here) – Partial list: – AMD: Mike Houston, Ben Gaster – Apple: Aaftab Munshi – DICE: Johan Andersson – Intel: Aaron Lefohn, Stephen Junkins, David Blythe, Adam Lake, Yariv Aridor, Larry Seiler and more… – And others… 2
  3. 3. Agenda• The beginning – From Shaders to Compute• The Past/Present: 1st Generation of Compute API’s – Caveats of the 1st generation• The Future: 2nd Generation of Compute API’s
  4. 4. From Shaders to Compute• In the beginning, GPU HW was fixed & optimized for Graphics… Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAPH 2008: 4
  5. 5. From Shaders to Compute• Graphics stages became programmable  GPUs evolved …• This led to the traditional GPGPU approach… Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAPH 2008: 5
  6. 6. From Shaders to Compute Traditional GPGPU • Write in graphics language and use the GPU • Highly effective, but : – The developer needs to learn another (not intuitive) language – The developer was limited by the graphics language • Then came CUDA & CTM… Slides from “General Purpose Computation on Graphics Processors6 (GPGPU)”, Mike Houston, Stanford University Graphics Lab 6
  7. 7. The cradle of GPU Compute API’sGeForce 8800 GTX (G80) was released on Nov. 2006 ATI x1900 (R580) released on Jan 2006CUDA 0.8 was released on Feb. 2007 (first official Beta) CTM was released on Nov. 2006 Slides from “GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GPU”, Ian Buck, NVIDIA, SC06, & “Close to the Metal”, Justin Hensley, AMD, SIGGRAPH 2007 7
  8. 8. The 1st generation of Platform Compute API • CUDA & CTM led the way to two compute standards: Direct Compute & OpenCL • DirectCompute is a Microsoft standard – Released as part of WIn7/DX11, a.k.a. Compute Shaders – Only runs under Windows on a GPU device • OpenCL is a cross-OS / cross-Vendor standard – Managed by a working group in Khronos – Apple is the spec editor & conformance owner – Work can be scheduled on both GPUs and CPUs Nov June Dec Aug Dec Oct Mar June 2006 2007 2007 2008 2008 2009 2010 2010 CTM CUDA 1.0 StreamSDK CUDA 2.0 OpenCL 1.0 DirectX 11 CUDA 3.0 OpenCL 1.1Released Released Released Released Released Released Released Released The 1st Generation was developed on GPU HW which was tuned for graphics usages – just extended it for general usage 8
  9. 9. The 1st generation of Platform Compute APIExecution Model• Execution model was driven directly from shader programming in graphics (“fragment processing”) : – Shader Programming : initiate one instance of the shader per vertex/pixel – Compute : initiate one instance for each point in an N-dimensional grid• Fits GPU’s vision of array of scalar (or stream) processors Drawing from OpenCL 1.1 Specification , Rev36 9
  10. 10. The 1st generation of Platform Compute APIMemory Model• Distributed Memory system: – Abstraction: Application gets a “handle” to the memory object / resource – Explicit transactions: API for sync between Host & Device(s) : read/write, map/unmap App OCL A Dev1 RT H Dev2 A• Three address spaces: Global, Local (Shared) & Private – Local/Shared Memory: the non-trivial memory space… 10
  11. 11. Disclaimer Next slides provide my opinion and thoughts on caveats and future improvements to the Platform Compute API. 11
  12. 12. The 2nd generation of Platform Compute API• Recap: – The 1st generation : CUDA (until 3.0), OpenCL 1.x, DX11 CS – Defined on HW optimized for GFX, extended to General Compute• The “cheese” has moved for GPUs – Compute becomes an important usage scenario – Advanced Graphics: Physics, Advanced Lighting Effects, Irregular Shadow Mapping, Screen Space Rendering – Media: Video Encoding & Processing, Image Processing, Image Segmentation, Face Recognition – Throughput: Scientific Simulations, Finance, Oil Searches – Developers feedback based on the 1st generation enables creating better HW/API• The Second generation of Platform Compute API: “OpenCL Next”, DirectX12 ? The 2nd Generation of Compute API will run on HW which is designed with Compute in mind 12
  13. 13. Caveats of the 1st generation: Execution Model • Developers input: – Most “real world” usages for compute use fine-grain granularity (the gird is small – 100’s at best) – “Real world” kernels got sequential parts interleaved with the parallel code (reduction, condition testing, etc.)__kernel foo(){ // code here runs for each point in the grid barrier(CLK_LOCAL_MEM_FENCE); if (local_id == 0) { // this code runs once per workgroup } // code here runs for each point in the grid barrier(CLK_GLOBAL_MEM_FENCE); if (global_id == 0) { Battlefield 2 // this code runs only once execution phase DAG } (Image courtesy Johan Andersson, DICE) // code here runs for each point in the grid} Using “fragment processing” for these usages results inefficient use of the machine 13
  14. 14. Caveats of the 1st generation Execution Model • The “array of scalar/stream processors” model is not optimal for CPU’s & GPU’s • Works well for large grids (like in traditional graphics), but on finer grain there is a better model…NV Fermi AMD R600 Intel NHM CPU’s and GPU’s are better modeled as multi-threaded vector machines 14
  15. 15. The 2nd generation of Platform Compute APIIdeas for new execution model• Goals – Support fine-grain task parallelism – Support complex application execution graphs: – Better match HW evolution: target multi-threaded vector machines – Aligned with CPU evolution, and SoC integration of CPU/GPU• Solution: Tasking system as execution model foundation Device Domain Device Tasking system: task SW Thread ... HW • Task Q’s mapped to independent task task task task task task task compute task task unit HW units (~compute cores) task task HW • Device load balancing enabled via ... task task task stealing task task compute unit Task Pool • OpenCL Analogy: Tasks execute at HW ... the “work group level” task task task task task task task compute task unit task • OpenCL Task ≠ CPU Task task HW • More restricted: No Preemption task task task compute • Evolved: Braided Task (sequential parts unit & fine-grain parallel parts interleaved) 15
  16. 16. The 2nd generation of Platform Compute APIIdeas for new execution model• There are others who think along the same lines … Slides from “Leading a new Era of Computing”, Chekib Akrout, Senior VP, Technology Group, AMD, 2010 Financial Analyst Day 16
  17. 17. Caveats of the 1st generation:Memory Model• Developers input: – A growing number of compute workloads uses complex data structures (linked lists, trees, etc.) – Performance: Cost of pointer marshaling & re-construct on device is high – Porting complexity: need to add explicit transactions, marshaling, etc. – Supporting a shared/unified address space (API & HW) is required App OCL A Dev1 RT H Dev2 A App OCL A Dev1 RT A Dev2 S A Shared/Unified Address Space between Host & Devices 17
  18. 18. The 2nd generation of Platform Compute API Ideas for new memory modelBaseline:Memory objects / resources will have Shared Address Spacethe same starting address betweenHost & Devices Shared Address Space Shared Address Space w. relaxed consistency w. full coherency • Extend existing OCL 1.x / DX11 Memory Model • New Model - Memory is coherent between Host & Device • Use explicit API calls to sync between Host & Device • Use known “language level” mechanisms for concurrent • Suitable for Disjoint memory architectures (Discrete access: atomics, volatile GPU’s, for example…) • Suitable for Shared Memory architectures Host Device Host Device P P P P P P P P P P P P P P P Host Memory Device Memory Coherent/Shared Memory 18
  19. 19. Some more thoughts for the 2nd generation(and beyond)• Promote Heterogeneous Processing – not GPU only… Execution Time – Running code pending on problem domain: GPU CPU – Matrix Multiply of 16x16 should run on the CPU – Matrix Multiply of 1000x1000 should run on the GPU Problem size – Where’s the decision point ? Better leave it to the Runtime… (requires API) – Load Balancing – Relevant especially on systems where the CPU & GPU are close in compute power• One API to rule them all – Compute API as the underlying infrastructure to run Media & GFX – Extend the API to contain flexible pipeline, fixed-function HW, etc. Slide from “Parallel Future of a Game Engine”, Johan Andersson, DICE 19
  20. 20. References:• “GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GPU”, Ian Buck, NVIDIA, SC06 – http://gpgpu.org/static/sc2006/workshop/presentations/Buck_NVIDIA_Cuda.pdf• “GPU Architecture: Implications & Trends”, David Luebke, NVIDIA Research, SIGGRAPH 2008: – http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf• “General Purpose Computation on Graphics Processors (GPGPU)”, Mike Houston, Stanford University Graphics Lab – http://www-graphics.stanford.edu/~mhouston/public_talks/R520-mhouston.pdf• “Close to the Metal”, Justin Hensley, AMD, SIGGRAPH 2007 – http://gpgpu.org/static/s2007/slides/07-CTM-overview.pdf• “NVIDIA’s Fermi: The First Complete GPU Computing Architecture”, Peter N. Glaskowsky – http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIAFermi- TheFirstCompleteGPUComputingArchitecture.pdf• “Leading a new Era of Computing”, Chekib Akrout, Senior VP, Technology Group, AMD, 2010 Financial Analyst Day – http://phx.corporate-ir.net/External.File?item=UGFyZW50SUQ9Njk3NTJ8Q2hpbGRJRD0tMXxUeXBlPTM=&t=1• “Parallel Future of a Game Engine”, Johan Andersson, DICE – http://www.slideshare.net/repii/parallel-futures-of-a-game-engine-2478448 20

×