HSA Introduction Hot Chips 2013

15,396 views

Published on

Introduction to HSA

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
15,396
On SlideShare
0
From Embeds
0
Number of Embeds
11,497
Actions
Shares
0
Downloads
241
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • We will be open on this.We will reach out to partners and collaborate to bring this to market in the right form
  • Lets take a deeper dive here into the details of the architecture …
  • The memory model for a new architecture is key
  • The memory model for a new architecture is key
  • Easily leverage the inherent power efficiency of GPU computingCommon routines such as scan, sort, reduce, transformMore advanced routines like heterogeneous pipelinesBolt library works with OpenCL or C++ AMPEnjoy the unique advantages of the HSA platformHigh-performance shared virtual memoryTightly integrated CPU and GPUMove the computation not the dataFinally a single source code base for the CPU and GPU!Developers can focus on core algorithmsLibrary handles the details for different processorsHeterogeneous pipeline allows different stages targeted to CPU or GPU, or tasks that the runtime can put on eitherParallel_filter designed to solve problems like Haar face-detection — early stages on GPU, later stages on CPU.Flexible customization provided by C++ templatesEnjoy the unique advantages of the HSA platformHigh-performance shared virtual memoryDirectly interface with host data structures (e.g., std::vector) Tightly integrated CPU and GPUMove computation not the dataUse appropriate computation resource (CPU or GPU)Ben Sander Bolt talk at 5:15PM–6:00PM, Wednesday, June 13. Hyatt:Evergreen H.
  • We have already released APARAPI, an acceleration library for Java, layered on OpenCL.The idea behind APARAPI is to allow a write once, run anywhere, solution for Java that does not require the programmer to learn a new language.We supply a library with a kernel class, that the Java programmer can override with their own Java code.Then at runtime, the APARAPI library converts that bytecode to OpenCL, or to a thread pool if an OpenCL device is not available.This has proven popular with the Java community.
  • Now lets analyze how it works, in the HAAR algorithm.
  • Now, just what do we do in each of those search boxes?
  • Each successive cascade stage searches for more features.- Initial cascade stage is cheap, each successive cascade stage is more expensiveExit from a cascade stage means a face is not present in the rectangleSurviving all the cascades confirms a face is presentThis is a nested data parallel opportunity:Significant chunks of parallel code in control flow
  • The GPU runs at its peak efficiency when its SIMD engines are fully loaded with live work items.Lets see how this looks relative to the CPU at each stage.
  • Here is a 3D graph of the cascade depth used when searching for the face in the Mona LisaNote that most of the searches exit in 5 cascade stages or fewerA considerable number run between 5–10.And we see the broad pillar that represents the face, where all 22 stages returned a positive results.This is not a classically data parallel workload. Far from it.
  • Here we see what happens as more and more work items exit on each cascade stage.In the first stage, the GPU is fully loaded and more than 5 times as fast as the CPU.For the next few stages, it is close, and the CPU is clearly better at the latter stages.How best to run this workload?Run early stages on the GPU, and later stages on the CPURun each workload where it is most power efficient!HSA allows this since we are in shared coherent memory
  • In this graph we show the range of performance and power savings possible by running the first N stages on the GPU, followed by the rest on the CPU.Note that running all on the CPU or all on the GPU, gets a similar bad result.If we run the first 3 stages on the GPU, then switch to the CPU, we get the best result.Note even though this is running fewer stages on the GPU, most of the search squares have exited by the time we switch to the CPU.
  • And here is the result.Performance more than doubles, and we process each frame with 40% of the power on average.HSA allows the right processor …
  • Broad Phase: Finding out which objects to collide against other objects. The simplest method is to check collisions with every object against every other object. This is very fast when you have small groups of objects that you want to collide against each other (< 10 we'll say), but extremely slow when you have a lot of objects (< 1000). The reason is that for 10 objects you would need 45 mid-phase checks and for 1000 you would need 499500. It doesn't scale to large numbers very well, and when you have a lot of objects you'll spend more time doing mid-phase checks than anything else. To speed up the broad phase, you'll need to use a smarter algorithm such as sort and sweep, spatial hashing, quadtrees, or some other spatial index that only tries combinations of objects that are likely to be colliding.Mid Phase: This is where you normally do a simple collision check such as a bounding box or bounding sphere check because they would be faster than your narrow phase. Most spatial index structures are going to return some false positives, and it's best to catch them early instead of calling the more expensive narrow phase collision routines.Narrow Phase: The narrow phase is the most specific and most expensive collision detection check such as a polygon to polygon check or a pixel to pixel check. A lot of games do just fine with bounding spheres or bounding boxes and don't need a more specific narrow phase check.
  • Key Points:Writing optimal CPU implementations requires complex development too. Programmers have to use both intrinsics for vector parallelism, and TBB for multicore parallelism. OpenCL C OpenCL-C is widely known fairly verbose C-based API, and it shows in the boilerplate initialization code, runtime-compile code, and kernel launch. OpenCL C++ : Removes initialization code by providing sensible defaults for platform, context, device, command-queue. No need to set these up, and no need to save them and drag them around for later OCL API calls. Reduce code to compile by using C++ exceptions for error-checking, automatic memory allocation (rather than calling API to determine size of return args) Default arguments, type-checking — code focuses on relevant parameters. The host-side support for C++ is available in a “cl.hpp” file which runs on any OpenCL implementation (including NV, Intel, etc). In addition, AMD OpenCL implementation supports “static” C++ kernel language — classes, namespaces, templates. (Not used in the this implementation). C++ AMP Initialization is handled through sensible defaults. C++AMP eliminates platform, context, accelerator_view combines device and queue. Single-source model : Eliminates run-time compile code, this is done at compile-time with the host code. Single-source model : streamlined kernel call convention (eliminate clSetKernelArg) The implementation here uses C++11 lambda to reduce boilerplate code for functor construction (kernel can directly access local vars) Data xfer reduced by implicit transfers performed by array_view BOLT Moves reduction code into library — what’s left is reduction operator. Removes data xfer and copy-back — interface is directly with host data structures. Bolt-for-C++AMP uses lambda syntax; Bolt-For-OCL does not (not supported) Bolt-for-OCL implementation relies on C++ static kernel language — recently introduced in AMD APP SDK 2.6 (beta) and 2.7 (production?) Other notes: Serial CPU integrates algorithm and reduction and we call it just “algorithm” ; later implementations separate these for performance. Launch is argument setup and calling of the kernel or library routine. Copy-back includes code to copy data back to host and run a host-side final reduction step. LOC includes appropriate spaces and comments. We attempted to use similar coding style across all implementations.Tbb init is 1-line to initialize the scheduler. (tbb::task_scheduler_init)
  • HSA Introduction Hot Chips 2013

    1. 1. HETEROGENEOUS SYSTEM ARCHITECTURE OVERVIEW HOT CHIPS TUTORIAL - AUGUST 2013 PHIL ROGERS HSA FOUNDATION PRESIDENT AMD CORPORATE FELLOW
    2. 2. HSA FOUNDATION  Founded in June 2012  Developing a new platform for heterogeneous systems  www.hsafoundation.com  Specifications under development in working groups  Our first specification, HSA Programmers Reference Manual is already published and available on our web site  Additional specifications for System Architecture, Runtime Software and Tools are in process © Copyright 2012 HSA Foundation. All Rights Reserved. 2
    3. 3. HSA FOUNDATION MEMBERSHIP — AUGUST 2013 © Copyright 2012 HSA Foundation. All Rights Reserved. 3 Founders Promoters Supporters Contributors Academic Associates
    4. 4. SOCS HAVE PROLIFERATED — MAKE THEM BETTER  SOCs have arrived and are a tremendous advance over previous platforms  SOCs combine CPU cores, GPU cores and other accelerators, with high bandwidth access to memory  How do we make them even better?  Easier to program  Easier to optimize  Higher performance  Lower power  HSA unites accelerators architecturally  Early focus on the GPU compute accelerator, but HSA goes well beyond the GPU © Copyright 2012 HSA Foundation. All Rights Reserved. 4
    5. 5. INFLECTIONS IN PROCESSOR DESIGN © Copyright 2012 HSA Foundation. All Rights Reserved. 5 ? Single-thread Performance Time we are here Enabled by:  Moore‘s Law  Voltage Scaling Constrained by: Power Complexity Single-Core Era ModernApplication Performance Time (Data-parallel exploitation) we are here Heterogeneous Systems Era Enabled by:  Abundant data parallelism  Power efficient GPUs Temporarily Constrained by: Programming models Comm.overhead Throughput Performance Time (# of processors) we are here Enabled by:  Moore‘s Law  SMP architecture Constrained by: Power Parallel SW Scalability Multi-Core Era Assembly  C/C++  Java … pthreads  OpenMP / TBB … Shader  CUDA OpenCL  C++ and Java
    6. 6. HIGH LEVEL FEATURES OF HSA  Features currently being defined in the HSA Working Groups**  Unified addressing across all processors  Operation into pageable system memory  Full memory coherency  User mode dispatch  Architected queuing language  High level language support for GPU compute processors  Preemption and context switching © Copyright 2012 HSA Foundation. All Rights Reserved. 6 ** All features subject to change, pending completion and ratification of specifications in the HSA Working Groups
    7. 7. HSA — AN OPEN PLATFORM  Open Architecture, membership open to all  HSA Programmers Reference Manual  HSA System Architecture  HSA Runtime  Delivered via royalty free standards  Royalty Free IP, Specifications and APIs  ISA agnostic for both CPU and GPU  Membership from all areas of computing  Hardware companies  Operating Systems  Tools and Middleware © Copyright 2012 HSA Foundation. All Rights Reserved. 7
    8. 8. HSA INTERMEDIATE LAYER — HSAIL  HSAIL is a virtual ISA for parallel programs  Finalized to ISA by a JIT compiler or ―Finalizer‖  ISA independent by design for CPU & GPU  Explicitly parallel  Designed for data parallel programming  Support for exceptions, virtual functions, and other high level language features  Lower level than OpenCL SPIR  Fits naturally in the OpenCL compilation stack  Suitable to support additional high level languages and programming models:  Java, C++, OpenMP, etc © Copyright 2012 HSA Foundation. All Rights Reserved. 8
    9. 9. HSA MEMORY MODEL  Defines visibility ordering between all threads in the HSA System  Designed to be compatible with C++11, Java, OpenCL and .NET Memory Models  Relaxed consistency memory model for parallel compute performance  Visibility controlled by:  Load.Acquire  Store.Release  Barriers © Copyright 2012 HSA Foundation. All Rights Reserved. 9
    10. 10. HSA QUEUING MODEL  User mode queuing for low latency dispatch  Application dispatches directly  No OS or driver in the dispatch path  Architected Queuing Layer  Single compute dispatch path for all hardware  No driver translation, direct to hardware  Allows for dispatch to queue from any agent  CPU or GPU  GPU self enqueue enables lots of solutions  Recursion  Tree traversal  Wavefront reforming © Copyright 2012 HSA Foundation. All Rights Reserved. 10
    11. 11. HSA SOFTWARE
    12. 12. Hardware - APUs, CPUs, GPUs Driver Stack Domain Libraries OpenCL™, DX Runtimes, User Mode Drivers Graphics Kernel Mode Driver Apps Apps Apps Apps Apps Apps HSA Software Stack Task Queuing Libraries HSA Domain Libraries, OpenCL ™ 2.x Runtime HSA Kernel Mode Driver HSA Runtime HSA JIT Apps Apps Apps Apps Apps Apps User mode component Kernel mode component Components contributed by third parties TITLE © Copyright 2012 HSA Foundation. All Rights Reserved. 12
    13. 13. OPENCL™ AND HSA  HSA is an optimized platform architecture for OpenCL™  Not an alternative to OpenCL™  OpenCL™ on HSA will benefit from  Avoidance of wasteful copies  Low latency dispatch  Improved memory model  Pointers shared between CPU and GPU  OpenCL™ 2.0 shows considerable alignment with HSA  Many HSA member companies are also active with Khronos in the OpenCL™ working group © Copyright 2012 HSA Foundation. All Rights Reserved. 13
    14. 14. BOLT — PARALLEL PRIMITIVES LIBRARY FOR HSA  Easily leverage the inherent power efficiency of GPU computing  Common routines such as scan, sort, reduce, transform  More advanced routines like heterogeneous pipelines  Bolt library works with OpenCL and C++ AMP  Enjoy the unique advantages of the HSA platform  Move the computation not the data  Finally a single source code base for the CPU and GPU!  Developers can focus on core algorithms  Bolt version 1.0 for OpenCL and C++ AMP is available now at https://github.com/HSA-Libraries/Bolt © Copyright 2012 HSA Foundation. All Rights Reserved. 14
    15. 15. HSA OPEN SOURCE SOFTWARE  HSA will feature an open source linux execution and compilation stack  Allows a single shared implementation for many components  Enables university research and collaboration in all areas  Because it‘s the right thing to do © Copyright 2012 HSA Foundation. All Rights Reserved. 15 Component Name IHV or Common Rationale HSA Bolt Library Common Enable understanding and debug HSAIL Code Generator Common Enable research LLVM Contributions Common Industry and academic collaboration HSAIL Assembler Common Enable understanding and debug HSA Runtime Common Standardize on a single runtime HSA Finalizer IHV Enable research and debug HSA Kernel Driver IHV For inclusion in linux distros
    16. 16. ACCELERATING JAVA GOING BEYOND NATIVE LANGUAGES
    17. 17. JAVA ENABLEMENT BY APARAPI © Copyright 2012 HSA Foundation. All Rights Reserved. 17 Developer creates Java™ source Source compiled to class files (bytecode) using standard compiler Aparapi = Runtime capable of converting Java™ bytecode to OpenCL™ For execution on any OpenCL™ 1.1+ capable device OR execute via a thread pool if OpenCL™ is not available
    18. 18. JAVA HETEROGENEOUS ENABLEMENT ROADMAP CPU ISA GPU ISA JVM Application APARAPI GPUCPU OpenCL™ © Copyright 2012 HSA Foundation. All Rights Reserved. 18 CPU ISA GPU ISA JVM Application APARAPI HSA CPUHSA CPU HSA Finalizer HSAIL CPU ISA GPU ISA JVM Application APARAPI HSA CPUHSA CPU HSA Finalizer HSAIL HSA Runtime LLVM Optimizer IR CPU ISA GPU ISA Sumatra Enabled JVM Application HSA CPUHSA CPU HSA Finalizer HSAIL
    19. 19. SUMATRA PROJECT OVERVIEW  AMD/Oracle sponsored Open Source (OpenJDK) project  Targeted at Java 9 (2015 release)  Allows developers to efficiently represent data parallel algorithms in Java  Sumatra ‗repurposes‘ Java 8‘s multi-core Stream/Lambda API‘s to enable both CPU or GPU computing  At runtime, Sumatra enabled Java Virtual Machine (JVM) will dispatch ‗selected‘ constructs to available HSA enabled devices  Developers of Java libraries are already refactoring their library code to use these same constructs  So developers using existing libraries should see GPU acceleration without any code changes  http://openjdk.java.net/projects/sumatra/  https://wikis.oracle.com/display/HotSpotInternals/Sumatra  http://mail.openjdk.java.net/pipermail/sumatra-dev/ © Copyright 2012 HSA Foundation. All Rights Reserved. 19 Application.java Java Compiler GPUCPU Sumatra Enabled JVM Application GPU ISA Lambda/Stream API CPU ISA Application.class Development Runtime HSA Finalizer
    20. 20. EXAMPLE WORKLOADS
    21. 21. HAAR FACE DETECTION CORNERSTONE TECHNOLOGY FOR COMPUTERVISION
    22. 22. LOOKING FOR FACES IN ALL THE RIGHT PLACES Quick HD Calculations Search square = 21 x 21 Pixels = 1920 x 1080 = 2,073,600 Search squares = 1900 x 1060 = ~2 Million © Copyright 2012 HSA Foundation. All Rights Reserved. 22
    23. 23. LOOKING FOR DIFFERENT SIZE FACES — BY SCALING THE VIDEO FRAME © Copyright 2012 HSA Foundation. All Rights Reserved. 23 More HD Calculations 70% scaling in H and V Total Pixels = 4.07 Million Search squares = 3.8 Million
    24. 24. HAAR CASCADE STAGES © Copyright 2012 HSA Foundation. All Rights Reserved. 24 Feature l Feature m Feature p Feature r Feature q Feature k Stage N Stage N+1 Face still possible?Yes No REJECT FRAME
    25. 25. 22 CASCADE STAGES, EARLY OUT BETWEEN EACH © Copyright 2012 HSA Foundation. All Rights Reserved. 25 STAGE 22STAGE 21STAGE 2STAGE 1 NO FACE FACE CONFIRMED Final HD Calculations Search squares = 3.8 million Average features per square = 124 Calculations per feature = 100 Calculations per frame = 47 GCalcs Calculation Rate 30 frames/sec = 1.4TCalcs/second 60 frames/sec = 2.8TCalcs/second … and this only gets front-facing faces
    26. 26. UNBALANCING DUE TO EARLY EXITS  When running on the GPU, we run each search rectangle on a separate work item  Early out algorithms, like HAAR, exhibit divergence between work items  Some work items exit early  Their neighbors continue  SIMD packing suffers as a result © Copyright 2012 HSA Foundation. All Rights Reserved. 26 Live Dead
    27. 27. CASCADE DEPTH ANALYSIS © Copyright 2012 HSA Foundation. All Rights Reserved. 27 0 5 10 15 20 25 Cascade Depth 20-25 15-20 10-15 5-10 0-5
    28. 28. PROCESSING TIME/STAGE © Copyright 2012 HSA Foundation. All Rights Reserved. 28 AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL™ 1.1 (873.1) 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9-22 Time(ms) “Trinity” A10-4600M (6CU@497Mhz, 4 cores@2700Mhz) GPU CPU
    29. 29. PERFORMANCE CPU-VS-GPU 0 2 4 6 8 10 12 0 1 2 3 4 5 6 7 8 22 Images/Sec Number of Cascade Stages on GPU “Trinity” A10-4600M (6CU@497Mhz, 4 cores@2700Mhz) CPU HSA GPU © Copyright 2012 HSA Foundation. All Rights Reserved. 29 AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL™ 1.1 (873.1)
    30. 30. HAAR SOLUTION — RUN DIFFERENT CASCADES ON GPU AND CPU © Copyright 2012 HSA Foundation. All Rights Reserved. 30 +2.5x -2.5x INCREASED PERFORMANCE DECREASED ENERGY PER FRAME By seamlessly sharing data between CPU and GPU, allows the right processor to handle its appropriate workload
    31. 31. ACCELERATING SUFFIX ARRAY CONSTRUCTION CLOUD SERVER WORKLOAD
    32. 32. SUFFIX ARRAYS  Suffix Arrays are a fundamental data structure  Designed for efficient searching of a large text  Quickly locate every occurrence of a substring S in a text T  Suffix Arrays are used to accelerate in-memory cloud workloads  Full text index search  Lossless data compression  Bio-informatics © Copyright 2012 HSA Foundation. All Rights Reserved. 32
    33. 33. ACCELERATED SUFFIX ARRAY CONSTRUCTION ON HSA © Copyright 2012 HSA Foundation. All Rights Reserved. 33 M. Deo, ―Parallel Suffix Array Construction and Least Common Prefix for the GPU‖, Submitted to ‖Principles and Practice of Parallel Programming, (PPoPP‘13)‖ February 2013. AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM By offloading data parallel computations to GPU, HSA increases performance and reduces energy for Suffix Array Construction versus Single Threaded CPU. By efficiently sharing data between CPU and GPU, HSA lets us move compute to data without penalty of intermediate copies. +5.8x -5x INCREASED PERFORMANCE DECREASED ENERGYMerge Sort::GPU Radix Sort::GPU Compute SA::CPU Lexical Rank::CPU Radix Sort::GPU Skew Algorithm for Compute SA
    34. 34. GAMEPLAY RIGID BODY PHYSICS
    35. 35. RIGID BODY PHYSICS SIMULATION  Rigid-Body Physics Simulation is:  A way to animate and interact with objects, widely used in games and movie production  Used to drive game play and for visual effects (eye candy)  Physics Simulation is used in many of today‘s software:  Middleware Physics engines such as Bullet, Havok, PhysX  Games ranging from Angry Birds and Cut the Rope to Tomb Raider and Crysis 3  3D authoring tools such as Autodesk Maya, Unity 3D, Houdini, Cinema 4D, Lightwave  Industrial applications such as Siemens NX8 Mechatronics Concept Design  Medical applications such as surgery trainers  Robotics simulation  But GPU-accelerated rigid-body physics is not used in game play — only in effects © Copyright 2012 HSA Foundation. All Rights Reserved. 35
    36. 36. RIGID BODY PHYSICS — ALGORITHM  Find potential interacting object ―pairs‖ using bounding shape approximations.  Perform full overlap testing between potentially interacting pairs  Compute exact contact information for a various shape types  Compute constraint forces for natural motion and stable stacking © Copyright 2012 HSA Foundation. All Rights Reserved. Broad-Phase Collision Detection Setup constraints Solve constraints Compute contact points A B0 B1 C0 C1 D1 D1 A 1 1 2 2 3 3 4 4 B D A 1 2 3 4 Mid-Phase Collision Detection Narrow-Phase Collision Detection 36
    37. 37. RIGID BODY PHYSICS — CHALLENGES & SOLUTIONS Implementation Challenges  Game engine and Physics engine need to interact synchronously during simulation  The set of pairs can be huge and changes from frame to frame  Thousands to Millions for any given frame  Narrow-phase algorithms cause thread divergence Benefits of HSA  Fast CPU round-trips  User mode dispatch  Unified Addressing,  Pageable memory,  Coherency  Supports as large a pair list as CPU  Entire memory space  Dynamic memory allocation  Improved handling of divergence  GPU enqueue © Copyright 2012 HSA Foundation. All Rights Reserved. 37
    38. 38. EASE OF PROGRAMMING CODE COMPLEXITY VS. PERFORMANCE
    39. 39. LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM. Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta 0 50 100 150 200 250 300 350 LOC Copy-back Algorithm Launch Copy Compile Init Performance Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt Performance 35.00 30.00 25.00 20.00 15.00 10.00 5.00 0Copy-back Algorithm Launch Copy Compile Init. Copy-back Algorithm Launch Copy Compile Copy-back Algorithm Launch Algorithm Launch Algorithm Launch Algorithm Launch Algorithm Launch (Exemplary ISV ―Hessian‖ Kernel) © Copyright 2012 HSA Foundation. All Rights Reserved. 39
    40. 40. © Copyright 2012 HSA Foundation. All Rights Reserved. 40 THE HSA FUTURE  Architected heterogeneous processing on the SOC  Programming of accelerators becomes much easier  Accelerated software that runs across multiple hardware vendors  Scalability from smart phones to super computers on a common architecture  GPU acceleration of parallel processing is the initial target, with DSPs and other accelerators coming to the HSA system architecture model  Heterogeneous software ecosystem evolves at a much faster pace  Lower power, more capable devices in your hand, on the wall, in the cloud
    41. 41. THANK YOU

    ×