ISCA Final Presentation - Intro


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • We will be open on this.
    We will reach out to partners and collaborate to bring this to market in the right form
  • Lets take a deeper dive here into the details of the architecture …
  • The memory model for a new architecture is key
  • The memory model for a new architecture is key

  • Key Points:
    Writing optimal CPU implementations requires complex development too.
    Programmers have to use both intrinsics for vector parallelism, and TBB for multicore parallelism.
    OpenCL C
    OpenCL-C is widely known fairly verbose C-based API, and it shows in the boilerplate initialization code, runtime-compile code, and kernel launch.
    OpenCL C++ :
    Removes initialization code by providing sensible defaults for platform, context, device, command-queue. No need to set these up, and no need to save them and drag them around for later OCL API calls.
    Reduce code to compile by using C++ exceptions for error-checking, automatic memory allocation (rather than calling API to determine size of return args)
    Default arguments, type-checking — code focuses on relevant parameters.
    The host-side support for C++ is available in a “cl.hpp” file which runs on any OpenCL implementation (including NV, Intel, etc).
    In addition, AMD OpenCL implementation supports “static” C++ kernel language — classes, namespaces, templates. (Not used in the this implementation).
    C++ AMP
    Initialization is handled through sensible defaults. C++AMP eliminates platform, context, accelerator_view combines device and queue.
    Single-source model : Eliminates run-time compile code, this is done at compile-time with the host code.
    Single-source model : streamlined kernel call convention (eliminate clSetKernelArg)
    The implementation here uses C++11 lambda to reduce boilerplate code for functor construction (kernel can directly access local vars)
    Data xfer reduced by implicit transfers performed by array_view
    Moves reduction code into library — what’s left is reduction operator.
    Removes data xfer and copy-back — interface is directly with host data structures.
    Bolt-for-C++AMP uses lambda syntax; Bolt-For-OCL does not (not supported)
    Bolt-for-OCL implementation relies on C++ static kernel language — recently introduced in AMD APP SDK 2.6 (beta) and 2.7 (production?)

    Other notes:
    Serial CPU integrates algorithm and reduction and we call it just “algorithm” ; later implementations separate these for performance.
    Launch is argument setup and calling of the kernel or library routine.
    Copy-back includes code to copy data back to host and run a host-side final reduction step.
    LOC includes appropriate spaces and comments. We attempted to use similar coding style across all implementations.
    Tbb init is 1-line to initialize the scheduler. (tbb::task_scheduler_init)
  • ISCA Final Presentation - Intro

    2. 2. HSA FOUNDATION  Founded in June 2012  Developing a new platform for heterogeneous systems   Specifications under development in working groups to define the platform  Membership consists of 43 companies and 16 universities  Adding 1-2 new members each month © Copyright 2014 HSA Foundation. All Rights Reserved
    3. 3. DIVERSE PARTNERS DRIVING FUTURE OF HETEROGENEOUS COMPUTING © Copyright 2014 HSA Foundation. All Rights Reserved Founders Promoters Supporters Contributors Academic Needs Updating – Add Toshiba Logo
    4. 4. MEMBERSHIP TABLE Membership Level Number List Founder 6 AMD, ARM, Imagination Technologies, MediaTek Inc., Qualcomm Inc., Samsung Electronics Co Ltd Promoter 1 LG Electronics Contributor 25 Analog Devices Inc., Apical, Broadcom, Canonical Limited, CEVA Inc., Digital Media Professionals, Electronics and Telecommunications Research, Institute (ETRI), General Processor, Huawei, Industrial Technology Res. Institute, Marvell International Ltd., Mobica, Oracle, Sonics, Inc, Sony Mobile, Communications, Swarm 64 GmbH, Synopsys, Tensilica, Inc., Texas Instruments Inc., Toshiba, VIA Technologies, Vivante Corporation Supporter 13 Allinea Software Ltd, Arteris Inc., Codeplay Software, Fabric Engine, Kishonti, Lawrence Livermore National Laboratory, Linaro, MultiCoreWare, Oak Ridge National Laboratory, Sandia Corporation, StreamComputing, SUSE LLC, UChicago Argonne LLC, Operator of Argonne National Laboratory Academic 17 Institute for Computing Systems Architecture, Missouri University of Science & Technology, National Tsing Hua University, NMAM Institute of Technology, Northeastern University, Rice University, Seoul National University, System Software Lab National, Tsing Hua University, Tampere University of Technology, TEI of Crete, The University of Mississippi, University of North Texas, University of Bologna, University of Bristol Microelectronic Research Group, University of Edinburgh, University of Illinois at Urbana-Champaign Department of Computer Science © Copyright 2014 HSA Foundation. All Rights Reserved
    5. 5. HETEROGENEOUS PROCESSORS HAVE PROLIFERATED — MAKE THEM BETTER  Heterogeneous SOCs have arrived and are a tremendous advance over previous platforms  SOCs combine CPU cores, GPU cores and other accelerators, with high bandwidth access to memory  How do we make them even better?  Easier to program  Easier to optimize  Higher performance  Lower power  HSA unites accelerators architecturally  Early focus on the GPU compute accelerator, but HSA will go well beyond the GPU © Copyright 2014 HSA Foundation. All Rights Reserved
    6. 6. INFLECTIONS IN PROCESSOR DESIGN © Copyright 2014 HSA Foundation. All Rights Reserved ? Single-thread Performance Time we are here Enabled by:  Moore’s Law  Voltage Scaling Constrained by: Power Complexity Single-Core Era ModernApplication Performance Time (Data-parallel exploitation) we are here Heterogeneous Systems Era Enabled by:  Abundant data parallelism  Power efficient GPUs Temporarily Constrained by: Programming models Comm.overhead Throughput Performance Time (# of processors) we are here Enabled by:  Moore’s Law  SMP architecture Constrained by: Power Parallel SW Scalability Multi-Core Era Assembly C/C++ Java … pthreads OpenMP / TBB … Shader CUDA OpenCL C++ and Java
    7. 7. LEGACY GPU COMPUTE PCIe ™ System Memory (Coherent) CPU CPU CPU . . . CU CU CU CU CU CU CU CU GPU Memory (Non-Coherent) GPU  Multiple memory pools  Multiple address spaces  High overhead dispatch  Data copies across PCIe  New languages for programming  Dual source development  Proprietary environments  Expert programmers only  Need to fix all of this to unleash our programmers The limiters © Copyright 2014 HSA Foundation. All Rights Reserved
    8. 8. EXISTING APUS AND SOCS CPU 1 CPU N… CPU 2 Physical Integration CU 1 … CU 2 CU 3 CU M-2 CU M-1 CU M System Memory (Coherent) GPU Memory (Non-Coherent) GPU  Physical Integration  Good first step  Some copies gone  Two memory pools remain  Still queue through the OS  Still requires expert programmers  Need to finish the job
    9. 9. AN HSA ENABLED SOC  Unified Coherent Memory enables data sharing across all processors  Processors architected to operate cooperatively  Designed to enable the application to run on different processors at different times Unified Coherent Memory CPU 1 CPU N… CPU 2 CU 1 CU 2 CU 3 CU M-2 CU M-1 CU M…
    10. 10. PILLARS OF HSA*  Unified addressing across all processors  Operation into pageable system memory  Full memory coherency  User mode dispatch  Architected queuing language  Scheduling and context switching  HSA Intermediate Language (HSAIL)  High level language support for GPU compute processors © Copyright 2014 HSA Foundation. All Rights Reserved * All features of HSA are subject to change, pending ratification of 1.0 Final specifications by the HSA Board of Directors
    11. 11. HSA SPECIFICATIONS  HSA System Architecture Specification  Version 1.0 Provisional, Released April 2014  Defines discovery, memory model, queue management, atomics, etc  HSA Programmers Reference Specification  Version 1.0 Provisional, Released June 2014  Defines the HSAIL language and object format  HSA Runtime Software Specification  Version 1.0 Provisional, expected to be released in July 2014  Defines the APIs through which an HSA application uses the platform  All released specifications can be found at the HSA Foundation web site:  © Copyright 2014 HSA Foundation. All Rights Reserved
    12. 12. HSA - AN OPEN PLATFORM  Open Architecture, membership open to all  HSA Programmers Reference Manual  HSA System Architecture  HSA Runtime  Delivered via royalty free standards  Royalty Free IP, Specifications and APIs  ISA agnostic for both CPU and GPU  Membership from all areas of computing  Hardware companies  Operating Systems  Tools and Middleware  Applications  Universities © Copyright 2014 HSA Foundation. All Rights Reserved
    13. 13. HSA INTERMEDIATE LAYER — HSAIL  HSAIL is a virtual ISA for parallel programs  Finalized to ISA by a JIT compiler or “Finalizer”  ISA independent by design for CPU & GPU  Explicitly parallel  Designed for data parallel programming  Support for exceptions, virtual functions, and other high level language features  Lower level than OpenCL SPIR  Fits naturally in the OpenCL compilation stack  Suitable to support additional high level languages and programming models:  Java, C++, OpenMP, C++, Python, etc © Copyright 2014 HSA Foundation. All Rights Reserved
    14. 14. HSA MEMORY MODEL  Defines visibility ordering between all threads in the HSA System  Designed to be compatible with C++11, Java, OpenCL and .NET Memory Models  Relaxed consistency memory model for parallel compute performance  Visibility controlled by:  Load.Acquire  Store.Release  Fences © Copyright 2014 HSA Foundation. All Rights Reserved
    15. 15. HSA QUEUING MODEL  User mode queuing for low latency dispatch  Application dispatches directly  No OS or driver required in the dispatch path  Architected Queuing Layer  Single compute dispatch path for all hardware  No driver translation, direct to hardware  Allows for dispatch to queue from any agent  CPU or GPU  GPU self enqueue enables lots of solutions  Recursion  Tree traversal  Wavefront reforming © Copyright 2014 HSA Foundation. All Rights Reserved
    16. 16. HSA SOFTWARE
    17. 17. Hardware - APUs, CPUs, GPUs Driver Stack Domain Libraries OpenCL™, DX Runtimes, User Mode Drivers Graphics Kernel Mode Driver Apps Apps Apps Apps Apps Apps HSA Software Stack Task Queuing Libraries HSA Domain Libraries, OpenCL ™ 2.x Runtime HSA Kernel Mode Driver HSA Runtime HSA JIT Apps Apps Apps Apps Apps Apps User mode component Kernel mode component Components contributed by third parties EVOLUTION OF THE SOFTWARE STACK © Copyright 2014 HSA Foundation. All Rights Reserved
    18. 18. OPENCL™ AND HSA  HSA is an optimized platform architecture for OpenCL  Not an alternative to OpenCL  OpenCL on HSA will benefit from  Avoidance of wasteful copies  Low latency dispatch  Improved memory model  Pointers shared between CPU and GPU  OpenCL 2.0 leverages HSA Features  Shared Virtual Memory  Platform Atomics © Copyright 2014 HSA Foundation. All Rights Reserved
    19. 19. ADDITIONAL LANGUAGES ON HSA  In development © Copyright 2014 HSA Foundation. All Rights Reserved Language Body More Information Java Sumatra OpenJDK LLVM LLVM Code generator for HSAIL C++ AMP Multicoreware mp-driver-ng/wiki/Home OpenMP, GCC AMD, Suse /hsa/gcc/README.hsa?view=markup&p athrev=207425
    20. 20. SUMATRA PROJECT OVERVIEW  AMD/Oracle sponsored Open Source (OpenJDK) project  Targeted at Java 9 (2015 release)  Allows developers to efficiently represent data parallel algorithms in Java  Sumatra ‘repurposes’ Java 8’s multi-core Stream/Lambda API’s to enable both CPU or GPU computing  At runtime, Sumatra enabled Java Virtual Machine (JVM) will dispatch ‘selected’ constructs to available HSA enabled devices  Developers of Java libraries are already refactoring their library code to use these same constructs  So developers using existing libraries should see GPU acceleration without any code changes    © Copyright 2014 HSA Foundation. All Rights Reserved Java Compiler GPUCPU Sumatra Enabled JVM Application GPU ISA Lambda/Stream API CPU ISA Application.clas s Development Runtime HSA Finalizer
    21. 21. HSA OPEN SOURCE SOFTWARE  HSA will feature an open source linux execution and compilation stack  Allows a single shared implementation for many components  Enables university research and collaboration in all areas  Because it’s the right thing to do © Copyright 2014 HSA Foundation. All Rights Reserved Component Name IHV or Common Rationale HSA Bolt Library Common Enable understanding and debug HSAIL Code Generator Common Enable research LLVM Contributions Common Industry and academic collaboration HSAIL Assembler Common Enable understanding and debug HSA Runtime Common Standardize on a single runtime HSA Finalizer IHV Enable research and debug HSA Kernel Driver IHV For inclusion in linux distros
    23. 23. SUFFIX ARRAYS  Suffix Arrays are a fundamental data structure  Designed for efficient searching of a large text  Quickly locate every occurrence of a substring S in a text T  Suffix Arrays are used to accelerate in-memory cloud workloads  Full text index search  Lossless data compression  Bio-informatics © Copyright 2014 HSA Foundation. All Rights Reserved
    24. 24. ACCELERATED SUFFIX ARRAY CONSTRUCTION ON HSA © Copyright 2014 HSA Foundation. All Rights Reserved M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013. AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM By offloading data parallel computations to GPU, HSA increases performance and reduces energy for Suffix Array Construction. By efficiently sharing data between CPU and GPU, HSA lets us move compute to data without penalty of intermediate copies. +5.8x -5x INCREASED PERFORMANCE DECREASED ENERGYMerge Sort::GPU Radix Sort::GPU Compute SA::CPU Lexical Rank::CPU Radix Sort::GPU Skew Algorithm for Compute SA
    26. 26. LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM. Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta 0 50 100 150 200 250 300 350 LOC Copy-back Algorithm Launch Copy Compile Init Performance Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt Performance 35.00 30.00 25.00 20.00 15.00 10.00 5.00 0Copy- back Algorithm Launch Copy Compile Init. Copy-back Algorithm Launch Copy Compile Copy-back Algorithm Launch Algorithm Launch Algorithm Launch Algorithm Launch Algorithm Launch (Exemplary ISV “Hessian” Kernel) © Copyright 2014 HSA Foundation. All Rights Reserved
    27. 27. THE HSA FUTURE  Architected heterogeneous processing on the SOC  Programming of accelerators becomes much easier  Accelerated software that runs across multiple hardware vendors  Scalability from smart phones to super computers on a common architecture  GPU acceleration of parallel processing is the initial target, with DSPs and other accelerators coming to the HSA system architecture model  Heterogeneous software ecosystem evolves at a much faster pace  Lower power, more capable devices in your hand, on the wall, in the cloud © Copyright 2014 HSA Foundation. All Rights Reserved