More Related Content Similar to ISCA Final Presentation - Intro Similar to ISCA Final Presentation - Intro (20) More from HSA Foundation (13) ISCA Final Presentation - Intro2. HSA FOUNDATION
Founded in June 2012
Developing a new platform for heterogeneous
systems
www.hsafoundation.com
Specifications under development in working
groups to define the platform
Membership consists of 43 companies and 16
universities
Adding 1-2 new members each month
© Copyright 2014 HSA Foundation. All Rights Reserved
3. DIVERSE PARTNERS DRIVING FUTURE OF
HETEROGENEOUS COMPUTING
© Copyright 2014 HSA Foundation. All Rights Reserved
Founders
Promoters
Supporters
Contributors
Academic
Needs Updating – Add Toshiba
Logo
4. MEMBERSHIP TABLE
Membership Level Number List
Founder 6 AMD, ARM, Imagination Technologies, MediaTek Inc.,
Qualcomm Inc., Samsung Electronics Co Ltd
Promoter 1 LG Electronics
Contributor 25 Analog Devices Inc., Apical, Broadcom, Canonical
Limited, CEVA Inc., Digital Media Professionals,
Electronics and Telecommunications Research,
Institute (ETRI), General Processor, Huawei, Industrial
Technology Res. Institute, Marvell International Ltd.,
Mobica, Oracle, Sonics, Inc, Sony Mobile,
Communications, Swarm 64 GmbH, Synopsys,
Tensilica, Inc., Texas Instruments Inc., Toshiba, VIA
Technologies, Vivante Corporation
Supporter 13 Allinea Software Ltd, Arteris Inc., Codeplay Software,
Fabric Engine, Kishonti, Lawrence Livermore National
Laboratory, Linaro, MultiCoreWare, Oak Ridge
National Laboratory, Sandia Corporation,
StreamComputing, SUSE LLC, UChicago Argonne LLC,
Operator of Argonne National Laboratory
Academic 17 Institute for Computing Systems Architecture,
Missouri University of Science & Technology, National
Tsing Hua University, NMAM Institute of Technology,
Northeastern University, Rice University, Seoul
National University, System Software Lab National,
Tsing Hua University, Tampere University of
Technology, TEI of Crete, The University of Mississippi,
University of North Texas, University of Bologna,
University of Bristol Microelectronic Research Group,
University of Edinburgh, University of Illinois at
Urbana-Champaign Department of Computer Science
© Copyright 2014 HSA Foundation. All Rights Reserved
5. HETEROGENEOUS PROCESSORS HAVE
PROLIFERATED — MAKE THEM BETTER
Heterogeneous SOCs have arrived and are a
tremendous advance over previous platforms
SOCs combine CPU cores, GPU cores and
other accelerators, with high bandwidth access
to memory
How do we make them even better?
Easier to program
Easier to optimize
Higher performance
Lower power
HSA unites accelerators architecturally
Early focus on the GPU compute accelerator,
but HSA will go well beyond the GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
6. INFLECTIONS IN PROCESSOR DESIGN
© Copyright 2014 HSA Foundation. All Rights Reserved
?
Single-thread
Performance
Time
we are
here
Enabled by:
Moore’s
Law
Voltage
Scaling
Constrained by:
Power
Complexity
Single-Core Era
ModernApplication
Performance
Time (Data-parallel exploitation)
we are
here
Heterogeneous
Systems Era
Enabled by:
Abundant data
parallelism
Power efficient
GPUs
Temporarily
Constrained by:
Programming
models
Comm.overhead
Throughput
Performance
Time (# of processors)
we are
here
Enabled by:
Moore’s Law
SMP
architecture
Constrained
by:
Power
Parallel SW
Scalability
Multi-Core Era
Assembly C/C++ Java … pthreads OpenMP / TBB …
Shader CUDA OpenCL
C++ and Java
7. LEGACY GPU COMPUTE
PCIe
™
System Memory
(Coherent)
CPU CPU CPU
. .
.
CU CU CU CU
CU CU CU CU
GPU Memory
(Non-Coherent)
GPU
Multiple memory pools
Multiple address spaces
High overhead dispatch
Data copies across PCIe
New languages for
programming
Dual source development
Proprietary environments
Expert programmers only
Need to fix all of this to
unleash our programmers
The limiters
© Copyright 2014 HSA Foundation. All Rights Reserved
8. EXISTING APUS AND SOCS
CPU
1
CPU
N…
CPU
2
Physical Integration
CU
1 …
CU
2
CU
3
CU
M-2
CU
M-1
CU
M
System Memory
(Coherent)
GPU Memory
(Non-Coherent)
GPU
Physical Integration
Good first step
Some copies gone
Two memory pools remain
Still queue through the OS
Still requires expert
programmers
Need to finish the job
9. AN HSA ENABLED SOC
Unified Coherent
Memory enables
data sharing across
all processors
Processors
architected to
operate
cooperatively
Designed to enable
the application to
run on different
processors at
different times
Unified Coherent Memory
CPU
1
CPU
N…
CPU
2
CU
1
CU
2
CU
3
CU
M-2
CU
M-1
CU
M…
10. PILLARS OF HSA*
Unified addressing across all processors
Operation into pageable system memory
Full memory coherency
User mode dispatch
Architected queuing language
Scheduling and context switching
HSA Intermediate Language (HSAIL)
High level language support for GPU compute processors
© Copyright 2014 HSA Foundation. All Rights Reserved
* All features of HSA are subject to change, pending ratification of 1.0 Final specifications by the HSA Board of Directors
11. HSA SPECIFICATIONS
HSA System Architecture Specification
Version 1.0 Provisional, Released April 2014
Defines discovery, memory model, queue management, atomics, etc
HSA Programmers Reference Specification
Version 1.0 Provisional, Released June 2014
Defines the HSAIL language and object format
HSA Runtime Software Specification
Version 1.0 Provisional, expected to be released in July 2014
Defines the APIs through which an HSA application uses the platform
All released specifications can be found at the HSA Foundation web site:
www.hsafoundation.com/standards
© Copyright 2014 HSA Foundation. All Rights Reserved
12. HSA - AN OPEN PLATFORM
Open Architecture, membership open to all
HSA Programmers Reference Manual
HSA System Architecture
HSA Runtime
Delivered via royalty free standards
Royalty Free IP, Specifications and APIs
ISA agnostic for both CPU and GPU
Membership from all areas of computing
Hardware companies
Operating Systems
Tools and Middleware
Applications
Universities
© Copyright 2014 HSA Foundation. All Rights Reserved
13. HSA INTERMEDIATE LAYER — HSAIL
HSAIL is a virtual ISA for parallel programs
Finalized to ISA by a JIT compiler or “Finalizer”
ISA independent by design for CPU & GPU
Explicitly parallel
Designed for data parallel programming
Support for exceptions, virtual functions,
and other high level language features
Lower level than OpenCL SPIR
Fits naturally in the OpenCL compilation stack
Suitable to support additional high level languages and programming models:
Java, C++, OpenMP, C++, Python, etc
© Copyright 2014 HSA Foundation. All Rights Reserved
14. HSA MEMORY MODEL
Defines visibility ordering between all
threads in the HSA System
Designed to be compatible with
C++11, Java, OpenCL and .NET
Memory Models
Relaxed consistency memory model
for parallel compute performance
Visibility controlled by:
Load.Acquire
Store.Release
Fences
© Copyright 2014 HSA Foundation. All Rights Reserved
15. HSA QUEUING MODEL
User mode queuing for low latency dispatch
Application dispatches directly
No OS or driver required in the dispatch path
Architected Queuing Layer
Single compute dispatch path for all hardware
No driver translation, direct to hardware
Allows for dispatch to queue from any agent
CPU or GPU
GPU self enqueue enables lots of solutions
Recursion
Tree traversal
Wavefront reforming
© Copyright 2014 HSA Foundation. All Rights Reserved
17. Hardware - APUs, CPUs, GPUs
Driver Stack
Domain Libraries
OpenCL™, DX Runtimes,
User Mode Drivers
Graphics Kernel Mode Driver
Apps
Apps
Apps
Apps
Apps
Apps
HSA Software Stack
Task Queuing
Libraries
HSA Domain Libraries,
OpenCL ™ 2.x Runtime
HSA Kernel
Mode Driver
HSA Runtime
HSA JIT
Apps
Apps
Apps
Apps
Apps
Apps
User mode component Kernel mode component Components contributed by third parties
EVOLUTION OF THE SOFTWARE STACK
© Copyright 2014 HSA Foundation. All Rights Reserved
18. OPENCL™ AND HSA
HSA is an optimized platform architecture
for OpenCL
Not an alternative to OpenCL
OpenCL on HSA will benefit from
Avoidance of wasteful copies
Low latency dispatch
Improved memory model
Pointers shared between CPU and GPU
OpenCL 2.0 leverages HSA Features
Shared Virtual Memory
Platform Atomics
© Copyright 2014 HSA Foundation. All Rights Reserved
19. ADDITIONAL LANGUAGES ON HSA
In development
© Copyright 2014 HSA Foundation. All Rights Reserved
Language Body More Information
Java Sumatra OpenJDK http://openjdk.java.net/projects/sumatra/
LLVM LLVM Code
generator for HSAIL
C++ AMP Multicoreware https://bitbucket.org/multicoreware/cppa
mp-driver-ng/wiki/Home
OpenMP, GCC AMD, Suse https://gcc.gnu.org/viewcvs/gcc/branches
/hsa/gcc/README.hsa?view=markup&p
athrev=207425
20. SUMATRA PROJECT OVERVIEW
AMD/Oracle sponsored Open Source (OpenJDK) project
Targeted at Java 9 (2015 release)
Allows developers to efficiently represent data parallel algorithms in
Java
Sumatra ‘repurposes’ Java 8’s multi-core Stream/Lambda API’s to
enable both CPU or GPU computing
At runtime, Sumatra enabled Java Virtual Machine (JVM) will dispatch
‘selected’ constructs to available HSA enabled devices
Developers of Java libraries are already refactoring their library code to
use these same constructs
So developers using existing libraries should see GPU acceleration
without any code changes
http://openjdk.java.net/projects/sumatra/
https://wikis.oracle.com/display/HotSpotInternals/Sumatra
http://mail.openjdk.java.net/pipermail/sumatra-dev/
© Copyright 2014 HSA Foundation. All Rights Reserved
Application.java
Java Compiler
GPUCPU
Sumatra Enabled JVM
Application
GPU ISA
Lambda/Stream API
CPU ISA
Application.clas
s
Development
Runtime
HSA Finalizer
21. HSA OPEN SOURCE SOFTWARE
HSA will feature an open source linux execution and compilation stack
Allows a single shared implementation for many components
Enables university research and collaboration in all areas
Because it’s the right thing to do
© Copyright 2014 HSA Foundation. All Rights Reserved
Component Name IHV or Common Rationale
HSA Bolt Library Common Enable understanding and debug
HSAIL Code Generator Common Enable research
LLVM Contributions Common Industry and academic collaboration
HSAIL Assembler Common Enable understanding and debug
HSA Runtime Common Standardize on a single runtime
HSA Finalizer IHV Enable research and debug
HSA Kernel Driver IHV For inclusion in linux distros
23. SUFFIX ARRAYS
Suffix Arrays are a fundamental data structure
Designed for efficient searching of a large text
Quickly locate every occurrence of a substring S in a text T
Suffix Arrays are used to accelerate in-memory cloud workloads
Full text index search
Lossless data compression
Bio-informatics
© Copyright 2014 HSA Foundation. All Rights Reserved
24. ACCELERATED SUFFIX ARRAY
CONSTRUCTION ON HSA
© Copyright 2014 HSA Foundation. All Rights Reserved
M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013.
AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM
By offloading data parallel computations to
GPU, HSA increases performance and
reduces energy for Suffix Array
Construction.
By efficiently sharing data between CPU and
GPU, HSA lets us move compute to data
without penalty of intermediate copies.
+5.8x
-5x
INCREASED
PERFORMANCE
DECREASED
ENERGYMerge Sort::GPU
Radix Sort::GPU
Compute SA::CPU
Lexical Rank::CPU
Radix Sort::GPU
Skew Algorithm for Compute SA
26. LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT
PROGRAMMING MODELS
AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.
Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta
0
50
100
150
200
250
300
350
LOC
Copy-back Algorithm Launch Copy Compile Init Performance
Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt
Performance
35.00
30.00
25.00
20.00
15.00
10.00
5.00
0Copy-
back
Algorithm
Launch
Copy
Compile
Init.
Copy-back
Algorithm
Launch
Copy
Compile
Copy-back
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
(Exemplary ISV “Hessian” Kernel)
© Copyright 2014 HSA Foundation. All Rights Reserved
27. THE HSA FUTURE
Architected heterogeneous processing on the SOC
Programming of accelerators becomes much easier
Accelerated software that runs across multiple hardware vendors
Scalability from smart phones to super computers on a common architecture
GPU acceleration of parallel processing is the initial target, with DSPs
and other accelerators coming to the HSA system architecture model
Heterogeneous software ecosystem evolves at a much faster pace
Lower power, more capable devices in your hand, on the wall, in the cloud
© Copyright 2014 HSA Foundation. All Rights Reserved
Editor's Notes We will be open on this.
We will reach out to partners and collaborate to bring this to market in the right form
Lets take a deeper dive here into the details of the architecture …
The memory model for a new architecture is key The memory model for a new architecture is key
Key Points:
Writing optimal CPU implementations requires complex development too.
Programmers have to use both intrinsics for vector parallelism, and TBB for multicore parallelism.
OpenCL C
OpenCL-C is widely known fairly verbose C-based API, and it shows in the boilerplate initialization code, runtime-compile code, and kernel launch.
OpenCL C++ :
Removes initialization code by providing sensible defaults for platform, context, device, command-queue. No need to set these up, and no need to save them and drag them around for later OCL API calls.
Reduce code to compile by using C++ exceptions for error-checking, automatic memory allocation (rather than calling API to determine size of return args)
Default arguments, type-checking — code focuses on relevant parameters.
The host-side support for C++ is available in a “cl.hpp” file which runs on any OpenCL implementation (including NV, Intel, etc).
In addition, AMD OpenCL implementation supports “static” C++ kernel language — classes, namespaces, templates. (Not used in the this implementation).
C++ AMP
Initialization is handled through sensible defaults. C++AMP eliminates platform, context, accelerator_view combines device and queue.
Single-source model : Eliminates run-time compile code, this is done at compile-time with the host code.
Single-source model : streamlined kernel call convention (eliminate clSetKernelArg)
The implementation here uses C++11 lambda to reduce boilerplate code for functor construction (kernel can directly access local vars)
Data xfer reduced by implicit transfers performed by array_view
BOLT
Moves reduction code into library — what’s left is reduction operator.
Removes data xfer and copy-back — interface is directly with host data structures.
Bolt-for-C++AMP uses lambda syntax; Bolt-For-OCL does not (not supported)
Bolt-for-OCL implementation relies on C++ static kernel language — recently introduced in AMD APP SDK 2.6 (beta) and 2.7 (production?)
Other notes:
Serial CPU integrates algorithm and reduction and we call it just “algorithm” ; later implementations separate these for performance.
Launch is argument setup and calling of the kernel or library routine.
Copy-back includes code to copy data back to host and run a host-side final reduction step.
LOC includes appropriate spaces and comments. We attempted to use similar coding style across all implementations.
Tbb init is 1-line to initialize the scheduler. (tbb::task_scheduler_init)