The document provides an overview of heterogeneous system architecture (HSA). HSA enables CPU, GPU, and other processors to work together on a single chip by moving tasks to the best suited processor. It features unified memory access, so all processors can access the same memory address space. This simplifies programming. The HSA Foundation is working to build an ecosystem around HSA through standards and by bringing together industry partners. HSA aims to provide a scalable architecture for programming across devices from smartphones to supercomputers.
2. OUTLINE
Introduction to HSA
Unified Memory Access
Power Management
HSA Programming Languages
Workloads
2 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
3. WHAT IS HSA?
An intelligent computing architecture that enables CPU, GPU and other
processors to work in harmony on a single piece of silicon by seamlessly
moving the right tasks to the best suited processing element
hUMA (MEMORY)
PARALLEL
WORKLOADS
SERIAL
WORKLOADS
APU
ACCELERATED PROCESSING UNIT
3 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
4. HSA EVOLUTION
Benefits
Unified power
efficiency
Improved compute
efficiency
Simplified
data sharing
Capabilities
Integrate CPU and
GPU in silicon
4 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
GPU can access CPU
memory
Uniform memory
access for CPU and
GPU
5. STATE-OF-THE-ART HETEROGENEOUS PROCESSOR
Shared Northbridge access to overlapping
CPU/GPU physical address spaces
Graphics processing unit
(GPU):
384 AMD Radeon™ cores
Multi-threaded CPU
cores
Accelerated processing unit (APU)
Many resources are shared between the CPU and GPU
– For example, memory hierarchy, power, and thermal capacity
5 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
6. A NEW ERA OF PROCESSOR PERFORMANCE
Single-Core Era
Constrained by:
Power
Complexity
Enabled by:
Moore’s Law
SMP architecture
?
Throughput
Performance
Single-thread Performance
Enabled by:
Abundant data
parallelism
Power efficient
GPUs
Time
6 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
we are
here
Time (# of processors)
Temporarily
Constrained by:
Programming
models
Comm.overhead
Shader CUDA OpenCL C+
+AMP …
pthreads OpenMP / TBB …
Assembly C/C++ Java …
we are
here
Constrained by:
Power
Parallel SW
Scalability
Modern Application
Performance
Enabled by:
Moore’s Law
Voltage Scaling
Heterogeneous
Systems Era
Multi-Core Era
we are
here
Time (Data-parallel exploitation)
7. Architecture Maturity & Programmer Accessibility
Excellent
EVOLUTION OF HETEROGENEOUS COMPUTING
Standards Drivers Era
OpenCL™, DirectCompute
Driver-based APIs
Proprietary Drivers Era
Graphics & Proprietary
Driver-based APIs
“Adventurous” programmers
Exploit early programmable
“shader cores” in the GPU
Expert programmers
C and C++ subsets
Compute centric APIs , data types
Multiple address spaces with
explicit data movement
Specialized work queue based
structures
Kernel mode dispatch
Architected Era
AMD Heterogeneous System Architecture
GPU Peer Processor
Mainstream programmers
Full C++
GPU as a co-processor
Unified coherent address space
Task parallel runtimes
Nested Data Parallel programs
User mode dispatch
Pre-emption and context
switching
Make your program look like
“graphics” to the GPU
Poor
CUDA™, Brook+, etc
2002 - 2008
7 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
2009 - 2011
2012 - 2020
8. HETEROGENEOUS PROCESSORS - EVERYWHERE
SMARTPHONES TO SUPER-COMPUTERS
Super
computer
Dense
Server
Tablet
Phone
Workstation
Notebook
A SINGLE SCALABLE
ARCHITECTURE FOR THE
WORLD’S PROGRAMMERS
IS DEMANDED AT THIS
POINT
8 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
9. HOW DOES HSA MAKE THIS ALL WORK?
Enables acceleration of languages like Java, C++ AMP and
Python
All processors use the same addresses, and can share data
structures in place
Heterogeneous computing can use all of virtual and physical
memory
Extends multicore coherency to the GPU and other processors
Pass work quickly between the processors
Enables quality of service
HSA FOUNDATION –
BUILDING THE ECOSYSTEM
9 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
10. HSA FOUNDATION AT LAUNCH
BORN IN JUNE 2012
Founders
10 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
11. HSA FOUNDATION TODAY – DECEMBER 2013
A GROWING AND POWERFUL FAMILY
Founders
Promoters
rters
butors
ORACLE
sities
NTHU Programming
Language Lab
11 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
NTHU System
Software Lab
COMPUTER SCIENCE
13. UNDERSTANDING UMA
Original meaning of UMA is Uniform Memory Access
•
Refers to how processing cores in a system view and access memory
•
All processing cores in a true UMA system share a single memory address space
Introduction of GPU compute created Non-Uniform Memory Access (NUMA)
•
Require data to be managed across multiple heaps with different address spaces
•
Add programming complexity due to frequent copies, synchronization, and address
translation
HSA restores the GPU to Uniform memory Access
•
Heterogeneous computing replaces GPU Computing
13 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
15. hUMA KEY FEATURES
Coherent Memory:
CPU
Ensures CPU and GPU
caches both see
an up-to-date view of
data
HW
Cache
Coherency
Cache
Physical Memory
Virtual Memory
Pageable memory:
The GPU can seamlessly
access virtual memory
addresses that are not
(yet)
present in physical
memory
Entire memory space:
Both CPU and GPU can access and allocate
any location in the system’s virtual memory
space
15 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
16. WITHOUT POINTERS AND DATA SHARING
Without hUMA:
•CPU explicitly copies data to GPU memory
•GPU completes computation
•CPU explicitly copies result back to CPU memory
CPU
|
|
|
|
|
Memory
16 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
|
|
|
|
|
GPU Memory
Only the data array
can be copied since GPU
cannot follow embedded
data-structure links
17. WITH POINTERS AND DATA SHARING
CPU can pass a pointer
to entire data structure
since the GPU can now
follow embedded links
|
|
|
|
|
CPU / GPU Uniform Memory
17 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
18. hUMA FEATURES
Access to Entire Memory Space
Pageable memory
Bi-directional Coherency
Fast GPU access to system memory
Dynamic Memory Allocation
18 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
20. KEY OBSERVATIONS
Applications exhibit varying degrees of CPU and GPU frequency sensitivities due
to
‒ Control divergence
‒ Interference at shared resources
‒ Performance coupling between CPU and GPU
Efficient energy management requires metrics that can predict frequency
sensitivity (power) in heterogeneous processors
Sensitivity metrics drive the coordinated setting of CPU and GPU power states
20 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
21. STATE-OF-THE-ART: BI-DIRECTIONAL APPLICATION
POWER MANAGEMENT (BAPM)
Chip is divided into
BAPM-controlled
thermal entities (TEs)
CU0
TE
CU1
TE
GPU TE
Power management algorithm
1. Calculate digital estimate of power consumption
2. Convert power to temperature
- RC network model for heat transfer
1. Assign new power budgets to TEs based on temperature
headroom
2. TEs locally control (boost) their own DVFS states to maximize
performance
21 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
22. DYNACO: RUN-TIME SYSTEM FOR COORDINATED
ENERGY MANAGEMENT
CPU-GPU
Frequency
Sensitivity
Computation
Performance
Metric Monitor
CPU-GPU Power
State Decision
GPU Frequency
Sensitivity
CPU Frequency
Sensitivity
Decision
High
Low
Shift power to GPU
High
High
Proportional power
allocation
Low
High
Shift power to CPU
Low
Low
Reduce power of
both CPU and GPU
DynaCo implemented as a run-time software policy overlaid on top of BAPM in
real hardware
22 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
24. PROGRAMMING LANGUAGES PROLIFERATING ON HSA
OpenCL™
OpenCL™
App
App
Java App
Java App
C++ AMP
C++ AMP
App
App
Python
Python
App
App
OpenCL
OpenCL
Runtime
Runtime
Java JVM
Java JVM
(Sumatra)
(Sumatra)
Various
Various
Runtimes
Runtimes
Fabric
Fabric
Engine RT
Engine RT
HSAIL (HSA Intermediate Language)
HSA
HSA
Helper
Helper
Libraries
Libraries
HSA Core
HSA Core
Runtime
Runtime
Kernel Fusion
Kernel Fusion
Driver (KFD)
Driver (KFD)
24 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
HSA
HSA
Finalizer
Finalizer
25. PROGRAMMING MODELS EMBRACING HSAIL AND HSA
THE RIGHT LEVEL OF ABSTRACTION
UNDER DEVELOPMENT
NEXT
Java: Project Sumatra
OpenJDK 9
OpenMP from SuSE
C++ AMP, based on
CLANG/LLVM
Python and KL from Fabric
Engine
DSLs: Halide, Julia, Rust
Fortran
JavaScript
Open Shading Language
R
25 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
26. HSAIL
HSAIL (HSA Intermediate Language) as the
SW interface
‒ A virtual ISA for parallel programs
‒ Finalized to a native ISA by a finalizer/JIT
‒ Accommodate to rapid innovations in native GPU architectures
‒ HSAIL expected to be stable and backward compatible
across implementations
High-Level Compiler
Flow
OpenCL™ Kernel
OpenCL™ Kernel
EDG or CLANG
EDG or CLANG
SPIR
SPIR
LLVM
LLVM
HSAIL
HSAIL
‒ Enable multiple hardware vendors to support HSA
Key design points and benefits for HSA
compilers
Finalizer Flow
(Runtime)
‒ Adopt a thin finalizer approach
‒ Enable fast translation time and robustness in the
finalizer
‒ Drive performance optimizations through high-level compilers (HLC)
HSAIL
HSAIL
Finalizer
Finalizer
Hardware ISA
Hardware ISA
‒ Take advantage of the strength and compilation time
budget in HLCs for aggressive optimizations
EDG – Edison Design Group
26 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
CLANG – LLVM FE
SPIR – Standard Portable Intermediate
Representation
27. HSA ENABLEMENT OF JAVA
JAVA 7 – OpenCL ENABLED
APARAPI
AMD initiated Open Source project
APIs for data parallel algorithms
‒ GPU accelerate Java applications
‒ No need to learn OpenCL™
Active community captured mindshare
‒ ~20 contributors
‒ >7000 downloads
‒ ~150 visits per day
JAVA 8 – HSA ENABLED
APARAPI
Java 8 brings Stream + Lambda API.
‒ More natural way of expressing data
parallel algorithms
‒ Initially targeted at multi-core.
APARAPI will :
‒ Support Java 8 Lambdas
‒ Dispatch code to HSA enabled devices at
runtime via HSAIL
Java Application
Adds native GPU acceleration to Java Virtual
Machine (JVM)
Developer uses JDK Lambda, Stream API
JVM uses GRAAL compiler to generate HSAIL
JVM decides at runtime to execute on either
CPU or GPU depending on workload
characteristics.
Java Application
Java JDK Stream +
Lambda API
APARAPI + Lambda
API
OpenCL™
OpenCL™ Compiler
Java GRAAL JIT
backend
HSAIL
HSAIL
HSA Finalizer
& Runtime
& Runtime
JVM
JVM
CPU
CPU
(SUMATRA)
Java Application
APARAPI API
CPU ISA
JAVA 9 – HSA ENABLED JAVA
HSA Finalizer
& Runtime
JVM
JVM
GPU ISA
GPU
GPU
27 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
CPU ISA
CPU
CPU
JVM
JVM
GPU ISA
GPU
GPU
CPU ISA
CPU
CPU
GPU ISA
GPU
GPU
29. OVERVIEW OF B+ TREES
B+ Trees are a special case of B
A B+ Tree …
Trees
‒ is a dynamic, multi-level index
‒ Is efficient for retrieval of data, stored in a
Fundamental data structure used in
block-oriented context
several popular database
Order (b) of a B+ Tree measures the
management systems
‒ SQLite
capacity of its nodes
‒ CouchDB
3
2
4
5
6
7
1
2
3
4
5
6
7 8
d1
d2
d3
d4
d5
d6
d7 d8
29 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
30. HOW WE ACCELERATE
Utilize coarse-grained parallelism in B+ Tree searches
‒ Perform many queries in parallel
‒ Increase memory bandwidth utilization with parallel reads
‒ Increase throughput (transactions per second for OLTP)
B+ Tree searches on an HSA enabled APU
‒ Allows much larger B+ Trees to be searched, than traditional GPU compute
‒ Eliminates data-copies since CPU and GPU cores can access the same memory
30 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
31. RESULTS
1M search queries in
parallel
Input B+ Tree contains
112 million keys and
uses 6GB of memory
Hardware: AMD
“Kaveri” APU with Quad
Core CPU and 8 GCN
Compute Units at 35W
TDP
Software: OpenCL on
HSA
Baseline: 4-core OpenMP + hand-tuned SSE CPU
implementation
31 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
32. REVERSE TIME MIGRATION (RTM)
A technique for creating images based
on sensor data to improve seismic
interpretations done by geophysicists
RTM is run on massive data sets
A natural scale out algorithm
Often run today on 100K node CPU
systems
Bringing this to HSA and APU based
supercomputing will increase
performance for current sensor arrays,
and allow more sensors and accuracy
in the future.
Marine crews
A memory-intensive and highly
parallel algorithm
Land crews
32 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
HOWEVER, SPEED OF PROCESSING
AND INTERPRETATION IS A CRITICAL
BOTTLENECK IN MAKING FULL USE
OF ACQUISITION ASSETS
33. TEXT ANALYTICS – HADOOP TERASORT AND BIG DATA
SEARCH
MINING BIG DATA
Multi-stage pipeline or
parallel processing stages
Traditional GPU Compute is
challenged by copies
sort
split 0
map
Sort
Compression
Regular expression parsing
CRC generation
Acceleration of large data
search scales out across the
cluster of APU nodes
33 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
copy
Output HDFS
merge
reduce
split 1
split 2
part 0
HDFS
Replication
reduce
APU with HSA accelerates
each stage in place
‒
‒
‒
‒
Input HDFS (Hadoop
Distributed File System)
part 1
HDFS
Replication
map
map
37. AMD
V1.3
AMD’s comprehensive heterogeneous developer tool suite including:
‒ CPU and GPU Profiling
‒ GPU kernel Debugging
‒ GPU kernel analysis
New features in version 1.3:
‒ Supports Java
‒ Integrated static kernel analysis
‒ Remote debugging/profiling
‒ Supports latest AMD APU and GPU products
37 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
38. OPEN SOURCE LIBRARIES ACCELERATED BY AMD
OpenCV
Bolt
clMath
Aparapi
Most popular
computer vision
library
Now with many
OpenCL™
accelerated
functions
C++ template library
Provides GPU offload for common
data-parallel
algorithms
Now with cross-OS
support and
improved
performance/functio
nality
AMD released
APPML as open
source to create
clMath
Accelerated BLAS
and FFT libraries
Accessible from
Fortran, C and C++
OpenCL™
accelerated Java 7
Java APIs for data
parallel algorithms
(no need to learn
OpenCL™
38 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
Editor's Notes
So, here is what we will explore today
HSA will empower software developers to easily innovate and unleash new levels of performance and functionality on all your modern devices and lead to powerful new experiences such as visually rich, intuitive, human-like interactivity.
Trinity contains two dual-core x86 modules or compute units (CU), and Radeon™ GPU cores along with miscellaneous other logic components such as a NorthBridge and a Unified Video Decoder (UVD). Each CU is composed of two, out-of-order cores that share the front-end and floating-point units. In addition, each CU is paired with a 2MB L2 cache that is shared between the cores. The GPU consists of 384 Radeon ™ cores, each capable of one single-precision fused multiply-add computation (FMAC) operation per cycle. The GPU is organized as six SIMD units, each containing sixteen processing units that are 4-way VLIW. The memory controller is shared between the CPU and the GPU.
Punch this one out.
Big emphasis on the last era.
… and now the unprecedented step … four year roadmap
http://acg.cis.upenn.edu/papers/cacm12_why_coherence.pdf.
Includes this quote:
Continued coherence support lets programmers concentrate on what matters for parallel speedups:
finding work to do in parallel with no undue communication and synchronization.
Field data is sent to a center (cluster of nodes) for processing and interpretation (geophysicists)
Gaps in data typically require further data acquisition either holding up crews causing redeployment
The problem is magnified due to multiple field crews depending on the same processing center
The problem cannot be trivially solved by a “truck full of discrete GPUs” solution on site, because RTM is a memory-bound problem that is better solved by APUs