Guide to Heterogeneous System Architecture (HSA

GUIDE TO HETEROGENEOUS
SYSTEM ARCHITECTURE (HSA)
DIBYENDU DAS, PRAKASH RAGHAVENDRA
DEC 16TH 2013

OUTLINE
 Introduction to HSA
 Unified Memory Access
 Power Management
 HSA Programming Languages
 Workloads

2 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC

WHAT IS HSA?
An intelligent computing architecture that enables CPU, GPU and other
processors to work in harmony on a single piece of silicon by seamlessly
moving the right tasks to the best suited processing element

hUMA (MEMORY)
PARALLEL
WORKLOADS

SERIAL
WORKLOADS

APU
ACCELERATED PROCESSING UNIT

HSA EVOLUTION

Benefits
Unified power
efficiency

Improved compute
efficiency

Simplified
data sharing

Capabilities
Integrate CPU and
GPU in silicon


GPU can access CPU
memory

Uniform memory
access for CPU and
GPU

STATE-OF-THE-ART HETEROGENEOUS PROCESSOR
Shared Northbridge  access to overlapping
CPU/GPU physical address spaces

Graphics processing unit
(GPU):
384 AMD Radeon™ cores

Multi-threaded CPU
cores

Accelerated processing unit (APU)

Many resources are shared between the CPU and GPU
– For example, memory hierarchy, power, and thermal capacity


A NEW ERA OF PROCESSOR PERFORMANCE
Single-Core Era
Constrained by:
Power
Complexity

Enabled by:
 Moore’s Law
 SMP architecture

?

Throughput
Performance

Single-thread Performance

Enabled by:
 Abundant data
parallelism
 Power efficient
GPUs

Time


we are
here

Time (# of processors)

Temporarily
Constrained by:
Programming
models
Comm.overhead

Shader  CUDA OpenCL C+
+AMP …

pthreads  OpenMP / TBB …

Assembly  C/C++  Java …

we are
here

Constrained by:
Power
Parallel SW
Scalability

Modern Application
Performance

Enabled by:
 Moore’s Law
 Voltage Scaling

Heterogeneous
Systems Era

Multi-Core Era

we are
here

Time (Data-parallel exploitation)

Architecture Maturity & Programmer Accessibility

Excellent

EVOLUTION OF HETEROGENEOUS COMPUTING
Standards Drivers Era
OpenCL™, DirectCompute
Driver-based APIs

Proprietary Drivers Era
Graphics & Proprietary
Driver-based APIs

 “Adventurous” programmers
 Exploit early programmable
“shader cores” in the GPU

Expert programmers
C and C++ subsets
Compute centric APIs , data types
Multiple address spaces with
explicit data movement
 Specialized work queue based
structures
 Kernel mode dispatch





Architected Era

AMD Heterogeneous System Architecture
GPU Peer Processor










Mainstream programmers
Full C++
GPU as a co-processor
Unified coherent address space
Task parallel runtimes
Nested Data Parallel programs
User mode dispatch
Pre-emption and context
switching

 Make your program look like
“graphics” to the GPU

Poor

 CUDA™, Brook+, etc

2002 - 2008


2009 - 2011

2012 - 2020

HETEROGENEOUS PROCESSORS - EVERYWHERE
SMARTPHONES TO SUPER-COMPUTERS

Super
computer
Dense
Server
Tablet
Phone

Workstation
Notebook

A SINGLE SCALABLE
ARCHITECTURE FOR THE
WORLD’S PROGRAMMERS
IS DEMANDED AT THIS
POINT

HOW DOES HSA MAKE THIS ALL WORK?
 Enables acceleration of languages like Java, C++ AMP and
Python
 All processors use the same addresses, and can share data
structures in place
 Heterogeneous computing can use all of virtual and physical
memory
 Extends multicore coherency to the GPU and other processors
 Pass work quickly between the processors
 Enables quality of service

HSA FOUNDATION –
BUILDING THE ECOSYSTEM


HSA FOUNDATION AT LAUNCH
BORN IN JUNE 2012

Founders


HSA FOUNDATION TODAY – DECEMBER 2013
A GROWING AND POWERFUL FAMILY

Founders
Promoters

rters

butors

ORACLE

sities

NTHU Programming
Language Lab


NTHU System
Software Lab

COMPUTER SCIENCE

UNDERSTANDING UMA

Original meaning of UMA is Uniform Memory Access
•

Refers to how processing cores in a system view and access memory

•

All processing cores in a true UMA system share a single memory address space

Introduction of GPU compute created Non-Uniform Memory Access (NUMA)
•

Require data to be managed across multiple heaps with different address spaces

•

Add programming complexity due to frequent copies, synchronization, and address
translation

HSA restores the GPU to Uniform memory Access
•

Heterogeneous computing replaces GPU Computing


INTRODUCING hUMA
UMA

CPU

Memory
C
P
U

C
P
U

C
P
U

C
P
U

NUMA

APU

GPU
Memory

CPU Memory
C
P
U

C
P
U

C
P
U

C
P
U

GPU
GPU
GPU
GPU

hUMA

APU
with
HSA

Memory
C
P
U


C
P
U

C
P
U

C
P
U

GPU
GPU
GPU
GPU

hUMA KEY FEATURES
Coherent Memory:

CPU

Ensures CPU and GPU
caches both see
an up-to-date view of
data

HW
Cache
Coherency

Cache

Physical Memory

Virtual Memory

Pageable memory:
The GPU can seamlessly
access virtual memory
addresses that are not
(yet)
present in physical
memory

Entire memory space:
Both CPU and GPU can access and allocate
any location in the system’s virtual memory
space

WITHOUT POINTERS AND DATA SHARING
Without hUMA:

•CPU explicitly copies data to GPU memory
•GPU completes computation
•CPU explicitly copies result back to CPU memory

CPU

|
|
|
|
|
Memory


|
|
|
|
|

GPU Memory

Only the data array
can be copied since GPU
cannot follow embedded
data-structure links

WITH POINTERS AND DATA SHARING

CPU can pass a pointer
to entire data structure
since the GPU can now
follow embedded links
|
|
|
|
|

CPU / GPU Uniform Memory


hUMA FEATURES

Access to Entire Memory Space
Pageable memory
Bi-directional Coherency
Fast GPU access to system memory
Dynamic Memory Allocation







KEY OBSERVATIONS
 Applications exhibit varying degrees of CPU and GPU frequency sensitivities due
to
‒ Control divergence
‒ Interference at shared resources
‒ Performance coupling between CPU and GPU

 Efficient energy management requires metrics that can predict frequency
sensitivity (power) in heterogeneous processors
 Sensitivity metrics drive the coordinated setting of CPU and GPU power states


STATE-OF-THE-ART: BI-DIRECTIONAL APPLICATION
POWER MANAGEMENT (BAPM)
Chip is divided into
BAPM-controlled
thermal entities (TEs)
CU0
TE

CU1
TE

GPU TE

 Power management algorithm
1. Calculate digital estimate of power consumption
2. Convert power to temperature
- RC network model for heat transfer
1. Assign new power budgets to TEs based on temperature
headroom
2. TEs locally control (boost) their own DVFS states to maximize
performance

DYNACO: RUN-TIME SYSTEM FOR COORDINATED
ENERGY MANAGEMENT
CPU-GPU
Frequency
Sensitivity
Computation

Performance
Metric Monitor

CPU-GPU Power
State Decision

GPU Frequency
Sensitivity

CPU Frequency
Sensitivity

Decision

High

Low

Shift power to GPU

High

High

Proportional power
allocation

Low

High

Shift power to CPU

Low

Low

Reduce power of
both CPU and GPU

 DynaCo implemented as a run-time software policy overlaid on top of BAPM in
real hardware


PROGRAMMING LANGUAGES PROLIFERATING ON HSA
OpenCL™
OpenCL™
App
App

Java App
Java App

C++ AMP
C++ AMP
App
App

Python
Python
App
App

OpenCL
OpenCL
Runtime
Runtime

Java JVM
Java JVM
(Sumatra)
(Sumatra)

Various
Various
Runtimes
Runtimes

Fabric
Fabric
Engine RT
Engine RT

HSAIL (HSA Intermediate Language)
HSA
HSA
Helper
Helper
Libraries
Libraries

HSA Core
HSA Core
Runtime
Runtime

Kernel Fusion
Kernel Fusion
Driver (KFD)
Driver (KFD)


HSA
HSA
Finalizer
Finalizer

PROGRAMMING MODELS EMBRACING HSAIL AND HSA
THE RIGHT LEVEL OF ABSTRACTION

UNDER DEVELOPMENT

NEXT

Java: Project Sumatra
OpenJDK 9
OpenMP from SuSE
C++ AMP, based on
CLANG/LLVM
Python and KL from Fabric
Engine

DSLs: Halide, Julia, Rust
Fortran
JavaScript
Open Shading Language
R


HSAIL
HSAIL (HSA Intermediate Language) as the
SW interface
‒ A virtual ISA for parallel programs

‒ Finalized to a native ISA by a finalizer/JIT

‒ Accommodate to rapid innovations in native GPU architectures

‒ HSAIL expected to be stable and backward compatible
across implementations

High-Level Compiler
Flow
OpenCL™ Kernel
OpenCL™ Kernel

EDG or CLANG
EDG or CLANG
SPIR
SPIR
LLVM
LLVM
HSAIL
HSAIL

‒ Enable multiple hardware vendors to support HSA

Key design points and benefits for HSA
compilers

Finalizer Flow
(Runtime)

‒ Adopt a thin finalizer approach

‒ Enable fast translation time and robustness in the
finalizer

‒ Drive performance optimizations through high-level compilers (HLC)

HSAIL
HSAIL
Finalizer
Finalizer
Hardware ISA
Hardware ISA

‒ Take advantage of the strength and compilation time
budget in HLCs for aggressive optimizations
EDG – Edison Design Group


CLANG – LLVM FE
SPIR – Standard Portable Intermediate
Representation

HSA ENABLEMENT OF JAVA
JAVA 7 – OpenCL ENABLED
APARAPI


AMD initiated Open Source project



APIs for data parallel algorithms
‒ GPU accelerate Java applications
‒ No need to learn OpenCL™



Active community captured mindshare
‒ ~20 contributors
‒ >7000 downloads
‒ ~150 visits per day

JAVA 8 – HSA ENABLED
APARAPI




Java 8 brings Stream + Lambda API.
‒ More natural way of expressing data
parallel algorithms
‒ Initially targeted at multi-core.
APARAPI will :
‒ Support Java 8 Lambdas
‒ Dispatch code to HSA enabled devices at
runtime via HSAIL

Java Application



Adds native GPU acceleration to Java Virtual
Machine (JVM)



Developer uses JDK Lambda, Stream API



JVM uses GRAAL compiler to generate HSAIL



JVM decides at runtime to execute on either
CPU or GPU depending on workload
characteristics.

Java Application
Java JDK Stream +
Lambda API

APARAPI + Lambda
API

OpenCL™
OpenCL™ Compiler

Java GRAAL JIT
backend

HSAIL

HSAIL

HSA Finalizer
& Runtime

& Runtime
JVM
JVM

CPU
CPU

(SUMATRA)

Java Application

APARAPI API

CPU ISA

JAVA 9 – HSA ENABLED JAVA

HSA Finalizer
& Runtime

JVM
JVM
GPU ISA
GPU
GPU


CPU ISA
CPU
CPU

JVM
JVM
GPU ISA
GPU
GPU

CPU ISA
CPU
CPU

GPU ISA
GPU
GPU

OVERVIEW OF B+ TREES
 B+ Trees are a special case of B
 A B+ Tree …
Trees
‒ is a dynamic, multi-level index
‒ Is efficient for retrieval of data, stored in a
 Fundamental data structure used in
block-oriented context
several popular database
 Order (b) of a B+ Tree measures the
management systems
‒ SQLite
capacity of its nodes
‒ CouchDB

3
2

4

5
6

7

1

2

3

4

5

6

7 8

d1

d2

d3

d4

d5

d6

d7 d8


HOW WE ACCELERATE
 Utilize coarse-grained parallelism in B+ Tree searches
‒ Perform many queries in parallel
‒ Increase memory bandwidth utilization with parallel reads
‒ Increase throughput (transactions per second for OLTP)

 B+ Tree searches on an HSA enabled APU

‒ Allows much larger B+ Trees to be searched, than traditional GPU compute
‒ Eliminates data-copies since CPU and GPU cores can access the same memory


RESULTS
1M search queries in
parallel

 Input B+ Tree contains
112 million keys and
uses 6GB of memory
 Hardware: AMD
“Kaveri” APU with Quad
Core CPU and 8 GCN
Compute Units at 35W
TDP
 Software: OpenCL on
HSA
Baseline: 4-core OpenMP + hand-tuned SSE CPU
implementation

REVERSE TIME MIGRATION (RTM)


A technique for creating images based
on sensor data to improve seismic
interpretations done by geophysicists



RTM is run on massive data sets



A natural scale out algorithm



Often run today on 100K node CPU
systems



Bringing this to HSA and APU based
supercomputing will increase
performance for current sensor arrays,
and allow more sensors and accuracy
in the future.

Marine crews

A memory-intensive and highly
parallel algorithm



Land crews


HOWEVER, SPEED OF PROCESSING
AND INTERPRETATION IS A CRITICAL
BOTTLENECK IN MAKING FULL USE
OF ACQUISITION ASSETS

TEXT ANALYTICS – HADOOP TERASORT AND BIG DATA
SEARCH
MINING BIG DATA




Multi-stage pipeline or
parallel processing stages
Traditional GPU Compute is
challenged by copies

sort
split 0

map

Sort
Compression
Regular expression parsing
CRC generation

Acceleration of large data
search scales out across the
cluster of APU nodes


copy

Output HDFS
merge
reduce

split 1

split 2

part 0

HDFS
Replication

reduce

APU with HSA accelerates
each stage in place
‒
‒
‒
‒



Input HDFS (Hadoop
Distributed File System)

part 1

HDFS
Replication

map

map

DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to
product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences
between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update
or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from
time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR
ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO
EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM
THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of
Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may
be trademarks of their respective owners.

BACKUP


AMD

V1.3

 AMD’s comprehensive heterogeneous developer tool suite including:
‒ CPU and GPU Profiling
‒ GPU kernel Debugging
‒ GPU kernel analysis

 New features in version 1.3:

‒ Supports Java
‒ Integrated static kernel analysis
‒ Remote debugging/profiling
‒ Supports latest AMD APU and GPU products


OPEN SOURCE LIBRARIES ACCELERATED BY AMD

OpenCV

Bolt

clMath

Aparapi

 Most popular
computer vision
library
 Now with many
OpenCL™
accelerated
functions

 C++ template library
 Provides GPU offload for common
data-parallel
algorithms
 Now with cross-OS
support and
improved
performance/functio
nality

 AMD released
APPML as open
source to create
clMath
 Accelerated BLAS
and FFT libraries
 Accessible from
Fortran, C and C++

 OpenCL™
accelerated Java 7
 Java APIs for data
parallel algorithms
(no need to learn
OpenCL™


Guide to Heterogeneous System Architecture (HSA

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Guide to Heterogeneous System Architecture (HSA

Similar to Guide to Heterogeneous System Architecture (HSA (20)

Recently uploaded

Recently uploaded (20)

Guide to Heterogeneous System Architecture (HSA

Editor's Notes