Heterogeneous System Architecture Overview

HETEROGENEOUS SYSTEM
ARCHITECTURE OVERVIEW
VINOD TIPPARAJU
PRINCIPAL MEMBER OF TECHNICAL STAFF,
AMD AUSTIN.

INTRODUCTION AND OVERVIEW
TERMINOLOGY, WHAT MAKES HSA, ORIGINS AND
EVOLUTION, USAGE SCENARIOS

SOME TERMINOLOGY
u 

u 

u 

HSA is heterogeneous systems architecture, not just GPUs
HSA Component – IP that satisfies architecture requirements and provides
identified features
SoC – system on Chip, collection of various IPs
u 

u 

u 

u 

E.g. AMD APU (Accelerated Processing Unit) integrates AMD/ARM CPU cores and
Graphics IP
It is possible to conceive companies just building parts of the IP

HSAIL -- HSA intermediate language very low-level SIMT language
HSA Agent – something that can participate in the HSA memory subsystem
(i.e. respect page sizes, memory properties, atomics, etc.)

AMD Confidential - NDA Required

WHAT IS HSA?
Systems Architecture
u 

u 

From a hardware point of view, system
architecture requirements necessary
Specifies shared memory, cache coherence
domains, concept of clocks, context
switching, memory based signaling, topology,

Programmers Reference (HSAIL)
u 

u 

u 
u 

Rules governing design and agent behavior

RUNTIME
u 

API that wraps the features like user mode
queues, clocks, signalling, etc

u 

Provides execution control

u 

An intermediate representation, very low
level.
Vendor independence, device compiler
optimizations
Abstracts HW, or can serve as the lowest
level instruction set

TOOLS
u 

Supporting profilers, debuggers and
compilers

Supports tools

u 

u 

Unique debugging support that greatly
simplifies implementing debuggers
Excellent profiling support with some user
mode access

HSA ORIGINS, EVOLUTION IN COMPUTE
u 

Next step from AMD in general purpose compute

u 

Evolutionary step
u 
u 

Exceptional graphics IP

u 

u 

Lot of experience in building general purpose CPUs
Natural to utilize graphics IP for doing compute

Prior step was HW integration phase
u 

GPU was pre-GCN (graphics core next)
u 

u 

Did not have all features to support HSA

Memory management unit was still evolving


TAKING THE HW INTEGRATION TO ITS
NATURAL CONCLUSION
u 

Architectural and System integration

u 

Extend architecture to make the component a first class citizen on the SoC

u 

Fully-evolved MMU

u 

Provide same level of support for tools as CPU

u 

Provide context switching, preemption, full-coherence
u 

u 

Helps simulators, migrations, checkpoints, etc

Future, other HSA IP


SOCS HAVE PROLIFERATED —
MAKE THEM BETTER
u 

u 

u 

SOCs have arrived and are a tremendous
advance over previous platforms
SOCs combine CPU cores, GPU cores and
other accelerators, with high bandwidth access
to memory
How do we make them even better?
u 
u 

Higher performance

u 

u 

Easier to optimize

u 

u 

Easier to program

Lower power

HSA unites accelerators architecturally
Early focus is APU (CPU with GPU compute
accelerator), but HSA goes well beyond the GPU


HIGH LEVEL USAGE SCENARIOS
u 

Bulk-Synchronous Parallelism -like concurrent computation
u 

u 

Rather large parallel sections followed by synchronization

Outstanding support for task-based parallelism
u 
u 

256 threads sufficient to fully fill the pipeline

u 

Launch is quick

u 

Support for execution schedules – excellent compiler target

u 

u 

Wavefront is 64 threads

Architected Queueing Language (AQL), dependencies

Advanced language support
u 

Function calls

u 

Virtual functions

u 

Exception handling (throw-catch)


HSA FOUNDATION
u 
u 

u 
u 

u 

u 

Founded in June 2012
Developing a new platform for
heterogeneous systems
www.hsafoundation.com
Specifications under development
in working groups
Our first specification, HSA
Programmers Reference Manual
is already published and available
on our web site
Additional specifications for
System Architecture, Runtime
Software and Tools are in process


HSA FOUNDATION MEMBERSHIP —
AUGUST 2013
Founders

Promoters
Supporters

Contributors

Academic
Associates


HSA — AN OPEN PLATFORM
u 

Open Architecture, membership open to all
u 
u 

HSA System Architecture

u 

u 

HSA Programmers Reference Manual
HSA Runtime

Delivered via royalty free standards
u 

Royalty Free IP, Specifications and APIs

u 

ISA agnostic for both CPU and GPU

u 

Membership from all areas of computing
u 

Hardware Companies

u 

Operating Systems

u 

Tools and Middleware


HSA MEMORY MODEL
u 

u 

u 

u 

Defines visibility ordering between all threads
in the HSA System
Designed to be compatible with C++11, Java,
OpenCL and .NET Memory Models
Relaxed consistency memory model for
parallel compute performance
Visibility controlled by:
u 

Load.Acquire

u 

Store.Release

u 

Barriers


HSA QUEUING MODEL
u 

User mode queuing for low latency dispatch
u 
u 

u 

Application dispatches directly
No OS or driver in the dispatch path

Architected Queuing Layer
u 
u 

u 

Single compute dispatch path for all hardware
No driver translation, direct to hardware

Allows for dispatch to queue from any agent
u 

u 

CPU or GPU

GPU self enqueue enables lots of solutions
u 

Recursion

u 

Tree traversal

u 

Wavefront reforming


HSAIL


HSA INTERMEDIATE LAYER — HSAIL
u 

HSAIL is a virtual ISA for parallel programs
u 
u 

u 

Finalized to ISA by a JIT compiler or “Finalizer”
ISA independent by design for CPU & GPU

Explicitly parallel
u 

u 

u 

Support for exceptions, virtual functions,
and other high level language features
Lower level than OpenCL SPIR
u 

u 

Designed for data parallel programming

Fits naturally in the OpenCL compilation stack

Suitable to support additional high level languages and programming models:
u 

Java, C++, OpenMP, Fortran etc


WHAT IS HSAIL?
u 

HSAIL is the intermediate language for parallel compute in HSA
u 
u 
u 
u 

u 

Generated by a high level compiler (LLVM, gcc, Java VM, etc)
Low-level IR, close to machine ISA level
Compiled down to target ISA by an IHV “Finalizer”
Finalizer may execute at run time, install time, or build time

Example: OpenCL™ Compilation Stack using HSAIL

High-Level Compiler Flow (Developer)
OpenCL™ Kernel
EDG or CLANG
SPIR
LLVM
HSAIL


Finalizer Flow (Runtime)
HSAIL
Finalizer
Hardware ISA

KEY HSAIL FEATURES
u 

Parallel

u 

Shared virtual memory

u 

Portable across vendors in HSA Foundation

u 

Stable across multiple product generations

u 

Consistent numerical results (IEEE-754 with defined min accuracy)

u 

Fast, robust, simple finalization step (no monthly updates)

u 

Good performance (little need to write in ISA)

u 

Supports all of OpenCL™ and C++ AMP™

u 

Support Java, C++, and other languages as well


SIMT EXECUTION MODEL
u 

HSAIL Presents a “SIMT” execution model to the programmer
u 
u 

Programmer writes program for a single thread of execution

u 

Each work-item appears to have its own program counter

u 

u 

“Single Instruction, Multiple Thread”

Branch instructions look natural

Hardware Implementation
u 
u 

Actually one program counter for the entire SIMD instruction

u 

u 

Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency
Branches implemented with predication

SIMT Advantages
u 

Easier to program (branch code in particular)

u 

Natural path for mainstream programming models

u 

Scales across a wide variety of hardware (programmer doesn’t see vector width)

u 

Cross-lane operations available for those who want peak performance


OPPORTUNITIES WITH LLVM BASED
COMPILATION

C99

C++ 11

C++AMP

Objective C

OpenCL

OpenMP

KL

OSL

Render
script

UPC

Halide

CLANG

LLVM

Rust

Julia

Mono

Fortran

Haskell

ARCHITECTURE DETAILS –
WALK THROUGH OF FEATURES
AND BENEFITS

HIGH LEVEL FEATURES OF HSA
u 

Features currently being defined in the HSA Working Groups**
u 

Unified addressing across all processors

u 

Operation into pageable system memory

u 

Full memory coherency

u 

User mode dispatch

u 

Architected queuing language

u 

High level language support for GPU compute processors

u 

Preemption and context switching

** All features subject to change, pending completion and ratification of specifications in the HSA Working Groups

STATE OF GPU COMPUTING
•  GPUs are fast and power efficient : high compute density per-mm and per-watt
•  But: Can be hard to program
Today’s Challenges
u 

Emerging Solution

Separate address spaces

u 

HSA Hardware

u 

Copies

u 

Single address space

u 

Can’t share pointers

u 

Coherent

u 

Virtual address space

u 

Fast access from all components

u 

Can share pointers

PCIe

u 

New language required for compute kernel
u 
u 

EX: OpenCL™ runtime API
Compute kernel compiled separately
than host code

u 

Bring GPU computing to existing, popular,
programming models
u 

u 

Single-source, fully supported by
compiler
HSAIL compiler IR (Cross-platform!)

MOTIVATION (TODAY’S PICTURE)
Application
Transfer
buffer to GPU

OS

GPU

Copy/Map
Memory

Queue Job
Schedule Job
Start Job

Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory


SHARED VIRTUAL MEMORY (TODAY)
u 

Multiple virtual memory address spaces

PHYSICAL MEMORY
PHYSICAL MEMORY

VIRTUAL MEMORY1

CPU0
VA1->PA1

VIRTUAL MEMORY2

GPU
VA2->PA1

SHARED VIRTUAL MEMORY (HSA)
u 

Common Virtual Memory for all HSA agents

PHYSICAL MEMORY
PHYSICAL MEMORY

VIRTUAL MEMORY

CPU0
VA->PA

GPU
VA->PA

SHARED VIRTUAL MEMORY
u 

u 

Advantages
u  No mapping tricks, no copying back-and-forth between different PA
addresses
u  Send pointers (not data) back and forth between HSA agents.
Implications
u  Common Page Tables (and common interpretation of architectural
semantics such as shareability, protection, etc).
u  Common mechanisms for address translation (and servicing
address translation faults)
u  Concept of a process address space ID (PASID) to allow multiple,
per process virtual address spaces within the system.


GETTING THERE …
Application
Transfer
buffer to GPU

OS

GPU

Copy/Map
Memory

Queue Job
Schedule Job
Start Job

Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory


SHARED VIRTUAL MEMORY
u 

Specifics
u  Minimum supported VA width is 48b for 64b systems, and 32b for
32b systems.
u  HSA agents may reserve VA ranges for internal use via system
software.
u  All HSA agents other than the host unit must use the lowest
privilege level
u  If present, read/write access flags for page tables must be
maintained by all agents.
u  Read/write permissions apply to all HSA agents, equally.


CACHE COHERENCY DOMAINS (1/2)

u 

Data accesses to global memory segment from all HSA Agents shall be
coherent without the need for explicit cache maintenance.


CACHE COHERENCY DOMAINS (2/2)
u 

u 

Advantages
u  Composability
u  Reduced SW complexity when communicating between agents
u  Lower barrier to entry when porting software
Implications
u  Hardware coherency support between all HSA agents
u 

Can take many forms
u  Stand alone Snoop Filters / Directories
u  Combined L3/Filters
u  Snoop-based systems (no filter)
u  Etc …


GETTING CLOSER …
Application
Transfer
buffer to GPU

OS

GPU

Copy/Map
Memory

Queue Job
Schedule Job
Start Job

Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory


SIGNALING (1/2)
u 

HSA agents support the ability to use signaling objects
u  All creation/destruction signaling objects occurs via HSA
runtime APIs
u  Object creation/destruction
u  From an HSA Agent you can directly accessing
signaling objects.
u  Signaling a signal object (this will wake up HSA
agents waiting upon the object)
u  Query current object
u  Wait on the current object (various conditions
supported).


SIGNALING (2/2)
u 

u 

Advantages
u  Enables asynchronous interrupts between HSA agents,
without involving the kernel
u  Common idiom for work offload
u  Low power waiting
Implications
u  Runtime support required
u  Commonly implemented on top of cache coherency
flows


ALMOST THERE…
Application
Transfer
buffer to GPU

OS

GPU

Copy/Map
Memory

Queue Job
Schedule Job
Start Job

Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory


USER MODE QUEUEING (1/3)
u 

User mode Queueing
u 

Enables user space applications to directly, without OS intervention, enqueue jobs
(“Dispatch Packets”) for HSA agents.
u 

Dispatch packet is a job of work

u 

Support for multiple queues per PASID

u 

Multiple threads/agents within a PASID may enqueue Packets in the same Queue.

u 

Dependency mechanisms created for ensuring ordering between packets.


USER MODE QUEUEING (2/3)
u 

Advantages
u 

Avoid involving the kernel/driver when dispatching work for an Agent.

u 

Lower latency job dispatch enables finer granularity of offload

u 

u 

Standard memory protection mechanisms may be used to protect communication
with the consuming agent.

Implications
u 

Packet formats/fields are Architected – standard across vendors!
u 

u 

u 

Guaranteed backward compatibility

Packets are enqueued/dequeued via an Architected protocol (all via memory
accesses and signalling)
More on this later……


SUCCESS!
Application
Transfer
buffer to GPU

OS

GPU

Copy/Map
Memory

Queue Job
Schedule Job
Start Job

Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory


SUCCESS!
Application

OS

GPU

Queue Job

Start Job

Finish Job


ACCELERATING SUFFIX ARRAY
CONSTRUCTION
CLOUD SERVER WORKLOAD

SUFFIX ARRAYS
u 

Suffix Arrays are a fundamental data structure
u 

Designed for efficient searching of a large text
u 

u 

Quickly locate every occurrence of a substring S in a text T

Suffix Arrays are used to accelerate in-memory cloud workloads
u 

Full text index search

u 

Lossless data compression

u 

Bio-informatics


ACCELERATED SUFFIX ARRAY
CONSTRUCTION ON HSA
By efficiently sharing data between CPU and
GPU, HSA lets us move compute to data
without penalty of intermediate copies.

By offloading data parallel computations to
GPU, HSA increases performance and
reduces energy for Suffix Array Construction
versus Single Threaded CPU.

Skew Algorithm for Compute SA

Radix Sort::GPU

+5.8x

Lexical Rank::CPU
Compute SA::CPU
-5x
Radix Sort::GPU
Merge Sort::GPU

INCREASED
PERFORMANCE

DECREASED
ENERGY

M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013.
AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM


THE HSA FUTURE
Architected heterogeneous processing on the SOC
Programming of accelerators becomes much easier
Accelerated software that runs across multiple hardware vendors
Scalability from smart phones to super computers on a common architecture
GPU acceleration of parallel processing is the initial target, with DSPs
and other accelerators coming to the HSA system architecture model
Heterogeneous software ecosystem evolves at a much faster pace
Lower power, more capable devices in your hand, on the wall, in the cloud or at
your supercomputing center.


Heterogeneous System Architecture Overview

More Related Content

What's hot

Viewers also liked

Similar to Heterogeneous System Architecture Overview

More from inside-BigData.com

Recently uploaded

Heterogeneous System Architecture Overview