GUIDE TO HETEROGENEOUS
SYSTEM ARCHITECTURE (HSA)
DIBYENDU DAS, PRAKASH RAGHAVENDRA
DEC 16TH 2013
OUTLINE
 Introduction to HSA
 Unified Memory Access
 Power Management
 HSA Programming Languages
 Workloads

2 | INDO...
WHAT IS HSA?
An intelligent computing architecture that enables CPU, GPU and other
processors to work in harmony on a sing...
HSA EVOLUTION

Benefits
Unified power
efficiency

Improved compute
efficiency

Simplified
data sharing

Capabilities
Integ...
STATE-OF-THE-ART HETEROGENEOUS PROCESSOR
Shared Northbridge  access to overlapping
CPU/GPU physical address spaces

Graph...
A NEW ERA OF PROCESSOR PERFORMANCE
Single-Core Era
Constrained by:
Power
Complexity

Enabled by:
 Moore’s Law
 SMP archi...
Architecture Maturity & Programmer Accessibility

Excellent

EVOLUTION OF HETEROGENEOUS COMPUTING
Standards Drivers Era
Op...
HETEROGENEOUS PROCESSORS - EVERYWHERE
SMARTPHONES TO SUPER-COMPUTERS

Super
computer
Dense
Server
Tablet
Phone

Workstatio...
HOW DOES HSA MAKE THIS ALL WORK?
 Enables acceleration of languages like Java, C++ AMP and
Python
 All processors use th...
HSA FOUNDATION AT LAUNCH
BORN IN JUNE 2012

Founders

10 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
HSA FOUNDATION TODAY – DECEMBER 2013
A GROWING AND POWERFUL FAMILY

Founders
Promoters

rters

butors

ORACLE

sities

NTH...
Unified
Memory
Access
UNDERSTANDING UMA

Original meaning of UMA is Uniform Memory Access
•

Refers to how processing cores in a system view and...
INTRODUCING hUMA
UMA

CPU

Memory
C
P
U

C
P
U

C
P
U

C
P
U

NUMA

APU

GPU
Memory

CPU Memory
C
P
U

C
P
U

C
P
U

C
P
U...
hUMA KEY FEATURES
Coherent Memory:

CPU

Ensures CPU and GPU
caches both see
an up-to-date view of
data

HW
Cache
Coherenc...
WITHOUT POINTERS AND DATA SHARING
Without hUMA:

•CPU explicitly copies data to GPU memory
•GPU completes computation
•CPU...
WITH POINTERS AND DATA SHARING

CPU can pass a pointer
to entire data structure
since the GPU can now
follow embedded link...
hUMA FEATURES

Access to Entire Memory Space
Pageable memory
Bi-directional Coherency
Fast GPU access to system memory
Dyn...
Power
Management
KEY OBSERVATIONS
 Applications exhibit varying degrees of CPU and GPU frequency sensitivities due
to
‒ Control divergence...
STATE-OF-THE-ART: BI-DIRECTIONAL APPLICATION
POWER MANAGEMENT (BAPM)
Chip is divided into
BAPM-controlled
thermal entities...
DYNACO: RUN-TIME SYSTEM FOR COORDINATED
ENERGY MANAGEMENT
CPU-GPU
Frequency
Sensitivity
Computation

Performance
Metric Mo...
Programming
Languages
PROGRAMMING LANGUAGES PROLIFERATING ON HSA
OpenCL™
OpenCL™
App
App

Java App
Java App

C++ AMP
C++ AMP
App
App

Python
Pyt...
PROGRAMMING MODELS EMBRACING HSAIL AND HSA
THE RIGHT LEVEL OF ABSTRACTION

UNDER DEVELOPMENT

NEXT

Java: Project Sumatra...
HSAIL
HSAIL (HSA Intermediate Language) as the
SW interface
‒ A virtual ISA for parallel programs

‒ Finalized to a nativ...
HSA ENABLEMENT OF JAVA
JAVA 7 – OpenCL ENABLED
APARAPI


AMD initiated Open Source project



APIs for data parallel alg...
Workloads
OVERVIEW OF B+ TREES
 B+ Trees are a special case of B
 A B+ Tree …
Trees
‒ is a dynamic, multi-level index
‒ Is efficie...
HOW WE ACCELERATE
 Utilize coarse-grained parallelism in B+ Tree searches
‒ Perform many queries in parallel
‒ Increase m...
RESULTS
1M search queries in
parallel

 Input B+ Tree contains
112 million keys and
uses 6GB of memory
 Hardware: AMD
“K...
REVERSE TIME MIGRATION (RTM)


A technique for creating images based
on sensor data to improve seismic
interpretations do...
TEXT ANALYTICS – HADOOP TERASORT AND BIG DATA
SEARCH
MINING BIG DATA




Multi-stage pipeline or
parallel processing st...
DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain te...
BACKUP

35 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
Programming
Tools
AMD

V1.3

 AMD’s comprehensive heterogeneous developer tool suite including:
‒ CPU and GPU Profiling
‒ GPU kernel Debugg...
OPEN SOURCE LIBRARIES ACCELERATED BY AMD

OpenCV

Bolt

clMath

Aparapi

 Most popular
computer vision
library
 Now with...
Upcoming SlideShare
Loading in …5
×

Guide to heterogeneous system architecture (hsa)

885 views

Published on

Published in: Technology
  • Be the first to comment

Guide to heterogeneous system architecture (hsa)

  1. 1. GUIDE TO HETEROGENEOUS SYSTEM ARCHITECTURE (HSA) DIBYENDU DAS, PRAKASH RAGHAVENDRA DEC 16TH 2013
  2. 2. OUTLINE  Introduction to HSA  Unified Memory Access  Power Management  HSA Programming Languages  Workloads 2 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  3. 3. WHAT IS HSA? An intelligent computing architecture that enables CPU, GPU and other processors to work in harmony on a single piece of silicon by seamlessly moving the right tasks to the best suited processing element hUMA (MEMORY) PARALLEL WORKLOADS SERIAL WORKLOADS APU ACCELERATED PROCESSING UNIT 3 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  4. 4. HSA EVOLUTION Benefits Unified power efficiency Improved compute efficiency Simplified data sharing Capabilities Integrate CPU and GPU in silicon 4 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC GPU can access CPU memory Uniform memory access for CPU and GPU
  5. 5. STATE-OF-THE-ART HETEROGENEOUS PROCESSOR Shared Northbridge  access to overlapping CPU/GPU physical address spaces Graphics processing unit (GPU): 384 AMD Radeon™ cores Multi-threaded CPU cores Accelerated processing unit (APU) Many resources are shared between the CPU and GPU – For example, memory hierarchy, power, and thermal capacity 5 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  6. 6. A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Constrained by: Power Complexity Enabled by:  Moore’s Law  SMP architecture ? Throughput Performance Single-thread Performance Enabled by:  Abundant data parallelism  Power efficient GPUs Time 6 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC we are here Time (# of processors) Temporarily Constrained by: Programming models Comm.overhead Shader  CUDA OpenCL C+ +AMP … pthreads  OpenMP / TBB … Assembly  C/C++  Java … we are here Constrained by: Power Parallel SW Scalability Modern Application Performance Enabled by:  Moore’s Law  Voltage Scaling Heterogeneous Systems Era Multi-Core Era we are here Time (Data-parallel exploitation)
  7. 7. Architecture Maturity & Programmer Accessibility Excellent EVOLUTION OF HETEROGENEOUS COMPUTING Standards Drivers Era OpenCL™, DirectCompute Driver-based APIs Proprietary Drivers Era Graphics & Proprietary Driver-based APIs  “Adventurous” programmers  Exploit early programmable “shader cores” in the GPU Expert programmers C and C++ subsets Compute centric APIs , data types Multiple address spaces with explicit data movement  Specialized work queue based structures  Kernel mode dispatch     Architected Era AMD Heterogeneous System Architecture GPU Peer Processor         Mainstream programmers Full C++ GPU as a co-processor Unified coherent address space Task parallel runtimes Nested Data Parallel programs User mode dispatch Pre-emption and context switching  Make your program look like “graphics” to the GPU Poor  CUDA™, Brook+, etc 2002 - 2008 7 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC 2009 - 2011 2012 - 2020
  8. 8. HETEROGENEOUS PROCESSORS - EVERYWHERE SMARTPHONES TO SUPER-COMPUTERS Super computer Dense Server Tablet Phone Workstation Notebook A SINGLE SCALABLE ARCHITECTURE FOR THE WORLD’S PROGRAMMERS IS DEMANDED AT THIS POINT 8 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  9. 9. HOW DOES HSA MAKE THIS ALL WORK?  Enables acceleration of languages like Java, C++ AMP and Python  All processors use the same addresses, and can share data structures in place  Heterogeneous computing can use all of virtual and physical memory  Extends multicore coherency to the GPU and other processors  Pass work quickly between the processors  Enables quality of service HSA FOUNDATION – BUILDING THE ECOSYSTEM 9 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  10. 10. HSA FOUNDATION AT LAUNCH BORN IN JUNE 2012 Founders 10 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  11. 11. HSA FOUNDATION TODAY – DECEMBER 2013 A GROWING AND POWERFUL FAMILY Founders Promoters rters butors ORACLE sities NTHU Programming Language Lab 11 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC NTHU System Software Lab COMPUTER SCIENCE
  12. 12. Unified Memory Access
  13. 13. UNDERSTANDING UMA Original meaning of UMA is Uniform Memory Access • Refers to how processing cores in a system view and access memory • All processing cores in a true UMA system share a single memory address space Introduction of GPU compute created Non-Uniform Memory Access (NUMA) • Require data to be managed across multiple heaps with different address spaces • Add programming complexity due to frequent copies, synchronization, and address translation HSA restores the GPU to Uniform memory Access • Heterogeneous computing replaces GPU Computing 13 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  14. 14. INTRODUCING hUMA UMA CPU Memory C P U C P U C P U C P U NUMA APU GPU Memory CPU Memory C P U C P U C P U C P U GPU GPU GPU GPU hUMA APU with HSA Memory C P U 14 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC C P U C P U C P U GPU GPU GPU GPU
  15. 15. hUMA KEY FEATURES Coherent Memory: CPU Ensures CPU and GPU caches both see an up-to-date view of data HW Cache Coherency Cache Physical Memory Virtual Memory Pageable memory: The GPU can seamlessly access virtual memory addresses that are not (yet) present in physical memory Entire memory space: Both CPU and GPU can access and allocate any location in the system’s virtual memory space 15 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  16. 16. WITHOUT POINTERS AND DATA SHARING Without hUMA: •CPU explicitly copies data to GPU memory •GPU completes computation •CPU explicitly copies result back to CPU memory CPU | | | | | Memory 16 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC | | | | | GPU Memory Only the data array can be copied since GPU cannot follow embedded data-structure links
  17. 17. WITH POINTERS AND DATA SHARING CPU can pass a pointer to entire data structure since the GPU can now follow embedded links | | | | | CPU / GPU Uniform Memory 17 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  18. 18. hUMA FEATURES Access to Entire Memory Space Pageable memory Bi-directional Coherency Fast GPU access to system memory Dynamic Memory Allocation 18 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC     
  19. 19. Power Management
  20. 20. KEY OBSERVATIONS  Applications exhibit varying degrees of CPU and GPU frequency sensitivities due to ‒ Control divergence ‒ Interference at shared resources ‒ Performance coupling between CPU and GPU  Efficient energy management requires metrics that can predict frequency sensitivity (power) in heterogeneous processors  Sensitivity metrics drive the coordinated setting of CPU and GPU power states 20 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  21. 21. STATE-OF-THE-ART: BI-DIRECTIONAL APPLICATION POWER MANAGEMENT (BAPM) Chip is divided into BAPM-controlled thermal entities (TEs) CU0 TE CU1 TE GPU TE  Power management algorithm 1. Calculate digital estimate of power consumption 2. Convert power to temperature - RC network model for heat transfer 1. Assign new power budgets to TEs based on temperature headroom 2. TEs locally control (boost) their own DVFS states to maximize performance 21 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  22. 22. DYNACO: RUN-TIME SYSTEM FOR COORDINATED ENERGY MANAGEMENT CPU-GPU Frequency Sensitivity Computation Performance Metric Monitor CPU-GPU Power State Decision GPU Frequency Sensitivity CPU Frequency Sensitivity Decision High Low Shift power to GPU High High Proportional power allocation Low High Shift power to CPU Low Low Reduce power of both CPU and GPU  DynaCo implemented as a run-time software policy overlaid on top of BAPM in real hardware 22 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  23. 23. Programming Languages
  24. 24. PROGRAMMING LANGUAGES PROLIFERATING ON HSA OpenCL™ OpenCL™ App App Java App Java App C++ AMP C++ AMP App App Python Python App App OpenCL OpenCL Runtime Runtime Java JVM Java JVM (Sumatra) (Sumatra) Various Various Runtimes Runtimes Fabric Fabric Engine RT Engine RT HSAIL (HSA Intermediate Language) HSA HSA Helper Helper Libraries Libraries HSA Core HSA Core Runtime Runtime Kernel Fusion Kernel Fusion Driver (KFD) Driver (KFD) 24 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC HSA HSA Finalizer Finalizer
  25. 25. PROGRAMMING MODELS EMBRACING HSAIL AND HSA THE RIGHT LEVEL OF ABSTRACTION UNDER DEVELOPMENT NEXT Java: Project Sumatra OpenJDK 9 OpenMP from SuSE C++ AMP, based on CLANG/LLVM Python and KL from Fabric Engine DSLs: Halide, Julia, Rust Fortran JavaScript Open Shading Language R 25 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  26. 26. HSAIL HSAIL (HSA Intermediate Language) as the SW interface ‒ A virtual ISA for parallel programs ‒ Finalized to a native ISA by a finalizer/JIT ‒ Accommodate to rapid innovations in native GPU architectures ‒ HSAIL expected to be stable and backward compatible across implementations High-Level Compiler Flow OpenCL™ Kernel OpenCL™ Kernel EDG or CLANG EDG or CLANG SPIR SPIR LLVM LLVM HSAIL HSAIL ‒ Enable multiple hardware vendors to support HSA Key design points and benefits for HSA compilers Finalizer Flow (Runtime) ‒ Adopt a thin finalizer approach ‒ Enable fast translation time and robustness in the finalizer ‒ Drive performance optimizations through high-level compilers (HLC) HSAIL HSAIL Finalizer Finalizer Hardware ISA Hardware ISA ‒ Take advantage of the strength and compilation time budget in HLCs for aggressive optimizations EDG – Edison Design Group 26 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC CLANG – LLVM FE SPIR – Standard Portable Intermediate Representation
  27. 27. HSA ENABLEMENT OF JAVA JAVA 7 – OpenCL ENABLED APARAPI  AMD initiated Open Source project  APIs for data parallel algorithms ‒ GPU accelerate Java applications ‒ No need to learn OpenCL™  Active community captured mindshare ‒ ~20 contributors ‒ >7000 downloads ‒ ~150 visits per day JAVA 8 – HSA ENABLED APARAPI   Java 8 brings Stream + Lambda API. ‒ More natural way of expressing data parallel algorithms ‒ Initially targeted at multi-core. APARAPI will : ‒ Support Java 8 Lambdas ‒ Dispatch code to HSA enabled devices at runtime via HSAIL Java Application  Adds native GPU acceleration to Java Virtual Machine (JVM)  Developer uses JDK Lambda, Stream API  JVM uses GRAAL compiler to generate HSAIL  JVM decides at runtime to execute on either CPU or GPU depending on workload characteristics. Java Application Java JDK Stream + Lambda API APARAPI + Lambda API OpenCL™ OpenCL™ Compiler Java GRAAL JIT backend HSAIL HSAIL HSA Finalizer & Runtime & Runtime JVM JVM CPU CPU (SUMATRA) Java Application APARAPI API CPU ISA JAVA 9 – HSA ENABLED JAVA HSA Finalizer & Runtime JVM JVM GPU ISA GPU GPU 27 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC CPU ISA CPU CPU JVM JVM GPU ISA GPU GPU CPU ISA CPU CPU GPU ISA GPU GPU
  28. 28. Workloads
  29. 29. OVERVIEW OF B+ TREES  B+ Trees are a special case of B  A B+ Tree … Trees ‒ is a dynamic, multi-level index ‒ Is efficient for retrieval of data, stored in a  Fundamental data structure used in block-oriented context several popular database  Order (b) of a B+ Tree measures the management systems ‒ SQLite capacity of its nodes ‒ CouchDB 3 2 4 5 6 7 1 2 3 4 5 6 7 8 d1 d2 d3 d4 d5 d6 d7 d8 29 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  30. 30. HOW WE ACCELERATE  Utilize coarse-grained parallelism in B+ Tree searches ‒ Perform many queries in parallel ‒ Increase memory bandwidth utilization with parallel reads ‒ Increase throughput (transactions per second for OLTP)  B+ Tree searches on an HSA enabled APU ‒ Allows much larger B+ Trees to be searched, than traditional GPU compute ‒ Eliminates data-copies since CPU and GPU cores can access the same memory 30 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  31. 31. RESULTS 1M search queries in parallel  Input B+ Tree contains 112 million keys and uses 6GB of memory  Hardware: AMD “Kaveri” APU with Quad Core CPU and 8 GCN Compute Units at 35W TDP  Software: OpenCL on HSA Baseline: 4-core OpenMP + hand-tuned SSE CPU implementation 31 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  32. 32. REVERSE TIME MIGRATION (RTM)  A technique for creating images based on sensor data to improve seismic interpretations done by geophysicists  RTM is run on massive data sets  A natural scale out algorithm  Often run today on 100K node CPU systems  Bringing this to HSA and APU based supercomputing will increase performance for current sensor arrays, and allow more sensors and accuracy in the future. Marine crews A memory-intensive and highly parallel algorithm  Land crews 32 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC HOWEVER, SPEED OF PROCESSING AND INTERPRETATION IS A CRITICAL BOTTLENECK IN MAKING FULL USE OF ACQUISITION ASSETS
  33. 33. TEXT ANALYTICS – HADOOP TERASORT AND BIG DATA SEARCH MINING BIG DATA    Multi-stage pipeline or parallel processing stages Traditional GPU Compute is challenged by copies sort split 0 map Sort Compression Regular expression parsing CRC generation Acceleration of large data search scales out across the cluster of APU nodes 33 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC copy Output HDFS merge reduce split 1 split 2 part 0 HDFS Replication reduce APU with HSA accelerates each stage in place ‒ ‒ ‒ ‒  Input HDFS (Hadoop Distributed File System) part 1 HDFS Replication map map
  34. 34. DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. 34 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  35. 35. BACKUP 35 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  36. 36. Programming Tools
  37. 37. AMD V1.3  AMD’s comprehensive heterogeneous developer tool suite including: ‒ CPU and GPU Profiling ‒ GPU kernel Debugging ‒ GPU kernel analysis  New features in version 1.3: ‒ Supports Java ‒ Integrated static kernel analysis ‒ Remote debugging/profiling ‒ Supports latest AMD APU and GPU products 37 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  38. 38. OPEN SOURCE LIBRARIES ACCELERATED BY AMD OpenCV Bolt clMath Aparapi  Most popular computer vision library  Now with many OpenCL™ accelerated functions  C++ template library  Provides GPU offload for common data-parallel algorithms  Now with cross-OS support and improved performance/functio nality  AMD released APPML as open source to create clMath  Accelerated BLAS and FFT libraries  Accessible from Fortran, C and C++  OpenCL™ accelerated Java 7  Java APIs for data parallel algorithms (no need to learn OpenCL™ 38 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC

×