Guide to heterogeneous system architecture (hsa)
Upcoming SlideShare
Loading in...5
×
 

Guide to heterogeneous system architecture (hsa)

on

  • 231 views

 

Statistics

Views

Total Views
231
Views on SlideShare
230
Embed Views
1

Actions

Likes
1
Downloads
12
Comments
0

1 Embed 1

http://darya-ld1.linkedin.biz 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • So, here is what we will explore today <br />
  • HSA will empower software developers to easily innovate and unleash new levels of performance and functionality on all your modern devices and lead to powerful new experiences such as visually rich, intuitive, human-like interactivity.    <br />
  • Trinity contains two dual-core x86 modules or compute units (CU), and Radeon™ GPU cores along with miscellaneous other logic components such as a NorthBridge and a Unified Video Decoder (UVD). Each CU is composed of two, out-of-order cores that share the front-end and floating-point units. In addition, each CU is paired with a 2MB L2 cache that is shared between the cores. The GPU consists of 384 Radeon ™ cores, each capable of one single-precision fused multiply-add computation (FMAC) operation per cycle. The GPU is organized as six SIMD units, each containing sixteen processing units that are 4-way VLIW. The memory controller is shared between the CPU and the GPU. <br />
  • Punch this one out. <br /> Big emphasis on the last era. <br /> … and now the unprecedented step … four year roadmap <br />
  • http://acg.cis.upenn.edu/papers/cacm12_why_coherence.pdf. <br />   <br /> Includes this quote: <br /> Continued coherence support lets programmers concentrate on what matters for parallel speedups: <br /> finding work to do in parallel with no undue communication and synchronization.   <br />
  • Field data is sent to a center (cluster of nodes) for processing and interpretation (geophysicists) <br /> Gaps in data typically require further data acquisition either holding up crews causing redeployment <br /> The problem is magnified due to multiple field crews depending on the same processing center <br /> The problem cannot be trivially solved by a “truck full of discrete GPUs” solution on site, because RTM is a memory-bound problem that is better solved by APUs <br />

Guide to heterogeneous system architecture (hsa) Guide to heterogeneous system architecture (hsa) Presentation Transcript

  • GUIDE TO HETEROGENEOUS SYSTEM ARCHITECTURE (HSA) DIBYENDU DAS, PRAKASH RAGHAVENDRA DEC 16TH 2013
  • OUTLINE  Introduction to HSA  Unified Memory Access  Power Management  HSA Programming Languages  Workloads 2 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • WHAT IS HSA? An intelligent computing architecture that enables CPU, GPU and other processors to work in harmony on a single piece of silicon by seamlessly moving the right tasks to the best suited processing element hUMA (MEMORY) PARALLEL WORKLOADS SERIAL WORKLOADS APU ACCELERATED PROCESSING UNIT 3 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • HSA EVOLUTION Benefits Unified power efficiency Improved compute efficiency Simplified data sharing Capabilities Integrate CPU and GPU in silicon 4 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC GPU can access CPU memory Uniform memory access for CPU and GPU
  • STATE-OF-THE-ART HETEROGENEOUS PROCESSOR Shared Northbridge  access to overlapping CPU/GPU physical address spaces Graphics processing unit (GPU): 384 AMD Radeon™ cores Multi-threaded CPU cores Accelerated processing unit (APU) Many resources are shared between the CPU and GPU – For example, memory hierarchy, power, and thermal capacity 5 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Constrained by: Power Complexity Enabled by:  Moore’s Law  SMP architecture ? Throughput Performance Single-thread Performance Enabled by:  Abundant data parallelism  Power efficient GPUs Time 6 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC we are here Time (# of processors) Temporarily Constrained by: Programming models Comm.overhead Shader  CUDA OpenCL C+ +AMP … pthreads  OpenMP / TBB … Assembly  C/C++  Java … we are here Constrained by: Power Parallel SW Scalability Modern Application Performance Enabled by:  Moore’s Law  Voltage Scaling Heterogeneous Systems Era Multi-Core Era we are here Time (Data-parallel exploitation)
  • Architecture Maturity & Programmer Accessibility Excellent EVOLUTION OF HETEROGENEOUS COMPUTING Standards Drivers Era OpenCL™, DirectCompute Driver-based APIs Proprietary Drivers Era Graphics & Proprietary Driver-based APIs  “Adventurous” programmers  Exploit early programmable “shader cores” in the GPU Expert programmers C and C++ subsets Compute centric APIs , data types Multiple address spaces with explicit data movement  Specialized work queue based structures  Kernel mode dispatch     Architected Era AMD Heterogeneous System Architecture GPU Peer Processor         Mainstream programmers Full C++ GPU as a co-processor Unified coherent address space Task parallel runtimes Nested Data Parallel programs User mode dispatch Pre-emption and context switching  Make your program look like “graphics” to the GPU Poor  CUDA™, Brook+, etc 2002 - 2008 7 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC 2009 - 2011 2012 - 2020
  • HETEROGENEOUS PROCESSORS - EVERYWHERE SMARTPHONES TO SUPER-COMPUTERS Super computer Dense Server Tablet Phone Workstation Notebook A SINGLE SCALABLE ARCHITECTURE FOR THE WORLD’S PROGRAMMERS IS DEMANDED AT THIS POINT 8 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • HOW DOES HSA MAKE THIS ALL WORK?  Enables acceleration of languages like Java, C++ AMP and Python  All processors use the same addresses, and can share data structures in place  Heterogeneous computing can use all of virtual and physical memory  Extends multicore coherency to the GPU and other processors  Pass work quickly between the processors  Enables quality of service HSA FOUNDATION – BUILDING THE ECOSYSTEM 9 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • HSA FOUNDATION AT LAUNCH BORN IN JUNE 2012 Founders 10 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • HSA FOUNDATION TODAY – DECEMBER 2013 A GROWING AND POWERFUL FAMILY Founders Promoters rters butors ORACLE sities NTHU Programming Language Lab 11 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC NTHU System Software Lab COMPUTER SCIENCE
  • Unified Memory Access
  • UNDERSTANDING UMA Original meaning of UMA is Uniform Memory Access • Refers to how processing cores in a system view and access memory • All processing cores in a true UMA system share a single memory address space Introduction of GPU compute created Non-Uniform Memory Access (NUMA) • Require data to be managed across multiple heaps with different address spaces • Add programming complexity due to frequent copies, synchronization, and address translation HSA restores the GPU to Uniform memory Access • Heterogeneous computing replaces GPU Computing 13 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • INTRODUCING hUMA UMA CPU Memory C P U C P U C P U C P U NUMA APU GPU Memory CPU Memory C P U C P U C P U C P U GPU GPU GPU GPU hUMA APU with HSA Memory C P U 14 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC C P U C P U C P U GPU GPU GPU GPU
  • hUMA KEY FEATURES Coherent Memory: CPU Ensures CPU and GPU caches both see an up-to-date view of data HW Cache Coherency Cache Physical Memory Virtual Memory Pageable memory: The GPU can seamlessly access virtual memory addresses that are not (yet) present in physical memory Entire memory space: Both CPU and GPU can access and allocate any location in the system’s virtual memory space 15 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • WITHOUT POINTERS AND DATA SHARING Without hUMA: •CPU explicitly copies data to GPU memory •GPU completes computation •CPU explicitly copies result back to CPU memory CPU | | | | | Memory 16 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC | | | | | GPU Memory Only the data array can be copied since GPU cannot follow embedded data-structure links
  • WITH POINTERS AND DATA SHARING CPU can pass a pointer to entire data structure since the GPU can now follow embedded links | | | | | CPU / GPU Uniform Memory 17 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • hUMA FEATURES Access to Entire Memory Space Pageable memory Bi-directional Coherency Fast GPU access to system memory Dynamic Memory Allocation 18 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC     
  • Power Management
  • KEY OBSERVATIONS  Applications exhibit varying degrees of CPU and GPU frequency sensitivities due to ‒ Control divergence ‒ Interference at shared resources ‒ Performance coupling between CPU and GPU  Efficient energy management requires metrics that can predict frequency sensitivity (power) in heterogeneous processors  Sensitivity metrics drive the coordinated setting of CPU and GPU power states 20 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • STATE-OF-THE-ART: BI-DIRECTIONAL APPLICATION POWER MANAGEMENT (BAPM) Chip is divided into BAPM-controlled thermal entities (TEs) CU0 TE CU1 TE GPU TE  Power management algorithm 1. Calculate digital estimate of power consumption 2. Convert power to temperature - RC network model for heat transfer 1. Assign new power budgets to TEs based on temperature headroom 2. TEs locally control (boost) their own DVFS states to maximize performance 21 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • DYNACO: RUN-TIME SYSTEM FOR COORDINATED ENERGY MANAGEMENT CPU-GPU Frequency Sensitivity Computation Performance Metric Monitor CPU-GPU Power State Decision GPU Frequency Sensitivity CPU Frequency Sensitivity Decision High Low Shift power to GPU High High Proportional power allocation Low High Shift power to CPU Low Low Reduce power of both CPU and GPU  DynaCo implemented as a run-time software policy overlaid on top of BAPM in real hardware 22 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • Programming Languages
  • PROGRAMMING LANGUAGES PROLIFERATING ON HSA OpenCL™ OpenCL™ App App Java App Java App C++ AMP C++ AMP App App Python Python App App OpenCL OpenCL Runtime Runtime Java JVM Java JVM (Sumatra) (Sumatra) Various Various Runtimes Runtimes Fabric Fabric Engine RT Engine RT HSAIL (HSA Intermediate Language) HSA HSA Helper Helper Libraries Libraries HSA Core HSA Core Runtime Runtime Kernel Fusion Kernel Fusion Driver (KFD) Driver (KFD) 24 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC HSA HSA Finalizer Finalizer
  • PROGRAMMING MODELS EMBRACING HSAIL AND HSA THE RIGHT LEVEL OF ABSTRACTION UNDER DEVELOPMENT NEXT Java: Project Sumatra OpenJDK 9 OpenMP from SuSE C++ AMP, based on CLANG/LLVM Python and KL from Fabric Engine DSLs: Halide, Julia, Rust Fortran JavaScript Open Shading Language R 25 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • HSAIL HSAIL (HSA Intermediate Language) as the SW interface ‒ A virtual ISA for parallel programs ‒ Finalized to a native ISA by a finalizer/JIT ‒ Accommodate to rapid innovations in native GPU architectures ‒ HSAIL expected to be stable and backward compatible across implementations High-Level Compiler Flow OpenCL™ Kernel OpenCL™ Kernel EDG or CLANG EDG or CLANG SPIR SPIR LLVM LLVM HSAIL HSAIL ‒ Enable multiple hardware vendors to support HSA Key design points and benefits for HSA compilers Finalizer Flow (Runtime) ‒ Adopt a thin finalizer approach ‒ Enable fast translation time and robustness in the finalizer ‒ Drive performance optimizations through high-level compilers (HLC) HSAIL HSAIL Finalizer Finalizer Hardware ISA Hardware ISA ‒ Take advantage of the strength and compilation time budget in HLCs for aggressive optimizations EDG – Edison Design Group 26 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC CLANG – LLVM FE SPIR – Standard Portable Intermediate Representation
  • HSA ENABLEMENT OF JAVA JAVA 7 – OpenCL ENABLED APARAPI  AMD initiated Open Source project  APIs for data parallel algorithms ‒ GPU accelerate Java applications ‒ No need to learn OpenCL™  Active community captured mindshare ‒ ~20 contributors ‒ >7000 downloads ‒ ~150 visits per day JAVA 8 – HSA ENABLED APARAPI   Java 8 brings Stream + Lambda API. ‒ More natural way of expressing data parallel algorithms ‒ Initially targeted at multi-core. APARAPI will : ‒ Support Java 8 Lambdas ‒ Dispatch code to HSA enabled devices at runtime via HSAIL Java Application  Adds native GPU acceleration to Java Virtual Machine (JVM)  Developer uses JDK Lambda, Stream API  JVM uses GRAAL compiler to generate HSAIL  JVM decides at runtime to execute on either CPU or GPU depending on workload characteristics. Java Application Java JDK Stream + Lambda API APARAPI + Lambda API OpenCL™ OpenCL™ Compiler Java GRAAL JIT backend HSAIL HSAIL HSA Finalizer & Runtime & Runtime JVM JVM CPU CPU (SUMATRA) Java Application APARAPI API CPU ISA JAVA 9 – HSA ENABLED JAVA HSA Finalizer & Runtime JVM JVM GPU ISA GPU GPU 27 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC CPU ISA CPU CPU JVM JVM GPU ISA GPU GPU CPU ISA CPU CPU GPU ISA GPU GPU
  • Workloads
  • OVERVIEW OF B+ TREES  B+ Trees are a special case of B  A B+ Tree … Trees ‒ is a dynamic, multi-level index ‒ Is efficient for retrieval of data, stored in a  Fundamental data structure used in block-oriented context several popular database  Order (b) of a B+ Tree measures the management systems ‒ SQLite capacity of its nodes ‒ CouchDB 3 2 4 5 6 7 1 2 3 4 5 6 7 8 d1 d2 d3 d4 d5 d6 d7 d8 29 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • HOW WE ACCELERATE  Utilize coarse-grained parallelism in B+ Tree searches ‒ Perform many queries in parallel ‒ Increase memory bandwidth utilization with parallel reads ‒ Increase throughput (transactions per second for OLTP)  B+ Tree searches on an HSA enabled APU ‒ Allows much larger B+ Trees to be searched, than traditional GPU compute ‒ Eliminates data-copies since CPU and GPU cores can access the same memory 30 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • RESULTS 1M search queries in parallel  Input B+ Tree contains 112 million keys and uses 6GB of memory  Hardware: AMD “Kaveri” APU with Quad Core CPU and 8 GCN Compute Units at 35W TDP  Software: OpenCL on HSA Baseline: 4-core OpenMP + hand-tuned SSE CPU implementation 31 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • REVERSE TIME MIGRATION (RTM)  A technique for creating images based on sensor data to improve seismic interpretations done by geophysicists  RTM is run on massive data sets  A natural scale out algorithm  Often run today on 100K node CPU systems  Bringing this to HSA and APU based supercomputing will increase performance for current sensor arrays, and allow more sensors and accuracy in the future. Marine crews A memory-intensive and highly parallel algorithm  Land crews 32 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC HOWEVER, SPEED OF PROCESSING AND INTERPRETATION IS A CRITICAL BOTTLENECK IN MAKING FULL USE OF ACQUISITION ASSETS
  • TEXT ANALYTICS – HADOOP TERASORT AND BIG DATA SEARCH MINING BIG DATA    Multi-stage pipeline or parallel processing stages Traditional GPU Compute is challenged by copies sort split 0 map Sort Compression Regular expression parsing CRC generation Acceleration of large data search scales out across the cluster of APU nodes 33 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC copy Output HDFS merge reduce split 1 split 2 part 0 HDFS Replication reduce APU with HSA accelerates each stage in place ‒ ‒ ‒ ‒  Input HDFS (Hadoop Distributed File System) part 1 HDFS Replication map map
  • DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. 34 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • BACKUP 35 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • Programming Tools
  • AMD V1.3  AMD’s comprehensive heterogeneous developer tool suite including: ‒ CPU and GPU Profiling ‒ GPU kernel Debugging ‒ GPU kernel analysis  New features in version 1.3: ‒ Supports Java ‒ Integrated static kernel analysis ‒ Remote debugging/profiling ‒ Supports latest AMD APU and GPU products 37 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • OPEN SOURCE LIBRARIES ACCELERATED BY AMD OpenCV Bolt clMath Aparapi  Most popular computer vision library  Now with many OpenCL™ accelerated functions  C++ template library  Provides GPU offload for common data-parallel algorithms  Now with cross-OS support and improved performance/functio nality  AMD released APPML as open source to create clMath  Accelerated BLAS and FFT libraries  Accessible from Fortran, C and C++  OpenCL™ accelerated Java 7  Java APIs for data parallel algorithms (no need to learn OpenCL™ 38 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC