SlideShare a Scribd company logo
1 of 42
Download to read offline
HETEROGENEOUS SYSTEM
ARCHITECTURE OVERVIEW
VINOD TIPPARAJU
PRINCIPAL MEMBER OF TECHNICAL STAFF,
AMD AUSTIN.
AGENDA
u  Introduction (terminology)
u  Memory Model & Queues
u  HSAIL
u  Architecture Details
AMD Confidential - NDA Required
SOME TERMINOLOGY
u  HSA is heterogeneous systems architecture, not just GPUs
u  HSA Component – IP that satisfies architecture requirements and provides
identified features
u  SoC – system on Chip, collection of various IPs
u  E.g. AMD APU (Accelerated Processing Unit) integrates AMD/ARM CPU cores and
Graphics IP
u  It is possible to conceive companies just building parts of the IP
u  HSAIL -- HSA intermediate language very low-level SIMT language
u  HSA Agent – something that can participate in the HSA memory subsystem
(i.e. respect page sizes, memory properties, atomics, etc.)
AMD Confidential - NDA Required
WHAT IS HSA?
RUNTIME
u  API that wraps the features like user mode
queues, clocks, signalling, etc
u  Provides execution control
u  Supports tools
TOOLS
u  Supporting profilers, debuggers and
compilers
u  Unique debugging support that greatly
simplifies implementing debuggers
u  Excellent profiling support with some user
mode access
Systems Architecture
u  From a hardware point of view, system
architecture requirements necessary
u  Specifies shared memory, cache coherence
domains, concept of clocks, context
switching, memory based signaling, topology,
u  Rules governing design and agent behavior
Programmers Reference (HSAIL)
u  An intermediate representation, very low
level.
u  Vendor independence, device compiler
optimizations
u  Abstracts HW, or can serve as the lowest
level instruction set
TAKING THE HW INTEGRATION TO ITS
NATURAL CONCLUSION
u  Architectural and System integration
u  Extend architecture to make the component a first class citizen on the SoC
u  Fully-evolved MMU
u  Provide same level of support for tools as CPU
u  Provide context switching, preemption, full-coherence
u  Helps simulators, migrations, checkpoints, etc
u  Future, other HSA IP
AMD Confidential - NDA Required
HIGH LEVEL USAGE SCENARIOS
u  Bulk-Synchronous Parallelism -like concurrent computation
u  Rather large parallel sections followed by synchronization
u  Outstanding support for task-based parallelism
u  Wavefront is 64 threads
u  256 threads sufficient to fully fill the pipeline
u  Launch in under a microsecond (same with nested launches)
u  Support for execution schedules – excellent compiler target
u  Architected Queueing Language (AQL), dependencies
u  Advanced language support
u  Function calls
u  Virtual functions
u  Exception handling (throw-catch)
AMD Confidential - NDA Required
MEMORY AND QUEUING MODEL
HSA MEMORY MODEL
u  Defines visibility ordering between all threads
in the HSA System
u  Designed to be compatible with C++11, Java,
OpenCL and .NET Memory Models
u  Relaxed consistency memory model for
parallel compute performance
u  Visibility controlled by:
u  Load.Acquire
u  Store.Release
u  Barriers
AMD Confidential - NDA Required
HSA QUEUING MODEL
u  User mode queuing for low latency dispatch
u  Application dispatches directly
u  No OS or driver in the dispatch path
u  Architected Queuing Layer
u  Single compute dispatch path for all hardware
u  No driver translation, direct to hardware
u  Allows for dispatch to queue from any agent
u  CPU or GPU
u  GPU self enqueue enables lots of solutions
u  Recursion
u  Tree traversal
u  Wavefront reforming
AMD Confidential - NDA Required
HSAIL
AMD Confidential - NDA Required
HSA INTERMEDIATE LAYER — HSAIL
u  HSAIL is a virtual ISA for parallel programs
u  Finalized to ISA by a JIT compiler or “Finalizer”
u  ISA independent by design for CPU & GPU
u  Explicitly parallel
u  Designed for data parallel programming
u  Support for exceptions, virtual functions,
and other high level language features
u  Lower level than OpenCL SPIR
u  Fits naturally in the OpenCL compilation stack
u  Suitable to support additional high level languages and programming models:
u  Java, C++, OpenMP, Fortran etc
AMD Confidential - NDA Required
WHAT IS HSAIL?
u  HSAIL is the intermediate language for parallel compute in HSA
u  Generated by a high level compiler (LLVM, gcc, Java VM, etc)
u  Low-level IR, close to machine ISA level
u  Compiled down to target ISA by an IHV “Finalizer”
u  Finalizer may execute at run time, install time, or build time
u  Example: OpenCL™ Compilation Stack using HSAIL
AMD Confidential - NDA Required
OpenCL™ Kernel
High-Level Compiler Flow (Developer)
Finalizer Flow (Runtime)
EDG or CLANG
SPIR
LLVM
HSAIL HSAIL
Finalizer
Hardware ISA
KEY HSAIL FEATURES
u  Parallel
u  Shared virtual memory
u  Portable across vendors in HSA Foundation
u  Stable across multiple product generations
u  Consistent numerical results (IEEE-754 with defined min accuracy)
u  Fast, robust, simple finalization step (no monthly updates)
u  Good performance (little need to write in ISA)
u  Supports all of OpenCL™ and C++ AMP™
u  Support Java, C++, and other languages as well
AMD Confidential - NDA Required
SIMT EXECUTION MODEL
u  HSAIL Presents a “SIMT” execution model to the programmer
u  “Single Instruction, Multiple Thread”
u  Programmer writes program for a single thread of execution
u  Each work-item appears to have its own program counter
u  Branch instructions look natural
u  Hardware Implementation
u  Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency
u  Actually one program counter for the entire SIMD instruction
u  Branches implemented with predication
u  SIMT Advantages
u  Easier to program (branch code in particular)
u  Natural path for mainstream programming models
u  Scales across a wide variety of hardware (programmer doesn’t see vector width)
u  Cross-lane operations available for those who want peak performance
AMD Confidential - NDA Required
ARCHITECTURE DETAILS –
WALK THROUGH OF FEATURES
AND BENEFITS
HIGH LEVEL FEATURES OF HSA
u  Features currently being defined in the HSA Working Groups**
u  Unified addressing across all processors
u  Operation into pageable system memory
u  Full memory coherency
u  User mode dispatch
u  Architected queuing language
u  High level language support for GPU compute processors
u  Preemption and context switching
AMD Confidential - NDA Required
** All features subject to change, pending completion and ratification of specifications in the HSA Working Groups
STATE OF GPU COMPUTING
Today’s Challenges
u  Separate address spaces
u  Copies
u  Can’t share pointers
u  New language required for compute kernel
u  EX: OpenCL™ runtime API
u  Compute kernel compiled separately
than host code
Emerging Solution
u  HSA Hardware
u  Single address space
u  Coherent
u  Virtual address space
u  Fast access from all components
u  Can share pointers
u  Bring GPU computing to existing, popular,
programming models
u  Single-source, fully supported by
compiler
u  HSAIL compiler IR (Cross-platform!)
• GPUs are fast and power efficient : high compute density per-mm and per-watt
• But: Can be hard to program
PCIe
MOTIVATION (TODAY’S PICTURE)
AMD Confidential - NDA Required
Application OS GPU
Transfer
buffer to GPU
Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (TODAY)
u  Multiple virtual memory address spaces
AMD Confidential - NDA Required
CPU0 GPU
VIRTUAL MEMORY1
PHYSICAL MEMORY
VA1->PA1 VA2->PA1
VIRTUAL MEMORY2
PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (HSA)
u  Common Virtual Memory for all HSA agents
AMD Confidential - NDA Required
CPU0 GPU
VIRTUAL MEMORY
PHYSICAL MEMORY
VA->PA VA->PA
SHARED VIRTUAL MEMORY
u  Advantages
u  No mapping tricks, no copying back-and-forth between different PA
addresses
u  Send pointers (not data) back and forth between HSA agents.
u  Implications
u  Common Page Tables (and common interpretation of architectural
semantics such as shareability, protection, etc).
u  Common mechanisms for address translation (and servicing
address translation faults)
u  Concept of a process address space ID (PASID) to allow multiple,
per process virtual address spaces within the system.
AMD Confidential - NDA Required
GETTING THERE …
AMD Confidential - NDA Required
Application OS GPU
Transfer
buffer to GPU
Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
SHARED VIRTUAL MEMORY
u  Specifics
u  Minimum supported VA width is 48b for 64b systems, and 32b for
32b systems.
u  HSA agents may reserve VA ranges for internal use via system
software.
u  All HSA agents other than the host unit must use the lowest
privilege level
u  If present, read/write access flags for page tables must be
maintained by all agents.
u  Read/write permissions apply to all HSA agents, equally.
AMD Confidential - NDA Required
CACHE COHERENCY
CACHE COHERENCY DOMAINS (1/2)
u  Data accesses to global memory segment from all HSA Agents shall be
coherent without the need for explicit cache maintenance.
AMD Confidential - NDA Required
CACHE COHERENCY DOMAINS (2/2)
u  Advantages
u  Composability
u  Reduced SW complexity when communicating between agents
u  Lower barrier to entry when porting software
u  Implications
u  Hardware coherency support between all HSA agents
u  Can take many forms
u  Stand alone Snoop Filters / Directories
u  Combined L3/Filters
u  Snoop-based systems (no filter)
u  Etc …
AMD Confidential - NDA Required
GETTING CLOSER …
AMD Confidential - NDA Required
Application OS GPU
Transfer
buffer to GPU
Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
Cache coherency
SIGNALING
SIGNALING (1/2)
u  HSA agents support the ability to use signaling objects
u  All creation/destruction signaling objects occurs via HSA
runtime APIs
u Object creation/destruction
u  From an HSA Agent you can directly accessing
signaling objects.
u Signaling a signal object (this will wake up HSA
agents waiting upon the object)
u Query current object
u Wait on the current object (various conditions
supported).
AMD Confidential - NDA Required
SIGNALING (2/2)
u  Advantages
u  Enables asynchronous interrupts between HSA agents,
without involving the kernel
u  Common idiom for work offload
u  Low power waiting
u  Implications
u  Runtime support required
u  Commonly implemented on top of cache coherency
flows
AMD Confidential - NDA Required
ALMOST THERE…
AMD Confidential - NDA Required
Application OS GPU
Transfer
buffer to GPU
Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
signaling
USER MODE QUEUEING
USER MODE QUEUEING (1/3)
u  User mode Queueing
u  Enables user space applications to directly, without OS intervention, enqueue jobs
(“Dispatch Packets”) for HSA agents.
u  Dispatch packet is a job of work
u  Support for multiple queues per PASID
u  Multiple threads/agents within a PASID may enqueue Packets in the same Queue.
u  Dependency mechanisms created for ensuring ordering between packets.
AMD Confidential - NDA Required
USER MODE QUEUEING (2/3)
u  Advantages
u  Avoid involving the kernel/driver when dispatching work for an Agent.
u  Lower latency job dispatch enables finer granularity of offload
u  Standard memory protection mechanisms may be used to protect communication
with the consuming agent.
u  Implications
u  Packet formats/fields are Architected – standard across vendors!
u  Guaranteed backward compatibility
u  Packets are enqueued/dequeued via an Architected protocol (all via memory
accesses and signalling)
u  More on this later……
AMD Confidential - NDA Required
SUCCESS!
AMD Confidential - NDA Required
Application OS GPU
Transfer
buffer to GPU
Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
SUCCESS!
AMD Confidential - NDA Required
Application OS GPU
Queue Job
Start Job
Finish Job
ACCELERATING SUFFIX ARRAY
CONSTRUCTION
CLOUD SERVER WORKLOAD
SUFFIX ARRAYS
u  Suffix Arrays are a fundamental data structure
u  Designed for efficient searching of a large text
u  Quickly locate every occurrence of a substring S in a text T	

u  Suffix Arrays are used to accelerate in-memory cloud workloads
u  Full text index search
u  Lossless data compression
u  Bio-informatics
AMD Confidential - NDA Required
ACCELERATED SUFFIX ARRAY
CONSTRUCTION ON HSA
AMD Confidential - NDA Required
M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013.
AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM
By offloading data parallel computations to
GPU, HSA increases performance and
reduces energy for Suffix Array Construction
versus Single Threaded CPU.
By efficiently sharing data between CPU and
GPU, HSA lets us move compute to data
without penalty of intermediate copies.
+5.8x
-5x
INCREASED
PERFORMANCE
DECREASED
ENERGYMerge Sort::GPU
Radix Sort::GPU
Compute SA::CPU
Lexical Rank::CPU
Radix Sort::GPU
Skew Algorithm for Compute SA
THE HSA FUTURE
Architected heterogeneous processing on the SOC
Programming of accelerators becomes much easier
Accelerated software that runs across multiple hardware vendors
Scalability from smart phones to super computers on a common architecture
GPU acceleration of parallel processing is the initial target, with DSPs
and other accelerators coming to the HSA system architecture model
Heterogeneous software ecosystem evolves at a much faster pace
Lower power, more capable devices in your hand, on the wall or in a super
computer.
AMD Confidential - NDA Required
AMD Confidential - NDA Required
BACKUP
SEGMENTS
AMD Confidential - NDA Required

More Related Content

Similar to LCU13: HSA Architecture Presentation

Heterogeneous System Architecture Overview
Heterogeneous System Architecture OverviewHeterogeneous System Architecture Overview
Heterogeneous System Architecture Overviewinside-BigData.com
 
HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective HSA Foundation
 
Introduction to HSA
Introduction to HSAIntroduction to HSA
Introduction to HSA秉儒 吳
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computingRashid Ansari
 
Sequoia Spark Talk March 2015.pdf
Sequoia Spark Talk March 2015.pdfSequoia Spark Talk March 2015.pdf
Sequoia Spark Talk March 2015.pdftotomeme1991
 
Guide to heterogeneous system architecture (hsa)
Guide to heterogeneous system architecture (hsa)Guide to heterogeneous system architecture (hsa)
Guide to heterogeneous system architecture (hsa)dibyendu.das
 
HSA Introduction Hot Chips 2013
HSA Introduction  Hot Chips 2013HSA Introduction  Hot Chips 2013
HSA Introduction Hot Chips 2013HSA Foundation
 
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...HSA Foundation
 
ISCA final presentation - Queuing Model
ISCA final presentation - Queuing ModelISCA final presentation - Queuing Model
ISCA final presentation - Queuing ModelHSA Foundation
 
Avr microcontrollers training (sahil gupta - 9068557926)
Avr microcontrollers training  (sahil gupta - 9068557926)Avr microcontrollers training  (sahil gupta - 9068557926)
Avr microcontrollers training (sahil gupta - 9068557926)Sahil Gupta
 
Big Data as PaaS in Enterprises
Big Data as PaaS in EnterprisesBig Data as PaaS in Enterprises
Big Data as PaaS in EnterprisesPankaj Khattar
 
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben SanderHSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben SanderAMD Developer Central
 
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...OpenShift Origin
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4UniFabric
 
SUSE Expert Days 2017 FUJITSU
SUSE Expert Days 2017 FUJITSUSUSE Expert Days 2017 FUJITSU
SUSE Expert Days 2017 FUJITSUSUSE España
 
Google Anthos - Azure Stack - AWS Outposts :Comparison
Google Anthos - Azure Stack - AWS Outposts :ComparisonGoogle Anthos - Azure Stack - AWS Outposts :Comparison
Google Anthos - Azure Stack - AWS Outposts :ComparisonKrishna-Kumar
 

Similar to LCU13: HSA Architecture Presentation (20)

Heterogeneous System Architecture Overview
Heterogeneous System Architecture OverviewHeterogeneous System Architecture Overview
Heterogeneous System Architecture Overview
 
HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective
 
Introduction to HSA
Introduction to HSAIntroduction to HSA
Introduction to HSA
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computing
 
HSA Introduction
HSA IntroductionHSA Introduction
HSA Introduction
 
Sequoia Spark Talk March 2015.pdf
Sequoia Spark Talk March 2015.pdfSequoia Spark Talk March 2015.pdf
Sequoia Spark Talk March 2015.pdf
 
Guide to heterogeneous system architecture (hsa)
Guide to heterogeneous system architecture (hsa)Guide to heterogeneous system architecture (hsa)
Guide to heterogeneous system architecture (hsa)
 
HSA Introduction Hot Chips 2013
HSA Introduction  Hot Chips 2013HSA Introduction  Hot Chips 2013
HSA Introduction Hot Chips 2013
 
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
 
ISCA final presentation - Queuing Model
ISCA final presentation - Queuing ModelISCA final presentation - Queuing Model
ISCA final presentation - Queuing Model
 
Avr microcontrollers training (sahil gupta - 9068557926)
Avr microcontrollers training  (sahil gupta - 9068557926)Avr microcontrollers training  (sahil gupta - 9068557926)
Avr microcontrollers training (sahil gupta - 9068557926)
 
Big Data as PaaS in Enterprises
Big Data as PaaS in EnterprisesBig Data as PaaS in Enterprises
Big Data as PaaS in Enterprises
 
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben SanderHSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
 
CUDA
CUDACUDA
CUDA
 
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4
 
Cuda
CudaCuda
Cuda
 
SUSE Expert Days 2017 FUJITSU
SUSE Expert Days 2017 FUJITSUSUSE Expert Days 2017 FUJITSU
SUSE Expert Days 2017 FUJITSU
 
Google Anthos - Azure Stack - AWS Outposts :Comparison
Google Anthos - Azure Stack - AWS Outposts :ComparisonGoogle Anthos - Azure Stack - AWS Outposts :Comparison
Google Anthos - Azure Stack - AWS Outposts :Comparison
 
Enduro/X Middleware
Enduro/X MiddlewareEnduro/X Middleware
Enduro/X Middleware
 

More from Linaro

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloLinaro
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaLinaro
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraLinaro
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaLinaro
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018Linaro
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018Linaro
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...Linaro
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Linaro
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Linaro
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteLinaro
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopLinaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allLinaro
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorLinaro
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMULinaro
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MLinaro
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation Linaro
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootLinaro
 

More from Linaro (20)

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qa
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
 

Recently uploaded

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

LCU13: HSA Architecture Presentation

  • 1. HETEROGENEOUS SYSTEM ARCHITECTURE OVERVIEW VINOD TIPPARAJU PRINCIPAL MEMBER OF TECHNICAL STAFF, AMD AUSTIN.
  • 2. AGENDA u  Introduction (terminology) u  Memory Model & Queues u  HSAIL u  Architecture Details AMD Confidential - NDA Required
  • 3. SOME TERMINOLOGY u  HSA is heterogeneous systems architecture, not just GPUs u  HSA Component – IP that satisfies architecture requirements and provides identified features u  SoC – system on Chip, collection of various IPs u  E.g. AMD APU (Accelerated Processing Unit) integrates AMD/ARM CPU cores and Graphics IP u  It is possible to conceive companies just building parts of the IP u  HSAIL -- HSA intermediate language very low-level SIMT language u  HSA Agent – something that can participate in the HSA memory subsystem (i.e. respect page sizes, memory properties, atomics, etc.) AMD Confidential - NDA Required
  • 4. WHAT IS HSA? RUNTIME u  API that wraps the features like user mode queues, clocks, signalling, etc u  Provides execution control u  Supports tools TOOLS u  Supporting profilers, debuggers and compilers u  Unique debugging support that greatly simplifies implementing debuggers u  Excellent profiling support with some user mode access Systems Architecture u  From a hardware point of view, system architecture requirements necessary u  Specifies shared memory, cache coherence domains, concept of clocks, context switching, memory based signaling, topology, u  Rules governing design and agent behavior Programmers Reference (HSAIL) u  An intermediate representation, very low level. u  Vendor independence, device compiler optimizations u  Abstracts HW, or can serve as the lowest level instruction set
  • 5. TAKING THE HW INTEGRATION TO ITS NATURAL CONCLUSION u  Architectural and System integration u  Extend architecture to make the component a first class citizen on the SoC u  Fully-evolved MMU u  Provide same level of support for tools as CPU u  Provide context switching, preemption, full-coherence u  Helps simulators, migrations, checkpoints, etc u  Future, other HSA IP AMD Confidential - NDA Required
  • 6. HIGH LEVEL USAGE SCENARIOS u  Bulk-Synchronous Parallelism -like concurrent computation u  Rather large parallel sections followed by synchronization u  Outstanding support for task-based parallelism u  Wavefront is 64 threads u  256 threads sufficient to fully fill the pipeline u  Launch in under a microsecond (same with nested launches) u  Support for execution schedules – excellent compiler target u  Architected Queueing Language (AQL), dependencies u  Advanced language support u  Function calls u  Virtual functions u  Exception handling (throw-catch) AMD Confidential - NDA Required
  • 8. HSA MEMORY MODEL u  Defines visibility ordering between all threads in the HSA System u  Designed to be compatible with C++11, Java, OpenCL and .NET Memory Models u  Relaxed consistency memory model for parallel compute performance u  Visibility controlled by: u  Load.Acquire u  Store.Release u  Barriers AMD Confidential - NDA Required
  • 9. HSA QUEUING MODEL u  User mode queuing for low latency dispatch u  Application dispatches directly u  No OS or driver in the dispatch path u  Architected Queuing Layer u  Single compute dispatch path for all hardware u  No driver translation, direct to hardware u  Allows for dispatch to queue from any agent u  CPU or GPU u  GPU self enqueue enables lots of solutions u  Recursion u  Tree traversal u  Wavefront reforming AMD Confidential - NDA Required
  • 10. HSAIL AMD Confidential - NDA Required
  • 11. HSA INTERMEDIATE LAYER — HSAIL u  HSAIL is a virtual ISA for parallel programs u  Finalized to ISA by a JIT compiler or “Finalizer” u  ISA independent by design for CPU & GPU u  Explicitly parallel u  Designed for data parallel programming u  Support for exceptions, virtual functions, and other high level language features u  Lower level than OpenCL SPIR u  Fits naturally in the OpenCL compilation stack u  Suitable to support additional high level languages and programming models: u  Java, C++, OpenMP, Fortran etc AMD Confidential - NDA Required
  • 12. WHAT IS HSAIL? u  HSAIL is the intermediate language for parallel compute in HSA u  Generated by a high level compiler (LLVM, gcc, Java VM, etc) u  Low-level IR, close to machine ISA level u  Compiled down to target ISA by an IHV “Finalizer” u  Finalizer may execute at run time, install time, or build time u  Example: OpenCL™ Compilation Stack using HSAIL AMD Confidential - NDA Required OpenCL™ Kernel High-Level Compiler Flow (Developer) Finalizer Flow (Runtime) EDG or CLANG SPIR LLVM HSAIL HSAIL Finalizer Hardware ISA
  • 13. KEY HSAIL FEATURES u  Parallel u  Shared virtual memory u  Portable across vendors in HSA Foundation u  Stable across multiple product generations u  Consistent numerical results (IEEE-754 with defined min accuracy) u  Fast, robust, simple finalization step (no monthly updates) u  Good performance (little need to write in ISA) u  Supports all of OpenCL™ and C++ AMP™ u  Support Java, C++, and other languages as well AMD Confidential - NDA Required
  • 14. SIMT EXECUTION MODEL u  HSAIL Presents a “SIMT” execution model to the programmer u  “Single Instruction, Multiple Thread” u  Programmer writes program for a single thread of execution u  Each work-item appears to have its own program counter u  Branch instructions look natural u  Hardware Implementation u  Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency u  Actually one program counter for the entire SIMD instruction u  Branches implemented with predication u  SIMT Advantages u  Easier to program (branch code in particular) u  Natural path for mainstream programming models u  Scales across a wide variety of hardware (programmer doesn’t see vector width) u  Cross-lane operations available for those who want peak performance AMD Confidential - NDA Required
  • 15. ARCHITECTURE DETAILS – WALK THROUGH OF FEATURES AND BENEFITS
  • 16. HIGH LEVEL FEATURES OF HSA u  Features currently being defined in the HSA Working Groups** u  Unified addressing across all processors u  Operation into pageable system memory u  Full memory coherency u  User mode dispatch u  Architected queuing language u  High level language support for GPU compute processors u  Preemption and context switching AMD Confidential - NDA Required ** All features subject to change, pending completion and ratification of specifications in the HSA Working Groups
  • 17. STATE OF GPU COMPUTING Today’s Challenges u  Separate address spaces u  Copies u  Can’t share pointers u  New language required for compute kernel u  EX: OpenCL™ runtime API u  Compute kernel compiled separately than host code Emerging Solution u  HSA Hardware u  Single address space u  Coherent u  Virtual address space u  Fast access from all components u  Can share pointers u  Bring GPU computing to existing, popular, programming models u  Single-source, fully supported by compiler u  HSAIL compiler IR (Cross-platform!) • GPUs are fast and power efficient : high compute density per-mm and per-watt • But: Can be hard to program PCIe
  • 18. MOTIVATION (TODAY’S PICTURE) AMD Confidential - NDA Required Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 19. PHYSICAL MEMORY SHARED VIRTUAL MEMORY (TODAY) u  Multiple virtual memory address spaces AMD Confidential - NDA Required CPU0 GPU VIRTUAL MEMORY1 PHYSICAL MEMORY VA1->PA1 VA2->PA1 VIRTUAL MEMORY2
  • 20. PHYSICAL MEMORY SHARED VIRTUAL MEMORY (HSA) u  Common Virtual Memory for all HSA agents AMD Confidential - NDA Required CPU0 GPU VIRTUAL MEMORY PHYSICAL MEMORY VA->PA VA->PA
  • 21. SHARED VIRTUAL MEMORY u  Advantages u  No mapping tricks, no copying back-and-forth between different PA addresses u  Send pointers (not data) back and forth between HSA agents. u  Implications u  Common Page Tables (and common interpretation of architectural semantics such as shareability, protection, etc). u  Common mechanisms for address translation (and servicing address translation faults) u  Concept of a process address space ID (PASID) to allow multiple, per process virtual address spaces within the system. AMD Confidential - NDA Required
  • 22. GETTING THERE … AMD Confidential - NDA Required Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 23. SHARED VIRTUAL MEMORY u  Specifics u  Minimum supported VA width is 48b for 64b systems, and 32b for 32b systems. u  HSA agents may reserve VA ranges for internal use via system software. u  All HSA agents other than the host unit must use the lowest privilege level u  If present, read/write access flags for page tables must be maintained by all agents. u  Read/write permissions apply to all HSA agents, equally. AMD Confidential - NDA Required
  • 25. CACHE COHERENCY DOMAINS (1/2) u  Data accesses to global memory segment from all HSA Agents shall be coherent without the need for explicit cache maintenance. AMD Confidential - NDA Required
  • 26. CACHE COHERENCY DOMAINS (2/2) u  Advantages u  Composability u  Reduced SW complexity when communicating between agents u  Lower barrier to entry when porting software u  Implications u  Hardware coherency support between all HSA agents u  Can take many forms u  Stand alone Snoop Filters / Directories u  Combined L3/Filters u  Snoop-based systems (no filter) u  Etc … AMD Confidential - NDA Required
  • 27. GETTING CLOSER … AMD Confidential - NDA Required Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory Cache coherency
  • 29. SIGNALING (1/2) u  HSA agents support the ability to use signaling objects u  All creation/destruction signaling objects occurs via HSA runtime APIs u Object creation/destruction u  From an HSA Agent you can directly accessing signaling objects. u Signaling a signal object (this will wake up HSA agents waiting upon the object) u Query current object u Wait on the current object (various conditions supported). AMD Confidential - NDA Required
  • 30. SIGNALING (2/2) u  Advantages u  Enables asynchronous interrupts between HSA agents, without involving the kernel u  Common idiom for work offload u  Low power waiting u  Implications u  Runtime support required u  Commonly implemented on top of cache coherency flows AMD Confidential - NDA Required
  • 31. ALMOST THERE… AMD Confidential - NDA Required Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory signaling
  • 33. USER MODE QUEUEING (1/3) u  User mode Queueing u  Enables user space applications to directly, without OS intervention, enqueue jobs (“Dispatch Packets”) for HSA agents. u  Dispatch packet is a job of work u  Support for multiple queues per PASID u  Multiple threads/agents within a PASID may enqueue Packets in the same Queue. u  Dependency mechanisms created for ensuring ordering between packets. AMD Confidential - NDA Required
  • 34. USER MODE QUEUEING (2/3) u  Advantages u  Avoid involving the kernel/driver when dispatching work for an Agent. u  Lower latency job dispatch enables finer granularity of offload u  Standard memory protection mechanisms may be used to protect communication with the consuming agent. u  Implications u  Packet formats/fields are Architected – standard across vendors! u  Guaranteed backward compatibility u  Packets are enqueued/dequeued via an Architected protocol (all via memory accesses and signalling) u  More on this later…… AMD Confidential - NDA Required
  • 35. SUCCESS! AMD Confidential - NDA Required Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 36. SUCCESS! AMD Confidential - NDA Required Application OS GPU Queue Job Start Job Finish Job
  • 38. SUFFIX ARRAYS u  Suffix Arrays are a fundamental data structure u  Designed for efficient searching of a large text u  Quickly locate every occurrence of a substring S in a text T u  Suffix Arrays are used to accelerate in-memory cloud workloads u  Full text index search u  Lossless data compression u  Bio-informatics AMD Confidential - NDA Required
  • 39. ACCELERATED SUFFIX ARRAY CONSTRUCTION ON HSA AMD Confidential - NDA Required M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013. AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM By offloading data parallel computations to GPU, HSA increases performance and reduces energy for Suffix Array Construction versus Single Threaded CPU. By efficiently sharing data between CPU and GPU, HSA lets us move compute to data without penalty of intermediate copies. +5.8x -5x INCREASED PERFORMANCE DECREASED ENERGYMerge Sort::GPU Radix Sort::GPU Compute SA::CPU Lexical Rank::CPU Radix Sort::GPU Skew Algorithm for Compute SA
  • 40. THE HSA FUTURE Architected heterogeneous processing on the SOC Programming of accelerators becomes much easier Accelerated software that runs across multiple hardware vendors Scalability from smart phones to super computers on a common architecture GPU acceleration of parallel processing is the initial target, with DSPs and other accelerators coming to the HSA system architecture model Heterogeneous software ecosystem evolves at a much faster pace Lower power, more capable devices in your hand, on the wall or in a super computer. AMD Confidential - NDA Required
  • 41. AMD Confidential - NDA Required BACKUP