HSA QUEUING MODEL
HAKAN PERSSON, SENIOR PRINCIPAL ENGINEER,
ARM
HSA QUEUEING, MOTIVATION
MOTIVATION (TODAY’S PICTURE)
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
HSA QUEUEING: REQUIREMENTS
REQUIREMENTS
 Three key technologies are used to build the user mode queueing
mechanism
 Shared Virtual Memory
 System Coherency
 Signaling
 AQL (Architected Queueing Language) enables any agent
enqueue tasks
© Copyright 2014 HSA Foundation. All Rights Reserved
SHARED VIRTUAL MEMORY
PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (TODAY)
 Multiple Virtual memory address spaces
© Copyright 2014 HSA Foundation. All Rights Reserved
CPU0 GPU
VIRTUAL MEMORY1
PHYSICAL MEMORY
VA1->PA1 VA2->PA1
VIRTUAL MEMORY2
PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (HSA)
 Common Virtual Memory for all HSA agents
© Copyright 2014 HSA Foundation. All Rights Reserved
CPU0 GPU
VIRTUAL MEMORY
PHYSICAL MEMORY
VA->PA VA->PA
SHARED VIRTUAL MEMORY
 Advantages
 No mapping tricks, no copying back-and-forth between different PA
addresses
 Send pointers (not data) back and forth between HSA agents.
 Implications
 Common Page Tables (and common interpretation of architectural
semantics such as shareability, protection, etc).
 Common mechanisms for address translation (and servicing address
translation faults)
 Concept of a process address space (PASID) to allow multiple, per
process virtual address spaces within the system.
© Copyright 2014 HSA Foundation. All Rights Reserved
SHARED VIRTUAL MEMORY
 Specifics
 Minimum supported VA width is 48b for 64b systems, and 32b for
32b systems.
 HSA agents may reserve VA ranges for internal use via system
software.
 All HSA agents other than the host unit must use the lowest privilege
level
 If present, read/write access flags for page tables must be
maintained by all agents.
 Read/write permissions apply to all HSA agents, equally.
© Copyright 2014 HSA Foundation. All Rights Reserved
GETTING THERE …
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
CACHE COHERENCY
CACHE COHERENCY DOMAINS (1/3)
 Data accesses to global memory segment from all HSA Agents shall be
coherent without the need for explicit cache maintenance.
© Copyright 2014 HSA Foundation. All Rights Reserved
CACHE COHERENCY DOMAINS (2/3)
 Advantages
 Composability
 Reduced SW complexity when communicating between agents
 Lower barrier to entry when porting software
 Implications
 Hardware coherency support between all HSA agents
 Can take many forms
 Stand alone Snoop Filters / Directories
 Combined L3/Filters
 Snoop-based systems (no filter)
 Etc …
© Copyright 2014 HSA Foundation. All Rights Reserved
CACHE COHERENCY DOMAINS (3/3)
 Specifics
 No requirement for instruction memory accesses to be
coherent
 Only applies to the Primary memory type.
 No requirement for HSA agents to maintain coherency to any
memory location where the HSA agents do not specify the
same memory attributes
 Read-only image data is required to remain static during the
execution of an HSA kernel.
 No double mapping (via different attributes) in order to
modify. Must remain static
© Copyright 2014 HSA Foundation. All Rights Reserved
GETTING CLOSER …
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
SIGNALING
SIGNALING (1/3)
 HSA agents support the ability to use signaling objects
 All creation/destruction signaling objects occurs via HSA
runtime APIs
 From an HSA Agent you can directly access signaling objects.
 Signaling a signal object (this will wake up HSA agents
waiting upon the object)
 Query current object
 Wait on the current object (various conditions supported).
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNALING (2/3)
 Advantages
 Enables asynchronous events between HSA agents,
without involving the kernel
 Common idiom for work offload
 Low power waiting
 Implications
 Runtime support required
 Commonly implemented on top of cache coherency flows
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNALING (3/3)
 Specifics
 Only supported within a PASID
 Supported wait conditions are =, !=, < and >=
 Wait operations may return sporadically (no guarantee against
false positives)
 Programmer must test.
 Wait operations have a maximum duration before returning.
 The HSAIL atomic operations are supported on signal objects.
 Signal objects are opaque
 Must use dedicated HSAIL/HSA runtime operations
© Copyright 2014 HSA Foundation. All Rights Reserved
ALMOST THERE…
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
USER MODE QUEUING
ONE BLOCK LEFT
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
USER MODE QUEUEING (1/3)
 User mode Queueing
 Enables user space applications to directly, without OS
intervention, enqueue jobs (“Dispatch Packets”) for HSA
agents.
 Queues are created/destroyed via calls to the HSA
runtime.
 One (or many) agents enqueue packets, a single agent
dequeues packets.
 Requires coherency and shared virtual memory.
© Copyright 2014 HSA Foundation. All Rights Reserved
USER MODE QUEUEING (2/3)
 Advantages
 Avoid involving the kernel/driver when dispatching work for an Agent.
 Lower latency job dispatch enables finer granularity of offload
 Standard memory protection mechanisms may be used to protect communication with
the consuming agent.
 Implications
 Packet formats/fields are Architected – standard across vendors!
 Guaranteed backward compatibility
 Packets are enqueued/dequeued via an Architected protocol (all via memory
accesses and signaling)
 More on this later……
© Copyright 2014 HSA Foundation. All Rights Reserved
SUCCESS!
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
SUCCESS!
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Queue Job
Start Job
Finish Job
ARCHITECTED QUEUEING
LANGUAGE, QUEUES
ARCHITECTED QUEUEING LANGUAGE
 HSA Queues look just like standard shared
memory queues, supporting multi-producer,
single-consumer
 Single producer variant defined with some
optimizations possible.
 Queues consist of storage, read/write indices, ID,
etc.
 Queues are created/destroyed via calls to the
HSA runtime
 “Packets” are placed in queues directly from user
mode, via an architected protocol
 Packet format is architected
© Copyright 2014 HSA Foundation. All Rights Reserved
Producer Producer
Consumer
Read Index
Write Index
Storage in
coherent, shared
memory
Packets
ARCHITECTED QUEUING LANGUAGE
 Packets are read and dispatched for execution from the queue in order, but
may complete in any order.
 There is no guarantee that more than one packet will be processed in parallel at a
time
 There may be many queues. A single agent may also consume from several
queues.
 Any HSA agent may enqueue packets
 CPUs
 GPUs
 Other accelerators
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE STRUCTURE
© Copyright 2014 HSA Foundation. All Rights Reserved
Offset (bytes) Size (bytes) Field Notes
0 4 queueType Differentiate different queues
4 4 queueFeatures Indicate supported features
8 8 baseAddress Pointer to packet array
16 16 doorbellSignal HSA signaling object handle
24 4 size Packet array cardinality
28 4 queueId Unique per process
32 8 serviceQueue Queue for callback services
intrinsic 8 writeIndex Packet array write index
intrinsic 8 readIndex Packet array read index
QUEUE VARIANTS
 queueType and queueFeatures together define queue semantics and
capabilities
 Two queueType values defined, other values reserved:
 MULTI – queue supports multiple producers
 SINGLE – queue supports single producer
 queueFeatures is a bitfield indicating capabilities
 DISPATCH (bit 0) if set then queue supports DISPATCH packets
 AGENT_DISPATCH (bit 1) if set then queue supports AGENT_DISPATCH packets
 All other bits are reserved and must be 0
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE STRUCTURE DETAILS
 Queue doorbells are HSA signaling objects with restrictions
 Created as part of the queue – lifetime tied to queue object
 Atomic read-modify-write not allowed
 size field value must be aligned to a power of 2
 serviceQueue can be used by HSA kernel for callback services
 Provided by application when queue is created
 Can be mapped to HSA runtime provided serviceQueue, an application serviced
queue, or NULL if no serviceQueue required
© Copyright 2014 HSA Foundation. All Rights Reserved
READ/WRITE INDICES
 readIndex and writeIndex properties are part of the queue, but not visible in the queue structure
 Accessed through HSA runtime API and HSAIL operations
 HSA runtime/HSAIL operations defined to
 Read readIndex or writeIndex property
 Write readIndex or writeIndex property
 Add constant to writeIndex property (returns previous writeIndex value)
 CAS on writeIndex property
 readIndex & writeIndex operations treated as atomic in memory model
 relaxed, acquire, release and acquire-release variants defined as applicable
 readIndex and writeIndex never wrap
 PacketID – the index of a particular packet
 Uniquely identifies each packet of a queue
© Copyright 2014 HSA Foundation. All Rights Reserved
PACKET ENQUEUE
 Packet enqueue follows a few simple steps:
 Reserve space
 Multiple packets can be reserved at a time
 Write packet to queue
 Mark packet as valid
 Producer no longer allowed to modify packet
 Consumer is allowed to start processing packet
 Notify consumer of packet through the queue doorbell
 Multiple packets can be notified at a time
 Doorbell signal should be signaled with last packetID notified
 On small machine model the lower 32 bits of the packetID are used
© Copyright 2014 HSA Foundation. All Rights Reserved
PACKET RESERVATION
 Two flows envisaged
 Atomic add writeIndex with number of packets to reserve
 Producer must wait until packetID < readIndex + size before writing to packet
 Queue can be sized so that wait is unlikely (or impossible)
 Suitable when many threads use one queue
 Check queue not full first, then use atomic CAS to update writeIndex
 Can be inefficient if many threads use the same queue
 Allows different failure model if queue is congested
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE OPTIMIZATIONS
 Queue behavior is loosely defined to allow optimizations
 Some potential producer behavior optimizations:
 Keep local copy of readIndex, update when required
 For single producer queues:
 Keep local copy of writeIndex
 Use store operation rather than add/cas atomic to update writeIndex
 Some potential consumer behavior optimizations:
 Use packet format field to determine whether a packet has been submitted rather than writeIndex
property
 Speculatively read multiple packets from the queue
 Not update readIndex for each packet processed
 Rely on value used for doorbellSignal to notify new packets
 Especially useful for single producer queues
© Copyright 2014 HSA Foundation. All Rights Reserved
POTENTIAL MULTI-PRODUCER ALGORITHM
// Allocate packet
uint64_t packetID = hsa_queue_add_write_index_relaxed(q, 1);
// Wait until the queue is no longer full.
uint64_t rdIdx;
do {
rdIdx = hsa_queue_load_read_index_relaxed(q);
} while (packetID >= (rdIdx + q->size));
// calculate index
uint32_t arrayIdx = packetID & (q->size-1);
// copy over the packet, the format field is INVALID
q->baseAddress[arrayIdx] = pkt;
// Update format field with release semantics
q->baseAddress[index].hdr.format.store(DISPATCH, std::memory_order_release);
// ring doorbell, with release semantics (could also amortize over multiple packets)
hsa_signal_send_relaxed(q->doorbellSignal, packetID);
© Copyright 2014 HSA Foundation. All Rights Reserved
POTENTIAL CONSUMER ALGORITHM
// Get location of next packet
uint64_t readIndex = hsa_queue_load_read_index_relaxed(q);
// calculate the index
uint32_t arrayIdx = readIndex & (q->size-1);
// spin while empty (could also perform low-power wait on doorbell)
while (INVALID == q->baseAddress[arrayIdx].hdr.format) { }
// copy over the packet
pkt = q->baseAddress[arrayIdx];
// set the format field to invalid
q->baseAddress[arrayIdx].hdr.format.store(INVALID, std::memory_order_relaxed);
// Update the readIndex using HSA intrinsic
hsa_queue_store_read_index_relaxed(q, readIndex+1);
// Now process <pkt>!
© Copyright 2014 HSA Foundation. All Rights Reserved
ARCHITECTED QUEUEING
LANGUAGE, PACKETS
PACKETS
© Copyright 2014 HSA Foundation. All Rights Reserved
 Packets come in three main types with architected layouts
 Always reserved & Invalid
 Do not contain any valid tasks and are not processed (queue will not progress)
 Dispatch
 Specifies kernel execution over a grid
 Agent Dispatch
 Specifies a single function to perform with a set of parameters
 Barrier
 Used for task dependencies
COMMON PACKET HEADER
Start Offset
(Bytes)
Format Field Name Description
0 uint16_t
format:8
Contains the packet type (Always reserved, Invalid,
Dispatch, Agent Dispatch, and Barrier). Other values are
reserved and should not be used.
barrier:1
If set then processing of packet will only begin when all
preceding packets are complete.
acquireFenceScope:2
Determines the scope and type of the memory fence
operation applied before the packet enters the active
phase.
Must be 0 for Barrier Packets.
releaseFenceScope:2
Determines the scope and type of the memory fence
operation applied after kernel completion but before the
packet is completed.
reserved:3 Must be 0
© Copyright 2014 HSA Foundation. All Rights Reserved
DISPATCH PACKET
© Copyright 2014 HSA Foundation. All Rights Reserved
Start
Offset
(Bytes)
Format Field Name Description
0 uint16_t header Packet header
2 uint16_t
dimensions:2 Number of dimensions specified in gridSize. Valid values are 1, 2, or 3.
reserved:14 Must be 0.
4 uint16_t workgroupSize.x x dimension of work-group (measured in work-items).
6 uint16_t workgroupSize.y y dimension of work-group (measured in work-items).
8 uint16_t workgroupSize.z z dimension of work-group (measured in work-items).
10 uint16_t reserved2 Must be 0.
12 uint32_t gridSize.x x dimension of grid (measured in work-items).
16 uint32_t gridSize.y y dimension of grid (measured in work-items).
20 uint32_t gridSize.z z dimension of grid (measured in work-items).
24 uint32_t privateSegmentSizeBytes Total size in bytes of private memory allocation request (per work-item).
28 uint32_t groupSegmentSizeBytes Total size in bytes of group memory allocation request (per work-group).
32 uint64_t kernelObjectAddress
Address of an object in memory that includes an implementation-defined
executable ISA image for the kernel.
40 uint64_t kernargAddress Address of memory containing kernel arguments.
48 uint64_t reserved3 Must be 0.
56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
AGENT DISPATCH PACKET
© Copyright 2014 HSA Foundation. All Rights Reserved
Start Offset
(Bytes)
Format Field Name Description
0 uint16_t header Packet header
2 uint16_t type
The function to be performed by the destination Agent. The type value is
split into the following ranges:
 0x0000:0x3FFF – Vendor specific
 0x4000:0x7FFF – HSA runtime
 0x8000:0xFFFF – User registered function
4 uint32_t reserved2 Must be 0.
8 uint64_t returnLocation Pointer to location to store the function return value in.
16 uint64_t arg[0]
64-bit direct or indirect arguments.
24 uint64_t arg[1]
32 uint64_t arg[2]
40 uint64_t arg[3]
48 uint64_t reserved3 Must be 0.
56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
BARRIER PACKET
 Used for specifying dependences between packets
 HSA agent will not launch any further packets from this queue until the barrier
packet signal conditions are met
 Used for specifying dependences on packets dispatched from any queue.
 Execution phase completes only when all of the dependent signals (up to five) have
been signaled (with the value of 0).
 Or if an error has occurred in one of the packets upon which we have a dependence.
© Copyright 2014 HSA Foundation. All Rights Reserved
BARRIER PACKET
© Copyright 2014 HSA Foundation. All Rights Reserved
Start Offset
(Bytes)
Format Field Name Description
0 uint16_t header Packet header, see 2.8.1 Packet header (p. 16).
2 uint16_t reserved2 Must be 0.
4 uint32_t reserved3 Must be 0.
8 uint64_t depSignal0
Address of dependent signaling objects to be evaluated by the packet processor.
16 uint64_t depSignal1
24 uint64_t depSignal2
32 uint64_t depSignal3
40 uint64_t depSignal4
48 uint64_t reserved4 Must be 0.
56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
DEPENDENCES
 A user may never assume more than one packet is being executed by an HSA
agent at a time.
 Implications:
 Packets can’t poll on shared memory values which will be set by packets issued from
other queues, unless the user has ensured the proper ordering.
 To ensure all previous packets from a queue have been completed, use the Barrier
bit.
 To ensure specific packets from any queue have completed, use the Barrier packet.
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA QUEUEING, PACKET EXECUTION
PACKET EXECUTION
 Launch phase
 Initiated when launch conditions are met
 All preceding packets in the queue must have exited launch phase
 If the barrier bit in the packet header is set, then all preceding packets in the queue
must have exited completion phase
 Includes memory acquire fence
 Active phase
 Execute the packet
 Barrier packets remain in Active phase until conditions are met.
 Completion phase
 First step is memory release fence – make results visible.
 completionSignal field is then signaled with a decrementing atomic.
© Copyright 2014 HSA Foundation. All Rights Reserved
PACKET EXECUTION – BARRIER BIT
© Copyright 2014 HSA Foundation. All Rights Reserved
Pkt1
Launch
Pkt2
Launch
Pkt1
Execute
Pkt2
Execute
Pkt1
Complete
Pkt3
Launch (barrier=1)
Pkt2
Complete
Pkt3
Execute
Time
Pkt3 launches whenall
packets in the queue
have completed.
PUTTING IT ALL TOGETHER (FFT)
© Copyright 2014 HSA Foundation. All Rights Reserved
Packet 1
Packet 2
Packet 3
Packet 4
Packet 5
Packet 6
Barrier Barrier
X[0]
X[1]
X[2]
X[3]
X[4]
X[5]
X[6]
X[7]
Time
PUTTING IT ALL TOGETHER
© Copyright 2014 HSA Foundation. All Rights Reserved
AQL Pseudo Code
// Send the packets to do the first stage.
aql_dispatch(pkt1);
aql_dispatch(pkt2);
// Send the next two packets, setting the barrier bit so we
// know packets 1 & 2 will be complete before 3 and 4 are
// launched.
aql_dispatch_with _barrier_bit(pkt3);
aql_dispatch(pkt4);
// Same as above (make sure 3 & 4 are done before issuing 5
// & 6)
aql_dispatch_with_barrier_bit(pkt5);
aql_dispatch(pkt6);
// This packet will notify us when 5 & 6 are complete)
aql_dispatch_with_barrier_bit(finish_pkt);
PACKET EXECUTION – BARRIER PACKET
© Copyright 2014 HSA Foundation. All Rights Reserved
Barrier T2Q2
T1Q1
Signal X
init to 1
depSignal0
completionSignal
Time
Decrements signal X
Barrier
Launch
T1
Launch
Barrier
Execute
T1
Execute
Barrier
Complete
T1
Complete
T2
Launch
T2
Execute
T2
Complete
Barrier completes
when signal X
signalled with 0
T2 launches once
barrier complete
DEPTH FIRST CHILD TASK EXECUTION
 Consider two generations of child tasks
 Task T submits tasks T.1 & T.2
 Task T.1 submits tasks T.1.1 & T.1.2
 Task T.2 submits tasks T.2.1 & T.2.2
 Desired outcome
 Depth first child task execution
 I.e. T  T1  T.1.1  T.1.2  T.2  T.2.1  T.2.2
 T passed signal (allComplete) to decrement when all tasks are complete (T and its
children etc)
© Copyright 2014 HSA Foundation. All Rights Reserved
T
T.2.2T.1.2T.1.2T.1.1
T.1 T.2
HOW TO DO THIS WITH HSA QUEUES?
 Use a separate user mode queue for each recursion level
 Task T submits to queue Q1
 Tasks T.1 & T.2 submits tasks to queue Q2
 Queues could be passed in as parameters to task T
 Depth first requires ordering of T.1, T.2 and their children
 Use additional signal object (childrenComplete) to track completion of the children of
T.1 & T.2
 childrenComplete set to number of children (i.e. 2) by each of T.1 & T.2
© Copyright 2014 HSA Foundation. All Rights Reserved
A PICTURE SAYS MORE THAN 1000 WORDS
© Copyright 2014 HSA Foundation. All Rights Reserved
T
T.2.2T.1.2T.1.2T.1.1
T.1 T.2 T.1 Barrier T.2 BarrierQ1
Wait on
childrenComplete
Signal
allComplete
T.1.1 T.1.2 T.2.1 T.2.2Q2
SUMMARY
© Copyright 2014 HSA Foundation. All Rights Reserved
KEY HSA TECHNOLOGIES
 HSA combines several mechanisms to enable low overhead task
dispatch
 Shared Virtual Memory
 System Coherency
 Signaling
 AQL
 User mode queues – from any compatible agent
 Architected packet format
 Rich dependency mechanism
 Flexible and efficient signaling of completion
© Copyright 2014 HSA Foundation. All Rights Reserved
QUESTIONS?
© Copyright 2014 HSA Foundation. All Rights Reserved

ISCA final presentation - Queuing Model

  • 1.
    HSA QUEUING MODEL HAKANPERSSON, SENIOR PRINCIPAL ENGINEER, ARM
  • 2.
  • 3.
    MOTIVATION (TODAY’S PICTURE) ©Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 4.
  • 5.
    REQUIREMENTS  Three keytechnologies are used to build the user mode queueing mechanism  Shared Virtual Memory  System Coherency  Signaling  AQL (Architected Queueing Language) enables any agent enqueue tasks © Copyright 2014 HSA Foundation. All Rights Reserved
  • 6.
  • 7.
    PHYSICAL MEMORY SHARED VIRTUALMEMORY (TODAY)  Multiple Virtual memory address spaces © Copyright 2014 HSA Foundation. All Rights Reserved CPU0 GPU VIRTUAL MEMORY1 PHYSICAL MEMORY VA1->PA1 VA2->PA1 VIRTUAL MEMORY2
  • 8.
    PHYSICAL MEMORY SHARED VIRTUALMEMORY (HSA)  Common Virtual Memory for all HSA agents © Copyright 2014 HSA Foundation. All Rights Reserved CPU0 GPU VIRTUAL MEMORY PHYSICAL MEMORY VA->PA VA->PA
  • 9.
    SHARED VIRTUAL MEMORY Advantages  No mapping tricks, no copying back-and-forth between different PA addresses  Send pointers (not data) back and forth between HSA agents.  Implications  Common Page Tables (and common interpretation of architectural semantics such as shareability, protection, etc).  Common mechanisms for address translation (and servicing address translation faults)  Concept of a process address space (PASID) to allow multiple, per process virtual address spaces within the system. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 10.
    SHARED VIRTUAL MEMORY Specifics  Minimum supported VA width is 48b for 64b systems, and 32b for 32b systems.  HSA agents may reserve VA ranges for internal use via system software.  All HSA agents other than the host unit must use the lowest privilege level  If present, read/write access flags for page tables must be maintained by all agents.  Read/write permissions apply to all HSA agents, equally. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 11.
    GETTING THERE … ©Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 12.
  • 13.
    CACHE COHERENCY DOMAINS(1/3)  Data accesses to global memory segment from all HSA Agents shall be coherent without the need for explicit cache maintenance. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 14.
    CACHE COHERENCY DOMAINS(2/3)  Advantages  Composability  Reduced SW complexity when communicating between agents  Lower barrier to entry when porting software  Implications  Hardware coherency support between all HSA agents  Can take many forms  Stand alone Snoop Filters / Directories  Combined L3/Filters  Snoop-based systems (no filter)  Etc … © Copyright 2014 HSA Foundation. All Rights Reserved
  • 15.
    CACHE COHERENCY DOMAINS(3/3)  Specifics  No requirement for instruction memory accesses to be coherent  Only applies to the Primary memory type.  No requirement for HSA agents to maintain coherency to any memory location where the HSA agents do not specify the same memory attributes  Read-only image data is required to remain static during the execution of an HSA kernel.  No double mapping (via different attributes) in order to modify. Must remain static © Copyright 2014 HSA Foundation. All Rights Reserved
  • 16.
    GETTING CLOSER … ©Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 17.
  • 18.
    SIGNALING (1/3)  HSAagents support the ability to use signaling objects  All creation/destruction signaling objects occurs via HSA runtime APIs  From an HSA Agent you can directly access signaling objects.  Signaling a signal object (this will wake up HSA agents waiting upon the object)  Query current object  Wait on the current object (various conditions supported). © Copyright 2014 HSA Foundation. All Rights Reserved
  • 19.
    SIGNALING (2/3)  Advantages Enables asynchronous events between HSA agents, without involving the kernel  Common idiom for work offload  Low power waiting  Implications  Runtime support required  Commonly implemented on top of cache coherency flows © Copyright 2014 HSA Foundation. All Rights Reserved
  • 20.
    SIGNALING (3/3)  Specifics Only supported within a PASID  Supported wait conditions are =, !=, < and >=  Wait operations may return sporadically (no guarantee against false positives)  Programmer must test.  Wait operations have a maximum duration before returning.  The HSAIL atomic operations are supported on signal objects.  Signal objects are opaque  Must use dedicated HSAIL/HSA runtime operations © Copyright 2014 HSA Foundation. All Rights Reserved
  • 21.
    ALMOST THERE… © Copyright2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 22.
  • 23.
    ONE BLOCK LEFT ©Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 24.
    USER MODE QUEUEING(1/3)  User mode Queueing  Enables user space applications to directly, without OS intervention, enqueue jobs (“Dispatch Packets”) for HSA agents.  Queues are created/destroyed via calls to the HSA runtime.  One (or many) agents enqueue packets, a single agent dequeues packets.  Requires coherency and shared virtual memory. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 25.
    USER MODE QUEUEING(2/3)  Advantages  Avoid involving the kernel/driver when dispatching work for an Agent.  Lower latency job dispatch enables finer granularity of offload  Standard memory protection mechanisms may be used to protect communication with the consuming agent.  Implications  Packet formats/fields are Architected – standard across vendors!  Guaranteed backward compatibility  Packets are enqueued/dequeued via an Architected protocol (all via memory accesses and signaling)  More on this later…… © Copyright 2014 HSA Foundation. All Rights Reserved
  • 26.
    SUCCESS! © Copyright 2014HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 27.
    SUCCESS! © Copyright 2014HSA Foundation. All Rights Reserved Application OS GPU Queue Job Start Job Finish Job
  • 28.
  • 29.
    ARCHITECTED QUEUEING LANGUAGE HSA Queues look just like standard shared memory queues, supporting multi-producer, single-consumer  Single producer variant defined with some optimizations possible.  Queues consist of storage, read/write indices, ID, etc.  Queues are created/destroyed via calls to the HSA runtime  “Packets” are placed in queues directly from user mode, via an architected protocol  Packet format is architected © Copyright 2014 HSA Foundation. All Rights Reserved Producer Producer Consumer Read Index Write Index Storage in coherent, shared memory Packets
  • 30.
    ARCHITECTED QUEUING LANGUAGE Packets are read and dispatched for execution from the queue in order, but may complete in any order.  There is no guarantee that more than one packet will be processed in parallel at a time  There may be many queues. A single agent may also consume from several queues.  Any HSA agent may enqueue packets  CPUs  GPUs  Other accelerators © Copyright 2014 HSA Foundation. All Rights Reserved
  • 31.
    QUEUE STRUCTURE © Copyright2014 HSA Foundation. All Rights Reserved Offset (bytes) Size (bytes) Field Notes 0 4 queueType Differentiate different queues 4 4 queueFeatures Indicate supported features 8 8 baseAddress Pointer to packet array 16 16 doorbellSignal HSA signaling object handle 24 4 size Packet array cardinality 28 4 queueId Unique per process 32 8 serviceQueue Queue for callback services intrinsic 8 writeIndex Packet array write index intrinsic 8 readIndex Packet array read index
  • 32.
    QUEUE VARIANTS  queueTypeand queueFeatures together define queue semantics and capabilities  Two queueType values defined, other values reserved:  MULTI – queue supports multiple producers  SINGLE – queue supports single producer  queueFeatures is a bitfield indicating capabilities  DISPATCH (bit 0) if set then queue supports DISPATCH packets  AGENT_DISPATCH (bit 1) if set then queue supports AGENT_DISPATCH packets  All other bits are reserved and must be 0 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 33.
    QUEUE STRUCTURE DETAILS Queue doorbells are HSA signaling objects with restrictions  Created as part of the queue – lifetime tied to queue object  Atomic read-modify-write not allowed  size field value must be aligned to a power of 2  serviceQueue can be used by HSA kernel for callback services  Provided by application when queue is created  Can be mapped to HSA runtime provided serviceQueue, an application serviced queue, or NULL if no serviceQueue required © Copyright 2014 HSA Foundation. All Rights Reserved
  • 34.
    READ/WRITE INDICES  readIndexand writeIndex properties are part of the queue, but not visible in the queue structure  Accessed through HSA runtime API and HSAIL operations  HSA runtime/HSAIL operations defined to  Read readIndex or writeIndex property  Write readIndex or writeIndex property  Add constant to writeIndex property (returns previous writeIndex value)  CAS on writeIndex property  readIndex & writeIndex operations treated as atomic in memory model  relaxed, acquire, release and acquire-release variants defined as applicable  readIndex and writeIndex never wrap  PacketID – the index of a particular packet  Uniquely identifies each packet of a queue © Copyright 2014 HSA Foundation. All Rights Reserved
  • 35.
    PACKET ENQUEUE  Packetenqueue follows a few simple steps:  Reserve space  Multiple packets can be reserved at a time  Write packet to queue  Mark packet as valid  Producer no longer allowed to modify packet  Consumer is allowed to start processing packet  Notify consumer of packet through the queue doorbell  Multiple packets can be notified at a time  Doorbell signal should be signaled with last packetID notified  On small machine model the lower 32 bits of the packetID are used © Copyright 2014 HSA Foundation. All Rights Reserved
  • 36.
    PACKET RESERVATION  Twoflows envisaged  Atomic add writeIndex with number of packets to reserve  Producer must wait until packetID < readIndex + size before writing to packet  Queue can be sized so that wait is unlikely (or impossible)  Suitable when many threads use one queue  Check queue not full first, then use atomic CAS to update writeIndex  Can be inefficient if many threads use the same queue  Allows different failure model if queue is congested © Copyright 2014 HSA Foundation. All Rights Reserved
  • 37.
    QUEUE OPTIMIZATIONS  Queuebehavior is loosely defined to allow optimizations  Some potential producer behavior optimizations:  Keep local copy of readIndex, update when required  For single producer queues:  Keep local copy of writeIndex  Use store operation rather than add/cas atomic to update writeIndex  Some potential consumer behavior optimizations:  Use packet format field to determine whether a packet has been submitted rather than writeIndex property  Speculatively read multiple packets from the queue  Not update readIndex for each packet processed  Rely on value used for doorbellSignal to notify new packets  Especially useful for single producer queues © Copyright 2014 HSA Foundation. All Rights Reserved
  • 38.
    POTENTIAL MULTI-PRODUCER ALGORITHM //Allocate packet uint64_t packetID = hsa_queue_add_write_index_relaxed(q, 1); // Wait until the queue is no longer full. uint64_t rdIdx; do { rdIdx = hsa_queue_load_read_index_relaxed(q); } while (packetID >= (rdIdx + q->size)); // calculate index uint32_t arrayIdx = packetID & (q->size-1); // copy over the packet, the format field is INVALID q->baseAddress[arrayIdx] = pkt; // Update format field with release semantics q->baseAddress[index].hdr.format.store(DISPATCH, std::memory_order_release); // ring doorbell, with release semantics (could also amortize over multiple packets) hsa_signal_send_relaxed(q->doorbellSignal, packetID); © Copyright 2014 HSA Foundation. All Rights Reserved
  • 39.
    POTENTIAL CONSUMER ALGORITHM //Get location of next packet uint64_t readIndex = hsa_queue_load_read_index_relaxed(q); // calculate the index uint32_t arrayIdx = readIndex & (q->size-1); // spin while empty (could also perform low-power wait on doorbell) while (INVALID == q->baseAddress[arrayIdx].hdr.format) { } // copy over the packet pkt = q->baseAddress[arrayIdx]; // set the format field to invalid q->baseAddress[arrayIdx].hdr.format.store(INVALID, std::memory_order_relaxed); // Update the readIndex using HSA intrinsic hsa_queue_store_read_index_relaxed(q, readIndex+1); // Now process <pkt>! © Copyright 2014 HSA Foundation. All Rights Reserved
  • 40.
  • 41.
    PACKETS © Copyright 2014HSA Foundation. All Rights Reserved  Packets come in three main types with architected layouts  Always reserved & Invalid  Do not contain any valid tasks and are not processed (queue will not progress)  Dispatch  Specifies kernel execution over a grid  Agent Dispatch  Specifies a single function to perform with a set of parameters  Barrier  Used for task dependencies
  • 42.
    COMMON PACKET HEADER StartOffset (Bytes) Format Field Name Description 0 uint16_t format:8 Contains the packet type (Always reserved, Invalid, Dispatch, Agent Dispatch, and Barrier). Other values are reserved and should not be used. barrier:1 If set then processing of packet will only begin when all preceding packets are complete. acquireFenceScope:2 Determines the scope and type of the memory fence operation applied before the packet enters the active phase. Must be 0 for Barrier Packets. releaseFenceScope:2 Determines the scope and type of the memory fence operation applied after kernel completion but before the packet is completed. reserved:3 Must be 0 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 43.
    DISPATCH PACKET © Copyright2014 HSA Foundation. All Rights Reserved Start Offset (Bytes) Format Field Name Description 0 uint16_t header Packet header 2 uint16_t dimensions:2 Number of dimensions specified in gridSize. Valid values are 1, 2, or 3. reserved:14 Must be 0. 4 uint16_t workgroupSize.x x dimension of work-group (measured in work-items). 6 uint16_t workgroupSize.y y dimension of work-group (measured in work-items). 8 uint16_t workgroupSize.z z dimension of work-group (measured in work-items). 10 uint16_t reserved2 Must be 0. 12 uint32_t gridSize.x x dimension of grid (measured in work-items). 16 uint32_t gridSize.y y dimension of grid (measured in work-items). 20 uint32_t gridSize.z z dimension of grid (measured in work-items). 24 uint32_t privateSegmentSizeBytes Total size in bytes of private memory allocation request (per work-item). 28 uint32_t groupSegmentSizeBytes Total size in bytes of group memory allocation request (per work-group). 32 uint64_t kernelObjectAddress Address of an object in memory that includes an implementation-defined executable ISA image for the kernel. 40 uint64_t kernargAddress Address of memory containing kernel arguments. 48 uint64_t reserved3 Must be 0. 56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
  • 44.
    AGENT DISPATCH PACKET ©Copyright 2014 HSA Foundation. All Rights Reserved Start Offset (Bytes) Format Field Name Description 0 uint16_t header Packet header 2 uint16_t type The function to be performed by the destination Agent. The type value is split into the following ranges:  0x0000:0x3FFF – Vendor specific  0x4000:0x7FFF – HSA runtime  0x8000:0xFFFF – User registered function 4 uint32_t reserved2 Must be 0. 8 uint64_t returnLocation Pointer to location to store the function return value in. 16 uint64_t arg[0] 64-bit direct or indirect arguments. 24 uint64_t arg[1] 32 uint64_t arg[2] 40 uint64_t arg[3] 48 uint64_t reserved3 Must be 0. 56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
  • 45.
    BARRIER PACKET  Usedfor specifying dependences between packets  HSA agent will not launch any further packets from this queue until the barrier packet signal conditions are met  Used for specifying dependences on packets dispatched from any queue.  Execution phase completes only when all of the dependent signals (up to five) have been signaled (with the value of 0).  Or if an error has occurred in one of the packets upon which we have a dependence. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 46.
    BARRIER PACKET © Copyright2014 HSA Foundation. All Rights Reserved Start Offset (Bytes) Format Field Name Description 0 uint16_t header Packet header, see 2.8.1 Packet header (p. 16). 2 uint16_t reserved2 Must be 0. 4 uint32_t reserved3 Must be 0. 8 uint64_t depSignal0 Address of dependent signaling objects to be evaluated by the packet processor. 16 uint64_t depSignal1 24 uint64_t depSignal2 32 uint64_t depSignal3 40 uint64_t depSignal4 48 uint64_t reserved4 Must be 0. 56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
  • 47.
    DEPENDENCES  A usermay never assume more than one packet is being executed by an HSA agent at a time.  Implications:  Packets can’t poll on shared memory values which will be set by packets issued from other queues, unless the user has ensured the proper ordering.  To ensure all previous packets from a queue have been completed, use the Barrier bit.  To ensure specific packets from any queue have completed, use the Barrier packet. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 48.
  • 49.
    PACKET EXECUTION  Launchphase  Initiated when launch conditions are met  All preceding packets in the queue must have exited launch phase  If the barrier bit in the packet header is set, then all preceding packets in the queue must have exited completion phase  Includes memory acquire fence  Active phase  Execute the packet  Barrier packets remain in Active phase until conditions are met.  Completion phase  First step is memory release fence – make results visible.  completionSignal field is then signaled with a decrementing atomic. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 50.
    PACKET EXECUTION –BARRIER BIT © Copyright 2014 HSA Foundation. All Rights Reserved Pkt1 Launch Pkt2 Launch Pkt1 Execute Pkt2 Execute Pkt1 Complete Pkt3 Launch (barrier=1) Pkt2 Complete Pkt3 Execute Time Pkt3 launches whenall packets in the queue have completed.
  • 51.
    PUTTING IT ALLTOGETHER (FFT) © Copyright 2014 HSA Foundation. All Rights Reserved Packet 1 Packet 2 Packet 3 Packet 4 Packet 5 Packet 6 Barrier Barrier X[0] X[1] X[2] X[3] X[4] X[5] X[6] X[7] Time
  • 52.
    PUTTING IT ALLTOGETHER © Copyright 2014 HSA Foundation. All Rights Reserved AQL Pseudo Code // Send the packets to do the first stage. aql_dispatch(pkt1); aql_dispatch(pkt2); // Send the next two packets, setting the barrier bit so we // know packets 1 & 2 will be complete before 3 and 4 are // launched. aql_dispatch_with _barrier_bit(pkt3); aql_dispatch(pkt4); // Same as above (make sure 3 & 4 are done before issuing 5 // & 6) aql_dispatch_with_barrier_bit(pkt5); aql_dispatch(pkt6); // This packet will notify us when 5 & 6 are complete) aql_dispatch_with_barrier_bit(finish_pkt);
  • 53.
    PACKET EXECUTION –BARRIER PACKET © Copyright 2014 HSA Foundation. All Rights Reserved Barrier T2Q2 T1Q1 Signal X init to 1 depSignal0 completionSignal Time Decrements signal X Barrier Launch T1 Launch Barrier Execute T1 Execute Barrier Complete T1 Complete T2 Launch T2 Execute T2 Complete Barrier completes when signal X signalled with 0 T2 launches once barrier complete
  • 54.
    DEPTH FIRST CHILDTASK EXECUTION  Consider two generations of child tasks  Task T submits tasks T.1 & T.2  Task T.1 submits tasks T.1.1 & T.1.2  Task T.2 submits tasks T.2.1 & T.2.2  Desired outcome  Depth first child task execution  I.e. T  T1  T.1.1  T.1.2  T.2  T.2.1  T.2.2  T passed signal (allComplete) to decrement when all tasks are complete (T and its children etc) © Copyright 2014 HSA Foundation. All Rights Reserved T T.2.2T.1.2T.1.2T.1.1 T.1 T.2
  • 55.
    HOW TO DOTHIS WITH HSA QUEUES?  Use a separate user mode queue for each recursion level  Task T submits to queue Q1  Tasks T.1 & T.2 submits tasks to queue Q2  Queues could be passed in as parameters to task T  Depth first requires ordering of T.1, T.2 and their children  Use additional signal object (childrenComplete) to track completion of the children of T.1 & T.2  childrenComplete set to number of children (i.e. 2) by each of T.1 & T.2 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 56.
    A PICTURE SAYSMORE THAN 1000 WORDS © Copyright 2014 HSA Foundation. All Rights Reserved T T.2.2T.1.2T.1.2T.1.1 T.1 T.2 T.1 Barrier T.2 BarrierQ1 Wait on childrenComplete Signal allComplete T.1.1 T.1.2 T.2.1 T.2.2Q2
  • 57.
    SUMMARY © Copyright 2014HSA Foundation. All Rights Reserved
  • 58.
    KEY HSA TECHNOLOGIES HSA combines several mechanisms to enable low overhead task dispatch  Shared Virtual Memory  System Coherency  Signaling  AQL  User mode queues – from any compatible agent  Architected packet format  Rich dependency mechanism  Flexible and efficient signaling of completion © Copyright 2014 HSA Foundation. All Rights Reserved
  • 59.
    QUESTIONS? © Copyright 2014HSA Foundation. All Rights Reserved