THE PROGRAMMER’S GUIDE
TO THE APU GALAXY
Phil Rogers, Corporate Fellow
AMD
THE OPPORTUNITY WE ARE SEIZING




   Make the unprecedented
     processing capability of
   the APU as accessible to
        programmers as the
              CPU is today.



2 | The Programmer’s Guide to the APU Galaxy | June 2011
OUTLINE


The APU today and its programming
 environment

The future of the heterogeneous platform

AMD Fusion System Architecture

Roadmap

Software evolution

A visual view of the new command
 and data flow




3 | The Programmer’s Guide to the APU Galaxy | June 2011
APU: ACCELERATED PROCESSING UNIT


The APU has arrived and it is a great advance
 over previous platforms
Combines scalar processing on CPU with
 parallel processing on the GPU and high
 bandwidth access to memory
How do we make it even better going forward?
   – Easier to program
   – Easier to optimize
   – Easier to load balance
   – Higher performance
   – Lower power




4 | The Programmer’s Guide to the APU Galaxy | June 2011
LOW POWER E-SERIES AMD FUSION APU: “ZACATE”


    E-Series APU
 2 x86 Bobcat CPU cores
 Array of Radeon™ Cores
        Discrete-class DirectX® 11 performance
        80 Stream Processors
 3rd Generation Unified Video Decoder
 PCIe® Gen2
 Single-channel DDR3 @ 1066
 18W TDP



    Performance:
 Up to 8.5GB/s System Memory Bandwidth
 Up to 90 Gflop of Single Precision Compute



5 | The Programmer’s Guide to the APU Galaxy | June 2011
TABLET Z-SERIES AMD FUSION APU: “DESNA”


    Z-Series APU
 2 x86 “Bobcat” CPU cores
 Array of Radeon™ Cores
        Discrete-class DirectX® 11 performance
        80 Stream Processors
 3rd Generation Unified Video Decoder
 PCIe® Gen2
 Single-channel DDR3 @ 1066
 6W TDP w/ Local Hardware Thermal Control



    Performance:
 Up to 8.5GB/s System Memory Bandwidth
 Suitable for sealed, passively cooled designs



6 | The Programmer’s Guide to the APU Galaxy | June 2011
MAINSTREAM A-SERIES AMD FUSION APU: “LLANO”


    A-Series APU
 Up to four x86 CPU cores
        AMD Turbo CORE frequency acceleration
 Array of Radeon™ Cores
        Discrete-class DirectX® 11 performance
 3rd Generation Unified Video Decoder
 Blu-ray 3D stereoscopic display
 PCIe® Gen2
 Dual-channel DDR3
 45W TDP


    Performance:
 Up to 29GB/s System Memory Bandwidth
 Up to 500 Gflops of Single Precision Compute



7 | The Programmer’s Guide to the APU Galaxy | June 2011
COMMITTED TO OPEN STANDARDS


AMD drives open and de-facto standards

   – Compete on the best implementation

Open standards are the basis for large
 ecosystems

Open standards always win over time
                                                           DirectX®
   – SW developers want their applications
     to run on multiple platforms from
     multiple hardware vendors




8 | The Programmer’s Guide to the APU Galaxy | June 2011
A NEW ERA OF PROCESSOR PERFORMANCE


                                                                                                                              Heterogeneous
                 Single-Core Era                                        Multi-Core Era
                                                                                                                               Systems Era
Enabled by:              Constrained by:                   Enabled by:                 Constrained by:       Enabled by:                       Temporarily
 Moore’s Law              Power                            Moore’s Law                 Power                Abundant data                   Constrained by:
 Voltage                  Complexity                       SMP                         Parallel SW           parallelism                       Programming
  Scaling                                                    architecture                Scalability          Power efficient                   models
                                                                                                               GPUs                              Comm.overhead

          Assembly  C/C++  Java …                                  pthreads  OpenMP / TBB …                          Shader  CUDA OpenCL !!!




                                                                                                         Modern Application
 Single-thread
 Performance




                                                       Performance
                                             ?
                                                       Throughput




                                                                                                           Performance
                                                                                            we are
                                                                                             here
                                we are
                                 here
                                                                                                                                    we are
                                                                                                                                     here

                         Time                                               Time (# of processors)                            Time (Data-parallel exploitation)




9 | The Programmer’s Guide to the APU Galaxy | June 2011
EVOLUTION OF HETEROGENEOUS COMPUTING
                                                   Excellent                                                                                Architected Era

                                                                                                                                AMD Fusion System Architecture
Architecture Maturity & Programmer Accessibility




                                                                                                    Standards Drivers Era            GPU Peer Processor

                                                                                                  OpenCL™, DirectCompute            Mainstream programmers
                                                                    Proprietary Drivers Era          Driver-based APIs              Full C++
                                                                                                                                    GPU as a co-processor
                                                                   Graphics & Proprietary       Expert programmers                 Unified coherent address space
                                                                     Driver-based APIs          C and C++ subsets                  Task parallel runtimes
                                                                                                Compute centric APIs , data        Nested Data Parallel programs
                                                                “Adventurous” programmers       types                              User mode dispatch
                                                                                                Multiple address spaces with       Pre-emption and context
                                                                Exploit early programmable      explicit data movement              switching
                                                                 “shader cores” in the GPU      Specialized work queue based
                                                                Make your program look like     structures
                                                                 “graphics” to the GPU          Kernel mode dispatch
                                                                                                                                      See Herb Sutter’s Keynote
                                                                                                                                    tomorrow for a cool example of
                                                                CUDA™, Brook+, etc
                                                                                                                                     plans for the architected era!
                                                   Poor




                                                                       2002 - 2008                      2009 - 2011                          2012 - 2020


          10 | The Programmer’s Guide to the APU Galaxy | June 2011
FSA FEATURE ROADMAP


          Physical                                  Optimized          Architectural             System
         Integration                                Platforms           Integration            Integration

                                                                                             GPU compute
  Integrate CPU & GPU                         GPU Compute C++       Unified Address Space
                                                                                             context switch
         in silicon                               support             for CPU and GPU

                                                                                             GPU graphics
                                                                     GPU uses pageable        pre-emption
       Unified Memory
                                           User mode scheduling      system memory via
         Controller
                                                                        CPU pointers
                                                                                            Quality of Service

          Common                             Bi-Directional Power
                                                                    Fully coherent memory
        Manufacturing                        Mgmt between CPU                                  Extend to
                                                                     between CPU & GPU
         Technology                                and GPU                                   Discrete GPU




11 | The Programmer’s Guide to the APU Galaxy | June 2011
FUSION SYSTEM ARCHITECTURE – AN OPEN PLATFORM

Open Architecture, published specifications
  – FSAIL virtual ISA
  – FSA memory model
  – FSA dispatch
ISA agnostic for both CPU and GPU

Inviting partners to join us, in all areas
   – Hardware companies
   – Operating Systems
   – Tools and Middleware
   – Applications

FSA review committee planned



12 | The Programmer’s Guide to the APU Galaxy | June 2011
FSA INTERMEDIATE LAYER - FSAIL

FSAIL is a virtual ISA for parallel programs
    – Finalized to ISA by a JIT compiler or
      “Finalizer”

Explicitly parallel
    – Designed for data parallel programming

Support for exceptions, virtual functions,
 and other high level language features

Syscall methods
    – GPU code can call directly to system
      services, IO, printf, etc

Debugging support


13 | The Programmer’s Guide to the APU Galaxy | June 2011
FSA MEMORY MODEL


Designed to be compatible with C++0x,
 Java and .NET Memory Models

Relaxed consistency memory model for
 parallel compute performance

Loads and stores can be re-ordered by
 the finalizer

Visibility controlled by:
    – Load.Acquire, Store.Release
    – Fences
    – Barriers




14 | The Programmer’s Guide to the APU Galaxy | June 2011
Driver Stack                                                FSA Software Stack

            Apps                                                  Apps
                Apps                                                     Apps
                    Apps                                                        Apps
                        Apps                                                           Apps
                            Apps                                                              Apps
                                Apps                                                                 Apps



                Domain Libraries                                         FSA Domain Libraries



            OpenCL™ 1.x, DX Runtimes,
                                                                                                                    FSA Runtime
                User Mode Drivers
                                                                                              Task Queuing
                                                                            FSA JIT
                                                                                                Libraries
                                                                                                                    FSA Kernel
            Graphics Kernel Mode Driver
                                                                                                                    Mode Driver


                                                       Hardware - APUs, CPUs, GPUs

           AMD user mode component                      AMD kernel mode component         All others contributed by third parties or AMD


15 | The Programmer’s Guide to the APU Galaxy | June 2011
OPENCL™ AND FSA

FSA is an optimized platform architecture
 for OpenCL™
    – Not an alternative to OpenCL™
OpenCL™ on FSA will benefit from
    – Avoidance of wasteful copies
    – Low latency dispatch
    – Improved memory model
    – Pointers shared between CPU and GPU
FSA also exposes a lower level programming
 interface, for those that want the ultimate in
 control and performance
    – Optimized libraries may choose the lower
      level interface


16 | The Programmer’s Guide to the APU Galaxy | June 2011
TASK QUEUING RUNTIMES

Popular pattern for task and data parallel
 programming on SMP systems today
Characterized by:
    – A work queue per core
    – Runtime library that divides large
      loops into tasks and distributes to
      queues
    – A work stealing runtime that keeps the
      system balanced
FSA is designed to extend this pattern to
 run on heterogeneous systems




17 | The Programmer’s Guide to the APU Galaxy | June 2011
TASK QUEUING RUNTIME ON CPUS

                                       Work Stealing Runtime



                  Q                            Q                 Q          Q



          CPU                          CPU                   CPU        CPU
         Worker                       Worker                Worker     Worker

         X86 CPU                      X86 CPU               X86 CPU    X86 CPU




           CPU Threads                    GPU Threads         Memory



18 | The Programmer’s Guide to the APU Galaxy | June 2011
TASK QUEUING RUNTIME ON THE FSA PLATFORM

                                                      Work Stealing Runtime



                  Q                            Q                  Q                Q           Q



          CPU                          CPU                    CPU              CPU        GPU
         Worker                       Worker                 Worker           Worker     Manager

         X86 CPU                      X86 CPU                X86 CPU          X86 CPU   Radeon™ GPU




           CPU Threads                    GPU Threads          Memory



19 | The Programmer’s Guide to the APU Galaxy | June 2011
TASK QUEUING RUNTIME ON THE FSA PLATFORM

                                                      Work Stealing Runtime



                  Q                            Q                  Q                Q                    Q



          CPU                          CPU                    CPU              CPU               GPU
                                                                                                                 Memory
         Worker                       Worker                 Worker           Worker            Manager

         X86 CPU                      X86 CPU                X86 CPU          X86 CPU

                                                                                            Fetch and Dispatch


                                                                                        S   S   S   S        S
                                                                                        I   I   I   I        I
                                                                                        M   M   M   M        M
           CPU Threads                    GPU Threads          Memory                   D   D   D   D        D




20 | The Programmer’s Guide to the APU Galaxy | June 2011
FSA SOFTWARE EXAMPLE - REDUCTION

float foo(float);
float myArray[…];


Task<float, ReductionBin> task([myArray]( IndexRange<1> index) [[device]] {
        float sum = 0.;
        for (size_t I = index.begin(); I !=                        index.end();   i++) {
                sum += foo(myArray[i]);
        }
        return sum;
});


float result = task.enqueueWithReduce( Partition<1, Auto>(1920),
                                                      [] (int x, int y) [[device]] { return x+y; }, 0.);



21 | The Programmer’s Guide to the APU Galaxy | June 2011
HETEROGENEOUS COMPUTE DISPATCH




         How compute dispatch operates
              today in the driver model


                  How compute dispatch
           improves tomorrow under FSA




22 | The Programmer’s Guide to the APU Galaxy | June 2011
TODAY’S COMMAND AND DISPATCH FLOW

          Command Flow                 Data Flow




                                  User                          Kernel
 Application                                             Soft
                  Direct3D        Mode                          Mode
      A                                                 Queue
                                  Driver                        Driver
                                           Command Buffer                DMA Buffer




                                                                                      A                 GPU
                                                                                                     HARDWARE




                                                                                          Hardware
                                                                                           Queue




23 | The Programmer’s Guide to the APU Galaxy | June 2011
TODAY’S COMMAND AND DISPATCH FLOW

          Command Flow                 Data Flow




                                  User                          Kernel
 Application                                             Soft
                  Direct3D        Mode                          Mode
      A                                                 Queue
                                  Driver                        Driver
                                           Command Buffer                DMA Buffer

          Command Flow                 Data Flow




                                  User                          Kernel                                  GPU
 Application                                             Soft                         A
                  Direct3D        Mode                          Mode                                 HARDWARE
      B                                                 Queue
                                  Driver                        Driver
                                           Command Buffer                DMA Buffer


          Command Flow                 Data Flow



                                                                                          Hardware
                                  User                          Kernel                     Queue
 Application                                             Soft
                  Direct3D        Mode                          Mode
      C                                                 Queue
                                  Driver                        Driver
                                           Command Buffer                DMA Buffer




24 | The Programmer’s Guide to the APU Galaxy | June 2011
TODAY’S COMMAND AND DISPATCH FLOW

          Command Flow                 Data Flow




                                  User                          Kernel
 Application                                             Soft
                  Direct3D        Mode                          Mode
      A                                                 Queue
                                  Driver                        Driver
                                           Command Buffer                DMA Buffer

          Command Flow                 Data Flow
                                                                                                  A   B
                                                                                              B
                                                                                          C
                                  User                          Kernel                                          GPU
 Application                                             Soft                         A
                  Direct3D        Mode                          Mode                                         HARDWARE
      B                                                 Queue
                                  Driver                        Driver
                                           Command Buffer                DMA Buffer


          Command Flow                 Data Flow



                                                                                                  Hardware
                                  User                          Kernel                             Queue
 Application                                             Soft
                  Direct3D        Mode                          Mode
      C                                                 Queue
                                  Driver                        Driver
                                           Command Buffer                DMA Buffer




25 | The Programmer’s Guide to the APU Galaxy | June 2011
TODAY’S COMMAND AND DISPATCH FLOW

          Command Flow                 Data Flow




                                  User                          Kernel
 Application                                             Soft
                  Direct3D        Mode                          Mode
      A                                                 Queue
                                  Driver                        Driver
                                           Command Buffer                DMA Buffer

          Command Flow                 Data Flow
                                                                                                  A   B
                                                                                              B
                                                                                          C
                                  User                          Kernel                                          GPU
 Application                                             Soft                         A
                  Direct3D        Mode                          Mode                                         HARDWARE
      B                                                 Queue
                                  Driver                        Driver
                                           Command Buffer                DMA Buffer


          Command Flow                 Data Flow



                                                                                                  Hardware
                                  User                          Kernel                             Queue
 Application                                             Soft
                  Direct3D        Mode                          Mode
      C                                                 Queue
                                  Driver                        Driver
                                           Command Buffer                DMA Buffer




26 | The Programmer’s Guide to the APU Galaxy | June 2011
FUTURE COMMAND AND DISPATCH FLOW

                                                            C   C
                                                       C            C
 Application                                                                         Application codes to the
      C                                                             C                 hardware
                                                                                     User mode queuing
                                                        Hardware Queue
                         Optional Dispatch
                              Buffer
                                                                                     Hardware scheduling
                                                                B
                                                            B                        Low dispatch times
 Application                                           B                    GPU
      B                                                                  HARDWARE


                                                                                     No APIs
                                                        Hardware Queue
                                                                                     No Soft Queues
                                                                A
                                                            A                        No User Mode Drivers
                                                       A
 Application                                                                         No Kernel Mode Transitions
      A
                                                                                     No Overhead!

                                                        Hardware Queue



27 | The Programmer’s Guide to the APU Galaxy | June 2011
FUTURE COMMAND AND DISPATCH CPU <-> GPU

                                                            Application / Runtime




                  CPU1                                         CPU2                 GPU




28 | The Programmer’s Guide to the APU Galaxy | June 2011
FUTURE COMMAND AND DISPATCH CPU <-> GPU

                                                            Application / Runtime




                  CPU1                                         CPU2                 GPU




29 | The Programmer’s Guide to the APU Galaxy | June 2011
FUTURE COMMAND AND DISPATCH CPU <-> GPU

                                                            Application / Runtime




                  CPU1                                         CPU2                 GPU




30 | The Programmer’s Guide to the APU Galaxy | June 2011
FUTURE COMMAND AND DISPATCH CPU <-> GPU

                                                            Application / Runtime




                  CPU1                                         CPU2                 GPU




31 | The Programmer’s Guide to the APU Galaxy | June 2011
WHERE ARE WE TAKING YOU?

      Switch the compute, don’t move
                                                              Platform Design Goals
      the data!

       Every processor now has serial and                   Easy support of massive data sets
        parallel cores
                                                             Support for task based programming
       All cores capable, with performance                   models
        differences
                                                             Solutions for
       Simple and                                            all platforms
        efficient program
        model                                                Open to all




32 | The Programmer’s Guide to the APU Galaxy | June 2011
THE FUTURE OF HETEROGENEOUS COMPUTING

The architectural path for the future is clear
    – Programming patterns established on
      Symmetric Multi-Processor (SMP)
      systems migrate to the heterogeneous
      world
    – An open architecture, with published
      specifications and an open source
      execution software stack
    – Heterogeneous cores working together
      seamlessly in coherent memory
    – Low latency dispatch
    – No software fault lines




33 | The Programmer’s Guide to the APU Galaxy | June 2011
 AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”

AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”

  • 1.
    THE PROGRAMMER’S GUIDE TOTHE APU GALAXY Phil Rogers, Corporate Fellow AMD
  • 2.
    THE OPPORTUNITY WEARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU is today. 2 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 3.
    OUTLINE The APU todayand its programming environment The future of the heterogeneous platform AMD Fusion System Architecture Roadmap Software evolution A visual view of the new command and data flow 3 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 4.
    APU: ACCELERATED PROCESSINGUNIT The APU has arrived and it is a great advance over previous platforms Combines scalar processing on CPU with parallel processing on the GPU and high bandwidth access to memory How do we make it even better going forward? – Easier to program – Easier to optimize – Easier to load balance – Higher performance – Lower power 4 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 5.
    LOW POWER E-SERIESAMD FUSION APU: “ZACATE” E-Series APU 2 x86 Bobcat CPU cores Array of Radeon™ Cores  Discrete-class DirectX® 11 performance  80 Stream Processors 3rd Generation Unified Video Decoder PCIe® Gen2 Single-channel DDR3 @ 1066 18W TDP Performance: Up to 8.5GB/s System Memory Bandwidth Up to 90 Gflop of Single Precision Compute 5 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 6.
    TABLET Z-SERIES AMDFUSION APU: “DESNA” Z-Series APU 2 x86 “Bobcat” CPU cores Array of Radeon™ Cores  Discrete-class DirectX® 11 performance  80 Stream Processors 3rd Generation Unified Video Decoder PCIe® Gen2 Single-channel DDR3 @ 1066 6W TDP w/ Local Hardware Thermal Control Performance: Up to 8.5GB/s System Memory Bandwidth Suitable for sealed, passively cooled designs 6 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 7.
    MAINSTREAM A-SERIES AMDFUSION APU: “LLANO” A-Series APU Up to four x86 CPU cores  AMD Turbo CORE frequency acceleration Array of Radeon™ Cores  Discrete-class DirectX® 11 performance 3rd Generation Unified Video Decoder Blu-ray 3D stereoscopic display PCIe® Gen2 Dual-channel DDR3 45W TDP Performance: Up to 29GB/s System Memory Bandwidth Up to 500 Gflops of Single Precision Compute 7 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 8.
    COMMITTED TO OPENSTANDARDS AMD drives open and de-facto standards – Compete on the best implementation Open standards are the basis for large ecosystems Open standards always win over time DirectX® – SW developers want their applications to run on multiple platforms from multiple hardware vendors 8 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 9.
    A NEW ERAOF PROCESSOR PERFORMANCE Heterogeneous Single-Core Era Multi-Core Era Systems Era Enabled by: Constrained by: Enabled by: Constrained by: Enabled by: Temporarily  Moore’s Law Power  Moore’s Law Power  Abundant data Constrained by:  Voltage Complexity  SMP Parallel SW parallelism Programming Scaling architecture Scalability  Power efficient models GPUs Comm.overhead Assembly  C/C++  Java … pthreads  OpenMP / TBB … Shader  CUDA OpenCL !!! Modern Application Single-thread Performance Performance ? Throughput Performance we are here we are here we are here Time Time (# of processors) Time (Data-parallel exploitation) 9 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 10.
    EVOLUTION OF HETEROGENEOUSCOMPUTING Excellent Architected Era AMD Fusion System Architecture Architecture Maturity & Programmer Accessibility Standards Drivers Era GPU Peer Processor OpenCL™, DirectCompute  Mainstream programmers Proprietary Drivers Era Driver-based APIs  Full C++  GPU as a co-processor Graphics & Proprietary  Expert programmers  Unified coherent address space Driver-based APIs  C and C++ subsets  Task parallel runtimes  Compute centric APIs , data  Nested Data Parallel programs  “Adventurous” programmers types  User mode dispatch  Multiple address spaces with  Pre-emption and context  Exploit early programmable explicit data movement switching “shader cores” in the GPU  Specialized work queue based  Make your program look like structures “graphics” to the GPU  Kernel mode dispatch See Herb Sutter’s Keynote tomorrow for a cool example of  CUDA™, Brook+, etc plans for the architected era! Poor 2002 - 2008 2009 - 2011 2012 - 2020 10 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 11.
    FSA FEATURE ROADMAP Physical Optimized Architectural System Integration Platforms Integration Integration GPU compute Integrate CPU & GPU GPU Compute C++ Unified Address Space context switch in silicon support for CPU and GPU GPU graphics GPU uses pageable pre-emption Unified Memory User mode scheduling system memory via Controller CPU pointers Quality of Service Common Bi-Directional Power Fully coherent memory Manufacturing Mgmt between CPU Extend to between CPU & GPU Technology and GPU Discrete GPU 11 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 12.
    FUSION SYSTEM ARCHITECTURE– AN OPEN PLATFORM Open Architecture, published specifications – FSAIL virtual ISA – FSA memory model – FSA dispatch ISA agnostic for both CPU and GPU Inviting partners to join us, in all areas – Hardware companies – Operating Systems – Tools and Middleware – Applications FSA review committee planned 12 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 13.
    FSA INTERMEDIATE LAYER- FSAIL FSAIL is a virtual ISA for parallel programs – Finalized to ISA by a JIT compiler or “Finalizer” Explicitly parallel – Designed for data parallel programming Support for exceptions, virtual functions, and other high level language features Syscall methods – GPU code can call directly to system services, IO, printf, etc Debugging support 13 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 14.
    FSA MEMORY MODEL Designedto be compatible with C++0x, Java and .NET Memory Models Relaxed consistency memory model for parallel compute performance Loads and stores can be re-ordered by the finalizer Visibility controlled by: – Load.Acquire, Store.Release – Fences – Barriers 14 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 15.
    Driver Stack FSA Software Stack Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Domain Libraries FSA Domain Libraries OpenCL™ 1.x, DX Runtimes, FSA Runtime User Mode Drivers Task Queuing FSA JIT Libraries FSA Kernel Graphics Kernel Mode Driver Mode Driver Hardware - APUs, CPUs, GPUs AMD user mode component AMD kernel mode component All others contributed by third parties or AMD 15 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 16.
    OPENCL™ AND FSA FSAis an optimized platform architecture for OpenCL™ – Not an alternative to OpenCL™ OpenCL™ on FSA will benefit from – Avoidance of wasteful copies – Low latency dispatch – Improved memory model – Pointers shared between CPU and GPU FSA also exposes a lower level programming interface, for those that want the ultimate in control and performance – Optimized libraries may choose the lower level interface 16 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 17.
    TASK QUEUING RUNTIMES Popularpattern for task and data parallel programming on SMP systems today Characterized by: – A work queue per core – Runtime library that divides large loops into tasks and distributes to queues – A work stealing runtime that keeps the system balanced FSA is designed to extend this pattern to run on heterogeneous systems 17 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 18.
    TASK QUEUING RUNTIMEON CPUS Work Stealing Runtime Q Q Q Q CPU CPU CPU CPU Worker Worker Worker Worker X86 CPU X86 CPU X86 CPU X86 CPU CPU Threads GPU Threads Memory 18 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 19.
    TASK QUEUING RUNTIMEON THE FSA PLATFORM Work Stealing Runtime Q Q Q Q Q CPU CPU CPU CPU GPU Worker Worker Worker Worker Manager X86 CPU X86 CPU X86 CPU X86 CPU Radeon™ GPU CPU Threads GPU Threads Memory 19 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 20.
    TASK QUEUING RUNTIMEON THE FSA PLATFORM Work Stealing Runtime Q Q Q Q Q CPU CPU CPU CPU GPU Memory Worker Worker Worker Worker Manager X86 CPU X86 CPU X86 CPU X86 CPU Fetch and Dispatch S S S S S I I I I I M M M M M CPU Threads GPU Threads Memory D D D D D 20 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 21.
    FSA SOFTWARE EXAMPLE- REDUCTION float foo(float); float myArray[…]; Task<float, ReductionBin> task([myArray]( IndexRange<1> index) [[device]] { float sum = 0.; for (size_t I = index.begin(); I != index.end(); i++) { sum += foo(myArray[i]); } return sum; }); float result = task.enqueueWithReduce( Partition<1, Auto>(1920), [] (int x, int y) [[device]] { return x+y; }, 0.); 21 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 22.
    HETEROGENEOUS COMPUTE DISPATCH How compute dispatch operates today in the driver model How compute dispatch improves tomorrow under FSA 22 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 23.
    TODAY’S COMMAND ANDDISPATCH FLOW Command Flow Data Flow User Kernel Application Soft Direct3D Mode Mode A Queue Driver Driver Command Buffer DMA Buffer A GPU HARDWARE Hardware Queue 23 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 24.
    TODAY’S COMMAND ANDDISPATCH FLOW Command Flow Data Flow User Kernel Application Soft Direct3D Mode Mode A Queue Driver Driver Command Buffer DMA Buffer Command Flow Data Flow User Kernel GPU Application Soft A Direct3D Mode Mode HARDWARE B Queue Driver Driver Command Buffer DMA Buffer Command Flow Data Flow Hardware User Kernel Queue Application Soft Direct3D Mode Mode C Queue Driver Driver Command Buffer DMA Buffer 24 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 25.
    TODAY’S COMMAND ANDDISPATCH FLOW Command Flow Data Flow User Kernel Application Soft Direct3D Mode Mode A Queue Driver Driver Command Buffer DMA Buffer Command Flow Data Flow A B B C User Kernel GPU Application Soft A Direct3D Mode Mode HARDWARE B Queue Driver Driver Command Buffer DMA Buffer Command Flow Data Flow Hardware User Kernel Queue Application Soft Direct3D Mode Mode C Queue Driver Driver Command Buffer DMA Buffer 25 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 26.
    TODAY’S COMMAND ANDDISPATCH FLOW Command Flow Data Flow User Kernel Application Soft Direct3D Mode Mode A Queue Driver Driver Command Buffer DMA Buffer Command Flow Data Flow A B B C User Kernel GPU Application Soft A Direct3D Mode Mode HARDWARE B Queue Driver Driver Command Buffer DMA Buffer Command Flow Data Flow Hardware User Kernel Queue Application Soft Direct3D Mode Mode C Queue Driver Driver Command Buffer DMA Buffer 26 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 27.
    FUTURE COMMAND ANDDISPATCH FLOW C C C C Application  Application codes to the C C hardware  User mode queuing Hardware Queue Optional Dispatch Buffer  Hardware scheduling B B  Low dispatch times Application B GPU B HARDWARE  No APIs Hardware Queue  No Soft Queues A A  No User Mode Drivers A Application  No Kernel Mode Transitions A  No Overhead! Hardware Queue 27 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 28.
    FUTURE COMMAND ANDDISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 28 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 29.
    FUTURE COMMAND ANDDISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 29 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 30.
    FUTURE COMMAND ANDDISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 30 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 31.
    FUTURE COMMAND ANDDISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 31 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 32.
    WHERE ARE WETAKING YOU? Switch the compute, don’t move Platform Design Goals the data!  Every processor now has serial and  Easy support of massive data sets parallel cores  Support for task based programming  All cores capable, with performance models differences  Solutions for  Simple and all platforms efficient program model  Open to all 32 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 33.
    THE FUTURE OFHETEROGENEOUS COMPUTING The architectural path for the future is clear – Programming patterns established on Symmetric Multi-Processor (SMP) systems migrate to the heterogeneous world – An open architecture, with published specifications and an open source execution software stack – Heterogeneous cores working together seamlessly in coherent memory – Low latency dispatch – No software fault lines 33 | The Programmer’s Guide to the APU Galaxy | June 2011