Your SlideShare is downloading. ×
0

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”

15,721

Published on

AFDS Keynote: “The Programmer’s Guide to the APU Galaxy.” …

AFDS Keynote: “The Programmer’s Guide to the APU Galaxy.”
Phil Rogers, AMD Corporate Fellow

It’s a well-understood maxim in the technology industry that software and hardware must evolve in parallel, and be well matched, to achieve greatness. With the introduction of the world’s first APU in January 2011, AMD pointed the world toward a new way of computing. This was very much a first step in an architectural journey that is well underway at AMD. APUs combine different processing engines in single-chip combinations to strike a unique balance between the dimensions of performance, power consumption and price. Hear how AMD is working to ease the programmer’s access to this new level of compute horsepower and dramatically expand the processing resources available to modern applications

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
15,721
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
145
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. THE PROGRAMMER’S GUIDETO THE APU GALAXYPhil Rogers, Corporate FellowAMD
  • 2. THE OPPORTUNITY WE ARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU is today.2 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 3. OUTLINEThe APU today and its programming environmentThe future of the heterogeneous platformAMD Fusion System ArchitectureRoadmapSoftware evolutionA visual view of the new command and data flow3 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 4. APU: ACCELERATED PROCESSING UNITThe APU has arrived and it is a great advance over previous platformsCombines scalar processing on CPU with parallel processing on the GPU and high bandwidth access to memoryHow do we make it even better going forward? – Easier to program – Easier to optimize – Easier to load balance – Higher performance – Lower power4 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 5. LOW POWER E-SERIES AMD FUSION APU: “ZACATE” E-Series APU 2 x86 Bobcat CPU cores Array of Radeon™ Cores  Discrete-class DirectX® 11 performance  80 Stream Processors 3rd Generation Unified Video Decoder PCIe® Gen2 Single-channel DDR3 @ 1066 18W TDP Performance: Up to 8.5GB/s System Memory Bandwidth Up to 90 Gflop of Single Precision Compute5 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 6. TABLET Z-SERIES AMD FUSION APU: “DESNA” Z-Series APU 2 x86 “Bobcat” CPU cores Array of Radeon™ Cores  Discrete-class DirectX® 11 performance  80 Stream Processors 3rd Generation Unified Video Decoder PCIe® Gen2 Single-channel DDR3 @ 1066 6W TDP w/ Local Hardware Thermal Control Performance: Up to 8.5GB/s System Memory Bandwidth Suitable for sealed, passively cooled designs6 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 7. MAINSTREAM A-SERIES AMD FUSION APU: “LLANO” A-Series APU Up to four x86 CPU cores  AMD Turbo CORE frequency acceleration Array of Radeon™ Cores  Discrete-class DirectX® 11 performance 3rd Generation Unified Video Decoder Blu-ray 3D stereoscopic display PCIe® Gen2 Dual-channel DDR3 45W TDP Performance: Up to 29GB/s System Memory Bandwidth Up to 500 Gflops of Single Precision Compute7 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 8. COMMITTED TO OPEN STANDARDSAMD drives open and de-facto standards – Compete on the best implementationOpen standards are the basis for large ecosystemsOpen standards always win over time DirectX® – SW developers want their applications to run on multiple platforms from multiple hardware vendors8 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 9. A NEW ERA OF PROCESSOR PERFORMANCE Heterogeneous Single-Core Era Multi-Core Era Systems EraEnabled by: Constrained by: Enabled by: Constrained by: Enabled by: Temporarily Moore’s Law Power  Moore’s Law Power  Abundant data Constrained by: Voltage Complexity  SMP Parallel SW parallelism Programming Scaling architecture Scalability  Power efficient models GPUs Comm.overhead Assembly  C/C++  Java … pthreads  OpenMP / TBB … Shader  CUDA OpenCL !!! Modern Application Single-thread Performance Performance ? Throughput Performance we are here we are here we are here Time Time (# of processors) Time (Data-parallel exploitation)9 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 10. EVOLUTION OF HETEROGENEOUS COMPUTING Excellent Architected Era AMD Fusion System ArchitectureArchitecture Maturity & Programmer Accessibility Standards Drivers Era GPU Peer Processor OpenCL™, DirectCompute  Mainstream programmers Proprietary Drivers Era Driver-based APIs  Full C++  GPU as a co-processor Graphics & Proprietary  Expert programmers  Unified coherent address space Driver-based APIs  C and C++ subsets  Task parallel runtimes  Compute centric APIs , data  Nested Data Parallel programs  “Adventurous” programmers types  User mode dispatch  Multiple address spaces with  Pre-emption and context  Exploit early programmable explicit data movement switching “shader cores” in the GPU  Specialized work queue based  Make your program look like structures “graphics” to the GPU  Kernel mode dispatch See Herb Sutter’s Keynote tomorrow for a cool example of  CUDA™, Brook+, etc plans for the architected era! Poor 2002 - 2008 2009 - 2011 2012 - 2020 10 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 11. FSA FEATURE ROADMAP Physical Optimized Architectural System Integration Platforms Integration Integration GPU compute Integrate CPU & GPU GPU Compute C++ Unified Address Space context switch in silicon support for CPU and GPU GPU graphics GPU uses pageable pre-emption Unified Memory User mode scheduling system memory via Controller CPU pointers Quality of Service Common Bi-Directional Power Fully coherent memory Manufacturing Mgmt between CPU Extend to between CPU & GPU Technology and GPU Discrete GPU11 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 12. FUSION SYSTEM ARCHITECTURE – AN OPEN PLATFORMOpen Architecture, published specifications – FSAIL virtual ISA – FSA memory model – FSA dispatchISA agnostic for both CPU and GPUInviting partners to join us, in all areas – Hardware companies – Operating Systems – Tools and Middleware – ApplicationsFSA review committee planned12 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 13. FSA INTERMEDIATE LAYER - FSAILFSAIL is a virtual ISA for parallel programs – Finalized to ISA by a JIT compiler or “Finalizer”Explicitly parallel – Designed for data parallel programmingSupport for exceptions, virtual functions, and other high level language featuresSyscall methods – GPU code can call directly to system services, IO, printf, etcDebugging support13 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 14. FSA MEMORY MODELDesigned to be compatible with C++0x, Java and .NET Memory ModelsRelaxed consistency memory model for parallel compute performanceLoads and stores can be re-ordered by the finalizerVisibility controlled by: – Load.Acquire, Store.Release – Fences – Barriers14 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 15. Driver Stack FSA Software Stack Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Domain Libraries FSA Domain Libraries OpenCL™ 1.x, DX Runtimes, FSA Runtime User Mode Drivers Task Queuing FSA JIT Libraries FSA Kernel Graphics Kernel Mode Driver Mode Driver Hardware - APUs, CPUs, GPUs AMD user mode component AMD kernel mode component All others contributed by third parties or AMD15 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 16. OPENCL™ AND FSAFSA is an optimized platform architecture for OpenCL™ – Not an alternative to OpenCL™OpenCL™ on FSA will benefit from – Avoidance of wasteful copies – Low latency dispatch – Improved memory model – Pointers shared between CPU and GPUFSA also exposes a lower level programming interface, for those that want the ultimate in control and performance – Optimized libraries may choose the lower level interface16 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 17. TASK QUEUING RUNTIMESPopular pattern for task and data parallel programming on SMP systems todayCharacterized by: – A work queue per core – Runtime library that divides large loops into tasks and distributes to queues – A work stealing runtime that keeps the system balancedFSA is designed to extend this pattern to run on heterogeneous systems17 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 18. TASK QUEUING RUNTIME ON CPUS Work Stealing Runtime Q Q Q Q CPU CPU CPU CPU Worker Worker Worker Worker X86 CPU X86 CPU X86 CPU X86 CPU CPU Threads GPU Threads Memory18 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 19. TASK QUEUING RUNTIME ON THE FSA PLATFORM Work Stealing Runtime Q Q Q Q Q CPU CPU CPU CPU GPU Worker Worker Worker Worker Manager X86 CPU X86 CPU X86 CPU X86 CPU Radeon™ GPU CPU Threads GPU Threads Memory19 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 20. TASK QUEUING RUNTIME ON THE FSA PLATFORM Work Stealing Runtime Q Q Q Q Q CPU CPU CPU CPU GPU Memory Worker Worker Worker Worker Manager X86 CPU X86 CPU X86 CPU X86 CPU Fetch and Dispatch S S S S S I I I I I M M M M M CPU Threads GPU Threads Memory D D D D D20 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 21. FSA SOFTWARE EXAMPLE - REDUCTIONfloat foo(float);float myArray[…];Task<float, ReductionBin> task([myArray]( IndexRange<1> index) [[device]] { float sum = 0.; for (size_t I = index.begin(); I != index.end(); i++) { sum += foo(myArray[i]); } return sum;});float result = task.enqueueWithReduce( Partition<1, Auto>(1920), [] (int x, int y) [[device]] { return x+y; }, 0.);21 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 22. HETEROGENEOUS COMPUTE DISPATCH How compute dispatch operates today in the driver model How compute dispatch improves tomorrow under FSA22 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 23. TODAY’S COMMAND AND DISPATCH FLOW Command Flow Data Flow User Kernel Application Soft Direct3D Mode Mode A Queue Driver Driver Command Buffer DMA Buffer A GPU HARDWARE Hardware Queue23 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 24. TODAY’S COMMAND AND DISPATCH FLOW Command Flow Data Flow User Kernel Application Soft Direct3D Mode Mode A Queue Driver Driver Command Buffer DMA Buffer Command Flow Data Flow User Kernel GPU Application Soft A Direct3D Mode Mode HARDWARE B Queue Driver Driver Command Buffer DMA Buffer Command Flow Data Flow Hardware User Kernel Queue Application Soft Direct3D Mode Mode C Queue Driver Driver Command Buffer DMA Buffer24 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 25. TODAY’S COMMAND AND DISPATCH FLOW Command Flow Data Flow User Kernel Application Soft Direct3D Mode Mode A Queue Driver Driver Command Buffer DMA Buffer Command Flow Data Flow A B B C User Kernel GPU Application Soft A Direct3D Mode Mode HARDWARE B Queue Driver Driver Command Buffer DMA Buffer Command Flow Data Flow Hardware User Kernel Queue Application Soft Direct3D Mode Mode C Queue Driver Driver Command Buffer DMA Buffer25 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 26. TODAY’S COMMAND AND DISPATCH FLOW Command Flow Data Flow User Kernel Application Soft Direct3D Mode Mode A Queue Driver Driver Command Buffer DMA Buffer Command Flow Data Flow A B B C User Kernel GPU Application Soft A Direct3D Mode Mode HARDWARE B Queue Driver Driver Command Buffer DMA Buffer Command Flow Data Flow Hardware User Kernel Queue Application Soft Direct3D Mode Mode C Queue Driver Driver Command Buffer DMA Buffer26 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 27. FUTURE COMMAND AND DISPATCH FLOW C C C C Application  Application codes to the C C hardware  User mode queuing Hardware Queue Optional Dispatch Buffer  Hardware scheduling B B  Low dispatch times Application B GPU B HARDWARE  No APIs Hardware Queue  No Soft Queues A A  No User Mode Drivers A Application  No Kernel Mode Transitions A  No Overhead! Hardware Queue27 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 28. FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU28 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 29. FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU29 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 30. FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU30 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 31. FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU31 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 32. WHERE ARE WE TAKING YOU? Switch the compute, don’t move Platform Design Goals the data!  Every processor now has serial and  Easy support of massive data sets parallel cores  Support for task based programming  All cores capable, with performance models differences  Solutions for  Simple and all platforms efficient program model  Open to all32 | The Programmer’s Guide to the APU Galaxy | June 2011
  • 33. THE FUTURE OF HETEROGENEOUS COMPUTINGThe architectural path for the future is clear – Programming patterns established on Symmetric Multi-Processor (SMP) systems migrate to the heterogeneous world – An open architecture, with published specifications and an open source execution software stack – Heterogeneous cores working together seamlessly in coherent memory – Low latency dispatch – No software fault lines33 | The Programmer’s Guide to the APU Galaxy | June 2011

×