Sensor Integration and ImprovedUser Experiences at Lower PowerPhil RogersPresident, HSA FoundationCorporate Fellow, AMD
SENSOR INTEGRATION CHALLENGES      Sensors on all new devices, going up in       resolution and data rates      Extract ...
ABUNDANT PARALLEL WORKLOADS                                                        Biometric                              ...
A NEW ERA OF PROCESSORPERFORMANCE                                                                                         ...
EVOLUTION OF HETEROGENEOUS      COMPUTING   Architecture Maturity & Programmer Accessibility                              ...
HSA FEATURE ROADMAP         Physical                                 Optimized          Architectural            System   ...
HETEROGENEOUS SYSTEMARCHITECTURE – AN OPEN PLATFORM      Open Architecture, published specifications              HSAIL ...
HSA INTERMEDIATE LAYER - HSAIL        HSAIL is a virtual ISA for parallel programs               Finalized to ISA by a J...
HSA MEMORY MODEL      Designed to be compatible with C++11, Java and       .NET Memory Models      Relaxed consistency m...
THE HSA FOUNDATION                Founders                 Promoters                Supporters                Contributors...
HSA SOFTWARE STACKSAPPLICATIONS, SYSTEM SOFTWARE ANDPROGRAMMING MODELS
Driver Stack                                                 HSA Software Stack           Apps                            ...
HSA SOLUTION STACK                                           Application                                        Domain Spe...
OPENCL™ AND HSA     HSA is an optimized platform architecture for      OpenCL™            Not an alternative to OpenCL™...
MAKE GPUS EASIER TO PROGRAM:PRIMARY PROGRAMMING MODELS•     Microsoft C++AMP        •    Address large population of Windo...
AMD’S OPEN SOURCE COMMITMENT TO HSA     We will open source our linux execution and compilation stack            Jump st...
ACCELERATED WORKLOADSCLIENT AND SERVER EXAMPLES
HAAR Face DetectionCORNERSTONE TECHNOLOGYFOR COMPUTERVISION
LOOKING FOR FACES ONE PLACE AT A TIMEQuick HD CalculationsSearch square = 21 x 21Pixels = 1920 x 1080 = 2,073,600Search sq...
LOOKING FOR DIFFERENT SIZE FACES –BY SCALING THE VIDEO FRAME       More HD Calculations       70% scaling in H and V      ...
HAAR CASCADE STAGES                                                        Feature k                                      ...
22 CASCADE STAGES,EARLY OUT BETWEEN EACH                                                                                  ...
CASCADE DEPTH ANALYSIS                                                        Cascade Depth                               ...
PROCESSING TIME/STAGE                                                                  “Trinity” A10-4600M (6CU@497Mhz, 4 ...
PERFORMANCE CPU-VS-GPU                                                                       “Trinity” A10-4600M (6CU@497M...
HAAR SOLUTION – RUN DIFFERENTCASCADES ON GPU AND CPU                                 By seamlessly sharing data between CP...
ACCELERATING B+TREESEARCHESCLOUD SERVER WORKLOAD
B+TREE SEARCHES     B+Trees are used by enterprise DB applications            SQL: SQLite, MySQL, Oracle, and others    ...
PARALLEL B+TREE SEARCHES ON HSA With HSA, DB can be larger than GPU                                                       ...
ACCELERATING SUFFIX ARRAYCONSTRUCTIONCLOUD SERVER WORKLOAD
SUFFIX ARRAYS     Suffix Arrays are used to accelerate in-memory cloud workloads            Full text index search      ...
ACCELERATED SUFFIX ARRAYCONSTRUCTION ON HSABy efficiently sharing data between CPU and                                    ...
VASCULAR IMAGEENHANCEMENTEMBEDDED MEDICAL WORKLOAD
VASCULATURE IMAGE ENHANCEMENT     Automatic Image Enhancement            Reduces expertise required to analyze the image...
IMPROVED PERFORMANCE AND ENERGY                                        HSA increases performance and reduces energy versus...
EASE OF PROGRAMMINGCODE COMPLEXITY VS. PERFORMANCE
LINES-OF-CODE AND PERFORMANCE FORDIFFERENT PROGRAMMING MODELS         350                                                 ...
THE HSA FUTURE                                     Highly productive programmers                                 + Scalabl...
THE HSA FOUNDATION                Founders                 Promoters                Supporters                Contributors...
THANK YOU!                                                  www.hsafoundation.com© Copyright 2012 HSA Foundation. All Righ...
Upcoming SlideShare
Loading in …5
×

ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

2,383 views

Published on

HSA is a new computing platform architecture being standardized by the HSA Foundation which has as Founding members, AMD, ARM, Imagination, TI, Mediatek, Samsung and Qualcomm. HSA is intended to make the use of heterogeneous programming widespread by making purpose built architectures as easy to program as modern CPUs are. We start off by doing this with the GPU, the most widely deployed companion processor to the CPU and one which especially complements the CPU in low power and performance workloads. This requires some hardware architecture changes, that we have been working on for some time (in particular those that enable user mode scheduling, unified address space, unified shared memory, compute context switching, etc.) and which we have encapsulated into the spec currently under review by the HSA Foundation.
In short, HSA codifies the hardware architecture changes that are needed to enable mainstream programmers to develop heterogeneous application with the same facility that they do CPU only applications by seamlessly integrating the sequential programming capability of the CPU with the parallel compute capability of the GPU. We describe the software stacks that are needed for HSA, the benefits that accrue to both developers as well as end users, and describe our vision of the how HSA will help unify the ecosystems of the smartphone and tablet platforms as well as bring it closer to that of the traditional PC market. We will provide analysis of several examples which arise in applications and present data to validate the performance per watt benefit of HSA.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,383
On SlideShare
0
From Embeds
0
Number of Embeds
599
Actions
Shares
0
Downloads
105
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

  1. 1. Sensor Integration and ImprovedUser Experiences at Lower PowerPhil RogersPresident, HSA FoundationCorporate Fellow, AMD
  2. 2. SENSOR INTEGRATION CHALLENGES  Sensors on all new devices, going up in resolution and data rates  Extract meaning from video, audio and location  Mix of scalar and parallel workloads  Process locally or in the cloud  Power must keep going down© Copyright 2012 HSA Foundation. All Rights Reserved. 2
  3. 3. ABUNDANT PARALLEL WORKLOADS Biometric Recognition Natural UI & Secure, fast, accurate: face, voice, fingerprints Augmented Gestures Reality Touch, gesture, Superimpose graphics, and voice audio, and other digital information as a virtual overlay Content AV Content Everywhere Management Content from any Searching, indexing and source to any display tagging of video & audio. seamlessly multimedia data mining Beyond HD Experiences Streaming media, new codecs, 3D, transcode, audio© Copyright 2012 HSA Foundation. All Rights Reserved. 3
  4. 4. A NEW ERA OF PROCESSORPERFORMANCE Heterogeneous Single-Core Era Multi-Core Era Systems Era Enabled by: Constrained by: Enabled by: Constrained by: Enabled by: Temporarily  Moore’s Power  Moore’s Law Power  Abundant data Constrained by: Law Complexity  SMP Parallel SW parallelism Programming  Voltage architecture Scalability  Power efficient models Scaling GPUs Comm.overhead Assembly  C/C++  Java … pthreads  OpenMP  TBB Shader  CUDA OpenCL  C++ and Java Modern Application Single-thread Performance Performance Performance Throughput ? we are we are here here we are here Time Time (# of processors) Time (Data-parallel exploitation)© Copyright 2012 HSA Foundation. All Rights Reserved. 4
  5. 5. EVOLUTION OF HETEROGENEOUS COMPUTING Architecture Maturity & Programmer Accessibility Architected Era Excellent Standards Drivers Era Heterogeneous System Architecture GPU Peer Processor OpenCL™, DirectCompute  Mainstream programmers Proprietary Drivers Era Driver-based APIs  Full C++  Expert programmers  GPU as a co-processor Graphics & Proprietary  C and C++ subsets  Unified coherent address space Driver-based APIs  Task parallel runtimes  Compute centric APIs , data types  Nested Data Parallel programs  “Adventurous” programmers  User mode dispatch  Multiple address spaces with explicit data movement  Pre-emption and context  Exploit early programmable  Specialized work queue based switching “shader cores” in the GPU structures  Make your program look like  Kernel mode dispatch “graphics” to the GPUPoor  CUDA™, Brook+, etc 2002 - 2008 2009 - 2011 2012 - 2020 2002 - 2008 2009 - 2011 2012 - 2020 © Copyright 2012 HSA Foundation. All Rights Reserved. 5
  6. 6. HSA FEATURE ROADMAP Physical Optimized Architectural System Integration Platforms Integration Integration Integrate CPU & GPU GPU Compute C++ Unified Address Space GPU compute in silicon support for CPU and GPU context switch GPU uses pageable Unified Memory GPU graphics User mode scheduling system memory via Controller pre-emption CPU pointers Common Bi-Directional Power Fully coherent memory Quality of Service Manufacturing Mgmt between CPU between CPU & GPU Technology and GPU© Copyright 2012 HSA Foundation. All Rights Reserved. 6
  7. 7. HETEROGENEOUS SYSTEMARCHITECTURE – AN OPEN PLATFORM  Open Architecture, published specifications  HSAIL virtual ISA  HSA memory model  HSA system architecture  ISA agnostic for both CPU and GPU  HSA Foundation formed in June 2012  Inviting partners to join us, in all areas  Hardware companies  Operating Systems  Tools and Middleware  Applications© Copyright 2012 HSA Foundation. All Rights Reserved. 7
  8. 8. HSA INTERMEDIATE LAYER - HSAIL  HSAIL is a virtual ISA for parallel programs  Finalized to ISA by a JIT compiler or “Finalizer”  ISA independent by design  Explicitly parallel  Designed for data parallel programming  Support for exceptions, virtual functions, and other high level language features  Syscall methods  GPU code can call directly to system services, IO, printf, etc  Debugging support© Copyright 2012 HSA Foundation. All Rights Reserved. 8
  9. 9. HSA MEMORY MODEL  Designed to be compatible with C++11, Java and .NET Memory Models  Relaxed consistency memory model for parallel compute performance  Loads and stores can be re-ordered by the finalizer  Visibility controlled by:  Load.Acquire  Store.Release  Barriers© Copyright 2012 HSA Foundation. All Rights Reserved. 9
  10. 10. THE HSA FOUNDATION Founders Promoters Supporters Contributors Associates
  11. 11. HSA SOFTWARE STACKSAPPLICATIONS, SYSTEM SOFTWARE ANDPROGRAMMING MODELS
  12. 12. Driver Stack HSA Software Stack Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Domain Libraries HSA Domain Libraries OpenCL™ 1.x, DX Runtimes, HSA Runtime User Mode Drivers Task Queuing HSA JIT Libraries HSA Kernel Graphics Kernel Mode Driver Mode Driver Hardware - APUs, CPUs, GPUs User mode component Kernel mode component Components contributed by third parties© Copyright 2012 HSA Foundation. All Rights Reserved. 12
  13. 13. HSA SOLUTION STACK Application Domain Specific Libs Application SW (Bolt, OpenCV™, … many others) OpenCL™ DirectX Other Runtime Runtime Runtime HSA Runtime Legacy Drivers HSA Software HSAIL Ctl Drivers HSA Finalizer Knl Driver GPU ISA Other Differentiated HW CPU(s) GPU(s) Accelerators
  14. 14. OPENCL™ AND HSA HSA is an optimized platform architecture for OpenCL™  Not an alternative to OpenCL™ OpenCL™ on HSA will benefit from  Avoidance of wasteful copies  Low latency dispatch  Improved memory model  Pointers shared between CPU and GPU HSA also exposes a lower level programming interface, for those that want the ultimate in control and performance  Can be exposed through OpenCL extensions© Copyright 2012 HSA Foundation. All Rights Reserved. 14
  15. 15. MAKE GPUS EASIER TO PROGRAM:PRIMARY PROGRAMMING MODELS• Microsoft C++AMP • Address large population of Windows developers • Integrated in Visual Studio and WinRT • Microsoft Community Promise License for open platform use• Java acceleration • Aparapi on OpenCL today • Project Sumatra to add HSA support in an OpenJDK for Java 8 • Driving to have Sumatra absorbed into Java 9© Copyright 2012 HSA Foundation. All Rights Reserved. 15
  16. 16. AMD’S OPEN SOURCE COMMITMENT TO HSA We will open source our linux execution and compilation stack  Jump start the ecosystem  Allow a single shared implementation where appropriate  Enable university research in all areas Component Name AMD Rationale Specific HSA Bolt Library No Enable understanding and debug HSAIL Code Generator No Enable research LLVM Contributions No Industry and academic collaboration HSA Assembler No Enable understanding and debug HSA Runtime No Standardize on a single runtime HSA Finalizer Yes Enable research and debug HSA Kernel Driver Yes For inclusion in linux distros© Copyright 2012 HSA Foundation. All Rights Reserved. 16
  17. 17. ACCELERATED WORKLOADSCLIENT AND SERVER EXAMPLES
  18. 18. HAAR Face DetectionCORNERSTONE TECHNOLOGYFOR COMPUTERVISION
  19. 19. LOOKING FOR FACES ONE PLACE AT A TIMEQuick HD CalculationsSearch square = 21 x 21Pixels = 1920 x 1080 = 2,073,600Search squares = 1900 x 1060 = ~2 Million
  20. 20. LOOKING FOR DIFFERENT SIZE FACES –BY SCALING THE VIDEO FRAME More HD Calculations 70% scaling in H and V Total Pixels = 4.07 Million Search squares = 3.8 Million© Copyright 2012 HSA Foundation. All Rights Reserved. 20
  21. 21. HAAR CASCADE STAGES Feature k Feature l Stage N Feature m Face still Yes possible? Feature p No Feature r Stage N+1 Feature q REJECT FRAME© Copyright 2012 HSA Foundation. All Rights Reserved. 21
  22. 22. 22 CASCADE STAGES,EARLY OUT BETWEEN EACH FACE STAGE 1 STAGE 2 STAGE 21 STAGE 22 CONFIRMED NO FACE Final HD Calculations Calculation Rate Search squares = 3.8 million 30 frames/sec = 1.4TCalcs/second Average features per square = 124 60 frames/sec = 2.8TCalcs/second Calculations per feature = 100 Calculations per frame = 47 GCalcs …and this only gets front-facing faces© Copyright 2012 HSA Foundation. All Rights Reserved. 22
  23. 23. CASCADE DEPTH ANALYSIS Cascade Depth 25 20 15 10 5 20-25 0 15-20 10-15 5-10 0-5© Copyright 2012 HSA Foundation. All Rights Reserved. 23
  24. 24. PROCESSING TIME/STAGE “Trinity” A10-4600M (6CU@497Mhz, 4 cores@2700Mhz) 100 90 80 70 Time (ms) 60 50 40 30 GPU 20 10 CPU 0 1 2 3 4 5 6 7 8 9-22 Cascade Stage AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL™ 1.1 (873.1)© Copyright 2012 HSA Foundation. All Rights Reserved. 24
  25. 25. PERFORMANCE CPU-VS-GPU “Trinity” A10-4600M (6CU@497Mhz, 4 cores@2700Mhz) 12 10 8 Images/Sec 6 4 CPU HSA 2 GPU 0 0 1 2 3 4 5 6 7 8 22 Number of Cascade Stages on GPU AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL™ 1.1 (873.1)© Copyright 2012 HSA Foundation. All Rights Reserved. 25
  26. 26. HAAR SOLUTION – RUN DIFFERENTCASCADES ON GPU AND CPU By seamlessly sharing data between CPU and GPU, HSA allows the right processor to handle its appropriate workload +2.5x -2.5x INCREASED DECREASED ENERGY PERFORMANCE PER FRAME© Copyright 2012 HSA Foundation. All Rights Reserved. 26
  27. 27. ACCELERATING B+TREESEARCHESCLOUD SERVER WORKLOAD
  28. 28. B+TREE SEARCHES B+Trees are used by enterprise DB applications  SQL: SQLite, MySQL, Oracle, and others  No-SQL: CouchDB, Tokyo Cabinet, and others  Audio search, video copy detection  DB Size: Can be many times larger than GPU memory! A simple B+Tree linking the keys 1-7. The linked list (red) allows rapid in-order traversal. B+Trees are a fundamental data structure  Used to reduce memory & disk access to locate a key  Can support index- and range-based queries  Can be updated efficiently© Copyright 2012 HSA Foundation. All Rights Reserved. 28
  29. 29. PARALLEL B+TREE SEARCHES ON HSA With HSA, DB can be larger than GPU HSA increases performance versus Multi memory, and can be shared with CPU. Threaded CPU, even for tree structures that reside in pinned host memory.  HSA lets us move compute to data Millions of Queries Per Second (MPQ) by B+Tree “order”  Parallel search can move to GPU 80  Sequential updates can remain on CPU 70 60 50 MQPS Platform Size < 3 GB Size > 3 GB 40 CPU 30 dGPU ✓ ✗ 20 APU (memory size = 3GB) 10 HSA ✓ ✓ 0 4 8 16 32 64 128 “Order” of B+Tree Node M. Daga, and M. Nutter, “Exploiting Coarse-Grained Parallelism in B+Tree Searches on an APU”, Accepted at ”Second Workshop on Irregular Applications: Algorithms and Architectures, (IA3)” November 2012. AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM© Copyright 2012 HSA Foundation. All Rights Reserved. 29
  30. 30. ACCELERATING SUFFIX ARRAYCONSTRUCTIONCLOUD SERVER WORKLOAD
  31. 31. SUFFIX ARRAYS Suffix Arrays are used to accelerate in-memory cloud workloads  Full text index search  Lossless data compression  Bio-informatics Suffix Arrays are a fundamental data structure  Designed for efficient searching of a large text  Quickly locate every occurrence of a substring S in a text T Suffix Array Construction contains parallel & sequential steps  Sorting, ideal for GPU  Nested recursion, ideal for CPU© Copyright 2012 HSA Foundation. All Rights Reserved. 31
  32. 32. ACCELERATED SUFFIX ARRAYCONSTRUCTION ON HSABy efficiently sharing data between CPU and By offloading data parallel computations toGPU, HSA lets us move compute to data GPU, HSA increases performance andwithout penalty of intermediate copies. reduces energy for Suffix Array Construction versus single threaded execution. Skew Algorithm for Compute SA Radix Sort::GPU +5.8x Lexical Rank::CPU Compute SA::CPU Radix Sort::GPU -5x Merge Sort::GPU INCREASED DECREASED PERFORMANCE ENERGY M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013. AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM© Copyright 2012 HSA Foundation. All Rights Reserved. 32
  33. 33. VASCULAR IMAGEENHANCEMENTEMBEDDED MEDICAL WORKLOAD
  34. 34. VASCULATURE IMAGE ENHANCEMENT Automatic Image Enhancement  Reduces expertise required to analyze the image  Reduces time to diagnosis Vasculature Image Enhancement: Input Image Output Image GPU::Gaussian Convolution GPU::Hessian Matrix Compute Intermediate copied to CPU GPU::Eigen Decomposition Intermediate copied to CPU GPU::Vascular Network Compute Intermediate copied to CPU CPU::Analysis© Copyright 2012 HSA Foundation. All Rights Reserved. 34
  35. 35. IMPROVED PERFORMANCE AND ENERGY HSA increases performance and reduces energy versus single threaded execution by offloading data parallel compute to GPU +14x -14x INCREASED DECREASED PERFORMANCE ENERGY AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL™ 1.1 (873.1)© Copyright 2012 HSA Foundation. All Rights Reserved. 35
  36. 36. EASE OF PROGRAMMINGCODE COMPLEXITY VS. PERFORMANCE
  37. 37. LINES-OF-CODE AND PERFORMANCE FORDIFFERENT PROGRAMMING MODELS 350 35.00 (Exemplary ISV “Hessian” Kernel) 300 30.00 Init. 250 25.00 Launch Performance 200 Compile 20.00 LOC Compile Copy Copy 150 15.00 Launch Launch Launch Algorithm Launch 100 10.00 Launch Algorithm Algorithm Algorithm Launch 50 5.00 Algorithm Algorithm Algorithm Copy-back Copy-back Copy-back 0 0 Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt Copy-back Algorithm Launch Copy Compile Init Performance AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM. Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta© Copyright 2012 HSA Foundation. All Rights Reserved. 37
  38. 38. THE HSA FUTURE Highly productive programmers + Scalable performance + Power efficiency = AMAZING USER EXPERIENCES© Copyright 2012 HSA Foundation. All Rights Reserved. 38
  39. 39. THE HSA FOUNDATION Founders Promoters Supporters Contributors Associates
  40. 40. THANK YOU! www.hsafoundation.com© Copyright 2012 HSA Foundation. All Rights Reserved. 40

×