Power Optimization Through Manycore Multiprocessing


Published on

John Goodacre, ARM

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The performance requirements of handsets and other mobile devices continues to grow exponentially with new applications, advanced gaming, and traditional PC-type functionality migrating rapidly to these platforms. While this capability enables the next wave of digital revolution it comes at the price of increased power usage and potential thermal challenges. This presentation will investigate the issues and compromises traditionally required to push performance to the next level, and the challenges we face as an industry if we do not architecturally innovate on the  implementation of  advance systems. We will demonstrate key advances in future processor designs and highlight the advantages and challenges faced as we look to deliver high performance in the low power world.
  • EXAMPLE: Digital camera sport mode (burst mode). Take a lot of pictures and filter and JPEG on the go. Each picture is an independent work item, and can be processed in parallel. Instead of processing the pictures one at the time, one after the other, you can processes them in parallel. Quicker execution. Then switch-off cores and go to sleep. Low leakage and no dynamic power consumption. ANOTHER EXAMPLE: Complex post-processing on large RAW digital image. You can have more than one thread concurrently acting on the input data, and writing to the output image (reads can overlap).
  • EXAMPLE: You have more than one application running at the same time. On a single core your multitasking OS will time-slice. On a multi-core things will happen in parallel. They will execute in less time, and be more responsive (ie the UI).
  • EXAMPLE: VIDEO CODEC: This works because a video codec processes a stream. Within a single frame, and within a group of frames there are all sorts of dependencies BUT this is a stream, so while you are storing the result of a encoded frame, you can already be calculating the maths of the following frames, and sampling the next one and so on... Each core can have a task allocated to it, and the code needs to be modified so that these task synchronise and communicate between each other. Distribute different functional blocks of the decoder across available processors Multi-task pipeline: Eg taskA -> taskB -> (multiple)TaskC -> taskD Split into defined functional threads Uses passing of data blocks between threads to allocate work
  • Start with cheap package (high thermal resistance :15C/W Thetajb, 30C/W Thetaja) and 60C Tjb (so we use Thetajb) 1.5 to 2W with stacked memory limit (including the memory Tj max 85C). 3W w/o mems (20C advantage to play with assuming 105C max Tj SOC) NB: This is an issue we need to understand a lot better.
  • What is DVM? Why does the slide say 3 masters and 2 slaves (looks like the other way around)
  • Power Optimization Through Manycore Multiprocessing

    1. 1. Power Optimization ThroughMany-Core Multiprocessing Delivering High Performance in a Low Power World ChipEx2012 Haydn Povey Marketing Director – Implementation & Security ARM Processor Division May 2, 20121
    2. 2. Billions of Connected Devices TAM(m) Form Factor 2015 Mobile Phones 1,750Performance expectations continue to Media players 300 Mobile Computers 750 increase exponentially but power Desktop PCs 150 efficiency and scalability are Digital TV/STB 500 becoming formidable challenges Automotive Infotainment 100 Other* 450 Total 4 billion *Includes PND, photo-frames, etc ABI Research, IDC, Gartner and ARM forecasts May 2, 2012 2
    3. 3. Historic Technology Drivers Functionality Functionality Functionality Functionality $ Power × $ Energy×$ 2010s Up to 1980s 1990s 2000s Mobile Mainframes/mini The PC Notebooks Computing May 2, 20123
    4. 4. Low Power Positioned for the Future Going forward low power is necessary for everything from microcontroller to servers Low power is a design philosophy  Mindset, style, culture and working practice  Not something you change or acquire easily Low power is a design reality  ARM is an efficient architecture Functionality  None of the legacy or CISC complexity Energy×$ Low cost is a design & manufacturing partnership  Time to volume not time to niche markets 2010s Mobile  Speed-binning not good enough for mass-market Computing May 2, 2012 4
    5. 5. Limitations with Multiprocessing Cost of offering the peak single thread performance on each CPU quickly exceeds chassis thermal limits System and software bottlenecks limit overall scalability Single die integration offered some roadmap May 2, 2012 5
    6. 6. Evolution to Many-Core Base theorem  Simpler and smaller processor designs require exponentially less energy to accomplish same amount of compute as a more complex and larger processor design. “Approximate rule of thumb”  To increase performance 50% you double the power and area cost of the processor design  Quickly reaches point of diminishing returns May 2, 2012 6
    7. 7. Challenge of Many-Core Many-core definition  Use ‘lots’ of smaller, more efficient processors to achieve a higher aggregate performance than can be reached through multiprocessing Smaller processors are not capable of executing the same single thread as a higher performance processor in the same time – so can’t execute existing applications effectively Many threads can not easily be decomposed into simpler smaller tasks so as to benefit from multiprocessing on the smaller processor Software development challenge May 2, 20127
    8. 8. Software Data Decomposition Each data item is independent TASK CPU CPU CPU CPU TASK CPU Split large quantity of DATA TASK CPU into smaller chunks that can TASK CPU be operated in parallel TASK CPU May 2, 20128
    9. 9. Software Task Decomposition Each task item is functionally independent TASK TASK TASK TASK TASK TASK TASK TASK TASK CPU CPU CPU CPUTASK TASK TASK CPUTASK TASK TASK CPU Functionally independent tasks can be executed concurrentlyTASK TASK TASK CPUTASK TASK TASK CPU May 2, 2012 9
    10. 10. Functional Block Partitioning Functional blocks are serially dependent  But temporary independent Distribute different functional blocks across available processors  Split into defined functional threads  Uses passing of data blocks between threads to allocate work Requires code changes and fine tuning Example: Real Time Video Encoding CPU2 Motion Compensation CPU0 CPU1 CPU3 Analogue Remove Remove Quantise Run-Length Buffer Video Inter-Frame Intra-Frame Samples Compress Store Sampling Redundancy Redundancy (Simplified MPEG encoding functional block diagram) TIME May 2, 2012 10
    11. 11. Strategy Focus: The Thermal Wall SOC sustained power is limited in mobile devices by thermals;  1.5W to 2W with low-cost POP and stacked memories  3W without stacked memories  Responsiveness is a mustPower Burst for responsiveness (e.g. Browsing)  Complex active management is T >= Tjmax, Tskin needed “Opportunistic Residency” Managed Sustained Power Tj >= T max Tj < Tmax Un-managed Max Power (@Tjmax ) Sustained performance (e.g. HD Video Record , Gaming) Power Optimised Low End (e.g. e-Mail, Voice, MP3) May 2, 2012 Time 11
    12. 12. Applying Nominal Use Case Typical Day for Smartphone User  90 min voice calling  60 min email / social networking  30 min reading web  50 min angry birds / other gaming  90 min jogging while listening to music and logging GPS co-ordinates  10 min video recording  7 hrs sleep with music alarm clock  OS typically executing ~28 active processes  Apps synching in background May 2, 2012 12
    13. 13. Use Case Measurements May 2, 201213
    14. 14. Use Case Conclusion Profiled CPU Minutes % of CPU States Active Deep Sleep 1186 n/a 200MHz 154 60% 500 MHz 69 27% 800 MHz 18 7% 1000 MHz 4 2% 1200 MHz 10 4% If the phone was ARM big.LITTLE™ enabled... Active CPU time 12% big 88% LITTLE May 2, 201214
    15. 15. Big.LITTLE Processing Multiprocessing Capable Many core Benefits May 2, 201215
    16. 16. “big” Processor – Cortex-A15 ARM Cortex™-A15 Processor  3.5+ DMIPS/MHz  1-4 core MPCore™ configurable Advanced Capabilities  Full ARMv7A architecture  Thumb®-2, TrustZone®, VFP, NEON™  Virtualization, large address extensions  AMBA® 4 ACE™ coherency High Performance  Targeting 1.5GHz mobile implementation on 28nm  Hard Macro Quad-core Implementation @ 2GHz on 28HPM process May 2, 2012 16
    17. 17. “LITTLE” Processor – Cortex-A7 ARM Cortex-A7 Processor  “LITTLE” to Cortex-A15 “big”  1-4 core MPCore configurable Same Architectural Capabilities  Full ARMv7A architecture  Thumb-2, TrustZone, VFP, NEON  Virtualization, large address extensions  AMBA 4 ACE Coherency  ISA identical to Cortex-A15 processor High Performance  Up to 1.2GHz for mobile implementation on 28nm May 2, 2012 17
    18. 18. Comparison of big.LITTLE Pipelines May 2, 201218
    19. 19. Performance Comparison May 2, 201219
    20. 20. Power Efficiency Comparison May 2, 201220
    21. 21. Software Use Models Big.LITTLE Task Migration – One CPU active  Migrate between Cortex-A15 and Cortex-A7 depending on performance requirements Big.LITTLE MP – Both CPUs can be active  Allocate threads that need high-performance to cortex-A15  Allocate threads that don’t require high performance to Cortex-A7 for best energy efficiency  AMBA 4 hardware coherency between Cortex-A-15 and Cortex-A7 May 2, 2012 21
    22. 22. Task Migration Mechanics May 2, 201222
    23. 23. CCI-400 Cache Coherent InterconnectAMBA 4 compliant, 128-bit single layer at up to ½ Cortex-A15 frequency GIC-400 Coherent Mali-T604 I/O CCI-400 2+3 (x3) Graphics DMA LCD Quad ACE-Lite device  2 full AMBA 4 ACE slave Quad Cortex- Cortex-A7 Configurable AXI 4/AXI 3/AHB : NIC-400 interfaces A15 ADB-400 ADB-400 ACE ACE AXI 4  +3 ACE-Lite I/O coherent ADB-400 ADB-400 MMU-400 MMU-400 MMU-400 slave interfaces 128b 128b 128b 128b 128 b  x3 master interfaces ACE ACE ACE-Lite + DVM ACE-Lite + DVM ACE-Lite + DVM CoreLink™ CCI-400 Cache Coherent Interconnect 128 bit @ up to 0.5 Cortex-A15 frequency CCI interfaces: ACE-Lite ACE-Lite ACE-Lite  AMBA 4 ACE and ACE- 128b 128b 128b Lite manage all ACE-Lite ACE-Lite AXI 4 NIC-400 coherency, sharability DMC-400 PHY PHY Configurable AXI 4/AXI 3/AHB/APB : and barriers DDR3/2 DDR3/2 Other Other LPDDR2/3 LPDDR2/3 Slaves Slaves May 2, 201223
    24. 24. Summary Multiprocessing enables the scaling of today’s application to grow while maintaining single thread performance  Addresses nicely the multi-tasking of stacked usage scenarios Many-core brings the energy advantages of simpler and smaller processor but with the challenge of software complexity and lack of backwards compatibility with respect to single thread performance The big.LITTLE processing as delivered by the ARM Cortex- A15 and Cortex-A7 offers both the performance and compatibility advantages of Multiprocessing along with the power efficiency and scalability advantages of many-core processing May 2, 2012 24