Power Optimization Through Manycore Multiprocessing
Upcoming SlideShare
Loading in...5

Power Optimization Through Manycore Multiprocessing



John Goodacre, ARM

John Goodacre, ARM



Total Views
Views on SlideShare
Embed Views



4 Embeds 112

http://www.chiportal.co.il 102
http://www.directrss.co.il 5
http://translate.googleusercontent.com 3
http://chiportal.co.il 2



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • The performance requirements of handsets and other mobile devices continues to grow exponentially with new applications, advanced gaming, and traditional PC-type functionality migrating rapidly to these platforms. While this capability enables the next wave of digital revolution it comes at the price of increased power usage and potential thermal challenges. This presentation will investigate the issues and compromises traditionally required to push performance to the next level, and the challenges we face as an industry if we do not architecturally innovate on the  implementation of  advance systems. We will demonstrate key advances in future processor designs and highlight the advantages and challenges faced as we look to deliver high performance in the low power world.
  • EXAMPLE: Digital camera sport mode (burst mode). Take a lot of pictures and filter and JPEG on the go. Each picture is an independent work item, and can be processed in parallel. Instead of processing the pictures one at the time, one after the other, you can processes them in parallel. Quicker execution. Then switch-off cores and go to sleep. Low leakage and no dynamic power consumption. ANOTHER EXAMPLE: Complex post-processing on large RAW digital image. You can have more than one thread concurrently acting on the input data, and writing to the output image (reads can overlap).
  • EXAMPLE: You have more than one application running at the same time. On a single core your multitasking OS will time-slice. On a multi-core things will happen in parallel. They will execute in less time, and be more responsive (ie the UI).
  • EXAMPLE: VIDEO CODEC: This works because a video codec processes a stream. Within a single frame, and within a group of frames there are all sorts of dependencies BUT this is a stream, so while you are storing the result of a encoded frame, you can already be calculating the maths of the following frames, and sampling the next one and so on... Each core can have a task allocated to it, and the code needs to be modified so that these task synchronise and communicate between each other. Distribute different functional blocks of the decoder across available processors Multi-task pipeline: Eg taskA -> taskB -> (multiple)TaskC -> taskD Split into defined functional threads Uses passing of data blocks between threads to allocate work
  • Start with cheap package (high thermal resistance :15C/W Thetajb, 30C/W Thetaja) and 60C Tjb (so we use Thetajb) 1.5 to 2W with stacked memory limit (including the memory Tj max 85C). 3W w/o mems (20C advantage to play with assuming 105C max Tj SOC) NB: This is an issue we need to understand a lot better.
  • What is DVM? Why does the slide say 3 masters and 2 slaves (looks like the other way around)

Power Optimization Through Manycore Multiprocessing Power Optimization Through Manycore Multiprocessing Presentation Transcript

  • Power Optimization ThroughMany-Core Multiprocessing Delivering High Performance in a Low Power World ChipEx2012 Haydn Povey Marketing Director – Implementation & Security ARM Processor Division May 2, 20121
  • Billions of Connected Devices TAM(m) Form Factor 2015 Mobile Phones 1,750Performance expectations continue to Media players 300 Mobile Computers 750 increase exponentially but power Desktop PCs 150 efficiency and scalability are Digital TV/STB 500 becoming formidable challenges Automotive Infotainment 100 Other* 450 Total 4 billion *Includes PND, photo-frames, etc ABI Research, IDC, Gartner and ARM forecasts May 2, 2012 2
  • Historic Technology Drivers Functionality Functionality Functionality Functionality $ Power × $ Energy×$ 2010s Up to 1980s 1990s 2000s Mobile Mainframes/mini The PC Notebooks Computing May 2, 20123
  • Low Power Positioned for the Future Going forward low power is necessary for everything from microcontroller to servers Low power is a design philosophy  Mindset, style, culture and working practice  Not something you change or acquire easily Low power is a design reality  ARM is an efficient architecture Functionality  None of the legacy or CISC complexity Energy×$ Low cost is a design & manufacturing partnership  Time to volume not time to niche markets 2010s Mobile  Speed-binning not good enough for mass-market Computing May 2, 2012 4
  • Limitations with Multiprocessing Cost of offering the peak single thread performance on each CPU quickly exceeds chassis thermal limits System and software bottlenecks limit overall scalability Single die integration offered some roadmap May 2, 2012 5
  • Evolution to Many-Core Base theorem  Simpler and smaller processor designs require exponentially less energy to accomplish same amount of compute as a more complex and larger processor design. “Approximate rule of thumb”  To increase performance 50% you double the power and area cost of the processor design  Quickly reaches point of diminishing returns May 2, 2012 6
  • Challenge of Many-Core Many-core definition  Use ‘lots’ of smaller, more efficient processors to achieve a higher aggregate performance than can be reached through multiprocessing Smaller processors are not capable of executing the same single thread as a higher performance processor in the same time – so can’t execute existing applications effectively Many threads can not easily be decomposed into simpler smaller tasks so as to benefit from multiprocessing on the smaller processor Software development challenge May 2, 20127
  • Software Data Decomposition Each data item is independent TASK CPU CPU CPU CPU TASK CPU Split large quantity of DATA TASK CPU into smaller chunks that can TASK CPU be operated in parallel TASK CPU May 2, 20128
  • Software Task Decomposition Each task item is functionally independent TASK TASK TASK TASK TASK TASK TASK TASK TASK CPU CPU CPU CPUTASK TASK TASK CPUTASK TASK TASK CPU Functionally independent tasks can be executed concurrentlyTASK TASK TASK CPUTASK TASK TASK CPU May 2, 2012 9
  • Functional Block Partitioning Functional blocks are serially dependent  But temporary independent Distribute different functional blocks across available processors  Split into defined functional threads  Uses passing of data blocks between threads to allocate work Requires code changes and fine tuning Example: Real Time Video Encoding CPU2 Motion Compensation CPU0 CPU1 CPU3 Analogue Remove Remove Quantise Run-Length Buffer Video Inter-Frame Intra-Frame Samples Compress Store Sampling Redundancy Redundancy (Simplified MPEG encoding functional block diagram) TIME May 2, 2012 10
  • Strategy Focus: The Thermal Wall SOC sustained power is limited in mobile devices by thermals;  1.5W to 2W with low-cost POP and stacked memories  3W without stacked memories  Responsiveness is a mustPower Burst for responsiveness (e.g. Browsing)  Complex active management is T >= Tjmax, Tskin needed “Opportunistic Residency” Managed Sustained Power Tj >= T max Tj < Tmax Un-managed Max Power (@Tjmax ) Sustained performance (e.g. HD Video Record , Gaming) Power Optimised Low End (e.g. e-Mail, Voice, MP3) May 2, 2012 Time 11
  • Applying Nominal Use Case Typical Day for Smartphone User  90 min voice calling  60 min email / social networking  30 min reading web  50 min angry birds / other gaming  90 min jogging while listening to music and logging GPS co-ordinates  10 min video recording  7 hrs sleep with music alarm clock  OS typically executing ~28 active processes  Apps synching in background May 2, 2012 12
  • Use Case Measurements May 2, 201213
  • Use Case Conclusion Profiled CPU Minutes % of CPU States Active Deep Sleep 1186 n/a 200MHz 154 60% 500 MHz 69 27% 800 MHz 18 7% 1000 MHz 4 2% 1200 MHz 10 4% If the phone was ARM big.LITTLE™ enabled... Active CPU time 12% big 88% LITTLE May 2, 201214
  • Big.LITTLE Processing Multiprocessing Capable Many core Benefits May 2, 201215
  • “big” Processor – Cortex-A15 ARM Cortex™-A15 Processor  3.5+ DMIPS/MHz  1-4 core MPCore™ configurable Advanced Capabilities  Full ARMv7A architecture  Thumb®-2, TrustZone®, VFP, NEON™  Virtualization, large address extensions  AMBA® 4 ACE™ coherency High Performance  Targeting 1.5GHz mobile implementation on 28nm  Hard Macro Quad-core Implementation @ 2GHz on 28HPM process May 2, 2012 16
  • “LITTLE” Processor – Cortex-A7 ARM Cortex-A7 Processor  “LITTLE” to Cortex-A15 “big”  1-4 core MPCore configurable Same Architectural Capabilities  Full ARMv7A architecture  Thumb-2, TrustZone, VFP, NEON  Virtualization, large address extensions  AMBA 4 ACE Coherency  ISA identical to Cortex-A15 processor High Performance  Up to 1.2GHz for mobile implementation on 28nm May 2, 2012 17
  • Comparison of big.LITTLE Pipelines May 2, 201218
  • Performance Comparison May 2, 201219
  • Power Efficiency Comparison May 2, 201220
  • Software Use Models Big.LITTLE Task Migration – One CPU active  Migrate between Cortex-A15 and Cortex-A7 depending on performance requirements Big.LITTLE MP – Both CPUs can be active  Allocate threads that need high-performance to cortex-A15  Allocate threads that don’t require high performance to Cortex-A7 for best energy efficiency  AMBA 4 hardware coherency between Cortex-A-15 and Cortex-A7 May 2, 2012 21
  • Task Migration Mechanics May 2, 201222
  • CCI-400 Cache Coherent InterconnectAMBA 4 compliant, 128-bit single layer at up to ½ Cortex-A15 frequency GIC-400 Coherent Mali-T604 I/O CCI-400 2+3 (x3) Graphics DMA LCD Quad ACE-Lite device  2 full AMBA 4 ACE slave Quad Cortex- Cortex-A7 Configurable AXI 4/AXI 3/AHB : NIC-400 interfaces A15 ADB-400 ADB-400 ACE ACE AXI 4  +3 ACE-Lite I/O coherent ADB-400 ADB-400 MMU-400 MMU-400 MMU-400 slave interfaces 128b 128b 128b 128b 128 b  x3 master interfaces ACE ACE ACE-Lite + DVM ACE-Lite + DVM ACE-Lite + DVM CoreLink™ CCI-400 Cache Coherent Interconnect 128 bit @ up to 0.5 Cortex-A15 frequency CCI interfaces: ACE-Lite ACE-Lite ACE-Lite  AMBA 4 ACE and ACE- 128b 128b 128b Lite manage all ACE-Lite ACE-Lite AXI 4 NIC-400 coherency, sharability DMC-400 PHY PHY Configurable AXI 4/AXI 3/AHB/APB : and barriers DDR3/2 DDR3/2 Other Other LPDDR2/3 LPDDR2/3 Slaves Slaves May 2, 201223
  • Summary Multiprocessing enables the scaling of today’s application to grow while maintaining single thread performance  Addresses nicely the multi-tasking of stacked usage scenarios Many-core brings the energy advantages of simpler and smaller processor but with the challenge of software complexity and lack of backwards compatibility with respect to single thread performance The big.LITTLE processing as delivered by the ARM Cortex- A15 and Cortex-A7 offers both the performance and compatibility advantages of Multiprocessing along with the power efficiency and scalability advantages of many-core processing May 2, 2012 24