Energy Efficient Computing - 26mar13

1,580 views

Published on

(See: http://youtu.be/9rP-5TSk_dA)
Electronic Systems support every aspect of our lives today, both Visibly and Invisibly. Numbered in their tens of billions these are the dominant form of computing we now experience. And whilst many dissipate just milliwatts, their shear volume makes them a significant consumer of energy in their own right. Energy Efficiency in Computing has moved from the mainframe to become a consumer issue.
## By Ian Phillips http://ianp24.blogspot.co.uk/
## Opinions expressed are my own
Moved from SlideShare 10mar14 with 1064 views)

Published in: Technology
  • Be the first to comment

Energy Efficient Computing - 26mar13

  1. 1. Energy Efficient Computing - Through a 21c Looking Glass.  Abstract:   With the assistance of its global partners, ARM shipped 8.7 billion CPUs in 2012; a number which continues to grow at around ~20%pa. The 40B we have shipped to date outnumber the total of PC's more than 50 times; and today more than 75% of the things connected to the Internet are ARM based. The dominant nature of Computing in the 21c is very different to that of the Mainframe era. It is sobering to think that if each of those 8.7B CPUs was to dissipate just 100mw, then it would require the output of two modern power stations to drive them; with 2.4 next year, and 3 the year after that! So Electronic Systems are also defining where the real Energy Efficient Computing issue is! But with such a small footprint it must be easy to measure and manage power optimisation? An increasing percentage of these are immensely complex systems, running significant multi-tasking and multi-threaded operating systems on platforms which include multi-processor CPU/GPU configurations, and GB of memory. Whilst their minimum dissipations are a few uW, their peak power exceed the silicon's ability to dissipate it; so the penalty for power un-aware software design is huge. What has been done to manage this in Electronic Systems design, and what lessons can be transferred to the Classic Computing domain? Context    40min Keynote at Energy Efficient Computing Workshop at University of Bristol, UK. 26mar13 By: The TSB’s Energy Efficient Computing SIG (EEC-SIG) and; UoBristol Energy Aware COmputing (EACO) initiative https://connect.innovateuk.org/web/eec ..and.. http://www.Cs.bris.ac.uk/Research/Micro/eaco.jsp SlideCast and pdf available via http://ianp24.blogspot.co.uk/ 1
  2. 2. 1v1 Prof. Ian Phillips Principal Staff Eng’r, ARM Ltd ian.phillips@arm.com Visiting Prof. at ... Contribution to Industry Award 2008 Energy Efficient Computing Workshop Uo.Bristol 26mar13 2
  3. 3. Our 21c World ... 3
  4. 4. Electronic Systems are Everywhere ... 4
  5. 5. Electronic Systems are Everywhere ...  Bringing Embedded Intelligence to the Consumer Market, has changed the face of Computing! 5
  6. 6. Electronic Systems Will Create Our Future ‘Old Drivers’ don’t go away ... they don’t dominate any longer. Source: Adapted from Morgan Stanley, Nov 2009 Today: ~2% of our Energy Use goes on Computing and Electronics! 1 ... Tomorrow: It could easily be 20%! 6 1: NATIONAL ACADEMY OF SCIENCES
  7. 7. ARM in the Digital World 150+ billion CPUs cumulative by 2020  8.7B CPUs shipped in 2012 (Growing 20%pa.pa)  75% of the things connected to the Internet today are ARM Powered! Gartner 40+ billion CPUs to date 1998 7 http://www.arm.com/ 2012 2020
  8. 8. Moore’s Law ... X 100nm 10um Transistor/PM (K) 1um Transistors/Chip (M) Approximate Process Geometry 10nm Gordon Moore. Founder of Intel. (1965) 100um ITRS’99 ... 8 http://en.wikipedia.org/wiki/Moore’s_law x More Functionality on a Si Chip in 20 yrs!
  9. 9. Is HPC The Pinnacle of Computing? 9
  10. 10. ... Or the Cloud? 10
  11. 11. ... Or the iGadget? 11
  12. 12. A Machine for Computing ... Computing: A general term for algebraic manipulation of data ... Numerated Phenomena IN (x) y=F(x,t,s) Processed Data/ Information OUT (y) ... State and Time are normally factors in this.  It can include phenomena ranging from human thinking to calculations with a narrower meaning. Usually used it to exercise analogies (models) of real-world situations; Frequently in real-time (Fast enough to be a stabilising factor in a loop). Wikipedia  ... So what part does Hardware and Software play? ... And what about Energy? 12
  13. 13. Antikythera c87BC ... Planet Motion Computer Mechanical Technology • Inventor: Hipparchos (c.190 BC – c.120 BC). • Ancient Greek Astronomer, Philosopher and Mathematician. Single-Task, Continuous Time, Analogue Mechanical Computing (With backlash!) See: http://www.youtube.com/watch?v=L1CuR29OajI 13
  14. 14. Orrery c1700 ... Planet Motion Computer Mechanical Technology • Inventor: George Graham (1674-1751). English Clock-Maker. • Single-Task, Continuous Time, Analogue Mechanical Computing (With backlash!) 14
  15. 15. Babbage's Difference Engine 1837 Mechanical Technology (Re)construction c2000  The difference engine consists of a number of columns, numbered from 1 to N. Each column is able to store one decimal number. The only operation the engine can do is add the value of a column n + 1 to column n to produce the new value of n. Column N can only store a constant, column 1 displays (and possibly prints) the value of the calculation on the current iteration. Computer for Calculating Tables: A Basic ALU Engine 15
  16. 16. “Enigma” c1940 Mechanical Technology Data Encryption/Decryption Computer 16
  17. 17. “Colossus” 1944 Valve/Mechanical Technology Code-Breaking Computer: A Data Processor 17
  18. 18. “Baby” 1947 (Reconstruction) Valve/Software Technology General Purpose, Quantised Time and Data, (Digital) Electronic Computing 18
  19. 19. The Analogue Computer Tele-Verta Radio 4 Valves 1 Rectifier Valve BTH Crystal Set c1945 1 Diode Evoke DAB Radio c1925 100 M Transistors 2-3 Embedded Processors Bush Radio 7 Transistors 1 Diode c1960 19 c2005
  20. 20. Radio as Computation ... ‘Integrated Circuit’ Transistor Valve Technology Vi Vrf=Vi*100 Vro='Bandpass'(Vif*1000) Vrf Vif Vro Vif=Vrf*Vlo Vlo Vlo=Cos(t*1^6) Single-Task (Embedded), Real-Time, Analogue (Close-Enough) Computing 20
  21. 21. The Pinnacle is Era and Application Related ... Computing: is just Creating Output from Input ... Architecture: is the way this is done on the day. It is the Most Important Product Decision! (HW, SW, Analogue, Optics, Graphene, Mechanics, Steam, etc) 21
  22. 22. Computation in a Cool iCon ... 22
  23. 23. A lot of Cool Stuff in a Smart Phone ... ... Computation in many forms 23
  24. 24. Take a Look Inside... Level-1: Modules The Control Board. 24 http://www.ifixit.com
  25. 25. Inside The Control Board (a-side) Level-2: Sub-Assemblies   Visible Computing Contributors ...  Samsung: Flash Memory - NV-MOS (ARM Partner)  Cirrus Logic: Audio Codec - Bi-CMOS (ARM Partner)  AKM: Magnetic Sensor - MEM-CMOS  Texas Instruments:Touch Screen Controller and mobile DDR - Analogue-CMOS (ARM Partner)  RF Filters - SAW Filter Technology Invisible Computing Contributors ...  OS, Drivers, Stacks, Applications, GSM, Security, Graphics, Video, Sound, etc  Software Tools, Debug Tools, etc 25 http://www.ifixit.com
  26. 26. Inside The Control Board (b-side) Level-2: Sub-Assemblies  More Visible Computing Contributors ...       A4 Processor. Spec:Apple, Design & Mfr: Samsung Digital-CMOS (nm) ...  Provides the iPhone 4 with its GP computing power.  (Said to contain ARM A8 600 MHz CPU and other ARM IP) ST-Micro: 3 axis Gyroscope - MEM-CMOS (ARM Partner) Broadcom: Wi-Fi, Bluetooth, and GPS - Analogue-CMOS (ARM Ptr) Skyworks: GSM Analogue-Bipolar Triquint: GSM PA Analogue-GaAs Infineon: GSM Transceiver - Anal/Digi-CMOS (ARM Partner) GPS Bluetooth, EDR &FM 26 http://www.ifixit.com
  27. 27. Level-3: Processor NB: The Tegra 3 is similar to the A4/5, but not used in the iPhone 27 (Nvidea Tegra 3, Around 1B transistors)
  28. 28. Architecting your Product   : Is the cumulative non-functional choices made to support the functional need  A Good Architecture is the one that ‘survives’  History is written by the winners (2nd is for losers) : Component Performance may be ‘poor’ as long as System Performance is ‘better’ for its use.  Architectural Options ... : Business Model (Cost-of Ownership, ROI), TTM (Productivity, History, IPAvailability, Know-How), Aesthetics (Power, Quality, Behaviour, Appearance)  : Analogue, Digital, Mechanical, Optical, RF, Software, Plastics, Metal-forming, Manufacturing, Glass, ...  : More than 99% of a Product is Reused from its Predecessor  ... 28 is assumed (working is expected!) ... It used to be the only consideration!
  29. 29. Power Philosophy  Hardware Dissipates The Power ...  Chose Underlying Technology for best power efficiency.  One size does not fit all (Products, Applications or Instances)  ... But Software Tells It To!   Chips can melt-down under software ‘instruction’ Make computing hardware power as ‘Activity’ dependent as possible   Zero Activity => Zero Power Make OS/Apps aware of the power/performance situation, and their options for controlling it (Indicators and levers)  Avoid Moving Data  Becoming the dominant energy consumption in a system  Energy ∝ DataVolume x Speed x Distance>2(3)  Bring the processing to the data ... Think System: It’s how the ‘box’ performs, not the components 29
  30. 30. All ARM Processors are Power Efficient 30
  31. 31. Chose The Horses for The Course About 50MTr About 50KTr ... Delivering ~5x speed (Architecture + Process + Clock) 31
  32. 32. Parallel is More Efficient Processor Input Processor Output Output Input f/2 f Processor Capacitance = C Voltage = V Frequency = f Power = CV2f f/2 Capacitance = 2.2C Voltage = 0.6V Frequency = 0.5f Power = 0.4CV2f f ... The limit determined by Amdahl’s or Gustafson’s Law 32
  33. 33. Multicore ARM On-Chip ...  Heterogeneous Multicore Systems  have been in ARM for a long time: Application UI & 3D graphics Power Manager Cortex™-A8 Mali™-400 MP Cortex-M3 Interconnect Memory 33
  34. 34. Coherent Multicore Cluster ...  Homogenous Multicore  cluster, as part of a heterogeneous system: Cortex-A9 Power Manager Mali-400 MP … User Interface and 3D graphics Cortex-M3 Cortex-A9 Coherency Logic Interconnect 34
  35. 35. Multiple Clusters ...  Multiple Homogeneous Coherent Clusters … Cortex-A15 Cortex-A15 Coherency Logic in L2 Cache … Cortex-A15 Coherency Logic in L2 Cache Coherent Interconnect 35 Cortex-A15
  36. 36. Computer On a Chip c2010 ... Today’s Consumer require a pocket ‘Super-Computer’ ...  Silicon Technology Provides a Billion transistors ...  It will be supported with a few GB of memory ... • Typically 10 Processors ... • • • • • • 36 http://www.arm.com/ 4 x A9 Processors (2x2): 4 x MALI 400 Frag. Proc 1 x MALI 400 Vertex Proc 1 x MALI Video CoDec Software Stacks, OS’s and Design Tools/ ARM Technology gives chip/system designers ... • Improved Productivity • Improved TTM • Improved Quality/Certainty
  37. 37. CoreLink™ CCN-504 and DMC-520 Heterogeneous processors – CPU, GPU, DSP and accelerators Virtualized Interrupts Up to 4 cores per cluster Up to 4 coherent clusters Quad CortexA15 Quad CortexA15 Quad CortexA15 L2 cache L2 cache L2 cache Quad ACE CortexA15 L2 cache DSP DSP DSP PCIe DPI Crypto USB AHB ACE SATA NIC-400 IO Virtualisation with System MMU CoreLink™ CCN-504 Cache Coherent Network Integrated L3 cache Snoop Filter 8-16MB L3 cache CoreLink™ DMC-520 Dual channel DDR3/4 x72 10-40 GbE Interrupt Control Uniform System memory CoreLink™ DMC-520 NIC-400 Network Interconnect PHY x72 DDR4-3200 x72 DDR4-3200 Flash GPIO Peripheral address space 37 Up to 18 AMBA interfaces for I/O coherent accelerators and IO
  38. 38. Methodology As Well As Hardware  C/C++  Debug & Trace Development Energy Trace Modules  Middleware 38
  39. 39. Power Management  For Single-Processor systems, and Peripheral Circuitry...  Variable/Gated clock domains  Variable/Switched power domains  Maximises power efficiency by ...     Minimise voltage/frequency (P=CV2f) so that processor has just enough performance for the current application need Controlled by the OS and the Application SW Maximises ‘Activity Power’ dependence Apply on/off-chip zones ...   39 Methodology Retention Flops/Latches, Level Shifters, Power-Switch Cells, PLLs
  40. 40. big.LITTLE Processing  For High-Performance systems...  Tightly coupled combination of two ARM CPU clusters:   Cortex-A15 and Cortex-A7 - functionally identical Same programmers view, looks the same to OS and applications  big.LITTLE combines high-performance and low power   Automatically selects the right processor for the right job Redefines the efficiency/performance trade-off “Demanding tasks” >2x Performance Current big.LITTLE smartphone 40 big “Always on, always connected tasks” LITTLE 30% of the Power (select use cases) Current big.LITTLE smartphone
  41. 41. LITTLE Fine-Tuned to Different Performance Points Most energy-efficient applications processor from ARM   Simple, in-order, 8 stage pipelines Performance better than mainstream, high-volume smartphones (Cortex-A8 and Cortex-A9) big Highest performance in mobile power envelope 41   Complex, out-of-order, multi-issue pipelines Up to 2x the performance of today’s high-end smartphones Cortex-A7 Cortex-A53 Q u e u e I s s u e I n t e g e r Cortex-A15 Cortex-A57
  42. 42. big.LITTLE Software CPU Migration  Migrate a single processor workload to the appropriate CPU  Migration = save context then resume on another core  Also known as Linaro “In Kernel Switcher”  DVFS driver modifications and kernel modifications  Based on standard power management routines  Small modification to OS and DVFS, ~600 lines of code big.LITTLE MP  OS scheduler moves threads/tasks to appropriate CPU  Based on CPU workload  Based on dynamic thread performance requirements  Enables highest peak performance by using all cores at once 42
  43. 43. Bringing the Processing to the Data … Press Claims: Dell + Marvell, Copper BaiDu + Marvell, Baserock  288 server nodes in a 4U rack space Public Source: http://www.engadget.com/2011/11/02/hp-and-calxedas-moonshot-arm-servers-will-bring-all-the-boys-to/ 43
  44. 44. ... Refining Data into Information 44
  45. 45. Transferrable Lessons to GP Software   Moving data is Power Expensive ...  Don’t move data; use it locally (Cache it)  Refine it once, use it often (Pre-Process it)  Your CPU Power is work-load independent ...  So, get in; get the work done; and get out.  Maximise the workload of your code; terminate when complete.  Make your Processing work-load dependent  Use a Hypervisor and turn off (at least free) processors not in use. 45
  46. 46. Societies Challenges in the 21c  Urbanisation (Smart Cities)  Health (eHealth)  Transport  Energy (Smart Grid)  Security  Environment  Food/Water  Ageing Society  Sustainability  Digital Inclusion  Economics And whilst our technologies will be an essential part of all solutions, they cannot not fix them without Society’s help and cooperation! ... Energy Efficient Computing will help but not avert the Energy (or other) challenges! 46 Having a great time!
  47. 47. Conclusions  Putting the power of Computation into the hands of the masses, has changed the face of Computing (again)  Electronic Systems will become Ubiquitous in out Lives and Economy  Power Efficient ES is a major issue to Society  Which faces a future with it as a significant energy consumer  Power Efficiency must be architected into the System Hardware and Software from the beginning     47 To realise the maximum potential out of your Silicon (Avoiding Dark Si) Architect & Design HW as efficiently as possible (reflecting the task)  Strive for: No Work => No Power Equip HW with Indicators and Levers so the System/App can manage it Bring Processing to the Data ...  Don’t move Data; move Information  Process data locally  Energy ∝ DataVolume x Speed x Distance>2(3)
  48. 48. Computing the 21c … Enabling the Creation of High-Performance Electronic Systems ... Productively, Economically and Reliably through Hw & Sw Reuse Methodologies based on a family of CPU/GPU cores 48

×