Energy Efficiant Computing in the 21c

675 views

Published on

The early 21c has brought the power of the computer into the hands of the general population, and though these computers consume small amounts of energy they are so numerous that their Energy Efficiency will soon become a major issue. This presentation looks at modern Computing, the ways that Energy Efficiency is currently being enhanced, and the principles behind this.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
675
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
12
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Energy Efficiant Computing in the 21c

  1. 1. Energy Efficient Computing ... In the early 21C  Abstract:   Opinions expressed are those of the author alone With the assistance of its global partners, ARM shipped 8.7 billion CPUs in 2012; a number which continues to grow at around ~20%pa. The 40B we have shipped to date outnumber the total of PC's more than 50 times; and today more than 75% of the things connected to the Internet are ARM based. The dominant nature of Computing in the 21c is very different to that of the Mainframe era. It is sobering to think that if each of those 8.7B CPUs was to dissipate just 100mw, then it would require the output of two modern power stations to drive them; with 2.4 next year, and 3 the year after that! So Electronic Systems are also defining where the real Energy Efficient Computing issue is! But with such a small footprint it must be easy to measure and manage power optimisation? An increasing percentage of these are immensely complex systems, running significant multi-tasking and multi-threaded operating systems on platforms which include multi-processor CPU/GPU configurations, and GB of memory. Whilst their minimum dissipations are a few uW, their peak power exceed the silicon's ability to dissipate it; so the penalty for power un-aware software design is huge. What has been done to manage this in Electronic Systems design, and can any lessons can be transferred to the Classic Computing domains? Context    1hr talk at The Centre for Robotics and Neural Systems (CNRS) at University of Plymouth, Devon, UK. The CRNS has a regular seminar series inviting national and international speakers. http://www.tech.plym.ac.uk/SOCCE/CRNS/ SlideCast and pdf available via http://ianp24.blogspot.co.uk/ 1
  2. 2. Opinions expressed are those of the author alone Prof. Ian Phillips Principal Staff Eng’r, ARM Ltd ian.phillips@arm.com Visiting Prof. at ... Contribution to Industry Award 2008 Centre for Robotics and Neural Systems Uo.Plymouth 1nov13 SlideCast and pdf available via http://ianp24.blogspot.co.uk/ 2 1v0
  3. 3. Energy Efficient Computing ..? 3
  4. 4. Energy Efficient Computing ..? 4
  5. 5. Energy Efficient Computing ..? 5
  6. 6. The Visible Face of Computing Today 6
  7. 7. The Invisible Face of Computing Today  100’s of Billions of computers each consuming mW!  Bringing Embedded Intelligence to the Consumer Market, has changed the Face of Computing! (Again) 7
  8. 8. Our 21c World ... 8
  9. 9. Markets provide the Growth Drivers 3rd Era Millions of Units Computing as part of our lives 2nd Era Broad-based computing for specific tasks 1st Era Select work tasks 1960 1970 1980 1990 2000 2010 2020 Today: ~2% of our Energy Use goes on Computing and Electronics! ... Tomorrow: It could easily be 20%! 9
  10. 10. ARM in the Digital World 150+ billion CPUs cumulative by 2020  8.7B CPUs shipped in 2012 (Growing 20%pa.pa)  75% of the things connected to the Internet today are ARM Powered! Gartner 40+ billion CPUs to date 1998 10 http://www.arm.com/ 2012 2020
  11. 11. Moore’s Law ... X 100nm 10um Transistor/PM (K) 1um Transistors/Chip (M) Approximate Process Geometry 10nm Gordon Moore. Founder of Intel. (1965) 100um ITRS’99 ... 11 http://en.wikipedia.org/wiki/Moore’s_law x More Functionality on a Si Chip in 20 yrs!
  12. 12. A Machine for Computing ... Computing: A general term for algebraic manipulation of data ... Numerated Phenomena IN (x) y=F(x,t,s) Processed Data/ Information OUT (y) ... State and Time are always factors (variable weight).  It can include phenomena ranging from human thinking to calculations with a narrower meaning. Usually used it to exercise analogies (models) of real-world situations; Frequently in real-time (Fast enough to be a stabilising factor in a loop). Wikipedia  ... So what part does Hardware and Software play? ... And what about Energy? 12
  13. 13. Antikythera c87BC ... Planet Motion Computer Mechanical Technology • Inventor: Hipparchos (c.190 BC – c.120 BC). • Ancient Greek Astronomer, Philosopher and Mathematician. Single-Task, Continuous Time, Analogue Mechanical Computing (With backlash!) See: http://www.youtube.com/watch?v=L1CuR29OajI 13
  14. 14. Orrery c1700 ... Planet Motion Computer Mechanical Technology • Inventor: George Graham (1674-1751). English Clock-Maker. • Single-Task, Continuous Time, Analogue Mechanical Computing (With backlash!) 14
  15. 15. Babbage's Difference Engine 1837 Mechanical Technology (Re)construction c2000  The difference engine consists of a number of columns, numbered from 1 to N. Each column is able to store one decimal number. The only operation the engine can do is add the value of a column n + 1 to column n to produce the new value of n. Column N can only store a constant, column 1 displays (and possibly prints) the value of the calculation on the current iteration. Computer for Calculating Tables: A Basic ALU Engine 15
  16. 16. “Enigma” c1940 Mechanical Technology Data Encryption/Decryption Computer 16
  17. 17. “Colossus” 1944 Valve/Mechanical Technology Code-Breaking Computer: A Data Processor 17
  18. 18. “Baby” 1947 (Reconstruction) Valve/Software Technology General Purpose, Quantised Time and Data, (Digital) Electronic Computing 18
  19. 19. Signal Processing Tele-Verta Radio 4 Valves 1 Rectifier Valve BTH Crystal Set c1945 1 Diode Evoke DAB Radio c1925 100 M Transistors 2-3 Embedded Processors Bush Radio 7 Transistors 1 Diode c1960 19 c2005
  20. 20. Radio as Computation ... Vi Vrf=Vi*100 Vro='Bandpass'(Vif*1000) Vrf Vif Vro Vif=Vrf*Vlo Vlo Vlo=Cos(t*1^6) Single-Task (Embedded), Real-Time, Analogue (Close-Enough) Computing 20
  21. 21. Radio as Computation ... Valve Technology Vi Vrf=Vi*100 Vro='Bandpass'(Vif*1000) Vrf Vif Vro Vif=Vrf*Vlo Vlo Vlo=Cos(t*1^6) Single-Task (Embedded), Real-Time, Analogue (Close-Enough) Computing 21
  22. 22. Radio as Computation ... ‘Integrated Circuit’ Transistor Valve Technology Vi Vrf=Vi*100 Vro='Bandpass'(Vif*1000) Vrf Vif Vro Vif=Vrf*Vlo Vlo Vlo=Cos(t*1^6) Single-Task (Embedded), Real-Time, Analogue (Close-Enough) Computing 22
  23. 23. Computing is Era and Application Related ... Computing: Creating Useful Output from Input ... Architecture: The way this is done on the day. It is the Most Important Product Decision! (HW, SW, Digital, Analogue, Optics, Graphene, Mechanics, Steam, etc) 23
  24. 24. Moore's Real Law: x2 Functionality Every 18mth!  Cascade of Technologies supporting Functional growth ... Functional Density (units) 1012 1010 106 102 Electronic era: System era: 1975-2005 2003-2030 100 1960 1980 2000 2020 ... The ‘Law’ started with Wood ⇒ Stone ⇒ Bronze ⇒ Iron 24
  25. 25. Computing in a Cool iCon ... 25
  26. 26. ‘A lot’ of Architecture in a Smart Phone ... ... Computation in many forms 26
  27. 27. Take a Look Inside... Level-1: Modules The Control Board. 27 http://www.ifixit.com
  28. 28. Inside The Control Board (a-side) Level-2: Sub-Assemblies   Visible Computing Contributors ...  Samsung: Flash Memory - NV-MOS (ARM Partner)  Cirrus Logic: Audio Codec - Bi-CMOS (ARM Partner)  AKM: Magnetic Sensor - MEM-CMOS  Texas Instruments:Touch Screen Controller and mobile DDR - Analogue-CMOS (ARM Partner)  RF Filters - SAW Filter Technology Invisible Computing Contributors ...  OS, Drivers, Stacks, Applications, GSM, Security, Graphics, Video, Sound, etc  Software Tools, Debug Tools, etc 28 http://www.ifixit.com
  29. 29. Inside The Control Board (b-side) Level-2: Sub-Assemblies  More Visible Computing Contributors ...       A4 Processor. Spec:Apple, Design & Mfr: Samsung Digital-CMOS (nm) ...  Provides the iPhone 4 with its GP computing power.  (Said to contain ARM A8 600 MHz CPU and other ARM IP) ST-Micro: 3 axis Gyroscope - MEM-CMOS (ARM Partner) Broadcom: Wi-Fi, Bluetooth, and GPS - Analogue-CMOS (ARM Ptr) Skyworks: GSM Analogue-Bipolar Triquint: GSM PA Analogue-GaAs Infineon: GSM Transceiver - Anal/Digi-CMOS (ARM Partner) GPS Bluetooth, EDR &FM 29 http://www.ifixit.com
  30. 30. Level-3: Processor NB: The Tegra 3 is similar to the A4/5, but not used in the iPhone 30 (Nvidea Tegra 3, Around 1B transistors)
  31. 31. Packing Technology into an iCon Analogue and Digital Design Embedded Software Mechanics, Plastics and Glass Micro-Machines (MEMs) Displays and Transducers Robotics and Test Knowledge and Know-How Research, Education and Training Components, Sub-Systems and Systems; Design, Assembly and Manufacture Metrology, Methodology and Tools ... Involving Many Specialist Businesses ... Round and Round the World ...Not-Least from Europe 31
  32. 32. Architecting your Product   : Is the cumulative non-functional choices made to support the functional need  A Good Architecture is the one that ‘survives’  History is written by winners (2nd is for losers) : Component Performance may be ‘poor’ as long as System Performance is ‘better’ for its use.  Architectural Options ... : Business Model (Cost-of Ownership, ROI), TTM (Productivity, History, IPAvailability, Know-How), Aesthetics (Power, Quality, Behaviour, Appearance)  : Analogue, Digital, Mechanical, Optical, RF, Software, Plastics, Metal-forming, Manufacturing, Glass, ...  : More than 99% of a Product is Reused from its Predecessor  ... 32 is assumed (working is expected!) ... It used to be the only consideration!
  33. 33. Power Philosophy  Hardware Dissipates Power ...  Chose Underlying Technology for best power efficiency.  One size does not fit all (Products, Applications or Instances)  ... Software Doesn’t (But it Tells Hardware To!)   Chips can literaly melt-down under software ‘instruction’ Make computing hardware power as ‘Activity’ dependent as possible   Zero Activity => Zero Power Make OS/Apps aware of the power/performance situation, and their options for controlling it (Need Indicators and Levers)  ... Think System: It’s how the ‘box’ performs, not the components 33
  34. 34. Core Power Management  For Processor and Peripheral Circuitry...  Variable/Gated - Clock Domains  Variable/Switched - Power Domains  Indicators and Levers  Allow the software to see and influence what is going on  Principles of Core Power Efficiency...  Minimise voltage/frequency (P=CV2f) so that processor has just    enough performance for the current application need Maximises ‘Activity Power’ dependence (Zero Activity => Zero Power) Management by the OS and the Application SW Apply to all on & off-chip zones (not just the CPU) ...   34 Methodology Retention Flops/Latches, Level Shifters, Power-Switch Cells, PLLs
  35. 35. Architectural Energy Efficiency - Parallelism Processor Input Output Output Processor f Input f/2 Processor f f/2 Capacitance = C Voltage = V Frequency = f Capacitance = 2.2C Voltage = 0.6V Frequency = 0.5f Power = CV2f Power = (2.2*0.6*0.6*0.5)CV2f = 0.4CV2f  To a limit determined by Amdahl’s or Gustafson’s Law ...  Amdahl: Extracted parallelism from existing code (Reuse)  Gustafson: Some needs only benefit from parallelism (Custom) ... Actual improvement is application specific. 35
  36. 36. Architectural Energy Efficiency - Data  Moving Data takes significant Energy  Becoming the dominant energy consumption in a system  Data Location  Avoid moving or copying Data  Energy ∝ DataVolume x Speed x Distance>2(3)  Bring the processing to the data  Bring the Processing to the Data  Caching is good (depends on implementation)  Write back is better than write-through  Local working memory is good  Aka Software Caching ... The Arrangement of your Data matters! 36
  37. 37. All ARM Processors are Power Efficient 37
  38. 38. Chose The Horses for The Course About 50MTr About 50KTr ... Delivering ~5x speed (Architecture + Process + Clock) 38
  39. 39. Multicore ARM On-Chip ...  Heterogeneous Multicore Systems  have been in ARM for a long time: Application UI & 3D Graphics Power Manager Cortex™-A8 Mali™-400 MP Cortex-M3 Interconnect Memory 39
  40. 40. Coherent Multicore Cluster ...  Homogenous Multicore  cluster, as part of a heterogeneous system: Cortex-A9 Power Manager Mali-400 MP … User Interface and 3D graphics Cortex-M3 Cortex-A9 Coherency Logic Interconnect 40
  41. 41. Multiple Clusters ...  Multiple Homogeneous Coherent Clusters … Cortex-A15 Cortex-A15 Coherency Logic in L2 Cache … Cortex-A15 Coherency Logic in L2 Cache Coherent Interconnect 41 Cortex-A15
  42. 42. Computer On a Chip c2010 ... Today’s Consumer require a pocket ‘Super-Computer’ ...  Silicon Technology Provides a Billion transistors ...  It will be supported with a few GB of memory ... • Typically 10 Processors ... • • • • • • 42 http://www.arm.com/ 4 x A9 Processors (2x2): 4 x MALI 400 Frag. Proc 1 x MALI 400 Vertex Proc 1 x MALI Video CoDec Software Stacks, OS’s and Design Tools/ ARM Technology gives chip/system designers ... • Improved Productivity • Improved TTM • Improved Quality/Certainty
  43. 43. CoreLink™ CCN-504 and DMC-520 Heterogeneous processors – CPU, GPU, DSP and accelerators Virtualized Interrupts Up to 4 cores per cluster Up to 4 coherent clusters Quad CortexA15 Quad CortexA15 Quad CortexA15 L2 cache L2 cache L2 cache Quad ACE CortexA15 L2 cache DSP DSP DSP PCIe DPI Crypto USB AHB ACE SATA NIC-400 IO Virtualisation with System MMU CoreLink™ CCN-504 Cache Coherent Network Integrated L3 cache Snoop Filter 8-16MB L3 cache CoreLink™ DMC-520 Dual channel DDR3/4 x72 10-40 GbE Interrupt Control Uniform System memory CoreLink™ DMC-520 NIC-400 Network Interconnect PHY x72 DDR4-3200 x72 DDR4-3200 Flash GPIO Peripheral address space 43 Up to 18 AMBA interfaces for I/O coherent accelerators and IO
  44. 44. Methodology As Well As Hardware  C/C++  Debug & Trace Development Energy Trace Modules  Middleware 44
  45. 45. big.LITTLE Processing  For High-Performance systems...  Tightly coupled combination of two ARM CPU clusters:   Cortex-A15 and Cortex-A7 - functionally identical Same programmers view, looks the same to OS and applications  big.LITTLE combines high-performance and low power   Automatically selects the right processor for the right job Redefines the efficiency/performance trade-off “Demanding tasks” >2x Performance Current big.LITTLE smartphone 45 big “Always on, always connected tasks” LITTLE 30% of the Power (select use cases) Current big.LITTLE smartphone
  46. 46. LITTLE Fine-Tuned to Different Performance Points Most energy-efficient applications processor from ARM   Simple, in-order, 8 stage pipelines Performance better than mainstream, high-volume smartphones (Cortex-A8 and Cortex-A9) big Highest performance in mobile power envelope 46   Complex, out-of-order, multi-issue pipelines Up to 2x the performance of today’s high-end smartphones Cortex-A7 Cortex-A53 Q u e u e I s s u e I n t e g e r Cortex-A15 Cortex-A57
  47. 47. big.LITTLE Software CPU Migration  Migrate a single processor workload to the appropriate CPU  Migration = save context then resume on another core  Also known as Linaro “In Kernel Switcher”  DVFS driver modifications and kernel modifications  Based on standard power management routines  Small modification to OS and DVFS, ~600 lines of code big.LITTLE MP  OS scheduler moves threads/tasks to appropriate CPU  Based on CPU workload  Based on dynamic thread performance requirements  Enables highest peak performance by using all cores at once 47
  48. 48. Bringing the Processing to the Data … Press Claims: Dell + Marvell, Copper BaiDu + Marvell, Baserock  288 server nodes in a 4U rack space Public Source: http://www.engadget.com/2011/11/02/hp-and-calxedas-moonshot-arm-servers-will-bring-all-the-boys-to/ 48
  49. 49. ... Refining Data into Information 49
  50. 50. Transferrable Lessons to GP Software   Moving data is Power Expensive ...  Don’t move data; use it locally (Cache it)  Refine it once, use it often (Pre-Process it)  Your CPU Power is work-load independent ...  So, get in; get the work done; and get out.  Maximise the workload of your code; terminate when complete.  Make your Processing work-load dependent  Use a Hypervisor and turn off (at least free) processors not in use. 50
  51. 51. Societies Challenges in the 21c  Urbanisation (Smart Cities)  Health (eHealth)  Transport  Energy (Smart Grid)  Security  Environment  Food/Water  Ageing Society  Sustainability  Digital Inclusion  Economics And whilst our technologies will be an essential part of all solutions, they cannot not fix them without Society’s help and cooperation! ... Energy Efficient Computing will minimise the impact not avert the challenges! 51 Having a great time!
  52. 52. Conclusions  Putting the power of Computation into the hands of the masses, has changed the face of Computing (again)  Electronic Systems will become Essential to our Lives and the Economy  Power Efficient ES are a major issue to Society  Which faces a future with them as a significant energy consumer in themselves  Power Efficiency must be architected into the System Hardware and Software from the beginning     52 To realise the maximum potential out of your Silicon (Avoiding Dark Si) Architect & Design HW as efficiently as possible (reflecting the task)  Strive for: No Work => No Power Equip HW with Indicators and Levers so the System/App can manage it Bring Processing to the Data ...  Don’t move Data; move Information  Process data Locally  Energy ∝ DataVolume x Speed x Distance>2(3)
  53. 53. Computing at the heart of the 21c ARM: Enabling the Creation of High-Performance Electronic Systems --• Productively, Economically and Reliably • Through Hw/Sw Reuse Methodologies • Based on a family of CPU/GPU cores 53

×