Improving Xen idle power efficiency


Published on

Power management has become increasingly important in large-scale datacenters to address costs and limitations in cooling or power delivery, and it is much critical in mobile client where battery lifecycle is considered as one of the critical characteristics of the platform of choice. Good power management helps to achieve great energy efficiency. Virtualization imposes additional challenge to power management. It involves multiple software layers: VMM, OS, APP. For example, a good OS software stack may result in bad power consumption, if the hypervisor is not the timer unalignment, etc.

In this session, we will introduce what we did to improve power efficiency to achieve better power efficiency in both server and client virtualization environment.

In server side, we will introduce additional optimization technologies (e.g., eliminate unnecessary activities, align periodic timers to create long-idle period), to improve package C6 residency to be within 5% overhead with native. In client side, we will share our client power optimization technologies (e.g. graphics, ATA and wireless), which successfully reduce XenClient idle power overhead to be within 5%.

Published in: Business, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The LCD brightness is control by ACPI. When we press the hotkey in the laptop to decrease the LCD brightness, it will trigger a ACPI event and the event handler will call the control methods to take the corresponding action. But we lack the control methods support in guest’s ACPI table, so we need add those control methods to guest ACPI. And when those control methods are called by guest, then ask dom0 to do the work.
  • Basically, we need to turn off the SCSI host that doesn’t attached any device to save the power. But previous client didn’t do this and this will waste power. Now our solution is to bring down all the SCSI host that do not attached any device to save more power.In previous Client, the ATA link only can be set statically: either to mini_power or to max_performance. Now, we add the dynamic solution: runtime check the system load, if idle, set to mini_power, or else, set to max_performance.
  • FLR means function level reset. FLR will reset the whole device and go to initial status(like power on). The issue is that FLR is required when pass through GFX to guest. Then it will clear all PM regs setting by BIOS which can save the power. The solution is that we save the PM regs before FLR and restore it after FLR.RC6(render standby) is a GPU’s technology that allows the GPU to go into a very low power consumption state when the GPU is idle. It is same with the C state in CPU.Turbo is the intel turbo boost. Refer to to get more detailsGPMT(Graphics Power Modulation Technology):Graphics Power Modulation Technology (Intel GPMT) is a method for saving power in the graphics adapter while continuing to display and process data in the adapter. This method will switch the render frequency and/or render voltage dynamically between higher and lower power states supported on the platform based on render engine workload
  • Process is an OS schedulable task/entity
  • Actually, we don’t know the proper value as the expiration window. The 50ns just a guess. As you know, different guest and different workload have the different requirement. It hard to give a fixing value as the expiration window. We may need lots of experiments to get the proper value. Unfortunately, we don’t have the time to do this. Also, we don’t have the time to implement the timer alignment in Xen. We did it in KVM. But the idea is same between Xen and KVM.
  • 1. Not VM. The RTC is emulated by Hypervisor. Here I mean the emulation logic in Hypervior is wrong, not the usage inside VM.2. Event channel is a mechanism used to notify events between hypervisor and VMs. Before, device model polls the buffered I/O(several times a second), and mostly, there are no new data arrived. Now, when hypervisor write the data to buffered I/O page, it will issue an event to notify device model that new data is arriving, then device model will wake up to get the data. With this way, we can eliminate the needless waken ups of device model to check the buffer I/O.
  • Improving Xen idle power efficiency

    1. 1. Xen Power ImprovementsWill Auld, Yang Z Zhang, Winston WangIntel Corporation
    2. 2. Legal DisclaimerINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NOLICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUALPROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMSAND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER,AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OFINTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR APARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OROTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE INMEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.Intel may make changes to specifications and product descriptions at any time, without notice.All products, dates, and figures specified are preliminary based on current expectations, and are subject tochange without notice.Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, whichmay cause the product to deviate from published specifications. Current characterized errata are available onrequest.Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in theUnited States and other countries.*Other names and brands may be claimed as the property of others.Copyright © 2012 Intel Corporation. 2
    3. 3. Agenda• Background• Power saving in client• Power saving in server• Summary 3
    4. 4. Room to save POWER• Ideal/standard  Native OS power consumption• Reality  Hypervisor power consumption• LARGE DELTA (~40% for client at start) 4
    5. 5. Client architecture Client Xen Configuration Linux Win7 DomU Dom0 DomU VM VM VM Xen Hypervisor Hardware 5
    6. 6. Goal • Native OS power efficiency • Close the Power gap with Native Win7 Code Drop Fix Identify Code Gap Root Cause 6
    7. 7. Current results• ~40% idle power gap 2 years ago• ~5% idle power gap now Idle Power Gap 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% Project Start Project End• More?• Increasingly harder to extract 7
    8. 8. LCD brightness control LCD Display – ~20% idle power − Broken brightness controls Win7> Dom0> Fix: −Added emulation of ACPI video extension − Specifically, brightness control methods _BCL, _BCM, and _BQC − Added to VM guest ACPI BIOS − Pass through control knob output to Dom0 take platform action −Make sure Dom0 LCD brightness is really working 8
    9. 9. Runtime IO power managementDysfunctional IO power management• ~15% Idle power• 1st available in 2.6.32 kernel, but: − not functioning correctlyFix:• Enable energy-saving states at run time and auto suspended when idle• Gap dropped from ~25% to 6.8% after fix − HP 8440p mobile platform based on Nehalem processor 9
    10. 10. ATA_link power Max_PerfATA_link static power setting− ~6% idle power in max_performance Run Time− But performance suffers with min_power− Even worse: −All SCSI hosts active with/without attached devices Mim_PowerFix:− Runtime update for ATA_link power setting −Toggle min_power / max_performance, as needed− Disable clocks on deviceless ports 10
    11. 11. Network powerWired and Wi-Fi− ~16 % idle power (650mw)− Many interrupts break deep c state during idle Win7> Dom0>Fix:− Enable Wi-Fi and E1000 power saving mode in Dom0− Add Win7 power management PV driver to pass control settings to Dom0 11
    12. 12. GFX power managementiGFX power management inactive− ~16% idle power (650mw)− VT-d requires device reset −Reset clears all regs including BIOS enabled power management regs − Disables: RC6 (render standby), turbo, and GPMT (Graphics Power Modulation Technology) VT-d operation BIOS PM ON PM PM ON Reset OFF Save / RestoreFix:− Save/Restore PM registers around FLR 12
    13. 13. Client summary• Started with a ~40% gap• Ended with ~5% gap• Greatly improved and got close to the goal 13
    14. 14. Server power savings -- increasing idle time• Timer alignment• Power aware scheduling• Reducing periodic tasks 14
    15. 15. Timer alignment• Independent, frequent timer interrupts • Frequent wake-ups• Reduced idle time, greater power consumption intr arrived intr arrived Timer intr idle busy idle busy Cpu0: CPU idle CPU busy intr arrived Cpu1: Resultant Socket Socket C-state : 15
    16. 16. Timer alignment• Proposal • Configurable timer consolidate window, such as 50 ns • Compute timer interrupt moment • Shift timer handle moment to next timer consolidate moment• Benefit • Fewer interrupts  longer idle time  power savings• Challenges • Guest schedule impact– performance impact • Cross CPU timer synchronization • IPI frequency and synchronization 16
    17. 17. Timer alignment intr arrived intr arrived Timer intr idle busy idle busy Cpu0: CPU idle CPU busy New intr arrived intr arrived Cpu1: Resultant Socket Socket C-state : Gained C-State• Shifting CPU1’s interrupt to match CPU0’s Nice gain in C-State• Repeated over and over adds up 17
    18. 18. Power aware scheduling• ACPI modes – − Performance  Power hungry mode − Energy mode  Power savings mode − Balanced• Task to Scheduling − Performance − Schedule vCPUs one per physical core before pairing − Energy − Schedule vCPUs one per logical core  − power down more cores  − power down more sockets 18
    19. 19. Power saving scheduler packages pkg 0 pkg 1 cores core 0 core 1 core 0 core 1 HT cpu 0 cpu 1 cpu 2 cpu 3 cpu 4 cpu 5 cpu 6 cpu 7running task vcpu0 vcpu1 vcpu2 power aware scheduler Idle CPU/in deep C-state Busy CPU Not in deep C-state 19
    20. 20. Reduce periodic activity• Power-unfriendly RTC emulation: − VMM updates RTC clock twice per second − Solution − Update RTC clock only on Read If a clock ticks where no one can see it, does the time change?• Frequent Wake-ups to check buffered I/O: − Wakeup multiple times a second (Polling model) − Solution (Push model) − Event channel to notify buffered I/O change status No more polling 20
    21. 21. Server summary• Significant areas of work• Need to quantify the impacts 21
    22. 22. Overall summary• Every component counts – software and hardware• Make sure the basics are working• Still more to do 22
    23. 23. Questions? 23