The On-going Evolutions of Power Management in Xen

                                               Kevin Tian
            ...
Agenda
• Brief history
• Evolutions of Idle power management
• Evolutions of run-time power management
• Tools
• Experimen...
Brief History

                                                                           Enhanced green computing
       ...
Evolutions of Idle power management
           (Xen cpuidle)




          Intel Confidential
                            ...
Xen Summit Boston 2008
                 Dom0



     ACPI Parser

      External
      Control




                       ...
Enhanced C-states support
                 Dom0                                                                           ...
TSC freeze
                          TSC freeze
                                            Ideal TSC
                    ...
APIC timer freeze
                 Dom0                                               Timer heap
                         ...
Menu governor
                                                                                               Ladder
      ...
Range timer
                  Dom0
          0(ms)   1   2     3          4      5        6

                             ...
Range timer effect

       One UP HVM RHEL5u1                                                           Multiple idle UP H...
Current picture
                 Dom0
                            Xenpm



     ACPI Parser

      External
      Control
...
Evolutions of run-time power management
             (Xen cpufreq)




             Intel Confidential
                   ...
Xen Summit Boston 2008
                    Dom0
            Linux
             PM
            Tools




    On       User
...
Current picture
                    Dom0
            Linux
             PM                       Xenpm
                   ...
Tools




Intel Confidential
                     16
Dom0
                                         Retrieve run-time
                       Xenpm
                             ...
Experimental data about Xen power efficiency




               Intel Confidential
                                       ...
•   All data shown in this section:
     –                                 For reference only and not guaranteed
     –   ...
• Below is a more attracting comparison
   – Native WinXPsp1’s idle watt and SPECpower score are both normalized to 100% a...
Power impact from VM




   Intel Confidential
                        21
• VMM shouldn’t be blamed as only reason for high power consumption!
• A ‘bad’ VM could eat power
   – Just like what ‘bad...
Idle power consumption

                  120.00%                                                                         ...
Power efficiency

                                    1020

                                    1000
Normalized score (ssj...
Legal Information

• INFORMATION IN THIS DOCUMENT IS PROVIDED IN
  CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS
  PROVIDED I...
Intel Confidential
                     26
Upcoming SlideShare
Loading in …5
×

XS Oracle 2009 CVF

978 views

Published on

Ze'ev Maor: Client Virtualization Framework

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
978
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

XS Oracle 2009 CVF

  1. 1. The On-going Evolutions of Power Management in Xen Kevin Tian Open Source Technology Center (OTC) Intel Cooperation
  2. 2. Agenda • Brief history • Evolutions of Idle power management • Evolutions of run-time power management • Tools • Experimental data about Xen power efficiency • Power impact from VM Intel Confidential 2
  3. 3. Brief History Enhanced green computing Improved stability Jan. 2008 Jun. 2008 Better usability Jul. 2007 Xen 3.2 Mature deep C- … released states support Host S3 Xen 3.1 Sep. 2007 May. 2007 Aug. 2008 Dom0 Preliminary Xen 3.3 controlled cpufreq and released freq/vol cpuidle support scaling in Xen Intel Confidential 3
  4. 4. Evolutions of Idle power management (Xen cpuidle) Intel Confidential 4
  5. 5. Xen Summit Boston 2008 Dom0 ACPI Parser External Control Schedulers Registration Enter Idle Hypercall Ladder Halt (C1) CpuIdle driver Dynamic Timekeeping Tick C1 C2 Mwait/IO Xen Intel Confidential 5
  6. 6. Enhanced C-states support Dom0 noPM x ps p1 hv m noC3 x ps p1 hv m 120% C3 x ps p1 hv m 100% 100% 100% ACPI Parser 80% 102.60% Percentage External 91.40% 60% 107.70% Control 82.30% 40% 20% Schedulers 0% id le (W att) SPEC p o w e r Registration Enter Idle (Sco r e ) Hypercall Ladder Halt (C1) ‘noPM’ has both cpuidle and cpufreq CpuIdle driver disabled, and vice versa for other two Dynamic Timekeeping Tick cases. Compared to ‘C3’, noC3 has C1 C2 Mwait/IO Deep C-states maximum C-state limited to C2 Timer C1E Xentrace For idle watt, lower value means greener. For SPECpower score, higher value indicates more power efficient Xen Intel Confidential 6
  7. 7. TSC freeze TSC freeze Ideal TSC Unsynchronized TSC Time went backwards warning Dom0 CPU0 Actual TSC Lots of lost ticks Fluctuating TSC scale factor ACPI Parser CPU1 Faster ToD Xen system time skew External Control 0 1 sec 1 sec Scale error … Platform counter … TSC Restore TSC upon elapsed platform counter since current entry Schedulers Percpu platform to TSC scale … TSC Registration Enter Idle Restore TSC upon elapsed platform Software compensation counter since last calibration according to elapsed Percpu platform to TSC scale … counter of platform timer TSCLadder Hypercall Halt (C1) … Restore TSC upon elapsed platform counter since power on CpuIdle driver Global platform to TSC scale Dynamic Timekeeping … Tick TSC C1 C2 Mwait/IO Deep C-states Timer C1E TSC Xentrace save/restore Hardware enhancement to have TSC Always never stopped (e.g. by Intel Core-i7) running TSC Platform counter TSC Xen Intel Confidential 7
  8. 8. APIC timer freeze Dom0 Timer heap Nearest deadline Reprogram local APIC T count-down timer T T APIC timer ACPI Parser T T T T freeze interrupt when … count down to zero External Scan/execute Control expired timers (Delayed) Timer softirq handler Local APIC ISR Schedulers Registration Enter Idle Solution use platform timer Halt (C1) (PIT/HPET) to carry nearest Hypercall Ladder deadline, when APIC timer is halted in deep C- states: CpuIdle driver Dynamic Timekeeping Tick C1 C2 Mwait/IO Deep C-states is required since number of platform Broadcast Timer timer source is less than number of CPUs C1E PIT/HPET TSC Xentrace broadcast save/restore will come soon to have more MSI based HPET interrupt platform timer sources with reduced broadcast traffic Always running TSC will be supported in new CPU Always running APIC timer soon Xen Intel Confidential 8
  9. 9. Menu governor Ladder Dom0 Expected minimal residency at Cn Expected minimal residency at Cn+1 C0 ACPI Parser Cn External Control Cn+1 Inefficient Promotion if continuous N Cn Schedulers residencies > expectation Demotion if current Cn+1 residency < expectation Registration Enter Idle less idle watt consumed (-5.2%) Higher SPECpower score (+1.6%) Hypercall Ladder Menu (HVM WinWPsp1) CpuIdle driver Menu Halt (C1) C1 C2 Mwait/IO Deep C-states Nearest timer deadline C1, 1ns, 20w Dynamic PICK Timekeeping Last unexpected break event Tick C1E C2, 10ns, 15w PIT/HPET TSC broadcast save/restore Timer C3, 100ns, 5w Latency/power requirement … Always Xentrace running TSC To be further tuned! Xen Intel Confidential 9
  10. 10. Range timer Dom0 0(ms) 1 2 3 4 5 6 For each timer, it accepts a range for expiration now: C0 ACPI Parser [expiration, expiration + Cn Frequent C-state External timer_slop] Control Entry/exit may (default 50us for timer_slop) Instead consume 0(ms) 1 2 3 4 5 6 More power Overlapping ranges can be merged to reduce timer Schedulers interrupt count C0 Registration Enter Idle Cn Hypercall Ladder Menu Halt (C1) CpuIdle driver Dynamic Timekeeping Tick C1 C2 Mwait/IO Deep C-states Range Timer timer C1E PIT/HPET TSC Xentrace broadcast save/restore Always running TSC Xen Intel Confidential 10
  11. 11. Range timer effect One UP HVM RHEL5u1 Multiple idle UP HVM RHEL5u1 10000 range=50us 9000 range=1ms timer interrupt/second 8000 7000 6000 -7.5% +1.2% 5000 4000 3000 2000 1000 Idle(watt) SPECpower(score) 0 100% 100% 50us 1HVM 2HVM 4HVM 92.50% 101.20% 1ms Collected on a two-cores mobile platform Intel Confidential 11
  12. 12. Current picture Dom0 Xenpm ACPI Parser External Control Power Schedulers Aware Registration Enter Idle Hypercall Statistics Ladder Menu Halt (C1) CpuIdle driver Dynamic Timekeeping Tick C1 C2 Mwait/IO Deep C-states Range Timer timer C1E PIT/HPET TSC Xentrace broadcast save/restore Always running TSC Xen Intel Confidential 12
  13. 13. Evolutions of run-time power management (Xen cpufreq) Intel Confidential 13
  14. 14. Xen Summit Boston 2008 Dom0 Linux PM Tools On User Others Demand Space Cpufreq core Power ACPI Others ACPI Parser Now-K8 cpufreq Linux Cpufreq External Control Query Idle State Registration Schedulers On Demand Tiny cpufreq core Enable MSR ACPI Cpufreq Access (IA32) Perm Xen Intel Confidential 14
  15. 15. Current picture Dom0 Linux PM Xenpm ? Tools On User Others Demand Space Cpufreq core Power ACPI Others ACPI Parser Now-K8 cpufreq Linux Cpufreq External Control Query Idle More State governors Registration Schedulers Turbo mode On User Perfor Power Enhanced Demand Space mance Save Control User Tiny cpufreq core Interface Statistics Enable Cpu offline MSR ACPI ACPI Power / online Cpufreq Cpufreq Now Access (IA32) (IA64) K8 Perm More Xen drivers Intel Confidential 15
  16. 16. Tools Intel Confidential 16
  17. 17. Dom0 Retrieve run-time Xenpm statistics about Xen cpuidle and cpufreq Apply user policy on exposed control knobs of Xen cpufreq (governor, set freq, etc) More capabilities to be added later, e.g. profile… Log every state change for Xen cpuidle and cpufreq: Xentrace CPU0 391365842416 (+ 21204) cpu_idle_entry [ idle to state 2 ] CPU0 391375951050 (+10108634) cpu_idle_exit [ return from state 2 ] Raw data could be further processed by other scripts Xen Intel Confidential 17
  18. 18. Experimental data about Xen power efficiency Intel Confidential 18
  19. 19. • All data shown in this section: – For reference only and not guaranteed – On a two-core mobile platform, with one HVM guest created • Server consolidation effect with multiple VMs/workloads are in progress • Improvement when Xen cpuidle and cpufreq are enabled – SPECpower score is normalized (100% noPM score is 1000 ssj_ops / watt) – Similarly, consumed watt is also normalized (idle noPM watt is 10w) 1500 25 noPM-score PM-score normalized score (ssj_ops/watt) noPM-watt PM-watt 20 Normalized power (watt) Reduced 1000 Power! 15 10 500 Improved Efficiency 5 0 0 0 10 20 30 40 50 60 70 80 90 100 SPECpower workload (%) Intel Confidential 19
  20. 20. • Below is a more attracting comparison – Native WinXPsp1’s idle watt and SPECpower score are both normalized to 100% as the base native xpsp1 140% native rhel5u1 HZ=1000 in xen xpsp1 hvm RHEL5u1 incurs high timer interrupt kvm xpsp1 hvm 130% xen rhel5u1 hvm kvm rhel5u1 hvm 120% Normalized percentage (%) 110% 100% Xen is slightly more power 90% efficient than KVM Similar Idle power 80% consumption for Xen and KVM 70% 60% Native Native 50% idle (Watt) SPECpow er (ssj_ops/w att) Intel Confidential 20
  21. 21. Power impact from VM Intel Confidential 21
  22. 22. • VMM shouldn’t be blamed as only reason for high power consumption! • A ‘bad’ VM could eat power – Just like what ‘bad’ application could do in a native OS – Cause high break events (e.g. timer interrupts) with short C-state residency • Which parts in VM could draw high power? – ‘Bad’ applications hurt just as what they could on native – How guest OS is implemented also matters • Periodic tick frequency – HZ • Timer usage in drivers • Time sub-system implementation • … • Green guest OS wins! – Smaller HZ, idle tickles or fully dynamic tick, range timer, etc. Intel Confidential 22
  23. 23. Idle power consumption 120.00% 4 115.00% 3.5 ‘Bad’ guest 110.00% 3 Average residency (ms) Normalized watt 105.00% 2.5 100.00% 2 95.00% 1.5 Green Green 90.00% 1 85.00% 0.5 80.00% 0 bare Dom0 PV HVM Winxp HZ=1000 HZ=250 HZ=100 tickless Idle power consumption Average C-state residency Intel Confidential 23
  24. 24. Power efficiency 1020 1000 Normalized score (ssj_ops / watt) 980 960 940 920 900 880 860 840 820 800 bare Dom0 PV HVM Winxp HZ=1000 HZ=100 tickless SPECpower score Intel Confidential 24
  25. 25. Legal Information • INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. • Intel may make changes to specifications, product descriptions, and plans at any time, without notice. • All dates provided are subject to change without notice. • Intel is a trademark of Intel Corporation in the U.S. and other countries. • *Other names and brands may be claimed as the property of others. • Copyright © 2007, Intel Corporation. All rights are protected. Intel Confidential 25
  26. 26. Intel Confidential 26

×