Ppt4 london - michael rudgyard ( concurrent thinking ) driving efficiencies through measuring and monitoring in the data centre
Measuring and monitoring to support the EU code of conduct Michael Rudgyard (CTO) Concurrent Thinking Ltd
EU Code of Conduct – Participant Commitments• The participant commitments define minimum obligations (roughly): – Provision of monthly DCiE / PUE measurements – Provision of IT rated electrical load capacity of the DC – Target inlet temperature for IT equipment (optional) – External monthly average ambient temperature (optional) – External monthly average dew point temperature (optional)• It also requires the DC to commit to an energy-saving action plan: – A number of potential ways to save energy are suggested – Most (all ?) involve some level of monitoring
monitoring vs. Monitoring (1)• It is simple (but neither cost-effective nor sensible) to monitor your data centre using the ‘man and a clip-board’ technique• Sadly, this is the ‘state of the art’ for a lot of data centres, each housing many millions of pounds of high-tech IT equipment• But information is power, and power is money….
monitoring vs. Monitoring (2)• Much more effective to Monitor on as fine grain a level as possible – To truly understand where energy savings can be made – To understand how factors vary over time / with load etc – To give ample warning of potential (often critical) issues – To report factual information to management – To drive continuous iterative improvement over time• Real energy and productivity savings require a ‘joined-up’ approach – Managing buildings, data-centre facilities and IT in a unified manner – .. opening the door to the possibility of orchestration of the data-centre
Monitoring Energy and PUE (or DCiE)• First step is to monitor power; then understand where the power is going.• Next step is to measure PUE – Most new data centres are being designed against PUE targets – Many existing data centres are looking to improve their PUE – Aim to reduce energy utilisation through incremental improvements to PUE – The average data centre has a PUE of 1.9 (Kooney, 2010), but most should be able to achieve a figure below 1.5 (??)• Caveats: – Officially, PUE needs to be an annualised average … not a ‘snap shot’ – However, continuous PUE ‘snap-shots’ are useful to help drive improvement
Monitoring key infrastructure• Cooling the data centre is the key overhead that is measured by PUE – But many do not continuously monitor the effectiveness of cooling equipment – Basic assumption: “if the air is cool enough, then the aircon is working… “• But cooling infrastructure is generally depreciated over several years – Despite expensive support contracts, its efficiency may diminish significantly.. – Its efficiency may also be influenced by other changes in the data centre – When should cooling systems be replaced (OPEX vs. CAPEX) ????• Need to track fine-grain power utilisation to really understand issues
Environmental Monitoring• There are significant opportunities for improvements in most data centres – The majority operate at temperatures at >3-4oC below (old) ASHRAE recommendations (Paterson et al, 2009) – A 1oC increase in temperature equates to a 2-4% reduction in energy (California Energy Commission, 2007; UK financial institution, 2011)• It is critical to monitor temperature on as fine grain a level as possible – To understand where hot-spots are, and how these change over time – To give ample warning of cooling failure with a smaller thermal ‘buffer’ – Relating temperatures to energy use helps drive iterative improvement• The more real-time measurements, the better – Ideally at the rack, sub-rack, server – ……..or even processor level !!
Environmental Monitoring (cont…)• Should monitor IT hardware (eg. IPMI) to fully optimise environmentals – Understand the effect of power used by (inefficient) server fans – To identify faulty equipment that we might be overcompensating for…
Driving End-User Behaviour• With few exceptions, the most successful methodology for improving energy conservation across all sectors is: – Step 1: Identify who/what is responsible for significant energy waste – Step 2: Drive behaviour to ‘encourage’ change• What is the implication for the Data Centre ?• Need to report (charge ?) IT power by customer, department or end-user – Track energy (& energy efficiency) to the server ,VM or even application level – Who or what applications/service are the worst offenders ? – Management can use data to drive better practice
Next steps: DC design vs. operational efficiency• Most new data centres are being designed against PUE targets – For a given IT hardware capacity, PUE is a good planning metric – However, it is often a poor operational metric• Most importantly: what if the servers are not doing any useful work ?? – The data centre may still have a ‘good’ PUE, but it would be very inefficient by any business metric• We really need to monitor IT utilisation: – Surveys imply that IT utilisation is between 5 & 10% for an un-virtualised DC, rising to 10 & 20% for a fully virtualised DC – In a typical DC, 10% of running servers are not in use at all (Green Grid Survey, 2010)
‘ITUE’ – A better class of efficiency metrics ?• Some simple ITUE metrics may be derived, eg: – Normalised CPU Utilisation/watt – for compute bound tasks – IOPS/watt – when I/O is predominant – Bytes/watt – for network utilisation – All three !• Some end-users may be interested in application-related metrics: – Database transactions/watt – Page refresh/watt Compute Utilisation – Search/watt Effectiveness 1 0.8 0.6 0.4 0.2 0 Network Storage Utilisation Utilisation Effectiveness Effectiveness
Understanding IT utilisation• Understanding IT utilisation and ITUE metrics can help reduce overall power utilisation very significantly – Remembering that PUE is relative to IT power !!• In particular, it can also help us to identify – Who is using the power they are assigned in an efficient way – Which servers/VM/applications are delivering best ‘value’• In particular, ‘sweating’ the IT assets may not be smart after all ! – What is the efficiency of service delivery on individual platforms – When do running costs exceed depreciation costs – What replacement platform should be procured etc ??
Q: Isn’t Virtualisation the answer ?• A: It is (an important) part of the answer• Typically human behaviour is: – A customer replaces a 3 year old (then state-of-the-art) server with a new state-of-the-art server – He puts a number of VMs on his new (much faster) server rather than the single OS instance on his much slower server – He typically doubles his IT efficiency (from 10% to 20%)• This demonstrates the need to spec new equipment based on historical application and user requirements• As with hardware, some VMs may not be used at all over time…
Continuous Iterative Improvement• Monitoring and Reporting alone do not produce savings• Use data to agree, plan & make iterative improvements: – Eg. Make incremental changes to data centre environmentals; riase CRAC temperatures; find hotspots; move equipment; improve airflow – Eg. Identify unused servers, underused servers and decommission; identify servers that are not used at night, weekends etc and employ active power management; define virtualisation strategy based on real data etc.• This is not without its complexities – Requires cross-cultural change (IT, Facilities, Building Management) – Requires openness and end-user targetting (no-one is an angel…) – Requires detailed planning and (often) down-time• Rewards can be significant, even by focussing on simple changes – >25% energy savings in 1st year ?
Conclusions• Efficient DCs should monitor & manage both IT and Facilities systems in a coherent manner: – Environmental systems (temperature, humidity, air-conditioning..) – Power (at the distribution board, rack PDU and server PSU level …) – IT equipment (using standard protocols such as IPMI and SNMP…) – Operating systems & Virtual Machines (integrating with IT systems) – ..and perhaps applications themselves• In the future, we will move to the autonomous data centre – Emphasis moves from monitoring to active management – Potential for very significant energy savings…