Data Centre Compute and Overhead Costs - Delivering End-to-end KPIs


Published on

Michael Rudgyard (CTO) - Concurrent Thinking Ltd

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Centre Compute and Overhead Costs - Delivering End-to-end KPIs

  1. 1. Data Centre Compute and Overhead Costs Delivering End-to-end KPIs Michael Rudgyard (CTO) Concurrent Thinking Ltd
  2. 2. Our Background • Background in High Performance Computing & Scale-out Computing – Gives us a unique perspective on DCIM • Founded Concurrent Thinking in 2010 – – – – Focussed on tools for operational efficiency in the Data Centre Exploit an existing & mature product that was originally developed for HPC Investment from Carbon Trust Investments Launched new products at DatacenterDynamics, Nov 2011
  3. 3. Bridging the Divides – Facilities, IT & Management It‟s all about virtualization It‟s all about procurement What constitutes an efficient data centre ?? It‟s about staff efficiency It’s all about cooling
  4. 4. What we do… Data Centre Infrastructure Management • Continuous monitoring & active management of IT & Facilities systems – – – – – – Building management systems Environmental systems (temperature, humidity, air-conditioning..) Power (at the distribution board, rack PDU and server PSU level …) IT equipment (including server health) Operating systems & Virtual Machines Application Performance • We leverage standards-based protocols – OPC, Modbus, 1-wire, SNMP, IPMI, Intel Node Manager, WMI • …and offer monitoring agents and extensible means to monitor non-standard M&E equipment
  5. 5. Why Data Centre Infrastructure Management ? • Aims – – – – – • To truly understand where operational savings can be made To understand how factors vary over time / with load etc To give ample warning of potential (often critical) issues To report factual information to management To drive continuous iterative improvement over time Real energy and productivity savings require a „joined-up‟ approach – Managing buildings, data-centre facilities and IT in a unified manner – .. opening the door to the possibility of orchestration of the data-centre
  6. 6. Our Approach • We provide a tool that: – – – – – – Tracks power to the server/network (and OS/VM/application) level Allows for reporting by department, customer or end-user Offers a simple interface to present data for different purposes Has integrated IT asset management Generates business intelligence on end-to-end service delivery Is both user-extensible and built to scale (visually & architecturally)
  7. 7. What are the important data centre metrics ? • We don‟t push particular metrics (eg. PUE, ITUE, ITEE, FVER..) • DCIM is a tool that should enable a customer to define his own KPIs Compute Utilisation Effectiveness 1 0.8 0.6 0.4 0.2 0 Network Utilisation Effectiveness Storage Utilisation Effectiveness
  8. 8. Example 1 – OS performance monitoring • Potential performance metrics: – CPU utilisation (* CPU benchmark) per watt – IOPS per watt – Bytes per watt • To produce these metrics we monitor: – – – – OS metrics via SNMP (Linux/MS) or WMI (MS) Server power usage (via a managed PDU or IPMI) (CPU benchmark figure) Power overhead for cooling and power distribution etc (and apportion this for this server) – Power cost (at different times)
  9. 9. Example 2– Microsoft Exchange • For a typical MS Exchange service, the most useful metrics might be: – Power usage per email (OPEX only) – Cost per email (OPEX or OPEX + CAPEX) – CO2 per Email • WMI now provides the necessary application performance metrics – The number of email transactions – Server power usage (as above) – Power overhead for cooling and power distribution etc. (as above) – Power Cost (as above) – Asset depreciation model
  10. 10. Example 3 – Linux MySQL Server • For a web service, the most useful metric might be: – Power per database query – Cost per database query – CO2 per database query • SNMP now provides the application performance information
  11. 11. Example 4 – Linux Apache Web Server • For a web service, the most useful metric might be: – Power per HTML query – Cost per HTML query – CO2 per HTML query • Unfortunately, SNMP support for Apache is poor – Best option was to install the Apache „status module‟ – Read the number of web transactions from the status module web page
  12. 12. Application performance on virtual machines • Assume a single application per virtual machine • Issue now is: what is the power used by a virtual machine ? • Our solution: „inferred metrics‟ – Use another metric (eg. CPU utilisation) as a proxy for power usage – Attribute the power used by a server to individual VMs
  13. 13. Using this information (1) • Which servers are underused/inefficient/should be virtualised ? • Which servers are better at delivering a particular service ? – Provides useful procurement information ! – (or which application gives better performance on the same hardware ?) • When should I retire old servers ? – Sweating IT assets is often a very bad idea indeed !
  14. 14. Using this information (2) • Which departments are using their IT resources wisely ? – Define server groups and report by department • Charge departments for individual power usage
  15. 15. Conclusions and open questions • It is straightforward to monitor many KPIs for a data centre – – – – From PUE, to ITUE and “application utilisation efficiency” Requires a proper monitoring & reporting tool, with inbuilt asset management Requires power monitoring hardware (managed PDUs or modern servers) Requires suitable configuration (relatively easy for small numbers of apps) • It is straightforward to apportion costs by racks, servers and by department (if application servers are not shared) • The ROI can be very significant • Can we monitor granular information by user at the app level ? – On going collaborations with University of Hertfordshire and Surrey University – Collaboration on HPC with HPC Wales and STFC Daresbury