Sustainable Data Centers: HP Labs

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Notes on slide 1

    The slide has all the text

    Let us consider one example of flexible building blocks. We will focus on server design for the cloud market. This picture shows where the total costs of ownership are spent in current servers. (We assume infrastructure operational costs and a three year depreciation cycle; people costs are not considered given the market.) The right focuses on hardware costs and the left focuses on *burdened* power and cooling costs. Each portion is further broken down into its subcomponents like CPU, memory, etc.(Pause)Lots of interesting data in this slide, but want to highlight two key insights:Cost of power and cooling equal to cost of hardware (similar to what others, including, google, has been saying)But more importantly, there is no single large dominating component that can give us a factor of 2X or 5X. We need a *holistic* solution!

    This picture shows the holistic solution that we designed. We call our approach microblades and megaservers. There are three elements to our approach. The first one (point to figure) is a redesign of the server platform to be extremely disaggregated and efficient. In this particular case, we replace the classical server processor with a more power-efficient more cost-efficient embedded processor (from the high-volume mobile market) – example, an Atom processor or a mobile core2duo, etc. The intuition is that many of these processors have significantly lower energy (4X or 10 lower) but do not correspondingly lower performance (only 1.5X to 2X lower) and given the throughput-centric nature of the workloads, we can have more of these cheaper lower-power systems to get same performance at lower power. We also simplify the rest of the components on the server as much as we can. The second element of our design is co-designed packaging and system architecture; we do two things. The inset shows one of the ideas – we use naked packages in each microblades, put a sheet of alumnium in between the microblades (kinda like a sandwich), *aggregate* all the heat and have one single heat sink at the top – this can improve cooling efficiency by about 25%; the second element is shown in the blue and red arrows that show airflow in the figure. We have designed a new kind of blade enclosure with a separate plenum that directs air in a more efficient manner (vertically instead of horizontally for example among other things) that can further improve cooling efficiency by another 25% or higher. The third element of our design is to have *memory disaggregation*. Just as we can think of storage area networks, what we have is memory across the network and a new novel memory hierarchy design that provides support for a remote level of memory including optimizations to dynamically “Swap-in” content as appropriate. This design also enables us to inject currently more-costly technologies like non-volatile memory (or in the future, memristors) without compromising commodity cost structures.Note that these three design aspects were *co-designed* with each other going back to the holistic design motivation in the previous slide.

    So how did we do?The figure shows the performance-per-dollar improvement for our design compared to current state-of-the-art servers. The x axis shows different workloads common to the cloud market – we have websearch, webmail (think hotmail), video hosting (thinky youtube), mapreduce running different kinds of data processing, and the last bar is the harmonic mean.We can see that this design does really well. On average we get a factor of 2 improvement. That is, for the same costs, we can double the performance. Interestingly, in some cases, we get even higher performance 400 to 600% better performance for the same costs! While I focused on performance for TCO (total cost of ownership) here, we get similar results if you look at performance per watt as well. We can see that this notion of building flexible building blocks from a holistic point of view really works.

    Application of sustainable data center principles to provision compute, power and cooling resources based on the needs of the user can dramatically reduce the cost of ownership. As illustration, HP consolidated 14 laboratory data centres into one large site in Bangalore, India and applied a dynamic control system that adjusts the utilisation of the air conditioning system based on 7,500 temperature sensors deployed throughout the data centre. The 40 percent savings in cooling power consumption achieved in this facility translates to an annual savings of approximately $1.2 million (U.S.), when compared to the conventional approach. In addition, with 7500 sensors gathering data every 10 sec, we have built an enormous testbed for our knowledge discovery research.Chandrakant can provide some anecdotes that can serve as good examples if you like e.g. “reverse outsourcing” to India by analyzing data through KD techniques in the USA.

    How do we solve this problem – we call this the “no power struggles architecture”Like we mentioned earlier, the first part is a coordinated cross-layer monitoring framework. In addition to the pervasive sensoring, a key challenge is to communicate the information across layers. For example, the virtual machine layer needs to know about what the hardware is doing, and the lowest hardware controller needs to know about SLAs. And we need to provide a way to communicate this without resorting to global information sharing and proprietary protocols for communication that don’t scale or add to costs.The solution we have is something that we call m-channels and brokers. This figure shows an example of this architecture. You can see the platform elements in the bottom and the virtualization elements at the top. We assume two key elements – shown in the shaded areas. The m-channel abstractions provide a registry and proxy service for different sensing agents and management agents to register and communication information across layers. We leverage industry-standard protocols like the CIM (common information model) to be able to do that. A key second element is called m-brokers and that provides the mechanisms for coordinating the actuation. There is a lot of hard problems that need to be solved in the design of mchannels and mbrokers including unified namespaces, multi-agent coordination, and doing that at scale that I am not going to have time to talk about, but the paper has more details on these.

    Once we have the basic architecture, we now need to have a coordination policy solution. The complicated (colorful ) figure summarizes our approach. The list on the right indicate the various controllers that we coordinate (using the mbroker abstractions from the last slide). We have the efficiency controller (think of it as classic voltage/frequency scaling), a local power capper (that capps power to a certain point), an enclosure and group capper that does the same thing at different levels, and then a virtual machine controller that migrates virtual machines and turns machines off. One of the advantages of the inter-disciplinary people we have in HP Labs is that we were able to walk to the next cube and talk to the control theory folks and you can see that at the core (in light blue) is a classical feedback controller. All we have done is to redefine the computer as a container that dynamically gets resized when the container is not utilized (in this case, CPU utilization is the utilization measurement and the resizing knob is voltage and frequency scaling). The local capper in red is now implemented as a controller that changes the preference on “how much we would like the container to be full” – the elegance of this approach is that we minimize the communication between the two controllers and essentially overload classical control theory terms to provide the coordination. Other controller interface in the same way – you can see that the VMC changes the input to the controllers for example. The advantage is that the same property of the control theory that enables it to respond to dynamism and changing behaviour of the workload now lets it learn about and respond to policies and actuations by other controllers! In addition, using control theory allows us to formally reason about stability. When you are talking to customers and you can give them a mathematical proof that the datacenter will not explode when they have 13 different controllers, you usually give them more confidence! 

    More generally, we evaluated our approach on 8 different customer traces – ranging from web2.0 customers to pharmaceutical, finance, etc. You can see three numbers – avg power – namely how much of our electricity bill do I save – op-ex, peak power – how much of my peak provisioning can I reduce – capital expenditures, and performance. Redis baseline and green is our new solution. We can see that green does really well – a staggering 65% reduction in electricity costs and a 20% reduction in capex and most importantly with negligible changes to performance. This really illustrates the value of the coordination architecture in improving efficiency while providing stability and correctness.

    One last advantage of such a coordination architecture.When we start thinking of the solution holistically, we now get a chance to rethink individual controllers? Do we *really* need all the controllers? Do we *Really* need all the complexity in the policies for the individual controllers? Or can we change them to be simpler when you know that we have another controller to back it up?The paper talks about several interesting insights that we found, but here we show one example. As before we have avg power, peak power and performance. We have the red baseline and green proposed solution, but we now have tow more bars – a blue and yellow bar corresponding to different permutations of controllers in the overall coordinated architecture. What we can see here is that (looking at the yellow) in this particular case for the systems and workloads considered, a solution that emphasizes the virtual machine controller can achieve most of the benefits that the overall solution comes at the expense of deemphasizing some of the local control. More generally, such a coordination architecture allows us to get deep insights on the assignment of functionality at different layers of the stack and going back to the original picture with the multiple verticals and horizontals once again motivates whey we want to look at this problem holistically!

    In closing…

    8 Favorites

    Sustainable Data Centers: HP Labs - Presentation Transcript

    1. Sustainable Data CentersEnabled by Supply and Demand Side Management
      Prith Banerjee
      Senior Vice President of Research and
      Director, Hewlett Packard Laboratories
      (Co-authors: Cullen Bash, Chandrakant Patel, Partha Ranganathan)
    2. 2
      DAC2009-50-2
      HP Labs Research Areas
      The next technology challenges and opportunities
      Digital Commercial Print
      Intelligent Infrastructure
      Content Transformation
      Sustainability
      Immersive Interaction
      Cloud
      Analytics
      Information Management
    3. 3
      DAC2009-50-2
      The rest of the global economy
      IT industry
      98%
      2%
      IT must play a central role in addressing the global sustainability challenge.
      Industry Challenge
      Create technologies, IT infrastructure and business models for the low-carbon economy
      Total carbon emissions
      • As much as the aviation industry
      • Projected to double by 2020
      • IT can play a role in reducing this impact
      • To do so, IT solutions must take a lifecycle perspective
      3
    4. 4
      DAC2009-50-2
      Role of the IT EcosystemData Centers at the Hub
      Sustainable Data Centers enabled by supply and demand side management of power, cooling and IT resources
      4
    5. 5
      DAC2009-50-2
      Supply and Demand Side Management
      Supply Side:
      Design of physical infrastructure with focus on lifecycle engineering and management, and the available energy required to extract, manufacture, operate and reclaim components;
      Utilization of local resources to minimize destruction of available energy in transmission, and construction of transmission infrastructure.
      Demand Side:
      Provisioning data center resources based on the needs and service level agreement of the user through use of flexible building blocks, pervasive sensing, knowledge discovery and policy based control
      5
    6. 6
      DAC2009-50-2
      IT:SW
      IT:HW
      Power
      Cooling
      Autonomous Control
      Knowledge Discovery & Visualization
      Pervasive Cross-layer Sensing
      Flexible, Efficient, & Configurable Building Blocks
      Data Center Scale Lifecycle Design
      Sustainable Data Center
      Key Elements
      extraction
      operation
      manufacturing
      End of Life
      6
    7. 7
      DAC2009-50-2
      Lifecycle Design
      extraction
      operation
      manufacturing
      End of Life
      7
    8. 8
      DAC2009-50-2
      Lifecycle Design through Data Center Synthesis
      Automate design of datacenters based on lifecycle considerations
      Synthesis Process Flow
      8
    9. 9
      DAC2009-50-2
      Flexible Building Blocks
      From chips, to servers to data centers
      Cooling Grid
      Qdata center + ∑W
      Power Grid – Wensemble
      Wblower
      Wblower
      Wpump
      Wcompressor
      Wpump
      Ground Coupled Loop
      Qsystem
      Wblower
      Wsystem
      Qchip
      Outside Air
      Wchip
      9
    10. 10
      DAC2009-50-2
      Microblades and Megaservers
      The Inefficiencies in the Cloud
      Hardware
      Power & cooling
    11. 11
      DAC2009-50-2
      Disaggregation
      Efficient building blocks



    12. 12
      DAC2009-50-2
      2X performance/$
    13. 13
      DAC2009-50-2
      Pervasive SensingIT & Facilities
      Cooling Infrastructure
      IT Hardware
      External Environment
      13
    14. 14
      DAC2009-50-2
      Knowledge Discovery in the IT ecosystem
      minimizing material and minimizing energy
      PCA-based Anomaly detection
      Mobile-enabled diagnostics
      Collaborative Fault Analytics
      Data Center Room Infrastructure
      Power Micro-Grid Infrastructure
      Cooling Grid Infrastructure
      Client Infrastructure
      Data Aggregation Pathways
      Event/Episode Detection
      Model
      Creation
      Causality
      Pre-Process
      Raw
      Data
      Useful Knowledge
      Visualization
      Expedient Assessment
      Early Warning System
      Lifetime Estimation
      Performance Metrics
      Efficiency Metrics
      Expert Systems
    15. DAC2009-50-2
      Air Mixture Out
      Warm Water
      Air Mixture In
      Air Mixture In
      Cooling Tower loop
      Wp
      QCond
      Makeup Water
      Return Water
      Chiller Refrigerant loop
      Wcomp
      QEvap
      Wp
      Chilled Water loop
      Data Center CRAC units
      Example: Chiller Unit Ensemble
      Data center cooling infrastructure
      Clustering
      15
    16. DAC2009-50-2
      Active Control of Cooling Resources
      Conventional Mode
      With active control using rack inlet temperature
      35% Energy SavingsImproved reliability
    17. DAC2009-50-2
      Dynamic Workload Placement based on Cooling Efficiency
      8am
      10am
      3pm
      Results
      (Thermal Management Margin)j + (AC Margin)j
      • 32% Energy savings over random job placement
      •  $1M/year in savings (for large DCs)
      • Improved thermal reliability of IT equipment
      • Increased uptime
      (Hot Air Recirculation)j
      (Tset – Tin)i + [(TSAT - TSAT,min)j TCIj]i
      (Tin – TSAT)i
      Premise:
      • Hotspots exist that impact efficiency
      • Use LWPI to place workload in more efficient locations
      • Test using batch loads in real data center
      LWPI = 0.33
      LWPI = 3.8
      Batch load for tests
      LWPI = 4.9
      Local Workload Placement Index …
      LWPI =
      =
      17
      17
    18. 18
      DAC2009-50-2
      Demonstration at Scale
      • Software Operations, Bangalore
      • Consolidation of 14 lab data centers
      Facility Building Blocks
      IT Building Blocks
      • Servers
      • Non-Stop servers
      • Proliant servers
      • Blade servers
      • Custom Enclosures
      • Storage (XP/EVA)
      • Multiple Network topologies
      • SensorNetwork
      • 7500 sensors
      • Chillers
      • 3 air-cooled
      • 2 water-cooled
      • Pumps
      • 7 Primary
      • 5 Secondary
      • CRAC units
      • 55 units
      • Diesel Generators
      • 5 3MW units
      5 floors @14k sq. ft.
      900kW cooling per floor
      • Need based provisioning of compute, power and cooling resources based on available energy consumed (supply side)
      • Dynamic cooling control implemented
      • Data Analysis, Visualization and Knowledge Discovery to detect anomalies, improve reliability and minimize redundancy
      40% reduction in AHU power
      20% reduction in Infrastructure Power
      7,500 tons of CO2 prevented annually
      18
      27 July 2009
    19. 19
      DAC2009-50-2
      Rack
      heterogeneity
      X
      VM
      Enclosure IAM
      Server iLO
      CPU
      CHAOS!! (“Power” Struggle)
      X
      VM-res.all
      Peak thermal power
      Peak electrical power
      OS-wlm
      Average power
      Vmotion
      OS-gwlm
      Local optima
      LSF
      X
      SIM
      global optima
      X
      Power Struggles!
      X
      X
      X
      X
      X
      X
      performance
      performance
      X
      X
      X
      performance
      X
      X
      X
      X
      X
      X
      X
      X
      X
    20. 20
      DAC2009-50-2
      Cross Layer Monitoring and Management Framework
      20
    21. 21
      DAC2009-50-2
    22. 22
      DAC2009-50-2
      It works!
    23. 23
      DAC2009-50-2
      It works well!
      65% savings (OpEx)
      20% savings (CapEx)
      Similar performance
    24. 24
      DAC2009-50-2
      Other interesting insights…
      No VMC
      VMC only
      Unified
    25. 25
      DAC2009-50-2
      Conclusions
      • Environmental impact of IT is a growing worldwide concern
      • Governments are beginning to take notice and regulations are increasing
      • Management of available energy required to run cost effective operation
      • An integrated, life-cycle approach to data center design and management is necessary to improve efficiency and reduce impact.
      • Demonstrated results with economic payback

    + Hewlett-PackardHewlett-Packard, 4 months ago

    custom

    1242 views, 8 favs, 2 embeds more stats

    A presentation and paper delivered by HP Labs direc more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 1242
      • 1240 on SlideShare
      • 2 from embeds
    • Comments 0
    • Favorites 8
    • Downloads 0
    Most viewed embeds
    • 1 views on http://www.vmworld.com
    • 1 views on http://prakairat.wordpress.com

    more

    All embeds
    • 1 views on http://www.vmworld.com
    • 1 views on http://prakairat.wordpress.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories