Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Rethinking data centers Chuck Thacker Microsoft Research October 2007 “ There is nothing more difficult to take in hand, more perilous to conduct, or more uncertain in its success, than to take the lead in the introduction of a new order of things” – Niccolo Machiavelli
  2. 2. Problems <ul><li>Today, we need a new order of things. </li></ul><ul><li>Problem: Today, data centers aren’t designed as systems . We need apply system engineering to: </li></ul><ul><ul><li>Packaging </li></ul></ul><ul><ul><li>Power distribution </li></ul></ul><ul><ul><li>Computers themselves </li></ul></ul><ul><ul><li>Management </li></ul></ul><ul><ul><li>Networking </li></ul></ul><ul><li>I’ll discuss all of these. </li></ul><ul><li>Note: Data centers are not mini-Internets </li></ul>
  3. 3. System design: Packaging <ul><li>We build large buildings, load them up with lots of expensive power and cooling, and only then start installing computers. </li></ul><ul><li>Pretty much need to design for the peak load (power distribution, cooling, and networking). </li></ul><ul><li>We build them in remote areas. </li></ul><ul><ul><li>Near hydro dams, but not near construction workers. </li></ul></ul><ul><li>They must be human-friendly, since we have to tinker with them a lot. </li></ul>
  4. 4. Packaging: Another way <ul><li>Sun’s Black Box </li></ul><ul><li>Use a shipping container – and build a parking lot instead of a building. </li></ul><ul><li>Don’t need to be human-friendly. </li></ul><ul><li>Assembled at one location, computers and all. A global shipping infrastructure already exists. </li></ul><ul><li>Sun’s version uses a 20-foot box. 40 might be better (They’ve got other problems, too). </li></ul><ul><li>Requires only networking, power, and cooled water, so build near a river. Need a pump, but no chiller (might have environmental problems). </li></ul><ul><li>Expand as needed, in sensible increments. </li></ul><ul><li>Rackable has a similar system. So (rumored) does Google. </li></ul>
  5. 5. The Black Box
  6. 6. Sun’s problems <ul><li>Servers are off-the-shelf (with fans). </li></ul><ul><li>They go into the racks sideways, since they’re designed for front-to-back airflow. </li></ul><ul><li>Servers have a lot of packaging (later slide). </li></ul><ul><li>Cables exit front and back (mostly back). </li></ul><ul><li>Rack must be extracted to replace a server </li></ul><ul><ul><li>They have a tool, but the procedure is complex. </li></ul></ul><ul><li>Rack as a whole is supported on springs. </li></ul><ul><li>Fixing these problems: Later slides. </li></ul>
  7. 7. Packaging <ul><li>A 40’ container holds two rows of 16 racks. </li></ul><ul><li>Each rack holds 40 “1U” servers, plus network switch. Total container: 1280 servers. </li></ul><ul><li>If each server draws 200W, the rack is 8KW, the container is 256 KW. </li></ul><ul><li>A 64-container data center is 16 Mw, plus inefficiency and cooling. 82K computers. </li></ul><ul><li>Each container has independent fire suppression. Reduces insurance cost. </li></ul>
  8. 8. System Design: Power distribution <ul><li>The current picture: </li></ul><ul><ul><li>Servers run on 110/220V 60Hz 1 Φ AC. </li></ul></ul><ul><ul><li>Grid -> Transformer -> Transfer switch -> Servers. </li></ul></ul><ul><ul><li>Transfer switch connects batteries -> UPS/ Diesel generator. </li></ul></ul><ul><ul><li>Lots and lots of batteries. Batteries require safety systems (acid, hydrogen). </li></ul></ul><ul><ul><li>The whole thing nets out at ~40 – 50% efficient, => Large power bills. </li></ul></ul>
  9. 9. Power: Another way <ul><li>Servers run on 12V 400 Hz 3 Φ AC. </li></ul><ul><ul><li>Synchronous rectifiers on motherboard make DC (with little filtering). </li></ul></ul><ul><li>Rack contains a step-down transformer. Distribution voltage is 480 VAC </li></ul><ul><li>Diesel generator supplies backup power. </li></ul><ul><li>Uses a flywheel to store energy until the generator comes up. </li></ul><ul><ul><li>When engaged, the frequency and voltage drop slightly, but nobody cares. </li></ul></ul><ul><li>Distribution chain: </li></ul><ul><ul><li>Grid -> 60Hz motor -> 400Hz generator -> Transformer -> Servers. </li></ul></ul><ul><ul><li>Much more efficient (probably > 85%). </li></ul></ul><ul><li>The Cray 1 (300 KW, 1971) was powered this way. </li></ul><ul><li>Extremely high reliability. </li></ul>
  10. 10. The Computers <ul><li>We currently use commodity servers designed by HP, Rackable, others. </li></ul><ul><li>Higher quality and reliability than the average PC, but they still operate in the PC ecosystem. </li></ul><ul><ul><li>IBM doesn’t. </li></ul></ul><ul><li>Why not roll our own? </li></ul>
  11. 11. Designing our own <ul><li>Minimize SKUs </li></ul><ul><ul><li>One for computes. Lots of CPU, Lots of memory, relatively few disks. </li></ul></ul><ul><ul><li>One for storage. Modest CPU, memory, lots of disks. </li></ul></ul><ul><ul><ul><li>Like Petabyte? </li></ul></ul></ul><ul><ul><li>Maybe one for DB apps. Properties TBD. </li></ul></ul><ul><ul><li>Maybe Flash memory has a home here. </li></ul></ul><ul><li>What do they look like? </li></ul>
  12. 12. More on Computers <ul><li>Use custom motherboards. We design them, optimizing for our needs, not the needs of disparate application. </li></ul><ul><li>Use commodity disks. </li></ul><ul><li>All cabling exits at the front of the rack. </li></ul><ul><ul><li>So that it’s easy to extract the server from the rack. </li></ul></ul><ul><li>Use no power supplies other than on-board inverters (we have these now). </li></ul><ul><li>Error correct everything possible, and check what we can’t correct. </li></ul><ul><li>When a component doesn’t work to its spec, fix it or get another, rather than just turning the “feature” off. </li></ul><ul><li>For the storage SKU, don’t necessarily use Intel/AMD processors. </li></ul><ul><ul><li>Although both seem to be beginning to understand that low power/energy efficiency is important. </li></ul></ul>
  13. 13. Management <ul><li>We’re doing pretty well here. </li></ul><ul><ul><li>Automatic management is getting some traction, opex isn’t overwhelming. </li></ul></ul><ul><li>Could probably do better </li></ul><ul><ul><li>Better management of servers to provide more fine-grained control than just reboot, reimage. </li></ul></ul><ul><ul><li>Better sensors to diagnose problems, predict failures. </li></ul></ul><ul><ul><li>Measuring airflow is very important (not just finding out whether the fan blades turn, since there are no fans in the server). </li></ul></ul><ul><ul><li>Measure more temperatures than just the CPU. </li></ul></ul><ul><li>Modeling to predict failures? </li></ul><ul><ul><li>Machine learning is your friend. </li></ul></ul>
  14. 14. Networking <ul><li>A large part of the total cost. </li></ul><ul><li>Large routers are very expensive, and command astronomical margins. </li></ul><ul><li>They are relatively unreliable – we sometimes see correlated failures in replicated units. </li></ul><ul><li>Router software is large, old, and incomprehensible – frequently the cause of problems. </li></ul><ul><li>Serviced by the manufacturer, and they’re never on site. </li></ul><ul><li>By designing our own, we save money and improve reliability. </li></ul><ul><li>We also can get exactly what we want, rather than paying for a lot of features we don’t need (data centers aren’t mini-Internets). </li></ul>
  15. 15. Data center network differences <ul><li>We know (and define) the topology </li></ul><ul><li>The number of nodes is limited </li></ul><ul><li>Broadcast/multicast is not required. </li></ul><ul><li>Security is simpler. </li></ul><ul><ul><li>No malicious attacks, just buggy software and misconfiguration. </li></ul></ul><ul><li>We can distribute applications to servers to distribute load and minimize hot spots. </li></ul>
  16. 16. Data center network design goals <ul><li>Reduce the need for large switches in the core. </li></ul><ul><li>Simplify the software. Push complexity to the edge of the network. </li></ul><ul><li>Improve reliability </li></ul><ul><li>Reduce capital and operating cost. </li></ul>
  17. 17. One approach to data center networks <ul><li>Use a 64-node destination-addressed hypercube at the core. </li></ul><ul><li>In the containers, use source routed trees. </li></ul><ul><li>Use standard link technology, but don’t need standard protocols. </li></ul><ul><ul><li>Mix of optical and copper cables. Short runs are copper, long runs are optical. </li></ul></ul>
  18. 18. Basic switches <ul><li>Type 1: Forty 1 Gb/sec ports, plus 2 10 Gb/sec ports. </li></ul><ul><ul><li>These are the leaf switches. Center uses 2048 switches, one per rack. </li></ul></ul><ul><ul><li>Not replicated (if a switch fails, lose 1/2048 of total capacity) </li></ul></ul><ul><li>Type 2: Ten 10 Gb/sec ports </li></ul><ul><ul><li>These are the interior switches </li></ul></ul><ul><ul><li>There are two variants, one for destination routing, one for tree routing. Hardware is identical, only FPGA bit streams differ. </li></ul></ul><ul><ul><li>Hypercube uses 64 switches, containers use 10 each. 704 switches total. </li></ul></ul><ul><li>Both implemented with Xilinx Virtex 5 FPGAs. </li></ul><ul><ul><li>Type 1 also uses Gb Ethernet Phy chips </li></ul></ul><ul><ul><li>These contain 10 Gb Phys, plus buffer memories and logic. </li></ul></ul><ul><li>We can prototype both types with BEE3. </li></ul><ul><ul><li>Real switches use less expensive FPGAs </li></ul></ul>
  19. 19. Data center network
  20. 20. Hypercube properties <ul><li>Minimum hop count </li></ul><ul><li>Even load distribution for all-all communication. </li></ul><ul><li>Reasonable bisection bandwidth. </li></ul><ul><li>Can route around switch/link failures. </li></ul><ul><li>Simple routing: </li></ul><ul><ul><li>Outport = f(Dest xor NodeNum) </li></ul></ul><ul><ul><li>No routing tables </li></ul></ul><ul><li>Switches are identical. </li></ul><ul><ul><li>Use destination – based input buffering, since a given link carries packets for a limited number of destinations. </li></ul></ul><ul><li>Link–by–link flow control to eliminate drops. </li></ul><ul><ul><li>Links are short enough that this doesn’t cause a buffering problem. </li></ul></ul>
  21. 21. A 16-node (dimension 4) hypercube
  22. 22. Routing within a container
  23. 23. Bandwidths <ul><li>Servers –> network: 82 Tb/sec </li></ul><ul><li>Containers –> cube: 2.5 Tb/sec </li></ul><ul><li>Average will be lower  . </li></ul>
  24. 24. Objections to this scheme <ul><li>“ Commodity hardware is cheaper” </li></ul><ul><ul><li>This is commodity hardware. Even for one center (and we build many). </li></ul></ul><ul><ul><li>And in the case of the network, it’s not cheaper. Large switches command very large margins. </li></ul></ul><ul><li>“ Standards are better” </li></ul><ul><ul><li>Yes, but only if they do what you need, at acceptable cost. </li></ul></ul><ul><li>“ It requires too many different skills” </li></ul><ul><ul><li>Not as many as you might think. </li></ul></ul><ul><ul><li>And we would work with engineering/manufacturing partners who would be the ultimate manufacturers. This model has worked before. </li></ul></ul><ul><li>“ If this stuff is so great, why aren’t others doing it”? </li></ul><ul><ul><li>They are. </li></ul></ul>
  25. 25. Conclusion <ul><li>By treating data centers as systems , and doing full-system optimization, we can achieve: </li></ul><ul><ul><li>Lower cost, both opex and capex. </li></ul></ul><ul><ul><li>Higher reliability. </li></ul></ul><ul><ul><li>Incremental scale-out. </li></ul></ul><ul><ul><li>More rapid innovation as technology improves. </li></ul></ul>
  26. 26. Questions?