Warehouse scale computer

Introduction
• Had scale been the only distinguishing feature
of these systems we might simply refer to
them as datacenters.
• Datacenters are buildings where multiple
servers and communication gear are
co-located because of their common
environmental requirements and physical
security needs, and for ease of maintenance.
• In that sense, a WSC is a type of datacenter.

Introduction
• Traditional datacenters, however, typically host a large
number of relatively small- or medium-sized
applications, each running on a dedicated hardware
infrastructure that is de-coupled and protected from
other systems in the same facility.
• Those datacenters host hardware and software for
multiple organizational units or even different
companies.
• Different computing systems within such a datacenter
often have little in common in terms of hardware,
software, or maintenance infrastructure, and tend not
to communicate with each other at all.

Introduction
• WSCs currently power the services offered by
companies such as Google, Amazon, Facebook, and
Microsoft’s online services division.
• They differ significantly from traditional datacenters:
1. They belong to a single organization.
2. Use a relatively homogeneous hardware and system
software platform
3. Share a common systems management layer.
• Often, much of the application, middleware, and
system software is built in-house compared to the
predominance of third-party software running in
conventional datacenters.

Introduction
• Most importantly, WSCs run a smaller number of
very large applications (or Internet services), and
the common resource management
infrastructure allows significant deployment
flexibility.
• The requirements of homogeneity, single-
organization control, and enhanced focus on cost
efficiency motivate designers to take new
approaches in constructing and operating these
systems.

Introduction
• Internet services must achieve high
availability, typically aiming for at least 99.99%
uptime (“four nines”, about an hour of
downtime per year).
• Achieving fault-free operation on a large
collection of hardware and system software is
hard and is made more difficult by the large
number of servers involved.

Introduction
• Although it might be theoretically possible to
prevent hardware failures in a collection of
10,000 servers, it would surely be extremely
expensive.
• Consequently, WSC workloads must be
designed to gracefully tolerate large numbers
of component faults with little or no impact
on service level performance and availability.

ARCHITECTURAL OVERVIEW OF WSCS
• The hardware implementation of a WSC will differ
significantly from one installation to the next.
• Even within a single organization such as Google,
systems deployed in different years use different
basic elements, reflecting the hardware
improvements provided by the industry.
• However, the architectural organization of these
systems has been relatively stable over the last few
years.
• Therefore, it is useful to describe this general
architecture at a high level as it sets the background for
subsequent discussions.

• Being satisfied with neither the metric nor the
US system, rack designers use “rack units” to
measure the height of servers.
• 1U is 1.75 inches or 44.45 mm; a typical rack is
42U high.
• The 19-inch (48.26-cm) rack is still the
standard framework to hold servers, despite
this standard going back to railroad hardware
from the 1930s.

Sketch of the typical elements in warehouse-scale systems: 1U server (left), 7’ rack with
Ethernet switch (middle), and diagram of a small cluster with a cluster-level Ethernet switch/
router (right).

• Previous Figure depicts the high-level building blocks
for WSCs.
• A set of low-end servers, typically in a 1U or blade
enclosure format, are mounted within a rack and
interconnected using a local Ethernet switch.
• These rack-level switches, which can use 1- or 10-Gbps
links, have a number of uplink connections to one or
more cluster-level (or datacenter-level) Ethernet
switches.
• This second-level switching domain can potentially
span more than ten thousand individual servers.

• In the case of a blade enclosure there is an
additional first level of networking
aggregation within the enclosure where
multiple processing blades connect to a small
number of networking blades through an I/O
bus such as PCIe.

• A 7-foot (213.36-cm) rack offers 48 U, so it’s not a
coincidence that the most popular switch for a
rack is a 48-port Ethernet switch.
• This product has become a commodity that costs
as little as $30 per port for a 1 Gbit/sec Ethernet
link in 2011.
• Note that the bandwidth within the rack is the
same for each server, so it does not matter where
the software places the sender and the receiver
as long as they are within the same rack.

• This flexibility is ideal from a software
perspective.
• These switches typically offer two to eight
uplinks, which leave the rack to go to the next
higher switch in the network hierarchy.
• Thus, the bandwidth leaving the rack is 6 to 24
times smaller—48/8 to 48/2—than the
bandwidth within the rack. This ratio is called
oversubscription.
• Uplink has 48 / n times lower bandwidth, where
n= # of uplink ports

• Alas, large oversubscription means
programmers must be aware of the
performance consequences when placing
senders and receivers in different racks.
• This increased software-scheduling burden is
another argument for network switches
designed specifically for the datacenter.

Picture of a row of servers in a Google WSC, 2012.

• Array Switch
• Switch that connects an array of racks.
• Array switch should have 10 X the bisection
bandwidth⌘ of a rack switch
• Cost of n-port switch grows as n2
• Often utilize content addressable memory
chips and FPGAs to support high-speed packet
inspection.

• WSC Memory Hierarchy

• Previous Figures shows the latency, bandwidth, and
capacity of memory hierarchy inside a WSC, and also
shows the same data visually.
• Each server contains:
16 GBytes of memory with a 100-nanosecond access
time and transfers at 20 GBytes/sec and
2 terabytes of disk that offers a 10-millisecond access
time and transfers at 200 MBytes/sec.
• There are two sockets per board, and they share one
1 Gbit/sec Ethernet port.

• Every pair of racks includes one rack switch and holds
80 2U servers.
• Networking software plus switch overhead increases
the latency to DRAM to 100 microseconds and the disk
access latency to 11 milliseconds.
• Thus, the total storage capacity of a rack is roughly 1
terabyte of DRAM and 160 terabytes of disk storage.
• The 1 Gbit/sec Ethernet limits the remote bandwidth
to DRAM or disk within the rack to 100 MBytes/sec.

• The array switch can handle 30 racks, so storage
capacity of an array goes up by a factor of 30: 30
terabytes of DRAM and 4.8 petabytes of disk.
• The array switch hardware and software
increases latency to DRAM within an array to 500
microseconds and disk latency to 12 milliseconds.
• The bandwidth of the array switch limits the
remote bandwidth to either array DRAM or array
disk to 10 MBytes/sec.

• Previous figures show that network overhead
dramatically increases latency from local
DRAM to rack DRA M and array DRAM, but
both still have more than 10 times better
latency than the local disk.
• The network collapses the difference in
bandwidth between rack DRAM and rack disk
and between array DRAM and array disk.

• What is the average latency assuming that
90% of accesses are local to the server, 9% are
outside the server but local to the rack , and
1% are outside the rack but within the array?
• (90%x0.1)+(9%100)+(1%x300)=12.09 msec

• How long does it take to transfer 1000MB between disks
within the server, between servers in the rack, and
between servers in different racks of an array?
• Within server: 1000/200=5 sec
• Within rack: 1000/100=10 sec
• Within array: 1000/10= 100 sec

• The WSC needs 20 arrays to reach 50,000
servers, so there is one more level of the
networking hierarchy.
• Next Figure shows the conventional Layer 3
routers to connect the arrays together and to
the Internet.

The Layer 3 network used to link arrays together and to the Internet
[Greenberg et al. 2009].
Some WSCs use a separate border router to connect the Internet to the
datacenter Layer 3 switches.

Sample three-stage fat tree topology.

• Another way to tackle network scalability is to
offload some traffic to a special-purpose
network.
• For example, if storage traffic is a big component
of overall traffic, we could build a separate
network to connect servers to storage units.
• If that traffic is more localized (not all servers
need to be attached to all storage units) we can
build smaller-scale networks, thus reducing costs.

• Historically, that’s how all storage was networked:
a SAN (storage area network) connected servers
to disks, typically using FibreChannel networks
rather than Ethernet.
• Today, Ethernet is becoming more common since
it offers comparable speeds, and protocols such
as FCoE (FibreChannel over Ethernet) and iSCSI
(SCSI over IP) allow Ethernet networks to
integrate well with traditional SANs.

• WSCs using VMs (or, more generally, task
migration) pose further challenges to
networks since connection endpoints (i.e., IP
address/port combinations) can move from
one physical machine to another.
• Typical networking hardware as well as
network management software doesn’t
anticipate such moves and in fact often
explicitly assume that they’re not possible.

• For example, network designs often assume that
all machines in a given rack have IP addresses in a
common subnet, which simplifies administration
and minimizes the number of required
forwarding table entries routing tables.
• More importantly, frequent migration makes it
impossible to manage the network manually--
programming network elements needs to be
automated, so the same cluster manager that
decides the placement of computations also
needs to update the network state.

• The Need of SDN
• The need for a programmable network has led
to much interest in OpenFlow
[http://www.openflow.org/] and software-
defined networking (SDN), which moves the
network control plane out of the individual
switches into a logically centralized controller.

• The Need of SDN
• Controlling a network from a logically centralized
server offers many advantages; in particular, common
networking algorithms such as computing reachability,
shortest paths, or max-flow traffic placement become
much simpler to solve, compared to their
implementation in current networks where each
individual router must solve the same problem while
dealing with limited visibility (direct neighbors only),
inconsistent network state (routers that are out of
synch with the current network state), and many
independent and concurrent actors (routers).

• STORAGE
• Disk drives or Flash devices are connected
directly to each individual server and managed by
a global distributed file system (such as Google’s
GFS) or they can be part of Network Attached
Storage (NAS) devices directly connected to the
cluster-level switching fabric.
• A NAS tends to be a simpler solution to deploy
initially because it allows some of the data
management responsibilities to be outsourced to
a NAS appliance vendor.

• STORAGE
• Keeping storage separate from computing nodes also
makes it easier to enforce quality of service guarantees
since the NAS runs no compute jobs besides the
storage server.
• In contrast, attaching disks directly to compute nodes
can reduce hardware costs (the disks leverage the
existing server enclosure) and improve networking
fabric utilization (each server network port is
effectively dynamically shared between the computing
tasks and the file system).

• STORAGE
• The replication model between these two
approaches is also fundamentally different.
• A NAS tends to provide high availability through
replication or error correction capabilities within
each appliance, whereas systems like GFS
implement replication across different machines
and consequently will use more networking
bandwidth to complete write operations.

• STORAGE
• However, GFS-like systems are able to keep data
available even after the loss of an entire server
enclosure or rack and may allow higher aggregate
read bandwidth because the same data can be
sourced from multiple replicas.
• Trading off higher write overheads for lower cost,
higher availability, and increased read bandwidth
was the right solution for many of Google’s early
workloads.

• STORAGE
• An additional advantage of having disks co-
located with compute servers is that it enables
distributed system software to exploit data
locality.
• Given how networking performance has
outpaced disk performance for the last decades
such locality advantages are less useful for disks
but may remain beneficial to faster modern
storage devices such as those using Flash storage.

• STORAGE
• NAND Flash technology has made Solid State Drives
(SSDs) affordable for a growing class of storage needs
in WSCs.
• While the cost per byte stored in SSDs will remain
much higher than in disks for the foreseeable future,
many Web services have I/O rates that cannot be easily
achieved with disk based systems.
• Since SSDs can deliver IO rates many orders of
magnitude higher than disks, they are increasingly
displacing disk drives as the repository of choice for
databases in Web services.

HDD interiors almost resemble a high-tech record player.
OCZ's Vector SSD is one of the fastest around
The OCZ RevoDrive Hybrid.

• STORAGE
• Types of NAND Flash
• There are primarily two types of NAND Flash widely used
today, Single-Level Cell (SLC) and Multi-Level Cell (MLC).
NAND Flash stores data in a large array of cells.
• Each cell can store data — one bit for cell for SLC NAND,
and two bits per cell for MLC. So, SLC NAND would store a
“0” or “1” in each cell, and MLC NAND would store “00”,
“01”, “10”, or “11” in each cell.
• SLC and MLC NAND offer different levels of performance
and endurance characteristics at different price points, with
SLC being the higher performing and more costly of the
two.

• WSC STORAGE
• The data manipulated by WSC workloads tends to fall into
two categories:
• data that is private to individual running tasks and data that
is part of the shared state of the distributed workload.
• Private data tends to reside in local DRAM or disk, is rarely
replicated, and its management is simplified by virtue of its
single user semantics.
• In contrast, shared data must be much more durable and is
accessed by a large number of clients, and thus requires a
much more sophisticated distributed storage system.

• WSC STORAGE
• UNSTRUCTURED WSC STORAGE
• Google’s GFS is an example of a storage system with a
simple file-like abstraction (Google’s Colossus system has
since replaced GFS, but follows a similar architectural
philosophy so we choose to describe the better known GFS
here).
• GFS was designed to support the Web search indexing
system (the system that turned crawled Web pages into
index files for use in Web search), and therefore focuses on
high throughput for thousands of concurrent
readers/writers and robust performance under high
hardware failures rates.

• WSC STORAGE
• GFS users typically manipulate large quantities of
data, and thus GFS is further optimized for large
operations.
• The system architecture consists of a master,
which handles metadata operations, and
thousands of chunk server (slave) processes
running on every server with a disk drive, to
manage the data chunks on those drives.

• WSC STORAGE
• In GFS, fault tolerance is provided by replication
across machines instead of within them, as is the
case in RAID systems.
• Cross-machine replication allows the system to
tolerate machine and network failures and
enables fast recovery, since replicas for a given
disk or machine can be spread across thousands
of other machines.

• WSC STORAGE
• Although the initial version of GFS only
supported simple replication, today’s version
(Colossus) has added support for more space-
efficient Reed-Solomon codes, which tend to
reduce the space overhead of replication by
roughly a factor of two over simple replication
for the same level of availability.

• WSC STORAGE
• An important factor in maintaining high availability is distributing file
chunks across the whole cluster in such a way that a small number of
correlated failures is extremely unlikely to lead to data loss.
• GFS takes advantage of knowledge about the known possible correlated
fault scenarios and attempts to distribute replicas in a way that avoids
their co-location in a single fault domain.
• Wide distribution of chunks across disks over a whole cluster is also key for
speeding up recovery.
• Since replicas of chunks in a given disk are spread across possibly all
machines in a storage cluster, reconstruction of lost data chunks is
performed in parallel at high speed.
• Quick recovery is important since long recovery time windows leave
under-replicated chunks vulnerable to data loss should additional faults
hit the cluster.

• WSC STORAGE
• STRUCTURED WSC STORAGE
• The simple file abstraction of GFS and Colossus may suffice
for systems that manipulate large blobs of data, but
application developers also need the WSC equivalent of
database-like functionality, where data sets can be
structured and indexed for easy small updates or complex
queries.
• Blobs (binary large object, basic large object, BLOB, or
BLOb) is a collection of binary data stored as a single entity
in a database management system. Blobs are typically
images, audio or other multimedia objects, though
sometimes binary executable code is stored as a blob.

• WSC STORAGE
• Structured distributed storage systems such as Google’s BigTable
and Amazon’s Dynamo were designed to fulfill those needs.
• Compared to traditional database systems, BigTable and Dynamo
sacrifice some features, such as richness of schema representation
and strong consistency, in favor of higher performance and
availability at massive scales.
• BigTable, for example, presents a simple multi-dimensional sorted
map consisting of row keys (strings) associated with multiple values
organized in columns, forming a distributed sparse table space.
Column values are associated with timestamps in order to support
versioning and time-series.

• WSC STORAGE
• The choice of eventual consistency in BigTable and Dynamo shifts
the burden of resolving temporary inconsistencies to the
applications using these systems.
• A number of application developers within Google have found it
inconvenient to deal with weak consistency models and the
limitations of the simple data schemes in BigTable.
• Second-generation structured storage systems such as MegaStore
and subsequently Spanner have been designed to address such
• concerns.
• Both MegaStore and Spanner provide richer schemas and SQL-like
functionality while providing simpler, stronger consistency models.

Weak Consistency
• The protocol is said to support weak
consistency if:
• All accesses to synchronization
variables are seen by all processes (or
nodes, processors) in the same order
(sequentially) - these are
synchronization operations.
• Accesses to critical sections are seen
sequentially.
• All other accesses may be seen in
different order on different processes
(or nodes, processors).
• The set of both read and write
operations in between different
synchronization operations is the same
in each process.
Strong Consistency
• The protocol is said to support
strong consistency if:
• All accesses are seen by all
parallel processes (or nodes,
processors etc.) in the same
order (sequentially)
• Therefore only one consistent
state can be observed, as
opposed to weak consistency,
where different parallel
processes (or nodes etc.) can
perceive variables in different
states.

• WSC STORAGE
• INTERPLAY OF STORAGE AND NETWORKING TECHNOLOGY
• The success of WSC distributed storage systems can be
partially attributed to the evolution of datacenter
networking fabrics.
• The observe that the gap between networking and disk
performance has widened to the point that disk locality is
no longer relevant in intra-datacenter computations.
• This observation enables dramatic simplifications in the
design of distributed disk-based storage systems as well as
utilization improvements since any disk byte in a WSC
facility can in principle be utilized by any task regardless of
their relative locality.

• DATACENTER TIER CLASSIFICATIONS AND
SPECIFICATIONS
• The design of a datacenter is often classified as
belonging to “Tier I–IV”.
• The Uptime Institute, a professional services
organization specializing in datacenters, and the
Telecommunications Industry Association (TIA), an
industry group accredited by ANSI and made up of
approximately 400 member companies, both advocate
a 4-tier classification loosely based on the power
distribution, uninterruptible power supply (UPS),
cooling delivery and redundancy of the datacenter.

• DATACENTER TIER CLASSIFICATIONS AND SPECIFICATIONS
• Tier I datacenters have a single path for power distribution, UPS,
and cooling distribution, without redundant components.
• Tier II adds redundant components to this design (N + 1), improving
availability.
• Tier III datacenters have one active and one alternate distribution
path for utilities. Each path has redundant components and are
concurrently maintainable, that is, they provide redundancy even
during maintenance.
• Tier IV datacenters have two simultaneously active power and
cooling distribution paths, redundant components in each path, and
are supposed to tolerate any single equipment failure without
impacting the load.

• DATACENTER TIER CLASSIFICATIONS AND SPECIFICATIONS
• The Uptime Institute’s specification is generally
performance-based (with notable exceptions for the
amount of backup diesel fuel, water storage, and ASHRAE
temperature design points ).
• The specification describes topology rather than
prescribing a specific list of components to meet the
requirements, so there are many architectures that can
achieve a given tier classification.
• In contrast, TIA-942 is very prescriptive and specifies a
variety of implementation details such as building
construction, ceiling height, voltage levels, types of racks,
and patch cord labeling, for example.

SPECIFICATIONS
• Formally achieving tier classification certification is
difficult and requires a full review from one of the
granting bodies, and most datacenters are not formally
rated.
• Most commercial datacenters fall somewhere between
tiers III and IV, choosing a balance between
construction cost and reliability.
• Generally, the lowest of the individual subsystem
ratings (cooling, power, etc.) determines the overall
tier classification of the datacenter.

SPECIFICATIONS
• Real-world datacenter reliability is strongly influenced
by the quality of the organization running the
datacenter, not just by the design.
• The Uptime Institute reports that over 70% of
datacenter outages are the result of human error,
including management decisions on staffing,
maintenance, and training.
• Theoretical availability estimates used in the industry
range from 99.7% for tier II datacenters to 99.98% and
99.995% for tiers III and IV, respectively.

• DATACENTER ENERGY EFFICIENCY
• The broadest definition of WSC energy efficiency would
measure the energy used to run a particular workload
(say, to sort a petabyte of data).
• Unfortunately, no two companies run the same
workload and real-world application mixes change all
the time so it is hard to benchmark real-world WSCs
this way.
• Thus, even though such benchmarks have been
contemplated as far back as 2008 they haven’t yet
been found and we doubt they ever will.

• However, it is useful to view energy efficiency
as the product of three factors we can
independently measure and optimize:
• The first term (a) measures facility efficiency,
the second server power conversion efficiency,
and the third measures the server’s
architectural efficiency.

• THE PUE METRIC
• Power usage effectiveness (PUE) reflects the
quality of the datacenter building infrastructure
itself, and captures the ratio of total building
power to IT power (the power consumed by the
actual computing and network equipment, etc.).
(Sometimes IT power is also referred to as
“critical power.”)
• PUE = (Facility power) / (IT Equipment power)

• THE PUE METRIC
• PUE has gained a lot of traction as a datacenter
efficiency metric since widespread reporting
started around 2009.
• We can easily measure PUE by adding electrical
meters to the lines powering the various parts of
a datacenter, thus determining how much power
is used by chillers or a UPS.

• THE PUE METRIC
• Historically, the PUE for the average
datacenter has been embarrassingly poor.
• According to a 2006 study, 85% of current
datacenters were estimated to have a PUE of
greater than 3.0.

• THE PUE METRIC
• In other words, the building’s mechanical and electrical
systems consumed twice as much power as the actual
computing load! Only 5% had a PUE of 2.0 or better.
• A subsequent EPA survey of over 100 datacenters
reported an average PUE value of 1.91, and a 2012
Uptime Institute survey of over 1100 datacenters
covering a range of geographies and datacenter sizes
reported an average PUE value between 1.8 and 1.89.

Uptime Institute survey of PUE for 1100+ datacenters.

• SOURCES OF EFFICIENCY LOSSES IN
DATACENTERS
• For illustration, let us walk through the losses
in a typical datacenter.

• DATACENTER E
• The second term (b) accounts for overheads inside
servers or other IT equipment using a metric analogous
to PUE, server PUE (SPUE).
• SPUE consists of the ratio of total server input power to
its useful power, where useful power includes only the
power consumed by the electronic components
directly involved in the computation: motherboard,
disks, CPUs, DRAM, I/O cards, and so on.
• Substantial amounts of power may be lost in the
server’s power supply, voltage regulator modules
(VRMs), and cooling fans.

• DATACENTER E
• The product of PUE and SPUE constitutes an
accurate assessment of the end-to-end
electromechanical efficiency of a WSC. Such a
true (or total) PUE metric (TPUE), defined as
PUE.

• DATACENTER E
• MEASURING ENERGY EFFICIENCY
• Similarly, server-level benchmarks such as Joulesort and
SPECpower characterize other aspects of computing
efficiency.
• Joulesort measures the total system energy to perform an
out-of-core sort and derives a metric that enables the
comparison of systems ranging from embedded devices to
supercomputers.
• SPECpower focuses on server-class systems and computes
the performance-to-power ratio of a system running a
typical business application on an enterprise Java platform.

• DATACENTER E
• MEASURING ENERGY EFFICIENCY
• Two separate benchmarking efforts aim to
characterize the efficiency of storage systems: the
Emerald Program by the Storage Networking
Industry Association (SNIA) and the SPC-2/E by
the Storage Performance Council.
• Both benchmarks measure storage servers under
different kinds of request activity and report
ratios of transaction throughput per Watt.

• Cost of a WSC
• To better understand the potential impact of energy-
related optimizations, let us examine the total cost of
ownership (TCO) of a datacenter.
• At the top level, costs split up into capital expenses
(Capex) and operational expenses (Opex).
• Capex refers to investments that must be made
upfront and that are then depreciated over a certain
time frame; examples are the construction cost of a
datacenter or the purchase price of a server.

• Cost of a WSC
• Opex refers to the recurring monthly costs of
actually running the equipment, excluding
depreciation: electricity costs, repairs and
maintenance, salaries of on-site personnel,
and so on.
• Thus, we have:
TCO = datacenter depreciation + datacenter Opex + server
depreciation + server Opex

• Cost of a WSC

• Cost of a WSC
• The monthly depreciation cost (or amortization
cost) that results from the initial construction
expense depends on the duration over which the
investment is amortized (which is related to its
expected lifetime) and the assumed interest rate.
• Typically, datacenters are depreciated over
periods of 10–15 years.
• Under U.S. accounting rules, it is common to use
straight-line depreciation where the value of the
asset declines by a fixed amount each month.

• Cost of a WSC
• For example, if we depreciate a $12/W
datacenter over 12 years, the depreciation cost is
$0.08/W per month.
• If we had to take out a loan to finance
construction at an interest rate of 8%, the
associated monthly interest payments add an
additional cost of $0.05/W, for a total of $0.13/W
per month.
• Typical interest rates vary over time, but many
companies will pay interest in the 7–12% range.

• Cost of a WSC
To put the cost of energy into
perspective, Hamilton did a case
study to estimate the costs of a WSC.
He determined that the CAPEX of this
8 MW facility was $88M, and
that the roughly 46,000 servers and
corresponding networking
equipment added another
$79M to the CAPEX for the WSC.

• Cost of a WSC
•We can now price the total cost of energy, since U.S . accounting rules allow us to
convert CAPEX into OPEX.
•We can just amortize CAPEX as a fixed amount each month for the effective life of the
equipment.
•Note that the amortization rates differ significantly, from 10 years for the facility to 4
years for the networking equipment and 3 years for the servers.
•Hence, the WSC facility lasts a decade, but you need to replace the servers every 3
years and the networking equipment every 4 years.
•By amortizing the CAPEX, Hamilton came up with a monthly OPEX, including accounting
for the cost of borrowing money (5% annually) to pay for the WSC.
•At $3.8M, the monthly OPEX is about 2% of the CAPEX.

• A Google Warehouse-Scale Computer
• Since many companies with WSCs are competing vigorously
in the marketplace, up until recently, they have been
reluctant to share their latest innovations with the public
(and each other).
• In 2009, Google described a state-of-the-art WSC as of
2005.
• Google graciously provided an update of the 2007 status of
their WS C, making this section the most up-to-date
description of a Google WS C.
• Even more recently, Facebook described their latest
datacenter as part of
• http://opencompute.org.

• Containers
• Both Google and Microsoft have built WSCs using shipping
containers.
• The idea of building a WSC from containers is to make WSC
design modular.
• Each container is independent, and the only external
connections are networking, power, and water.
• The containers in turn supply networking, power, and
cooling to the servers placed inside them, so the job of the
WSC is to supply networking, power, and cold water to the
containers and to pump the resulting warm water to
external cooling towers and chillers.

• Containers
• Diagram is a cutaway drawing of a Google container.
• A container holds up to 1160 servers, so 45 containers
have space for 52,200 servers. (This WSC has about
40,000 servers.)
• The servers are stacked 20 high in racks that form two
long rows of 29 racks (also called bays) each, with one
row on each side of the container.
• The rack switches are 48-port, 1 Gbit/sec Ethernet
switches, which are placed in every other rack.

• Containers
• The Google WSC that we are looking at contains 45 40-
foot-long containers in a 300- foot by 250-foot space,
or 75,000 square feet (about 7000 square meters).
• To fit in the warehouse, 30 of the containers are
stacked two high, or 15 pairs of stacked containers.
• Although the location was not revealed, it was built at
the time that Google developed WSCs in The Dallas,
Oregon, which provides a moderate climate and is near
cheap hydroelectric power and Internet backbone
fiber.

• Containers
• This WSC offers 10 megawatts with a PUE of 1.23 over the prior 12
months.
• Of that 0.230 of PUE overhead, 85% goes to cooling losses (0.195
PUE) and 15% (0.035) goes to power losses.
• The system went live in November 2005, and this section describes
its state as of 2007.
• A Google container can handle up to 250 kilowatts. That means the
container can handle 780 watts per square foot (0.09 square
meters), or 133 watts per square foot across the entire 75,000-
square-foot space with 40 containers.
• However, the containers in this WSC average just 222 kilowatts

• Containers

• Containers
• Servers In A Google WSC
• The server in Figure 6.21 has two sockets, each containing a
dual-core AMD Opteron processor running at 2.2 GHz. The
photo shows eight DIMMS, and these servers are typically
deployed with 8 GB of DDR2 DRA M.
• A novel feature is that the memory bus is down clocked to
533 MHz from the standard 666 MHz since the slower bus
has little impact on performance but a significant impact on
power.
• The baseline design has a single network interface card
(NIC) for a 1 Gbit/sec Ethernet link.

• Containers
• Servers In A Google WSC
• Although the photo in Figure 6.21 shows two SATA disk drives, the
baseline server has just one.
• The peak power of the baseline is about 160 watts, and idle power is 85
watts.
• This baseline node is supplemented to offer a storage (or “diskfull”) node.
• First, a second tray containing 10 S ATA disks is connected to the server.
• To get one more disk, a second disk is placed into the empty spot on the
motherboard, giving the storage node 12 S ATA disks.
• Finally, since a storage node could saturate a single 1 Gbit/sec Ethernet
link, a second Ethernet NIC was added.
• Peak power for a storage node is about 300 watts, and it idles at 198
watts.

Warehouse scale computer

More Related Content

What's hot

Similar to Warehouse scale computer

More from Hassan A-j

Recently uploaded

Warehouse scale computer