A brief overview of the main factors involved in the design of Warehouse-Scale Computers (WSC), from the hardware, to the cooling system to the overall plant energy efficiency, always keeping in mind the costs of such a big architecture.
Co-Author: Pietro Piscione (https://www.linkedin.com/pub/pietro-piscione/84/b37/926)
A work based on:
"The Datacenter as a Computer, An Introduction to the Design of Warehouse-Scale Machines, Second Edition"
by
Luiz André Barroso
Jimmy Clidaras
Urs Hölzle
An introduction to the Design of Warehouse-Scale Computers
1. An introduction to the
Design of Warehouse-Scale Computers
Computer Architecture
A.Y. 2014/2015
Authors:
Piscione Pietro
Villardita Alessio
Degree: Computer Engineering
2. What is a WSC
Warehouse-Scale
Computer
● Scalable
● Distributed
● Cost efficiency
5. Is WSC a data center
Data centers:
● Not co-located
● Host services for
multiple providers
● Third party SW
solution
WSCs:
● Co-located
● Single organization
● Homogenous SW
and HW organization
6. Cost efficiency at scale
It requires more:
● Computing power
● Storage
● Throughput
● Reliability
Morecosts
9. ● Platform-level software: common firmware,
kernel, operating system distribution, and
libraries
● Cluster-level infrastructure software:
MapReduce, BigTable, Hadoop, Spanner, etc.
● Application-level software: Google search,
Gmail, Google Maps, etc.
Software Layers
12. Large SMP vs low-end server nodes
Warehouse scale
13. Limits of very low-end cores
● Amdahl’s law: difficult to reduce serialization
and communication overheads
● The larger # of threads, the larger the
variability in response times
Ex.: Web Server Latency per request
High-End cores Low-End cores (3x slower)
1s/request (50% CPU) 2s/request (75% CPU)
14. Network fabric
● Network scalability: hard to put in practice;
offloading some traffic to a special-purpose
network
● Protocols: FCoE (FibreChannel over Ethernet)
and iSCSI (SCSI over IP)
● Programmable network: OpenFlow and SDN
15. WSC architecture overview - Network
Characteristics Ethernet
cable
Optical fiber
Performance (Gbs) 1-10 10-1000
MTBF (years) >45 >10
Costs ($/km) 200-500 700-1200
What protocol is used in the data center? Infiniband-Ethernet
17. WSC architecture overview - Disks
Characteristics HDD SDD
Performance (MBs) R:59 W:60 R:100 W:80
Active Power (W) 3.86 1
MTBF (Mh) >2 <0.7
Costs ($/TB) 60-75 130-150
Which is the file system? GFS
18. Modelling costs
Total Cost=Capital Cost+Operational cost
Capital cost depends from:
● Design
● Size
● Location
● Speed of construction
Operational cost:
It hardly depends
from applications
19. Capital Cost - example1
1
Ref. [2]
Servers
$2,997,090
Power &
Cooling
$1,296,902
Power
$1,042,440
Other
$284,686
20. Operational Cost - example
● Power consumption
○ Cooling
○ Servers
○ Energy power efficiency
○ Workload
● Repairs and failure
21. WSC Power Consumption: overview
● A datacenter uses
10-20% of the
servers power
● Cooling
● High-efficiency in
power conversion
CPUs
DRAM
Disks
Cooling
23. Energy and power efficiency
● Measures are workload dependant
● Distinguish between three main factors:
● State-of-the-Art TPUE = PUE x SPUE around 1.44
● Average data centers have TPUE = 3.2
Efficiency
1
SPUE
1
PUE
C
TEEC
Facility Server Computing
For each productive watt, 2.2 more are consumed!
27. Energy efficiency key factors
● Efficient load distribution: Live migration and
Google File System
● Idle times must be little
● Energy-proportional computing
● Workload peaks prediction models (complex)
28. Energy efficiency Benchmarks
● LINPACK: world’s top supercomputers
● JouleSort
● SPECpower
● Emerald
● SP C-2/E
● SPECpower_ssj2008: based on a broad class of
server workloads
Storage: # of transactions per Watt
Server-level: performance-to-power
29. Dealing with failures and repairs
System
Available @ 99.9%
Unavailable
FailureHW upgrade Maintenance
Tolerating faults, not hiding them
“A gracefully degraded service”
But how?
30. Fault-Tolerant SW Infrastructure
Requirements:
● HW faults can be tolerated
● HW level: its faults must
always be detected and
reported to software
● support a broad class of
operational procedures
inexpensive PC-class HW
costs saving and
optimization
Pros:
reactive containment and
recovery actions
turn in
31. Truly faulty
Main faults causes
@Google:
● Software errors
● Human mistakes
● Wrong
configurations
But also (10-25%):
● Hardware-related
○ Disk errors
○ DRAM soft errors
Config
SW
Human
HW
Net
Oth
32. World is not perfect, and holds on
And Google’s WSCs do so:
● 1.2-2 crashes per year (mature server)
● with 2,000 servers, approximately 1 crash every 2.5 h
(10 per day)
● ⅓ of servers is affected by correctable DRAM errors,
on average per year (1 error per server every 2.5 h)
● with ECC, only 1.3% of all machines ever experience
uncorrectable memory errors per year
Hardware
34. ● Monitors servers’
configuration, activity,
environmental, and
error data
● Individual machine
diagnostics
● Stability of new system
software versions
● Suggest repairs action
Google System Health
35. Study case: web search
Web size?
Nobody knows it.
Classification?
Using PageRank.
QPS?
Not possible to
establish a priori
Logical view of a web index
39. Benchmark for Enterprise applications
16 x DELL M1000e
14 x IBM Blade
Center Model
16 x HP C7000
What are we going to test?
40. SPECpower_ssj2008 description
How it’s composed ?
● New Order (30.3%)
● Payment (30.3%)
● Order Status (3.0%)
● Delivery (3.0%)
● Stock Level (3.0%)
● Customer Report (30.3%)
How does it work ?
44. References
[1] Barroso, Clidaras, Hölzle, The Datacenter as a Computer: An
Introduction to the Design of Warehouse-Scale Machines, Morgan &
Claypool Publishers, 2013
[2]http://perspectives.mvdirona.com/2008/11/cost-of-power-in-large-
scale-data-centers/, James Hamilton, AWS Team
Thank you for
listening !