An introduction to the Design of Warehouse-Scale Computers

An introduction to the
Design of Warehouse-Scale Computers
Computer Architecture
A.Y. 2014/2015
Authors:
Piscione Pietro
Villardita Alessio
Degree: Computer Engineering

What is a WSC
Warehouse-Scale
Computer
● Scalable
● Distributed
● Cost efficiency

VMs and
applications
Disks
Networking
Servers
Cooling
Energy
proportionality Costs
WSC @
Repair and
failures
Web search
and QPS
e-commerce

Why WSC
Motivations:
● Cloud services
● E-mail
● Social network
● News
● E-commerce
and so on...

Is WSC a data center
Data centers:
● Not co-located
● Host services for
multiple providers
● Third party SW
solution
WSCs:
● Co-located
● Single organization
● Homogenous SW
and HW organization

Cost efficiency at scale
It requires more:
● Computing power
● Storage
● Throughput
● Reliability
Morecosts

WSC architecture overview
Low-end
server Cluster

WSC: SW and HW techniques
● Replication
● Error correction
● Sharding
● Load-balancing
● Health checking
● Compression
● Consistency
● Canaries

● Platform-level software: common firmware,
kernel, operating system distribution, and
libraries
● Cluster-level infrastructure software:
MapReduce, BigTable, Hadoop, Spanner, etc.
● Application-level software: Google search,
Gmail, Google Maps, etc.
Software Layers

Platform-level software
Virtual machines
Pro: Versatile, Reliable, Isolation, Performance,
Encapsulation, Costs, Flexibility, Checkpointing,
Live Migration.
Cons:
I/O intensive WL

Hardware Building Blocks
● Server hardware
● Network fabric
● Storage hierarchy components

Large SMP vs low-end server nodes
Warehouse scale

Limits of very low-end cores
● Amdahl’s law: difficult to reduce serialization
and communication overheads
● The larger # of threads, the larger the
variability in response times
Ex.: Web Server Latency per request
High-End cores Low-End cores (3x slower)
1s/request (50% CPU) 2s/request (75% CPU)

Network fabric
● Network scalability: hard to put in practice;
offloading some traffic to a special-purpose
network
● Protocols: FCoE (FibreChannel over Ethernet)
and iSCSI (SCSI over IP)
● Programmable network: OpenFlow and SDN

WSC architecture overview - Network
Characteristics Ethernet
cable
Optical fiber
Performance (Gbs) 1-10 10-1000
MTBF (years) >45 >10
Costs ($/km) 200-500 700-1200
What protocol is used in the data center? Infiniband-Ethernet

Storage hierarchy componentsLatency
Size

WSC architecture overview - Disks
Characteristics HDD SDD
Performance (MBs) R:59 W:60 R:100 W:80
Active Power (W) 3.86 1
MTBF (Mh) >2 <0.7
Costs ($/TB) 60-75 130-150
Which is the file system? GFS

Modelling costs
Total Cost=Capital Cost+Operational cost
Capital cost depends from:
● Design
● Size
● Location
● Speed of construction
Operational cost:
It hardly depends
from applications

Capital Cost - example1
1
Ref. [2]
Servers
$2,997,090
Power &
Cooling
$1,296,902
Power
$1,042,440
Other
$284,686

Operational Cost - example
● Power consumption
○ Cooling
○ Servers
○ Energy power efficiency
○ Workload
● Repairs and failure

WSC Power Consumption: overview
● A datacenter uses
10-20% of the
servers power
● Cooling
● High-efficiency in
power conversion
CPUs
DRAM
Disks
Cooling

Energy and power efficiency
● Measures are workload dependant
● Distinguish between three main factors:
● State-of-the-Art TPUE = PUE x SPUE around 1.44
● Average data centers have TPUE = 3.2
Efficiency
1
SPUE
1
PUE
C
TEEC
Facility Server Computing
For each productive watt, 2.2 more are consumed!

Sources of Efficiency Losses
IT
Equipment
Cooling
UPSAir movement

Workload
Large continuous
batch
Mix: online services

Energy efficiency key factors
● Efficient load distribution: Live migration and
Google File System
● Idle times must be little
● Energy-proportional computing
● Workload peaks prediction models (complex)

Energy efficiency Benchmarks
● LINPACK: world’s top supercomputers
● JouleSort
● SPECpower
● Emerald
● SP C-2/E
● SPECpower_ssj2008: based on a broad class of
server workloads
Storage: # of transactions per Watt
Server-level: performance-to-power

Dealing with failures and repairs
System
Available @ 99.9%
Unavailable
FailureHW upgrade Maintenance
Tolerating faults, not hiding them
“A gracefully degraded service”
But how?

Fault-Tolerant SW Infrastructure
Requirements:
● HW faults can be tolerated
● HW level: its faults must
always be detected and
reported to software
● support a broad class of
operational procedures
inexpensive PC-class HW
costs saving and
optimization
Pros:
reactive containment and
recovery actions
turn in

Truly faulty
Main faults causes
@Google:
● Software errors
● Human mistakes
● Wrong
configurations
But also (10-25%):
● Hardware-related
○ Disk errors
○ DRAM soft errors
Config
SW
Human
HW
Net
Oth

World is not perfect, and holds on
And Google’s WSCs do so:
● 1.2-2 crashes per year (mature server)
● with 2,000 servers, approximately 1 crash every 2.5 h
(10 per day)
● ⅓ of servers is affected by correctable DRAM errors,
on average per year (1 error per server every 2.5 h)
● with ECC, only 1.3% of all machines ever experience
uncorrectable memory errors per year
Hardware

Google’s Availability
55%
6 30
25%
1% > 1 day!
99.84%

● Monitors servers’
configuration, activity,
environmental, and
error data
● Individual machine
diagnostics
● Stability of new system
software versions
● Suggest repairs action
Google System Health

Study case: web search
Web size?
Nobody knows it.
Classification?
Using PageRank.
QPS?
Not possible to
establish a priori
Logical view of a web index

Study case: web search - 2
No energy proportionality
Hour

CPU - energy proportionality
VFS solution
Trade-off
Performance vs.
Power consumption

Benchmark for Enterprise applications
16 x DELL M1000e
14 x IBM Blade
Center Model
16 x HP C7000
What are we going to test?

SPECpower_ssj2008 description
How it’s composed ?
● New Order (30.3%)
● Payment (30.3%)
● Order Status (3.0%)
● Delivery (3.0%)
● Stock Level (3.0%)
● Customer Report (30.3%)
How does it work ?

Benchmark results
Lower is better

Benchmark results - 2
Higher is better

Conclusions
Internet grows tirelessly!
User side
● Services
● Price
● Latency
● Availability
WSC side
● Hardware
● Costs
● Performance
● Reliability and
fault tolerance

References
[1] Barroso, Clidaras, Hölzle, The Datacenter as a Computer: An
Introduction to the Design of Warehouse-Scale Machines, Morgan &
Claypool Publishers, 2013
[2]http://perspectives.mvdirona.com/2008/11/cost-of-power-in-large-
scale-data-centers/, James Hamilton, AWS Team
Thank you for
listening !

An introduction to the Design of Warehouse-Scale Computers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to An introduction to the Design of Warehouse-Scale Computers

Similar to An introduction to the Design of Warehouse-Scale Computers (20)

Recently uploaded

Recently uploaded (20)

An introduction to the Design of Warehouse-Scale Computers