SlideShare a Scribd company logo
1 of 44
An introduction to the
Design of Warehouse-Scale Computers
Computer Architecture
A.Y. 2014/2015
Authors:
Piscione Pietro
Villardita Alessio
Degree: Computer Engineering
What is a WSC
Warehouse-Scale
Computer
● Scalable
● Distributed
● Cost efficiency
VMs and
applications
Disks
Networking
Servers
Cooling
Energy
proportionality Costs
WSC @
Repair and
failures
Web search
and QPS
e-commerce
Why WSC
Motivations:
● Cloud services
● E-mail
● Social network
● News
● E-commerce
and so on...
Is WSC a data center
Data centers:
● Not co-located
● Host services for
multiple providers
● Third party SW
solution
WSCs:
● Co-located
● Single organization
● Homogenous SW
and HW organization
Cost efficiency at scale
It requires more:
● Computing power
● Storage
● Throughput
● Reliability
Morecosts
WSC architecture overview
Low-end
server Cluster
WSC: SW and HW techniques
● Replication
● Error correction
● Sharding
● Load-balancing
● Health checking
● Compression
● Consistency
● Canaries
● Platform-level software: common firmware,
kernel, operating system distribution, and
libraries
● Cluster-level infrastructure software:
MapReduce, BigTable, Hadoop, Spanner, etc.
● Application-level software: Google search,
Gmail, Google Maps, etc.
Software Layers
Platform-level software
Virtual machines
Pro: Versatile, Reliable, Isolation, Performance,
Encapsulation, Costs, Flexibility, Checkpointing,
Live Migration.
Cons:
I/O intensive WL
Hardware Building Blocks
● Server hardware
● Network fabric
● Storage hierarchy components
Large SMP vs low-end server nodes
Warehouse scale
Limits of very low-end cores
● Amdahl’s law: difficult to reduce serialization
and communication overheads
● The larger # of threads, the larger the
variability in response times
Ex.: Web Server Latency per request
High-End cores Low-End cores (3x slower)
1s/request (50% CPU) 2s/request (75% CPU)
Network fabric
● Network scalability: hard to put in practice;
offloading some traffic to a special-purpose
network
● Protocols: FCoE (FibreChannel over Ethernet)
and iSCSI (SCSI over IP)
● Programmable network: OpenFlow and SDN
WSC architecture overview - Network
Characteristics Ethernet
cable
Optical fiber
Performance (Gbs) 1-10 10-1000
MTBF (years) >45 >10
Costs ($/km) 200-500 700-1200
What protocol is used in the data center? Infiniband-Ethernet
Storage hierarchy componentsLatency
Size
WSC architecture overview - Disks
Characteristics HDD SDD
Performance (MBs) R:59 W:60 R:100 W:80
Active Power (W) 3.86 1
MTBF (Mh) >2 <0.7
Costs ($/TB) 60-75 130-150
Which is the file system? GFS
Modelling costs
Total Cost=Capital Cost+Operational cost
Capital cost depends from:
● Design
● Size
● Location
● Speed of construction
Operational cost:
It hardly depends
from applications
Capital Cost - example1
1
Ref. [2]
Servers
$2,997,090
Power &
Cooling
$1,296,902
Power
$1,042,440
Other
$284,686
Operational Cost - example
● Power consumption
○ Cooling
○ Servers
○ Energy power efficiency
○ Workload
● Repairs and failure
WSC Power Consumption: overview
● A datacenter uses
10-20% of the
servers power
● Cooling
● High-efficiency in
power conversion
CPUs
DRAM
Disks
Cooling
Closed Cooling System
Energy and power efficiency
● Measures are workload dependant
● Distinguish between three main factors:
● State-of-the-Art TPUE = PUE x SPUE around 1.44
● Average data centers have TPUE = 3.2
Efficiency
1
SPUE
1
PUE
C
TEEC
Facility Server Computing
For each productive watt, 2.2 more are consumed!
Sources of Efficiency Losses
IT
Equipment
Cooling
UPSAir movement
Workload
Large continuous
batch
Mix: online services
Energy proportionality
Energy efficiency key factors
● Efficient load distribution: Live migration and
Google File System
● Idle times must be little
● Energy-proportional computing
● Workload peaks prediction models (complex)
Energy efficiency Benchmarks
● LINPACK: world’s top supercomputers
● JouleSort
● SPECpower
● Emerald
● SP C-2/E
● SPECpower_ssj2008: based on a broad class of
server workloads
Storage: # of transactions per Watt
Server-level: performance-to-power
Dealing with failures and repairs
System
Available @ 99.9%
Unavailable
FailureHW upgrade Maintenance
Tolerating faults, not hiding them
“A gracefully degraded service”
But how?
Fault-Tolerant SW Infrastructure
Requirements:
● HW faults can be tolerated
● HW level: its faults must
always be detected and
reported to software
● support a broad class of
operational procedures
inexpensive PC-class HW
costs saving and
optimization
Pros:
reactive containment and
recovery actions
turn in
Truly faulty
Main faults causes
@Google:
● Software errors
● Human mistakes
● Wrong
configurations
But also (10-25%):
● Hardware-related
○ Disk errors
○ DRAM soft errors
Config
SW
Human
HW
Net
Oth
World is not perfect, and holds on
And Google’s WSCs do so:
● 1.2-2 crashes per year (mature server)
● with 2,000 servers, approximately 1 crash every 2.5 h
(10 per day)
● ⅓ of servers is affected by correctable DRAM errors,
on average per year (1 error per server every 2.5 h)
● with ECC, only 1.3% of all machines ever experience
uncorrectable memory errors per year
Hardware
Google’s Availability
55%
6 30
25%
1% > 1 day!
99.84%
● Monitors servers’
configuration, activity,
environmental, and
error data
● Individual machine
diagnostics
● Stability of new system
software versions
● Suggest repairs action
Google System Health
Study case: web search
Web size?
Nobody knows it.
Classification?
Using PageRank.
QPS?
Not possible to
establish a priori
Logical view of a web index
Study case: web search - 2
No energy proportionality
Hour
CPU - energy proportionality
VFS solution
Trade-off
Performance vs.
Power consumption
A real life example
Benchmark for Enterprise applications
16 x DELL M1000e
14 x IBM Blade
Center Model
16 x HP C7000
What are we going to test?
SPECpower_ssj2008 description
How it’s composed ?
● New Order (30.3%)
● Payment (30.3%)
● Order Status (3.0%)
● Delivery (3.0%)
● Stock Level (3.0%)
● Customer Report (30.3%)
How does it work ?
Benchmark results
Lower is better
Benchmark results - 2
Higher is better
Conclusions
Internet grows tirelessly!
User side
● Services
● Price
● Latency
● Availability
WSC side
● Hardware
● Costs
● Performance
● Reliability and
fault tolerance
References
[1] Barroso, Clidaras, Hölzle, The Datacenter as a Computer: An
Introduction to the Design of Warehouse-Scale Machines, Morgan &
Claypool Publishers, 2013
[2]http://perspectives.mvdirona.com/2008/11/cost-of-power-in-large-
scale-data-centers/, James Hamilton, AWS Team
Thank you for
listening !

More Related Content

What's hot

Multiprocessor Systems
Multiprocessor SystemsMultiprocessor Systems
Multiprocessor Systemsvampugani
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platformsSyed Zaid Irshad
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for researchEsteban Hernandez
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performanceSyed Zaid Irshad
 
Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)Nakul Manchanda
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit pptSandeep Singh
 
Real time operating system
Real time operating systemReal time operating system
Real time operating systemBharti Goyal
 
Trends in computer architecture
Trends in computer architectureTrends in computer architecture
Trends in computer architecturemuhammedsalihabbas
 
Shared-Memory Multiprocessors
Shared-Memory MultiprocessorsShared-Memory Multiprocessors
Shared-Memory MultiprocessorsSalvatore La Bua
 
Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)Sudip Roy
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance ComputingDell World
 

What's hot (20)

Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Multiprocessor Systems
Multiprocessor SystemsMultiprocessor Systems
Multiprocessor Systems
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for research
 
CPU vs GPU Comparison
CPU  vs GPU ComparisonCPU  vs GPU Comparison
CPU vs GPU Comparison
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performance
 
Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
Real time operating system
Real time operating systemReal time operating system
Real time operating system
 
Trends in computer architecture
Trends in computer architectureTrends in computer architecture
Trends in computer architecture
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
 
Multicore Processor Technology
Multicore Processor TechnologyMulticore Processor Technology
Multicore Processor Technology
 
Shared-Memory Multiprocessors
Shared-Memory MultiprocessorsShared-Memory Multiprocessors
Shared-Memory Multiprocessors
 
Raid
Raid Raid
Raid
 
Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance Computing
 
VLIW Processors
VLIW ProcessorsVLIW Processors
VLIW Processors
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
File system implementation
File system implementationFile system implementation
File system implementation
 

Similar to An introduction to the Design of Warehouse-Scale Computers

Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesAlexander Penev
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Ankit Gupta
 
Design Like a Pro: How to Pick the Right System Architecture
Design Like a Pro: How to Pick the Right System ArchitectureDesign Like a Pro: How to Pick the Right System Architecture
Design Like a Pro: How to Pick the Right System ArchitectureInductive Automation
 
MRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud ComputingMRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud ComputingRoger Rafanell Mas
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudHostedbyConfluent
 
Challenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM MigrationChallenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM MigrationSarmad Makhdoom
 
Distributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesDistributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesPapitha Velumani
 
The Cloud & Its Impact on IT
The Cloud & Its Impact on ITThe Cloud & Its Impact on IT
The Cloud & Its Impact on ITAnand Haridass
 
Windows Server 2008 R2 Hyper V
Windows Server 2008 R2 Hyper VWindows Server 2008 R2 Hyper V
Windows Server 2008 R2 Hyper VAmit Gatenyo
 
Introduction to Architecture Exploration of Semiconductor, Embedded Systems, ...
Introduction to Architecture Exploration of Semiconductor, Embedded Systems, ...Introduction to Architecture Exploration of Semiconductor, Embedded Systems, ...
Introduction to Architecture Exploration of Semiconductor, Embedded Systems, ...Deepak Shankar
 
Lookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million DevicesLookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million DevicesScyllaDB
 
Compute Engine overview _ Sales _ Y21.pptx
Compute Engine overview _ Sales _ Y21.pptxCompute Engine overview _ Sales _ Y21.pptx
Compute Engine overview _ Sales _ Y21.pptxmosharafhossain95
 
How to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and ApplicationsHow to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and ApplicationsAlluxio, Inc.
 
Distributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesDistributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesPapitha Velumani
 
Distributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesDistributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesPapitha Velumani
 
Connecticut CMG - Demystifying Oracle database capacity management with wor...
Connecticut CMG - Demystifying Oracle database  capacity management with  wor...Connecticut CMG - Demystifying Oracle database  capacity management with  wor...
Connecticut CMG - Demystifying Oracle database capacity management with wor...Renato Bonomini
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsAlluxio, Inc.
 
Adding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance TestAdding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance TestRodolfo Kohn
 

Similar to An introduction to the Design of Warehouse-Scale Computers (20)

Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)
 
Design Like a Pro: How to Pick the Right System Architecture
Design Like a Pro: How to Pick the Right System ArchitectureDesign Like a Pro: How to Pick the Right System Architecture
Design Like a Pro: How to Pick the Right System Architecture
 
Designing Scalable Applications
Designing Scalable ApplicationsDesigning Scalable Applications
Designing Scalable Applications
 
MRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud ComputingMRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud Computing
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
 
Challenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM MigrationChallenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM Migration
 
Distributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesDistributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databases
 
The Cloud & Its Impact on IT
The Cloud & Its Impact on ITThe Cloud & Its Impact on IT
The Cloud & Its Impact on IT
 
Virtualization Go Green
Virtualization Go GreenVirtualization Go Green
Virtualization Go Green
 
Windows Server 2008 R2 Hyper V
Windows Server 2008 R2 Hyper VWindows Server 2008 R2 Hyper V
Windows Server 2008 R2 Hyper V
 
Introduction to Architecture Exploration of Semiconductor, Embedded Systems, ...
Introduction to Architecture Exploration of Semiconductor, Embedded Systems, ...Introduction to Architecture Exploration of Semiconductor, Embedded Systems, ...
Introduction to Architecture Exploration of Semiconductor, Embedded Systems, ...
 
Lookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million DevicesLookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million Devices
 
Compute Engine overview _ Sales _ Y21.pptx
Compute Engine overview _ Sales _ Y21.pptxCompute Engine overview _ Sales _ Y21.pptx
Compute Engine overview _ Sales _ Y21.pptx
 
How to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and ApplicationsHow to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and Applications
 
Distributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesDistributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databases
 
Distributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesDistributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databases
 
Connecticut CMG - Demystifying Oracle database capacity management with wor...
Connecticut CMG - Demystifying Oracle database  capacity management with  wor...Connecticut CMG - Demystifying Oracle database  capacity management with  wor...
Connecticut CMG - Demystifying Oracle database capacity management with wor...
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
 
Adding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance TestAdding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance Test
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

An introduction to the Design of Warehouse-Scale Computers

  • 1. An introduction to the Design of Warehouse-Scale Computers Computer Architecture A.Y. 2014/2015 Authors: Piscione Pietro Villardita Alessio Degree: Computer Engineering
  • 2. What is a WSC Warehouse-Scale Computer ● Scalable ● Distributed ● Cost efficiency
  • 4. Why WSC Motivations: ● Cloud services ● E-mail ● Social network ● News ● E-commerce and so on...
  • 5. Is WSC a data center Data centers: ● Not co-located ● Host services for multiple providers ● Third party SW solution WSCs: ● Co-located ● Single organization ● Homogenous SW and HW organization
  • 6. Cost efficiency at scale It requires more: ● Computing power ● Storage ● Throughput ● Reliability Morecosts
  • 8. WSC: SW and HW techniques ● Replication ● Error correction ● Sharding ● Load-balancing ● Health checking ● Compression ● Consistency ● Canaries
  • 9. ● Platform-level software: common firmware, kernel, operating system distribution, and libraries ● Cluster-level infrastructure software: MapReduce, BigTable, Hadoop, Spanner, etc. ● Application-level software: Google search, Gmail, Google Maps, etc. Software Layers
  • 10. Platform-level software Virtual machines Pro: Versatile, Reliable, Isolation, Performance, Encapsulation, Costs, Flexibility, Checkpointing, Live Migration. Cons: I/O intensive WL
  • 11. Hardware Building Blocks ● Server hardware ● Network fabric ● Storage hierarchy components
  • 12. Large SMP vs low-end server nodes Warehouse scale
  • 13. Limits of very low-end cores ● Amdahl’s law: difficult to reduce serialization and communication overheads ● The larger # of threads, the larger the variability in response times Ex.: Web Server Latency per request High-End cores Low-End cores (3x slower) 1s/request (50% CPU) 2s/request (75% CPU)
  • 14. Network fabric ● Network scalability: hard to put in practice; offloading some traffic to a special-purpose network ● Protocols: FCoE (FibreChannel over Ethernet) and iSCSI (SCSI over IP) ● Programmable network: OpenFlow and SDN
  • 15. WSC architecture overview - Network Characteristics Ethernet cable Optical fiber Performance (Gbs) 1-10 10-1000 MTBF (years) >45 >10 Costs ($/km) 200-500 700-1200 What protocol is used in the data center? Infiniband-Ethernet
  • 17. WSC architecture overview - Disks Characteristics HDD SDD Performance (MBs) R:59 W:60 R:100 W:80 Active Power (W) 3.86 1 MTBF (Mh) >2 <0.7 Costs ($/TB) 60-75 130-150 Which is the file system? GFS
  • 18. Modelling costs Total Cost=Capital Cost+Operational cost Capital cost depends from: ● Design ● Size ● Location ● Speed of construction Operational cost: It hardly depends from applications
  • 19. Capital Cost - example1 1 Ref. [2] Servers $2,997,090 Power & Cooling $1,296,902 Power $1,042,440 Other $284,686
  • 20. Operational Cost - example ● Power consumption ○ Cooling ○ Servers ○ Energy power efficiency ○ Workload ● Repairs and failure
  • 21. WSC Power Consumption: overview ● A datacenter uses 10-20% of the servers power ● Cooling ● High-efficiency in power conversion CPUs DRAM Disks Cooling
  • 23. Energy and power efficiency ● Measures are workload dependant ● Distinguish between three main factors: ● State-of-the-Art TPUE = PUE x SPUE around 1.44 ● Average data centers have TPUE = 3.2 Efficiency 1 SPUE 1 PUE C TEEC Facility Server Computing For each productive watt, 2.2 more are consumed!
  • 24. Sources of Efficiency Losses IT Equipment Cooling UPSAir movement
  • 27. Energy efficiency key factors ● Efficient load distribution: Live migration and Google File System ● Idle times must be little ● Energy-proportional computing ● Workload peaks prediction models (complex)
  • 28. Energy efficiency Benchmarks ● LINPACK: world’s top supercomputers ● JouleSort ● SPECpower ● Emerald ● SP C-2/E ● SPECpower_ssj2008: based on a broad class of server workloads Storage: # of transactions per Watt Server-level: performance-to-power
  • 29. Dealing with failures and repairs System Available @ 99.9% Unavailable FailureHW upgrade Maintenance Tolerating faults, not hiding them “A gracefully degraded service” But how?
  • 30. Fault-Tolerant SW Infrastructure Requirements: ● HW faults can be tolerated ● HW level: its faults must always be detected and reported to software ● support a broad class of operational procedures inexpensive PC-class HW costs saving and optimization Pros: reactive containment and recovery actions turn in
  • 31. Truly faulty Main faults causes @Google: ● Software errors ● Human mistakes ● Wrong configurations But also (10-25%): ● Hardware-related ○ Disk errors ○ DRAM soft errors Config SW Human HW Net Oth
  • 32. World is not perfect, and holds on And Google’s WSCs do so: ● 1.2-2 crashes per year (mature server) ● with 2,000 servers, approximately 1 crash every 2.5 h (10 per day) ● ⅓ of servers is affected by correctable DRAM errors, on average per year (1 error per server every 2.5 h) ● with ECC, only 1.3% of all machines ever experience uncorrectable memory errors per year Hardware
  • 34. ● Monitors servers’ configuration, activity, environmental, and error data ● Individual machine diagnostics ● Stability of new system software versions ● Suggest repairs action Google System Health
  • 35. Study case: web search Web size? Nobody knows it. Classification? Using PageRank. QPS? Not possible to establish a priori Logical view of a web index
  • 36. Study case: web search - 2 No energy proportionality Hour
  • 37. CPU - energy proportionality VFS solution Trade-off Performance vs. Power consumption
  • 38. A real life example
  • 39. Benchmark for Enterprise applications 16 x DELL M1000e 14 x IBM Blade Center Model 16 x HP C7000 What are we going to test?
  • 40. SPECpower_ssj2008 description How it’s composed ? ● New Order (30.3%) ● Payment (30.3%) ● Order Status (3.0%) ● Delivery (3.0%) ● Stock Level (3.0%) ● Customer Report (30.3%) How does it work ?
  • 42. Benchmark results - 2 Higher is better
  • 43. Conclusions Internet grows tirelessly! User side ● Services ● Price ● Latency ● Availability WSC side ● Hardware ● Costs ● Performance ● Reliability and fault tolerance
  • 44. References [1] Barroso, Clidaras, Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Publishers, 2013 [2]http://perspectives.mvdirona.com/2008/11/cost-of-power-in-large- scale-data-centers/, James Hamilton, AWS Team Thank you for listening !