SlideShare a Scribd company logo
1 of 40
Download to read offline
The Rise of
Cloud Computing Systems
Jeff Dean
Google, Inc.
(Describing the work of thousands of people!)
1
2
Utility computing: Corbató & Vyssotsky, “Introduction and Overview of the Multics
system”, AFIPS Conference, 1965.
Picture credit: http://www.multicians.org/
3
4
5
How Did We Get to Where We Are?
Prior to mid 1990s: Distributed systems emphasized:
● modest-scale systems in a single site (Grapevine, many
others), as well as
● widely distributed, decentralized systems (DNS)
6
Adjacent fields
High Performance Computing:
Heavy focus on performance, but not on fault-tolerance
Transactional processing systems/database systems:
Strong emphasis on structured data, consistency
Limited focus on very large scale, especially at low cost
7
Caveats
Very broad set of areas:
Can’t possible cover all relevant work
Focus on few important areas, systems, and trends
Will describe context behind systems with which I am most
familiar
8
What caused the need for such large systems?
Very resource-intensive interactive services like search
were key drivers
9
Growth of web
… from millions to hundreds of billions of pages
… and need to index it all,
… and search it millions and then billions of times per day
… with sub-second latencies
A Case for Networks of Workstations: NOW, Anderson, Culler, & Patterson. IEEE Micro,
1995
Cluster-Based Scalable Network Services, Fox, Gribble, Chawathe, Brewer, & Gauthier,
SOSP 1997.
➜
Picture credit: http://now.cs.berkeley.edu/ and http://wikieducator.org/images/2/23/Inktomi.jpg
10
My Vantage Point
Joined DEC WRL in 1996 around launch of
Picture credit: http://research.microsoft.com/en-us/um/people/gbell/digital/timeline/1995-2.htm
11
My vantage point, continued:
Google, circa 1999
Early Google tenet:
Commodity PCs give high perf/$
Picture credit: http://americanhistory.si.edu/exhibitions/preview-case-american-enterprise
Commodity components even better!
12
Aside: use of cork can land your
computing platform in the Smithsonian
At Modest Scale: Treat as Separate Machines
for m in a7 a8 a9 a10 a12 a13 a14 a16 a17 a18
a19 a20 a21 a22 a23 a24; do ssh -n $m "cd
/root/google; for j in "`seq $i $[$i+3]`'; do
j2=`printf %02d $j`; f=`echo '$files' | sed
s/bucket00/bucket$j2/g`; fgrun bin/buildindex
$f; done' & i=$[$i+4]; done
What happened to poor old a11 and a15?
13
At Larger Scale: Becomes Untenable
14
Reliability Must Come From Software
Typical first year for a new Google cluster (circa 2006)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packetloss)
~8 network maintenances (4 might cause ~30-min random connectivity losses)
~12 router reloads (takes out DNS and external vips for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for DNS
~1000 individual machine failures
~thousands of hard drive failures
slow disks, bad memory, misconfigured machines, flaky machines, etc.
Long distance links: wild dogs, sharks, dead horses, drunken hunters, etc.
15
A Series of Steps,
All With Common Theme:
Provide Higher-Level View Than
“Large Collection of Individual Machines”
Self-manage and self-repair as much as possible
OS OS OS OS OSOSOS OS
... 16
First Step:
Abstract Away Individual Disks
OS OS OS OS OSOSOS OS
...
Distributed file system
17
Xerox Alto (1973), NFS (1984), many others:
File servers, distributed clients
AFS (Howard et al. ‘88):
1000s of clients, whole file caching, weakly consistent
xFS (Anderson et al. ‘95):
completely decentralized
Petal (Lee & Thekkath, ‘95), Frangipani (Thekkath et al., ‘96):
distributed virtual disks, plus file system on top of Petal
Long History of Distributed File Systems
18
OS OS OS OS OSOSOS OS
...
Distributed file system
● Centralized master manages metadata
Master
Huge I/O bandwidth
GFS file system clients
Metadata
ops
Google File System (Ghemawat, Gobioff, & Leung, SOSP‘03)
● Files chunks of 64 MB, each replicated on 3 different servers
● High fault tolerance + automatic recovery, high availability
● 1000s of clients read/write directly to/from 1000s of disk serving processes
19
20
Disks in datacenter basically self-managing
21
Successful design pattern:
Centralized master for metadata/control, with
thousands of workers and thousands of clients
Once you can store data, then you want to be able to
process it efficiently
Large datasets implies need for highly parallel
computation
One important building block:
Scheduling jobs with 100s or 1000s of tasks
22
Multiple Approaches
● Virtual machines
● “Containers”: akin to a VM, but at the process level, not
whole OS
23
Virtual Machines
● Early work done by MIT and IBM in 1960s
○ Give separate users their own executing copy of OS
● Reinvigorated by Bugnion, Rosenblum et al. in late 1990s
○ simplify effective utilization of multiprocessor machines
○ allows consolidation of servers
Raw VMs: key abstraction now offered by cloud service
providers
24
Cluster Scheduling Systems
● Goal: Place containers or VMs on physical machines
○ handle resource requirements, constraints
○ run multiple tasks per machine for efficiency
○ handle machine failures
Similar problem to earlier HPC scheduling and distributed
workstation cluster scheduling systems
e.g. Condor [Litzkow, Livny & Mutkow, ‘88]
25
Many Such Systems
● Proprietary:
○ Borg [Google: Verma et al., published 2015, in use since 2004]
(unpublished predecessor by Liang, Dean, Sercinoglu, et al. in use since 2002)
○ Autopilot [Microsoft: Isaard et al., 2007]
○ Tupperware [Facebook, Narayanan slide deck, 2014]
○ Fuxi [Alibaba: Zhang et al., 2014]
● Open source:
○ Hadoop Yarn
○ Apache Mesos [Hindman et al., 2011]
○ Apache Aurora [2014]
○ Kubernetes [2014]
26
Tension: Multiplexing resources & performance isolation
● Sharing machines across completely different jobs and
tenants necessary for effective utilization
○ But leads to unpredictable performance blips
● Isolating while still sharing
○ Memory “ballooning” [Waldspurger, OSDI 2002]
○ Linux containers
○ ...
● Controlling tail latency very important [Dean & Barroso, 2013]
○ Especially in large fan-out systems 27
Higher-Level Computation Frameworks
Give programmer a high-level abstraction for computation
Map computation automatically
onto a large cluster of machines
28
MapReduce
[Dean & Ghemawat, OSDI 2004]
● simple Map and Reduce abstraction
● hides messy details of locality, scheduling, fault
tolerance, dealing with slow machines, etc. in its
implementation
● makes it very easy to do very wide variety of large-scale
computations
29
Hadoop - open source version of MapReduce
Succession of Higher-Level Computation Systems
● Dryad [Isard et al., 2007] - general dataflow graphs
● Sawzall [Pike et al. 2005], PIG [Olston et al. 2008],
DryadLinq [Yu et al. 2008], Flume [Chambers et al. 2010]
○ higher-level languages/systems using MapReduce/Hadoop/Dryad as
underlying execution engine
● Pregel [Malewicz et al., 2010] - graph computations
● Spark [Zaharia et al., 2010] - in-memory working sets
● ...
30
Many Applications Need To Update Structured State
With Low-Latency and Large Scale
31
Desires:
● Spread across many machines, grow and shrink automatically
● Handle machine failures quickly and transparently
● Often prefer low latency and high performance over consistency
keys
TBs to 100s of PBs of data
106
, 108
, or more reqs/sec
Distributed Semi-Structured Storage Systems
● BigTable [Google: Chang et al. OSDI 2006]
○ higher-level storage system built on top of distributed file system (GFS)
○ data model: rows, columns, timestamps
○ no cross-row consistency guarantees
○ state managed in small pieces (tablets)
○ recovery fast (10s or 100s of machines each recover state of one tablet)
● Dynamo [Amazon: DeCandia et al., 2007]
○ versioning + app-assisted conflict resolution
● Spanner [Google: Corbett et al., 2012]
○ wide-area distribution, supports both strong and weak consistency
32
33
Successful design pattern:
Give each machine hundreds or thousands of units
of work or state
Helps with:
dynamic capacity sizing
load balancing
faster failure recovery
34
The Public Cloud
Making these systems available to developers
everywhere
Cloud Service Providers
● Make computing resources available on demand
○ through a growing set of simple APIs
○ leverages economies of scale of large datacenters
○ … for anyone with a credit card
○ … at a large scale, if desired
35
Cloud Service Providers
Amazon: Queue API in 2004, EC2 launched in 2006
Google: AppEngine in 2005, other services starting in 2008
Microsoft: Azure launched in 2008.
Millions of customers using these services
Shift towards these services is accelerating
Comprehensiveness of APIs increasing over time
36
37
So where are we?
OS OS OS OS OSOSOS OS
...
Distributed file system
Cluster Scheduling System
38
MapReduce,
Dryad, Pregel, ...
BigTable,
Dynamo,
Spanner
Powerful
web services
Amazon Web Services,
Google Cloud Platform,
Microsoft Azure
What’s next?
● Abstractions for interactive services with 100s of
subsystems
○ less configuration, much more automated operation,
self-tuning, …
● Systems to handle greater heterogeneity
○ e.g. automatically split computation between mobile
device and datacenters
39
Thanks for listening!
Thanks to
Ken Birman, Eric Brewer, Peter Denning,
Sanjay Ghemawat, and Andrew Herbert for
comments on this presentation
40

More Related Content

What's hot

What's hot (7)

High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax AstraApache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
 
Comparing Orchestration
Comparing OrchestrationComparing Orchestration
Comparing Orchestration
 
Data Reduction for Gluster with VDO
Data Reduction for Gluster with VDOData Reduction for Gluster with VDO
Data Reduction for Gluster with VDO
 
node.js in production: Reflections on three years of riding the unicorn
node.js in production: Reflections on three years of riding the unicornnode.js in production: Reflections on three years of riding the unicorn
node.js in production: Reflections on three years of riding the unicorn
 

Viewers also liked

創意方法 水平營銷
創意方法 水平營銷創意方法 水平營銷
創意方法 水平營銷
文豪 譚
 
Presentation1, radiological imaging of karrtegner,s syndrome.
Presentation1, radiological imaging of karrtegner,s syndrome.Presentation1, radiological imaging of karrtegner,s syndrome.
Presentation1, radiological imaging of karrtegner,s syndrome.
Abdellah Nazeer
 

Viewers also liked (9)

BP DEMO
BP DEMOBP DEMO
BP DEMO
 
創意方法 水平營銷
創意方法 水平營銷創意方法 水平營銷
創意方法 水平營銷
 
Nike TaiWan NaLiMai 4507
Nike TaiWan NaLiMai 4507Nike TaiWan NaLiMai 4507
Nike TaiWan NaLiMai 4507
 
iCentres: partnerships not buildings
iCentres: partnerships not buildingsiCentres: partnerships not buildings
iCentres: partnerships not buildings
 
How to Work Your 20 Minutes a Day Productively (Internet Prophets Session 2)
How to Work Your 20 Minutes a Day Productively (Internet Prophets Session 2)How to Work Your 20 Minutes a Day Productively (Internet Prophets Session 2)
How to Work Your 20 Minutes a Day Productively (Internet Prophets Session 2)
 
Artist INC overview
Artist INC overviewArtist INC overview
Artist INC overview
 
Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed Systems
 
English Grammar Lecture 6: Verb Patterns and the "Be" Patterns
English Grammar Lecture 6: Verb Patterns and the "Be" PatternsEnglish Grammar Lecture 6: Verb Patterns and the "Be" Patterns
English Grammar Lecture 6: Verb Patterns and the "Be" Patterns
 
Presentation1, radiological imaging of karrtegner,s syndrome.
Presentation1, radiological imaging of karrtegner,s syndrome.Presentation1, radiological imaging of karrtegner,s syndrome.
Presentation1, radiological imaging of karrtegner,s syndrome.
 

Similar to The Rise of Cloud Computing Systems

Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
programmermag
 
ppt vdo stream cloud comp.ppt Cloud computing with the help of AWS
ppt vdo stream cloud comp.ppt Cloud computing with the help of AWSppt vdo stream cloud comp.ppt Cloud computing with the help of AWS
ppt vdo stream cloud comp.ppt Cloud computing with the help of AWS
vij7027
 

Similar to The Rise of Cloud Computing Systems (20)

Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
bigdata 2.pptx
bigdata 2.pptxbigdata 2.pptx
bigdata 2.pptx
 
Google cluster architecture
Google cluster architecture Google cluster architecture
Google cluster architecture
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
 
Computer operating system and network model
Computer operating system and network modelComputer operating system and network model
Computer operating system and network model
 
ppt vdo stream cloud comp.ppt Cloud computing with the help of AWS
ppt vdo stream cloud comp.ppt Cloud computing with the help of AWSppt vdo stream cloud comp.ppt Cloud computing with the help of AWS
ppt vdo stream cloud comp.ppt Cloud computing with the help of AWS
 
AWS res 2024 key points for better research.ppt
AWS res 2024 key points for better research.pptAWS res 2024 key points for better research.ppt
AWS res 2024 key points for better research.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
Introduction To Cloud Computing.ppt
Introduction To Cloud Computing.pptIntroduction To Cloud Computing.ppt
Introduction To Cloud Computing.ppt
 

Recently uploaded

Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 

Recently uploaded (20)

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 

The Rise of Cloud Computing Systems

  • 1. The Rise of Cloud Computing Systems Jeff Dean Google, Inc. (Describing the work of thousands of people!) 1
  • 2. 2 Utility computing: Corbató & Vyssotsky, “Introduction and Overview of the Multics system”, AFIPS Conference, 1965. Picture credit: http://www.multicians.org/
  • 3. 3
  • 4. 4
  • 5. 5
  • 6. How Did We Get to Where We Are? Prior to mid 1990s: Distributed systems emphasized: ● modest-scale systems in a single site (Grapevine, many others), as well as ● widely distributed, decentralized systems (DNS) 6
  • 7. Adjacent fields High Performance Computing: Heavy focus on performance, but not on fault-tolerance Transactional processing systems/database systems: Strong emphasis on structured data, consistency Limited focus on very large scale, especially at low cost 7
  • 8. Caveats Very broad set of areas: Can’t possible cover all relevant work Focus on few important areas, systems, and trends Will describe context behind systems with which I am most familiar 8
  • 9. What caused the need for such large systems? Very resource-intensive interactive services like search were key drivers 9 Growth of web … from millions to hundreds of billions of pages … and need to index it all, … and search it millions and then billions of times per day … with sub-second latencies
  • 10. A Case for Networks of Workstations: NOW, Anderson, Culler, & Patterson. IEEE Micro, 1995 Cluster-Based Scalable Network Services, Fox, Gribble, Chawathe, Brewer, & Gauthier, SOSP 1997. ➜ Picture credit: http://now.cs.berkeley.edu/ and http://wikieducator.org/images/2/23/Inktomi.jpg 10
  • 11. My Vantage Point Joined DEC WRL in 1996 around launch of Picture credit: http://research.microsoft.com/en-us/um/people/gbell/digital/timeline/1995-2.htm 11
  • 12. My vantage point, continued: Google, circa 1999 Early Google tenet: Commodity PCs give high perf/$ Picture credit: http://americanhistory.si.edu/exhibitions/preview-case-american-enterprise Commodity components even better! 12 Aside: use of cork can land your computing platform in the Smithsonian
  • 13. At Modest Scale: Treat as Separate Machines for m in a7 a8 a9 a10 a12 a13 a14 a16 a17 a18 a19 a20 a21 a22 a23 a24; do ssh -n $m "cd /root/google; for j in "`seq $i $[$i+3]`'; do j2=`printf %02d $j`; f=`echo '$files' | sed s/bucket00/bucket$j2/g`; fgrun bin/buildindex $f; done' & i=$[$i+4]; done What happened to poor old a11 and a15? 13
  • 14. At Larger Scale: Becomes Untenable 14
  • 15. Reliability Must Come From Software Typical first year for a new Google cluster (circa 2006) ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packetloss) ~8 network maintenances (4 might cause ~30-min random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for DNS ~1000 individual machine failures ~thousands of hard drive failures slow disks, bad memory, misconfigured machines, flaky machines, etc. Long distance links: wild dogs, sharks, dead horses, drunken hunters, etc. 15
  • 16. A Series of Steps, All With Common Theme: Provide Higher-Level View Than “Large Collection of Individual Machines” Self-manage and self-repair as much as possible OS OS OS OS OSOSOS OS ... 16
  • 17. First Step: Abstract Away Individual Disks OS OS OS OS OSOSOS OS ... Distributed file system 17
  • 18. Xerox Alto (1973), NFS (1984), many others: File servers, distributed clients AFS (Howard et al. ‘88): 1000s of clients, whole file caching, weakly consistent xFS (Anderson et al. ‘95): completely decentralized Petal (Lee & Thekkath, ‘95), Frangipani (Thekkath et al., ‘96): distributed virtual disks, plus file system on top of Petal Long History of Distributed File Systems 18
  • 19. OS OS OS OS OSOSOS OS ... Distributed file system ● Centralized master manages metadata Master Huge I/O bandwidth GFS file system clients Metadata ops Google File System (Ghemawat, Gobioff, & Leung, SOSP‘03) ● Files chunks of 64 MB, each replicated on 3 different servers ● High fault tolerance + automatic recovery, high availability ● 1000s of clients read/write directly to/from 1000s of disk serving processes 19
  • 20. 20 Disks in datacenter basically self-managing
  • 21. 21 Successful design pattern: Centralized master for metadata/control, with thousands of workers and thousands of clients
  • 22. Once you can store data, then you want to be able to process it efficiently Large datasets implies need for highly parallel computation One important building block: Scheduling jobs with 100s or 1000s of tasks 22
  • 23. Multiple Approaches ● Virtual machines ● “Containers”: akin to a VM, but at the process level, not whole OS 23
  • 24. Virtual Machines ● Early work done by MIT and IBM in 1960s ○ Give separate users their own executing copy of OS ● Reinvigorated by Bugnion, Rosenblum et al. in late 1990s ○ simplify effective utilization of multiprocessor machines ○ allows consolidation of servers Raw VMs: key abstraction now offered by cloud service providers 24
  • 25. Cluster Scheduling Systems ● Goal: Place containers or VMs on physical machines ○ handle resource requirements, constraints ○ run multiple tasks per machine for efficiency ○ handle machine failures Similar problem to earlier HPC scheduling and distributed workstation cluster scheduling systems e.g. Condor [Litzkow, Livny & Mutkow, ‘88] 25
  • 26. Many Such Systems ● Proprietary: ○ Borg [Google: Verma et al., published 2015, in use since 2004] (unpublished predecessor by Liang, Dean, Sercinoglu, et al. in use since 2002) ○ Autopilot [Microsoft: Isaard et al., 2007] ○ Tupperware [Facebook, Narayanan slide deck, 2014] ○ Fuxi [Alibaba: Zhang et al., 2014] ● Open source: ○ Hadoop Yarn ○ Apache Mesos [Hindman et al., 2011] ○ Apache Aurora [2014] ○ Kubernetes [2014] 26
  • 27. Tension: Multiplexing resources & performance isolation ● Sharing machines across completely different jobs and tenants necessary for effective utilization ○ But leads to unpredictable performance blips ● Isolating while still sharing ○ Memory “ballooning” [Waldspurger, OSDI 2002] ○ Linux containers ○ ... ● Controlling tail latency very important [Dean & Barroso, 2013] ○ Especially in large fan-out systems 27
  • 28. Higher-Level Computation Frameworks Give programmer a high-level abstraction for computation Map computation automatically onto a large cluster of machines 28
  • 29. MapReduce [Dean & Ghemawat, OSDI 2004] ● simple Map and Reduce abstraction ● hides messy details of locality, scheduling, fault tolerance, dealing with slow machines, etc. in its implementation ● makes it very easy to do very wide variety of large-scale computations 29 Hadoop - open source version of MapReduce
  • 30. Succession of Higher-Level Computation Systems ● Dryad [Isard et al., 2007] - general dataflow graphs ● Sawzall [Pike et al. 2005], PIG [Olston et al. 2008], DryadLinq [Yu et al. 2008], Flume [Chambers et al. 2010] ○ higher-level languages/systems using MapReduce/Hadoop/Dryad as underlying execution engine ● Pregel [Malewicz et al., 2010] - graph computations ● Spark [Zaharia et al., 2010] - in-memory working sets ● ... 30
  • 31. Many Applications Need To Update Structured State With Low-Latency and Large Scale 31 Desires: ● Spread across many machines, grow and shrink automatically ● Handle machine failures quickly and transparently ● Often prefer low latency and high performance over consistency keys TBs to 100s of PBs of data 106 , 108 , or more reqs/sec
  • 32. Distributed Semi-Structured Storage Systems ● BigTable [Google: Chang et al. OSDI 2006] ○ higher-level storage system built on top of distributed file system (GFS) ○ data model: rows, columns, timestamps ○ no cross-row consistency guarantees ○ state managed in small pieces (tablets) ○ recovery fast (10s or 100s of machines each recover state of one tablet) ● Dynamo [Amazon: DeCandia et al., 2007] ○ versioning + app-assisted conflict resolution ● Spanner [Google: Corbett et al., 2012] ○ wide-area distribution, supports both strong and weak consistency 32
  • 33. 33 Successful design pattern: Give each machine hundreds or thousands of units of work or state Helps with: dynamic capacity sizing load balancing faster failure recovery
  • 34. 34 The Public Cloud Making these systems available to developers everywhere
  • 35. Cloud Service Providers ● Make computing resources available on demand ○ through a growing set of simple APIs ○ leverages economies of scale of large datacenters ○ … for anyone with a credit card ○ … at a large scale, if desired 35
  • 36. Cloud Service Providers Amazon: Queue API in 2004, EC2 launched in 2006 Google: AppEngine in 2005, other services starting in 2008 Microsoft: Azure launched in 2008. Millions of customers using these services Shift towards these services is accelerating Comprehensiveness of APIs increasing over time 36
  • 38. OS OS OS OS OSOSOS OS ... Distributed file system Cluster Scheduling System 38 MapReduce, Dryad, Pregel, ... BigTable, Dynamo, Spanner Powerful web services Amazon Web Services, Google Cloud Platform, Microsoft Azure
  • 39. What’s next? ● Abstractions for interactive services with 100s of subsystems ○ less configuration, much more automated operation, self-tuning, … ● Systems to handle greater heterogeneity ○ e.g. automatically split computation between mobile device and datacenters 39
  • 40. Thanks for listening! Thanks to Ken Birman, Eric Brewer, Peter Denning, Sanjay Ghemawat, and Andrew Herbert for comments on this presentation 40