[Roblek] Distributed computing in practice

DISTRIBUTED COMPUTING IN
PRAXIS
GFS, BIGTABLE, MAPREDUCE, CHUBBY

Dominik Roblek
Software Engineer
Google Inc.

GOOGLE TECHNOLOGY LAYERS

Google™ search
Gmail™
Ads system
Services and Applications Google Maps™

Distributed Computing
Commodity PC Hardware
Linux
Computing Platform Physical Network

JavaBlend 2008, http://www.javablend.net/ 2

IMPLICATIONS OF GOOGLE ENVIRONMENT
• Single process performance does not matter
– Total throughput is more important

• Stuff breaks
– If you have one server, it may stay up three years
– If you have 10,000 servers, expect to lose ten a day

• “Ultra-reliable” hardware doesn’t really help
– At large scales, reliable hardware still fails, albeit less often
– Software still needs to be fault-tolerant


BUILDING BLOCKS OF google.com?
• Distributed data
– Google File System (GFS)
– BigTable

• Job manager

• Distributed computation
– MapReduce

• Distributed lock service
– Chubby


SCALABLE DISTRIBUTED FILE SYSTEM

Google File System
(GFS)


GFS: REQUIREMENTS
• High component failure rates
– Inexpensive commodity components fail all the time

• Modest number of huge files
– Just a few millions, most of them multi-GB

• Files are write-once, mostly appended to
– Perhaps concurrently
– Large streaming reads

JavaBlend 2008, http://www.javablend.net/

GFS: DESIGN DECISION
• Files stored as chunks
– Fixed size (64MB)

• Reliability through replication

• Each chunk replicated 3+ times

• Single master to coordinate access, keep metadata
– Simple centralized management

• No data caching
– Little benefit due to large data sets, streaming reads


GFS: ARCHITECTURE

Where is a potential weaknes of this design?

GFS: WEAK POINT - SINGLE MASTER
• From distributed systems we know this is a
– Single point of failure
– Scalability bottleneck

• GFS solutions
– Shadow masters
– Minimize master involvement
• never move data through it, use only for metadata
• large chunk size
• master delegates authority to primary replicas in data mutations
(chunk leases)


GFS: METADATA
• Global metadata is stored on the master
– File and chunk namespaces
– Mapping from files to chunks
• Locations of each chunk’s replicas
– All in memory (64 bytes / chunk)

• Master has an operation log for persistent logging of
critical metadata updates
– Persistent on local disk
– Replicated
– Checkpoints for faster recovery


GFS: MUTATIONS
• Mutations must be done
for all replicas

• Master picks one replica
as primary; gives it a
“lease” for mutations
– Primary defines a serial
order of mutations

• Data flow decoupled from
control flow


GFS: OPEN SOURCE ALTERNATIVES
• Hadoop Distributed File System - HDFS (Java)
– http://hadoop.apache.org/core/docs/current/hdfs_design.html


DISTRIBUTED STORAGE FOR LARGE STRUCTURED DATA SETS

Bigtable


BIGTABLE: REQUIREMENTS
• Want to store petabytes of structured data across
thousands of commodity servers

• Want a simple data format that supports dynamic control
over data layout and format

• Must support very high read/write rates
– millions of operations per second

• Latency requirements:
– backend bulk processing
– real-time data serving

BIGTABLE: STRUCTURE
• Bigtable is multi-dimensional map:
– sparse
– persistent
– distributed

• Key:
– Row name
– Column name
– Timestamp

• Value:
– array of bytes

(rowName: string, columnName: string, timestamp: long) → byte[]


BIGTABLE: EXAMPLE
• A web crawling system might use Bigtable that stores web pages
– Each row key could represent a specific URL
– Columns represent page contents, the references to that page, and
other metadata
– The row range for a table is dynamically partitioned between servers

• Rows are clustered together on machines by key
– Using inversed URLs as keys minimizes the number of machines where
pages from a single domain are stored
– Each cell is timestamped so there could be multiple versions of the
same data in the table


BIGTABLE: EXAMPLE

“contents:” “anchor:cnnsi.com” “anchor:my.look.ca”

“<html>…quot; t3
“com.cnn.www” “<html>…quot; t5 “CNNquot; t9 “CNN.comquot; t8
“<html>…quot; t6


BIGTABLE: ROWS

• Name is an arbitrary string
– Access to data in a row is atomic
– Row creation is implicit upon storing data

• Rows ordered lexicographically
– Rows close together lexicographically usually
on one or a small number of machines


BIGTABLE: TABLETS

• Row range for a table is dynamically
partitioned into tablets

• Tablet holds contiguous range of rows
– Reads over short row ranges are efficient
– Clients can choose row keys to achieve
locality


BIGTABLE: COLUMNS
• Columns have two-level name structure
<column_family>:[<column_qualifier>]

• Column family:
– Creation must be explicit
– Has associated type information and other metadata
– Unit of access control

• Column qualifier
– Unbounded number of columns
– Creation of column within a family is implicit at updates
• Additional dimensions


BIGTABLE: TIMESTAMPS
• Used to store different versions of data in a cell
– New writes default to current time
– Can also be set explicitly by clients

• Lookup options
– Return all values
– Return most recent K values
– Return all values in timestamp range

• Column families can be marked with attributes
– Only retain most recent K values in a cell
– Keep values until they are older than K seconds


BIGTABLE: AT GOOGLE

• Good match for most of our applications:
– Google Earth™
– Google Maps™
– Google Talk™
– Google Finance™
– Orkut™


BIGTABLE: OPEN SOURCE ALTERNATIVES

• HBase (Java)
– http://hadoop.apache.org/hbase/

• Hypertable (C++)
– http://www.hypertable.org/


PROGRAMMING MODEL FOR PROCESSING LARGE DATA SETS

MapReduce


MAPREDUCE: REQUIREMENTS

• Want to process lots of data ( > 1 TB)
• Want to run it on thousands of commodity PCs
• Must be robust
• … And simple to use

MAPREDUCE: DESCRIPTION
• A simple programming model that applies to many large-scale
computing problems
– Based on principles of functional languages
– Scalable, robust

• Hide messy details in MapReduce runtime library:
– automatic parallelization
– load balancing
– network and disk transfer optimization
– handling of machine failures
– robustness

• Improvements to core library benefit all users of library!

MAPREDUCE: FUNCTIONAL PROGRAMMING
• Functions don’t change data structures
– They always create new ones
– Input data remain unchanged

• Functions don’t have side effects

• Data flows are implicit in program design

• Order of operations does not matter

z := f(g(x), h(x, y), k(y))

MAPREDUCE: TYPICAL EXECUTION FLOW
• Read a lot of data

• Map: extract something you care about from each record

• Shuffle and Sort

• Reduce: aggregate, summarize, filter, or transform

• Write the results

Outline stays the same, map and reduce change to fit the problem

MAPREDUCE: PROGRAMING INTERFACE

User must implement two functions

Map(input_key, input_value)
→ (output_key, intermediate_value)

Reduce(output_key, intermediate_value_list)
→ output_value_list

MAPREDUCE: MAP
• Records from the data source …
– lines out of files
– rows of a database
– etc.
• … are fed into the map function as (key, value pairs)
– filename, line
– etc.

• map produces zero, one or more intermediate values
along with an output key from the input

MAPREDUCE: REDUCE

• After the map phase is over, all the
intermediate values for a given output key
are combined together into a list

• reduce combines those intermediate
values into zero, one or more final values
for that same output key

MAPREDUCE: EXAMPLE - WORD FREQUENCY 1/5
• Input is files with one document per record

• Specify a map function that takes a key/value pair
– key = document name
– value = document contents

• Output of map function is zero, one or more key/value
pairs
– In our case, output (word, “1”) once per word in the document



“To be or not to be?”
“document1”

“to”, “1”
“be”, “1”
“or”, “1”
…


MAPREDUCE: PRIMER - WORD FREQUENCY 3/5
• MapReduce library gathers together all pairs
with the same key
– shuffle/sort

• reduce function combines the values for a key
– In our case, compute the sum

• Output of reduce is zero, one or more values
paired with key and saved



key = “be” key = “not” key = “or” key = “to”
values = “1”, “1” values = “1” values = “1” values = “1”, “1”

“2” “1” “1” “2”

“be”, “2”
“not”, “1”
“or”, “1”
“to”, “2”


Map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_values:
EmitIntermediate(w, quot;1quot;);

Reduce(String output_key, Iterator intermediate_values):
// output_key: a word, same for input and output
// intermediate_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));


MAPREDUCE: DISTRIBUTED EXECUTION


MAPREDUCE: LOGICAL FLOW


MAPREDUCE: PARALLEL FLOW 1/2
• map functions run in parallel, creating different
intermediate values from different input data sets
• reduce functions also run in parallel, each
working on a different output key
– All values are processed independently
• Bottleneck
– reduce phase can’t start until map phase is
completely finished


MAPREDUCE: PARALLEL FLOW 2/2


MAPREDUCE: WIDELY APPLICABLE
• distributed grep
• distributed sort
• document clustering
• machine learning
• web access log stats
• inverted index construction
• statistical machine translation
• etc.

MAPREDUCE: EXAMPLE - LANGUAGE MODEL STATISTICS
• Used in our statistical machine translation system
• Ned to count # of times every 5-word sequence occurs
in large corpus of documents (and keep all those where
count >= 4)

• map:
– extract 5-word sequences => count from document
• reduce:
– summarize counts
– keep those where count >= 4


MAPREDUCE: EXAMPLE - JOINING WITH OTHER DATA
• Generate per-doc summary, but include per-host
information (e.g. # of pages on host, important terms on
host)
– per-host information might involve RPC to a set of machines
containing data for all sites

• map:
– extract host name from URL, lookup per-host info, combine with
per-doc data and emit
• reduce:
– identity function (just emit input value directly)


MAPREDUCE: FAULT TOLERANCE

• Master detects worker failures
– Re-executes failed map tasks
– Re-executes reduce tasks

• Master notices particular input key/values
cause crashes in map
– Skips those values on re-execution

MAPREDUCE: LOCAL OPTIMIZATIONS

• Master program divides up tasks based on
location of data
– tries to have map tasks on same machine as
physical file data, or at least same rack

MAPREDUCE: SLOW MAP TASKS
• reduce phase cannot start before the map phase
completes
– On slow disk controller can slow down the whole system

• Master redundantly starts slow-moving map task
– Uses results of first copy to finish

MAPREDUCE: COMBINE

• combine is a mini-reduce phase that runs
on the same machine as map phase
– It aggregates the results of local map phases
– Saves network bandwidth

MAPREDUCE: CONCLUSION
• MapReduce proved to be extremely useful
abstraction
– It greatly simplifies the processing of huge amounts of
data

• MapReduce is easy to use
– Programer can focus on problem
– MapReduce takes care for messy details


MAPREDUCE: OPEN SOURCE ALTERNATIVES
• Hadoop (Java)
– http://hadoop.apache.org/

• Disco (Erlang, Python)
– http://discoproject.org/

• etc.


LOCK SERVICE FOR LOOSELY-COUPLED DISTRIBUTED SYSTEMS

Chubby


CHUBBY: SYNCHRONIZED ACCESS TO SHARED RESOURCES
• Key element of distributed architecture at Google:
– Used by GFS, Bigtable and Mapreduce

• Interface similar to distributed file system with advisory locks
– Access control list
– No links

• Every Chubby file can hold a small amount of data

• Every Chubby file or directory can be used as read or write lock
– Locks are advisory, not mandatory
• Clients must be well-behaved
• A client that does not hold a lock can still read the content of a Chubby file


CHUBBY: DESIGN
• Design emphasis not on high performance, but
on availability and reliability

• Reading and writing is atomic

• Chubby service is composed of 5 active replicas
– One of them elected as master
– Requires the majority of replicas to be alive


CHUBBY: EVENTS

• Client can subscribe for various events:
– file contents modified
– child node added, removed, or modiﬁed
– lock acquired
– conﬂicting lock request from another client
– etc.


REFERENCES
• Bibliography:
– Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The google file system. In SOSP '03: Proceedings of the
nineteenth ACM symposium on Operating systems principles, pages 29-43. ACM Press.
– Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and
Gruber, R. E. (2006). Bigtable: A distributed storage system for structured data. In Operating Systems Design
and Implementation, pages 205-218.
– Dean, J. and Ghemawat, S (2004). Mapreduce: Simplified data processing on large clusters. In OSDI '04:
Proceedings of the 6th symposium on Operating systems design and implementation, pages 137-150.
– Dean, J. (2006). Experiences with MapReduce, an abstraction for large-scale computation. In PACT '06:
Proceedings of the 15th international Conference on Parallel Architectures and Compilation Techniques. ACM.
– Burrows, M. (2006). The chubby lock service for loosely-coupled distributed systems. In OSDI '06: Proceedings
of the 7th symposium on Operating systems design and implementation, pages 335-350.
• Partially based on:
– Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Distributed Computing Seminar, Lecture 2: MapReduce
Theory and Implementation. Retrieved September 6, 2008, from
http://code.google.com/edu/submissions/mapreduce-minilecture/lec2-mapred.ppt
– Dean, J. (2006). Experiences with MapReduce, an abstraction for large-scale computation. Retrieved
September 6, 2008, from http://www.cs.virginia.edu/~pact2006/program/mapreduce-pact06-keynote.pdf
– Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Distributed Computing Seminar, Lecture 3: Distributed
Filesystems. Retrieved September 6, 2008, from http://code.google.com/edu/submissions/mapreduce-
minilecture/lec3-dfs.ppt
– Stokely, M. (2007). Distributed Computing at Google. Retrieved September 6, 2008, from
http://www.swinog.ch/meetings/swinog15/SRE-Recruiting-SwiNOG2007.ppt

[Roblek] Distributed computing in practice

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to [Roblek] Distributed computing in practice

Similar to [Roblek] Distributed computing in practice (20)

Recently uploaded

Recently uploaded (20)

[Roblek] Distributed computing in practice