2. GOOGLE TECHNOLOGY LAYERS
Google™ search
Gmail™
Ads system
Services and Applications Google Maps™
Distributed Computing
Commodity PC Hardware
Linux
Computing Platform Physical Network
JavaBlend 2008, http://www.javablend.net/ 2
3. IMPLICATIONS OF GOOGLE ENVIRONMENT
• Single process performance does not matter
– Total throughput is more important
• Stuff breaks
– If you have one server, it may stay up three years
– If you have 10,000 servers, expect to lose ten a day
• “Ultra-reliable” hardware doesn’t really help
– At large scales, reliable hardware still fails, albeit less often
– Software still needs to be fault-tolerant
JavaBlend 2008, http://www.javablend.net/ 3
4. BUILDING BLOCKS OF google.com?
• Distributed data
– Google File System (GFS)
– BigTable
• Job manager
• Distributed computation
– MapReduce
• Distributed lock service
– Chubby
JavaBlend 2008, http://www.javablend.net/ 4
5. SCALABLE DISTRIBUTED FILE SYSTEM
Google File System
(GFS)
JavaBlend 2008, http://www.javablend.net/ 5
6. GFS: REQUIREMENTS
• High component failure rates
– Inexpensive commodity components fail all the time
• Modest number of huge files
– Just a few millions, most of them multi-GB
• Files are write-once, mostly appended to
– Perhaps concurrently
– Large streaming reads
JavaBlend 2008, http://www.javablend.net/
7. GFS: DESIGN DECISION
• Files stored as chunks
– Fixed size (64MB)
• Reliability through replication
• Each chunk replicated 3+ times
• Single master to coordinate access, keep metadata
– Simple centralized management
• No data caching
– Little benefit due to large data sets, streaming reads
JavaBlend 2008, http://www.javablend.net/
8. GFS: ARCHITECTURE
Where is a potential weaknes of this design?
JavaBlend 2008, http://www.javablend.net/
9. GFS: WEAK POINT - SINGLE MASTER
• From distributed systems we know this is a
– Single point of failure
– Scalability bottleneck
• GFS solutions
– Shadow masters
– Minimize master involvement
• never move data through it, use only for metadata
• large chunk size
• master delegates authority to primary replicas in data mutations
(chunk leases)
JavaBlend 2008, http://www.javablend.net/
10. GFS: METADATA
• Global metadata is stored on the master
– File and chunk namespaces
– Mapping from files to chunks
• Locations of each chunk’s replicas
– All in memory (64 bytes / chunk)
• Master has an operation log for persistent logging of
critical metadata updates
– Persistent on local disk
– Replicated
– Checkpoints for faster recovery
JavaBlend 2008, http://www.javablend.net/
11. GFS: MUTATIONS
• Mutations must be done
for all replicas
• Master picks one replica
as primary; gives it a
“lease” for mutations
– Primary defines a serial
order of mutations
• Data flow decoupled from
control flow
JavaBlend 2008, http://www.javablend.net/
12. GFS: OPEN SOURCE ALTERNATIVES
• Hadoop Distributed File System - HDFS (Java)
– http://hadoop.apache.org/core/docs/current/hdfs_design.html
JavaBlend 2008, http://www.javablend.net/ 12
13. DISTRIBUTED STORAGE FOR LARGE STRUCTURED DATA SETS
Bigtable
JavaBlend 2008, http://www.javablend.net/ 13
14. BIGTABLE: REQUIREMENTS
• Want to store petabytes of structured data across
thousands of commodity servers
• Want a simple data format that supports dynamic control
over data layout and format
• Must support very high read/write rates
– millions of operations per second
• Latency requirements:
– backend bulk processing
– real-time data serving
JavaBlend 2008, http://www.javablend.net/ 14
16. BIGTABLE: EXAMPLE
• A web crawling system might use Bigtable that stores web pages
– Each row key could represent a specific URL
– Columns represent page contents, the references to that page, and
other metadata
– The row range for a table is dynamically partitioned between servers
• Rows are clustered together on machines by key
– Using inversed URLs as keys minimizes the number of machines where
pages from a single domain are stored
– Each cell is timestamped so there could be multiple versions of the
same data in the table
JavaBlend 2008, http://www.javablend.net/ 16
18. BIGTABLE: ROWS
• Name is an arbitrary string
– Access to data in a row is atomic
– Row creation is implicit upon storing data
• Rows ordered lexicographically
– Rows close together lexicographically usually
on one or a small number of machines
JavaBlend 2008, http://www.javablend.net/ 18
19. BIGTABLE: TABLETS
• Row range for a table is dynamically
partitioned into tablets
• Tablet holds contiguous range of rows
– Reads over short row ranges are efficient
– Clients can choose row keys to achieve
locality
JavaBlend 2008, http://www.javablend.net/ 19
20. BIGTABLE: COLUMNS
• Columns have two-level name structure
<column_family>:[<column_qualifier>]
• Column family:
– Creation must be explicit
– Has associated type information and other metadata
– Unit of access control
• Column qualifier
– Unbounded number of columns
– Creation of column within a family is implicit at updates
• Additional dimensions
JavaBlend 2008, http://www.javablend.net/ 20
21. BIGTABLE: TIMESTAMPS
• Used to store different versions of data in a cell
– New writes default to current time
– Can also be set explicitly by clients
• Lookup options
– Return all values
– Return most recent K values
– Return all values in timestamp range
• Column families can be marked with attributes
– Only retain most recent K values in a cell
– Keep values until they are older than K seconds
JavaBlend 2008, http://www.javablend.net/ 21
22. BIGTABLE: AT GOOGLE
• Good match for most of our applications:
– Google Earth™
– Google Maps™
– Google Talk™
– Google Finance™
– Orkut™
JavaBlend 2008, http://www.javablend.net/ 22
24. PROGRAMMING MODEL FOR PROCESSING LARGE DATA SETS
MapReduce
JavaBlend 2008, http://www.javablend.net/ 24
25. MAPREDUCE: REQUIREMENTS
• Want to process lots of data ( > 1 TB)
• Want to run it on thousands of commodity PCs
• Must be robust
• … And simple to use
26. MAPREDUCE: DESCRIPTION
• A simple programming model that applies to many large-scale
computing problems
– Based on principles of functional languages
– Scalable, robust
• Hide messy details in MapReduce runtime library:
– automatic parallelization
– load balancing
– network and disk transfer optimization
– handling of machine failures
– robustness
• Improvements to core library benefit all users of library!
JavaBlend 2008, http://www.javablend.net/ 26
27. MAPREDUCE: FUNCTIONAL PROGRAMMING
• Functions don’t change data structures
– They always create new ones
– Input data remain unchanged
• Functions don’t have side effects
• Data flows are implicit in program design
• Order of operations does not matter
z := f(g(x), h(x, y), k(y))
28. MAPREDUCE: TYPICAL EXECUTION FLOW
• Read a lot of data
• Map: extract something you care about from each record
• Shuffle and Sort
• Reduce: aggregate, summarize, filter, or transform
• Write the results
Outline stays the same, map and reduce change to fit the problem
JavaBlend 2008, http://www.javablend.net/ 28
29. MAPREDUCE: PROGRAMING INTERFACE
User must implement two functions
Map(input_key, input_value)
→ (output_key, intermediate_value)
Reduce(output_key, intermediate_value_list)
→ output_value_list
30. MAPREDUCE: MAP
• Records from the data source …
– lines out of files
– rows of a database
– etc.
• … are fed into the map function as (key, value pairs)
– filename, line
– etc.
• map produces zero, one or more intermediate values
along with an output key from the input
31. MAPREDUCE: REDUCE
• After the map phase is over, all the
intermediate values for a given output key
are combined together into a list
• reduce combines those intermediate
values into zero, one or more final values
for that same output key
32. MAPREDUCE: EXAMPLE - WORD FREQUENCY 1/5
• Input is files with one document per record
• Specify a map function that takes a key/value pair
– key = document name
– value = document contents
• Output of map function is zero, one or more key/value
pairs
– In our case, output (word, “1”) once per word in the document
JavaBlend 2008, http://www.javablend.net/ 32
33. MAPREDUCE: EXAMPLE - WORD FREQUENCY 2/5
“To be or not to be?”
“document1”
“to”, “1”
“be”, “1”
“or”, “1”
…
JavaBlend 2008, http://www.javablend.net/ 33
34. MAPREDUCE: PRIMER - WORD FREQUENCY 3/5
• MapReduce library gathers together all pairs
with the same key
– shuffle/sort
• reduce function combines the values for a key
– In our case, compute the sum
• Output of reduce is zero, one or more values
paired with key and saved
JavaBlend 2008, http://www.javablend.net/ 34
36. MAPREDUCE: EXAMPLE - WORD FREQUENCY 5/5
Map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_values:
EmitIntermediate(w, quot;1quot;);
Reduce(String output_key, Iterator intermediate_values):
// output_key: a word, same for input and output
// intermediate_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
JavaBlend 2008, http://www.javablend.net/ 36
39. MAPREDUCE: PARALLEL FLOW 1/2
• map functions run in parallel, creating different
intermediate values from different input data sets
• reduce functions also run in parallel, each
working on a different output key
– All values are processed independently
• Bottleneck
– reduce phase can’t start until map phase is
completely finished
JavaBlend 2008, http://www.javablend.net/ 39
41. MAPREDUCE: WIDELY APPLICABLE
• distributed grep
• distributed sort
• document clustering
• machine learning
• web access log stats
• inverted index construction
• statistical machine translation
• etc.
JavaBlend 2008, http://www.javablend.net/ 41
42. MAPREDUCE: EXAMPLE - LANGUAGE MODEL STATISTICS
• Used in our statistical machine translation system
• Ned to count # of times every 5-word sequence occurs
in large corpus of documents (and keep all those where
count >= 4)
• map:
– extract 5-word sequences => count from document
• reduce:
– summarize counts
– keep those where count >= 4
JavaBlend 2008, http://www.javablend.net/ 42
43. MAPREDUCE: EXAMPLE - JOINING WITH OTHER DATA
• Generate per-doc summary, but include per-host
information (e.g. # of pages on host, important terms on
host)
– per-host information might involve RPC to a set of machines
containing data for all sites
• map:
– extract host name from URL, lookup per-host info, combine with
per-doc data and emit
• reduce:
– identity function (just emit input value directly)
JavaBlend 2008, http://www.javablend.net/ 43
44. MAPREDUCE: FAULT TOLERANCE
• Master detects worker failures
– Re-executes failed map tasks
– Re-executes reduce tasks
• Master notices particular input key/values
cause crashes in map
– Skips those values on re-execution
45. MAPREDUCE: LOCAL OPTIMIZATIONS
• Master program divides up tasks based on
location of data
– tries to have map tasks on same machine as
physical file data, or at least same rack
46. MAPREDUCE: SLOW MAP TASKS
• reduce phase cannot start before the map phase
completes
– On slow disk controller can slow down the whole system
• Master redundantly starts slow-moving map task
– Uses results of first copy to finish
47. MAPREDUCE: COMBINE
• combine is a mini-reduce phase that runs
on the same machine as map phase
– It aggregates the results of local map phases
– Saves network bandwidth
48. MAPREDUCE: CONCLUSION
• MapReduce proved to be extremely useful
abstraction
– It greatly simplifies the processing of huge amounts of
data
• MapReduce is easy to use
– Programer can focus on problem
– MapReduce takes care for messy details
JavaBlend 2008, http://www.javablend.net/ 48
49. MAPREDUCE: OPEN SOURCE ALTERNATIVES
• Hadoop (Java)
– http://hadoop.apache.org/
• Disco (Erlang, Python)
– http://discoproject.org/
• etc.
JavaBlend 2008, http://www.javablend.net/ 49
50. LOCK SERVICE FOR LOOSELY-COUPLED DISTRIBUTED SYSTEMS
Chubby
JavaBlend 2008, http://www.javablend.net/ 50
51. CHUBBY: SYNCHRONIZED ACCESS TO SHARED RESOURCES
• Key element of distributed architecture at Google:
– Used by GFS, Bigtable and Mapreduce
• Interface similar to distributed file system with advisory locks
– Access control list
– No links
• Every Chubby file can hold a small amount of data
• Every Chubby file or directory can be used as read or write lock
– Locks are advisory, not mandatory
• Clients must be well-behaved
• A client that does not hold a lock can still read the content of a Chubby file
JavaBlend 2008, http://www.javablend.net/ 51
52. CHUBBY: DESIGN
• Design emphasis not on high performance, but
on availability and reliability
• Reading and writing is atomic
• Chubby service is composed of 5 active replicas
– One of them elected as master
– Requires the majority of replicas to be alive
JavaBlend 2008, http://www.javablend.net/ 52
53. CHUBBY: EVENTS
• Client can subscribe for various events:
– file contents modified
– child node added, removed, or modified
– lock acquired
– conflicting lock request from another client
– etc.
JavaBlend 2008, http://www.javablend.net/ 53
54. REFERENCES
• Bibliography:
– Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The google file system. In SOSP '03: Proceedings of the
nineteenth ACM symposium on Operating systems principles, pages 29-43. ACM Press.
– Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and
Gruber, R. E. (2006). Bigtable: A distributed storage system for structured data. In Operating Systems Design
and Implementation, pages 205-218.
– Dean, J. and Ghemawat, S (2004). Mapreduce: Simplified data processing on large clusters. In OSDI '04:
Proceedings of the 6th symposium on Operating systems design and implementation, pages 137-150.
– Dean, J. (2006). Experiences with MapReduce, an abstraction for large-scale computation. In PACT '06:
Proceedings of the 15th international Conference on Parallel Architectures and Compilation Techniques. ACM.
– Burrows, M. (2006). The chubby lock service for loosely-coupled distributed systems. In OSDI '06: Proceedings
of the 7th symposium on Operating systems design and implementation, pages 335-350.
• Partially based on:
– Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Distributed Computing Seminar, Lecture 2: MapReduce
Theory and Implementation. Retrieved September 6, 2008, from
http://code.google.com/edu/submissions/mapreduce-minilecture/lec2-mapred.ppt
– Dean, J. (2006). Experiences with MapReduce, an abstraction for large-scale computation. Retrieved
September 6, 2008, from http://www.cs.virginia.edu/~pact2006/program/mapreduce-pact06-keynote.pdf
– Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Distributed Computing Seminar, Lecture 3: Distributed
Filesystems. Retrieved September 6, 2008, from http://code.google.com/edu/submissions/mapreduce-
minilecture/lec3-dfs.ppt
– Stokely, M. (2007). Distributed Computing at Google. Retrieved September 6, 2008, from
http://www.swinog.ch/meetings/swinog15/SRE-Recruiting-SwiNOG2007.ppt
JavaBlend 2008, http://www.javablend.net/ 54