0
DISTRIBUTED COMPUTING IN
PRAXIS
GFS, BIGTABLE, MAPREDUCE, CHUBBY




Dominik Roblek
Software Engineer
Google Inc.
GOOGLE TECHNOLOGY LAYERS

                                                      Google™ search
                           ...
IMPLICATIONS OF GOOGLE ENVIRONMENT
 • Single process performance does not matter
    – Total throughput is more important
...
BUILDING BLOCKS OF google.com?
 • Distributed data
    – Google File System (GFS)
    – BigTable

 • Job manager

 • Distr...
SCALABLE DISTRIBUTED FILE SYSTEM




     Google File System
           (GFS)

                 JavaBlend 2008, http://www...
GFS: REQUIREMENTS
 • High component failure rates
   – Inexpensive commodity components fail all the time

 • Modest numbe...
GFS: DESIGN DECISION
 •   Files stored as chunks
      – Fixed size (64MB)

 •   Reliability through replication

 •   Eac...
GFS: ARCHITECTURE




              Where is a potential weaknes of this design?
                JavaBlend 2008, http://ww...
GFS: WEAK POINT - SINGLE MASTER
 • From distributed systems we know this is a
    – Single point of failure
    – Scalabil...
GFS: METADATA
 • Global metadata is stored on the master
    – File and chunk namespaces
    – Mapping from files to chunk...
GFS: MUTATIONS
 • Mutations must be done
   for all replicas

 • Master picks one replica
   as primary; gives it a
   “le...
GFS: OPEN SOURCE ALTERNATIVES
 • Hadoop Distributed File System - HDFS (Java)
    – http://hadoop.apache.org/core/docs/cur...
DISTRIBUTED STORAGE FOR LARGE STRUCTURED DATA SETS




                 Bigtable


                 JavaBlend 2008, http:/...
BIGTABLE: REQUIREMENTS
 • Want to store petabytes of structured data across
   thousands of commodity servers

 • Want a s...
BIGTABLE: STRUCTURE
 •    Bigtable is multi-dimensional map:
       – sparse
       – persistent
       – distributed

 • ...
BIGTABLE: EXAMPLE
•   A web crawling system might use Bigtable that stores web pages
    – Each row key could represent a ...
BIGTABLE: EXAMPLE

                  “contents:”           “anchor:cnnsi.com” “anchor:my.look.ca”



                   “<...
BIGTABLE: ROWS

 • Name is an arbitrary string
   – Access to data in a row is atomic
   – Row creation is implicit upon s...
BIGTABLE: TABLETS

 • Row range for a table is dynamically
   partitioned into tablets

 • Tablet holds contiguous range o...
BIGTABLE: COLUMNS
 •   Columns have two-level name structure
                  <column_family>:[<column_qualifier>]

 •   ...
BIGTABLE: TIMESTAMPS
 •   Used to store different versions of data in a cell
      – New writes default to current time
  ...
BIGTABLE: AT GOOGLE

 • Good match for most of our applications:
   – Google Earth™
   – Google Maps™
   – Google Talk™
  ...
BIGTABLE: OPEN SOURCE ALTERNATIVES

 • HBase (Java)
   – http://hadoop.apache.org/hbase/

 • Hypertable (C++)
   – http://...
PROGRAMMING MODEL FOR PROCESSING LARGE DATA SETS




            MapReduce


                JavaBlend 2008, http://www.ja...
MAPREDUCE: REQUIREMENTS

 • Want to process lots of data ( > 1 TB)
 • Want to run it on thousands of commodity PCs
 • Must...
MAPREDUCE: DESCRIPTION
 •   A simple programming model that applies to many large-scale
     computing problems
      – Ba...
MAPREDUCE: FUNCTIONAL PROGRAMMING
 •   Functions don’t change data structures
     –   They always create new ones
     – ...
MAPREDUCE: TYPICAL EXECUTION FLOW
•   Read a lot of data

•   Map: extract something you care about from each record

•   ...
MAPREDUCE: PROGRAMING INTERFACE

 User must implement two functions

 Map(input_key, input_value)
   → (output_key, interm...
MAPREDUCE: MAP
 • Records from the data source …
    – lines out of files
    – rows of a database
    – etc.
 • … are fed...
MAPREDUCE: REDUCE

 • After the map phase is over, all the
   intermediate values for a given output key
   are combined t...
MAPREDUCE: EXAMPLE - WORD FREQUENCY 1/5
 • Input is files with one document per record

 • Specify a map function that tak...
MAPREDUCE: EXAMPLE - WORD FREQUENCY 2/5

              “To be or not to be?”
                        “document1”


       ...
MAPREDUCE: PRIMER - WORD FREQUENCY 3/5
 • MapReduce library gathers together all pairs
   with the same key
   – shuffle/s...
MAPREDUCE: EXAMPLE - WORD FREQUENCY 4/5

      key = “be”        key = “not”           key = “or”              key = “to”
...
MAPREDUCE: EXAMPLE - WORD FREQUENCY 5/5
Map(String input_key, String input_value):
  // input_key: document name
  // inpu...
MAPREDUCE: DISTRIBUTED EXECUTION




                JavaBlend 2008, http://www.javablend.net/   37
MAPREDUCE: LOGICAL FLOW




                JavaBlend 2008, http://www.javablend.net/   38
MAPREDUCE: PARALLEL FLOW 1/2
 • map functions run in parallel, creating different
   intermediate values from different in...
MAPREDUCE: PARALLEL FLOW 2/2




                JavaBlend 2008, http://www.javablend.net/   40
MAPREDUCE: WIDELY APPLICABLE
 • distributed grep
 • distributed sort
 • document clustering
 • machine learning
 • web acc...
MAPREDUCE: EXAMPLE - LANGUAGE MODEL STATISTICS
 • Used in our statistical machine translation system
 • Ned to count # of ...
MAPREDUCE: EXAMPLE - JOINING WITH OTHER DATA
 • Generate per-doc summary, but include per-host
   information (e.g. # of p...
MAPREDUCE: FAULT TOLERANCE

 • Master detects worker failures
   – Re-executes failed map tasks
   – Re-executes reduce ta...
MAPREDUCE: LOCAL OPTIMIZATIONS

 • Master program divides up tasks based on
   location of data
   – tries to have map tas...
MAPREDUCE: SLOW MAP TASKS
• reduce phase cannot start before the map phase
  completes
   – On slow disk controller can sl...
MAPREDUCE: COMBINE

 • combine is a mini-reduce phase that runs
   on the same machine as map phase
   – It aggregates the...
MAPREDUCE: CONCLUSION
 • MapReduce proved to be extremely useful
   abstraction
   – It greatly simplifies the processing ...
MAPREDUCE: OPEN SOURCE ALTERNATIVES
 • Hadoop (Java)
   – http://hadoop.apache.org/

 • Disco (Erlang, Python)
   – http:/...
LOCK SERVICE FOR LOOSELY-COUPLED DISTRIBUTED SYSTEMS




                 Chubby

                 JavaBlend 2008, http://...
CHUBBY: SYNCHRONIZED ACCESS TO SHARED RESOURCES
 •   Key element of distributed architecture at Google:
      – Used by GF...
CHUBBY: DESIGN
 • Design emphasis not on high performance, but
   on availability and reliability

 • Reading and writing ...
CHUBBY: EVENTS

 • Client can subscribe for various events:
   – file contents modified
   – child node added, removed, or...
REFERENCES
 •   Bibliography:
      –   Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The google file system. In SOS...
Upcoming SlideShare
Loading in...5
×

[Roblek] Distributed computing in practice

2,385

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,385
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
218
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "[Roblek] Distributed computing in practice"

  1. 1. DISTRIBUTED COMPUTING IN PRAXIS GFS, BIGTABLE, MAPREDUCE, CHUBBY Dominik Roblek Software Engineer Google Inc.
  2. 2. GOOGLE TECHNOLOGY LAYERS Google™ search Gmail™ Ads system Services and Applications Google Maps™ Distributed Computing Commodity PC Hardware Linux Computing Platform Physical Network JavaBlend 2008, http://www.javablend.net/ 2
  3. 3. IMPLICATIONS OF GOOGLE ENVIRONMENT • Single process performance does not matter – Total throughput is more important • Stuff breaks – If you have one server, it may stay up three years – If you have 10,000 servers, expect to lose ten a day • “Ultra-reliable” hardware doesn’t really help – At large scales, reliable hardware still fails, albeit less often – Software still needs to be fault-tolerant JavaBlend 2008, http://www.javablend.net/ 3
  4. 4. BUILDING BLOCKS OF google.com? • Distributed data – Google File System (GFS) – BigTable • Job manager • Distributed computation – MapReduce • Distributed lock service – Chubby JavaBlend 2008, http://www.javablend.net/ 4
  5. 5. SCALABLE DISTRIBUTED FILE SYSTEM Google File System (GFS) JavaBlend 2008, http://www.javablend.net/ 5
  6. 6. GFS: REQUIREMENTS • High component failure rates – Inexpensive commodity components fail all the time • Modest number of huge files – Just a few millions, most of them multi-GB • Files are write-once, mostly appended to – Perhaps concurrently – Large streaming reads JavaBlend 2008, http://www.javablend.net/
  7. 7. GFS: DESIGN DECISION • Files stored as chunks – Fixed size (64MB) • Reliability through replication • Each chunk replicated 3+ times • Single master to coordinate access, keep metadata – Simple centralized management • No data caching – Little benefit due to large data sets, streaming reads JavaBlend 2008, http://www.javablend.net/
  8. 8. GFS: ARCHITECTURE Where is a potential weaknes of this design? JavaBlend 2008, http://www.javablend.net/
  9. 9. GFS: WEAK POINT - SINGLE MASTER • From distributed systems we know this is a – Single point of failure – Scalability bottleneck • GFS solutions – Shadow masters – Minimize master involvement • never move data through it, use only for metadata • large chunk size • master delegates authority to primary replicas in data mutations (chunk leases) JavaBlend 2008, http://www.javablend.net/
  10. 10. GFS: METADATA • Global metadata is stored on the master – File and chunk namespaces – Mapping from files to chunks • Locations of each chunk’s replicas – All in memory (64 bytes / chunk) • Master has an operation log for persistent logging of critical metadata updates – Persistent on local disk – Replicated – Checkpoints for faster recovery JavaBlend 2008, http://www.javablend.net/
  11. 11. GFS: MUTATIONS • Mutations must be done for all replicas • Master picks one replica as primary; gives it a “lease” for mutations – Primary defines a serial order of mutations • Data flow decoupled from control flow JavaBlend 2008, http://www.javablend.net/
  12. 12. GFS: OPEN SOURCE ALTERNATIVES • Hadoop Distributed File System - HDFS (Java) – http://hadoop.apache.org/core/docs/current/hdfs_design.html JavaBlend 2008, http://www.javablend.net/ 12
  13. 13. DISTRIBUTED STORAGE FOR LARGE STRUCTURED DATA SETS Bigtable JavaBlend 2008, http://www.javablend.net/ 13
  14. 14. BIGTABLE: REQUIREMENTS • Want to store petabytes of structured data across thousands of commodity servers • Want a simple data format that supports dynamic control over data layout and format • Must support very high read/write rates – millions of operations per second • Latency requirements: – backend bulk processing – real-time data serving JavaBlend 2008, http://www.javablend.net/ 14
  15. 15. BIGTABLE: STRUCTURE • Bigtable is multi-dimensional map: – sparse – persistent – distributed • Key: – Row name – Column name – Timestamp • Value: – array of bytes (rowName: string, columnName: string, timestamp: long) → byte[] JavaBlend 2008, http://www.javablend.net/ 15
  16. 16. BIGTABLE: EXAMPLE • A web crawling system might use Bigtable that stores web pages – Each row key could represent a specific URL – Columns represent page contents, the references to that page, and other metadata – The row range for a table is dynamically partitioned between servers • Rows are clustered together on machines by key – Using inversed URLs as keys minimizes the number of machines where pages from a single domain are stored – Each cell is timestamped so there could be multiple versions of the same data in the table JavaBlend 2008, http://www.javablend.net/ 16
  17. 17. BIGTABLE: EXAMPLE “contents:” “anchor:cnnsi.com” “anchor:my.look.ca” “<html>…quot; t3 “com.cnn.www” “<html>…quot; t5 “CNNquot; t9 “CNN.comquot; t8 “<html>…quot; t6 JavaBlend 2008, http://www.javablend.net/ 17
  18. 18. BIGTABLE: ROWS • Name is an arbitrary string – Access to data in a row is atomic – Row creation is implicit upon storing data • Rows ordered lexicographically – Rows close together lexicographically usually on one or a small number of machines JavaBlend 2008, http://www.javablend.net/ 18
  19. 19. BIGTABLE: TABLETS • Row range for a table is dynamically partitioned into tablets • Tablet holds contiguous range of rows – Reads over short row ranges are efficient – Clients can choose row keys to achieve locality JavaBlend 2008, http://www.javablend.net/ 19
  20. 20. BIGTABLE: COLUMNS • Columns have two-level name structure <column_family>:[<column_qualifier>] • Column family: – Creation must be explicit – Has associated type information and other metadata – Unit of access control • Column qualifier – Unbounded number of columns – Creation of column within a family is implicit at updates • Additional dimensions JavaBlend 2008, http://www.javablend.net/ 20
  21. 21. BIGTABLE: TIMESTAMPS • Used to store different versions of data in a cell – New writes default to current time – Can also be set explicitly by clients • Lookup options – Return all values – Return most recent K values – Return all values in timestamp range • Column families can be marked with attributes – Only retain most recent K values in a cell – Keep values until they are older than K seconds JavaBlend 2008, http://www.javablend.net/ 21
  22. 22. BIGTABLE: AT GOOGLE • Good match for most of our applications: – Google Earth™ – Google Maps™ – Google Talk™ – Google Finance™ – Orkut™ JavaBlend 2008, http://www.javablend.net/ 22
  23. 23. BIGTABLE: OPEN SOURCE ALTERNATIVES • HBase (Java) – http://hadoop.apache.org/hbase/ • Hypertable (C++) – http://www.hypertable.org/ JavaBlend 2008, http://www.javablend.net/ 23
  24. 24. PROGRAMMING MODEL FOR PROCESSING LARGE DATA SETS MapReduce JavaBlend 2008, http://www.javablend.net/ 24
  25. 25. MAPREDUCE: REQUIREMENTS • Want to process lots of data ( > 1 TB) • Want to run it on thousands of commodity PCs • Must be robust • … And simple to use
  26. 26. MAPREDUCE: DESCRIPTION • A simple programming model that applies to many large-scale computing problems – Based on principles of functional languages – Scalable, robust • Hide messy details in MapReduce runtime library: – automatic parallelization – load balancing – network and disk transfer optimization – handling of machine failures – robustness • Improvements to core library benefit all users of library! JavaBlend 2008, http://www.javablend.net/ 26
  27. 27. MAPREDUCE: FUNCTIONAL PROGRAMMING • Functions don’t change data structures – They always create new ones – Input data remain unchanged • Functions don’t have side effects • Data flows are implicit in program design • Order of operations does not matter z := f(g(x), h(x, y), k(y))
  28. 28. MAPREDUCE: TYPICAL EXECUTION FLOW • Read a lot of data • Map: extract something you care about from each record • Shuffle and Sort • Reduce: aggregate, summarize, filter, or transform • Write the results Outline stays the same, map and reduce change to fit the problem JavaBlend 2008, http://www.javablend.net/ 28
  29. 29. MAPREDUCE: PROGRAMING INTERFACE User must implement two functions Map(input_key, input_value) → (output_key, intermediate_value) Reduce(output_key, intermediate_value_list) → output_value_list
  30. 30. MAPREDUCE: MAP • Records from the data source … – lines out of files – rows of a database – etc. • … are fed into the map function as (key, value pairs) – filename, line – etc. • map produces zero, one or more intermediate values along with an output key from the input
  31. 31. MAPREDUCE: REDUCE • After the map phase is over, all the intermediate values for a given output key are combined together into a list • reduce combines those intermediate values into zero, one or more final values for that same output key
  32. 32. MAPREDUCE: EXAMPLE - WORD FREQUENCY 1/5 • Input is files with one document per record • Specify a map function that takes a key/value pair – key = document name – value = document contents • Output of map function is zero, one or more key/value pairs – In our case, output (word, “1”) once per word in the document JavaBlend 2008, http://www.javablend.net/ 32
  33. 33. MAPREDUCE: EXAMPLE - WORD FREQUENCY 2/5 “To be or not to be?” “document1” “to”, “1” “be”, “1” “or”, “1” … JavaBlend 2008, http://www.javablend.net/ 33
  34. 34. MAPREDUCE: PRIMER - WORD FREQUENCY 3/5 • MapReduce library gathers together all pairs with the same key – shuffle/sort • reduce function combines the values for a key – In our case, compute the sum • Output of reduce is zero, one or more values paired with key and saved JavaBlend 2008, http://www.javablend.net/ 34
  35. 35. MAPREDUCE: EXAMPLE - WORD FREQUENCY 4/5 key = “be” key = “not” key = “or” key = “to” values = “1”, “1” values = “1” values = “1” values = “1”, “1” “2” “1” “1” “2” “be”, “2” “not”, “1” “or”, “1” “to”, “2” JavaBlend 2008, http://www.javablend.net/ 35
  36. 36. MAPREDUCE: EXAMPLE - WORD FREQUENCY 5/5 Map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_values: EmitIntermediate(w, quot;1quot;); Reduce(String output_key, Iterator intermediate_values): // output_key: a word, same for input and output // intermediate_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); JavaBlend 2008, http://www.javablend.net/ 36
  37. 37. MAPREDUCE: DISTRIBUTED EXECUTION JavaBlend 2008, http://www.javablend.net/ 37
  38. 38. MAPREDUCE: LOGICAL FLOW JavaBlend 2008, http://www.javablend.net/ 38
  39. 39. MAPREDUCE: PARALLEL FLOW 1/2 • map functions run in parallel, creating different intermediate values from different input data sets • reduce functions also run in parallel, each working on a different output key – All values are processed independently • Bottleneck – reduce phase can’t start until map phase is completely finished JavaBlend 2008, http://www.javablend.net/ 39
  40. 40. MAPREDUCE: PARALLEL FLOW 2/2 JavaBlend 2008, http://www.javablend.net/ 40
  41. 41. MAPREDUCE: WIDELY APPLICABLE • distributed grep • distributed sort • document clustering • machine learning • web access log stats • inverted index construction • statistical machine translation • etc. JavaBlend 2008, http://www.javablend.net/ 41
  42. 42. MAPREDUCE: EXAMPLE - LANGUAGE MODEL STATISTICS • Used in our statistical machine translation system • Ned to count # of times every 5-word sequence occurs in large corpus of documents (and keep all those where count >= 4) • map: – extract 5-word sequences => count from document • reduce: – summarize counts – keep those where count >= 4 JavaBlend 2008, http://www.javablend.net/ 42
  43. 43. MAPREDUCE: EXAMPLE - JOINING WITH OTHER DATA • Generate per-doc summary, but include per-host information (e.g. # of pages on host, important terms on host) – per-host information might involve RPC to a set of machines containing data for all sites • map: – extract host name from URL, lookup per-host info, combine with per-doc data and emit • reduce: – identity function (just emit input value directly) JavaBlend 2008, http://www.javablend.net/ 43
  44. 44. MAPREDUCE: FAULT TOLERANCE • Master detects worker failures – Re-executes failed map tasks – Re-executes reduce tasks • Master notices particular input key/values cause crashes in map – Skips those values on re-execution
  45. 45. MAPREDUCE: LOCAL OPTIMIZATIONS • Master program divides up tasks based on location of data – tries to have map tasks on same machine as physical file data, or at least same rack
  46. 46. MAPREDUCE: SLOW MAP TASKS • reduce phase cannot start before the map phase completes – On slow disk controller can slow down the whole system • Master redundantly starts slow-moving map task – Uses results of first copy to finish
  47. 47. MAPREDUCE: COMBINE • combine is a mini-reduce phase that runs on the same machine as map phase – It aggregates the results of local map phases – Saves network bandwidth
  48. 48. MAPREDUCE: CONCLUSION • MapReduce proved to be extremely useful abstraction – It greatly simplifies the processing of huge amounts of data • MapReduce is easy to use – Programer can focus on problem – MapReduce takes care for messy details JavaBlend 2008, http://www.javablend.net/ 48
  49. 49. MAPREDUCE: OPEN SOURCE ALTERNATIVES • Hadoop (Java) – http://hadoop.apache.org/ • Disco (Erlang, Python) – http://discoproject.org/ • etc. JavaBlend 2008, http://www.javablend.net/ 49
  50. 50. LOCK SERVICE FOR LOOSELY-COUPLED DISTRIBUTED SYSTEMS Chubby JavaBlend 2008, http://www.javablend.net/ 50
  51. 51. CHUBBY: SYNCHRONIZED ACCESS TO SHARED RESOURCES • Key element of distributed architecture at Google: – Used by GFS, Bigtable and Mapreduce • Interface similar to distributed file system with advisory locks – Access control list – No links • Every Chubby file can hold a small amount of data • Every Chubby file or directory can be used as read or write lock – Locks are advisory, not mandatory • Clients must be well-behaved • A client that does not hold a lock can still read the content of a Chubby file JavaBlend 2008, http://www.javablend.net/ 51
  52. 52. CHUBBY: DESIGN • Design emphasis not on high performance, but on availability and reliability • Reading and writing is atomic • Chubby service is composed of 5 active replicas – One of them elected as master – Requires the majority of replicas to be alive JavaBlend 2008, http://www.javablend.net/ 52
  53. 53. CHUBBY: EVENTS • Client can subscribe for various events: – file contents modified – child node added, removed, or modified – lock acquired – conflicting lock request from another client – etc. JavaBlend 2008, http://www.javablend.net/ 53
  54. 54. REFERENCES • Bibliography: – Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The google file system. In SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29-43. ACM Press. – Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. (2006). Bigtable: A distributed storage system for structured data. In Operating Systems Design and Implementation, pages 205-218. – Dean, J. and Ghemawat, S (2004). Mapreduce: Simplified data processing on large clusters. In OSDI '04: Proceedings of the 6th symposium on Operating systems design and implementation, pages 137-150. – Dean, J. (2006). Experiences with MapReduce, an abstraction for large-scale computation. In PACT '06: Proceedings of the 15th international Conference on Parallel Architectures and Compilation Techniques. ACM. – Burrows, M. (2006). The chubby lock service for loosely-coupled distributed systems. In OSDI '06: Proceedings of the 7th symposium on Operating systems design and implementation, pages 335-350. • Partially based on: – Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Distributed Computing Seminar, Lecture 2: MapReduce Theory and Implementation. Retrieved September 6, 2008, from http://code.google.com/edu/submissions/mapreduce-minilecture/lec2-mapred.ppt – Dean, J. (2006). Experiences with MapReduce, an abstraction for large-scale computation. Retrieved September 6, 2008, from http://www.cs.virginia.edu/~pact2006/program/mapreduce-pact06-keynote.pdf – Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Distributed Computing Seminar, Lecture 3: Distributed Filesystems. Retrieved September 6, 2008, from http://code.google.com/edu/submissions/mapreduce- minilecture/lec3-dfs.ppt – Stokely, M. (2007). Distributed Computing at Google. Retrieved September 6, 2008, from http://www.swinog.ch/meetings/swinog15/SRE-Recruiting-SwiNOG2007.ppt JavaBlend 2008, http://www.javablend.net/ 54
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×