[Roblek] Distributed computing in practice

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    [Roblek] Distributed computing in practice - Presentation Transcript

    1. DISTRIBUTED COMPUTING IN PRAXIS GFS, BIGTABLE, MAPREDUCE, CHUBBY Dominik Roblek Software Engineer Google Inc.
    2. GOOGLE TECHNOLOGY LAYERS Google™ search Gmail™ Ads system Services and Applications Google Maps™ Distributed Computing Commodity PC Hardware Linux Computing Platform Physical Network JavaBlend 2008, http://www.javablend.net/ 2
    3. IMPLICATIONS OF GOOGLE ENVIRONMENT • Single process performance does not matter – Total throughput is more important • Stuff breaks – If you have one server, it may stay up three years – If you have 10,000 servers, expect to lose ten a day • “Ultra-reliable” hardware doesn’t really help – At large scales, reliable hardware still fails, albeit less often – Software still needs to be fault-tolerant JavaBlend 2008, http://www.javablend.net/ 3
    4. BUILDING BLOCKS OF google.com? • Distributed data – Google File System (GFS) – BigTable • Job manager • Distributed computation – MapReduce • Distributed lock service – Chubby JavaBlend 2008, http://www.javablend.net/ 4
    5. SCALABLE DISTRIBUTED FILE SYSTEM Google File System (GFS) JavaBlend 2008, http://www.javablend.net/ 5
    6. GFS: REQUIREMENTS • High component failure rates – Inexpensive commodity components fail all the time • Modest number of huge files – Just a few millions, most of them multi-GB • Files are write-once, mostly appended to – Perhaps concurrently – Large streaming reads JavaBlend 2008, http://www.javablend.net/
    7. GFS: DESIGN DECISION • Files stored as chunks – Fixed size (64MB) • Reliability through replication • Each chunk replicated 3+ times • Single master to coordinate access, keep metadata – Simple centralized management • No data caching – Little benefit due to large data sets, streaming reads JavaBlend 2008, http://www.javablend.net/
    8. GFS: ARCHITECTURE Where is a potential weaknes of this design? JavaBlend 2008, http://www.javablend.net/
    9. GFS: WEAK POINT - SINGLE MASTER • From distributed systems we know this is a – Single point of failure – Scalability bottleneck • GFS solutions – Shadow masters – Minimize master involvement • never move data through it, use only for metadata • large chunk size • master delegates authority to primary replicas in data mutations (chunk leases) JavaBlend 2008, http://www.javablend.net/
    10. GFS: METADATA • Global metadata is stored on the master – File and chunk namespaces – Mapping from files to chunks • Locations of each chunk’s replicas – All in memory (64 bytes / chunk) • Master has an operation log for persistent logging of critical metadata updates – Persistent on local disk – Replicated – Checkpoints for faster recovery JavaBlend 2008, http://www.javablend.net/
    11. GFS: MUTATIONS • Mutations must be done for all replicas • Master picks one replica as primary; gives it a “lease” for mutations – Primary defines a serial order of mutations • Data flow decoupled from control flow JavaBlend 2008, http://www.javablend.net/
    12. GFS: OPEN SOURCE ALTERNATIVES • Hadoop Distributed File System - HDFS (Java) – http://hadoop.apache.org/core/docs/current/hdfs_design.html JavaBlend 2008, http://www.javablend.net/ 12
    13. DISTRIBUTED STORAGE FOR LARGE STRUCTURED DATA SETS Bigtable JavaBlend 2008, http://www.javablend.net/ 13
    14. BIGTABLE: REQUIREMENTS • Want to store petabytes of structured data across thousands of commodity servers • Want a simple data format that supports dynamic control over data layout and format • Must support very high read/write rates – millions of operations per second • Latency requirements: – backend bulk processing – real-time data serving JavaBlend 2008, http://www.javablend.net/ 14
    15. BIGTABLE: STRUCTURE • Bigtable is multi-dimensional map: – sparse – persistent – distributed • Key: – Row name – Column name – Timestamp • Value: – array of bytes (rowName: string, columnName: string, timestamp: long) → byte[] JavaBlend 2008, http://www.javablend.net/ 15
    16. BIGTABLE: EXAMPLE • A web crawling system might use Bigtable that stores web pages – Each row key could represent a specific URL – Columns represent page contents, the references to that page, and other metadata – The row range for a table is dynamically partitioned between servers • Rows are clustered together on machines by key – Using inversed URLs as keys minimizes the number of machines where pages from a single domain are stored – Each cell is timestamped so there could be multiple versions of the same data in the table JavaBlend 2008, http://www.javablend.net/ 16
    17. BIGTABLE: EXAMPLE “contents:” “anchor:cnnsi.com” “anchor:my.look.ca” “<html>…\" t3 “com.cnn.www” “<html>…\" t5 “CNN\" t9 “CNN.com\" t8 “<html>…\" t6 JavaBlend 2008, http://www.javablend.net/ 17
    18. BIGTABLE: ROWS • Name is an arbitrary string – Access to data in a row is atomic – Row creation is implicit upon storing data • Rows ordered lexicographically – Rows close together lexicographically usually on one or a small number of machines JavaBlend 2008, http://www.javablend.net/ 18
    19. BIGTABLE: TABLETS • Row range for a table is dynamically partitioned into tablets • Tablet holds contiguous range of rows – Reads over short row ranges are efficient – Clients can choose row keys to achieve locality JavaBlend 2008, http://www.javablend.net/ 19
    20. BIGTABLE: COLUMNS • Columns have two-level name structure <column_family>:[<column_qualifier>] • Column family: – Creation must be explicit – Has associated type information and other metadata – Unit of access control • Column qualifier – Unbounded number of columns – Creation of column within a family is implicit at updates • Additional dimensions JavaBlend 2008, http://www.javablend.net/ 20
    21. BIGTABLE: TIMESTAMPS • Used to store different versions of data in a cell – New writes default to current time – Can also be set explicitly by clients • Lookup options – Return all values – Return most recent K values – Return all values in timestamp range • Column families can be marked with attributes – Only retain most recent K values in a cell – Keep values until they are older than K seconds JavaBlend 2008, http://www.javablend.net/ 21
    22. BIGTABLE: AT GOOGLE • Good match for most of our applications: – Google Earth™ – Google Maps™ – Google Talk™ – Google Finance™ – Orkut™ JavaBlend 2008, http://www.javablend.net/ 22
    23. BIGTABLE: OPEN SOURCE ALTERNATIVES • HBase (Java) – http://hadoop.apache.org/hbase/ • Hypertable (C++) – http://www.hypertable.org/ JavaBlend 2008, http://www.javablend.net/ 23
    24. PROGRAMMING MODEL FOR PROCESSING LARGE DATA SETS MapReduce JavaBlend 2008, http://www.javablend.net/ 24
    25. MAPREDUCE: REQUIREMENTS • Want to process lots of data ( > 1 TB) • Want to run it on thousands of commodity PCs • Must be robust • … And simple to use
    26. MAPREDUCE: DESCRIPTION • A simple programming model that applies to many large-scale computing problems – Based on principles of functional languages – Scalable, robust • Hide messy details in MapReduce runtime library: – automatic parallelization – load balancing – network and disk transfer optimization – handling of machine failures – robustness • Improvements to core library benefit all users of library! JavaBlend 2008, http://www.javablend.net/ 26
    27. MAPREDUCE: FUNCTIONAL PROGRAMMING • Functions don’t change data structures – They always create new ones – Input data remain unchanged • Functions don’t have side effects • Data flows are implicit in program design • Order of operations does not matter z := f(g(x), h(x, y), k(y))
    28. MAPREDUCE: TYPICAL EXECUTION FLOW • Read a lot of data • Map: extract something you care about from each record • Shuffle and Sort • Reduce: aggregate, summarize, filter, or transform • Write the results Outline stays the same, map and reduce change to fit the problem JavaBlend 2008, http://www.javablend.net/ 28
    29. MAPREDUCE: PROGRAMING INTERFACE User must implement two functions Map(input_key, input_value) → (output_key, intermediate_value) Reduce(output_key, intermediate_value_list) → output_value_list
    30. MAPREDUCE: MAP • Records from the data source … – lines out of files – rows of a database – etc. • … are fed into the map function as (key, value pairs) – filename, line – etc. • map produces zero, one or more intermediate values along with an output key from the input
    31. MAPREDUCE: REDUCE • After the map phase is over, all the intermediate values for a given output key are combined together into a list • reduce combines those intermediate values into zero, one or more final values for that same output key
    32. MAPREDUCE: EXAMPLE - WORD FREQUENCY 1/5 • Input is files with one document per record • Specify a map function that takes a key/value pair – key = document name – value = document contents • Output of map function is zero, one or more key/value pairs – In our case, output (word, “1”) once per word in the document JavaBlend 2008, http://www.javablend.net/ 32
    33. MAPREDUCE: EXAMPLE - WORD FREQUENCY 2/5 “To be or not to be?” “document1” “to”, “1” “be”, “1” “or”, “1” … JavaBlend 2008, http://www.javablend.net/ 33
    34. MAPREDUCE: PRIMER - WORD FREQUENCY 3/5 • MapReduce library gathers together all pairs with the same key – shuffle/sort • reduce function combines the values for a key – In our case, compute the sum • Output of reduce is zero, one or more values paired with key and saved JavaBlend 2008, http://www.javablend.net/ 34
    35. MAPREDUCE: EXAMPLE - WORD FREQUENCY 4/5 key = “be” key = “not” key = “or” key = “to” values = “1”, “1” values = “1” values = “1” values = “1”, “1” “2” “1” “1” “2” “be”, “2” “not”, “1” “or”, “1” “to”, “2” JavaBlend 2008, http://www.javablend.net/ 35
    36. MAPREDUCE: EXAMPLE - WORD FREQUENCY 5/5 Map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_values: EmitIntermediate(w, \"1\"); Reduce(String output_key, Iterator intermediate_values): // output_key: a word, same for input and output // intermediate_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); JavaBlend 2008, http://www.javablend.net/ 36
    37. MAPREDUCE: DISTRIBUTED EXECUTION JavaBlend 2008, http://www.javablend.net/ 37
    38. MAPREDUCE: LOGICAL FLOW JavaBlend 2008, http://www.javablend.net/ 38
    39. MAPREDUCE: PARALLEL FLOW 1/2 • map functions run in parallel, creating different intermediate values from different input data sets • reduce functions also run in parallel, each working on a different output key – All values are processed independently • Bottleneck – reduce phase can’t start until map phase is completely finished JavaBlend 2008, http://www.javablend.net/ 39
    40. MAPREDUCE: PARALLEL FLOW 2/2 JavaBlend 2008, http://www.javablend.net/ 40
    41. MAPREDUCE: WIDELY APPLICABLE • distributed grep • distributed sort • document clustering • machine learning • web access log stats • inverted index construction • statistical machine translation • etc. JavaBlend 2008, http://www.javablend.net/ 41
    42. MAPREDUCE: EXAMPLE - LANGUAGE MODEL STATISTICS • Used in our statistical machine translation system • Ned to count # of times every 5-word sequence occurs in large corpus of documents (and keep all those where count >= 4) • map: – extract 5-word sequences => count from document • reduce: – summarize counts – keep those where count >= 4 JavaBlend 2008, http://www.javablend.net/ 42
    43. MAPREDUCE: EXAMPLE - JOINING WITH OTHER DATA • Generate per-doc summary, but include per-host information (e.g. # of pages on host, important terms on host) – per-host information might involve RPC to a set of machines containing data for all sites • map: – extract host name from URL, lookup per-host info, combine with per-doc data and emit • reduce: – identity function (just emit input value directly) JavaBlend 2008, http://www.javablend.net/ 43
    44. MAPREDUCE: FAULT TOLERANCE • Master detects worker failures – Re-executes failed map tasks – Re-executes reduce tasks • Master notices particular input key/values cause crashes in map – Skips those values on re-execution
    45. MAPREDUCE: LOCAL OPTIMIZATIONS • Master program divides up tasks based on location of data – tries to have map tasks on same machine as physical file data, or at least same rack
    46. MAPREDUCE: SLOW MAP TASKS • reduce phase cannot start before the map phase completes – On slow disk controller can slow down the whole system • Master redundantly starts slow-moving map task – Uses results of first copy to finish
    47. MAPREDUCE: COMBINE • combine is a mini-reduce phase that runs on the same machine as map phase – It aggregates the results of local map phases – Saves network bandwidth
    48. MAPREDUCE: CONCLUSION • MapReduce proved to be extremely useful abstraction – It greatly simplifies the processing of huge amounts of data • MapReduce is easy to use – Programer can focus on problem – MapReduce takes care for messy details JavaBlend 2008, http://www.javablend.net/ 48
    49. MAPREDUCE: OPEN SOURCE ALTERNATIVES • Hadoop (Java) – http://hadoop.apache.org/ • Disco (Erlang, Python) – http://discoproject.org/ • etc. JavaBlend 2008, http://www.javablend.net/ 49
    50. LOCK SERVICE FOR LOOSELY-COUPLED DISTRIBUTED SYSTEMS Chubby JavaBlend 2008, http://www.javablend.net/ 50
    51. CHUBBY: SYNCHRONIZED ACCESS TO SHARED RESOURCES • Key element of distributed architecture at Google: – Used by GFS, Bigtable and Mapreduce • Interface similar to distributed file system with advisory locks – Access control list – No links • Every Chubby file can hold a small amount of data • Every Chubby file or directory can be used as read or write lock – Locks are advisory, not mandatory • Clients must be well-behaved • A client that does not hold a lock can still read the content of a Chubby file JavaBlend 2008, http://www.javablend.net/ 51
    52. CHUBBY: DESIGN • Design emphasis not on high performance, but on availability and reliability • Reading and writing is atomic • Chubby service is composed of 5 active replicas – One of them elected as master – Requires the majority of replicas to be alive JavaBlend 2008, http://www.javablend.net/ 52
    53. CHUBBY: EVENTS • Client can subscribe for various events: – file contents modified – child node added, removed, or modified – lock acquired – conflicting lock request from another client – etc. JavaBlend 2008, http://www.javablend.net/ 53
    54. REFERENCES • Bibliography: – Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The google file system. In SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29-43. ACM Press. – Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. (2006). Bigtable: A distributed storage system for structured data. In Operating Systems Design and Implementation, pages 205-218. – Dean, J. and Ghemawat, S (2004). Mapreduce: Simplified data processing on large clusters. In OSDI '04: Proceedings of the 6th symposium on Operating systems design and implementation, pages 137-150. – Dean, J. (2006). Experiences with MapReduce, an abstraction for large-scale computation. In PACT '06: Proceedings of the 15th international Conference on Parallel Architectures and Compilation Techniques. ACM. – Burrows, M. (2006). The chubby lock service for loosely-coupled distributed systems. In OSDI '06: Proceedings of the 7th symposium on Operating systems design and implementation, pages 335-350. • Partially based on: – Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Distributed Computing Seminar, Lecture 2: MapReduce Theory and Implementation. Retrieved September 6, 2008, from http://code.google.com/edu/submissions/mapreduce-minilecture/lec2-mapred.ppt – Dean, J. (2006). Experiences with MapReduce, an abstraction for large-scale computation. Retrieved September 6, 2008, from http://www.cs.virginia.edu/~pact2006/program/mapreduce-pact06-keynote.pdf – Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Distributed Computing Seminar, Lecture 3: Distributed Filesystems. Retrieved September 6, 2008, from http://code.google.com/edu/submissions/mapreduce- minilecture/lec3-dfs.ppt – Stokely, M. (2007). Distributed Computing at Google. Retrieved September 6, 2008, from http://www.swinog.ch/meetings/swinog15/SRE-Recruiting-SwiNOG2007.ppt JavaBlend 2008, http://www.javablend.net/ 54

    + javablendjavablend, 2 years ago

    custom

    1062 views, 0 favs, 0 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 1062
      • 1062 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 19
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories