• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
[Roblek] Distributed computing in practice
 

[Roblek] Distributed computing in practice

on

  • 3,610 views

 

Statistics

Views

Total Views
3,610
Views on SlideShare
3,609
Embed Views
1

Actions

Likes
3
Downloads
214
Comments
0

1 Embed 1

http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    [Roblek] Distributed computing in practice [Roblek] Distributed computing in practice Presentation Transcript

    • DISTRIBUTED COMPUTING IN PRAXIS GFS, BIGTABLE, MAPREDUCE, CHUBBY Dominik Roblek Software Engineer Google Inc.
    • GOOGLE TECHNOLOGY LAYERS Google™ search Gmail™ Ads system Services and Applications Google Maps™ Distributed Computing Commodity PC Hardware Linux Computing Platform Physical Network JavaBlend 2008, http://www.javablend.net/ 2
    • IMPLICATIONS OF GOOGLE ENVIRONMENT • Single process performance does not matter – Total throughput is more important • Stuff breaks – If you have one server, it may stay up three years – If you have 10,000 servers, expect to lose ten a day • “Ultra-reliable” hardware doesn’t really help – At large scales, reliable hardware still fails, albeit less often – Software still needs to be fault-tolerant JavaBlend 2008, http://www.javablend.net/ 3
    • BUILDING BLOCKS OF google.com? • Distributed data – Google File System (GFS) – BigTable • Job manager • Distributed computation – MapReduce • Distributed lock service – Chubby JavaBlend 2008, http://www.javablend.net/ 4
    • SCALABLE DISTRIBUTED FILE SYSTEM Google File System (GFS) JavaBlend 2008, http://www.javablend.net/ 5
    • GFS: REQUIREMENTS • High component failure rates – Inexpensive commodity components fail all the time • Modest number of huge files – Just a few millions, most of them multi-GB • Files are write-once, mostly appended to – Perhaps concurrently – Large streaming reads JavaBlend 2008, http://www.javablend.net/
    • GFS: DESIGN DECISION • Files stored as chunks – Fixed size (64MB) • Reliability through replication • Each chunk replicated 3+ times • Single master to coordinate access, keep metadata – Simple centralized management • No data caching – Little benefit due to large data sets, streaming reads JavaBlend 2008, http://www.javablend.net/
    • GFS: ARCHITECTURE Where is a potential weaknes of this design? JavaBlend 2008, http://www.javablend.net/
    • GFS: WEAK POINT - SINGLE MASTER • From distributed systems we know this is a – Single point of failure – Scalability bottleneck • GFS solutions – Shadow masters – Minimize master involvement • never move data through it, use only for metadata • large chunk size • master delegates authority to primary replicas in data mutations (chunk leases) JavaBlend 2008, http://www.javablend.net/
    • GFS: METADATA • Global metadata is stored on the master – File and chunk namespaces – Mapping from files to chunks • Locations of each chunk’s replicas – All in memory (64 bytes / chunk) • Master has an operation log for persistent logging of critical metadata updates – Persistent on local disk – Replicated – Checkpoints for faster recovery JavaBlend 2008, http://www.javablend.net/
    • GFS: MUTATIONS • Mutations must be done for all replicas • Master picks one replica as primary; gives it a “lease” for mutations – Primary defines a serial order of mutations • Data flow decoupled from control flow JavaBlend 2008, http://www.javablend.net/
    • GFS: OPEN SOURCE ALTERNATIVES • Hadoop Distributed File System - HDFS (Java) – http://hadoop.apache.org/core/docs/current/hdfs_design.html JavaBlend 2008, http://www.javablend.net/ 12
    • DISTRIBUTED STORAGE FOR LARGE STRUCTURED DATA SETS Bigtable JavaBlend 2008, http://www.javablend.net/ 13
    • BIGTABLE: REQUIREMENTS • Want to store petabytes of structured data across thousands of commodity servers • Want a simple data format that supports dynamic control over data layout and format • Must support very high read/write rates – millions of operations per second • Latency requirements: – backend bulk processing – real-time data serving JavaBlend 2008, http://www.javablend.net/ 14
    • BIGTABLE: STRUCTURE • Bigtable is multi-dimensional map: – sparse – persistent – distributed • Key: – Row name – Column name – Timestamp • Value: – array of bytes (rowName: string, columnName: string, timestamp: long) → byte[] JavaBlend 2008, http://www.javablend.net/ 15
    • BIGTABLE: EXAMPLE • A web crawling system might use Bigtable that stores web pages – Each row key could represent a specific URL – Columns represent page contents, the references to that page, and other metadata – The row range for a table is dynamically partitioned between servers • Rows are clustered together on machines by key – Using inversed URLs as keys minimizes the number of machines where pages from a single domain are stored – Each cell is timestamped so there could be multiple versions of the same data in the table JavaBlend 2008, http://www.javablend.net/ 16
    • BIGTABLE: EXAMPLE “contents:” “anchor:cnnsi.com” “anchor:my.look.ca” “<html>…quot; t3 “com.cnn.www” “<html>…quot; t5 “CNNquot; t9 “CNN.comquot; t8 “<html>…quot; t6 JavaBlend 2008, http://www.javablend.net/ 17
    • BIGTABLE: ROWS • Name is an arbitrary string – Access to data in a row is atomic – Row creation is implicit upon storing data • Rows ordered lexicographically – Rows close together lexicographically usually on one or a small number of machines JavaBlend 2008, http://www.javablend.net/ 18
    • BIGTABLE: TABLETS • Row range for a table is dynamically partitioned into tablets • Tablet holds contiguous range of rows – Reads over short row ranges are efficient – Clients can choose row keys to achieve locality JavaBlend 2008, http://www.javablend.net/ 19
    • BIGTABLE: COLUMNS • Columns have two-level name structure <column_family>:[<column_qualifier>] • Column family: – Creation must be explicit – Has associated type information and other metadata – Unit of access control • Column qualifier – Unbounded number of columns – Creation of column within a family is implicit at updates • Additional dimensions JavaBlend 2008, http://www.javablend.net/ 20
    • BIGTABLE: TIMESTAMPS • Used to store different versions of data in a cell – New writes default to current time – Can also be set explicitly by clients • Lookup options – Return all values – Return most recent K values – Return all values in timestamp range • Column families can be marked with attributes – Only retain most recent K values in a cell – Keep values until they are older than K seconds JavaBlend 2008, http://www.javablend.net/ 21
    • BIGTABLE: AT GOOGLE • Good match for most of our applications: – Google Earth™ – Google Maps™ – Google Talk™ – Google Finance™ – Orkut™ JavaBlend 2008, http://www.javablend.net/ 22
    • BIGTABLE: OPEN SOURCE ALTERNATIVES • HBase (Java) – http://hadoop.apache.org/hbase/ • Hypertable (C++) – http://www.hypertable.org/ JavaBlend 2008, http://www.javablend.net/ 23
    • PROGRAMMING MODEL FOR PROCESSING LARGE DATA SETS MapReduce JavaBlend 2008, http://www.javablend.net/ 24
    • MAPREDUCE: REQUIREMENTS • Want to process lots of data ( > 1 TB) • Want to run it on thousands of commodity PCs • Must be robust • … And simple to use
    • MAPREDUCE: DESCRIPTION • A simple programming model that applies to many large-scale computing problems – Based on principles of functional languages – Scalable, robust • Hide messy details in MapReduce runtime library: – automatic parallelization – load balancing – network and disk transfer optimization – handling of machine failures – robustness • Improvements to core library benefit all users of library! JavaBlend 2008, http://www.javablend.net/ 26
    • MAPREDUCE: FUNCTIONAL PROGRAMMING • Functions don’t change data structures – They always create new ones – Input data remain unchanged • Functions don’t have side effects • Data flows are implicit in program design • Order of operations does not matter z := f(g(x), h(x, y), k(y))
    • MAPREDUCE: TYPICAL EXECUTION FLOW • Read a lot of data • Map: extract something you care about from each record • Shuffle and Sort • Reduce: aggregate, summarize, filter, or transform • Write the results Outline stays the same, map and reduce change to fit the problem JavaBlend 2008, http://www.javablend.net/ 28
    • MAPREDUCE: PROGRAMING INTERFACE User must implement two functions Map(input_key, input_value) → (output_key, intermediate_value) Reduce(output_key, intermediate_value_list) → output_value_list
    • MAPREDUCE: MAP • Records from the data source … – lines out of files – rows of a database – etc. • … are fed into the map function as (key, value pairs) – filename, line – etc. • map produces zero, one or more intermediate values along with an output key from the input
    • MAPREDUCE: REDUCE • After the map phase is over, all the intermediate values for a given output key are combined together into a list • reduce combines those intermediate values into zero, one or more final values for that same output key
    • MAPREDUCE: EXAMPLE - WORD FREQUENCY 1/5 • Input is files with one document per record • Specify a map function that takes a key/value pair – key = document name – value = document contents • Output of map function is zero, one or more key/value pairs – In our case, output (word, “1”) once per word in the document JavaBlend 2008, http://www.javablend.net/ 32
    • MAPREDUCE: EXAMPLE - WORD FREQUENCY 2/5 “To be or not to be?” “document1” “to”, “1” “be”, “1” “or”, “1” … JavaBlend 2008, http://www.javablend.net/ 33
    • MAPREDUCE: PRIMER - WORD FREQUENCY 3/5 • MapReduce library gathers together all pairs with the same key – shuffle/sort • reduce function combines the values for a key – In our case, compute the sum • Output of reduce is zero, one or more values paired with key and saved JavaBlend 2008, http://www.javablend.net/ 34
    • MAPREDUCE: EXAMPLE - WORD FREQUENCY 4/5 key = “be” key = “not” key = “or” key = “to” values = “1”, “1” values = “1” values = “1” values = “1”, “1” “2” “1” “1” “2” “be”, “2” “not”, “1” “or”, “1” “to”, “2” JavaBlend 2008, http://www.javablend.net/ 35
    • MAPREDUCE: EXAMPLE - WORD FREQUENCY 5/5 Map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_values: EmitIntermediate(w, quot;1quot;); Reduce(String output_key, Iterator intermediate_values): // output_key: a word, same for input and output // intermediate_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); JavaBlend 2008, http://www.javablend.net/ 36
    • MAPREDUCE: DISTRIBUTED EXECUTION JavaBlend 2008, http://www.javablend.net/ 37
    • MAPREDUCE: LOGICAL FLOW JavaBlend 2008, http://www.javablend.net/ 38
    • MAPREDUCE: PARALLEL FLOW 1/2 • map functions run in parallel, creating different intermediate values from different input data sets • reduce functions also run in parallel, each working on a different output key – All values are processed independently • Bottleneck – reduce phase can’t start until map phase is completely finished JavaBlend 2008, http://www.javablend.net/ 39
    • MAPREDUCE: PARALLEL FLOW 2/2 JavaBlend 2008, http://www.javablend.net/ 40
    • MAPREDUCE: WIDELY APPLICABLE • distributed grep • distributed sort • document clustering • machine learning • web access log stats • inverted index construction • statistical machine translation • etc. JavaBlend 2008, http://www.javablend.net/ 41
    • MAPREDUCE: EXAMPLE - LANGUAGE MODEL STATISTICS • Used in our statistical machine translation system • Ned to count # of times every 5-word sequence occurs in large corpus of documents (and keep all those where count >= 4) • map: – extract 5-word sequences => count from document • reduce: – summarize counts – keep those where count >= 4 JavaBlend 2008, http://www.javablend.net/ 42
    • MAPREDUCE: EXAMPLE - JOINING WITH OTHER DATA • Generate per-doc summary, but include per-host information (e.g. # of pages on host, important terms on host) – per-host information might involve RPC to a set of machines containing data for all sites • map: – extract host name from URL, lookup per-host info, combine with per-doc data and emit • reduce: – identity function (just emit input value directly) JavaBlend 2008, http://www.javablend.net/ 43
    • MAPREDUCE: FAULT TOLERANCE • Master detects worker failures – Re-executes failed map tasks – Re-executes reduce tasks • Master notices particular input key/values cause crashes in map – Skips those values on re-execution
    • MAPREDUCE: LOCAL OPTIMIZATIONS • Master program divides up tasks based on location of data – tries to have map tasks on same machine as physical file data, or at least same rack
    • MAPREDUCE: SLOW MAP TASKS • reduce phase cannot start before the map phase completes – On slow disk controller can slow down the whole system • Master redundantly starts slow-moving map task – Uses results of first copy to finish
    • MAPREDUCE: COMBINE • combine is a mini-reduce phase that runs on the same machine as map phase – It aggregates the results of local map phases – Saves network bandwidth
    • MAPREDUCE: CONCLUSION • MapReduce proved to be extremely useful abstraction – It greatly simplifies the processing of huge amounts of data • MapReduce is easy to use – Programer can focus on problem – MapReduce takes care for messy details JavaBlend 2008, http://www.javablend.net/ 48
    • MAPREDUCE: OPEN SOURCE ALTERNATIVES • Hadoop (Java) – http://hadoop.apache.org/ • Disco (Erlang, Python) – http://discoproject.org/ • etc. JavaBlend 2008, http://www.javablend.net/ 49
    • LOCK SERVICE FOR LOOSELY-COUPLED DISTRIBUTED SYSTEMS Chubby JavaBlend 2008, http://www.javablend.net/ 50
    • CHUBBY: SYNCHRONIZED ACCESS TO SHARED RESOURCES • Key element of distributed architecture at Google: – Used by GFS, Bigtable and Mapreduce • Interface similar to distributed file system with advisory locks – Access control list – No links • Every Chubby file can hold a small amount of data • Every Chubby file or directory can be used as read or write lock – Locks are advisory, not mandatory • Clients must be well-behaved • A client that does not hold a lock can still read the content of a Chubby file JavaBlend 2008, http://www.javablend.net/ 51
    • CHUBBY: DESIGN • Design emphasis not on high performance, but on availability and reliability • Reading and writing is atomic • Chubby service is composed of 5 active replicas – One of them elected as master – Requires the majority of replicas to be alive JavaBlend 2008, http://www.javablend.net/ 52
    • CHUBBY: EVENTS • Client can subscribe for various events: – file contents modified – child node added, removed, or modified – lock acquired – conflicting lock request from another client – etc. JavaBlend 2008, http://www.javablend.net/ 53
    • REFERENCES • Bibliography: – Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The google file system. In SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29-43. ACM Press. – Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. (2006). Bigtable: A distributed storage system for structured data. In Operating Systems Design and Implementation, pages 205-218. – Dean, J. and Ghemawat, S (2004). Mapreduce: Simplified data processing on large clusters. In OSDI '04: Proceedings of the 6th symposium on Operating systems design and implementation, pages 137-150. – Dean, J. (2006). Experiences with MapReduce, an abstraction for large-scale computation. In PACT '06: Proceedings of the 15th international Conference on Parallel Architectures and Compilation Techniques. ACM. – Burrows, M. (2006). The chubby lock service for loosely-coupled distributed systems. In OSDI '06: Proceedings of the 7th symposium on Operating systems design and implementation, pages 335-350. • Partially based on: – Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Distributed Computing Seminar, Lecture 2: MapReduce Theory and Implementation. Retrieved September 6, 2008, from http://code.google.com/edu/submissions/mapreduce-minilecture/lec2-mapred.ppt – Dean, J. (2006). Experiences with MapReduce, an abstraction for large-scale computation. Retrieved September 6, 2008, from http://www.cs.virginia.edu/~pact2006/program/mapreduce-pact06-keynote.pdf – Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Distributed Computing Seminar, Lecture 3: Distributed Filesystems. Retrieved September 6, 2008, from http://code.google.com/edu/submissions/mapreduce- minilecture/lec3-dfs.ppt – Stokely, M. (2007). Distributed Computing at Google. Retrieved September 6, 2008, from http://www.swinog.ch/meetings/swinog15/SRE-Recruiting-SwiNOG2007.ppt JavaBlend 2008, http://www.javablend.net/ 54