Large Scale Data Analysis with Map/Reduce, part I

  • 16,161 views
Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
16,161
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
523
Comments
0
Likes
24

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Large Scale Data Analysis with Map/Reduce, part I Marin Dimitrov (technology watch #1) Feb 2010
  • 2. Contents • Map/Reduce • Dryad • Sector/Sphere • Open source M/R frameworks & tools – Hadoop (Yahoo/Apache) – Cloud MapReduce (Accenture) – Elastic MapReduce (Hadoop on AWS) – MR.Flow • Some M/R algorithms – Graph algorithms, Text Indexing & retrieval Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #2
  • 3. Contents Part I Distributed computing frameworks Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #3
  • 4. Scalability & Parallelisation • Scalability approaches – Scale up (vertical scaling) • Only one direction of improvement (bigger box) – Scale out (horizontal scaling) • Two directions – add more nodes + scale up each node • Can achieve x4 the performance of a similarly priced scale-up system (ref?) – Hybrid (“scale out in a box”) • Parallel algorithms... Not – Algorithms with state – Dependencies from one iteration to another (recurrence, induction) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #4
  • 5. Parallelisation approaches • Parallelization approaches – Task decomposition • Distribute coarse-grained (synchronisation wise) and computationally expensive tasks (otherwise too much coordination/management overhead) • Dependencies - execution order vs. data dependencies • Move the data to the processing (when needed) – Data decomposition • Each parallel task works with a data partition assigned to it (no sharing) • Data has regular structure, i.e. chunks expected to need the same amount of processing time • Two criteria: granularity (size of chunk) and shape (data exchange between chunk neighbours) • Move the processing to the data Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #5
  • 6. Amdahl’s law • Impossible to achieve linear speedup • Maximum speedup is always bounded by the overhead for parallelisation and by the serial processing part • Amdahl’s law – max_speedup = – P: proportion of the program than can be parallelised (1-P still remains serial or overhead) – N: number of processors / parallel nodes – Example: P=75% (i.e. 25% serial or overhead) N (parallel nodes) 2 4 8 16 32 1024 64K Max speedup 1.60 2.29 2.91 3.37 3.66 3.99 3.99 Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #6
  • 7. Map/Reduce • Google (2005), US patent (2010) • General idea - co-locate data with computation nodes – Data decomposition (parallelization) – no data/order dependencies between tasks (except the Map-to-Reduce phase) – Try to utilise data locality (bandwidth is $$$) – Implicit data flow (higher abstraction level than MPI) – Partial failure handling (failed map/reduce tasks are re-scheduled) • Structure – Map - for each input (Ki,Vi) produce zero or more output pairs (Km,Vm) – Combine – optional intermediate aggregation (less M->R data transfer) – Reduce - for input pair (Km, list(V1,V2,…, Vn)) produce zero or more output pairs (Kr,Vr) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #7
  • 8. Map/Reduce (2) (C) Jimmy Lin Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #8
  • 9. Map/Reduce - examples • In other words… – Map = partitioning of the data (compute part of a problem across several servers) – Reduce = processing of the partitions (aggregate the partial results from all servers into a single resultset) – The M/R framework takes care of grouping of partitions by key • Example: word count – Map (1 task per document in the collection) • In: docx • Out: (term1, count1,x), (term2, count2,x), … – Reduce (1 task per term in the collection) • In: (term1, < count1,x, count1,y, … count1,z >) • Out: (term1, SUM(count1,x, count1,y, … count1,z)) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #9
  • 10. Map/Reduce examples (2) • Example: Shortest path in graph (naïve) – Map: in (nodein, dist); out (nodeout, dist++) where nodein->nodeout – Reduce: in (noder, <dista,r, distb,r, …, dustc,r>); out (noder, MIN(dista,r, distb,r, …, dustc,r)) – Multiple M/R iterations required, start with (nodestart,0) • Example: Inverted indexing (full text search) – Map • In: docx • out: (term1, (docx, pos’1,x)), (term1, (docx, pos’’1,x)), (term2, (docx, pos2,x))… – Reduce • in = (term1, < (docx, pos’1,x), (docx, pos’’1,x), (docy, pos1,y), … (docz, pos1,z)>) • out = (term1, < (docx, <pos’1,x, pos’’1,x,…>), (docy, <pos1,y>), … (docz, <pos1,z>)>) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #10
  • 11. Map/Reduce - examples (3) • Inverted index example rundown • input – Doc1: “Why did the chicken cross the road?” – Doc2: “The chicken and egg problem” – Doc3: “Kentucky Fried Chicken” • Map phase (3 parallel tasks) – map1 => (“why”,(doc1,1)), (“did”,(doc1,2)), (“the”,(doc1,3)), (“chicken”,(doc1,4)), (“cross”,(doc1,5)), (“the”,(doc1,6)), (“road”,(doc1,7)) – map2 => (“the”,(doc2,1)), (“chicken”,(doc2,2)), (“and”,(doc2,3)), (“egg”,(doc2,4)), (“problem”, (doc2,5)) – map3 => (“kentucky”,(doc3,1)), (“fried”,(doc3,2)), (“chicken”,(doc3,3)) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #11
  • 12. Map/Reduce - examples (4) • Inverted index example rundown (cont.) • Intermediate shuffle & sort phase – (“why”, <(doc1,1)>), – (“did”, <(doc1,2)>), – (“the”, <(doc1,3), (doc1,6), (doc2,1)>) – (“chicken”, <(doc1,4), (doc2,2), (doc3,3)>) – (“cross”, <(doc1,5)>) – (“road”, <(doc1,7)>) – (“and”, <(doc2,3)>) – (“egg”, <(doc2,4)>) – (“problem”, <(doc2,5)>) – (“kentucky”, <(doc3,1)>) – (“fried”, <(doc3,2)>) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #12
  • 13. Map/Reduce - examples (5) • Inverted index example rundown (cont.) • Reduce phase (11 parallel tasks) – (“why”, <(doc1,<1>)>), – (“did”, <(doc1,<2>)>), – (“the”, <(doc1, <3,6>), (doc2, <1>)>) – (“chicken”, <(doc1,<4>), (doc2,<2>), (doc3,<3>)>) – (“cross”, <(doc1,<5>)>) – (“road”, <(doc1,<7>)>) – (“and”, <(doc2,<3>)>) – (“egg”, <(doc2,<4>)>) – (“problem”, <(doc2,<5>)>) – (“kentucky”, <(doc3,<1>)>) – (“fried”, <(doc3,<2>)>) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #13
  • 14. Map/Reduce – pros & cons • Good for – Lots of input, intermediate & output data – Little or no synchronisation required – “Read once”, batch oriented datasets (ETL) • Bad for – Fast response time – Large amounts of shared data – Fine-grained synchronisation required – CPU intensive operations (as opposed to data intensive) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #14
  • 15. Dryad • Microsoft Research (2007), http://research.microsoft.com/en-us/projects/dryad/ • General purpose distributed execution engine – Focus on throughput, not latency – Automatic management of scheduling, distribution &fault tolerance • Simple DAG model – Vertices -> processes (processing nodes) – Edges -> communication channels between the processes • DAG model benefits – Generic scheduler – No deadlocks / deterministic – Easier fault tolerance Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #15
  • 16. Dryad DAG jobs (C) Michael Isard Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #16
  • 17. Dryad (3) • The job graph can mutate during execution (?) • Channel types (one way) – Files on a DFS – Temporary file – Shared memory FIFO – TCP pipes • Fault tolerance – Node fails => re-run – Input disappears => re-run upstream node – Node is slow => run a duplicate copy at another node, get first result Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #17
  • 18. Dryad architecture & components (C) Mihai Budiu Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #18
  • 19. Dryad programming • C++ API (incl. Map/Reduce interfaces) • SQL Integration Services (SSIS) – Many parallel SQL Server instances (each is a vertex in the DAG) • DryadLINQ – LINQ to Dryad translator • Distributed shell – Generalisation of the Unix shell & pipes – Many inputs/outputs per process! – Pipes span multiple machines Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #19
  • 20. Dryad vs. Map/Reduce (C) Mihai Budiu Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #20
  • 21. Contents Part II Open Source Map/Reduce frameworks Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #21
  • 22. Hadoop • Apache Nutch (2004), Yahoo is currently the major contributor • http://hadoop.apache.org/ • Not only a Map/Reduce implementation! – HDFS – distributed filesystem – HBase – distributed column store – Pig – high level query language (SQL like) – Hive – Hadoop based data warehouse – ZooKeeper, Chukwa, Pipes/Streaming, … • Also available on Amazon EC2 • Largest Hadoop cluster – 25K nodes / 100K cores (Yahoo) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #22
  • 23. Hadoop - Map/Reduce • Components – Job client – Job Tracker • Only one • Scheduling, coordinating, monitoring, failure handling – Task Tracker • Many • Executes tasks received by the Job Tracker • Sends “heartbeats” and progress reports back to the Job Tracker – Task Runner • The actual Map or Reduce task started in a separate JVM • Crashes & failures do not affect the Task Tracker on the node! Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #23
  • 24. Hadoop - Map/Reduce (2) (C) Tom White Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #24
  • 25. Hadoop - Map/Reduce (3) • Integrated with HDFS – Map tasks executed on the HDFS node where the data is (data locality => reduce traffic) – Data locality is not possible for Reduce tasks – Intermediate outputs of Map tasks (nodes) are not stored on HDFS, but locally, and then sent to the proper Reduce task (node) • Status updates – Task Runner => Task Tracker, progress updates every 3s – Task Tracker => Job Tracker, heartbeat + progress for all local tasks every 5s – If a task has no progress report for too long, it will be considered failed and re-started Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #25
  • 26. Hadoop - Map/Reduce (4) • Some extras – Counters • Gather stats about a task • Globally aggregated (Job Runner => Task Tracker => Job Tracker) • M/R counters: M/R input records, M/R output records • Filesystem counters: bytes read/written • Job counters: launched M/R tasks, failed M/R tasks, … – Joins • Copy the small set on each node and perform joins locally. Useful when one dataset is very large, the other very small (e.g. “Scalable Distributed Reasoning using MapReduce” from VUA) • Map side join – data is joined before the Map function, very efficient but less flexible (datasets must be partitioned & sorted in a particular way) • Reduce side join – more general but less efficient (Map generates (K,V) pairs using the join key) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #26
  • 27. Hadoop - Map/Reduce (5) • Built-in mappers and reducers – Chain – run a chain/pipe of sequential Maps (M+RM*). The last Map output is the Task output – FieldSelection – select a list of fields from the input dataset to be used as MR keys/values – TokenCounterMapper, SumReducer – (remember the “word count” example?) – RegexMapper – matches a regex in the input key/value pairs Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #27
  • 28. Cloud MapReduce • Accenture (2010) • http://code.google.com/p/cloudmapreduce/ • Map/Reduce implementation for AWS (EC2, S3, SimpleDB, SQS) – fast (reported as up to 60 times faster than Hadoop/EC2 in some cases) – scalable & robust (no single point of bottleneck or failure) – simple (3 KLOC) • Features – No need for centralised coordinator (JobTracker), just put job status in the cloud datastore (SimpleDB) – All data transfer & communication is handled by the Cloud – All I/O and storage is handled by the Cloud Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #28
  • 29. Cloud MapReduce (2) (C) Ricky Ho Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #29
  • 30. Cloud MapReduce (3) • Job client workflow 1. Store input data (S3) 2. Create a Map task for each data split & put it into the Mapper Queue (SQS) 3. Create Multiple Partition Queue (SQS) 4. Create Reducer Queue (SQS) & put a Reduce task for each Partition Queue 5. Create the Output Queue (SQS) 6. Create a Job Request (ref to all queues) and put it into SimpleDB 7. Start EC2 instances for Mappers & Reducers 8. Poll SimpleDB for job status 9. When job complete download results from S3 Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #30
  • 31. Cloud MapReduce (4) • Mapper worflow 1. Dequeue a Map task from the Mapper Queue 2. Fetch data from S3 3. Perform user defined map function, add multiple output (Km,Vm) pairs to some Multiple Partition Queue (hash(Km)) => several partition keys may share the same partition queue! 4. When done remove Map task from Mapper Queue • Reducer workflow 1. Dequeue a Reeduce task from the Reducer Queue 2. Dequeue the (Km,Vm) pairs from the corresponding Partition Queue => several partitions may share the same queue! 3. Perform a user defined reduce function and add output pairs (Kr,Vr) to the Output Queue 4. When done remove the Reduce task from the Reducer Queue Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #31
  • 32. MR.Flow • Web based M/R editor – http://www.mr-flow.com – Reusable M/R modules – Execution & status monitoring (Hadoop clusters) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #32
  • 33. Contents Part III Some Map/Reduce algorithms Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #33
  • 34. General considerations • Map execution order is not deterministic • Map processing time cannot be predicted • Reduce tasks cannot start before all Maps have finished (dataset needs to be fully partitioned) • Not suitable for continuous input streams • There will be a spike in network utilisation after the Map / before the Reduce phase • Number & size of key/value pairs – Object creation & serialisation overhead (Amdahl’s law!) • Aggregate partial results when possible! – Use Combiners Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #34
  • 35. Graph algorithms • Very suitable for M/R processing – Data (graph node) locality – “spreading activation” type of processing – Some algorithms with sequential dependency not suitable for M/R • Breadth-first search algorithms better than depth-first • General Approach – Graph represented by adjacency lists – Map task – input: node + its adjacency list; perform some analysis over the node link structure; output: target key + analysis result – Reduce task – aggregate values by key – Perform multiple iterations (with a termination criteria) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #35
  • 36. Social Network Analysis • Problem: recommend new friends (friend-of-a-friend, FOAF) • Map task – U (target user) is fixed and its friends list copied to all cluster nodes (“copy join”); each cluster node stores part of the social graph – In: (X, <friendsX>), i.e. the local data for the cluster node – Out: • if (U, X) are friends => (U, <friendsXfriendsU>), i.e. the users who are friends of X but not already friends of U • nil otherwise • Reduce task – In: (U, <<friendsAfriendsU>,<friendsBfriendsU>, … >), i.e. the FOAF lists for all users A, B, etc. who are friends with U – Out (U, <(X1, N1), (X2, N2), …>), where each X is a FOAF for U, and N is its total number of occurrences in all FOAF lists (sort/rank the result!) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #36
  • 37. PageRank with M/R (C) Jimmy Lin Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #37
  • 38. Text Indexing & Retrieval • Indexing is very suitable for M/R – Focus on scalability, not on latency & response time – Batch oriented • Map task – emit (Term, (DocID, position)) • Reduce task – Group pairs by Term and sort by DocID Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #38
  • 39. Text Indexing & Retrieval (2) (C) Jimmy Lin Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #39
  • 40. Text Indexing & Retrieval (3) • Retrieval not suitable for M/R – Focus on response time – Startup of Mappers & Reducers is usually prohibitively expensive • Katta – http://katta.sourceforge.net/ – Distributed Lucene indexing with Hadoop (HDFS) – Multicast querying & ranking Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #40
  • 41. Useful links • "MapReduce: Simplified Data Processing on Large Clusters" • “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks” • “Cloud MapReduce Technical Report” • Data-Intensive Text Processing with MapReduce • Hadoop - The Definitive Guide Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #41
  • 42. Q&A Questions? Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #42