Your SlideShare is downloading. ×
0
Large Scale Data Analysis with
     Map/Reduce, part I
           Marin Dimitrov
        (technology watch #1)


         ...
Contents

• Map/Reduce
• Dryad
• Sector/Sphere
• Open source M/R frameworks & tools
   –   Hadoop (Yahoo/Apache)
   –   Cl...
Contents



                       Part I

Distributed computing
      frameworks


    Large Scale Data Analysis (Map/Red...
Scalability & Parallelisation

• Scalability approaches
   – Scale up (vertical scaling)
       • Only one direction of im...
Parallelisation approaches

• Parallelization approaches
   – Task decomposition
       • Distribute coarse-grained (synch...
Amdahl’s law

• Impossible to achieve linear speedup
• Maximum speedup is always bounded by the overhead for
  parallelisa...
Map/Reduce

• Google (2005), US patent (2010)
• General idea - co-locate data with computation nodes
   – Data decompositi...
Map/Reduce (2)




                                                 (C) Jimmy Lin


Large Scale Data Analysis (Map/Reduce)...
Map/Reduce - examples

• In other words…
   – Map = partitioning of the data (compute part of a problem across
     severa...
Map/Reduce
                                 examples (2)
• Example: Shortest path in graph (naïve)
   – Map: in (nodein, d...
Map/Reduce - examples (3)

• Inverted index example rundown
• input
   – Doc1: “Why did the chicken cross the road?”
   – ...
Map/Reduce - examples (4)

• Inverted index example rundown (cont.)
• Intermediate shuffle & sort phase
   –   (“why”, <(d...
Map/Reduce - examples (5)

• Inverted index example rundown (cont.)
• Reduce phase (11 parallel tasks)
   –   (“why”, <(do...
Map/Reduce – pros & cons

• Good for
   – Lots of input, intermediate & output data
   – Little or no synchronisation requ...
Dryad

• Microsoft Research (2007),
  http://research.microsoft.com/en-us/projects/dryad/
• General purpose distributed ex...
Dryad DAG jobs




                                                  (C) Michael Isard

Large Scale Data Analysis (Map/Red...
Dryad (3)

• The job graph can mutate during execution (?)
• Channel types (one way)
   –   Files on a DFS
   –   Temporar...
Dryad architecture & components




                                                       (C) Mihai Budiu




     Large ...
Dryad programming

• C++ API (incl. Map/Reduce interfaces)
• SQL Integration Services (SSIS)
   – Many parallel SQL Server...
Dryad vs. Map/Reduce




                                                 (C) Mihai Budiu


Large Scale Data Analysis (Map...
Contents



                       Part II

Open Source Map/Reduce
      frameworks


     Large Scale Data Analysis (Map/...
Hadoop

• Apache Nutch (2004), Yahoo is currently the major
  contributor
• http://hadoop.apache.org/
• Not only a Map/Red...
Hadoop - Map/Reduce

• Components
  – Job client
  – Job Tracker
      • Only one
      • Scheduling, coordinating, monito...
Hadoop - Map/Reduce (2)




                                                  (C) Tom White


 Large Scale Data Analysis (...
Hadoop - Map/Reduce (3)

• Integrated with HDFS
   – Map tasks executed on the HDFS node where the data is (data
     loca...
Hadoop - Map/Reduce (4)

• Some extras
   – Counters
       •   Gather stats about a task
       •   Globally aggregated (...
Hadoop - Map/Reduce (5)

• Built-in mappers and reducers
   – Chain – run a chain/pipe of sequential Maps (M+RM*). The las...
Cloud MapReduce

• Accenture (2010)
• http://code.google.com/p/cloudmapreduce/
• Map/Reduce implementation for AWS (EC2, S...
Cloud MapReduce (2)




                                                 (C) Ricky Ho



Large Scale Data Analysis (Map/Re...
Cloud MapReduce (3)

• Job client workflow
   1.   Store input data (S3)
   2.   Create a Map task for each data split & p...
Cloud MapReduce (4)

• Mapper worflow
   1.   Dequeue a Map task from the Mapper Queue
   2.   Fetch data from S3
   3.   ...
MR.Flow

• Web based M/R editor
   – http://www.mr-flow.com
   – Reusable M/R modules
   – Execution & status monitoring (...
Contents



                    Part III

Some Map/Reduce
   algorithms


  Large Scale Data Analysis (Map/Reduce), part I...
General considerations

• Map execution order is not deterministic
• Map processing time cannot be predicted
• Reduce task...
Graph algorithms

• Very suitable for M/R processing
   – Data (graph node) locality
   – “spreading activation” type of p...
Social Network Analysis

• Problem: recommend new friends (friend-of-a-friend, FOAF)
• Map task
   – U (target user) is fi...
PageRank with M/R




                                                     (C) Jimmy Lin




Large Scale Data Analysis (Ma...
Text Indexing & Retrieval

• Indexing is very suitable for M/R
   – Focus on scalability, not on latency & response time
 ...
Text Indexing & Retrieval (2)




                                                   (C) Jimmy Lin



  Large Scale Data A...
Text Indexing & Retrieval (3)

• Retrieval not suitable for M/R
   – Focus on response time
   – Startup of Mappers & Redu...
Useful links

• "MapReduce: Simplified Data Processing on Large Clusters"
• “Dryad: Distributed Data-Parallel Programs fro...
Q&A




    Questions?




Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #42
Upcoming SlideShare
Loading in...5
×

Large Scale Data Analysis with Map/Reduce, part I

16,717

Published on

Published in: Technology, Education

Transcript of "Large Scale Data Analysis with Map/Reduce, part I"

  1. 1. Large Scale Data Analysis with Map/Reduce, part I Marin Dimitrov (technology watch #1) Feb 2010
  2. 2. Contents • Map/Reduce • Dryad • Sector/Sphere • Open source M/R frameworks & tools – Hadoop (Yahoo/Apache) – Cloud MapReduce (Accenture) – Elastic MapReduce (Hadoop on AWS) – MR.Flow • Some M/R algorithms – Graph algorithms, Text Indexing & retrieval Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #2
  3. 3. Contents Part I Distributed computing frameworks Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #3
  4. 4. Scalability & Parallelisation • Scalability approaches – Scale up (vertical scaling) • Only one direction of improvement (bigger box) – Scale out (horizontal scaling) • Two directions – add more nodes + scale up each node • Can achieve x4 the performance of a similarly priced scale-up system (ref?) – Hybrid (“scale out in a box”) • Parallel algorithms... Not – Algorithms with state – Dependencies from one iteration to another (recurrence, induction) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #4
  5. 5. Parallelisation approaches • Parallelization approaches – Task decomposition • Distribute coarse-grained (synchronisation wise) and computationally expensive tasks (otherwise too much coordination/management overhead) • Dependencies - execution order vs. data dependencies • Move the data to the processing (when needed) – Data decomposition • Each parallel task works with a data partition assigned to it (no sharing) • Data has regular structure, i.e. chunks expected to need the same amount of processing time • Two criteria: granularity (size of chunk) and shape (data exchange between chunk neighbours) • Move the processing to the data Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #5
  6. 6. Amdahl’s law • Impossible to achieve linear speedup • Maximum speedup is always bounded by the overhead for parallelisation and by the serial processing part • Amdahl’s law – max_speedup = – P: proportion of the program than can be parallelised (1-P still remains serial or overhead) – N: number of processors / parallel nodes – Example: P=75% (i.e. 25% serial or overhead) N (parallel nodes) 2 4 8 16 32 1024 64K Max speedup 1.60 2.29 2.91 3.37 3.66 3.99 3.99 Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #6
  7. 7. Map/Reduce • Google (2005), US patent (2010) • General idea - co-locate data with computation nodes – Data decomposition (parallelization) – no data/order dependencies between tasks (except the Map-to-Reduce phase) – Try to utilise data locality (bandwidth is $$$) – Implicit data flow (higher abstraction level than MPI) – Partial failure handling (failed map/reduce tasks are re-scheduled) • Structure – Map - for each input (Ki,Vi) produce zero or more output pairs (Km,Vm) – Combine – optional intermediate aggregation (less M->R data transfer) – Reduce - for input pair (Km, list(V1,V2,…, Vn)) produce zero or more output pairs (Kr,Vr) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #7
  8. 8. Map/Reduce (2) (C) Jimmy Lin Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #8
  9. 9. Map/Reduce - examples • In other words… – Map = partitioning of the data (compute part of a problem across several servers) – Reduce = processing of the partitions (aggregate the partial results from all servers into a single resultset) – The M/R framework takes care of grouping of partitions by key • Example: word count – Map (1 task per document in the collection) • In: docx • Out: (term1, count1,x), (term2, count2,x), … – Reduce (1 task per term in the collection) • In: (term1, < count1,x, count1,y, … count1,z >) • Out: (term1, SUM(count1,x, count1,y, … count1,z)) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #9
  10. 10. Map/Reduce examples (2) • Example: Shortest path in graph (naïve) – Map: in (nodein, dist); out (nodeout, dist++) where nodein->nodeout – Reduce: in (noder, <dista,r, distb,r, …, dustc,r>); out (noder, MIN(dista,r, distb,r, …, dustc,r)) – Multiple M/R iterations required, start with (nodestart,0) • Example: Inverted indexing (full text search) – Map • In: docx • out: (term1, (docx, pos’1,x)), (term1, (docx, pos’’1,x)), (term2, (docx, pos2,x))… – Reduce • in = (term1, < (docx, pos’1,x), (docx, pos’’1,x), (docy, pos1,y), … (docz, pos1,z)>) • out = (term1, < (docx, <pos’1,x, pos’’1,x,…>), (docy, <pos1,y>), … (docz, <pos1,z>)>) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #10
  11. 11. Map/Reduce - examples (3) • Inverted index example rundown • input – Doc1: “Why did the chicken cross the road?” – Doc2: “The chicken and egg problem” – Doc3: “Kentucky Fried Chicken” • Map phase (3 parallel tasks) – map1 => (“why”,(doc1,1)), (“did”,(doc1,2)), (“the”,(doc1,3)), (“chicken”,(doc1,4)), (“cross”,(doc1,5)), (“the”,(doc1,6)), (“road”,(doc1,7)) – map2 => (“the”,(doc2,1)), (“chicken”,(doc2,2)), (“and”,(doc2,3)), (“egg”,(doc2,4)), (“problem”, (doc2,5)) – map3 => (“kentucky”,(doc3,1)), (“fried”,(doc3,2)), (“chicken”,(doc3,3)) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #11
  12. 12. Map/Reduce - examples (4) • Inverted index example rundown (cont.) • Intermediate shuffle & sort phase – (“why”, <(doc1,1)>), – (“did”, <(doc1,2)>), – (“the”, <(doc1,3), (doc1,6), (doc2,1)>) – (“chicken”, <(doc1,4), (doc2,2), (doc3,3)>) – (“cross”, <(doc1,5)>) – (“road”, <(doc1,7)>) – (“and”, <(doc2,3)>) – (“egg”, <(doc2,4)>) – (“problem”, <(doc2,5)>) – (“kentucky”, <(doc3,1)>) – (“fried”, <(doc3,2)>) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #12
  13. 13. Map/Reduce - examples (5) • Inverted index example rundown (cont.) • Reduce phase (11 parallel tasks) – (“why”, <(doc1,<1>)>), – (“did”, <(doc1,<2>)>), – (“the”, <(doc1, <3,6>), (doc2, <1>)>) – (“chicken”, <(doc1,<4>), (doc2,<2>), (doc3,<3>)>) – (“cross”, <(doc1,<5>)>) – (“road”, <(doc1,<7>)>) – (“and”, <(doc2,<3>)>) – (“egg”, <(doc2,<4>)>) – (“problem”, <(doc2,<5>)>) – (“kentucky”, <(doc3,<1>)>) – (“fried”, <(doc3,<2>)>) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #13
  14. 14. Map/Reduce – pros & cons • Good for – Lots of input, intermediate & output data – Little or no synchronisation required – “Read once”, batch oriented datasets (ETL) • Bad for – Fast response time – Large amounts of shared data – Fine-grained synchronisation required – CPU intensive operations (as opposed to data intensive) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #14
  15. 15. Dryad • Microsoft Research (2007), http://research.microsoft.com/en-us/projects/dryad/ • General purpose distributed execution engine – Focus on throughput, not latency – Automatic management of scheduling, distribution &fault tolerance • Simple DAG model – Vertices -> processes (processing nodes) – Edges -> communication channels between the processes • DAG model benefits – Generic scheduler – No deadlocks / deterministic – Easier fault tolerance Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #15
  16. 16. Dryad DAG jobs (C) Michael Isard Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #16
  17. 17. Dryad (3) • The job graph can mutate during execution (?) • Channel types (one way) – Files on a DFS – Temporary file – Shared memory FIFO – TCP pipes • Fault tolerance – Node fails => re-run – Input disappears => re-run upstream node – Node is slow => run a duplicate copy at another node, get first result Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #17
  18. 18. Dryad architecture & components (C) Mihai Budiu Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #18
  19. 19. Dryad programming • C++ API (incl. Map/Reduce interfaces) • SQL Integration Services (SSIS) – Many parallel SQL Server instances (each is a vertex in the DAG) • DryadLINQ – LINQ to Dryad translator • Distributed shell – Generalisation of the Unix shell & pipes – Many inputs/outputs per process! – Pipes span multiple machines Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #19
  20. 20. Dryad vs. Map/Reduce (C) Mihai Budiu Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #20
  21. 21. Contents Part II Open Source Map/Reduce frameworks Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #21
  22. 22. Hadoop • Apache Nutch (2004), Yahoo is currently the major contributor • http://hadoop.apache.org/ • Not only a Map/Reduce implementation! – HDFS – distributed filesystem – HBase – distributed column store – Pig – high level query language (SQL like) – Hive – Hadoop based data warehouse – ZooKeeper, Chukwa, Pipes/Streaming, … • Also available on Amazon EC2 • Largest Hadoop cluster – 25K nodes / 100K cores (Yahoo) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #22
  23. 23. Hadoop - Map/Reduce • Components – Job client – Job Tracker • Only one • Scheduling, coordinating, monitoring, failure handling – Task Tracker • Many • Executes tasks received by the Job Tracker • Sends “heartbeats” and progress reports back to the Job Tracker – Task Runner • The actual Map or Reduce task started in a separate JVM • Crashes & failures do not affect the Task Tracker on the node! Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #23
  24. 24. Hadoop - Map/Reduce (2) (C) Tom White Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #24
  25. 25. Hadoop - Map/Reduce (3) • Integrated with HDFS – Map tasks executed on the HDFS node where the data is (data locality => reduce traffic) – Data locality is not possible for Reduce tasks – Intermediate outputs of Map tasks (nodes) are not stored on HDFS, but locally, and then sent to the proper Reduce task (node) • Status updates – Task Runner => Task Tracker, progress updates every 3s – Task Tracker => Job Tracker, heartbeat + progress for all local tasks every 5s – If a task has no progress report for too long, it will be considered failed and re-started Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #25
  26. 26. Hadoop - Map/Reduce (4) • Some extras – Counters • Gather stats about a task • Globally aggregated (Job Runner => Task Tracker => Job Tracker) • M/R counters: M/R input records, M/R output records • Filesystem counters: bytes read/written • Job counters: launched M/R tasks, failed M/R tasks, … – Joins • Copy the small set on each node and perform joins locally. Useful when one dataset is very large, the other very small (e.g. “Scalable Distributed Reasoning using MapReduce” from VUA) • Map side join – data is joined before the Map function, very efficient but less flexible (datasets must be partitioned & sorted in a particular way) • Reduce side join – more general but less efficient (Map generates (K,V) pairs using the join key) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #26
  27. 27. Hadoop - Map/Reduce (5) • Built-in mappers and reducers – Chain – run a chain/pipe of sequential Maps (M+RM*). The last Map output is the Task output – FieldSelection – select a list of fields from the input dataset to be used as MR keys/values – TokenCounterMapper, SumReducer – (remember the “word count” example?) – RegexMapper – matches a regex in the input key/value pairs Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #27
  28. 28. Cloud MapReduce • Accenture (2010) • http://code.google.com/p/cloudmapreduce/ • Map/Reduce implementation for AWS (EC2, S3, SimpleDB, SQS) – fast (reported as up to 60 times faster than Hadoop/EC2 in some cases) – scalable & robust (no single point of bottleneck or failure) – simple (3 KLOC) • Features – No need for centralised coordinator (JobTracker), just put job status in the cloud datastore (SimpleDB) – All data transfer & communication is handled by the Cloud – All I/O and storage is handled by the Cloud Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #28
  29. 29. Cloud MapReduce (2) (C) Ricky Ho Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #29
  30. 30. Cloud MapReduce (3) • Job client workflow 1. Store input data (S3) 2. Create a Map task for each data split & put it into the Mapper Queue (SQS) 3. Create Multiple Partition Queue (SQS) 4. Create Reducer Queue (SQS) & put a Reduce task for each Partition Queue 5. Create the Output Queue (SQS) 6. Create a Job Request (ref to all queues) and put it into SimpleDB 7. Start EC2 instances for Mappers & Reducers 8. Poll SimpleDB for job status 9. When job complete download results from S3 Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #30
  31. 31. Cloud MapReduce (4) • Mapper worflow 1. Dequeue a Map task from the Mapper Queue 2. Fetch data from S3 3. Perform user defined map function, add multiple output (Km,Vm) pairs to some Multiple Partition Queue (hash(Km)) => several partition keys may share the same partition queue! 4. When done remove Map task from Mapper Queue • Reducer workflow 1. Dequeue a Reeduce task from the Reducer Queue 2. Dequeue the (Km,Vm) pairs from the corresponding Partition Queue => several partitions may share the same queue! 3. Perform a user defined reduce function and add output pairs (Kr,Vr) to the Output Queue 4. When done remove the Reduce task from the Reducer Queue Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #31
  32. 32. MR.Flow • Web based M/R editor – http://www.mr-flow.com – Reusable M/R modules – Execution & status monitoring (Hadoop clusters) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #32
  33. 33. Contents Part III Some Map/Reduce algorithms Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #33
  34. 34. General considerations • Map execution order is not deterministic • Map processing time cannot be predicted • Reduce tasks cannot start before all Maps have finished (dataset needs to be fully partitioned) • Not suitable for continuous input streams • There will be a spike in network utilisation after the Map / before the Reduce phase • Number & size of key/value pairs – Object creation & serialisation overhead (Amdahl’s law!) • Aggregate partial results when possible! – Use Combiners Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #34
  35. 35. Graph algorithms • Very suitable for M/R processing – Data (graph node) locality – “spreading activation” type of processing – Some algorithms with sequential dependency not suitable for M/R • Breadth-first search algorithms better than depth-first • General Approach – Graph represented by adjacency lists – Map task – input: node + its adjacency list; perform some analysis over the node link structure; output: target key + analysis result – Reduce task – aggregate values by key – Perform multiple iterations (with a termination criteria) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #35
  36. 36. Social Network Analysis • Problem: recommend new friends (friend-of-a-friend, FOAF) • Map task – U (target user) is fixed and its friends list copied to all cluster nodes (“copy join”); each cluster node stores part of the social graph – In: (X, <friendsX>), i.e. the local data for the cluster node – Out: • if (U, X) are friends => (U, <friendsXfriendsU>), i.e. the users who are friends of X but not already friends of U • nil otherwise • Reduce task – In: (U, <<friendsAfriendsU>,<friendsBfriendsU>, … >), i.e. the FOAF lists for all users A, B, etc. who are friends with U – Out (U, <(X1, N1), (X2, N2), …>), where each X is a FOAF for U, and N is its total number of occurrences in all FOAF lists (sort/rank the result!) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #36
  37. 37. PageRank with M/R (C) Jimmy Lin Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #37
  38. 38. Text Indexing & Retrieval • Indexing is very suitable for M/R – Focus on scalability, not on latency & response time – Batch oriented • Map task – emit (Term, (DocID, position)) • Reduce task – Group pairs by Term and sort by DocID Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #38
  39. 39. Text Indexing & Retrieval (2) (C) Jimmy Lin Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #39
  40. 40. Text Indexing & Retrieval (3) • Retrieval not suitable for M/R – Focus on response time – Startup of Mappers & Reducers is usually prohibitively expensive • Katta – http://katta.sourceforge.net/ – Distributed Lucene indexing with Hadoop (HDFS) – Multicast querying & ranking Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #40
  41. 41. Useful links • "MapReduce: Simplified Data Processing on Large Clusters" • “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks” • “Cloud MapReduce Technical Report” • Data-Intensive Text Processing with MapReduce • Hadoop - The Definitive Guide Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #41
  42. 42. Q&A Questions? Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #42
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×