• Like
  • Save
Large Scale Data Analysis with Map/Reduce, part I
Upcoming SlideShare
Loading in...5
×
 

Large Scale Data Analysis with Map/Reduce, part I

on

  • 18,501 views

 

Statistics

Views

Total Views
18,501
Views on SlideShare
18,392
Embed Views
109

Actions

Likes
23
Downloads
517
Comments
0

7 Embeds 109

http://www.slideshare.net 73
http://tungtraveling.blogspot.com 16
http://www.linkedin.com 9
https://www.linkedin.com 5
http://feeds.feedburner.com 3
http://webcache.googleusercontent.com 2
http://chess72.tistory.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Large Scale Data Analysis with Map/Reduce, part I Large Scale Data Analysis with Map/Reduce, part I Presentation Transcript

    • Large Scale Data Analysis with Map/Reduce, part I Marin Dimitrov (technology watch #1) Feb 2010
    • Contents • Map/Reduce • Dryad • Sector/Sphere • Open source M/R frameworks & tools – Hadoop (Yahoo/Apache) – Cloud MapReduce (Accenture) – Elastic MapReduce (Hadoop on AWS) – MR.Flow • Some M/R algorithms – Graph algorithms, Text Indexing & retrieval Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #2
    • Contents Part I Distributed computing frameworks Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #3
    • Scalability & Parallelisation • Scalability approaches – Scale up (vertical scaling) • Only one direction of improvement (bigger box) – Scale out (horizontal scaling) • Two directions – add more nodes + scale up each node • Can achieve x4 the performance of a similarly priced scale-up system (ref?) – Hybrid (“scale out in a box”) • Parallel algorithms... Not – Algorithms with state – Dependencies from one iteration to another (recurrence, induction) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #4
    • Parallelisation approaches • Parallelization approaches – Task decomposition • Distribute coarse-grained (synchronisation wise) and computationally expensive tasks (otherwise too much coordination/management overhead) • Dependencies - execution order vs. data dependencies • Move the data to the processing (when needed) – Data decomposition • Each parallel task works with a data partition assigned to it (no sharing) • Data has regular structure, i.e. chunks expected to need the same amount of processing time • Two criteria: granularity (size of chunk) and shape (data exchange between chunk neighbours) • Move the processing to the data Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #5
    • Amdahl’s law • Impossible to achieve linear speedup • Maximum speedup is always bounded by the overhead for parallelisation and by the serial processing part • Amdahl’s law – max_speedup = – P: proportion of the program than can be parallelised (1-P still remains serial or overhead) – N: number of processors / parallel nodes – Example: P=75% (i.e. 25% serial or overhead) N (parallel nodes) 2 4 8 16 32 1024 64K Max speedup 1.60 2.29 2.91 3.37 3.66 3.99 3.99 Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #6
    • Map/Reduce • Google (2005), US patent (2010) • General idea - co-locate data with computation nodes – Data decomposition (parallelization) – no data/order dependencies between tasks (except the Map-to-Reduce phase) – Try to utilise data locality (bandwidth is $$$) – Implicit data flow (higher abstraction level than MPI) – Partial failure handling (failed map/reduce tasks are re-scheduled) • Structure – Map - for each input (Ki,Vi) produce zero or more output pairs (Km,Vm) – Combine – optional intermediate aggregation (less M->R data transfer) – Reduce - for input pair (Km, list(V1,V2,…, Vn)) produce zero or more output pairs (Kr,Vr) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #7
    • Map/Reduce (2) (C) Jimmy Lin Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #8
    • Map/Reduce - examples • In other words… – Map = partitioning of the data (compute part of a problem across several servers) – Reduce = processing of the partitions (aggregate the partial results from all servers into a single resultset) – The M/R framework takes care of grouping of partitions by key • Example: word count – Map (1 task per document in the collection) • In: docx • Out: (term1, count1,x), (term2, count2,x), … – Reduce (1 task per term in the collection) • In: (term1, < count1,x, count1,y, … count1,z >) • Out: (term1, SUM(count1,x, count1,y, … count1,z)) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #9
    • Map/Reduce examples (2) • Example: Shortest path in graph (naïve) – Map: in (nodein, dist); out (nodeout, dist++) where nodein->nodeout – Reduce: in (noder, <dista,r, distb,r, …, dustc,r>); out (noder, MIN(dista,r, distb,r, …, dustc,r)) – Multiple M/R iterations required, start with (nodestart,0) • Example: Inverted indexing (full text search) – Map • In: docx • out: (term1, (docx, pos’1,x)), (term1, (docx, pos’’1,x)), (term2, (docx, pos2,x))… – Reduce • in = (term1, < (docx, pos’1,x), (docx, pos’’1,x), (docy, pos1,y), … (docz, pos1,z)>) • out = (term1, < (docx, <pos’1,x, pos’’1,x,…>), (docy, <pos1,y>), … (docz, <pos1,z>)>) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #10
    • Map/Reduce - examples (3) • Inverted index example rundown • input – Doc1: “Why did the chicken cross the road?” – Doc2: “The chicken and egg problem” – Doc3: “Kentucky Fried Chicken” • Map phase (3 parallel tasks) – map1 => (“why”,(doc1,1)), (“did”,(doc1,2)), (“the”,(doc1,3)), (“chicken”,(doc1,4)), (“cross”,(doc1,5)), (“the”,(doc1,6)), (“road”,(doc1,7)) – map2 => (“the”,(doc2,1)), (“chicken”,(doc2,2)), (“and”,(doc2,3)), (“egg”,(doc2,4)), (“problem”, (doc2,5)) – map3 => (“kentucky”,(doc3,1)), (“fried”,(doc3,2)), (“chicken”,(doc3,3)) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #11
    • Map/Reduce - examples (4) • Inverted index example rundown (cont.) • Intermediate shuffle & sort phase – (“why”, <(doc1,1)>), – (“did”, <(doc1,2)>), – (“the”, <(doc1,3), (doc1,6), (doc2,1)>) – (“chicken”, <(doc1,4), (doc2,2), (doc3,3)>) – (“cross”, <(doc1,5)>) – (“road”, <(doc1,7)>) – (“and”, <(doc2,3)>) – (“egg”, <(doc2,4)>) – (“problem”, <(doc2,5)>) – (“kentucky”, <(doc3,1)>) – (“fried”, <(doc3,2)>) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #12
    • Map/Reduce - examples (5) • Inverted index example rundown (cont.) • Reduce phase (11 parallel tasks) – (“why”, <(doc1,<1>)>), – (“did”, <(doc1,<2>)>), – (“the”, <(doc1, <3,6>), (doc2, <1>)>) – (“chicken”, <(doc1,<4>), (doc2,<2>), (doc3,<3>)>) – (“cross”, <(doc1,<5>)>) – (“road”, <(doc1,<7>)>) – (“and”, <(doc2,<3>)>) – (“egg”, <(doc2,<4>)>) – (“problem”, <(doc2,<5>)>) – (“kentucky”, <(doc3,<1>)>) – (“fried”, <(doc3,<2>)>) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #13
    • Map/Reduce – pros & cons • Good for – Lots of input, intermediate & output data – Little or no synchronisation required – “Read once”, batch oriented datasets (ETL) • Bad for – Fast response time – Large amounts of shared data – Fine-grained synchronisation required – CPU intensive operations (as opposed to data intensive) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #14
    • Dryad • Microsoft Research (2007), http://research.microsoft.com/en-us/projects/dryad/ • General purpose distributed execution engine – Focus on throughput, not latency – Automatic management of scheduling, distribution &fault tolerance • Simple DAG model – Vertices -> processes (processing nodes) – Edges -> communication channels between the processes • DAG model benefits – Generic scheduler – No deadlocks / deterministic – Easier fault tolerance Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #15
    • Dryad DAG jobs (C) Michael Isard Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #16
    • Dryad (3) • The job graph can mutate during execution (?) • Channel types (one way) – Files on a DFS – Temporary file – Shared memory FIFO – TCP pipes • Fault tolerance – Node fails => re-run – Input disappears => re-run upstream node – Node is slow => run a duplicate copy at another node, get first result Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #17
    • Dryad architecture & components (C) Mihai Budiu Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #18
    • Dryad programming • C++ API (incl. Map/Reduce interfaces) • SQL Integration Services (SSIS) – Many parallel SQL Server instances (each is a vertex in the DAG) • DryadLINQ – LINQ to Dryad translator • Distributed shell – Generalisation of the Unix shell & pipes – Many inputs/outputs per process! – Pipes span multiple machines Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #19
    • Dryad vs. Map/Reduce (C) Mihai Budiu Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #20
    • Contents Part II Open Source Map/Reduce frameworks Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #21
    • Hadoop • Apache Nutch (2004), Yahoo is currently the major contributor • http://hadoop.apache.org/ • Not only a Map/Reduce implementation! – HDFS – distributed filesystem – HBase – distributed column store – Pig – high level query language (SQL like) – Hive – Hadoop based data warehouse – ZooKeeper, Chukwa, Pipes/Streaming, … • Also available on Amazon EC2 • Largest Hadoop cluster – 25K nodes / 100K cores (Yahoo) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #22
    • Hadoop - Map/Reduce • Components – Job client – Job Tracker • Only one • Scheduling, coordinating, monitoring, failure handling – Task Tracker • Many • Executes tasks received by the Job Tracker • Sends “heartbeats” and progress reports back to the Job Tracker – Task Runner • The actual Map or Reduce task started in a separate JVM • Crashes & failures do not affect the Task Tracker on the node! Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #23
    • Hadoop - Map/Reduce (2) (C) Tom White Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #24
    • Hadoop - Map/Reduce (3) • Integrated with HDFS – Map tasks executed on the HDFS node where the data is (data locality => reduce traffic) – Data locality is not possible for Reduce tasks – Intermediate outputs of Map tasks (nodes) are not stored on HDFS, but locally, and then sent to the proper Reduce task (node) • Status updates – Task Runner => Task Tracker, progress updates every 3s – Task Tracker => Job Tracker, heartbeat + progress for all local tasks every 5s – If a task has no progress report for too long, it will be considered failed and re-started Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #25
    • Hadoop - Map/Reduce (4) • Some extras – Counters • Gather stats about a task • Globally aggregated (Job Runner => Task Tracker => Job Tracker) • M/R counters: M/R input records, M/R output records • Filesystem counters: bytes read/written • Job counters: launched M/R tasks, failed M/R tasks, … – Joins • Copy the small set on each node and perform joins locally. Useful when one dataset is very large, the other very small (e.g. “Scalable Distributed Reasoning using MapReduce” from VUA) • Map side join – data is joined before the Map function, very efficient but less flexible (datasets must be partitioned & sorted in a particular way) • Reduce side join – more general but less efficient (Map generates (K,V) pairs using the join key) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #26
    • Hadoop - Map/Reduce (5) • Built-in mappers and reducers – Chain – run a chain/pipe of sequential Maps (M+RM*). The last Map output is the Task output – FieldSelection – select a list of fields from the input dataset to be used as MR keys/values – TokenCounterMapper, SumReducer – (remember the “word count” example?) – RegexMapper – matches a regex in the input key/value pairs Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #27
    • Cloud MapReduce • Accenture (2010) • http://code.google.com/p/cloudmapreduce/ • Map/Reduce implementation for AWS (EC2, S3, SimpleDB, SQS) – fast (reported as up to 60 times faster than Hadoop/EC2 in some cases) – scalable & robust (no single point of bottleneck or failure) – simple (3 KLOC) • Features – No need for centralised coordinator (JobTracker), just put job status in the cloud datastore (SimpleDB) – All data transfer & communication is handled by the Cloud – All I/O and storage is handled by the Cloud Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #28
    • Cloud MapReduce (2) (C) Ricky Ho Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #29
    • Cloud MapReduce (3) • Job client workflow 1. Store input data (S3) 2. Create a Map task for each data split & put it into the Mapper Queue (SQS) 3. Create Multiple Partition Queue (SQS) 4. Create Reducer Queue (SQS) & put a Reduce task for each Partition Queue 5. Create the Output Queue (SQS) 6. Create a Job Request (ref to all queues) and put it into SimpleDB 7. Start EC2 instances for Mappers & Reducers 8. Poll SimpleDB for job status 9. When job complete download results from S3 Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #30
    • Cloud MapReduce (4) • Mapper worflow 1. Dequeue a Map task from the Mapper Queue 2. Fetch data from S3 3. Perform user defined map function, add multiple output (Km,Vm) pairs to some Multiple Partition Queue (hash(Km)) => several partition keys may share the same partition queue! 4. When done remove Map task from Mapper Queue • Reducer workflow 1. Dequeue a Reeduce task from the Reducer Queue 2. Dequeue the (Km,Vm) pairs from the corresponding Partition Queue => several partitions may share the same queue! 3. Perform a user defined reduce function and add output pairs (Kr,Vr) to the Output Queue 4. When done remove the Reduce task from the Reducer Queue Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #31
    • MR.Flow • Web based M/R editor – http://www.mr-flow.com – Reusable M/R modules – Execution & status monitoring (Hadoop clusters) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #32
    • Contents Part III Some Map/Reduce algorithms Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #33
    • General considerations • Map execution order is not deterministic • Map processing time cannot be predicted • Reduce tasks cannot start before all Maps have finished (dataset needs to be fully partitioned) • Not suitable for continuous input streams • There will be a spike in network utilisation after the Map / before the Reduce phase • Number & size of key/value pairs – Object creation & serialisation overhead (Amdahl’s law!) • Aggregate partial results when possible! – Use Combiners Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #34
    • Graph algorithms • Very suitable for M/R processing – Data (graph node) locality – “spreading activation” type of processing – Some algorithms with sequential dependency not suitable for M/R • Breadth-first search algorithms better than depth-first • General Approach – Graph represented by adjacency lists – Map task – input: node + its adjacency list; perform some analysis over the node link structure; output: target key + analysis result – Reduce task – aggregate values by key – Perform multiple iterations (with a termination criteria) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #35
    • Social Network Analysis • Problem: recommend new friends (friend-of-a-friend, FOAF) • Map task – U (target user) is fixed and its friends list copied to all cluster nodes (“copy join”); each cluster node stores part of the social graph – In: (X, <friendsX>), i.e. the local data for the cluster node – Out: • if (U, X) are friends => (U, <friendsXfriendsU>), i.e. the users who are friends of X but not already friends of U • nil otherwise • Reduce task – In: (U, <<friendsAfriendsU>,<friendsBfriendsU>, … >), i.e. the FOAF lists for all users A, B, etc. who are friends with U – Out (U, <(X1, N1), (X2, N2), …>), where each X is a FOAF for U, and N is its total number of occurrences in all FOAF lists (sort/rank the result!) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #36
    • PageRank with M/R (C) Jimmy Lin Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #37
    • Text Indexing & Retrieval • Indexing is very suitable for M/R – Focus on scalability, not on latency & response time – Batch oriented • Map task – emit (Term, (DocID, position)) • Reduce task – Group pairs by Term and sort by DocID Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #38
    • Text Indexing & Retrieval (2) (C) Jimmy Lin Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #39
    • Text Indexing & Retrieval (3) • Retrieval not suitable for M/R – Focus on response time – Startup of Mappers & Reducers is usually prohibitively expensive • Katta – http://katta.sourceforge.net/ – Distributed Lucene indexing with Hadoop (HDFS) – Multicast querying & ranking Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #40
    • Useful links • "MapReduce: Simplified Data Processing on Large Clusters" • “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks” • “Cloud MapReduce Technical Report” • Data-Intensive Text Processing with MapReduce • Hadoop - The Definitive Guide Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #41
    • Q&A Questions? Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #42