1. Large Scale Data Analysis with
Map/Reduce, part I
Marin Dimitrov
(technology watch #1)
Feb 2010
2. Contents
• Map/Reduce
• Dryad
• Sector/Sphere
• Open source M/R frameworks & tools
– Hadoop (Yahoo/Apache)
– Cloud MapReduce (Accenture)
– Elastic MapReduce (Hadoop on AWS)
– MR.Flow
• Some M/R algorithms
– Graph algorithms, Text Indexing & retrieval
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #2
3. Contents
Part I
Distributed computing
frameworks
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #3
4. Scalability & Parallelisation
• Scalability approaches
– Scale up (vertical scaling)
• Only one direction of improvement (bigger box)
– Scale out (horizontal scaling)
• Two directions – add more nodes + scale up each node
• Can achieve x4 the performance of a similarly priced scale-up system
(ref?)
– Hybrid (“scale out in a box”)
• Parallel algorithms... Not
– Algorithms with state
– Dependencies from one iteration to another (recurrence, induction)
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #4
5. Parallelisation approaches
• Parallelization approaches
– Task decomposition
• Distribute coarse-grained (synchronisation wise) and computationally
expensive tasks (otherwise too much coordination/management
overhead)
• Dependencies - execution order vs. data dependencies
• Move the data to the processing (when needed)
– Data decomposition
• Each parallel task works with a data partition assigned to it (no sharing)
• Data has regular structure, i.e. chunks expected to need the same
amount of processing time
• Two criteria: granularity (size of chunk) and shape (data exchange
between chunk neighbours)
• Move the processing to the data
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #5
6. Amdahl’s law
• Impossible to achieve linear speedup
• Maximum speedup is always bounded by the overhead for
parallelisation and by the serial processing part
• Amdahl’s law
– max_speedup =
– P: proportion of the program than can be parallelised (1-P still
remains serial or overhead)
– N: number of processors / parallel nodes
– Example: P=75% (i.e. 25% serial or overhead)
N (parallel nodes) 2 4 8 16 32 1024 64K
Max speedup 1.60 2.29 2.91 3.37 3.66 3.99 3.99
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #6
7. Map/Reduce
• Google (2005), US patent (2010)
• General idea - co-locate data with computation nodes
– Data decomposition (parallelization) – no data/order dependencies
between tasks (except the Map-to-Reduce phase)
– Try to utilise data locality (bandwidth is $$$)
– Implicit data flow (higher abstraction level than MPI)
– Partial failure handling (failed map/reduce tasks are re-scheduled)
• Structure
– Map - for each input (Ki,Vi) produce zero or more output pairs
(Km,Vm)
– Combine – optional intermediate aggregation (less M->R data
transfer)
– Reduce - for input pair (Km, list(V1,V2,…, Vn)) produce zero or more
output pairs (Kr,Vr)
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #7
8. Map/Reduce (2)
(C) Jimmy Lin
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #8
9. Map/Reduce - examples
• In other words…
– Map = partitioning of the data (compute part of a problem across
several servers)
– Reduce = processing of the partitions (aggregate the partial results
from all servers into a single resultset)
– The M/R framework takes care of grouping of partitions by key
• Example: word count
– Map (1 task per document in the collection)
• In: docx
• Out: (term1, count1,x), (term2, count2,x), …
– Reduce (1 task per term in the collection)
• In: (term1, < count1,x, count1,y, … count1,z >)
• Out: (term1, SUM(count1,x, count1,y, … count1,z))
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #9
10. Map/Reduce
examples (2)
• Example: Shortest path in graph (naïve)
– Map: in (nodein, dist); out (nodeout, dist++) where nodein->nodeout
– Reduce: in (noder, <dista,r, distb,r, …, dustc,r>); out (noder, MIN(dista,r,
distb,r, …, dustc,r))
– Multiple M/R iterations required, start with (nodestart,0)
• Example: Inverted indexing (full text search)
– Map
• In: docx
• out: (term1, (docx, pos’1,x)), (term1, (docx, pos’’1,x)), (term2, (docx, pos2,x))…
– Reduce
• in = (term1, < (docx, pos’1,x), (docx, pos’’1,x), (docy, pos1,y), … (docz, pos1,z)>)
• out = (term1, < (docx, <pos’1,x, pos’’1,x,…>), (docy, <pos1,y>), … (docz,
<pos1,z>)>)
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #10
11. Map/Reduce - examples (3)
• Inverted index example rundown
• input
– Doc1: “Why did the chicken cross the road?”
– Doc2: “The chicken and egg problem”
– Doc3: “Kentucky Fried Chicken”
• Map phase (3 parallel tasks)
– map1 => (“why”,(doc1,1)), (“did”,(doc1,2)), (“the”,(doc1,3)),
(“chicken”,(doc1,4)), (“cross”,(doc1,5)), (“the”,(doc1,6)),
(“road”,(doc1,7))
– map2 => (“the”,(doc2,1)), (“chicken”,(doc2,2)), (“and”,(doc2,3)),
(“egg”,(doc2,4)), (“problem”, (doc2,5))
– map3 => (“kentucky”,(doc3,1)), (“fried”,(doc3,2)), (“chicken”,(doc3,3))
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #11
14. Map/Reduce – pros & cons
• Good for
– Lots of input, intermediate & output data
– Little or no synchronisation required
– “Read once”, batch oriented datasets (ETL)
• Bad for
– Fast response time
– Large amounts of shared data
– Fine-grained synchronisation required
– CPU intensive operations (as opposed to data intensive)
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #14
15. Dryad
• Microsoft Research (2007),
http://research.microsoft.com/en-us/projects/dryad/
• General purpose distributed execution engine
– Focus on throughput, not latency
– Automatic management of scheduling, distribution &fault tolerance
• Simple DAG model
– Vertices -> processes (processing nodes)
– Edges -> communication channels between the processes
• DAG model benefits
– Generic scheduler
– No deadlocks / deterministic
– Easier fault tolerance
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #15
16. Dryad DAG jobs
(C) Michael Isard
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #16
17. Dryad (3)
• The job graph can mutate during execution (?)
• Channel types (one way)
– Files on a DFS
– Temporary file
– Shared memory FIFO
– TCP pipes
• Fault tolerance
– Node fails => re-run
– Input disappears => re-run upstream node
– Node is slow => run a duplicate copy at another node, get first result
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #17
18. Dryad architecture & components
(C) Mihai Budiu
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #18
19. Dryad programming
• C++ API (incl. Map/Reduce interfaces)
• SQL Integration Services (SSIS)
– Many parallel SQL Server instances (each is a vertex in the DAG)
• DryadLINQ
– LINQ to Dryad translator
• Distributed shell
– Generalisation of the Unix shell & pipes
– Many inputs/outputs per process!
– Pipes span multiple machines
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #19
20. Dryad vs. Map/Reduce
(C) Mihai Budiu
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #20
21. Contents
Part II
Open Source Map/Reduce
frameworks
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #21
22. Hadoop
• Apache Nutch (2004), Yahoo is currently the major
contributor
• http://hadoop.apache.org/
• Not only a Map/Reduce implementation!
– HDFS – distributed filesystem
– HBase – distributed column store
– Pig – high level query language (SQL like)
– Hive – Hadoop based data warehouse
– ZooKeeper, Chukwa, Pipes/Streaming, …
• Also available on Amazon EC2
• Largest Hadoop cluster – 25K nodes / 100K cores (Yahoo)
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #22
23. Hadoop - Map/Reduce
• Components
– Job client
– Job Tracker
• Only one
• Scheduling, coordinating, monitoring, failure handling
– Task Tracker
• Many
• Executes tasks received by the Job Tracker
• Sends “heartbeats” and progress reports back to the Job Tracker
– Task Runner
• The actual Map or Reduce task started in a separate JVM
• Crashes & failures do not affect the Task Tracker on the node!
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #23
24. Hadoop - Map/Reduce (2)
(C) Tom White
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #24
25. Hadoop - Map/Reduce (3)
• Integrated with HDFS
– Map tasks executed on the HDFS node where the data is (data
locality => reduce traffic)
– Data locality is not possible for Reduce tasks
– Intermediate outputs of Map tasks (nodes) are not stored on HDFS,
but locally, and then sent to the proper Reduce task (node)
• Status updates
– Task Runner => Task Tracker, progress updates every 3s
– Task Tracker => Job Tracker, heartbeat + progress for all local tasks
every 5s
– If a task has no progress report for too long, it will be considered
failed and re-started
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #25
26. Hadoop - Map/Reduce (4)
• Some extras
– Counters
• Gather stats about a task
• Globally aggregated (Job Runner => Task Tracker => Job Tracker)
• M/R counters: M/R input records, M/R output records
• Filesystem counters: bytes read/written
• Job counters: launched M/R tasks, failed M/R tasks, …
– Joins
• Copy the small set on each node and perform joins locally. Useful when
one dataset is very large, the other very small (e.g. “Scalable Distributed
Reasoning using MapReduce” from VUA)
• Map side join – data is joined before the Map function, very efficient but
less flexible (datasets must be partitioned & sorted in a particular way)
• Reduce side join – more general but less efficient (Map generates (K,V)
pairs using the join key)
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #26
27. Hadoop - Map/Reduce (5)
• Built-in mappers and reducers
– Chain – run a chain/pipe of sequential Maps (M+RM*). The last Map
output is the Task output
– FieldSelection – select a list of fields from the input dataset to be
used as MR keys/values
– TokenCounterMapper, SumReducer – (remember the “word count”
example?)
– RegexMapper – matches a regex in the input key/value pairs
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #27
28. Cloud MapReduce
• Accenture (2010)
• http://code.google.com/p/cloudmapreduce/
• Map/Reduce implementation for AWS (EC2, S3, SimpleDB,
SQS)
– fast (reported as up to 60 times faster than Hadoop/EC2 in some
cases)
– scalable & robust (no single point of bottleneck or failure)
– simple (3 KLOC)
• Features
– No need for centralised coordinator (JobTracker), just put job status
in the cloud datastore (SimpleDB)
– All data transfer & communication is handled by the Cloud
– All I/O and storage is handled by the Cloud
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #28
29. Cloud MapReduce (2)
(C) Ricky Ho
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #29
30. Cloud MapReduce (3)
• Job client workflow
1. Store input data (S3)
2. Create a Map task for each data split & put it into the Mapper
Queue (SQS)
3. Create Multiple Partition Queue (SQS)
4. Create Reducer Queue (SQS) & put a Reduce task for each Partition
Queue
5. Create the Output Queue (SQS)
6. Create a Job Request (ref to all queues) and put it into SimpleDB
7. Start EC2 instances for Mappers & Reducers
8. Poll SimpleDB for job status
9. When job complete download results from S3
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #30
31. Cloud MapReduce (4)
• Mapper worflow
1. Dequeue a Map task from the Mapper Queue
2. Fetch data from S3
3. Perform user defined map function, add multiple output (Km,Vm)
pairs to some Multiple Partition Queue (hash(Km)) => several
partition keys may share the same partition queue!
4. When done remove Map task from Mapper Queue
• Reducer workflow
1. Dequeue a Reeduce task from the Reducer Queue
2. Dequeue the (Km,Vm) pairs from the corresponding Partition Queue
=> several partitions may share the same queue!
3. Perform a user defined reduce function and add output pairs (Kr,Vr)
to the Output Queue
4. When done remove the Reduce task from the Reducer Queue
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #31
32. MR.Flow
• Web based M/R editor
– http://www.mr-flow.com
– Reusable M/R modules
– Execution & status monitoring (Hadoop clusters)
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #32
33. Contents
Part III
Some Map/Reduce
algorithms
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #33
34. General considerations
• Map execution order is not deterministic
• Map processing time cannot be predicted
• Reduce tasks cannot start before all Maps have finished
(dataset needs to be fully partitioned)
• Not suitable for continuous input streams
• There will be a spike in network utilisation after the Map /
before the Reduce phase
• Number & size of key/value pairs
– Object creation & serialisation overhead (Amdahl’s law!)
• Aggregate partial results when possible!
– Use Combiners
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #34
35. Graph algorithms
• Very suitable for M/R processing
– Data (graph node) locality
– “spreading activation” type of processing
– Some algorithms with sequential dependency not suitable for M/R
• Breadth-first search algorithms better than depth-first
• General Approach
– Graph represented by adjacency lists
– Map task – input: node + its adjacency list; perform some analysis
over the node link structure; output: target key + analysis result
– Reduce task – aggregate values by key
– Perform multiple iterations (with a termination criteria)
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #35
36. Social Network Analysis
• Problem: recommend new friends (friend-of-a-friend, FOAF)
• Map task
– U (target user) is fixed and its friends list copied to all cluster nodes
(“copy join”); each cluster node stores part of the social graph
– In: (X, <friendsX>), i.e. the local data for the cluster node
– Out:
• if (U, X) are friends => (U, <friendsXfriendsU>), i.e. the users who are
friends of X but not already friends of U
• nil otherwise
• Reduce task
– In: (U, <<friendsAfriendsU>,<friendsBfriendsU>, … >), i.e. the FOAF
lists for all users A, B, etc. who are friends with U
– Out (U, <(X1, N1), (X2, N2), …>), where each X is a FOAF for U, and N is
its total number of occurrences in all FOAF lists (sort/rank the result!)
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #36
37. PageRank with M/R
(C) Jimmy Lin
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #37
38. Text Indexing & Retrieval
• Indexing is very suitable for M/R
– Focus on scalability, not on latency & response time
– Batch oriented
• Map task
– emit (Term, (DocID, position))
• Reduce task
– Group pairs by Term and sort by DocID
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #38
39. Text Indexing & Retrieval (2)
(C) Jimmy Lin
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #39
40. Text Indexing & Retrieval (3)
• Retrieval not suitable for M/R
– Focus on response time
– Startup of Mappers & Reducers is usually prohibitively expensive
• Katta
– http://katta.sourceforge.net/
– Distributed Lucene indexing with Hadoop (HDFS)
– Multicast querying & ranking
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #40
41. Useful links
• "MapReduce: Simplified Data Processing on Large Clusters"
• “Dryad: Distributed Data-Parallel Programs from Sequential
Building Blocks”
• “Cloud MapReduce Technical Report”
• Data-Intensive Text Processing with MapReduce
• Hadoop - The Definitive Guide
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #41
42. Q&A
Questions?
Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #42