Ch4.mapreduce algorithm design

2,262 views

Published on

Chapter 4 of Data-Intensive Text Processing with Map Reduce introduce the efficiently map-reduce algorithm, pairs and stripes. It show how to use these two algorithm to contrust the co-occurrence matrix. It compare the time complexity between pairs and stripes algorithms. According to the experiments, the stripes algorithm have the better efficiency than pairs algorithm.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,262
On SlideShare
0
From Embeds
0
Number of Embeds
691
Actions
Shares
0
Downloads
37
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Ch4.mapreduce algorithm design

  1. 1. 1<br />Mapreduce algorithm design<br />Web Intelligence and Data Mining Laboratory<br />Presenter / Allen<br />2011/4/26<br />
  2. 2. Outline<br />MapReduce Framework<br />Pairs Approach<br />Stripes Approach<br />Issues<br />2011/4/26<br />2<br />
  3. 3. MapReduce Framework<br />2011/4/26<br />3<br /><ul><li>Mappers are applied to all input key-value pairs, which generate an arbitrary number of intermediate key-value pairs.
  4. 4. Combiners can be viewed as mini-reducers" in the map phase.
  5. 5. Partitioners determine which reducer is responsible for a particular key.
  6. 6. Reducers are applied to all values associated with the same key.</li></li></ul><li>Managing Dependencies<br />Mappers and reducers run in isolation<br />Where a mapper or reducer runs. (i.e. on which node)<br />When a mapper or reducer begins or finishes.<br />Which input key-value pairs are processed by a specific mapper<br />Which intermediate key-value pairs are processed by a specific reducer.<br />Tools for synchronization<br />Ability to hold the state in both mappers and reducers across in multiple key-value pairs<br />Sorting function for keys<br />Partitioner<br />Cleverly-constructed data structures<br />4<br />2011/4/26<br />
  7. 7. Motivating Example<br />Term co-occurrence matrix for a text collection<br />M=NN matrix (N = vocabulary size)<br />Mij: number of times term i and j co-occur in some context<br />(for concreteness, let’s say context = sentence)<br />Why?<br />Distributional profiles as a way of measuring semantic distance<br />Semantic distance useful for many language processing tasks<br />5<br />2011/4/26<br />
  8. 8. MapReduce: Large counting problems<br />Term co-occurrence matrix for a text collection = specific instance of a large counting problem<br />A large event space (number of terms)<br />A large number of observations (the collection itself)<br />Goal: keep tracking of interesting statistics about the events<br />Basic idea<br />Mappers generate partial counts<br />Reducers aggregate partial counts<br />How do we aggregate partial counts efficiently?<br />6<br />2011/4/26<br />
  9. 9. First try “Pairs”<br />Each mapper takes a sentence:<br />Generate all co-occurring term pairs<br />For all pairs, emit(a, b)  count<br />Reducers sums up counts associated with these pairs<br />Use combiners!<br />7<br />2011/4/26<br />
  10. 10. “Pairs”Algorithm<br />2011/4/26<br />8<br />
  11. 11. “Pairs” Analysis<br />Advantages<br />Easy to implement, easy to understand<br />Disadvantages<br />Lots of pairs to sort and shuffle around (upper bound?)<br />9<br />2011/4/26<br />
  12. 12. Another try “Stripes”<br />Idea: group together pairs into an associate array<br /> (a, b) 1<br /> (a, c) 2<br />(a, d) 5 a{b:1, c:2, d:5, e:3, f:2}<br />(a, e) 3<br /> (a, f) 2<br />Each mapper takes a sentence:<br />Generating all co-occurring term pairs<br />For each term, emit a {b:countb, c:countc, d:countd,…}<br />Reducers perform element-wise sum of associate arrays<br /> a{b:1, d:5, e:3}<br />+ a{b:1, c:2, d:2, f:2}<br /> a{b:2, c:2, d:7, e:3, f:2}<br />10<br />2011/4/26<br />
  13. 13. “Stripes”Algorithm<br />2011/4/26<br />11<br />
  14. 14. “Stripes” Analysis<br />Advantages<br />Far less sorting and shuffling of key-value pairs<br />Can make better use of combiners<br />Disadvantages<br />More difficult to implement<br />Underlying objects is more heavyweight<br />Fundamental limitation in terms of size of event space<br />12<br />2011/4/26<br />
  15. 15. Running time of the “Pairs” and “Stripes”<br />13<br />2011/4/26<br />
  16. 16. Conditional probabilities<br />How do we estimate conditional probabilities from counts?<br />Why do we want to do this?<br />How do we do this with MapReduce?<br />14<br />2011/4/26<br />
  17. 17. P(B|A) “Stripes”<br />a{b1:3, b2:12, b3:7, b4:1,…}<br />Easy!<br />One pass to compute (a, *)<br />Another pass to directly compute P(B|A) <br />15<br />2011/4/26<br />
  18. 18. P(B|A) “Pairs”<br />(a, *)  32 Reducer holds this value in memory<br />(a, b1)  3 (a, b1)  3/32<br />(a, b2)  12 (a, b2)  12/32<br />(a, b3)  7 (a, b3)  7/32<br />(a, b4)  1 (a, b1)  1/32<br />… …<br />For this to work:<br />Must emit extra (a, *) for every bn in mapper.<br />Must make sure all a’s get sent to same reducer (use partitioner)<br />Must make sure (a, *) comes first (define sort order)<br />Must hold state in reducer across different key-value pairs<br />16<br />2011/4/26<br />
  19. 19. Synchronization in Hadoop<br />Approach 1: turn synchronization into an ordering problem<br />Sort keys into correct order of computation<br />Partition key space so that each reducer gets the appropriate set of partial results<br />Hold state in reducer across multiple key-value pairs to perform computation<br />Illustrated by the “pairs” approach<br />17<br />2011/4/26<br />
  20. 20. Synchronization in Hadoop<br />Approach 2: construct data structures that “bring the pieces together”<br />Each reducer receives all the data it needs to complete the computation <br />Illustrated by the “stripes” approach<br />18<br />2011/4/26<br />
  21. 21. Issues<br />Number of key-value pairs<br />Object creation overhead<br />Times for sorting and shuffling pairs across the network<br />Size of each key-value pair<br />De/serialization overhead<br />Combiners make a big difference!<br />RAM vs. disk vs. network<br />Arrange data to maximize opportunities to aggregate partial results<br />19<br />2011/4/26<br />
  22. 22. 20<br />Thank you!<br />2011/4/26<br />

×