Mr Share 11 Sep 2010

2,010 views

Published on

Large-scale data analysis lies in the core of modern enterprises
and scientific research. With the emergence of cloud
computing, the use of an analytical query processing infrastructure
(e.g., Amazon EC2) can be directly mapped
to monetary value. MapReduce has been a popular framework
in the context of cloud computing, designed to serve
long running queries (jobs) which can be processed in batch
mode. Taking into account that different jobs often perform
similar work, there are many opportunities for sharing. In
principle, sharing similar work reduces the overall amount of
work, which can lead to reducing monetary charges incurred
while utilizing the processing infrastructure. In this paper
we propose a sharing framework tailored to MapReduce.
Our framework, MRShare, transforms a batch of queries
into a new batch that will be executed more efficiently, by
merging jobs into groups and evaluating each group as a
single query. Based on our cost model for MapReduce, we
define an optimization problem and we provide a solution
that derives the optimal grouping of queries. Experiments
in our prototype, built on top of Hadoop, demonstrate the
overall effectiveness of our approach and substantial savings.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,010
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
73
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Talk about different possibilities of arranging jobs, and the question which one is the optimal one.
  • Mr Share 11 Sep 2010

    1. 1. MRShare: Sharing Across Multiple Queries in MapReduce<br />Tomasz Nykiel(University of Toronto)<br />MichalisPotamias (Boston University)<br />ChaitanyaMishra (University of Toronto, currently Facebook)<br />George Kollios (Boston University)<br />Nick Koudas (University of Toronto)<br />1<br />
    2. 2. Data management landscape<br />flexibility<br />MRShare – sharing framework for MR<br /><ul><li>Arbitrary data
    3. 3. Large scale setups
    4. 4. Time performance</li></ul>σπ<br />efficiency<br />2<br />
    5. 5. MRShare – a sharing framework for Map Reduce<br />MRShare framework:<br />Inspired by sharing primitives from relational domain<br />Introduces a cost model for Map Reduce jobs<br />Searches for the optimal sharing strategies<br />Does not change the Map Reduce computational model<br />hsdhquweiquwijksajdajsdjhwhjadjhashdj<br />3<br />
    6. 6. Outline<br />Introduction<br />Map Reduce recap.<br />MRShare – Sharing primitives in Map-Reduce<br />MRShare – Cost based approach to sharing <br />MRShare Evaluation<br />Summary<br />4<br />
    7. 7. Outline<br />Introduction<br />Map Reduce recap.<br />MRShare – Sharing primitives in Map-Reduce<br />MRShare – Cost based approach to sharing <br />MRShare Evaluation<br />Summary<br />5<br />
    8. 8. network<br />Map Reduce recap.<br />Reduce<br />Map<br />I<br />Output<br />I<br />I<br />Output<br />I<br />HDFS<br />HDFS<br />6<br />
    9. 9. Outline<br />Introduction<br />Map Reduce recap.<br />MRShare - Sharing primitives in Map-Reduce<br />MRShare – Cost based approach to sharing <br />MRShare Evaluation<br />Summary<br />7<br />
    10. 10. Sharing primitives – sharing scans<br />SELECT COUNT(*) FROM user GROUP BY hometown<br />SELECT AVG(age) FROM user GROUP BY hometown<br />SQL<br />Map<br />Map<br />id1<br />student<br />Toronto<br />id1<br />student<br />Toronto<br />Toronto<br />1<br />Toronto<br />17<br />Map Reduce<br />Reduce<br />Reduce<br />Toronto<br />1<br />Toronto<br />17<br />Toronto<br />1<br />Toronto<br />3<br />Toronto<br />19<br />Toronto<br />18<br />Toronto<br />1<br />Montreal<br />20<br />Montreal<br />20<br />Ottawa<br />1<br />Ottawa<br />23<br />Ottawa<br />2<br />Ottawa<br />24<br />Ottawa<br />1<br />Ottawa<br />25<br />8<br />
    11. 11. MRShare – sharing scans (map).<br />Input<br />Meta-map<br />Map 1<br />Map 2<br />Map 3<br />Map 4<br />Map output<br />9<br />
    12. 12. Meta-reduce<br />MRShare – sharing scans (reduce)<br />Reduce 1<br />Reduce 2<br />Reduce 3<br />Reduce 4<br />10<br />
    13. 13. Outline<br />Introduction<br />Map Reduce recap.<br />MRShare - Sharing primitives in Map-Reduce<br />Sharing scans<br />Sharing intermediate data<br />MRShare – Cost based approach to sharing <br />MRShare Evaluation<br />Summary<br />11<br />
    14. 14. Sharing primitives - Sharing intermediate data.<br />SELECT COUNT(*) FROM user WHERE occupation=‘student’ GROUP BY hometown<br />SELECT COUNT(*) FROM user WHERE age > 18 GROUP BY hometown<br />SQL<br />Map<br />Map<br />id1<br />student<br />Toronto<br />id1<br />student<br />Toronto<br />Age ?> 18<br />Occupation ?= ‘student’<br />Toronto<br />1<br />Toronto<br />1<br />Map Reduce<br />Reduce<br />Reduce<br />Toronto<br />1<br />Toronto<br />1<br />Toronto<br />1<br />Toronto<br />3<br />Toronto<br />1<br />Toronto<br />2<br />Toronto<br />1<br />Ottawa<br />1<br />Ottawa<br />1<br />Ottawa<br />1<br />Ottawa<br />1<br />Ottawa<br />2<br />Montreal<br />2<br />Ottawa<br />1<br />Montreal<br />1<br />12<br />
    15. 15. Meta-map<br />MRShare – sharing intermediate data (map).<br />Input<br />Map 1<br />Map 2<br />Map 3<br />Map 4<br />Map output<br />13<br />
    16. 16. Meta-reduce<br />MRShare – sharing intermediate data (reduce).<br />Reduce 1<br />Reduce 2<br />Reduce 3<br />Reduce 4<br />14<br />
    17. 17. Outline<br />Introduction<br />Map Reduce recap.<br />MRShare – Sharing primitives in Map-Reduce<br />MRShare – Cost based approach to sharing<br />Cost model for finding the optimal sharing strategy<br />SplitJobs – cost based algorithm for sharing scans<br />MultiSplitJobs – an improvement of SplitJobs<br />γ-MultiSplitJobs– the algorithm for sharing intermediate data<br />MRShare Evaluation<br />Summary<br />15<br />
    18. 18. Cost model for Map Reduce (single job)<br />Reading input<br />Sorting int. data<br />Copying<br />Writing output<br />Reading– f(input size)<br />Sorting– f(intermediate data size)<br />Copying– f(intermediate data size)<br />Writing – f(output size)<br />16<br />
    19. 19. Cost of executing a group of jobs<br />Read<br />Sort<br />Copy<br />Write<br />J1<br />Read<br />Sort<br />Copy<br />Write<br />J2<br />Read<br />Sort<br />Copy<br />Write<br />J3<br />J1+J2+J3<br />Read<br />Sort<br />Copy<br />Write<br />Potential costs<br />Potential savings<br />Savings<br />17<br />
    20. 20. Finding the optimal sharing strategy<br />“NoShare”<br />J3<br />J3<br />J2<br />J2<br />18<br />J5<br />J4<br />J4<br />J1<br />J1<br />J5<br />J3<br />J2<br />J4<br />J1<br /><ul><li>An optimization problem</li></ul>J5<br />“GreedyShare”<br />
    21. 21. Outline<br />Introduction<br />Map Reduce recap.<br />MRShare – Sharing primitives in Map-Reduce<br />MRShare – Cost based approach to sharing<br />Cost model for finding the optimal sharing strategy <br />SplitJobs – cost based algorithm for sharing scans<br />MultiSplitJobs – an improvement of SplitJobs<br />γ-MultiSplitJobs– the algorithm for sharing intermediate data<br />MRShare Evaluation<br />Summary<br />19<br />
    22. 22. Sharing scans - cost based optimization <br />20<br />Read<br />Sort<br />J1<br />J1+J2+J3<br />Read<br />Sort<br />J2<br />Read<br />Sort<br />Read<br />Sort<br />J3<br />Potential costs<br />Savings<br />Savings come from reduced number of scans<br />The sorting cost might change<br />The costs of copying and writing the output do not change<br /><ul><li>We prove NP-hardness of the problem of finding the optimal sharing strategy</li></li></ul><li>Sharing scans – approximating the cost of sorting <br />21<br />Area ~ |intermediate data|<br />J5<br />J4<br />J3<br />J2<br />J1<br />J3<br />Compute the exact sorting cost for (J1+J2+J3)<br />J2<br />J1<br />J3<br />J2<br />Approximate the sorting cost based on (J1+J2+J3)<br />J3<br />J1<br />
    23. 23. SplitJobs – a DP solution for sharing scans.<br />We reduce the problem of grouping to the problem of splitting a sorted list of jobs – by approximating the cost of sorting.<br />J6<br />J5<br />J4<br />J3<br />J2<br />J1<br /><ul><li>Using our cost model and the approximation, we employ a DP algorithm to find the optimal split points.</li></ul>J6<br />J5<br />J4<br />J3<br />J2<br />J1<br />SplitJobs<br />22<br />G1<br />G2<br />G3<br />
    24. 24. Outline<br />Introduction<br />Map Reduce recap.<br />MRShare – Sharing primitives in Map-Reduce<br />MRShare – Cost based approach to sharing<br />Cost model for finding the optimal sharing strategy<br />SplitJobs – cost based algorithm for sharing scans<br />MultiSplitJobs – an improvement of SplitJobs<br />γ-MultiSplitJobs– the algorithm for sharing intermediate data<br />MRShare Evaluation<br />Summary<br />23<br />
    25. 25. MultiSplitJobs – an improvement of SplitJobs<br />24<br />J8<br />J7<br />J6<br />J5<br />J4<br />J3<br />J2<br />J1<br />G1<br />G2<br />SplitJobs<br />SplitJobs<br />G3<br />SplitJobs<br />G4<br />MultiSplitJobs<br />
    26. 26. Outline<br />Introduction<br />Map Reduce recap.<br />MRShare – Sharing primitives in Map-Reduce<br />MRShare – Cost based approach to sharing<br />Cost model for finding the optimal sharing strategy<br />SplitJobs – cost based algorithm for sharing scans<br />MultiSplitJobs – an improvement of SplitJobs<br />γ-MultiSplitJobs– the algorithm for sharing intermediate data<br />MRShare Evaluation<br />Summary<br />25<br />
    27. 27. Sharing intermediate data - cost based optimization <br />26<br />Read<br />Sort<br />Copy<br />J1<br />J1+J2+J3<br />Read<br />Sort<br />Copy<br />Read<br />Sort<br />Copy<br />J2<br />Savings<br />Potential savings<br />Read<br />Sort<br />Copy<br />J3<br />Potential costs or savings<br />The sorting and copying costs change – depending on the size of the intermediate data<br />Prohibitive cost of maintaining statistics<br />J3<br />We need to estimate the size of the intermediate data of all combinations of jobs.<br />J1<br />J2<br />
    28. 28. Approximate the size of the intermediate data<br />J3<br />J1<br />γ-MultiSplitJobs – the solution for sharing intermediate data<br />27<br />J2<br />J3<br />J2<br />J1<br />=<br />+ γ *<br />J1<br />J2<br />J3<br /><ul><li>γ –MultiSplitJobs – applies MultiSplitJobs with modified cost function
    29. 29. γ set heuristically</li></li></ul><li>Outline<br />Introduction<br />Map Reduce recap.<br />MRShare – Sharing primitives in Map-Reduce<br />MRShare – Cost based approach to sharing <br />MRShare Evaluation<br />Summary<br />28<br />
    30. 30. Evaluation setup<br />40 EC2 small instance virtual machines<br />Modified Hadoop engine<br />30 GB text dataset consisting of blogs<br />Multiple grep-wordcount queries<br />Counts words matching a regular expression<br />Allows for variable intermediate data sizes<br />Generic aggregation Map Reduce job<br />29<br />
    31. 31. Evaluation goals<br />Sharing is not always beneficial.<br />‘GreedyShare’ policy<br />How much can we save on sharing scans?<br />MRShare - MultiSplitJobs evaluation<br />How much can we save on sharing intermediate data? <br />MRShare - γ-MultiSplitJobs evaluation<br />30<br />
    32. 32. Is sharing always beneficial?- ‘GreedyShare’ policy<br />31<br />
    33. 33. How much we save on sharing scans – MRShare MultiSplitJobs<br />32<br />
    34. 34. How much we save on sharing intermediate data - MRShare - γ-MultiSplitJobs<br />33<br />
    35. 35. Summary<br />We introduced MRShare – a framework for automatic work sharing in Map Reduce.<br />We identified sharing primitives and demonstrated the implementation thereof in a Map Reduce engine.<br />We established a cost model and solved several work sharing optimization problems.<br />We demonstrated vast savings when using MRShare.<br />34<br />
    36. 36. Thank you!!!<br />Questions?<br />35<br />
    37. 37. Ongoing work – sharing expensive computation<br />Sharing across multiple Map Reduce jobs with expensive predicates.<br />36<br />Input<br />Meta-map<br />Map 1<br />Map 2<br />Map 3<br />Map 4<br />
    38. 38. Ongoing work – dynamic sharing<br />Dynamic sharing.<br />37<br />J1+j2<br />progress<br />J1<br />J2<br />time <br />J2<br />J1<br />

    ×