Large-scale data analysis lies in the core of modern enterprises
and scientific research. With the emergence of cloud
computing, the use of an analytical query processing infrastructure
(e.g., Amazon EC2) can be directly mapped
to monetary value. MapReduce has been a popular framework
in the context of cloud computing, designed to serve
long running queries (jobs) which can be processed in batch
mode. Taking into account that different jobs often perform
similar work, there are many opportunities for sharing. In
principle, sharing similar work reduces the overall amount of
work, which can lead to reducing monetary charges incurred
while utilizing the processing infrastructure. In this paper
we propose a sharing framework tailored to MapReduce.
Our framework, MRShare, transforms a batch of queries
into a new batch that will be executed more efficiently, by
merging jobs into groups and evaluating each group as a
single query. Based on our cost model for MapReduce, we
define an optimization problem and we provide a solution
that derives the optimal grouping of queries. Experiments
in our prototype, built on top of Hadoop, demonstrate the
overall effectiveness of our approach and substantial savings.
1. MRShare: Sharing Across Multiple Queries in MapReduce Tomasz Nykiel(University of Toronto) MichalisPotamias (Boston University) ChaitanyaMishra (University of Toronto, currently Facebook) George Kollios (Boston University) Nick Koudas (University of Toronto) 1
5. MRShare – a sharing framework for Map Reduce MRShare framework: Inspired by sharing primitives from relational domain Introduces a cost model for Map Reduce jobs Searches for the optimal sharing strategies Does not change the Map Reduce computational model hsdhquweiquwijksajdajsdjhwhjadjhashdj 3
6. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing MRShare Evaluation Summary 4
7. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing MRShare Evaluation Summary 5
9. Outline Introduction Map Reduce recap. MRShare - Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing MRShare Evaluation Summary 7
10. Sharing primitives – sharing scans SELECT COUNT(*) FROM user GROUP BY hometown SELECT AVG(age) FROM user GROUP BY hometown SQL Map Map id1 student Toronto id1 student Toronto Toronto 1 Toronto 17 Map Reduce Reduce Reduce Toronto 1 Toronto 17 Toronto 1 Toronto 3 Toronto 19 Toronto 18 Toronto 1 Montreal 20 Montreal 20 Ottawa 1 Ottawa 23 Ottawa 2 Ottawa 24 Ottawa 1 Ottawa 25 8
13. Outline Introduction Map Reduce recap. MRShare - Sharing primitives in Map-Reduce Sharing scans Sharing intermediate data MRShare – Cost based approach to sharing MRShare Evaluation Summary 11
14. Sharing primitives - Sharing intermediate data. SELECT COUNT(*) FROM user WHERE occupation=‘student’ GROUP BY hometown SELECT COUNT(*) FROM user WHERE age > 18 GROUP BY hometown SQL Map Map id1 student Toronto id1 student Toronto Age ?> 18 Occupation ?= ‘student’ Toronto 1 Toronto 1 Map Reduce Reduce Reduce Toronto 1 Toronto 1 Toronto 1 Toronto 3 Toronto 1 Toronto 2 Toronto 1 Ottawa 1 Ottawa 1 Ottawa 1 Ottawa 1 Ottawa 2 Montreal 2 Ottawa 1 Montreal 1 12
17. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 15
18. Cost model for Map Reduce (single job) Reading input Sorting int. data Copying Writing output Reading– f(input size) Sorting– f(intermediate data size) Copying– f(intermediate data size) Writing – f(output size) 16
19. Cost of executing a group of jobs Read Sort Copy Write J1 Read Sort Copy Write J2 Read Sort Copy Write J3 J1+J2+J3 Read Sort Copy Write Potential costs Potential savings Savings 17
20.
21. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 19
22.
23.
24. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 23
26. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 25
27. Sharing intermediate data - cost based optimization 26 Read Sort Copy J1 J1+J2+J3 Read Sort Copy Read Sort Copy J2 Savings Potential savings Read Sort Copy J3 Potential costs or savings The sorting and copying costs change – depending on the size of the intermediate data Prohibitive cost of maintaining statistics J3 We need to estimate the size of the intermediate data of all combinations of jobs. J1 J2
28.
29.
30. Evaluation setup 40 EC2 small instance virtual machines Modified Hadoop engine 30 GB text dataset consisting of blogs Multiple grep-wordcount queries Counts words matching a regular expression Allows for variable intermediate data sizes Generic aggregation Map Reduce job 29
31. Evaluation goals Sharing is not always beneficial. ‘GreedyShare’ policy How much can we save on sharing scans? MRShare - MultiSplitJobs evaluation How much can we save on sharing intermediate data? MRShare - γ-MultiSplitJobs evaluation 30
33. How much we save on sharing scans – MRShare MultiSplitJobs 32
34. How much we save on sharing intermediate data - MRShare - γ-MultiSplitJobs 33
35. Summary We introduced MRShare – a framework for automatic work sharing in Map Reduce. We identified sharing primitives and demonstrated the implementation thereof in a Map Reduce engine. We established a cost model and solved several work sharing optimization problems. We demonstrated vast savings when using MRShare. 34