Mr Share 11 Sep 2010

MRShare: Sharing Across Multiple Queries in MapReduce Tomasz Nykiel(University of Toronto) MichalisPotamias (Boston University) ChaitanyaMishra (University of Toronto, currently Facebook) George Kollios (Boston University) Nick Koudas (University of Toronto) 1

Data management landscape flexibility MRShare – sharing framework for MR ,[object Object]

Time performanceσπ efficiency 2

MRShare – a sharing framework for Map Reduce MRShare framework: Inspired by sharing primitives from relational domain Introduces a cost model for Map Reduce jobs Searches for the optimal sharing strategies Does not change the Map Reduce computational model hsdhquweiquwijksajdajsdjhwhjadjhashdj 3

Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing MRShare Evaluation Summary 4

Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing MRShare Evaluation Summary 5

network Map Reduce recap. Reduce Map I Output I I Output I HDFS HDFS 6

Outline Introduction Map Reduce recap. MRShare - Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing MRShare Evaluation Summary 7

Sharing primitives – sharing scans SELECT COUNT(*) FROM user GROUP BY hometown SELECT AVG(age) FROM user GROUP BY hometown SQL Map Map id1 student Toronto id1 student Toronto Toronto 1 Toronto 17 Map Reduce Reduce Reduce Toronto 1 Toronto 17 Toronto 1 Toronto 3 Toronto 19 Toronto 18 Toronto 1 Montreal 20 Montreal 20 Ottawa 1 Ottawa 23 Ottawa 2 Ottawa 24 Ottawa 1 Ottawa 25 8

MRShare – sharing scans (map). Input Meta-map Map 1 Map 2 Map 3 Map 4 Map output 9

Meta-reduce MRShare – sharing scans (reduce) Reduce 1 Reduce 2 Reduce 3 Reduce 4 10

Outline Introduction Map Reduce recap. MRShare - Sharing primitives in Map-Reduce Sharing scans Sharing intermediate data MRShare – Cost based approach to sharing MRShare Evaluation Summary 11

Sharing primitives - Sharing intermediate data. SELECT COUNT(*) FROM user WHERE occupation=‘student’ GROUP BY hometown SELECT COUNT(*) FROM user WHERE age > 18 GROUP BY hometown SQL Map Map id1 student Toronto id1 student Toronto Age ?> 18 Occupation ?= ‘student’ Toronto 1 Toronto 1 Map Reduce Reduce Reduce Toronto 1 Toronto 1 Toronto 1 Toronto 3 Toronto 1 Toronto 2 Toronto 1 Ottawa 1 Ottawa 1 Ottawa 1 Ottawa 1 Ottawa 2 Montreal 2 Ottawa 1 Montreal 1 12

Meta-map MRShare – sharing intermediate data (map). Input Map 1 Map 2 Map 3 Map 4 Map output 13

Meta-reduce MRShare – sharing intermediate data (reduce). Reduce 1 Reduce 2 Reduce 3 Reduce 4 14

Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 15

Cost model for Map Reduce (single job) Reading input Sorting int. data Copying Writing output Reading– f(input size) Sorting– f(intermediate data size) Copying– f(intermediate data size) Writing – f(output size) 16

Cost of executing a group of jobs Read Sort Copy Write J1 Read Sort Copy Write J2 Read Sort Copy Write J3 J1+J2+J3 Read Sort Copy Write Potential costs Potential savings Savings 17

Finding the optimal sharing strategy “NoShare” J3 J3 J2 J2 18 J5 J4 J4 J1 J1 J5 J3 J2 J4 J1 ,[object Object],J5 “GreedyShare”

Sharing scans - cost based optimization 20 Read Sort J1 J1+J2+J3 Read Sort J2 Read Sort Read Sort J3 Potential costs Savings Savings come from reduced number of scans The sorting cost might change The costs of copying and writing the output do not change ,[object Object],[object Object]

SplitJobs – a DP solution for sharing scans. We reduce the problem of grouping to the problem of splitting a sorted list of jobs – by approximating the cost of sorting. J6 J5 J4 J3 J2 J1 ,[object Object],J6 J5 J4 J3 J2 J1 SplitJobs 22 G1 G2 G3

MultiSplitJobs – an improvement of SplitJobs 24 J8 J7 J6 J5 J4 J3 J2 J1 G1 G2 SplitJobs SplitJobs G3 SplitJobs G4 MultiSplitJobs

Sharing intermediate data - cost based optimization 26 Read Sort Copy J1 J1+J2+J3 Read Sort Copy Read Sort Copy J2 Savings Potential savings Read Sort Copy J3 Potential costs or savings The sorting and copying costs change – depending on the size of the intermediate data Prohibitive cost of maintaining statistics J3 We need to estimate the size of the intermediate data of all combinations of jobs. J1 J2

Approximate the size of the intermediate data J3 J1 γ-MultiSplitJobs – the solution for sharing intermediate data 27 J2 J3 J2 J1 = + γ * J1 J2 J3 ,[object Object]

γ set heuristically,[object Object]

Evaluation setup 40 EC2 small instance virtual machines Modified Hadoop engine 30 GB text dataset consisting of blogs Multiple grep-wordcount queries Counts words matching a regular expression Allows for variable intermediate data sizes Generic aggregation Map Reduce job 29

Evaluation goals Sharing is not always beneficial. ‘GreedyShare’ policy How much can we save on sharing scans? MRShare - MultiSplitJobs evaluation How much can we save on sharing intermediate data? MRShare - γ-MultiSplitJobs evaluation 30

Is sharing always beneficial?- ‘GreedyShare’ policy 31

How much we save on sharing scans – MRShare MultiSplitJobs 32

How much we save on sharing intermediate data - MRShare - γ-MultiSplitJobs 33

Summary We introduced MRShare – a framework for automatic work sharing in Map Reduce. We identified sharing primitives and demonstrated the implementation thereof in a Map Reduce engine. We established a cost model and solved several work sharing optimization problems. We demonstrated vast savings when using MRShare. 34

Mr Share 11 Sep 2010

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Mr Share 11 Sep 2010

Similar to Mr Share 11 Sep 2010 (20)

Recently uploaded

Recently uploaded (20)

Mr Share 11 Sep 2010

Editor's Notes