Mr Share 11 Sep 2010

  • 1,468 views
Uploaded on

Large-scale data analysis lies in the core of modern enterprises …

Large-scale data analysis lies in the core of modern enterprises
and scientific research. With the emergence of cloud
computing, the use of an analytical query processing infrastructure
(e.g., Amazon EC2) can be directly mapped
to monetary value. MapReduce has been a popular framework
in the context of cloud computing, designed to serve
long running queries (jobs) which can be processed in batch
mode. Taking into account that different jobs often perform
similar work, there are many opportunities for sharing. In
principle, sharing similar work reduces the overall amount of
work, which can lead to reducing monetary charges incurred
while utilizing the processing infrastructure. In this paper
we propose a sharing framework tailored to MapReduce.
Our framework, MRShare, transforms a batch of queries
into a new batch that will be executed more efficiently, by
merging jobs into groups and evaluating each group as a
single query. Based on our cost model for MapReduce, we
define an optimization problem and we provide a solution
that derives the optimal grouping of queries. Experiments
in our prototype, built on top of Hadoop, demonstrate the
overall effectiveness of our approach and substantial savings.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,468
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
70
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Talk about different possibilities of arranging jobs, and the question which one is the optimal one.

Transcript

  • 1. MRShare: Sharing Across Multiple Queries in MapReduce
    Tomasz Nykiel(University of Toronto)
    MichalisPotamias (Boston University)
    ChaitanyaMishra (University of Toronto, currently Facebook)
    George Kollios (Boston University)
    Nick Koudas (University of Toronto)
    1
  • 2. Data management landscape
    flexibility
    MRShare – sharing framework for MR
    • Arbitrary data
    • 3. Large scale setups
    • 4. Time performance
    σπ
    efficiency
    2
  • 5. MRShare – a sharing framework for Map Reduce
    MRShare framework:
    Inspired by sharing primitives from relational domain
    Introduces a cost model for Map Reduce jobs
    Searches for the optimal sharing strategies
    Does not change the Map Reduce computational model
    hsdhquweiquwijksajdajsdjhwhjadjhashdj
    3
  • 6. Outline
    Introduction
    Map Reduce recap.
    MRShare – Sharing primitives in Map-Reduce
    MRShare – Cost based approach to sharing
    MRShare Evaluation
    Summary
    4
  • 7. Outline
    Introduction
    Map Reduce recap.
    MRShare – Sharing primitives in Map-Reduce
    MRShare – Cost based approach to sharing
    MRShare Evaluation
    Summary
    5
  • 8. network
    Map Reduce recap.
    Reduce
    Map
    I
    Output
    I
    I
    Output
    I
    HDFS
    HDFS
    6
  • 9. Outline
    Introduction
    Map Reduce recap.
    MRShare - Sharing primitives in Map-Reduce
    MRShare – Cost based approach to sharing
    MRShare Evaluation
    Summary
    7
  • 10. Sharing primitives – sharing scans
    SELECT COUNT(*) FROM user GROUP BY hometown
    SELECT AVG(age) FROM user GROUP BY hometown
    SQL
    Map
    Map
    id1
    student
    Toronto
    id1
    student
    Toronto
    Toronto
    1
    Toronto
    17
    Map Reduce
    Reduce
    Reduce
    Toronto
    1
    Toronto
    17
    Toronto
    1
    Toronto
    3
    Toronto
    19
    Toronto
    18
    Toronto
    1
    Montreal
    20
    Montreal
    20
    Ottawa
    1
    Ottawa
    23
    Ottawa
    2
    Ottawa
    24
    Ottawa
    1
    Ottawa
    25
    8
  • 11. MRShare – sharing scans (map).
    Input
    Meta-map
    Map 1
    Map 2
    Map 3
    Map 4
    Map output
    9
  • 12. Meta-reduce
    MRShare – sharing scans (reduce)
    Reduce 1
    Reduce 2
    Reduce 3
    Reduce 4
    10
  • 13. Outline
    Introduction
    Map Reduce recap.
    MRShare - Sharing primitives in Map-Reduce
    Sharing scans
    Sharing intermediate data
    MRShare – Cost based approach to sharing
    MRShare Evaluation
    Summary
    11
  • 14. Sharing primitives - Sharing intermediate data.
    SELECT COUNT(*) FROM user WHERE occupation=‘student’ GROUP BY hometown
    SELECT COUNT(*) FROM user WHERE age > 18 GROUP BY hometown
    SQL
    Map
    Map
    id1
    student
    Toronto
    id1
    student
    Toronto
    Age ?> 18
    Occupation ?= ‘student’
    Toronto
    1
    Toronto
    1
    Map Reduce
    Reduce
    Reduce
    Toronto
    1
    Toronto
    1
    Toronto
    1
    Toronto
    3
    Toronto
    1
    Toronto
    2
    Toronto
    1
    Ottawa
    1
    Ottawa
    1
    Ottawa
    1
    Ottawa
    1
    Ottawa
    2
    Montreal
    2
    Ottawa
    1
    Montreal
    1
    12
  • 15. Meta-map
    MRShare – sharing intermediate data (map).
    Input
    Map 1
    Map 2
    Map 3
    Map 4
    Map output
    13
  • 16. Meta-reduce
    MRShare – sharing intermediate data (reduce).
    Reduce 1
    Reduce 2
    Reduce 3
    Reduce 4
    14
  • 17. Outline
    Introduction
    Map Reduce recap.
    MRShare – Sharing primitives in Map-Reduce
    MRShare – Cost based approach to sharing
    Cost model for finding the optimal sharing strategy
    SplitJobs – cost based algorithm for sharing scans
    MultiSplitJobs – an improvement of SplitJobs
    γ-MultiSplitJobs– the algorithm for sharing intermediate data
    MRShare Evaluation
    Summary
    15
  • 18. Cost model for Map Reduce (single job)
    Reading input
    Sorting int. data
    Copying
    Writing output
    Reading– f(input size)
    Sorting– f(intermediate data size)
    Copying– f(intermediate data size)
    Writing – f(output size)
    16
  • 19. Cost of executing a group of jobs
    Read
    Sort
    Copy
    Write
    J1
    Read
    Sort
    Copy
    Write
    J2
    Read
    Sort
    Copy
    Write
    J3
    J1+J2+J3
    Read
    Sort
    Copy
    Write
    Potential costs
    Potential savings
    Savings
    17
  • 20. Finding the optimal sharing strategy
    “NoShare”
    J3
    J3
    J2
    J2
    18
    J5
    J4
    J4
    J1
    J1
    J5
    J3
    J2
    J4
    J1
    • An optimization problem
    J5
    “GreedyShare”
  • 21. Outline
    Introduction
    Map Reduce recap.
    MRShare – Sharing primitives in Map-Reduce
    MRShare – Cost based approach to sharing
    Cost model for finding the optimal sharing strategy
    SplitJobs – cost based algorithm for sharing scans
    MultiSplitJobs – an improvement of SplitJobs
    γ-MultiSplitJobs– the algorithm for sharing intermediate data
    MRShare Evaluation
    Summary
    19
  • 22. Sharing scans - cost based optimization
    20
    Read
    Sort
    J1
    J1+J2+J3
    Read
    Sort
    J2
    Read
    Sort
    Read
    Sort
    J3
    Potential costs
    Savings
    Savings come from reduced number of scans
    The sorting cost might change
    The costs of copying and writing the output do not change
    • We prove NP-hardness of the problem of finding the optimal sharing strategy
  • Sharing scans – approximating the cost of sorting
    21
    Area ~ |intermediate data|
    J5
    J4
    J3
    J2
    J1
    J3
    Compute the exact sorting cost for (J1+J2+J3)
    J2
    J1
    J3
    J2
    Approximate the sorting cost based on (J1+J2+J3)
    J3
    J1
  • 23. SplitJobs – a DP solution for sharing scans.
    We reduce the problem of grouping to the problem of splitting a sorted list of jobs – by approximating the cost of sorting.
    J6
    J5
    J4
    J3
    J2
    J1
    • Using our cost model and the approximation, we employ a DP algorithm to find the optimal split points.
    J6
    J5
    J4
    J3
    J2
    J1
    SplitJobs
    22
    G1
    G2
    G3
  • 24. Outline
    Introduction
    Map Reduce recap.
    MRShare – Sharing primitives in Map-Reduce
    MRShare – Cost based approach to sharing
    Cost model for finding the optimal sharing strategy
    SplitJobs – cost based algorithm for sharing scans
    MultiSplitJobs – an improvement of SplitJobs
    γ-MultiSplitJobs– the algorithm for sharing intermediate data
    MRShare Evaluation
    Summary
    23
  • 25. MultiSplitJobs – an improvement of SplitJobs
    24
    J8
    J7
    J6
    J5
    J4
    J3
    J2
    J1
    G1
    G2
    SplitJobs
    SplitJobs
    G3
    SplitJobs
    G4
    MultiSplitJobs
  • 26. Outline
    Introduction
    Map Reduce recap.
    MRShare – Sharing primitives in Map-Reduce
    MRShare – Cost based approach to sharing
    Cost model for finding the optimal sharing strategy
    SplitJobs – cost based algorithm for sharing scans
    MultiSplitJobs – an improvement of SplitJobs
    γ-MultiSplitJobs– the algorithm for sharing intermediate data
    MRShare Evaluation
    Summary
    25
  • 27. Sharing intermediate data - cost based optimization
    26
    Read
    Sort
    Copy
    J1
    J1+J2+J3
    Read
    Sort
    Copy
    Read
    Sort
    Copy
    J2
    Savings
    Potential savings
    Read
    Sort
    Copy
    J3
    Potential costs or savings
    The sorting and copying costs change – depending on the size of the intermediate data
    Prohibitive cost of maintaining statistics
    J3
    We need to estimate the size of the intermediate data of all combinations of jobs.
    J1
    J2
  • 28. Approximate the size of the intermediate data
    J3
    J1
    γ-MultiSplitJobs – the solution for sharing intermediate data
    27
    J2
    J3
    J2
    J1
    =
    + γ *
    J1
    J2
    J3
    • γ –MultiSplitJobs – applies MultiSplitJobs with modified cost function
    • 29. γ set heuristically
  • Outline
    Introduction
    Map Reduce recap.
    MRShare – Sharing primitives in Map-Reduce
    MRShare – Cost based approach to sharing
    MRShare Evaluation
    Summary
    28
  • 30. Evaluation setup
    40 EC2 small instance virtual machines
    Modified Hadoop engine
    30 GB text dataset consisting of blogs
    Multiple grep-wordcount queries
    Counts words matching a regular expression
    Allows for variable intermediate data sizes
    Generic aggregation Map Reduce job
    29
  • 31. Evaluation goals
    Sharing is not always beneficial.
    ‘GreedyShare’ policy
    How much can we save on sharing scans?
    MRShare - MultiSplitJobs evaluation
    How much can we save on sharing intermediate data?
    MRShare - γ-MultiSplitJobs evaluation
    30
  • 32. Is sharing always beneficial?- ‘GreedyShare’ policy
    31
  • 33. How much we save on sharing scans – MRShare MultiSplitJobs
    32
  • 34. How much we save on sharing intermediate data - MRShare - γ-MultiSplitJobs
    33
  • 35. Summary
    We introduced MRShare – a framework for automatic work sharing in Map Reduce.
    We identified sharing primitives and demonstrated the implementation thereof in a Map Reduce engine.
    We established a cost model and solved several work sharing optimization problems.
    We demonstrated vast savings when using MRShare.
    34
  • 36. Thank you!!!
    Questions?
    35
  • 37. Ongoing work – sharing expensive computation
    Sharing across multiple Map Reduce jobs with expensive predicates.
    36
    Input
    Meta-map
    Map 1
    Map 2
    Map 3
    Map 4
  • 38. Ongoing work – dynamic sharing
    Dynamic sharing.
    37
    J1+j2
    progress
    J1
    J2
    time
    J2
    J1