m2r2: A Framework for Results
Materialization and
Reuse in High-Level Dataflow Systems
for Big Data
2nd International Conference
on Big Data Science and Engineering (BDSE 2013)
Vasiliki Kalavri, Hui Shang, Vladimir Vlassov
{kalavri, hshang, vladv}@kth.se
4 December 2013, Sydney, Australia
Outline
âž” Motivation
âž” Materialized Views in Relational DBMSs
âž” High-Level Dataflow Systems for Big Data
â—† similarities in design and implementation
âž” m2r2 design
â—† design goals and system components
âž” Prototype Implementation Details
âž” Evaluation Results
âž” Conclusions and Future Work
2
Motivation
âž” Avoid computational redundancies
â—† filter out bad records, spam e-mail
â—† data representation transformations
âž” Microsoft has found a 30%-60% similarity
in queries submitted for execution
âž” A Berkeley MapReduce workload
characterization study shows a big need
for caching job results
3
Materialized Views in RDBMSs
âž” A derived relation, stored in the database
â—† Queries are computed using the views instead of
the base relations
âž” Challenges
â—† View Design: What to materialize?
â—† View Maintenance: How to update the views?
â—† View Exploitation: How to use the views for query
optimization?
â—Ź view matching and query rewriting
4
High-Level Dataflow Systems (1)
High-Level Dataflow Systems for Big Data
(Pig, Hive, Jaql, DryadLINQ, etc.) exhibit
wide similarities on multiple design levels:
âž” Language Layer
â—† Declarative, SQL-like language
â—† Statements define transformations on collections of datasets
âž” Data Operators
â—† Encapsulate the logic of the transformations to be performed
â—† Relational, Expressions, Control-flow
5
High-Level Dataflow Systems (2)
Pig Latin
HiveQL
Jaql
6
High-Level Dataflow Systems (3)
â—Ź The Logical Plan
○ Parser → AST → DAG of operators
â—Ź Compilation to an Execution Plan
7
m2r2: materialize - match - rewrite - reuse
âž” A language-independent, extensible
framework for
â—† storing
â—† managing and
â—† using
previous job and sub-job results
âž” Operates on the logical plan level, in
order to support different languages and
backend execution engines
8
m2r2 Components
âž” Plan Matcher and Rewriter
â—† How to be independent of the high-level
language and execution engine?
â—† Shark: Hive on Spark, PonIC: Pig on
Stratosphere, etc.? → Match at the Logical Plan
level!
âž” Plan Optimizer
âž” Results Cache
âž” Plan Repository
âž” Garbage Collector
9
Match and Rewrite Algorithm
10
m2r2 Implementation
âž” Built on top of
Pig/Hadoop
âž” HDFS as the Results Cache
âž” MySQL Cluster as the
Repository
â—† in-memory, highly-available
and fault-tolerant
âž” Garbage Collection as a
separate module
â—† policy on reuse frequency and
last access time
11
Evaluation Setup
12
âž” Cluster Setup
â—† Pig 0.11, Hadoop 1.0.4 and MySQL Cluster 7.2.12
deployed on top of OpenStack
â—† 20 Ubuntu 11.10 VMs
âž” Data and Queries
â—† TPC-H Benchmark for Pig
â—† 20 queries, out of which 6 with reuse
opportunity
â—† 107 GB of data using DBGEN tools of TPC-H
Speedup using Sub-Jobs
13
Speedup using Whole Jobs
14
Conclusions
15
âž” The logical plan is the proper layer to
build a language-independent reuse
framework
âž” When there exists reuse opportunity,
query execution time can be immensely
reduced
â—† 65% on average in our experiments
âž” The materialization overhead is quite
small and I/O dominant
Future Work
âž” Integrate with other high-level systems
âž” Explore the possibility of sharing results
among different frameworks
âž” Obtain execution traces and perform a
more realistic evaluation
âž” Minimize costs by overlapping
materialization with regular query
execution
16
m2r2: A Framework for Results
Materialization and
Reuse in High-Level Dataflow Systems
for Big Data
2nd International Conference
on Big Data Science and Engineering (BDSE 2013)
Vasiliki Kalavri, Hui Shang, Vladimir Vlassov
{kalavri, hshang, vladv}@kth.se
4 December 2013, Sydney, Australia

m2r2: A Framework for Results Materialization and Reuse

  • 1.
    m2r2: A Frameworkfor Results Materialization and Reuse in High-Level Dataflow Systems for Big Data 2nd International Conference on Big Data Science and Engineering (BDSE 2013) Vasiliki Kalavri, Hui Shang, Vladimir Vlassov {kalavri, hshang, vladv}@kth.se 4 December 2013, Sydney, Australia
  • 2.
    Outline âž” Motivation âž” MaterializedViews in Relational DBMSs âž” High-Level Dataflow Systems for Big Data â—† similarities in design and implementation âž” m2r2 design â—† design goals and system components âž” Prototype Implementation Details âž” Evaluation Results âž” Conclusions and Future Work 2
  • 3.
    Motivation âž” Avoid computationalredundancies â—† filter out bad records, spam e-mail â—† data representation transformations âž” Microsoft has found a 30%-60% similarity in queries submitted for execution âž” A Berkeley MapReduce workload characterization study shows a big need for caching job results 3
  • 4.
    Materialized Views inRDBMSs âž” A derived relation, stored in the database â—† Queries are computed using the views instead of the base relations âž” Challenges â—† View Design: What to materialize? â—† View Maintenance: How to update the views? â—† View Exploitation: How to use the views for query optimization? â—Ź view matching and query rewriting 4
  • 5.
    High-Level Dataflow Systems(1) High-Level Dataflow Systems for Big Data (Pig, Hive, Jaql, DryadLINQ, etc.) exhibit wide similarities on multiple design levels: âž” Language Layer â—† Declarative, SQL-like language â—† Statements define transformations on collections of datasets âž” Data Operators â—† Encapsulate the logic of the transformations to be performed â—† Relational, Expressions, Control-flow 5
  • 6.
    High-Level Dataflow Systems(2) Pig Latin HiveQL Jaql 6
  • 7.
    High-Level Dataflow Systems(3) ● The Logical Plan ○ Parser → AST → DAG of operators ● Compilation to an Execution Plan 7
  • 8.
    m2r2: materialize -match - rewrite - reuse âž” A language-independent, extensible framework for â—† storing â—† managing and â—† using previous job and sub-job results âž” Operates on the logical plan level, in order to support different languages and backend execution engines 8
  • 9.
    m2r2 Components ➔ PlanMatcher and Rewriter ◆ How to be independent of the high-level language and execution engine? ◆ Shark: Hive on Spark, PonIC: Pig on Stratosphere, etc.? → Match at the Logical Plan level! ➔ Plan Optimizer ➔ Results Cache ➔ Plan Repository ➔ Garbage Collector 9
  • 10.
    Match and RewriteAlgorithm 10
  • 11.
    m2r2 Implementation âž” Builton top of Pig/Hadoop âž” HDFS as the Results Cache âž” MySQL Cluster as the Repository â—† in-memory, highly-available and fault-tolerant âž” Garbage Collection as a separate module â—† policy on reuse frequency and last access time 11
  • 12.
    Evaluation Setup 12 âž” ClusterSetup â—† Pig 0.11, Hadoop 1.0.4 and MySQL Cluster 7.2.12 deployed on top of OpenStack â—† 20 Ubuntu 11.10 VMs âž” Data and Queries â—† TPC-H Benchmark for Pig â—† 20 queries, out of which 6 with reuse opportunity â—† 107 GB of data using DBGEN tools of TPC-H
  • 13.
  • 14.
  • 15.
    Conclusions 15 âž” The logicalplan is the proper layer to build a language-independent reuse framework âž” When there exists reuse opportunity, query execution time can be immensely reduced â—† 65% on average in our experiments âž” The materialization overhead is quite small and I/O dominant
  • 16.
    Future Work âž” Integratewith other high-level systems âž” Explore the possibility of sharing results among different frameworks âž” Obtain execution traces and perform a more realistic evaluation âž” Minimize costs by overlapping materialization with regular query execution 16
  • 17.
    m2r2: A Frameworkfor Results Materialization and Reuse in High-Level Dataflow Systems for Big Data 2nd International Conference on Big Data Science and Engineering (BDSE 2013) Vasiliki Kalavri, Hui Shang, Vladimir Vlassov {kalavri, hshang, vladv}@kth.se 4 December 2013, Sydney, Australia