m2r2: A Framework for Results Materialization and Reuse

•

1 like•568 views

This document presents m2r2, a framework for materializing and reusing results in high-level dataflow systems for big data. The framework operates at the logical plan level to be language-independent. It includes components for matching plans, rewriting queries to reuse past results, optimizing plans, caching results, and garbage collection. An evaluation using the TPC-H benchmark on Pig Latin showed the framework reduced query execution time by 65% on average by reusing past query results. Future work includes integrating it with more systems and minimizing materialization costs.

Technology

m2r2: A Framework for Results
Materialization and
Reuse in High-Level Dataflow Systems
for Big Data
2nd International Conference
on Big Data Science and Engineering (BDSE 2013)
Vasiliki Kalavri, Hui Shang, Vladimir Vlassov
{kalavri, hshang, vladv}@kth.se
4 December 2013, Sydney, Australia

Outline
➔ Motivation
➔ Materialized Views in Relational DBMSs
➔ High-Level Dataflow Systems for Big Data
◆ similarities in design and implementation
➔ m2r2 design
◆ design goals and system components
➔ Prototype Implementation Details
➔ Evaluation Results
➔ Conclusions and Future Work
2

Motivation
➔ Avoid computational redundancies
◆ filter out bad records, spam e-mail
◆ data representation transformations
➔ Microsoft has found a 30%-60% similarity
in queries submitted for execution
➔ A Berkeley MapReduce workload
characterization study shows a big need
for caching job results
3

Materialized Views in RDBMSs
➔ A derived relation, stored in the database
◆ Queries are computed using the views instead of
the base relations
➔ Challenges
◆ View Design: What to materialize?
◆ View Maintenance: How to update the views?
◆ View Exploitation: How to use the views for query
optimization?
● view matching and query rewriting
4

High-Level Dataflow Systems (1)
High-Level Dataflow Systems for Big Data
(Pig, Hive, Jaql, DryadLINQ, etc.) exhibit
wide similarities on multiple design levels:
➔ Language Layer
◆ Declarative, SQL-like language
◆ Statements define transformations on collections of datasets
➔ Data Operators
◆ Encapsulate the logic of the transformations to be performed
◆ Relational, Expressions, Control-flow
5

High-Level Dataflow Systems (2)
Pig Latin
HiveQL
Jaql
6

High-Level Dataflow Systems (3)
● The Logical Plan
○ Parser → AST → DAG of operators
● Compilation to an Execution Plan
7

m2r2: materialize - match - rewrite - reuse
➔ A language-independent, extensible
framework for
◆ storing
◆ managing and
◆ using
previous job and sub-job results
➔ Operates on the logical plan level, in
order to support different languages and
backend execution engines
8

m2r2 Components
➔ Plan Matcher and Rewriter
◆ How to be independent of the high-level
language and execution engine?
◆ Shark: Hive on Spark, PonIC: Pig on
Stratosphere, etc.? → Match at the Logical Plan
level!
➔ Plan Optimizer
➔ Results Cache
➔ Plan Repository
➔ Garbage Collector
9

m2r2 Implementation
➔ Built on top of
Pig/Hadoop
➔ HDFS as the Results Cache
➔ MySQL Cluster as the
Repository
◆ in-memory, highly-available
and fault-tolerant
➔ Garbage Collection as a
separate module
◆ policy on reuse frequency and
last access time
11

Evaluation Setup
12
➔ Cluster Setup
◆ Pig 0.11, Hadoop 1.0.4 and MySQL Cluster 7.2.12
deployed on top of OpenStack
◆ 20 Ubuntu 11.10 VMs
➔ Data and Queries
◆ TPC-H Benchmark for Pig
◆ 20 queries, out of which 6 with reuse
opportunity
◆ 107 GB of data using DBGEN tools of TPC-H

Conclusions
15
➔ The logical plan is the proper layer to
build a language-independent reuse
framework
➔ When there exists reuse opportunity,
query execution time can be immensely
reduced
◆ 65% on average in our experiments
➔ The materialization overhead is quite
small and I/O dominant

Future Work
➔ Integrate with other high-level systems
➔ Explore the possibility of sharing results
among different frameworks
➔ Obtain execution traces and perform a
more realistic evaluation
➔ Minimize costs by overlapping
materialization with regular query
execution
16

What's hot

Iceberg: a fast table format for S3DataWorks Summit

Batch and Stream Graph Processing with Apache FlinkVasia Kalavri

Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkVasia Kalavri

A primer on building real time data-driven productsLars Albertsson

Predictive Datacenter Analytics with StrymonVasia Kalavri

Data pipelines from zero to solidLars Albertsson

Apache flinkpranay kumar

A time energy performance analysis of map reduce on heterogeneous systems wit...newmooxx

The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman

HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...Xiao Qin

Presto Summit 2018 - 09 - Netflix Icebergkbajda

Case study- Real-time OLAP Cubes Ziemowit Jankowski

Managing Multi-DBMS on a Single UI, a Web-based Spatial DB Manager-FOSS4G A...BJ Jang

Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Spark Summit

Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward

Introduction to Real-time data processingYogi Devendra Vyavahare

Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit

What's hot (20)

Iceberg: a fast table format for S3

Batch and Stream Graph Processing with Apache Flink

Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

A primer on building real time data-driven products

Predictive Datacenter Analytics with Strymon

Data pipelines from zero to solid

Apache flink

A time energy performance analysis of map reduce on heterogeneous systems wit...

The evolution of Netflix's S3 data warehouse (Strata NY 2018)

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...

Presto Summit 2018 - 09 - Netflix Iceberg

Case study- Real-time OLAP Cubes

Managing Multi-DBMS on a Single UI, a Web-based Spatial DB Manager-FOSS4G A...

Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)

Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time

Introduction to Real-time data processing

Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...

Viewers also liked

Like a Pack of Wolves: Community Structure of Web TrackersVasia Kalavri

The shortest path is not always a straight lineVasia Kalavri

Graphs as Streams: Rethinking Graph Processing in the Streaming EraVasia Kalavri

Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri

Apache Flink Deep DiveVasia Kalavri

A Skype case study (2011)Vasia Kalavri

Demystifying Distributed Graph ProcessingVasia Kalavri

Flink vs. SparkSlim Baltagi

Viewers also liked (8)

Like a Pack of Wolves: Community Structure of Web Trackers

The shortest path is not always a straight line

Graphs as Streams: Rethinking Graph Processing in the Streaming Era

Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15

Apache Flink Deep Dive

A Skype case study (2011)

Demystifying Distributed Graph Processing

Flink vs. Spark

Similar to m2r2: A Framework for Results Materialization and Reuse

Nicholas：hdfs what is new in hadoop 2hdhappy001

Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh

11. From Hadoop to Spark 1:2Fabio Fumarola

Spark to DocumentDB connectorDenny Lee

SQL Engines for Hadoop - The case for Impalamarkgrover

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Home For Gypsies – Storage for NoSQL DatabasesAtish Kathpal

Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...Govt.Engineering college, Idukki

Introduction to Apache HadoopChristopher Pezza

Introduction to Hadoop AdministrationRamesh Pabba - seeking new projects

Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks

Top 10 present and future innovations in the NoSQL Cassandra ecosystem (2022)Cédrick Lunven

Polyglot Persistence - Two Great Tastes That Taste Great TogetherJohn Wood

Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan

Big data and non relational databaseManageEngine, Zoho Corporation

Introduction to Hadoop AdministrationRamesh Pabba - seeking new projects

Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh

Similar to m2r2: A Framework for Results Materialization and Reuse (20)

Nicholas：hdfs what is new in hadoop 2

Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...

11. From Hadoop to Spark 1:2

Spark to DocumentDB connector

SQL Engines for Hadoop - The case for Impala

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Home For Gypsies – Storage for NoSQL Databases

Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...

Introduction to Apache Hadoop

Introduction to Hadoop Administration

Apache Hadoop YARN - The Future of Data Processing with Hadoop

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...

Top 10 present and future innovations in the NoSQL Cassandra ecosystem (2022)

Polyglot Persistence - Two Great Tastes That Taste Great Together

Performance Characterization and Optimization of In-Memory Data Analytics on ...

Big data and non relational database

Introduction to Hadoop Administration

Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Architecting Cloud Native ApplicationsWSO2

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Why Teams call analytics are critical to your entire businesspanagenda

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

ICT role in 21st century education and its challengesrafiqahmad00786416

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea

Ransomware_Q4_2023. The report. [EN].pdfOverkill Security

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Manulife - Insurer Transformation Award 2024The Digital Insurer

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

presentation ICT roal in 21st century education

Architecting Cloud Native Applications

AWS Community Day CPH - Three problems of Terraform

Why Teams call analytics are critical to your entire business

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

ICT role in 21st century education and its challenges

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

Ransomware_Q4_2023. The report. [EN].pdf

How to Troubleshoot Apps for the Modern Connected Worker

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Manulife - Insurer Transformation Award 2024

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

m2r2: A Framework for Results Materialization and Reuse

1. m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data 2nd International Conference on Big Data Science and Engineering (BDSE 2013) Vasiliki Kalavri, Hui Shang, Vladimir Vlassov {kalavri, hshang, vladv}@kth.se 4 December 2013, Sydney, Australia

2. Outline ➔ Motivation ➔ Materialized Views in Relational DBMSs ➔ High-Level Dataflow Systems for Big Data ◆ similarities in design and implementation ➔ m2r2 design ◆ design goals and system components ➔ Prototype Implementation Details ➔ Evaluation Results ➔ Conclusions and Future Work 2

3. Motivation ➔ Avoid computational redundancies ◆ filter out bad records, spam e-mail ◆ data representation transformations ➔ Microsoft has found a 30%-60% similarity in queries submitted for execution ➔ A Berkeley MapReduce workload characterization study shows a big need for caching job results 3

4. Materialized Views in RDBMSs ➔ A derived relation, stored in the database ◆ Queries are computed using the views instead of the base relations ➔ Challenges ◆ View Design: What to materialize? ◆ View Maintenance: How to update the views? ◆ View Exploitation: How to use the views for query optimization? ● view matching and query rewriting 4

5. High-Level Dataflow Systems (1) High-Level Dataflow Systems for Big Data (Pig, Hive, Jaql, DryadLINQ, etc.) exhibit wide similarities on multiple design levels: ➔ Language Layer ◆ Declarative, SQL-like language ◆ Statements define transformations on collections of datasets ➔ Data Operators ◆ Encapsulate the logic of the transformations to be performed ◆ Relational, Expressions, Control-flow 5

6. High-Level Dataflow Systems (2) Pig Latin HiveQL Jaql 6

7. High-Level Dataflow Systems (3) ● The Logical Plan ○ Parser → AST → DAG of operators ● Compilation to an Execution Plan 7

8. m2r2: materialize - match - rewrite - reuse ➔ A language-independent, extensible framework for ◆ storing ◆ managing and ◆ using previous job and sub-job results ➔ Operates on the logical plan level, in order to support different languages and backend execution engines 8

9. m2r2 Components ➔ Plan Matcher and Rewriter ◆ How to be independent of the high-level language and execution engine? ◆ Shark: Hive on Spark, PonIC: Pig on Stratosphere, etc.? → Match at the Logical Plan level! ➔ Plan Optimizer ➔ Results Cache ➔ Plan Repository ➔ Garbage Collector 9

10. Match and Rewrite Algorithm 10

11. m2r2 Implementation ➔ Built on top of Pig/Hadoop ➔ HDFS as the Results Cache ➔ MySQL Cluster as the Repository ◆ in-memory, highly-available and fault-tolerant ➔ Garbage Collection as a separate module ◆ policy on reuse frequency and last access time 11

12. Evaluation Setup 12 ➔ Cluster Setup ◆ Pig 0.11, Hadoop 1.0.4 and MySQL Cluster 7.2.12 deployed on top of OpenStack ◆ 20 Ubuntu 11.10 VMs ➔ Data and Queries ◆ TPC-H Benchmark for Pig ◆ 20 queries, out of which 6 with reuse opportunity ◆ 107 GB of data using DBGEN tools of TPC-H

13. Speedup using Sub-Jobs 13

14. Speedup using Whole Jobs 14

15. Conclusions 15 ➔ The logical plan is the proper layer to build a language-independent reuse framework ➔ When there exists reuse opportunity, query execution time can be immensely reduced ◆ 65% on average in our experiments ➔ The materialization overhead is quite small and I/O dominant

16. Future Work ➔ Integrate with other high-level systems ➔ Explore the possibility of sharing results among different frameworks ➔ Obtain execution traces and perform a more realistic evaluation ➔ Minimize costs by overlapping materialization with regular query execution 16

17. m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data 2nd International Conference on Big Data Science and Engineering (BDSE 2013) Vasiliki Kalavri, Hui Shang, Vladimir Vlassov {kalavri, hshang, vladv}@kth.se 4 December 2013, Sydney, Australia

m2r2: A Framework for Results Materialization and Reuse

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to m2r2: A Framework for Results Materialization and Reuse

Similar to m2r2: A Framework for Results Materialization and Reuse (20)

Recently uploaded

Recently uploaded (20)

m2r2: A Framework for Results Materialization and Reuse