m2r2 Framework for Results Materialization and Reuse

•

1 like•568 views

This document presents m2r2, a framework for materializing and reusing results in high-level dataflow systems for big data. The framework operates at the logical plan level to be language-independent. It includes components for matching plans, rewriting queries to reuse past results, optimizing plans, caching results, and garbage collection. An evaluation using the TPC-H benchmark on Pig Latin showed the framework reduced query execution time by 65% on average by reusing past query results. Future work includes integrating it with more systems and minimizing materialization costs.

Technology

m2r2: A Framework for Results
Materialization and
Reuse in High-Level Dataflow Systems
for Big Data
2nd International Conference
on Big Data Science and Engineering (BDSE 2013)
Vasiliki Kalavri, Hui Shang, Vladimir Vlassov
{kalavri, hshang, vladv}@kth.se
4 December 2013, Sydney, Australia

Outline
➔ Motivation
➔ Materialized Views in Relational DBMSs
➔ High-Level Dataflow Systems for Big Data
◆ similarities in design and implementation
➔ m2r2 design
◆ design goals and system components
➔ Prototype Implementation Details
➔ Evaluation Results
➔ Conclusions and Future Work
2

Motivation
➔ Avoid computational redundancies
◆ filter out bad records, spam e-mail
◆ data representation transformations
➔ Microsoft has found a 30%-60% similarity
in queries submitted for execution
➔ A Berkeley MapReduce workload
characterization study shows a big need
for caching job results
3

Materialized Views in RDBMSs
➔ A derived relation, stored in the database
◆ Queries are computed using the views instead of
the base relations
➔ Challenges
◆ View Design: What to materialize?
◆ View Maintenance: How to update the views?
◆ View Exploitation: How to use the views for query
optimization?
● view matching and query rewriting
4

High-Level Dataflow Systems (1)
High-Level Dataflow Systems for Big Data
(Pig, Hive, Jaql, DryadLINQ, etc.) exhibit
wide similarities on multiple design levels:
➔ Language Layer
◆ Declarative, SQL-like language
◆ Statements define transformations on collections of datasets
➔ Data Operators
◆ Encapsulate the logic of the transformations to be performed
◆ Relational, Expressions, Control-flow
5

High-Level Dataflow Systems (2)
Pig Latin
HiveQL
Jaql
6

High-Level Dataflow Systems (3)
● The Logical Plan
○ Parser → AST → DAG of operators
● Compilation to an Execution Plan
7

m2r2: materialize - match - rewrite - reuse
➔ A language-independent, extensible
framework for
◆ storing
◆ managing and
◆ using
previous job and sub-job results
➔ Operates on the logical plan level, in
order to support different languages and
backend execution engines
8

m2r2 Components
➔ Plan Matcher and Rewriter
◆ How to be independent of the high-level
language and execution engine?
◆ Shark: Hive on Spark, PonIC: Pig on
Stratosphere, etc.? → Match at the Logical Plan
level!
➔ Plan Optimizer
➔ Results Cache
➔ Plan Repository
➔ Garbage Collector
9

m2r2 Implementation
➔ Built on top of
Pig/Hadoop
➔ HDFS as the Results Cache
➔ MySQL Cluster as the
Repository
◆ in-memory, highly-available
and fault-tolerant
➔ Garbage Collection as a
separate module
◆ policy on reuse frequency and
last access time
11

Evaluation Setup
12
➔ Cluster Setup
◆ Pig 0.11, Hadoop 1.0.4 and MySQL Cluster 7.2.12
deployed on top of OpenStack
◆ 20 Ubuntu 11.10 VMs
➔ Data and Queries
◆ TPC-H Benchmark for Pig
◆ 20 queries, out of which 6 with reuse
opportunity
◆ 107 GB of data using DBGEN tools of TPC-H

Conclusions
15
➔ The logical plan is the proper layer to
build a language-independent reuse
framework
➔ When there exists reuse opportunity,
query execution time can be immensely
reduced
◆ 65% on average in our experiments
➔ The materialization overhead is quite
small and I/O dominant

Future Work
➔ Integrate with other high-level systems
➔ Explore the possibility of sharing results
among different frameworks
➔ Obtain execution traces and perform a
more realistic evaluation
➔ Minimize costs by overlapping
materialization with regular query
execution
16

What's hot

Iceberg: a fast table format for S3DataWorks Summit

Batch and Stream Graph Processing with Apache FlinkVasia Kalavri

Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkVasia Kalavri

A primer on building real time data-driven productsLars Albertsson

Predictive Datacenter Analytics with StrymonVasia Kalavri

Data pipelines from zero to solidLars Albertsson

Apache flinkpranay kumar

A time energy performance analysis of map reduce on heterogeneous systems wit...newmooxx

The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman

HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...Xiao Qin

Presto Summit 2018 - 09 - Netflix Icebergkbajda

Case study- Real-time OLAP Cubes Ziemowit Jankowski

Managing Multi-DBMS on a Single UI, a Web-based Spatial DB Manager-FOSS4G A...BJ Jang

Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Spark Summit

Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward

Introduction to Real-time data processingYogi Devendra Vyavahare

Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit

What's hot (20)

Iceberg: a fast table format for S3

Batch and Stream Graph Processing with Apache Flink

Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

A primer on building real time data-driven products

Predictive Datacenter Analytics with Strymon

Data pipelines from zero to solid

Apache flink

A time energy performance analysis of map reduce on heterogeneous systems wit...

The evolution of Netflix's S3 data warehouse (Strata NY 2018)

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...

Presto Summit 2018 - 09 - Netflix Iceberg

Case study- Real-time OLAP Cubes

Managing Multi-DBMS on a Single UI, a Web-based Spatial DB Manager-FOSS4G A...

Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)

Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time

Introduction to Real-time data processing

Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...

Viewers also liked

Like a Pack of Wolves: Community Structure of Web TrackersVasia Kalavri

The shortest path is not always a straight lineVasia Kalavri

Graphs as Streams: Rethinking Graph Processing in the Streaming EraVasia Kalavri

Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri

Apache Flink Deep DiveVasia Kalavri

A Skype case study (2011)Vasia Kalavri

Demystifying Distributed Graph ProcessingVasia Kalavri

Flink vs. SparkSlim Baltagi

Viewers also liked (8)

Like a Pack of Wolves: Community Structure of Web Trackers

The shortest path is not always a straight line

Graphs as Streams: Rethinking Graph Processing in the Streaming Era

Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15

Apache Flink Deep Dive

A Skype case study (2011)

Demystifying Distributed Graph Processing

Flink vs. Spark

Similar to m2r2 Framework for Results Materialization and Reuse

Nicholas：hdfs what is new in hadoop 2hdhappy001

Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh

11. From Hadoop to Spark 1:2Fabio Fumarola

Spark to DocumentDB connectorDenny Lee

SQL Engines for Hadoop - The case for Impalamarkgrover

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Home For Gypsies – Storage for NoSQL DatabasesAtish Kathpal

Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...Govt.Engineering college, Idukki

Introduction to Apache HadoopChristopher Pezza

Introduction to Hadoop AdministrationRamesh Pabba - seeking new projects

Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks

Top 10 present and future innovations in the NoSQL Cassandra ecosystem (2022)Cédrick Lunven

Polyglot Persistence - Two Great Tastes That Taste Great TogetherJohn Wood

Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan

Big data and non relational databaseManageEngine, Zoho Corporation

Introduction to Hadoop AdministrationRamesh Pabba - seeking new projects

Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh

Similar to m2r2 Framework for Results Materialization and Reuse (20)

Nicholas：hdfs what is new in hadoop 2

Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...

11. From Hadoop to Spark 1:2

Spark to DocumentDB connector

SQL Engines for Hadoop - The case for Impala

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Home For Gypsies – Storage for NoSQL Databases

Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...

Introduction to Apache Hadoop

Introduction to Hadoop Administration

Apache Hadoop YARN - The Future of Data Processing with Hadoop

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...

Top 10 present and future innovations in the NoSQL Cassandra ecosystem (2022)

Polyglot Persistence - Two Great Tastes That Taste Great Together

Performance Characterization and Optimization of In-Memory Data Analytics on ...

Big data and non relational database

Introduction to Hadoop Administration

Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...

Recently uploaded

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

WordPress Websites for Engineers: Elevate Your Brandgvaughan

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Story boards and shot lists for my a level piececharlottematthew16

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

CloudStudio User manual (basic edition):comworks

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Recently uploaded (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Nell’iperspazio con Rocket: il Framework Web di Rust!

Unleash Your Potential - Namagunga Girls Coding Club

What's New in Teams Calling, Meetings and Devices March 2024

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

WordPress Websites for Engineers: Elevate Your Brand

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

Human Factors of XR: Using Human Factors to Design XR Systems

Dev Dives: Streamline document processing with UiPath Studio Web

Story boards and shot lists for my a level piece

Advanced Test Driven-Development @ php[tek] 2024

Scanning the Internet for External Cloud Exposures via SSL Certs

SAP Build Work Zone - Overview L2-L3.pptx

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Developer Data Modeling Mistakes: From Postgres to NoSQL

CloudStudio User manual (basic edition):

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Anypoint Exchange: It’s Not Just a Repo!

The Ultimate Guide to Choosing WordPress Pros and Cons

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

m2r2 Framework for Results Materialization and Reuse

1. m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data 2nd International Conference on Big Data Science and Engineering (BDSE 2013) Vasiliki Kalavri, Hui Shang, Vladimir Vlassov {kalavri, hshang, vladv}@kth.se 4 December 2013, Sydney, Australia

2. Outline ➔ Motivation ➔ Materialized Views in Relational DBMSs ➔ High-Level Dataflow Systems for Big Data ◆ similarities in design and implementation ➔ m2r2 design ◆ design goals and system components ➔ Prototype Implementation Details ➔ Evaluation Results ➔ Conclusions and Future Work 2

3. Motivation ➔ Avoid computational redundancies ◆ filter out bad records, spam e-mail ◆ data representation transformations ➔ Microsoft has found a 30%-60% similarity in queries submitted for execution ➔ A Berkeley MapReduce workload characterization study shows a big need for caching job results 3

4. Materialized Views in RDBMSs ➔ A derived relation, stored in the database ◆ Queries are computed using the views instead of the base relations ➔ Challenges ◆ View Design: What to materialize? ◆ View Maintenance: How to update the views? ◆ View Exploitation: How to use the views for query optimization? ● view matching and query rewriting 4

5. High-Level Dataflow Systems (1) High-Level Dataflow Systems for Big Data (Pig, Hive, Jaql, DryadLINQ, etc.) exhibit wide similarities on multiple design levels: ➔ Language Layer ◆ Declarative, SQL-like language ◆ Statements define transformations on collections of datasets ➔ Data Operators ◆ Encapsulate the logic of the transformations to be performed ◆ Relational, Expressions, Control-flow 5

6. High-Level Dataflow Systems (2) Pig Latin HiveQL Jaql 6

7. High-Level Dataflow Systems (3) ● The Logical Plan ○ Parser → AST → DAG of operators ● Compilation to an Execution Plan 7

8. m2r2: materialize - match - rewrite - reuse ➔ A language-independent, extensible framework for ◆ storing ◆ managing and ◆ using previous job and sub-job results ➔ Operates on the logical plan level, in order to support different languages and backend execution engines 8

9. m2r2 Components ➔ Plan Matcher and Rewriter ◆ How to be independent of the high-level language and execution engine? ◆ Shark: Hive on Spark, PonIC: Pig on Stratosphere, etc.? → Match at the Logical Plan level! ➔ Plan Optimizer ➔ Results Cache ➔ Plan Repository ➔ Garbage Collector 9

10. Match and Rewrite Algorithm 10

11. m2r2 Implementation ➔ Built on top of Pig/Hadoop ➔ HDFS as the Results Cache ➔ MySQL Cluster as the Repository ◆ in-memory, highly-available and fault-tolerant ➔ Garbage Collection as a separate module ◆ policy on reuse frequency and last access time 11

12. Evaluation Setup 12 ➔ Cluster Setup ◆ Pig 0.11, Hadoop 1.0.4 and MySQL Cluster 7.2.12 deployed on top of OpenStack ◆ 20 Ubuntu 11.10 VMs ➔ Data and Queries ◆ TPC-H Benchmark for Pig ◆ 20 queries, out of which 6 with reuse opportunity ◆ 107 GB of data using DBGEN tools of TPC-H

13. Speedup using Sub-Jobs 13

14. Speedup using Whole Jobs 14

15. Conclusions 15 ➔ The logical plan is the proper layer to build a language-independent reuse framework ➔ When there exists reuse opportunity, query execution time can be immensely reduced ◆ 65% on average in our experiments ➔ The materialization overhead is quite small and I/O dominant

16. Future Work ➔ Integrate with other high-level systems ➔ Explore the possibility of sharing results among different frameworks ➔ Obtain execution traces and perform a more realistic evaluation ➔ Minimize costs by overlapping materialization with regular query execution 16

17. m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data 2nd International Conference on Big Data Science and Engineering (BDSE 2013) Vasiliki Kalavri, Hui Shang, Vladimir Vlassov {kalavri, hshang, vladv}@kth.se 4 December 2013, Sydney, Australia

m2r2 Framework for Results Materialization and Reuse

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to m2r2 Framework for Results Materialization and Reuse

Similar to m2r2 Framework for Results Materialization and Reuse (20)

Recently uploaded

Recently uploaded (20)

m2r2 Framework for Results Materialization and Reuse