SlideShare a Scribd company logo
1 of 35
Download to read offline
Makoto Onizuka, Hiroyuki Kato, Soichiro
Hidaka, Keisuke Nakano, Zhenjiang Hu
1
Demand for Big Data Analysis
 Big Data Analysis
 Cyber space: Click log, query log
 Real space: shopping log, sensing data
 Machine learning
 Algorithm: classification, clustering
 Data type: relation, vector, graph, time series
 Distributed computing framework
 Interface: MPI, MapReduce, BSP (bulk
synchronous parallel)
2
Iterative analysis examples
 Clustering
 Partitioning: k-means, EM-algorithm, affinity
propagation
 Hierarchical clustering: Ward's method,
BIRCH
 Matrix factorization
 Graph mining
 PageRank, Random walk with restarts
3
Running example: PageRank
This program is not efficient. Which parts?
4
map function shuffles
whole graph structure
in every iteration
scores are computed
even if the nodes have
converged
Issues for iterative analysis
 How to optimize the program?
 Reusing the intermediate (shuffled) data
 Skip computing the scores of converted nodes
 Possible but difficult to manually remove
the above redundant computations
 Actually, Spark, HaLoop, REX force
programmers to manually remove them
 Our goal: Automatically remove redundant
computations for iterative queries
5
Overview
 OptIQ is a new optimization framework for
iterative queries with convergence property
 Declarative high level language; programmers
are freed from burden of removing redundancy
 OptIQ Integrates traditional optimization
techniques in database and compiler areas
 Two techniques for removing redundancy
 view materialization for invariant views
 incrementalization for variant views
 We implement on Hive and Spark
6
Iterative query language
 SQL extended with iteration
 Syntax
 Behavior
 initialize: statements before iteration
 update table is updated by step query repeatedly
until convergence
 return: statements after iteration
7
Example: PageRank
8
Example: k-means
9
Query Optimization
 Goal: remove redundant computations
 Question: What is redundant computation?
 Operations on unmodified attributes of tuples
 Operations on attributes of unmodified tuples
 OptIQ reuses partial results of step queries
 View materialization reuses operations on
unmodified attributes
 incrementalization reuses operations on unmodified
tuples
10
Query Optimization cont.
11
View materialization
 Purpose is to reuse unmodified attributes
of update table during iterations
 Procedure
1. Decompose update table into variant and
invariant tables by conservative analysis
2. Materialize sub-query in step query that only
accesses invariant table
3. Rewrite step query to use materialized view,
query processing using view
12
Table decomposition
 discriminate modified/unmodified attributes
 unmodified attribute: src, dest in Graph
 modified attribute: score in Graph
 decompose update table
 Graph’ = select src, IT.dest, VT.score
from VT, IT
where VT.src = IT.src 13
Example: PageRank
 Table decomposition
 Remove Graph’ table from query
 discriminate
14
simplification
Subquery lifting
 construct read-only (invariant) views
accessed by step queries
 extract loop-invariant computations by
using unmodified attributes
 Procedure
1. Constant let statement lifting (to initialize
clause)
2. Invariant subquery lifting (to initialize clause)
3. Common subquery elimination with query
rewrite, unnesting, identity query elimination
15
Example: PageRank
16
Invariant subquery lifting
Identity query elimination, VT = Score
Example: k-means
17
table decomposition
query elimination (for VT)
simplification
Automatic incrementalization
 Not all records are updated in iterations.
Purpose is to reuse unmodified tuples in
variant views.
 Procedure
1. Detect delta table between iterations before
starting 1st iteration.
2. Derive incremental queries. Both input and
output are delta tables.
3. Execute queries in incremental mode as much
as possible.
18
Delta table detection
 Delta table is detected easily, since we
have already identified variant views.
 ΔT = T’ – T,
 Update operations for update tables
 insertion
 deletion
 update
19
Deriving incremental queries
 Many literatures for incremental query
evaluations [9,13, 19]
 We focus on incremental query
evaluation for update operations, since
they are frequent in iterative queries.
20
Deriving incremental queries
 Query:
where step query q, update table T, delta table ΔT,
terminate condition φ
 Suppose q is distributed:
We obtain incremental query:
where ψ is an optional filter
21
Distribution rules
 Rules for relational operators
 selection
 projection
 join
 group-by
22
Example: PageRank
 Remember the query after lifting
 In algebraic form:
23
Example: PageRank
 This is re-written to:
24
Additional rules for group-by
 insertion/deletion rules for group-by
 sum, count: insertion and deletion
 max, min: only for insertion (not distributive for deletion)
25
MapReduce implementation
 We extend Hive for OptIQ
 Iterative query processing
 convergence is tested by joining old and new
update tables
 View materialization
 partition invariant views by group-by/join keys
for efficient group-by/join operations
 Incrementalization
 apply incrementalization as much as possible
 delta table is kept on DFS
 Putting MR design patterns together
26
Experiments
 Purpose
 How effective OptIQ is for real analysis?
 How much errors occur caused by
incrementalization?
 OptIQ is applicable for MapReduce and Spark?
 Environment: 11 computers
 Workload
 Datasets: graph (wikipedia, web graph),
multidimensional data (US cencus, mnist8m)
 Analysis: PageRank, RWR, k-means clustering
27
PageRank: performance
28
PageRank: convergence
29
k-means: performance
30
k-means: convergence
31
Related work
 Iterative MapReduce runtime system
 Twister: iterative MR computation
 Iterative mapReduce programming models
 HaLoop: manual view caching
 iMapReuce:
 Spark: in-memory cluster computing for iterative
applications, manual optimization for map-side join
 Pregel: Bulk synchronous parallel model
 GraphLab: Distributed graph computation model
 PEGAUS: matrix multiplication model on MapReduce
32
Related work cont.
 Declaratiave MapReduce programming
 HiveQL and Pig : SQL on MapReduce
 HadoopDB: Integration of RBMS and MapReduce
 MRQL: iterative query language, algebraic/MR-level
optimization; map fusion, join/group-by fusion
 Query optimization in MapReduce
 Comet: algebraic-level (shared selection, grouping,
time-spanned views) and MR-level sharing (shared
scan, shared shuffle)
 Ysmart: sharing among group-by and joins
 REX: explicit incremental computation
33
Conclusion
 OptIQ is optimization for iterative queries
with convergence property
 Two techniques for removing redundancy
 view materialization for invariant views
 incrementalization for variant views
 We implement on Hive and Spark
 OptIQ improves the performance up to five
times faster
34
Future work
 Apply OptIQ to another analysis: NMF, affinity
propagation, logistic regression
 adaptive and incremental evaluation techniques
for matrix computation, such as PageRank, NMF,
centrality computation
35

More Related Content

What's hot

SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...Kyong-Ha Lee
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...NECST Lab @ Politecnico di Milano
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersJen Aman
 
Mapreduce: Theory and implementation
Mapreduce: Theory and implementationMapreduce: Theory and implementation
Mapreduce: Theory and implementationSri Prasanna
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinityShashwat Shriparv
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoopishan0019
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsZubair Nabi
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmNilaNila16
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceLeonidas Akritidis
 
Object Detection & Machine Learning Paper
 Object Detection & Machine Learning Paper Object Detection & Machine Learning Paper
Object Detection & Machine Learning PaperJoseph Mogannam
 
Map reduce (from Google)
Map reduce (from Google)Map reduce (from Google)
Map reduce (from Google)Sri Prasanna
 
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup  - Alex PerrierLarge data with Scikit-learn - Boston Data Mining Meetup  - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup - Alex PerrierAlexis Perrier
 
ADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONS
ADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONSADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONS
ADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONSijcses
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operationSubhas Kumar Ghosh
 
Graph Matching
Graph MatchingGraph Matching
Graph Matchinggraphitech
 

What's hot (19)

SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Mapreduce
MapreduceMapreduce
Mapreduce
 
Mapreduce: Theory and implementation
Mapreduce: Theory and implementationMapreduce: Theory and implementation
Mapreduce: Theory and implementation
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinity
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
 
Object Detection & Machine Learning Paper
 Object Detection & Machine Learning Paper Object Detection & Machine Learning Paper
Object Detection & Machine Learning Paper
 
Map reduce (from Google)
Map reduce (from Google)Map reduce (from Google)
Map reduce (from Google)
 
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup  - Alex PerrierLarge data with Scikit-learn - Boston Data Mining Meetup  - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
ADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONS
ADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONSADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONS
ADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONS
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
Graph Matching
Graph MatchingGraph Matching
Graph Matching
 

Viewers also liked

06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticSpark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticALTIC Altic
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clusteringmobius.cn
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Varad Meru
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Titus Damaiyanti
 

Viewers also liked (16)

06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
 
MachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_SparkMachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_Spark
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticSpark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, altic
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Incremental clustering in search engines
Incremental clustering in search enginesIncremental clustering in search engines
Incremental clustering in search engines
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
 

Similar to Optimization for iterative queries on Mapreduce

Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspectiveপল্লব রায়
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Rusif Eyvazli
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentationNoha Elprince
 
HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)
HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)
HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)Subhajit Sahu
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit
 
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeWorkflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeFrederic Desprez
 
Ed Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual AnalysisEd Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual AnalysisVolha Banadyseva
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_TrushitaTrushita Redij
 
On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)Yu Liu
 
Scalable frequent itemset mining using heterogeneous computing par apriori a...
Scalable frequent itemset mining using heterogeneous computing  par apriori a...Scalable frequent itemset mining using heterogeneous computing  par apriori a...
Scalable frequent itemset mining using heterogeneous computing par apriori a...ijdpsjournal
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingIRJET Journal
 
Download It
Download ItDownload It
Download Itbutest
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Databricks
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxSivam Chinna
 

Similar to Optimization for iterative queries on Mapreduce (20)

Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
 
T180304125129
T180304125129T180304125129
T180304125129
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)
HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)
HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
 
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeWorkflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
 
Ed Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual AnalysisEd Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual Analysis
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
 
On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)
 
Scalable frequent itemset mining using heterogeneous computing par apriori a...
Scalable frequent itemset mining using heterogeneous computing  par apriori a...Scalable frequent itemset mining using heterogeneous computing  par apriori a...
Scalable frequent itemset mining using heterogeneous computing par apriori a...
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
Download It
Download ItDownload It
Download It
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptx
 
E05312426
E05312426E05312426
E05312426
 

Recently uploaded

Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 

Recently uploaded (20)

Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 

Optimization for iterative queries on Mapreduce

  • 1. Makoto Onizuka, Hiroyuki Kato, Soichiro Hidaka, Keisuke Nakano, Zhenjiang Hu 1
  • 2. Demand for Big Data Analysis  Big Data Analysis  Cyber space: Click log, query log  Real space: shopping log, sensing data  Machine learning  Algorithm: classification, clustering  Data type: relation, vector, graph, time series  Distributed computing framework  Interface: MPI, MapReduce, BSP (bulk synchronous parallel) 2
  • 3. Iterative analysis examples  Clustering  Partitioning: k-means, EM-algorithm, affinity propagation  Hierarchical clustering: Ward's method, BIRCH  Matrix factorization  Graph mining  PageRank, Random walk with restarts 3
  • 4. Running example: PageRank This program is not efficient. Which parts? 4 map function shuffles whole graph structure in every iteration scores are computed even if the nodes have converged
  • 5. Issues for iterative analysis  How to optimize the program?  Reusing the intermediate (shuffled) data  Skip computing the scores of converted nodes  Possible but difficult to manually remove the above redundant computations  Actually, Spark, HaLoop, REX force programmers to manually remove them  Our goal: Automatically remove redundant computations for iterative queries 5
  • 6. Overview  OptIQ is a new optimization framework for iterative queries with convergence property  Declarative high level language; programmers are freed from burden of removing redundancy  OptIQ Integrates traditional optimization techniques in database and compiler areas  Two techniques for removing redundancy  view materialization for invariant views  incrementalization for variant views  We implement on Hive and Spark 6
  • 7. Iterative query language  SQL extended with iteration  Syntax  Behavior  initialize: statements before iteration  update table is updated by step query repeatedly until convergence  return: statements after iteration 7
  • 10. Query Optimization  Goal: remove redundant computations  Question: What is redundant computation?  Operations on unmodified attributes of tuples  Operations on attributes of unmodified tuples  OptIQ reuses partial results of step queries  View materialization reuses operations on unmodified attributes  incrementalization reuses operations on unmodified tuples 10
  • 12. View materialization  Purpose is to reuse unmodified attributes of update table during iterations  Procedure 1. Decompose update table into variant and invariant tables by conservative analysis 2. Materialize sub-query in step query that only accesses invariant table 3. Rewrite step query to use materialized view, query processing using view 12
  • 13. Table decomposition  discriminate modified/unmodified attributes  unmodified attribute: src, dest in Graph  modified attribute: score in Graph  decompose update table  Graph’ = select src, IT.dest, VT.score from VT, IT where VT.src = IT.src 13
  • 14. Example: PageRank  Table decomposition  Remove Graph’ table from query  discriminate 14 simplification
  • 15. Subquery lifting  construct read-only (invariant) views accessed by step queries  extract loop-invariant computations by using unmodified attributes  Procedure 1. Constant let statement lifting (to initialize clause) 2. Invariant subquery lifting (to initialize clause) 3. Common subquery elimination with query rewrite, unnesting, identity query elimination 15
  • 16. Example: PageRank 16 Invariant subquery lifting Identity query elimination, VT = Score
  • 17. Example: k-means 17 table decomposition query elimination (for VT) simplification
  • 18. Automatic incrementalization  Not all records are updated in iterations. Purpose is to reuse unmodified tuples in variant views.  Procedure 1. Detect delta table between iterations before starting 1st iteration. 2. Derive incremental queries. Both input and output are delta tables. 3. Execute queries in incremental mode as much as possible. 18
  • 19. Delta table detection  Delta table is detected easily, since we have already identified variant views.  ΔT = T’ – T,  Update operations for update tables  insertion  deletion  update 19
  • 20. Deriving incremental queries  Many literatures for incremental query evaluations [9,13, 19]  We focus on incremental query evaluation for update operations, since they are frequent in iterative queries. 20
  • 21. Deriving incremental queries  Query: where step query q, update table T, delta table ΔT, terminate condition φ  Suppose q is distributed: We obtain incremental query: where ψ is an optional filter 21
  • 22. Distribution rules  Rules for relational operators  selection  projection  join  group-by 22
  • 23. Example: PageRank  Remember the query after lifting  In algebraic form: 23
  • 24. Example: PageRank  This is re-written to: 24
  • 25. Additional rules for group-by  insertion/deletion rules for group-by  sum, count: insertion and deletion  max, min: only for insertion (not distributive for deletion) 25
  • 26. MapReduce implementation  We extend Hive for OptIQ  Iterative query processing  convergence is tested by joining old and new update tables  View materialization  partition invariant views by group-by/join keys for efficient group-by/join operations  Incrementalization  apply incrementalization as much as possible  delta table is kept on DFS  Putting MR design patterns together 26
  • 27. Experiments  Purpose  How effective OptIQ is for real analysis?  How much errors occur caused by incrementalization?  OptIQ is applicable for MapReduce and Spark?  Environment: 11 computers  Workload  Datasets: graph (wikipedia, web graph), multidimensional data (US cencus, mnist8m)  Analysis: PageRank, RWR, k-means clustering 27
  • 32. Related work  Iterative MapReduce runtime system  Twister: iterative MR computation  Iterative mapReduce programming models  HaLoop: manual view caching  iMapReuce:  Spark: in-memory cluster computing for iterative applications, manual optimization for map-side join  Pregel: Bulk synchronous parallel model  GraphLab: Distributed graph computation model  PEGAUS: matrix multiplication model on MapReduce 32
  • 33. Related work cont.  Declaratiave MapReduce programming  HiveQL and Pig : SQL on MapReduce  HadoopDB: Integration of RBMS and MapReduce  MRQL: iterative query language, algebraic/MR-level optimization; map fusion, join/group-by fusion  Query optimization in MapReduce  Comet: algebraic-level (shared selection, grouping, time-spanned views) and MR-level sharing (shared scan, shared shuffle)  Ysmart: sharing among group-by and joins  REX: explicit incremental computation 33
  • 34. Conclusion  OptIQ is optimization for iterative queries with convergence property  Two techniques for removing redundancy  view materialization for invariant views  incrementalization for variant views  We implement on Hive and Spark  OptIQ improves the performance up to five times faster 34
  • 35. Future work  Apply OptIQ to another analysis: NMF, affinity propagation, logistic regression  adaptive and incremental evaluation techniques for matrix computation, such as PageRank, NMF, centrality computation 35