Presentation given on VLDB 2016: 42nd International Conference on Very Large Data Bases.
Paper: http://dx.doi.org/10.14778/2977797.2977800
ArXiv: https://arxiv.org/abs/1605.05219
Poster: https://zenodo.org/record/61653 (doi 10.5281/zenodo.61653)
Gumbo Software: https://github.com/JonnyDaenen/Gumbo
Abstract
While services such as Amazon AWS make computing power abundantly available, adding more computing nodes can incur high costs in, for instance, pay-as-you-go plans while not always significantly improving the net running time (aka wall-clock time) of queries. In this work, we provide algorithms for parallel evaluation of SGF queries in MapReduce that optimize total time, while retaining low net time. Not only can SGF queries specify all semi-join reducers, but also more expressive queries involving disjunction and negation. Since SGF queries can be seen as Boolean combinations of (potentially nested) semi-joins, we introduce a novel multi-semi-join (MSJ) MapReduce operator that enables the evaluation of a set of semi-joins in one job. We use this operator to obtain parallel query plans for SGF queries that outvalue sequential plans w.r.t. net time and provide additional optimizations aimed at minimizing total time without severely affecting net time. Even though the latter optimizations are NP-hard, we present effective greedy algorithms. Our experiments, conducted using our own implementation Gumbo on top of Hadoop, confirm the usefulness of parallel query plans, and the effectiveness and scalability of our optimizations, all with a significant improvement over Pig and Hive.
Context-Aware Recommender System Based on Boolean Matrix FactorisationDmitrii Ignatov
In this work we propose and study an approach for collaborative filtering, which is based on Boolean matrix factorisation and exploits additional (context) information about users and items. To avoid similarity loss in case of Boolean representation we use an adjusted type of projection of a target user to the obtained factor space.
We have compared the proposed method with SVD-based approach on the MovieLens dataset. The experiments demonstrate that the proposed method has better MAE and Precision and comparable Recall and F-measure. We also report an increase of quality in the context information presence.
Implementation of Energy Efficient Scalar Point Multiplication Techniques for...idescitation
Elliptic curve cryptography (ECC) is mainly an
alternative to traditional public-key cryptosystems (PKCs),
such as RSA, due to its smaller key size with same security
level for resource-constrained networks. The computational
efficiency of ECC depends on the scalar point multiplication,
which consists of modular point addition and point doubling
operations. The paper emphasizes on point multiplication
techniques such as Binary, NAF, w-NAF and different
coordinate systems like Affine and Projective (Standard
Projective, Jacobian and Mixed) for point addition and doubling
operations. These operations are compared based on execution
time. The results given here are for general purpose processor
with 1:73 GH z frequency. The implementation is done over
NIST-recommended prime fields 192/224/256/384/521.
Basic Computer Engineering Unit II as per RGPV SyllabusNANDINI SHARMA
Algorithm, Flowchart, Categories of Programming Languages, OOPs vs POP, concepts of OOPs, Inheritance, C++ Programming, How to write C++ program as a beginner, Array, Structure, etc
Random Number Generators :
LCG, Fibonacci, LFSR, GFSR, TGFSR, MT, MT19937,WELL
Tutorials on FInite Fields and associated RNG on github at :
https://github.com/rinnocente/Random_Numbers
Context-Aware Recommender System Based on Boolean Matrix FactorisationDmitrii Ignatov
In this work we propose and study an approach for collaborative filtering, which is based on Boolean matrix factorisation and exploits additional (context) information about users and items. To avoid similarity loss in case of Boolean representation we use an adjusted type of projection of a target user to the obtained factor space.
We have compared the proposed method with SVD-based approach on the MovieLens dataset. The experiments demonstrate that the proposed method has better MAE and Precision and comparable Recall and F-measure. We also report an increase of quality in the context information presence.
Implementation of Energy Efficient Scalar Point Multiplication Techniques for...idescitation
Elliptic curve cryptography (ECC) is mainly an
alternative to traditional public-key cryptosystems (PKCs),
such as RSA, due to its smaller key size with same security
level for resource-constrained networks. The computational
efficiency of ECC depends on the scalar point multiplication,
which consists of modular point addition and point doubling
operations. The paper emphasizes on point multiplication
techniques such as Binary, NAF, w-NAF and different
coordinate systems like Affine and Projective (Standard
Projective, Jacobian and Mixed) for point addition and doubling
operations. These operations are compared based on execution
time. The results given here are for general purpose processor
with 1:73 GH z frequency. The implementation is done over
NIST-recommended prime fields 192/224/256/384/521.
Basic Computer Engineering Unit II as per RGPV SyllabusNANDINI SHARMA
Algorithm, Flowchart, Categories of Programming Languages, OOPs vs POP, concepts of OOPs, Inheritance, C++ Programming, How to write C++ program as a beginner, Array, Structure, etc
Random Number Generators :
LCG, Fibonacci, LFSR, GFSR, TGFSR, MT, MT19937,WELL
Tutorials on FInite Fields and associated RNG on github at :
https://github.com/rinnocente/Random_Numbers
DFA minimization algorithms in map reduceIraj Hedayati
Explaining implementation and analysis of two well known DFA minimisation algorithms namely Morore and Hopcroft, in Map Reduce using Hadoop. Cost analysis and complexity are described.
Please follow this link: http://spectrum.library.concordia.ca/980838/
I am Boniface P. I am a Signal Processing Assignment Expert at matlabassignmentexperts.com. I hold a Ph.D. in Matlab, The University of Edinburg. I have been helping students with their homework for the past 14 years. I solve assignments related to Signal Processing.
Visit matlabassignmentexperts.com or email info@matlabassignmentexperts.com.
You can also call on +1 678 648 4277 for any assistance with Signal Processing Assignments.
This slide explain complexity of an algorithm. Explain from theory perspective. At the end of slide, I also show the test result to prove the theory. Pleas, read this slide to improve your code quality .
This slide is exported from Ms. Power
Point to PDF.
Overview on Optimization algorithms in Deep LearningKhang Pham
Overview on function optimization in general and in deep learning. The slides cover from basic algorithms like batch gradient descent, stochastic gradient descent to the state of art algorithm like Momentum, Adagrad, RMSprop, Adam.
DFA minimization algorithms in map reduceIraj Hedayati
Explaining implementation and analysis of two well known DFA minimisation algorithms namely Morore and Hopcroft, in Map Reduce using Hadoop. Cost analysis and complexity are described.
Please follow this link: http://spectrum.library.concordia.ca/980838/
I am Boniface P. I am a Signal Processing Assignment Expert at matlabassignmentexperts.com. I hold a Ph.D. in Matlab, The University of Edinburg. I have been helping students with their homework for the past 14 years. I solve assignments related to Signal Processing.
Visit matlabassignmentexperts.com or email info@matlabassignmentexperts.com.
You can also call on +1 678 648 4277 for any assistance with Signal Processing Assignments.
This slide explain complexity of an algorithm. Explain from theory perspective. At the end of slide, I also show the test result to prove the theory. Pleas, read this slide to improve your code quality .
This slide is exported from Ms. Power
Point to PDF.
Overview on Optimization algorithms in Deep LearningKhang Pham
Overview on function optimization in general and in deep learning. The slides cover from basic algorithms like batch gradient descent, stochastic gradient descent to the state of art algorithm like Momentum, Adagrad, RMSprop, Adam.
I am Andrew O. I am an Algorithm Assignment Expert at programminghomeworkhelp.com. I hold a master’s in Programming from, the University of Southampton, UK. I have been helping students with their assignments for the past 12 years. I solve assignment related to Algorithms.
Visit programminghomeworkhelp.com or email support@programminghomeworkhelp.com. You can also call on +1 678 648 4277 for any assistance with the Algorithm assignment.
I am Joe L. I am an Algorithms Design Assignment Expert at programminghomeworkhelp.com. I hold a Ph.D. in Programming from, the University of Chicago, USA. I have been helping students with their assignments for the past 10 years. I solve assignments related to Algorithms Design.
Visit programminghomeworkhelp.com or email support@programminghomeworkhelp.com. You can also call on +1 678 648 4277 for any assistance with the Algorithm Design Assignment.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
1. Parallel Evaluation of Multi-Semi-Joins
Jonny Daenen | Hasselt University
Jonny Daenen | Hasselt University
Frank Neven | Hasselt University
Tony Tan | National Taiwan University
Stijn Vansummeren | Université Libre de Bruxelles
2. Main Contribution
• Parallel evaluation (MapReduce)
• Strictly Guarded Fragment
• 2-round eval: low net time (elapsed)
• 2-tier optimization: total time (cost)
2
8. SELECT x
FROM R(x,y)
WHERE
[S(x,y) AND T(x)] OR NOT U(y,x)
SGF Query
6
1 Guard Atom
Conditional Atoms
Boolean Combination
(x,y) IN S
• Basic (BSGF): no nesting
• Semi-join algebra
9. Example
7
pository, respectively. Let Upcoming contain tuples (new-
title, author) of upcoming books. The following query se-
lects all the upcoming books (newtitle, author) of authors
that have not yet received a “bad” rating for the same title
at all three book retailers; Z2 is the output relation:
Z1 := SELECT aut FROM Amaz(ttl, aut, ”bad”)
WHERE BN(ttl, aut, ”bad”) AND BD(ttl, aut, ”bad”);
Z2 := SELECT (new, aut) FROM Upcoming(new, aut)
WHERE NOT Z1(aut);
Note that this query cannot be written as a basic SGF query,
since the atoms in the query computing Z1 must share the
ttl variable, which is not present in the guard of the query
computing Z2. ✷
a given
analyzi
one int
The ad
below,
differen
data.
Whi
and red
ure 1 s
See [37
phase
(ii) sor
by the
disk.
Example 2. Let Amaz, BN, and BD be relations con-
taining tuples (title, author, rating) corresponding to the
books found at Amazon, Barnes and Noble, and Book De-
pository, respectively. Let Upcoming contain tuples (new-
title, author) of upcoming books. The following query se-
lects all the upcoming books (newtitle, author) of authors
that have not yet received a “bad” rating for the same title
at all three book retailers; Z2 is the output relation:
Z1 := SELECT aut FROM Amaz(ttl, aut, ”bad”)
WHERE BN(ttl, aut, ”bad”) AND BD(ttl, aut, ”bad”);
Z2 := SELECT (new, aut) FROM Upcoming(new, aut)
WHERE NOT Z1(aut);
Note that this query cannot be written as a basic SGF query,
since the atoms in the query computing Z1 must share the
ttl variable, which is not present in the guard of the query
computing Z2. ✷
3.3
As o
plans,
a given
analyz
one int
The ad
below,
differen
data.
Whi
and red
ure 1 s
See [37
phase
(ii) sor
by the
disk.
12. MapReduce (in Hadoop)
• Map creates Key-value pairs
• Reducer combines values
• Cost Model (Wang et al., 2013)
9
Map
Shuffle
Reduce
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
Input
Map
Output
Reduce
Input
Reduce
Output
Let I1 ∪ · · · ∪ Ik denote the partition of the input tuples
such that the mapper behaves uniformly1
on every data item
in Ii. Let Ni be the size (in MB) of Ii, and let Mi be the
size (in MB) of the intermediate data output by the mapper
on Ii. The cost of the map phase on Ii is:
costmap(Ni, Mi) = hrNi + mergemap(Mi) + lwMi,
where mergemap(Mi), denoting the cost of sort and merge in
the map stage, is expressed by
mergemap(Mi) = (lr + lw)Mi logD
(Mi+Mi)/mi
buf map
.
See Table 1 for the meaning of the variables hr, lw, lr, lw,
D, Mi, mi, and buf map.2
The total cost incurred in the map
phase equals the sum
k
i=1
costmap(Ni, Mi). (2)
Let I1 ∪ · · · ∪ Ik denote the partition of the input tuples
such that the mapper behaves uniformly1
on every data item
in Ii. Let Ni be the size (in MB) of Ii, and let Mi be the
size (in MB) of the intermediate data output by the mapper
on Ii. The cost of the map phase on Ii is:
costmap(Ni, Mi) = hrNi + mergemap(Mi) + lwMi,
where mergemap(Mi), denoting the cost of sort and merge in
the map stage, is expressed by
mergemap(Mi) = (lr + lw)Mi logD
(Mi+Mi)/mi
buf map
.
See Table 1 for the meaning of the variables hr, lw, lr, lw,
D, Mi, mi, and buf map.2
The total cost incurred in the map
phase equals the sum
k
i=1
costmap(Ni, Mi). (2)
Note that the cost model in [27, 36] defines the total cost
incurred in the map phase as
costmap
k
i=1
Ni,
k
i=1
Mi . (3)
The latter is not always accurate. Indeed, consider for in-
stance an MR job whose input consists of two relations R
and S where the map function outputs many key-value pairs
for each tuple in R and at most one key-value pair for each
tuple in S, e.g., because of filtering. This difference in map
output may lead to a non-proportional contribution of both
input relations to the total cost. Hence, as shown by Equa-
tion (2), we opt to consider different inputs separately. This
cannot be captured by map cost calculation of Equation (3),
as it considers the global average map output size in the cal-
culation of the merge cost. In Section 5, we illustrate this
problem by means of an experiment that confirms the effec-
tiveness of the proposed adjustment.
To analyze the cost in the reduce phase, let M = k
i=1 Mi.
The reduce stage involves (i) transferring the intermediate
data (i.e., the output of the map function) to the correct
reducer, (ii) merging the key-value pairs locally for each re-
ducer, (iii) applying the reduce function, and (iv) writing
the output to hdfs. Its cost will be
costred(M, K) = tM + mergered(M) + hwK,
where K is the size of the output of the reduce function (in
MB). The cost of merging equals
mergered(M) = (lr + lw)M logD
M/r
buf red
.
The total cost of an MR job equals the sum
costh +
k
costmap(Ni, Mi) + costred(M, K),
i=1 i=1
The latter is not always accurate. Indeed, consider for in-
stance an MR job whose input consists of two relations R
and S where the map function outputs many key-value pairs
for each tuple in R and at most one key-value pair for each
tuple in S, e.g., because of filtering. This difference in map
output may lead to a non-proportional contribution of both
input relations to the total cost. Hence, as shown by Equa-
tion (2), we opt to consider different inputs separately. This
cannot be captured by map cost calculation of Equation (3),
as it considers the global average map output size in the cal-
culation of the merge cost. In Section 5, we illustrate this
problem by means of an experiment that confirms the effec-
tiveness of the proposed adjustment.
To analyze the cost in the reduce phase, let M = k
i=1 Mi.
The reduce stage involves (i) transferring the intermediate
data (i.e., the output of the map function) to the correct
reducer, (ii) merging the key-value pairs locally for each re-
ducer, (iii) applying the reduce function, and (iv) writing
the output to hdfs. Its cost will be
costred(M, K) = tM + mergered(M) + hwK,
where K is the size of the output of the reduce function (in
MB). The cost of merging equals
mergered(M) = (lr + lw)M logD
M/r
buf red
.
The total cost of an MR job equals the sum
costh +
k
i=1
costmap(Ni, Mi) + costred(M, K),
where costh is the overhead cost of starting an MR job.
1 2 3
the atoms occurring in C are called the
We interpret Z as the output relation of
DB, the BSGF query (1) defines a new
ing all tuples ¯a for which there is a substi-
ariables occurring in ¯t such that σ(¯x) = ¯a,
d C evaluates to true in DB under substi-
he evaluation of C in DB under σ is de-
on the structure of C. If C is C1 OR C2,
T C1, the semantics is the usual boolean
C is an atom T(¯v) then C evaluates to
) T(¯v), i.e., if there exists a T-atom in
(σ(¯t)) on those positions where R(¯t) and
es.
e intersection Z1 := R ∩ S and the dif-
− S between two relations R and S are
ws:
LECT ¯x FROM R(¯x) WHERE S(¯x);
LECT ¯x FROM R(¯x) WHERE NOT S(¯x);
Input Intermediate Data Output
Map Phase
read → map → sort → merge →
Reduce Phase
trans. → merge → reduce → write
Figure 1: A depiction of the inner workings of Hadoop MR.
Remark 1. The syntax we use here differs from the tra-
ditional syntax of the Guarded Fragment [20], and is ac-
tually closer in spirit to join trees for acyclic conjunctive
queries [9, 11], although we do allow disjunction and nega-
tion in the where clause. In the traditional syntax, a pro-
jection in the guarded fragment is only allowed in the form
∃ ¯wR(¯x)∧ϕ(¯z) where all variables in ¯z must occur in ¯x. One
can obtain a query in the traditional syntax of the guarded
fragment from our syntax by adding extra projections for
the atoms in C. For example,
SELECT x FROM R(x, y) WHERE S(x, z1) AND NOT S(y, z2)
632 4 51
4
5
5
2
18. Semi-join in MapReduce
14
0 1
1 2
8 1
1 2
1 10
3 8
R
S
I REQUEST this S-tuple!
A = SELECT x,y FROM R(x,y) WHERE S(y,z)
19. Semi-join in MapReduce
14
0 1
1 2
8 1
1 2
1 10
3 8
R
S
I REQUEST this S-tuple!
I ASSERT that an S-tuple exists!
A = SELECT x,y FROM R(x,y) WHERE S(y,z)
20. Semi-join in MapReduce
14
0 1
1 2
8 1
1 2
1 10
3 8
R
S
I REQUEST this S-tuple!
I ASSERT that an S-tuple exists!
We have a match!
Output A(0,1)
A = SELECT x,y FROM R(x,y) WHERE S(y,z)
21. Semi-join in MapReduce
14
0 1
1 2
8 1
1 2
1 10
3 8
R
S We have a match!
Output A(0,1)
A = SELECT x,y FROM R(x,y) WHERE S(y,z)
22. Semi-join in MapReduce
14
0 1
1 2
8 1
1 2
1 10
3 8
R
S 1 ASSERT S(y,z)
Join key Atom
1 REQUEST S(y,z) R(0,1)
Join key Atom Origin
We have a match!
Output A(0,1)
1 REQUEST S(x,y) R(0,1)
ASSERT S(x,y)
A = SELECT x,y FROM R(x,y) WHERE S(y,z)
26. SELECT x,y FROM R(x,y)
WHERE [S(x,y) OR S(y,x)] AND T(x,z)
BSGF Queries
18
massively
g., [2–5, 7,
d in terms
me, amount
he number
m to bring
query end
important
no longer
ces such as
the cost is
Attribution-
view a copy
-nd/4.0/. For
n by emailing
Consider the following SGF query Q:
SELECT (x, y) FROM R(x, y)
WHERE S(x, y) OR S(y, x) AND T(x, z)
Intuitively, this query asks for all pairs (x, y) in R for which
there exists some z such that (1) (x, y) or (y, x) occurs in
S and (2) (x, z) occurs in T. To evaluate Q it suffices to
compute the following semi-joins
X1 := R(x, y) S(x, y);
X2 := R(x, y) S(y, x);
X3 := R(x, y) T(x, z);
store the results in the binary relations X1, X2, or X3, and
subsequently compute ϕ := (X1 ∪X2)∩X3. Our multi-semi-
join operator MSJ(S) (defined in Section 4.2) takes a number
of semi-join-equations as input and exploits commonalities
between them to optimize evaluation. In our framework, a
possible query plan for query Q is of the form:
EVAL(R, ϕ)
MSJ(X1, X2) MSJ(X3)
X1 = SELECT x,y FROM R(x,y) WHERE S(x,y)
X2 = SELECT x,y FROM R(x,y) WHERE S(y,x)
X3 = SELECT x,y FROM R(x,y) WHERE T(x,z)
SELECT x,y FROM R(x,y)
WHERE [X1(x,y) OR X2(x,y)] AND X3(x,y)
Semi-Joins
Boolean
Combination
31. Some Optimizations
• Tuple IDs & Atom IDs
– Refer to tuple address using
– file id + byte offset
22
1 REQUEST S(y,z) R(0,1) 1 T.1 A.1 #1.043
• Packing
1 REQUEST S(y,z) R(0,1)
1 REQUEST T(x,z) R(0,1)
1 ASSERT S(z,y)
1 ASSERT S(y,z)
1 REQUEST S(y,z) T(x,z) R(0,1)
1 ASSERT S(y,z) S(z,y)
Only key contains actual values
Multiple atoms
32. Some Optimizations
• Tuple IDs & Atom IDs
– Refer to tuple address using
– file id + byte offset
22
1 REQUEST S(y,z) R(0,1) 1 T.1 A.1 #1.043
• Packing
1 REQUEST S(y,z) R(0,1)
1 REQUEST T(x,z) R(0,1)
1 ASSERT S(z,y)
1 ASSERT S(y,z)
1 REQUEST S(y,z) T(x,z) R(0,1)
1 ASSERT S(y,z) S(z,y)
Only key contains actual values
Multiple atoms
33. Some Optimizations
• Tuple IDs & Atom IDs
– Refer to tuple address using
– file id + byte offset
22
1 REQUEST S(y,z) R(0,1) 1 T.1 A.1 #1.043
• Packing
1 REQUEST S(y,z) R(0,1)
1 REQUEST T(x,z) R(0,1)
1 ASSERT S(z,y)
1 ASSERT S(y,z)
1 REQUEST S(y,z) T(x,z) R(0,1)
1 ASSERT S(y,z) S(z,y)
Only key contains actual values
Multiple atoms
34. Optimal Plan (BSGF)
23
136 Parallel Evaluation of Multi-Semi-Joins
EVAL
Z := X1 ^ (X2 _ ¬X3)
MSJ
X1 := R(x, y) n S(x, z)
MSJ
X2 := R(x, y) n T(y)
MSJ
X3 := R(x, y) n U(x)
(a)
EVAL
Z := X1 ^ (X2 _ ¬X3)
MSJ
X1 := R(x, y) n S(x, z)
X3 := R(x, y) n U(x)
MSJ
X2 := R(x, y) n T(y)
(b)
EVAL
Z := X1 ^ (X2 _ ¬X3)
MSJ
X1 := R(x, y) n S(x, z)
X2 := R(x, y) n T(y)
X3 := R(x, y) n U(x)
(c)
Figure 4.5: Some MapReduce query plan alternatives for the query given in
Example 4.13.
such that its total cost as computed in Equation (4.12) is minimal. The Scan-
Shared Optimal Grouping, which is known to be np-hard, is reducible to
this problem (see Nykiel et al. [139]). Stating it formally, we have:
Theorem 4.14. The decision variant of BSGF-Opt is np-complete.
136 Parallel Evaluation of Multi-Semi-Joins
EVAL
Z := X1 ^ (X2 _ ¬X3)
MSJ
X1 := R(x, y) n S(x, z)
MSJ
X2 := R(x, y) n T(y)
MSJ
X3 := R(x, y) n U(x)
(a)
EVAL
Z := X1 ^ (X2 _ ¬X3)
MSJ
X1 := R(x, y) n S(x, z)
X3 := R(x, y) n U(x)
MSJ
X2 := R(x, y) n T(y)
(b)
EVAL
Z := X1 ^ (X2 _ ¬X3)
MSJ
X1 := R(x, y) n S(x, z)
X2 := R(x, y) n T(y)
X3 := R(x, y) n U(x)
(c)
Figure 4.5: Some MapReduce query plan alternatives for the query given in
Example 4.13.
such that its total cost as computed in Equation (4.12) is minimal. The Scan-
136 Parallel Evaluation of Multi-Semi-Joins
EVAL
Z := X1 ^ (X2 _ ¬X3)
MSJ
X1 := R(x, y) n S(x, z)
MSJ
X2 := R(x, y) n T(y)
MSJ
X3 := R(x, y) n U(x)
(a)
EVAL
Z := X1 ^ (X2 _ ¬X3)
MSJ
X1 := R(x, y) n S(x, z)
X3 := R(x, y) n U(x)
MSJ
X2 := R(x, y) n T(y)
(b)
EVAL
Z := X1 ^ (X2 _ ¬X3)
MSJ
X1 := R(x, y) n S(x, z)
X2 := R(x, y) n T(y)
X3 := R(x, y) n U(x)
(c)
Figure 4.5: Some MapReduce query plan alternatives for the query given in
Example 4.13.
such that its total cost as computed in Equation (4.12) is minimal. The Scan-
35. Optimal MSJ Grouping
• Nykiel et al., 2010 & Wang et al.,2013
• Cost Model
• NP-Hard
• Greedy Approach
– cost matrix
– group current best option
– repeat
24
R,S(x) R,T(x) G,S(x) G,T(y)
R,S(x) 0 +200 +300 -200
R,T(x) 0 -100 +200
G,S(x) 0 -50
G,T(y) 0
R,S(x)
G,S(x)
R,T(x) G,T(y)
R,S(x)
G,S(x)
0 +100 -200
R,T(x) 0 +100
G,T(y) 0
36. Multiple BSGF
• Set of batch queries
• Same!
• More grouping possibilities
• EVAL remains 1 job
25
38. SGF Queries
27
Z n I
H n Q
R n S ^ T
G n S ^ T
BSGF Greedy grouping
1
2
3
1 level = MSJ+EVAL
Level
39. SGF Queries
27
Z n I
H n Q
R n S ^ T
G n S ^ T
BSGF Greedy grouping
1
2
3
1 level = MSJ+EVAL
Level
move?
40. SGF Queries
27
Z n I
H n Q
R n S ^ T
G n S ^ T
BSGF Greedy grouping
Z n I
H n Q
R n S ^ T G n S ^ T1
2
3
1 level = MSJ+EVAL
Level
move?
41. Optimal Plan (SGF)
• Assigning level to BSGF queries
• Dependencies
• Optimize for greedy BSGF (MSJ)
• NP-Hard
• Greedy overlap algorithm
28
Z n I
H n Q
R n S ^ T G n S ^ T1
2
3
44. BSGF Queries
31
QID Query Type of query
A1 R(x, y, z, w)
S(x) ∧ T (y) ∧ U(z) ∧ V (w)
guard sharing
A2 R(x, y, z, w)
S(x) ∧ S(y) ∧ S(z) ∧ S(w)
guard & con-
ditional name
sharing
A3 R(x, y, z, w)
S(x) ∧ T (x) ∧ U(x) ∧ V (x)
guard & condi-
tional key shar-
ing
A4 R(x, y, z, w)
S(x) ∧ T (y) ∧ U(z) ∧ V (w)
G(x, y, z, w)
W (x) ∧ X(y) ∧ Y (z) ∧ Z(w)
no sharing
A5 R(x, y, z, w)
S(x) ∧ T (y) ∧ U(z) ∧ V (w)
G(x, y, z, w)
S(x) ∧ T (y) ∧ U(z) ∧ V (w)
conditional
name sharing
B1 R(x, y, z, w)
S(x) ∧ T (x) ∧ U(x) ∧ V (x) ∧
S(y) ∧ T (y) ∧ U(y) ∧ V (y) ∧
S(z) ∧ T (z) ∧ U(z) ∧ V (z) ∧
large conjunc-
tive query
45. BSGF Strategies
32
Sequential Parallel Greedy Grouping
R2 n U
R3 n V
R1 n T
R n S R n S R n T R n U R n V
R1 R2 R3 R4
R n S R n T R n U R n V
R1 R2 R3 R4
50. SGF Queries
35
S(x) ∧ T (y) ∧ U(z) ∧ V (w)
G(x, y, z, w)
W (x) ∧ X(y) ∧ Y (z) ∧ Z(w)
A5 R(x, y, z, w)
S(x) ∧ T (y) ∧ U(z) ∧ V (w)
G(x, y, z, w)
S(x) ∧ T (y) ∧ U(z) ∧ V (w)
conditional
name sharing
B1 R(x, y, z, w)
S(x) ∧ T (x) ∧ U(x) ∧ V (x) ∧
S(y) ∧ T (y) ∧ U(y) ∧ V (y) ∧
S(z) ∧ T (z) ∧ U(z) ∧ V (z) ∧
S(w) ∧ T (w) ∧ U(w) ∧ V (w)
large conjunc-
tive query
B2 R(x, y, z, w)
(S(x)∧¬T (x)∧¬U(x)∧¬V (x))∨
(¬S(x)∧T (x)∧¬U(x)∧¬V (x))∨
(S(x) ∧ ¬T (x) ∧ U(x) ∧ ¬V (x)) ∨
(¬S(x) ∧ ¬T (x) ∧ ¬U(x) ∧ V (x))
uniqueness
query
Table 2: Queries used in the BSGF-experiment
restricted to only disjunction and negation. The same
optimization also works for multiple BSGF queries. We
refer to these programs as 1-ROUND below.
60. Future Work
• Skew
• Spark: update cost model
• Study optimization effects
• Rules for subclasses
• Incorporate into existing systems
• Get Gumbo on GitHub!
• More details? poster session tomorrow @2PM!
43