SlideShare a Scribd company logo
1 of 77
Download to read offline
Processing Theta Joins
using MapReduce
by Minsub Yim
Processing pipeline at a reducer
Goal: We want to minimize job completion time. Since it’s a function of both
input and output, we need a way to model both inputs and outputs to a reducer.
Reducer Join OutputMapper Output
time = f(input size) time = f(output size)
Receive Mapper
Output
Sort input
by key
Read
input
Run join
algorithm
Send join
output
Theta Join Model
S_id Value
1 5
2 6
3 6
4 8
5 8
6 10
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
Assuming join condition:
S.value = T.value
Theta Join Model
S_id Value
1 5
2 6
3 6
4 8
5 8
6 10
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
Assuming join condition:
S.value = T.value
5 5 6 8 8 10
5
6
6
8
8
10
[ Join Matrix M ]
: tuple satisfying the join
condition
S
T
Theta Join Model
(Examples)
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value <= T.value
S
T 5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
abs (S.value - T.value) < 2
S
T 5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
Theta Join Model
(Examples)
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value <= T.value
S
T 5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
abs (S.value - T.value) < 2
S
T 5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
Theta Join Model
(Examples)
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value <= T.value
S
T 5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
abs (S.value - T.value) < 2
S
T 5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
Goal Revisited
• We want to minimize job completion time
• We need to assign every true cell to exactly one
reducer. (find a mapping from M to R)
Goal Revisited
• We want to minimize job completion time
• We need to assign every true cell to exactly one
reducer. (find a mapping from M to R)
• Goal: Find a mapping from the join matrix M to
reducers that minimizes job completion time
Mappings from join matrix to
reducers
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
(1)
(2)
(3)
(4)
[R1]
Input: S1, T1, T2
Output: 2 tuples
!
[R2]
Input: S2, S3, T3
Output: 2 tuples
!
[R3]
Input: S4, S5, T4, T5
Output: 4 tuples
!
[R4]
Input: S6, T6
Output: 1 tuple
!
Max-Reducer-Input: 4
Max-Reducer-Output: 4
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
(1)
(2)
(3)
(4)
[R1]
Input: S1, S4, S5, T1,
T4, T5
Output: 3 tuples
!
[R2]
Input: S2, S4, T3,T5
Output: 2 tuples
!
[R3]
Input: S1, S5, T2, T4
Output: 2 tuples
!
[R4]
Input: S3, S6, T3, T6
Output: 2 tuples
!
MRI: 6
MRO: 3
(1)
(1)
(2)
(3)
(4)
Stndard equi-join algorithm Random
Mappings from join matrix to
reducers
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
(1)
(2)
(3)
(4)
[R1]
Input: S1, T1, T2
Output: 2 tuples
!
[R2]
Input: S2, S3, T3
Output: 2 tuples
!
[R3]
Input: S4, S5, T4, T5
Output: 4 tuples
!
[R4]
Input: S6, T6
Output: 1 tuple
!
Max-Reducer-Input: 4
Max-Reducer-Output: 4
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
(1)
(2)
(3)
(4)
[R1]
Input: S1, S4, S5, T1,
T4, T5
Output: 3 tuples
!
[R2]
Input: S2, S4, T3,T5
Output: 2 tuples
!
[R3]
Input: S1, S5, T2, T4
Output: 2 tuples
!
[R4]
Input: S3, S6, T3, T6
Output: 2 tuples
!
MRI: 6
MRO: 3
(1)
(1)
(2)
(3)
(4)
Stndard equi-join algorithm Random
Mappings from join matrix to
reducers
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
(1)
(2)
(3)
(4)
[R1]
Input: S1, T1, T2
Output: 2 tuples
!
[R2]
Input: S2, S3, T3
Output: 2 tuples
!
[R3]
Input: S4, S5, T4, T5
Output: 4 tuples
!
[R4]
Input: S6, T6
Output: 1 tuple
!
Max-Reducer-Input: 4
Max-Reducer-Output: 4
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
(1)
(2)
(3)
(4)
[R1]
Input: S1, S4, S5, T1,
T4, T5
Output: 3 tuples
!
[R2]
Input: S2, S4, T3,T5
Output: 2 tuples
!
[R3]
Input: S1, S5, T2, T4
Output: 2 tuples
!
[R4]
Input: S3, S6, T3, T6
Output: 2 tuples
!
MRI: 6
MRO: 3
(1)
(1)
(2)
(3)
(4)
Stndard equi-join algorithm Random
Mappings from join matrix to
reducers
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
(1)
(2)
(3)
(4)
[R1]
Input: S1, T1, T2
Output: 2 tuples
!
[R2]
Input: S2, S3, T3
Output: 2 tuples
!
[R3]
Input: S4, S5, T4, T5
Output: 4 tuples
!
[R4]
Input: S6, T6
Output: 1 tuple
!
Max-Reducer-Input: 4
Max-Reducer-Output: 4
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
(1)
(2)
(3)
(4)
[R1]
Input: S1, S4, S5, T1,
T4, T5
Output: 3 tuples
!
[R2]
Input: S2, S4, T3,T5
Output: 2 tuples
!
[R3]
Input: S1, S5, T2, T4
Output: 2 tuples
!
[R4]
Input: S3, S6, T3, T6
Output: 2 tuples
!
MRI: 6
MRO: 3
(1)
(1)
(2)
(3)
(4)
Stndard equi-join algorithm Random
Mappings from join matrix to
reducers
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
(1)
(2)
(3)
[R1]
Input: S1, S2, T1, T2
Output: 2 tuples
!
[R2]
Input: S3, S4, T1, T2, T3
Output: 2 tuples
!
[R3]
Input: S4, S5, S6, T4, T5, T6
Output: 5 tuples
!
!
Max-Reducer-Input: 6
Max-Reducer-Output: 5
Mappings from join matrix to
reducers
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
(1)
(2)
(3)
[R1]
Input: S1, S2, T1, T2
Output: 2 tuples
!
[R2]
Input: S3, S4, T1, T2, T3
Output: 2 tuples
!
[R3]
Input: S4, S5, S6, T4, T5, T6
Output: 5 tuples
!
!
Max-Reducer-Input: 6
Max-Reducer-Output: 5
Mappings from join matrix to
reducers
• We see there could be many possible mappings
from join matrix to reducers
• We will see in different cases, which mapping is
(close to) optimal and algorithms to compute
such mapping.
Lemma
We will be using the following lemma repeatedly to show
how (close to) optimal each mapping is.
[ LEMMA 1 ] A reducer that is assigned to c cells of the join
matrix M will receive at least input tuples
[ Proof ] Consider a reducer r that receives m records from
T and n records from S. Then,
!
!
2
p
c
mn c
2
p
mn 2
p
c
m + n 2
p
c
Lemma
We will be using the following lemma repeatedly to show
how (close to) optimal each mapping is.
[ LEMMA 1 ] A reducer that is assigned to c cells of the join
matrix M will receive at least input tuples
[ Proof ] Consider a reducer r that receives m records from
T and n records from S. Then,
!
!
2
p
c
mn c
2
p
mn 2
p
c
m + n 2
p
c
Cross Product
• We first consider cross product, where all of
tuples from two datasets satisfy the join
condition. The join matrix would look like the
following:
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S X T
S
T
Cross Product
• We first consider cross product, where all of
tuples from two datasets satisfy the join
condition. The join matrix would look like the
following:
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S X T
S
T
Cross Product
• Since all entries of the join matrix are true, we
can see that the maximum-reducer-output
(MRO) . (Otherwise, there would be
tuples not mapped to a reducer.)
• Along with Lemma 1, we have a lower bound for
the maximum-reducer-input (MRI):
MRI
|S||T|/r
2
r
|S||T|
r
[ LEMMA 1 ] A reducer that is assigned to c cells of the join
matrix M will receive at least input tuples2
p
c
Cross Product
• Since all entries of the join matrix are true, we
can see that the maximum-reducer-output
(MRO) . (Otherwise, there would be
tuples not mapped to a reducer.
• Along with Lemma 1, we have a lower bound for
the maximum-reducer-input (MRI):
MRI
|S||T|/r
2
r
|S||T|
r
[ LEMMA 1 ] A reducer that is assigned to c cells of the join
matrix M will receive at least input tuples2
p
c
Cross Product
• We will revisit these two properties frequently to
see the quality of join mappings:
|S||T|/rMRO and MRI 2
r
|S||T|
r
p
|S||T|/rCase 1: Suppose |S| and |T| are multiples of .
Namely, and .|S| = cs
p
|S||T|/r |T| = cT
p
|S||T|/r
Then, partitioning the join matrix with squares of size
is an optimal mapping.
p
|S||T|/r
Proof : is trivial. Each region mapped to a reducer
!
has output size: and input size:|S||T|/r 2
r
|S||T|
r
Cross Product
|S||T|/rMRO and MRI 2
r
|S||T|
r
Properties
p
|S||T|/rCase 1: Suppose |S| and |T| are multiples of .
Namely, and .|S| = cs
p
|S||T|/r |T| = cT
p
|S||T|/r
Then, partitioning the join matrix with squares of size
is an optimal mapping.
p
|S||T|/r
Proof : is trivial. Each region mapped to a reducer
!
has output size: and input size:|S||T|/r 2
r
|S||T|
r
Cross Product
|S||T|/rMRO and MRI 2
r
|S||T|
r
Properties
Cross Product
|S||T|/rMRO and MRI 2
r
|S||T|
r
Properties
5 5 6 8 8 10
5
6
6
8
8
10
S
T
Suppose |S| = |T| = 6
and r = 9
Cross Product
|S||T|/rMRO and MRI 2
r
|S||T|
r
Properties
5 5 6 8 8 10
5
6
6
8
8
10
S
T
Suppose |S| = |T| = 6
and r = 9
Cross Product
|S||T|/rMRO and MRI 2
r
|S||T|
r
Properties
5 5 6 8 8 10
5
6
6
8
8
10
S
T
Suppose |S| = |T| = 6
and r = 9
Cross Product
|S||T|/rMRO and MRI 2
r
|S||T|
r
Properties
5 5 6 8 8 10
5
6
6
8
8
10
S
T
Suppose |S| = |T| = 6
and r = 9
MRO = 4 = 2
r
|S||T|
r
MRI = 4 = |S||T|/r
Case 2: Suppose the cardinality of one dataset is
significantly greater than that of the other. (WLOG,
assume ). Then, rectangle cover
Cross Product
|S||T|/rMRO and MRI 2
r
|S||T|
r
Properties
|S| < |T|/r |S| ⇥ |T|/r
is the optimal mapping.
Case 2: Suppose the cardinality of one dataset is
significantly greater than that of the other. (WLOG,
assume ). Then, rectangle cover
Cross Product
|S||T|/rMRO and MRI 2
r
|S||T|
r
Properties
|S| < |T|/r |S| ⇥ |T|/r
is the optimal mapping.
(e.g., |S| = 3, |T| = 20, r = 5)
Case 3: The remaining case where .
!
Let ,
!
Then, covering M with squares
is a mapping worse than an optimal mapping by a
factor no greater than 4.
Cross Product
|S||T|/rMRO and MRI 2
r
|S||T|
r
Properties
|T|/r  |S|  |T|
CT =
$
|T|/
r
|S||T|
r
%
CS =
$
|S|/
r
|S||T|
r
%
p
|S||T|/r ⇥
p
|S||T|/r
If |S| and/or |T| is not a multiple of , scale each
!
side by and/or respectively to
!
cover M. Given , we see that
Cross Product
|S||T|/rMRO and MRI 2
r
|S||T|
r
Properties
p
|S||T|/r
✓
1 +
1
CS
◆ ✓
1 +
1
CT
◆
|T|/r  |S|  |T|
✓
1 +
1
CS
◆ r
|S||T|
r
 2
r
|S||T|
r
Hence, and
Cross Product
|S||T|/rMRO and MRI 2
r
|S||T|
r
Properties
Comparing these with the lower bounds given above,
we see that the MRO and MRI produced by this mapping
are at most 4 times (twice for MRI) the lower bounds.
MRI  4
p
|S||T|/rMRO  4|S||T|/r
Implementation
• Now we know how to (nearly) optimally partition
the join matrix. So let’s run it!!
• However, when a reducer is given a record (either
from S or T), it does NOT have enough information
where exactly in the dataset (in which row/col) the
record belongs to.
• We could run another pre-process to get that info,
but it can be avoided by running a randomized
algorithm!
Implementation
• Now we know how to (nearly) optimally partition
the join matrix. So let’s run it!!
• However, when a reducer is given a record (either
from S or T), it does NOT have enough information
where exactly in the dataset (in which row/col) the
record belongs to.
• We could run another pre-process to get that info,
but it can be avoided by running a randomized
algorithm!
Implementation
• Now we know how to (nearly) optimally partition
the join matrix. So let’s run it!!
• However, when a reducer is given a record (either
from S or T), it does NOT have enough information
where exactly in the dataset (in which row/col) the
record belongs to.
• We could run another pre-process to get that info,
but it can be avoided by running a randomized
algorithm!
Mapping & Randomized
Algorithm
Algorithm 1 : Map (Theta - Join)
!
Input : input tuple
1: if then
2: matrixRow = random(1,|S|)
3: for all regionID in lookup.getRegions(matrixRow)
do
4: Output (regionID, (x, “S”) )
5: else
6: matrixCol = random (1,|T|)
7: for all regionID in lookup.getRegions(matrixCol)
do
8: Output (regionID, (x, “T”) )
x 2 S [ T
x 2 S
Mapping & Randomized
Algorithm
Algorithm 1 : Map (Theta - Join)
!
Input : input tuple
1: if then
2: matrixRow = random(1,|S|)
3: for all regionID in lookup.getRegions(matrixRow)
do
4: Output (regionID, (x, “S”) )
5: else
6: matrixCol = random (1,|T|)
7: for all regionID in lookup.getRegions(matrixCol)
do
8: Output (regionID, (x, “T”) )
x 2 S [ T
x 2 S
1. Given a record ( WLOG )
2. Get a row uniformly randomly
3. Get all the regions intersecting that row
and output ( regID, (x, S) )
x 2 S
Mapping & Randomized
Algorithm
5 7 7 7 8 9
5
7
7
8
9
9
S
T
Join condition:
S.value = T.value
(1) (2)
(3)
3
5
1
5
1
2
6
2
2
3
6
4
(1,S1) (2,S1)
(3,S2)
(1,S3) (2,S3)
(3,S4)
(1,S5) (2,S5)
(1,S6) (2,S6)
(2,T1) (3,T1)
(1,T2) (3,T2)
(1,T3) (3,T3)
(1,T4) (3,T4)
(2,T5) (3,T5)
(2,T6) (3,T6)
Input
Tuple
Random
Row/Col
Output
Map
Reducer 1 : key 1 (regID)
Input: S1, S3, S5, S6,
T2, T3, T4
Output: (S3,T2) (S3,T3)
(S3,T4)
Reducer 2 : key 2 (regID)
Input: S1, S3, S5, S6,
T1, T5, T6
Output: (S1,T1) (S5,T6)
(S6,T6)
Reducer 3 : key 3 (regID)
Input: S2, S4,
T1, T2, T3, T4, T5, T6
Output: (S2,T2) (S2,T3)
(S2,T4) (S4,T5)
Reduce
S1.A = 5
S2.A = 7
S3.A = 7
S4.A = 8
S5.A = 9
S6.A = 9
T1.A = 5
T2.A = 7
T3.A = 7
T4.A = 7
T5.A = 8
T6.A = 9
Cross Product… NOT!
• We have verified that 1 Bucket Theta algorithm is
close to optimal when the join condition is cross
product.
• How does 1 Bucket Theta algorithm perform
when join condition is NOT cross product ?
• We will compare the quality of 1 Bucket Theta
algorithm to any join algorithm
Cross Product… NOT!
• We have verified that 1 Bucket Theta algorithm is
close to optimal when the join condition is cross
product.
• How does 1 Bucket Theta algorithm perform
when join condition is NOT cross product ?
• We will compare the quality of 1 Bucket Theta
algorithm to any join algorithm
Cross Product… NOT!
• We have verified that 1 Bucket Theta algorithm is
close to optimal when the join condition is cross
product.
• How does 1 Bucket Theta algorithm perform
when join condition is NOT cross product ?
• We will compare the quality of 1 Bucket Theta
algorithm to any join algorithm
1BT vs ANY join algorithm
Let . Any matrix to reducer mapping that has
to cover at least of the cells of the join matrix,
by Lemma 1, has MRI
1 x > 0
x|S||T| |S||T|
2
p
x|S||T|
[ LEMMA 1 ] A reducer that is assigned to c cells of the join
matrix M will receive at least input tuples2
p
c
As we have seen, 1BT guarantees that MRI .
!
Hence,
 4
p
|S||T|
MRI1BT
MRIAnyJoinAlg
=
4
p
|S||T|/r
2
p
x|S||T|/r
=
2
p
x
1BT vs ANY join algorithm
1BT vs ANY join algorithm
When , the ratio < 3.
!
Hence,compared to ANY join algorithm that assigns more
than 50% of its matrix cells to reducers, the MRI for 1BT is
at most 3 times the MRI of that algorithm.
x = 0.5
1BT vs ANY join algorithm
When , the ratio < 3.
!
Hence,compared to ANY join algorithm that assigns more
than 50% of its matrix cells to reducers, the MRI for 1BT is
at most 3 times the MRI of that algorithm.
x = 0.5
M-Bucket-I
• In the previous slide, we see that instead of
covering the entire matrix, mapping smaller
regions would yield better MRI result.
• Ideally, we only want to map those satisfying the
join condition, but it cannot be done before
knowing input statistics and/or join condition.
• M-Bucket-I exploits statistics to improve over 1
Bucket Theta join algorithm
M-Bucket-I
• In the previous slide, we see that instead of
covering the entire matrix, mapping smaller
regions would yield better MRI result.
• Ideally, we only want to map those satisfying the
join condition, but it cannot be done before
knowing input statistics and/or join condition.
• M-Bucket-I exploits statistics to improve over 1
Bucket Theta join algorithm
M-Bucket-I
• In the previous slide, we see that instead of
covering the entire matrix, mapping smaller
regions would yield better MRI result.
• Ideally, we only want to map those satisfying the
join condition, but it cannot be done before
knowing input statistics and/or join condition.
• M-Bucket-I exploits statistics to improve over 1
Bucket Theta join algorithm
M-Bucket-I
[ Step 1 ] Approximate Equi-Depth Histograms
1) With probability n /|S|, sample approx. n records
from |S|
2) Build k-quantiles (k buckets), where k < n
3) Iterate through |S| and count the number of
records in each bucket
4) Do the same for |T| and build the join matrix
accordingly
M-Bucket-I
[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
7 2
8 4
9 1
10 3
M-Bucket-I
[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
7 2
8 4
9 1
10 3
Sample S 7, 2, 2, 9, 2, 3
Sample T 5, 6, 8, 2, 1, 3
M-Bucket-I
[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
7 2
8 4
9 1
10 3
Sample S 7, 2, 2, 9, 2, 3
Sample T 5, 6, 8, 2, 1, 3
Samples
M-Bucket-I
[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
7 2
8 4
9 1
10 3
Sample S 7, 2, 2, 9, 2, 3
Sample T 5, 6, 8, 2, 1, 3
Samples
Buckets
S
T
0 2 3 9
0 1 5 8 1
1
M-Bucket-I
[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
7 2
8 4
9 1
10 3
Sample S 7, 2, 2, 9, 2, 3
Sample T 5, 6, 8, 2, 1, 3
Samples
Buckets
S
T
0 2 3 9
0 1 5 8 1
1
4 1 4 1
1 5 3 1
M-Bucket-I
[ Step 1 ] Approximate Equi-Depth Histograms
S S S S S S S S S S
T
T
T
T
T
T
T
T
T
T
Join condition:
S.value = T.value
M-Bucket-I
[ Step 1 ] Approximate Equi-Depth Histograms
S S S S S S S S S S
T
T
T
T
T
T
T
T
T
T
2 3 9
1
5
8
Join condition:
S.value = T.value
M-Bucket-I
[ Step 1 ] Approximate Equi-Depth Histograms
S S S S S S S S S S
T
T
T
T
T
T
T
T
T
T
2 3 9
1
5
8
Join condition:
S.value = T.value
We now have candidate
cells. How do we map
these cells to reducers?
M-Bucket-I
[ Step 2 ] M-Bucket-I Algorithm
Algorithm : M-Bucket-I
!
Input : maxInput, r, M
1: row = 0
2: while row < M.noOfRows do
3: (row,r) = CoverSubMatrix(row, maxInput, r, M)
4: if r < 0 then!
5: return false
6: return true!
M-Bucket-I
Algorithm : CoverSubMatrix
!
Input : row_s, maxInput, r, M
1: maxScore = -1, rUsed = 0
2: for i = 1 to maxInput-1 do
3: R_i = CoverRows(row_s, row_s + i, maxInput, M)
4: area = totalCandidateArea(row_s, row_s + i, M)
5: score = area/R_i.size
6: if score >= maxScore then!
7: bestRow = row_s + i
8: rUsed = R_i.size
9: r = r - rUsed
10: return (bestRow + 1, r)
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Algorithm : CoverRows
!
Input : row_f, row_l, maxInput, M
1: Regions = 0; r = newRegion()
2: for all c_i in M.getColumns do
3: if r. cap < c_i.candidateInputCosts then!
4: Regions = Regions U r
5: r = newRegion()
6: r.Cells = r.Cells U c_i.candidateCells
7: return Regions
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Run the algorithm with r = 6
maxInput = 5
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Run the algorithm with r = 6
maxInput = 5
row : 0
cost : 4
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Run the algorithm with r = 6
maxInput = 5
row : 0
cost : 4
row : 1
cost : 13/3 = 4.3
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Run the algorithm with r = 6
maxInput = 5
row : 0
cost : 4
row : 1
cost : 13/3 = 4.3
row : 2
cost : 22/4 = 5.5
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Run the algorithm with r = 6
maxInput = 5
row : 0
cost : 4
row : 1
cost : 13/3 = 4.3
row : 2
cost : 22/4 = 5.5
row : 3
cost : 31/7 = 4.428..
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Run the algorithm with r = 6
maxInput = 5
row : 0
cost : 4
row : 1
cost : 13/3 = 4.3
row : 2
cost : 22/4 = 5.5
row : 3
cost : 31/7 = 4.428..
We choose the mapping with
highest score!
(1) (2)
(3) (4)
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Run the algorithm with r = 6
maxInput = 5
row : 3
cost : 3
(1) (2)
(3) (4) So on and so forth…
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Run the algorithm with r = 6
maxInput = 5
Final mapping!
[ Step 2 ] M-Bucket-I Algorithm
(1) (2)
(3) (4)
(7)(6)(5)
(8) (9)
(10)
(11) (12)
(13)
M-Bucket-I
Run the algorithm with r = 6
maxInput = 5
(1) (2)
(3) (4)
However, we have mapped the
candidate cells to > r reducers.
!
We do binary search until we get
to the point where we a mapping
to <= r reducers.
(7)(6)(5)
(8) (9)
(10)
(11) (12)
(13)
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
[ Step 3 ] Binary Search
MaxInput = |S|+|T|
= 20
Num.Reducers
= 1
MaxInput = 5
Num.Reducers
= 13
M-Bucket-I
[ Step 3 ] Binary Search
MaxInput = |S|+|T|
= 20
Num.Reducers
= 1
MaxInput = 5
Num.Reducers
= 13
MaxInput = 12
Num.Reducers
= 3
M-Bucket-I
[ Step 3 ] Binary Search
MaxInput = |S|+|T|
= 20
Num.Reducers
= 1
MaxInput = 5
Num.Reducers
= 13
MaxInput = 12
Num.Reducers
= 3
MaxInput = 8
Num.Reducers
= 5
Since 7 reducers are required when MaxInput = 7, we stop the binary search here and output
the mapping with MRI = 8.
Performance
1 Bucket Theta Standard Equi Join
Data set
Output size
(billion)
Output
Imbalance
Runtime
(secs)
Output
Imbalance
Runtime
(secs)
Synth - 0 25.00 1.0030 657 1.0124 701
Synth - 0.4 24.99 1.0023 650 1.2541 722
Synth - 0.6 24.98 1.0033 676 1.7780 923
Synth - 0.8 24.95 1.0068 678 3.0103 1482
Synth - 1 24.91 1.0089 667 5.3124 2489
Skewed
Where Output Imbalance =
MRI
Ave.RI
MRI
Ave.RI
Skew Resistance of 1 Bucket Theta
Performance
1 Bucket Theta Standard Equi Join
Data set
Output size
(billion)
Output
Imbalance
Runtime
(secs)
Output
Imbalance
Runtime
(secs)
Synth - 0 25.00 1.0030 657 1.0124 701
Synth - 0.4 24.99 1.0023 650 1.2541 722
Synth - 0.6 24.98 1.0033 676 1.7780 923
Synth - 0.8 24.95 1.0068 678 3.0103 1482
Synth - 1 24.91 1.0089 667 5.3124 2489
Skewed
Where Output Imbalance =
MRI
Ave.RI
MRI
Ave.RI
Skew Resistance of 1 Bucket Theta
Performance
Step Number of Buckets
1 10 100 1000 10,000 100,000 1,000,000
M-Bucket-I cost details (seconds)
Quantiles 0 115 120 117 122 124 122
Histogram 0 140 145 147 157 167 604
Heuristic 74.01 9.21 0.84 1.50 16.67 118.03 111.27
Join 49384 10905 1157 595 548 540 536
Total 49,458.01 11,169.21 1,422.84 860.5 843.67 949.03 1,373.27

More Related Content

Similar to Theta join (M-bucket-I algorithm explained)

Hull White model presentation
Hull White model presentationHull White model presentation
Hull White model presentationStephan Chang
 
The Inner Secrets of Compilers
The Inner Secrets of CompilersThe Inner Secrets of Compilers
The Inner Secrets of CompilersIT MegaMeet
 
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...Austin Benson
 
pradeepbishtLecture13 div conq
pradeepbishtLecture13 div conqpradeepbishtLecture13 div conq
pradeepbishtLecture13 div conqPradeep Bisht
 
Lecture 13 14-time_domain_analysis_of_1st_order_systems
Lecture 13 14-time_domain_analysis_of_1st_order_systemsLecture 13 14-time_domain_analysis_of_1st_order_systems
Lecture 13 14-time_domain_analysis_of_1st_order_systemsSaifullah Memon
 
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Austin Benson
 
Linear transformation.ppt
Linear transformation.pptLinear transformation.ppt
Linear transformation.pptRaj Parekh
 
Finite-difference modeling, accuracy, and boundary conditions- Arthur Weglein...
Finite-difference modeling, accuracy, and boundary conditions- Arthur Weglein...Finite-difference modeling, accuracy, and boundary conditions- Arthur Weglein...
Finite-difference modeling, accuracy, and boundary conditions- Arthur Weglein...Arthur Weglein
 
Papr reduction for ofdm oqam signals via alternative signal method
Papr reduction for ofdm   oqam signals via alternative   signal methodPapr reduction for ofdm   oqam signals via alternative   signal method
Papr reduction for ofdm oqam signals via alternative signal methodeSAT Journals
 
Ceng232 Decoder Multiplexer Adder
Ceng232 Decoder Multiplexer AdderCeng232 Decoder Multiplexer Adder
Ceng232 Decoder Multiplexer Addergueste731a4
 
String kmp
String kmpString kmp
String kmpthinkphp
 

Similar to Theta join (M-bucket-I algorithm explained) (20)

Hull White model presentation
Hull White model presentationHull White model presentation
Hull White model presentation
 
The Inner Secrets of Compilers
The Inner Secrets of CompilersThe Inner Secrets of Compilers
The Inner Secrets of Compilers
 
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
 
pradeepbishtLecture13 div conq
pradeepbishtLecture13 div conqpradeepbishtLecture13 div conq
pradeepbishtLecture13 div conq
 
lecture14.ppt
lecture14.pptlecture14.ppt
lecture14.ppt
 
ACS 22LIE12 lab Manul.docx
ACS 22LIE12 lab Manul.docxACS 22LIE12 lab Manul.docx
ACS 22LIE12 lab Manul.docx
 
Lecture 13 14-time_domain_analysis_of_1st_order_systems
Lecture 13 14-time_domain_analysis_of_1st_order_systemsLecture 13 14-time_domain_analysis_of_1st_order_systems
Lecture 13 14-time_domain_analysis_of_1st_order_systems
 
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
 
Linear transformation.ppt
Linear transformation.pptLinear transformation.ppt
Linear transformation.ppt
 
Mergesort
MergesortMergesort
Mergesort
 
Chang etal 2012a
Chang etal 2012aChang etal 2012a
Chang etal 2012a
 
Finite-difference modeling, accuracy, and boundary conditions- Arthur Weglein...
Finite-difference modeling, accuracy, and boundary conditions- Arthur Weglein...Finite-difference modeling, accuracy, and boundary conditions- Arthur Weglein...
Finite-difference modeling, accuracy, and boundary conditions- Arthur Weglein...
 
Learn Matlab
Learn MatlabLearn Matlab
Learn Matlab
 
nabil-201-chap-06.ppt
nabil-201-chap-06.pptnabil-201-chap-06.ppt
nabil-201-chap-06.ppt
 
Papr reduction for ofdm oqam signals via alternative signal method
Papr reduction for ofdm   oqam signals via alternative   signal methodPapr reduction for ofdm   oqam signals via alternative   signal method
Papr reduction for ofdm oqam signals via alternative signal method
 
L08.pdf
L08.pdfL08.pdf
L08.pdf
 
Ceng232 Decoder Multiplexer Adder
Ceng232 Decoder Multiplexer AdderCeng232 Decoder Multiplexer Adder
Ceng232 Decoder Multiplexer Adder
 
Presentation1.ppt
Presentation1.pptPresentation1.ppt
Presentation1.ppt
 
String kmp
String kmpString kmp
String kmp
 
l1.ppt
l1.pptl1.ppt
l1.ppt
 

Recently uploaded

IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixingviprabot1
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
DATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage exampleDATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage examplePragyanshuParadkar1
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 

Recently uploaded (20)

IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixing
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
DATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage exampleDATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage example
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 

Theta join (M-bucket-I algorithm explained)

  • 1. Processing Theta Joins using MapReduce by Minsub Yim
  • 2. Processing pipeline at a reducer Goal: We want to minimize job completion time. Since it’s a function of both input and output, we need a way to model both inputs and outputs to a reducer. Reducer Join OutputMapper Output time = f(input size) time = f(output size) Receive Mapper Output Sort input by key Read input Run join algorithm Send join output
  • 3. Theta Join Model S_id Value 1 5 2 6 3 6 4 8 5 8 6 10 Dataset S Dataset T T_id Value 1 5 2 5 3 6 4 8 5 8 6 10 Assuming join condition: S.value = T.value
  • 4. Theta Join Model S_id Value 1 5 2 6 3 6 4 8 5 8 6 10 Dataset S Dataset T T_id Value 1 5 2 5 3 6 4 8 5 8 6 10 Assuming join condition: S.value = T.value 5 5 6 8 8 10 5 6 6 8 8 10 [ Join Matrix M ] : tuple satisfying the join condition S T
  • 5. Theta Join Model (Examples) 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value <= T.value S T 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: abs (S.value - T.value) < 2 S T 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T
  • 6. Theta Join Model (Examples) 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value <= T.value S T 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: abs (S.value - T.value) < 2 S T 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T
  • 7. Theta Join Model (Examples) 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value <= T.value S T 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: abs (S.value - T.value) < 2 S T 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T
  • 8. Goal Revisited • We want to minimize job completion time • We need to assign every true cell to exactly one reducer. (find a mapping from M to R)
  • 9. Goal Revisited • We want to minimize job completion time • We need to assign every true cell to exactly one reducer. (find a mapping from M to R) • Goal: Find a mapping from the join matrix M to reducers that minimizes job completion time
  • 10. Mappings from join matrix to reducers 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) (4) [R1] Input: S1, T1, T2 Output: 2 tuples ! [R2] Input: S2, S3, T3 Output: 2 tuples ! [R3] Input: S4, S5, T4, T5 Output: 4 tuples ! [R4] Input: S6, T6 Output: 1 tuple ! Max-Reducer-Input: 4 Max-Reducer-Output: 4 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) (4) [R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ! [R2] Input: S2, S4, T3,T5 Output: 2 tuples ! [R3] Input: S1, S5, T2, T4 Output: 2 tuples ! [R4] Input: S3, S6, T3, T6 Output: 2 tuples ! MRI: 6 MRO: 3 (1) (1) (2) (3) (4) Stndard equi-join algorithm Random
  • 11. Mappings from join matrix to reducers 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) (4) [R1] Input: S1, T1, T2 Output: 2 tuples ! [R2] Input: S2, S3, T3 Output: 2 tuples ! [R3] Input: S4, S5, T4, T5 Output: 4 tuples ! [R4] Input: S6, T6 Output: 1 tuple ! Max-Reducer-Input: 4 Max-Reducer-Output: 4 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) (4) [R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ! [R2] Input: S2, S4, T3,T5 Output: 2 tuples ! [R3] Input: S1, S5, T2, T4 Output: 2 tuples ! [R4] Input: S3, S6, T3, T6 Output: 2 tuples ! MRI: 6 MRO: 3 (1) (1) (2) (3) (4) Stndard equi-join algorithm Random
  • 12. Mappings from join matrix to reducers 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) (4) [R1] Input: S1, T1, T2 Output: 2 tuples ! [R2] Input: S2, S3, T3 Output: 2 tuples ! [R3] Input: S4, S5, T4, T5 Output: 4 tuples ! [R4] Input: S6, T6 Output: 1 tuple ! Max-Reducer-Input: 4 Max-Reducer-Output: 4 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) (4) [R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ! [R2] Input: S2, S4, T3,T5 Output: 2 tuples ! [R3] Input: S1, S5, T2, T4 Output: 2 tuples ! [R4] Input: S3, S6, T3, T6 Output: 2 tuples ! MRI: 6 MRO: 3 (1) (1) (2) (3) (4) Stndard equi-join algorithm Random
  • 13. Mappings from join matrix to reducers 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) (4) [R1] Input: S1, T1, T2 Output: 2 tuples ! [R2] Input: S2, S3, T3 Output: 2 tuples ! [R3] Input: S4, S5, T4, T5 Output: 4 tuples ! [R4] Input: S6, T6 Output: 1 tuple ! Max-Reducer-Input: 4 Max-Reducer-Output: 4 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) (4) [R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ! [R2] Input: S2, S4, T3,T5 Output: 2 tuples ! [R3] Input: S1, S5, T2, T4 Output: 2 tuples ! [R4] Input: S3, S6, T3, T6 Output: 2 tuples ! MRI: 6 MRO: 3 (1) (1) (2) (3) (4) Stndard equi-join algorithm Random
  • 14. Mappings from join matrix to reducers 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) [R1] Input: S1, S2, T1, T2 Output: 2 tuples ! [R2] Input: S3, S4, T1, T2, T3 Output: 2 tuples ! [R3] Input: S4, S5, S6, T4, T5, T6 Output: 5 tuples ! ! Max-Reducer-Input: 6 Max-Reducer-Output: 5
  • 15. Mappings from join matrix to reducers 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) [R1] Input: S1, S2, T1, T2 Output: 2 tuples ! [R2] Input: S3, S4, T1, T2, T3 Output: 2 tuples ! [R3] Input: S4, S5, S6, T4, T5, T6 Output: 5 tuples ! ! Max-Reducer-Input: 6 Max-Reducer-Output: 5
  • 16. Mappings from join matrix to reducers • We see there could be many possible mappings from join matrix to reducers • We will see in different cases, which mapping is (close to) optimal and algorithms to compute such mapping.
  • 17. Lemma We will be using the following lemma repeatedly to show how (close to) optimal each mapping is. [ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples [ Proof ] Consider a reducer r that receives m records from T and n records from S. Then, ! ! 2 p c mn c 2 p mn 2 p c m + n 2 p c
  • 18. Lemma We will be using the following lemma repeatedly to show how (close to) optimal each mapping is. [ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples [ Proof ] Consider a reducer r that receives m records from T and n records from S. Then, ! ! 2 p c mn c 2 p mn 2 p c m + n 2 p c
  • 19. Cross Product • We first consider cross product, where all of tuples from two datasets satisfy the join condition. The join matrix would look like the following: 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S X T S T
  • 20. Cross Product • We first consider cross product, where all of tuples from two datasets satisfy the join condition. The join matrix would look like the following: 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S X T S T
  • 21. Cross Product • Since all entries of the join matrix are true, we can see that the maximum-reducer-output (MRO) . (Otherwise, there would be tuples not mapped to a reducer.) • Along with Lemma 1, we have a lower bound for the maximum-reducer-input (MRI): MRI |S||T|/r 2 r |S||T| r [ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples2 p c
  • 22. Cross Product • Since all entries of the join matrix are true, we can see that the maximum-reducer-output (MRO) . (Otherwise, there would be tuples not mapped to a reducer. • Along with Lemma 1, we have a lower bound for the maximum-reducer-input (MRI): MRI |S||T|/r 2 r |S||T| r [ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples2 p c
  • 23. Cross Product • We will revisit these two properties frequently to see the quality of join mappings: |S||T|/rMRO and MRI 2 r |S||T| r
  • 24. p |S||T|/rCase 1: Suppose |S| and |T| are multiples of . Namely, and .|S| = cs p |S||T|/r |T| = cT p |S||T|/r Then, partitioning the join matrix with squares of size is an optimal mapping. p |S||T|/r Proof : is trivial. Each region mapped to a reducer ! has output size: and input size:|S||T|/r 2 r |S||T| r Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties
  • 25. p |S||T|/rCase 1: Suppose |S| and |T| are multiples of . Namely, and .|S| = cs p |S||T|/r |T| = cT p |S||T|/r Then, partitioning the join matrix with squares of size is an optimal mapping. p |S||T|/r Proof : is trivial. Each region mapped to a reducer ! has output size: and input size:|S||T|/r 2 r |S||T| r Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties
  • 26. Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties 5 5 6 8 8 10 5 6 6 8 8 10 S T Suppose |S| = |T| = 6 and r = 9
  • 27. Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties 5 5 6 8 8 10 5 6 6 8 8 10 S T Suppose |S| = |T| = 6 and r = 9
  • 28. Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties 5 5 6 8 8 10 5 6 6 8 8 10 S T Suppose |S| = |T| = 6 and r = 9
  • 29. Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties 5 5 6 8 8 10 5 6 6 8 8 10 S T Suppose |S| = |T| = 6 and r = 9 MRO = 4 = 2 r |S||T| r MRI = 4 = |S||T|/r
  • 30. Case 2: Suppose the cardinality of one dataset is significantly greater than that of the other. (WLOG, assume ). Then, rectangle cover Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties |S| < |T|/r |S| ⇥ |T|/r is the optimal mapping.
  • 31. Case 2: Suppose the cardinality of one dataset is significantly greater than that of the other. (WLOG, assume ). Then, rectangle cover Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties |S| < |T|/r |S| ⇥ |T|/r is the optimal mapping. (e.g., |S| = 3, |T| = 20, r = 5)
  • 32. Case 3: The remaining case where . ! Let , ! Then, covering M with squares is a mapping worse than an optimal mapping by a factor no greater than 4. Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties |T|/r  |S|  |T| CT = $ |T|/ r |S||T| r % CS = $ |S|/ r |S||T| r % p |S||T|/r ⇥ p |S||T|/r
  • 33. If |S| and/or |T| is not a multiple of , scale each ! side by and/or respectively to ! cover M. Given , we see that Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties p |S||T|/r ✓ 1 + 1 CS ◆ ✓ 1 + 1 CT ◆ |T|/r  |S|  |T| ✓ 1 + 1 CS ◆ r |S||T| r  2 r |S||T| r
  • 34. Hence, and Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties Comparing these with the lower bounds given above, we see that the MRO and MRI produced by this mapping are at most 4 times (twice for MRI) the lower bounds. MRI  4 p |S||T|/rMRO  4|S||T|/r
  • 35. Implementation • Now we know how to (nearly) optimally partition the join matrix. So let’s run it!! • However, when a reducer is given a record (either from S or T), it does NOT have enough information where exactly in the dataset (in which row/col) the record belongs to. • We could run another pre-process to get that info, but it can be avoided by running a randomized algorithm!
  • 36. Implementation • Now we know how to (nearly) optimally partition the join matrix. So let’s run it!! • However, when a reducer is given a record (either from S or T), it does NOT have enough information where exactly in the dataset (in which row/col) the record belongs to. • We could run another pre-process to get that info, but it can be avoided by running a randomized algorithm!
  • 37. Implementation • Now we know how to (nearly) optimally partition the join matrix. So let’s run it!! • However, when a reducer is given a record (either from S or T), it does NOT have enough information where exactly in the dataset (in which row/col) the record belongs to. • We could run another pre-process to get that info, but it can be avoided by running a randomized algorithm!
  • 38. Mapping & Randomized Algorithm Algorithm 1 : Map (Theta - Join) ! Input : input tuple 1: if then 2: matrixRow = random(1,|S|) 3: for all regionID in lookup.getRegions(matrixRow) do 4: Output (regionID, (x, “S”) ) 5: else 6: matrixCol = random (1,|T|) 7: for all regionID in lookup.getRegions(matrixCol) do 8: Output (regionID, (x, “T”) ) x 2 S [ T x 2 S
  • 39. Mapping & Randomized Algorithm Algorithm 1 : Map (Theta - Join) ! Input : input tuple 1: if then 2: matrixRow = random(1,|S|) 3: for all regionID in lookup.getRegions(matrixRow) do 4: Output (regionID, (x, “S”) ) 5: else 6: matrixCol = random (1,|T|) 7: for all regionID in lookup.getRegions(matrixCol) do 8: Output (regionID, (x, “T”) ) x 2 S [ T x 2 S 1. Given a record ( WLOG ) 2. Get a row uniformly randomly 3. Get all the regions intersecting that row and output ( regID, (x, S) ) x 2 S
  • 40. Mapping & Randomized Algorithm 5 7 7 7 8 9 5 7 7 8 9 9 S T Join condition: S.value = T.value (1) (2) (3) 3 5 1 5 1 2 6 2 2 3 6 4 (1,S1) (2,S1) (3,S2) (1,S3) (2,S3) (3,S4) (1,S5) (2,S5) (1,S6) (2,S6) (2,T1) (3,T1) (1,T2) (3,T2) (1,T3) (3,T3) (1,T4) (3,T4) (2,T5) (3,T5) (2,T6) (3,T6) Input Tuple Random Row/Col Output Map Reducer 1 : key 1 (regID) Input: S1, S3, S5, S6, T2, T3, T4 Output: (S3,T2) (S3,T3) (S3,T4) Reducer 2 : key 2 (regID) Input: S1, S3, S5, S6, T1, T5, T6 Output: (S1,T1) (S5,T6) (S6,T6) Reducer 3 : key 3 (regID) Input: S2, S4, T1, T2, T3, T4, T5, T6 Output: (S2,T2) (S2,T3) (S2,T4) (S4,T5) Reduce S1.A = 5 S2.A = 7 S3.A = 7 S4.A = 8 S5.A = 9 S6.A = 9 T1.A = 5 T2.A = 7 T3.A = 7 T4.A = 7 T5.A = 8 T6.A = 9
  • 41. Cross Product… NOT! • We have verified that 1 Bucket Theta algorithm is close to optimal when the join condition is cross product. • How does 1 Bucket Theta algorithm perform when join condition is NOT cross product ? • We will compare the quality of 1 Bucket Theta algorithm to any join algorithm
  • 42. Cross Product… NOT! • We have verified that 1 Bucket Theta algorithm is close to optimal when the join condition is cross product. • How does 1 Bucket Theta algorithm perform when join condition is NOT cross product ? • We will compare the quality of 1 Bucket Theta algorithm to any join algorithm
  • 43. Cross Product… NOT! • We have verified that 1 Bucket Theta algorithm is close to optimal when the join condition is cross product. • How does 1 Bucket Theta algorithm perform when join condition is NOT cross product ? • We will compare the quality of 1 Bucket Theta algorithm to any join algorithm
  • 44. 1BT vs ANY join algorithm Let . Any matrix to reducer mapping that has to cover at least of the cells of the join matrix, by Lemma 1, has MRI 1 x > 0 x|S||T| |S||T| 2 p x|S||T| [ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples2 p c As we have seen, 1BT guarantees that MRI . ! Hence,  4 p |S||T| MRI1BT MRIAnyJoinAlg = 4 p |S||T|/r 2 p x|S||T|/r = 2 p x
  • 45. 1BT vs ANY join algorithm
  • 46. 1BT vs ANY join algorithm When , the ratio < 3. ! Hence,compared to ANY join algorithm that assigns more than 50% of its matrix cells to reducers, the MRI for 1BT is at most 3 times the MRI of that algorithm. x = 0.5
  • 47. 1BT vs ANY join algorithm When , the ratio < 3. ! Hence,compared to ANY join algorithm that assigns more than 50% of its matrix cells to reducers, the MRI for 1BT is at most 3 times the MRI of that algorithm. x = 0.5
  • 48. M-Bucket-I • In the previous slide, we see that instead of covering the entire matrix, mapping smaller regions would yield better MRI result. • Ideally, we only want to map those satisfying the join condition, but it cannot be done before knowing input statistics and/or join condition. • M-Bucket-I exploits statistics to improve over 1 Bucket Theta join algorithm
  • 49. M-Bucket-I • In the previous slide, we see that instead of covering the entire matrix, mapping smaller regions would yield better MRI result. • Ideally, we only want to map those satisfying the join condition, but it cannot be done before knowing input statistics and/or join condition. • M-Bucket-I exploits statistics to improve over 1 Bucket Theta join algorithm
  • 50. M-Bucket-I • In the previous slide, we see that instead of covering the entire matrix, mapping smaller regions would yield better MRI result. • Ideally, we only want to map those satisfying the join condition, but it cannot be done before knowing input statistics and/or join condition. • M-Bucket-I exploits statistics to improve over 1 Bucket Theta join algorithm
  • 51. M-Bucket-I [ Step 1 ] Approximate Equi-Depth Histograms 1) With probability n /|S|, sample approx. n records from |S| 2) Build k-quantiles (k buckets), where k < n 3) Iterate through |S| and count the number of records in each bucket 4) Do the same for |T| and build the join matrix accordingly
  • 52. M-Bucket-I [ Step 1 ] Approximate Equi-Depth Histograms S_id Value 1 7 2 2 3 4 4 2 5 1 6 9 7 10 8 2 9 5 10 3 Dataset S Dataset T T_id Value 1 5 2 5 3 6 4 8 5 8 6 10 7 2 8 4 9 1 10 3
  • 53. M-Bucket-I [ Step 1 ] Approximate Equi-Depth Histograms S_id Value 1 7 2 2 3 4 4 2 5 1 6 9 7 10 8 2 9 5 10 3 Dataset S Dataset T T_id Value 1 5 2 5 3 6 4 8 5 8 6 10 7 2 8 4 9 1 10 3 Sample S 7, 2, 2, 9, 2, 3 Sample T 5, 6, 8, 2, 1, 3
  • 54. M-Bucket-I [ Step 1 ] Approximate Equi-Depth Histograms S_id Value 1 7 2 2 3 4 4 2 5 1 6 9 7 10 8 2 9 5 10 3 Dataset S Dataset T T_id Value 1 5 2 5 3 6 4 8 5 8 6 10 7 2 8 4 9 1 10 3 Sample S 7, 2, 2, 9, 2, 3 Sample T 5, 6, 8, 2, 1, 3 Samples
  • 55. M-Bucket-I [ Step 1 ] Approximate Equi-Depth Histograms S_id Value 1 7 2 2 3 4 4 2 5 1 6 9 7 10 8 2 9 5 10 3 Dataset S Dataset T T_id Value 1 5 2 5 3 6 4 8 5 8 6 10 7 2 8 4 9 1 10 3 Sample S 7, 2, 2, 9, 2, 3 Sample T 5, 6, 8, 2, 1, 3 Samples Buckets S T 0 2 3 9 0 1 5 8 1 1
  • 56. M-Bucket-I [ Step 1 ] Approximate Equi-Depth Histograms S_id Value 1 7 2 2 3 4 4 2 5 1 6 9 7 10 8 2 9 5 10 3 Dataset S Dataset T T_id Value 1 5 2 5 3 6 4 8 5 8 6 10 7 2 8 4 9 1 10 3 Sample S 7, 2, 2, 9, 2, 3 Sample T 5, 6, 8, 2, 1, 3 Samples Buckets S T 0 2 3 9 0 1 5 8 1 1 4 1 4 1 1 5 3 1
  • 57. M-Bucket-I [ Step 1 ] Approximate Equi-Depth Histograms S S S S S S S S S S T T T T T T T T T T Join condition: S.value = T.value
  • 58. M-Bucket-I [ Step 1 ] Approximate Equi-Depth Histograms S S S S S S S S S S T T T T T T T T T T 2 3 9 1 5 8 Join condition: S.value = T.value
  • 59. M-Bucket-I [ Step 1 ] Approximate Equi-Depth Histograms S S S S S S S S S S T T T T T T T T T T 2 3 9 1 5 8 Join condition: S.value = T.value We now have candidate cells. How do we map these cells to reducers?
  • 60. M-Bucket-I [ Step 2 ] M-Bucket-I Algorithm Algorithm : M-Bucket-I ! Input : maxInput, r, M 1: row = 0 2: while row < M.noOfRows do 3: (row,r) = CoverSubMatrix(row, maxInput, r, M) 4: if r < 0 then! 5: return false 6: return true!
  • 61. M-Bucket-I Algorithm : CoverSubMatrix ! Input : row_s, maxInput, r, M 1: maxScore = -1, rUsed = 0 2: for i = 1 to maxInput-1 do 3: R_i = CoverRows(row_s, row_s + i, maxInput, M) 4: area = totalCandidateArea(row_s, row_s + i, M) 5: score = area/R_i.size 6: if score >= maxScore then! 7: bestRow = row_s + i 8: rUsed = R_i.size 9: r = r - rUsed 10: return (bestRow + 1, r) [ Step 2 ] M-Bucket-I Algorithm
  • 62. M-Bucket-I Algorithm : CoverRows ! Input : row_f, row_l, maxInput, M 1: Regions = 0; r = newRegion() 2: for all c_i in M.getColumns do 3: if r. cap < c_i.candidateInputCosts then! 4: Regions = Regions U r 5: r = newRegion() 6: r.Cells = r.Cells U c_i.candidateCells 7: return Regions [ Step 2 ] M-Bucket-I Algorithm
  • 63. M-Bucket-I Run the algorithm with r = 6 maxInput = 5 [ Step 2 ] M-Bucket-I Algorithm
  • 64. M-Bucket-I Run the algorithm with r = 6 maxInput = 5 row : 0 cost : 4 [ Step 2 ] M-Bucket-I Algorithm
  • 65. M-Bucket-I Run the algorithm with r = 6 maxInput = 5 row : 0 cost : 4 row : 1 cost : 13/3 = 4.3 [ Step 2 ] M-Bucket-I Algorithm
  • 66. M-Bucket-I Run the algorithm with r = 6 maxInput = 5 row : 0 cost : 4 row : 1 cost : 13/3 = 4.3 row : 2 cost : 22/4 = 5.5 [ Step 2 ] M-Bucket-I Algorithm
  • 67. M-Bucket-I Run the algorithm with r = 6 maxInput = 5 row : 0 cost : 4 row : 1 cost : 13/3 = 4.3 row : 2 cost : 22/4 = 5.5 row : 3 cost : 31/7 = 4.428.. [ Step 2 ] M-Bucket-I Algorithm
  • 68. M-Bucket-I Run the algorithm with r = 6 maxInput = 5 row : 0 cost : 4 row : 1 cost : 13/3 = 4.3 row : 2 cost : 22/4 = 5.5 row : 3 cost : 31/7 = 4.428.. We choose the mapping with highest score! (1) (2) (3) (4) [ Step 2 ] M-Bucket-I Algorithm
  • 69. M-Bucket-I Run the algorithm with r = 6 maxInput = 5 row : 3 cost : 3 (1) (2) (3) (4) So on and so forth… [ Step 2 ] M-Bucket-I Algorithm
  • 70. M-Bucket-I Run the algorithm with r = 6 maxInput = 5 Final mapping! [ Step 2 ] M-Bucket-I Algorithm (1) (2) (3) (4) (7)(6)(5) (8) (9) (10) (11) (12) (13)
  • 71. M-Bucket-I Run the algorithm with r = 6 maxInput = 5 (1) (2) (3) (4) However, we have mapped the candidate cells to > r reducers. ! We do binary search until we get to the point where we a mapping to <= r reducers. (7)(6)(5) (8) (9) (10) (11) (12) (13) [ Step 2 ] M-Bucket-I Algorithm
  • 72. M-Bucket-I [ Step 3 ] Binary Search MaxInput = |S|+|T| = 20 Num.Reducers = 1 MaxInput = 5 Num.Reducers = 13
  • 73. M-Bucket-I [ Step 3 ] Binary Search MaxInput = |S|+|T| = 20 Num.Reducers = 1 MaxInput = 5 Num.Reducers = 13 MaxInput = 12 Num.Reducers = 3
  • 74. M-Bucket-I [ Step 3 ] Binary Search MaxInput = |S|+|T| = 20 Num.Reducers = 1 MaxInput = 5 Num.Reducers = 13 MaxInput = 12 Num.Reducers = 3 MaxInput = 8 Num.Reducers = 5 Since 7 reducers are required when MaxInput = 7, we stop the binary search here and output the mapping with MRI = 8.
  • 75. Performance 1 Bucket Theta Standard Equi Join Data set Output size (billion) Output Imbalance Runtime (secs) Output Imbalance Runtime (secs) Synth - 0 25.00 1.0030 657 1.0124 701 Synth - 0.4 24.99 1.0023 650 1.2541 722 Synth - 0.6 24.98 1.0033 676 1.7780 923 Synth - 0.8 24.95 1.0068 678 3.0103 1482 Synth - 1 24.91 1.0089 667 5.3124 2489 Skewed Where Output Imbalance = MRI Ave.RI MRI Ave.RI Skew Resistance of 1 Bucket Theta
  • 76. Performance 1 Bucket Theta Standard Equi Join Data set Output size (billion) Output Imbalance Runtime (secs) Output Imbalance Runtime (secs) Synth - 0 25.00 1.0030 657 1.0124 701 Synth - 0.4 24.99 1.0023 650 1.2541 722 Synth - 0.6 24.98 1.0033 676 1.7780 923 Synth - 0.8 24.95 1.0068 678 3.0103 1482 Synth - 1 24.91 1.0089 667 5.3124 2489 Skewed Where Output Imbalance = MRI Ave.RI MRI Ave.RI Skew Resistance of 1 Bucket Theta
  • 77. Performance Step Number of Buckets 1 10 100 1000 10,000 100,000 1,000,000 M-Bucket-I cost details (seconds) Quantiles 0 115 120 117 122 124 122 Histogram 0 140 145 147 157 167 604 Heuristic 74.01 9.21 0.84 1.50 16.67 118.03 111.27 Join 49384 10905 1157 595 548 540 536 Total 49,458.01 11,169.21 1,422.84 860.5 843.67 949.03 1,373.27