Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Processing Theta Joins
using MapReduce
by Minsub Yim
Processing pipeline at a reducer
Goal: We want to minimize job completion time. Since it’s a function of both
input and ou...
Theta Join Model
S_id Value
1 5
2 6
3 6
4 8
5 8
6 10
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
Assuming join...
Theta Join Model
S_id Value
1 5
2 6
3 6
4 8
5 8
6 10
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
Assuming join...
Theta Join Model
(Examples)
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value <= T.value
S
T 5 5 6 8 8 10
5
6
6
8
8
10
Joi...
Theta Join Model
(Examples)
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value <= T.value
S
T 5 5 6 8 8 10
5
6
6
8
8
10
Joi...
Theta Join Model
(Examples)
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value <= T.value
S
T 5 5 6 8 8 10
5
6
6
8
8
10
Joi...
Goal Revisited
• We want to minimize job completion time
• We need to assign every true cell to exactly one
reducer. (find ...
Goal Revisited
• We want to minimize job completion time
• We need to assign every true cell to exactly one
reducer. (find ...
Mappings from join matrix to
reducers
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
(1)
(2)
(3)
(4)
[R1]...
Mappings from join matrix to
reducers
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
(1)
(2)
(3)
(4)
[R1]...
Mappings from join matrix to
reducers
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
(1)
(2)
(3)
(4)
[R1]...
Mappings from join matrix to
reducers
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
(1)
(2)
(3)
(4)
[R1]...
Mappings from join matrix to
reducers
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
(1)
(2)
(3)
[R1]
Inp...
Mappings from join matrix to
reducers
5 5 6 8 8 10
5
6
6
8
8
10
Join condition:
S.value = T.value
S
T
(1)
(2)
(3)
[R1]
Inp...
Mappings from join matrix to
reducers
• We see there could be many possible mappings
from join matrix to reducers
• We wil...
Lemma
We will be using the following lemma repeatedly to show
how (close to) optimal each mapping is.
[ LEMMA 1 ] A reduce...
Lemma
We will be using the following lemma repeatedly to show
how (close to) optimal each mapping is.
[ LEMMA 1 ] A reduce...
Cross Product
• We first consider cross product, where all of
tuples from two datasets satisfy the join
condition. The join...
Cross Product
• We first consider cross product, where all of
tuples from two datasets satisfy the join
condition. The join...
Cross Product
• Since all entries of the join matrix are true, we
can see that the maximum-reducer-output
(MRO) . (Otherwi...
Cross Product
• Since all entries of the join matrix are true, we
can see that the maximum-reducer-output
(MRO) . (Otherwi...
Cross Product
• We will revisit these two properties frequently to
see the quality of join mappings:
|S||T|/rMRO and MRI 2...
p
|S||T|/rCase 1: Suppose |S| and |T| are multiples of .
Namely, and .|S| = cs
p
|S||T|/r |T| = cT
p
|S||T|/r
Then, partit...
p
|S||T|/rCase 1: Suppose |S| and |T| are multiples of .
Namely, and .|S| = cs
p
|S||T|/r |T| = cT
p
|S||T|/r
Then, partit...
Cross Product
|S||T|/rMRO and MRI 2
r
|S||T|
r
Properties
5 5 6 8 8 10
5
6
6
8
8
10
S
T
Suppose |S| = |T| = 6
and r = 9
Cross Product
|S||T|/rMRO and MRI 2
r
|S||T|
r
Properties
5 5 6 8 8 10
5
6
6
8
8
10
S
T
Suppose |S| = |T| = 6
and r = 9
Cross Product
|S||T|/rMRO and MRI 2
r
|S||T|
r
Properties
5 5 6 8 8 10
5
6
6
8
8
10
S
T
Suppose |S| = |T| = 6
and r = 9
Cross Product
|S||T|/rMRO and MRI 2
r
|S||T|
r
Properties
5 5 6 8 8 10
5
6
6
8
8
10
S
T
Suppose |S| = |T| = 6
and r = 9
MR...
Case 2: Suppose the cardinality of one dataset is
significantly greater than that of the other. (WLOG,
assume ). Then, rect...
Case 2: Suppose the cardinality of one dataset is
significantly greater than that of the other. (WLOG,
assume ). Then, rect...
Case 3: The remaining case where .
!
Let ,
!
Then, covering M with squares
is a mapping worse than an optimal mapping by a...
If |S| and/or |T| is not a multiple of , scale each
!
side by and/or respectively to
!
cover M. Given , we see that
Cross ...
Hence, and
Cross Product
|S||T|/rMRO and MRI 2
r
|S||T|
r
Properties
Comparing these with the lower bounds given above,
we...
Implementation
• Now we know how to (nearly) optimally partition
the join matrix. So let’s run it!!
• However, when a redu...
Implementation
• Now we know how to (nearly) optimally partition
the join matrix. So let’s run it!!
• However, when a redu...
Implementation
• Now we know how to (nearly) optimally partition
the join matrix. So let’s run it!!
• However, when a redu...
Mapping & Randomized
Algorithm
Algorithm 1 : Map (Theta - Join)
!
Input : input tuple
1: if then
2: matrixRow = random(1,|...
Mapping & Randomized
Algorithm
Algorithm 1 : Map (Theta - Join)
!
Input : input tuple
1: if then
2: matrixRow = random(1,|...
Mapping & Randomized
Algorithm
5 7 7 7 8 9
5
7
7
8
9
9
S
T
Join condition:
S.value = T.value
(1) (2)
(3)
3
5
1
5
1
2
6
2
2...
Cross Product… NOT!
• We have verified that 1 Bucket Theta algorithm is
close to optimal when the join condition is cross
p...
Cross Product… NOT!
• We have verified that 1 Bucket Theta algorithm is
close to optimal when the join condition is cross
p...
Cross Product… NOT!
• We have verified that 1 Bucket Theta algorithm is
close to optimal when the join condition is cross
p...
1BT vs ANY join algorithm
Let . Any matrix to reducer mapping that has
to cover at least of the cells of the join matrix,
...
1BT vs ANY join algorithm
1BT vs ANY join algorithm
When , the ratio < 3.
!
Hence,compared to ANY join algorithm that assigns more
than 50% of its m...
1BT vs ANY join algorithm
When , the ratio < 3.
!
Hence,compared to ANY join algorithm that assigns more
than 50% of its m...
M-Bucket-I
• In the previous slide, we see that instead of
covering the entire matrix, mapping smaller
regions would yield...
M-Bucket-I
• In the previous slide, we see that instead of
covering the entire matrix, mapping smaller
regions would yield...
M-Bucket-I
• In the previous slide, we see that instead of
covering the entire matrix, mapping smaller
regions would yield...
M-Bucket-I
[ Step 1 ] Approximate Equi-Depth Histograms
1) With probability n /|S|, sample approx. n records
from |S|
2) B...
M-Bucket-I
[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dat...
M-Bucket-I
[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dat...
M-Bucket-I
[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dat...
M-Bucket-I
[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dat...
M-Bucket-I
[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dat...
M-Bucket-I
[ Step 1 ] Approximate Equi-Depth Histograms
S S S S S S S S S S
T
T
T
T
T
T
T
T
T
T
Join condition:
S.value = ...
M-Bucket-I
[ Step 1 ] Approximate Equi-Depth Histograms
S S S S S S S S S S
T
T
T
T
T
T
T
T
T
T
2 3 9
1
5
8
Join condition...
M-Bucket-I
[ Step 1 ] Approximate Equi-Depth Histograms
S S S S S S S S S S
T
T
T
T
T
T
T
T
T
T
2 3 9
1
5
8
Join condition...
M-Bucket-I
[ Step 2 ] M-Bucket-I Algorithm
Algorithm : M-Bucket-I
!
Input : maxInput, r, M
1: row = 0
2: while row < M.noO...
M-Bucket-I
Algorithm : CoverSubMatrix
!
Input : row_s, maxInput, r, M
1: maxScore = -1, rUsed = 0
2: for i = 1 to maxInput...
M-Bucket-I
Algorithm : CoverRows
!
Input : row_f, row_l, maxInput, M
1: Regions = 0; r = newRegion()
2: for all c_i in M.g...
M-Bucket-I
Run the algorithm with r = 6
maxInput = 5
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Run the algorithm with r = 6
maxInput = 5
row : 0
cost : 4
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Run the algorithm with r = 6
maxInput = 5
row : 0
cost : 4
row : 1
cost : 13/3 = 4.3
[ Step 2 ] M-Bucket-I Algo...
M-Bucket-I
Run the algorithm with r = 6
maxInput = 5
row : 0
cost : 4
row : 1
cost : 13/3 = 4.3
row : 2
cost : 22/4 = 5.5
...
M-Bucket-I
Run the algorithm with r = 6
maxInput = 5
row : 0
cost : 4
row : 1
cost : 13/3 = 4.3
row : 2
cost : 22/4 = 5.5
...
M-Bucket-I
Run the algorithm with r = 6
maxInput = 5
row : 0
cost : 4
row : 1
cost : 13/3 = 4.3
row : 2
cost : 22/4 = 5.5
...
M-Bucket-I
Run the algorithm with r = 6
maxInput = 5
row : 3
cost : 3
(1) (2)
(3) (4) So on and so forth…
[ Step 2 ] M-Buc...
M-Bucket-I
Run the algorithm with r = 6
maxInput = 5
Final mapping!
[ Step 2 ] M-Bucket-I Algorithm
(1) (2)
(3) (4)
(7)(6)...
M-Bucket-I
Run the algorithm with r = 6
maxInput = 5
(1) (2)
(3) (4)
However, we have mapped the
candidate cells to > r re...
M-Bucket-I
[ Step 3 ] Binary Search
MaxInput = |S|+|T|
= 20
Num.Reducers
= 1
MaxInput = 5
Num.Reducers
= 13
M-Bucket-I
[ Step 3 ] Binary Search
MaxInput = |S|+|T|
= 20
Num.Reducers
= 1
MaxInput = 5
Num.Reducers
= 13
MaxInput = 12
...
M-Bucket-I
[ Step 3 ] Binary Search
MaxInput = |S|+|T|
= 20
Num.Reducers
= 1
MaxInput = 5
Num.Reducers
= 13
MaxInput = 12
...
Performance
1 Bucket Theta Standard Equi Join
Data set
Output size
(billion)
Output
Imbalance
Runtime
(secs)
Output
Imbala...
Performance
1 Bucket Theta Standard Equi Join
Data set
Output size
(billion)
Output
Imbalance
Runtime
(secs)
Output
Imbala...
Performance
Step Number of Buckets
1 10 100 1000 10,000 100,000 1,000,000
M-Bucket-I cost details (seconds)
Quantiles 0 11...
Upcoming SlideShare
Loading in …5
×

Theta join (M-bucket-I algorithm explained)

1,296 views

Published on

These slides are made to explain theta-join (M-bucket-I algorithm) proposed in Ockan et al. "Processing Theta-joins using map-reduce"

Paper link: http://www.ccs.neu.edu/home/mirek/papers/2011-SIGMOD-ParallelJoins.pdf

Published in: Engineering
  • Get the best essay, research papers or dissertations. from ⇒ HelpWriting.net ⇐ A team of professional authors with huge experience will give u a result that will overcome your expectations.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • If only we knew about this 10 years ago! I wasted a ton of money on garbage 'stop snoring' products like mouth guards, throat sprays, lozenges and nasal strips, to name just a few! None of them worked. My doctor explained to me that the only way I was going to fix my snoring was with an operation, although he did say it was a last resort. I am so glad I didn't risk it because after finding your program my snoring has considerably decreased! If only I knew about this 10 years ago! ■■■ http://t.cn/AigiCT7Q
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Theta join (M-bucket-I algorithm explained)

  1. 1. Processing Theta Joins using MapReduce by Minsub Yim
  2. 2. Processing pipeline at a reducer Goal: We want to minimize job completion time. Since it’s a function of both input and output, we need a way to model both inputs and outputs to a reducer. Reducer Join OutputMapper Output time = f(input size) time = f(output size) Receive Mapper Output Sort input by key Read input Run join algorithm Send join output
  3. 3. Theta Join Model S_id Value 1 5 2 6 3 6 4 8 5 8 6 10 Dataset S Dataset T T_id Value 1 5 2 5 3 6 4 8 5 8 6 10 Assuming join condition: S.value = T.value
  4. 4. Theta Join Model S_id Value 1 5 2 6 3 6 4 8 5 8 6 10 Dataset S Dataset T T_id Value 1 5 2 5 3 6 4 8 5 8 6 10 Assuming join condition: S.value = T.value 5 5 6 8 8 10 5 6 6 8 8 10 [ Join Matrix M ] : tuple satisfying the join condition S T
  5. 5. Theta Join Model (Examples) 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value <= T.value S T 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: abs (S.value - T.value) < 2 S T 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T
  6. 6. Theta Join Model (Examples) 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value <= T.value S T 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: abs (S.value - T.value) < 2 S T 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T
  7. 7. Theta Join Model (Examples) 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value <= T.value S T 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: abs (S.value - T.value) < 2 S T 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T
  8. 8. Goal Revisited • We want to minimize job completion time • We need to assign every true cell to exactly one reducer. (find a mapping from M to R)
  9. 9. Goal Revisited • We want to minimize job completion time • We need to assign every true cell to exactly one reducer. (find a mapping from M to R) • Goal: Find a mapping from the join matrix M to reducers that minimizes job completion time
  10. 10. Mappings from join matrix to reducers 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) (4) [R1] Input: S1, T1, T2 Output: 2 tuples ! [R2] Input: S2, S3, T3 Output: 2 tuples ! [R3] Input: S4, S5, T4, T5 Output: 4 tuples ! [R4] Input: S6, T6 Output: 1 tuple ! Max-Reducer-Input: 4 Max-Reducer-Output: 4 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) (4) [R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ! [R2] Input: S2, S4, T3,T5 Output: 2 tuples ! [R3] Input: S1, S5, T2, T4 Output: 2 tuples ! [R4] Input: S3, S6, T3, T6 Output: 2 tuples ! MRI: 6 MRO: 3 (1) (1) (2) (3) (4) Stndard equi-join algorithm Random
  11. 11. Mappings from join matrix to reducers 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) (4) [R1] Input: S1, T1, T2 Output: 2 tuples ! [R2] Input: S2, S3, T3 Output: 2 tuples ! [R3] Input: S4, S5, T4, T5 Output: 4 tuples ! [R4] Input: S6, T6 Output: 1 tuple ! Max-Reducer-Input: 4 Max-Reducer-Output: 4 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) (4) [R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ! [R2] Input: S2, S4, T3,T5 Output: 2 tuples ! [R3] Input: S1, S5, T2, T4 Output: 2 tuples ! [R4] Input: S3, S6, T3, T6 Output: 2 tuples ! MRI: 6 MRO: 3 (1) (1) (2) (3) (4) Stndard equi-join algorithm Random
  12. 12. Mappings from join matrix to reducers 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) (4) [R1] Input: S1, T1, T2 Output: 2 tuples ! [R2] Input: S2, S3, T3 Output: 2 tuples ! [R3] Input: S4, S5, T4, T5 Output: 4 tuples ! [R4] Input: S6, T6 Output: 1 tuple ! Max-Reducer-Input: 4 Max-Reducer-Output: 4 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) (4) [R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ! [R2] Input: S2, S4, T3,T5 Output: 2 tuples ! [R3] Input: S1, S5, T2, T4 Output: 2 tuples ! [R4] Input: S3, S6, T3, T6 Output: 2 tuples ! MRI: 6 MRO: 3 (1) (1) (2) (3) (4) Stndard equi-join algorithm Random
  13. 13. Mappings from join matrix to reducers 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) (4) [R1] Input: S1, T1, T2 Output: 2 tuples ! [R2] Input: S2, S3, T3 Output: 2 tuples ! [R3] Input: S4, S5, T4, T5 Output: 4 tuples ! [R4] Input: S6, T6 Output: 1 tuple ! Max-Reducer-Input: 4 Max-Reducer-Output: 4 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) (4) [R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ! [R2] Input: S2, S4, T3,T5 Output: 2 tuples ! [R3] Input: S1, S5, T2, T4 Output: 2 tuples ! [R4] Input: S3, S6, T3, T6 Output: 2 tuples ! MRI: 6 MRO: 3 (1) (1) (2) (3) (4) Stndard equi-join algorithm Random
  14. 14. Mappings from join matrix to reducers 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) [R1] Input: S1, S2, T1, T2 Output: 2 tuples ! [R2] Input: S3, S4, T1, T2, T3 Output: 2 tuples ! [R3] Input: S4, S5, S6, T4, T5, T6 Output: 5 tuples ! ! Max-Reducer-Input: 6 Max-Reducer-Output: 5
  15. 15. Mappings from join matrix to reducers 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S.value = T.value S T (1) (2) (3) [R1] Input: S1, S2, T1, T2 Output: 2 tuples ! [R2] Input: S3, S4, T1, T2, T3 Output: 2 tuples ! [R3] Input: S4, S5, S6, T4, T5, T6 Output: 5 tuples ! ! Max-Reducer-Input: 6 Max-Reducer-Output: 5
  16. 16. Mappings from join matrix to reducers • We see there could be many possible mappings from join matrix to reducers • We will see in different cases, which mapping is (close to) optimal and algorithms to compute such mapping.
  17. 17. Lemma We will be using the following lemma repeatedly to show how (close to) optimal each mapping is. [ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples [ Proof ] Consider a reducer r that receives m records from T and n records from S. Then, ! ! 2 p c mn c 2 p mn 2 p c m + n 2 p c
  18. 18. Lemma We will be using the following lemma repeatedly to show how (close to) optimal each mapping is. [ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples [ Proof ] Consider a reducer r that receives m records from T and n records from S. Then, ! ! 2 p c mn c 2 p mn 2 p c m + n 2 p c
  19. 19. Cross Product • We first consider cross product, where all of tuples from two datasets satisfy the join condition. The join matrix would look like the following: 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S X T S T
  20. 20. Cross Product • We first consider cross product, where all of tuples from two datasets satisfy the join condition. The join matrix would look like the following: 5 5 6 8 8 10 5 6 6 8 8 10 Join condition: S X T S T
  21. 21. Cross Product • Since all entries of the join matrix are true, we can see that the maximum-reducer-output (MRO) . (Otherwise, there would be tuples not mapped to a reducer.) • Along with Lemma 1, we have a lower bound for the maximum-reducer-input (MRI): MRI |S||T|/r 2 r |S||T| r [ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples2 p c
  22. 22. Cross Product • Since all entries of the join matrix are true, we can see that the maximum-reducer-output (MRO) . (Otherwise, there would be tuples not mapped to a reducer. • Along with Lemma 1, we have a lower bound for the maximum-reducer-input (MRI): MRI |S||T|/r 2 r |S||T| r [ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples2 p c
  23. 23. Cross Product • We will revisit these two properties frequently to see the quality of join mappings: |S||T|/rMRO and MRI 2 r |S||T| r
  24. 24. p |S||T|/rCase 1: Suppose |S| and |T| are multiples of . Namely, and .|S| = cs p |S||T|/r |T| = cT p |S||T|/r Then, partitioning the join matrix with squares of size is an optimal mapping. p |S||T|/r Proof : is trivial. Each region mapped to a reducer ! has output size: and input size:|S||T|/r 2 r |S||T| r Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties
  25. 25. p |S||T|/rCase 1: Suppose |S| and |T| are multiples of . Namely, and .|S| = cs p |S||T|/r |T| = cT p |S||T|/r Then, partitioning the join matrix with squares of size is an optimal mapping. p |S||T|/r Proof : is trivial. Each region mapped to a reducer ! has output size: and input size:|S||T|/r 2 r |S||T| r Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties
  26. 26. Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties 5 5 6 8 8 10 5 6 6 8 8 10 S T Suppose |S| = |T| = 6 and r = 9
  27. 27. Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties 5 5 6 8 8 10 5 6 6 8 8 10 S T Suppose |S| = |T| = 6 and r = 9
  28. 28. Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties 5 5 6 8 8 10 5 6 6 8 8 10 S T Suppose |S| = |T| = 6 and r = 9
  29. 29. Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties 5 5 6 8 8 10 5 6 6 8 8 10 S T Suppose |S| = |T| = 6 and r = 9 MRO = 4 = 2 r |S||T| r MRI = 4 = |S||T|/r
  30. 30. Case 2: Suppose the cardinality of one dataset is significantly greater than that of the other. (WLOG, assume ). Then, rectangle cover Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties |S| < |T|/r |S| ⇥ |T|/r is the optimal mapping.
  31. 31. Case 2: Suppose the cardinality of one dataset is significantly greater than that of the other. (WLOG, assume ). Then, rectangle cover Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties |S| < |T|/r |S| ⇥ |T|/r is the optimal mapping. (e.g., |S| = 3, |T| = 20, r = 5)
  32. 32. Case 3: The remaining case where . ! Let , ! Then, covering M with squares is a mapping worse than an optimal mapping by a factor no greater than 4. Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties |T|/r  |S|  |T| CT = $ |T|/ r |S||T| r % CS = $ |S|/ r |S||T| r % p |S||T|/r ⇥ p |S||T|/r
  33. 33. If |S| and/or |T| is not a multiple of , scale each ! side by and/or respectively to ! cover M. Given , we see that Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties p |S||T|/r ✓ 1 + 1 CS ◆ ✓ 1 + 1 CT ◆ |T|/r  |S|  |T| ✓ 1 + 1 CS ◆ r |S||T| r  2 r |S||T| r
  34. 34. Hence, and Cross Product |S||T|/rMRO and MRI 2 r |S||T| r Properties Comparing these with the lower bounds given above, we see that the MRO and MRI produced by this mapping are at most 4 times (twice for MRI) the lower bounds. MRI  4 p |S||T|/rMRO  4|S||T|/r
  35. 35. Implementation • Now we know how to (nearly) optimally partition the join matrix. So let’s run it!! • However, when a reducer is given a record (either from S or T), it does NOT have enough information where exactly in the dataset (in which row/col) the record belongs to. • We could run another pre-process to get that info, but it can be avoided by running a randomized algorithm!
  36. 36. Implementation • Now we know how to (nearly) optimally partition the join matrix. So let’s run it!! • However, when a reducer is given a record (either from S or T), it does NOT have enough information where exactly in the dataset (in which row/col) the record belongs to. • We could run another pre-process to get that info, but it can be avoided by running a randomized algorithm!
  37. 37. Implementation • Now we know how to (nearly) optimally partition the join matrix. So let’s run it!! • However, when a reducer is given a record (either from S or T), it does NOT have enough information where exactly in the dataset (in which row/col) the record belongs to. • We could run another pre-process to get that info, but it can be avoided by running a randomized algorithm!
  38. 38. Mapping & Randomized Algorithm Algorithm 1 : Map (Theta - Join) ! Input : input tuple 1: if then 2: matrixRow = random(1,|S|) 3: for all regionID in lookup.getRegions(matrixRow) do 4: Output (regionID, (x, “S”) ) 5: else 6: matrixCol = random (1,|T|) 7: for all regionID in lookup.getRegions(matrixCol) do 8: Output (regionID, (x, “T”) ) x 2 S [ T x 2 S
  39. 39. Mapping & Randomized Algorithm Algorithm 1 : Map (Theta - Join) ! Input : input tuple 1: if then 2: matrixRow = random(1,|S|) 3: for all regionID in lookup.getRegions(matrixRow) do 4: Output (regionID, (x, “S”) ) 5: else 6: matrixCol = random (1,|T|) 7: for all regionID in lookup.getRegions(matrixCol) do 8: Output (regionID, (x, “T”) ) x 2 S [ T x 2 S 1. Given a record ( WLOG ) 2. Get a row uniformly randomly 3. Get all the regions intersecting that row and output ( regID, (x, S) ) x 2 S
  40. 40. Mapping & Randomized Algorithm 5 7 7 7 8 9 5 7 7 8 9 9 S T Join condition: S.value = T.value (1) (2) (3) 3 5 1 5 1 2 6 2 2 3 6 4 (1,S1) (2,S1) (3,S2) (1,S3) (2,S3) (3,S4) (1,S5) (2,S5) (1,S6) (2,S6) (2,T1) (3,T1) (1,T2) (3,T2) (1,T3) (3,T3) (1,T4) (3,T4) (2,T5) (3,T5) (2,T6) (3,T6) Input Tuple Random Row/Col Output Map Reducer 1 : key 1 (regID) Input: S1, S3, S5, S6, T2, T3, T4 Output: (S3,T2) (S3,T3) (S3,T4) Reducer 2 : key 2 (regID) Input: S1, S3, S5, S6, T1, T5, T6 Output: (S1,T1) (S5,T6) (S6,T6) Reducer 3 : key 3 (regID) Input: S2, S4, T1, T2, T3, T4, T5, T6 Output: (S2,T2) (S2,T3) (S2,T4) (S4,T5) Reduce S1.A = 5 S2.A = 7 S3.A = 7 S4.A = 8 S5.A = 9 S6.A = 9 T1.A = 5 T2.A = 7 T3.A = 7 T4.A = 7 T5.A = 8 T6.A = 9
  41. 41. Cross Product… NOT! • We have verified that 1 Bucket Theta algorithm is close to optimal when the join condition is cross product. • How does 1 Bucket Theta algorithm perform when join condition is NOT cross product ? • We will compare the quality of 1 Bucket Theta algorithm to any join algorithm
  42. 42. Cross Product… NOT! • We have verified that 1 Bucket Theta algorithm is close to optimal when the join condition is cross product. • How does 1 Bucket Theta algorithm perform when join condition is NOT cross product ? • We will compare the quality of 1 Bucket Theta algorithm to any join algorithm
  43. 43. Cross Product… NOT! • We have verified that 1 Bucket Theta algorithm is close to optimal when the join condition is cross product. • How does 1 Bucket Theta algorithm perform when join condition is NOT cross product ? • We will compare the quality of 1 Bucket Theta algorithm to any join algorithm
  44. 44. 1BT vs ANY join algorithm Let . Any matrix to reducer mapping that has to cover at least of the cells of the join matrix, by Lemma 1, has MRI 1 x > 0 x|S||T| |S||T| 2 p x|S||T| [ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples2 p c As we have seen, 1BT guarantees that MRI . ! Hence,  4 p |S||T| MRI1BT MRIAnyJoinAlg = 4 p |S||T|/r 2 p x|S||T|/r = 2 p x
  45. 45. 1BT vs ANY join algorithm
  46. 46. 1BT vs ANY join algorithm When , the ratio < 3. ! Hence,compared to ANY join algorithm that assigns more than 50% of its matrix cells to reducers, the MRI for 1BT is at most 3 times the MRI of that algorithm. x = 0.5
  47. 47. 1BT vs ANY join algorithm When , the ratio < 3. ! Hence,compared to ANY join algorithm that assigns more than 50% of its matrix cells to reducers, the MRI for 1BT is at most 3 times the MRI of that algorithm. x = 0.5
  48. 48. M-Bucket-I • In the previous slide, we see that instead of covering the entire matrix, mapping smaller regions would yield better MRI result. • Ideally, we only want to map those satisfying the join condition, but it cannot be done before knowing input statistics and/or join condition. • M-Bucket-I exploits statistics to improve over 1 Bucket Theta join algorithm
  49. 49. M-Bucket-I • In the previous slide, we see that instead of covering the entire matrix, mapping smaller regions would yield better MRI result. • Ideally, we only want to map those satisfying the join condition, but it cannot be done before knowing input statistics and/or join condition. • M-Bucket-I exploits statistics to improve over 1 Bucket Theta join algorithm
  50. 50. M-Bucket-I • In the previous slide, we see that instead of covering the entire matrix, mapping smaller regions would yield better MRI result. • Ideally, we only want to map those satisfying the join condition, but it cannot be done before knowing input statistics and/or join condition. • M-Bucket-I exploits statistics to improve over 1 Bucket Theta join algorithm
  51. 51. M-Bucket-I [ Step 1 ] Approximate Equi-Depth Histograms 1) With probability n /|S|, sample approx. n records from |S| 2) Build k-quantiles (k buckets), where k < n 3) Iterate through |S| and count the number of records in each bucket 4) Do the same for |T| and build the join matrix accordingly
  52. 52. M-Bucket-I [ Step 1 ] Approximate Equi-Depth Histograms S_id Value 1 7 2 2 3 4 4 2 5 1 6 9 7 10 8 2 9 5 10 3 Dataset S Dataset T T_id Value 1 5 2 5 3 6 4 8 5 8 6 10 7 2 8 4 9 1 10 3
  53. 53. M-Bucket-I [ Step 1 ] Approximate Equi-Depth Histograms S_id Value 1 7 2 2 3 4 4 2 5 1 6 9 7 10 8 2 9 5 10 3 Dataset S Dataset T T_id Value 1 5 2 5 3 6 4 8 5 8 6 10 7 2 8 4 9 1 10 3 Sample S 7, 2, 2, 9, 2, 3 Sample T 5, 6, 8, 2, 1, 3
  54. 54. M-Bucket-I [ Step 1 ] Approximate Equi-Depth Histograms S_id Value 1 7 2 2 3 4 4 2 5 1 6 9 7 10 8 2 9 5 10 3 Dataset S Dataset T T_id Value 1 5 2 5 3 6 4 8 5 8 6 10 7 2 8 4 9 1 10 3 Sample S 7, 2, 2, 9, 2, 3 Sample T 5, 6, 8, 2, 1, 3 Samples
  55. 55. M-Bucket-I [ Step 1 ] Approximate Equi-Depth Histograms S_id Value 1 7 2 2 3 4 4 2 5 1 6 9 7 10 8 2 9 5 10 3 Dataset S Dataset T T_id Value 1 5 2 5 3 6 4 8 5 8 6 10 7 2 8 4 9 1 10 3 Sample S 7, 2, 2, 9, 2, 3 Sample T 5, 6, 8, 2, 1, 3 Samples Buckets S T 0 2 3 9 0 1 5 8 1 1
  56. 56. M-Bucket-I [ Step 1 ] Approximate Equi-Depth Histograms S_id Value 1 7 2 2 3 4 4 2 5 1 6 9 7 10 8 2 9 5 10 3 Dataset S Dataset T T_id Value 1 5 2 5 3 6 4 8 5 8 6 10 7 2 8 4 9 1 10 3 Sample S 7, 2, 2, 9, 2, 3 Sample T 5, 6, 8, 2, 1, 3 Samples Buckets S T 0 2 3 9 0 1 5 8 1 1 4 1 4 1 1 5 3 1
  57. 57. M-Bucket-I [ Step 1 ] Approximate Equi-Depth Histograms S S S S S S S S S S T T T T T T T T T T Join condition: S.value = T.value
  58. 58. M-Bucket-I [ Step 1 ] Approximate Equi-Depth Histograms S S S S S S S S S S T T T T T T T T T T 2 3 9 1 5 8 Join condition: S.value = T.value
  59. 59. M-Bucket-I [ Step 1 ] Approximate Equi-Depth Histograms S S S S S S S S S S T T T T T T T T T T 2 3 9 1 5 8 Join condition: S.value = T.value We now have candidate cells. How do we map these cells to reducers?
  60. 60. M-Bucket-I [ Step 2 ] M-Bucket-I Algorithm Algorithm : M-Bucket-I ! Input : maxInput, r, M 1: row = 0 2: while row < M.noOfRows do 3: (row,r) = CoverSubMatrix(row, maxInput, r, M) 4: if r < 0 then! 5: return false 6: return true!
  61. 61. M-Bucket-I Algorithm : CoverSubMatrix ! Input : row_s, maxInput, r, M 1: maxScore = -1, rUsed = 0 2: for i = 1 to maxInput-1 do 3: R_i = CoverRows(row_s, row_s + i, maxInput, M) 4: area = totalCandidateArea(row_s, row_s + i, M) 5: score = area/R_i.size 6: if score >= maxScore then! 7: bestRow = row_s + i 8: rUsed = R_i.size 9: r = r - rUsed 10: return (bestRow + 1, r) [ Step 2 ] M-Bucket-I Algorithm
  62. 62. M-Bucket-I Algorithm : CoverRows ! Input : row_f, row_l, maxInput, M 1: Regions = 0; r = newRegion() 2: for all c_i in M.getColumns do 3: if r. cap < c_i.candidateInputCosts then! 4: Regions = Regions U r 5: r = newRegion() 6: r.Cells = r.Cells U c_i.candidateCells 7: return Regions [ Step 2 ] M-Bucket-I Algorithm
  63. 63. M-Bucket-I Run the algorithm with r = 6 maxInput = 5 [ Step 2 ] M-Bucket-I Algorithm
  64. 64. M-Bucket-I Run the algorithm with r = 6 maxInput = 5 row : 0 cost : 4 [ Step 2 ] M-Bucket-I Algorithm
  65. 65. M-Bucket-I Run the algorithm with r = 6 maxInput = 5 row : 0 cost : 4 row : 1 cost : 13/3 = 4.3 [ Step 2 ] M-Bucket-I Algorithm
  66. 66. M-Bucket-I Run the algorithm with r = 6 maxInput = 5 row : 0 cost : 4 row : 1 cost : 13/3 = 4.3 row : 2 cost : 22/4 = 5.5 [ Step 2 ] M-Bucket-I Algorithm
  67. 67. M-Bucket-I Run the algorithm with r = 6 maxInput = 5 row : 0 cost : 4 row : 1 cost : 13/3 = 4.3 row : 2 cost : 22/4 = 5.5 row : 3 cost : 31/7 = 4.428.. [ Step 2 ] M-Bucket-I Algorithm
  68. 68. M-Bucket-I Run the algorithm with r = 6 maxInput = 5 row : 0 cost : 4 row : 1 cost : 13/3 = 4.3 row : 2 cost : 22/4 = 5.5 row : 3 cost : 31/7 = 4.428.. We choose the mapping with highest score! (1) (2) (3) (4) [ Step 2 ] M-Bucket-I Algorithm
  69. 69. M-Bucket-I Run the algorithm with r = 6 maxInput = 5 row : 3 cost : 3 (1) (2) (3) (4) So on and so forth… [ Step 2 ] M-Bucket-I Algorithm
  70. 70. M-Bucket-I Run the algorithm with r = 6 maxInput = 5 Final mapping! [ Step 2 ] M-Bucket-I Algorithm (1) (2) (3) (4) (7)(6)(5) (8) (9) (10) (11) (12) (13)
  71. 71. M-Bucket-I Run the algorithm with r = 6 maxInput = 5 (1) (2) (3) (4) However, we have mapped the candidate cells to > r reducers. ! We do binary search until we get to the point where we a mapping to <= r reducers. (7)(6)(5) (8) (9) (10) (11) (12) (13) [ Step 2 ] M-Bucket-I Algorithm
  72. 72. M-Bucket-I [ Step 3 ] Binary Search MaxInput = |S|+|T| = 20 Num.Reducers = 1 MaxInput = 5 Num.Reducers = 13
  73. 73. M-Bucket-I [ Step 3 ] Binary Search MaxInput = |S|+|T| = 20 Num.Reducers = 1 MaxInput = 5 Num.Reducers = 13 MaxInput = 12 Num.Reducers = 3
  74. 74. M-Bucket-I [ Step 3 ] Binary Search MaxInput = |S|+|T| = 20 Num.Reducers = 1 MaxInput = 5 Num.Reducers = 13 MaxInput = 12 Num.Reducers = 3 MaxInput = 8 Num.Reducers = 5 Since 7 reducers are required when MaxInput = 7, we stop the binary search here and output the mapping with MRI = 8.
  75. 75. Performance 1 Bucket Theta Standard Equi Join Data set Output size (billion) Output Imbalance Runtime (secs) Output Imbalance Runtime (secs) Synth - 0 25.00 1.0030 657 1.0124 701 Synth - 0.4 24.99 1.0023 650 1.2541 722 Synth - 0.6 24.98 1.0033 676 1.7780 923 Synth - 0.8 24.95 1.0068 678 3.0103 1482 Synth - 1 24.91 1.0089 667 5.3124 2489 Skewed Where Output Imbalance = MRI Ave.RI MRI Ave.RI Skew Resistance of 1 Bucket Theta
  76. 76. Performance 1 Bucket Theta Standard Equi Join Data set Output size (billion) Output Imbalance Runtime (secs) Output Imbalance Runtime (secs) Synth - 0 25.00 1.0030 657 1.0124 701 Synth - 0.4 24.99 1.0023 650 1.2541 722 Synth - 0.6 24.98 1.0033 676 1.7780 923 Synth - 0.8 24.95 1.0068 678 3.0103 1482 Synth - 1 24.91 1.0089 667 5.3124 2489 Skewed Where Output Imbalance = MRI Ave.RI MRI Ave.RI Skew Resistance of 1 Bucket Theta
  77. 77. Performance Step Number of Buckets 1 10 100 1000 10,000 100,000 1,000,000 M-Bucket-I cost details (seconds) Quantiles 0 115 120 117 122 124 122 Histogram 0 140 145 147 157 167 604 Heuristic 74.01 9.21 0.84 1.50 16.67 118.03 111.27 Join 49384 10905 1157 595 548 540 536 Total 49,458.01 11,169.21 1,422.84 860.5 843.67 949.03 1,373.27

×