This document discusses performing a three-way join of relations R, S, and T in a single MapReduce job. It presents two algorithms for performing the join: a nested loop join and a sort-based join. It also discusses how to determine the number of reducers to use, giving an example where using a non-square matrix of reducers can lead to data replication or reducer inefficiency. Experimental results show the three-way join took 37 seconds to complete.
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
Three way join in one round on hadoop
1. Three-way join in one
round on Hadoop
COMP 6231
GROUP 7
IRAJ HEDAYATISOMARIN, ZAKARIA NASERELDINE, J INYANG DU
2. Problem statement
푅 ⋈ 푆 ⋈ 푇
In this section of second project we
aimed to calculate three-way join in
one round of Map-Reduce algorithm.
S R
T
R join S join T
3. Algorithm Overview
First relation: R
a, b
Second relation: S
b, c
Third relation: T
c, d
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Mapper
h(b)=x
h(c)=y
R,(a,b)
S,(b,c)
T,(c,d)
x
y
<KEY, VALUE>=<(X,Y), (relation_name, tuple)> In memory join
Coordinate of a reducer in imagined matrix of reducers
4. Mapping and Hashing
<KEY, VALUE>=<(X,Y), (relation_name, tuple)>
Exactly same as input
Fetch from file name
Input tuple
First relation: R
(h(b),1)
(h(b),2)
Second relation: S (h(b),h(c))
Third relation: T
…
(h(b),11)
(1,h(c))
(1,h(c))
…
(11,h(c))
푅푒푑푢푐푒푟 # = (푥 − 1) × # 표푓 푟푒푑푢푐푒푟푠 + 푦
h(b)=x
h(c)=y
5. In-memory join algorithm
NESTED LOOP JOIN
For each tuple in R
For each tuple in S
If R.b==S.b then
For each tuple in T
If S.c==T.c then
Print (R.a, S.b, S.c, T.d)
SORT-BASED JOIN ALGORITHM
1. divide input list in three sorted lists using
Binary Search 푂(푛 algorithm
log 푛)
2. Execute in-memory join algorithm
•UNTIL R and S are not empty DO
• IF the first items in both list are equal THEN
• make sure all the tuples with the same value have
been joined together and remove them from the list
• ELSE
• Choose the smallest one and remove items until
reach an item equal or greater than the front item in
the another list
푂(푛3)
1.Divide list: 푂(푛 log 푛)
2.In-memory join:
1.푅 ⋈ 푆 = 푂 푛
2.푅푆 ⋈ 푇 = 푂 푛
6. Number of reducers
We decide to use a square matrix. This choice would be a constraint on number of reducers. For
example in this case, we had 128 reducers available but actually we just use 121 of them
On the other hand selecting different number of reducers in each dimension, we will have data
replication and inefficiency.
7. Number of reducers (example 1,
replication problem)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2
3
4
# of reducers=128
Assumption: R>>T
Both of them have uniform distribution
T(R) = 1,000,000
T(T) = 1,000
For square matrix:
Replicated data=1,000,000*11+1,000*11=11,011,000
For above matrix:
Replicated data=1,000,000*16+1,000*16=16,016,000
8. Number of reducers (example 1,
inefficiency problem)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2
3
4
IDLE FULL IDLE FULL
# of reducers=128
Assumption: T>>R
T is not uniformly distributed
T(R) = 1,000
T(T) = 1,000,000
When the range is reduced, it’s more likely two value
hash in to the same location.