Joining Large data at Scale

ETL PIPELINE AND
JOINING LARGE
DATASETS
-
Harsha Tenneti

Contents
● ETL Pipeline
● Fault Tolerance
● Joins in Dataframe
● Problem statement
● Issues
● Steps to solve issues

ETL Pipeline
Data Manager
Ingestor Joiner
Wrangler Validator

Fault Tolerance
● All The modules are stateless, Data Manager gives job to all the modules.
● Data Manager holds the state of entire pipeline in Mysql
● Has timeouts to each job so that if it fails, then it will again start.

Joins
● Joins need the keys from each dataset to be in same partition.
● If both dataset’s doesn’t have same partitioner, then we need to shuffle the
data which makes sure same keys across dataset’s lies in same partitioner.
● Couple of Join strategies used in dataframe are sort merge and broadcast
joins.

Problem Statement
● Need to do a left outer Join of 12 datasets(A1…..A12) in which 10 datasets are
below 10mb size and 2 are between 25-30mb with a dataset(B) which is
around 50gb with approx 8 cores.
B.join(A1...A2, “left_outer”)
● After join, need to do a groupBy and then select a row from the group.
● All files are in Parquet format.

Issues
● We have to actually join one by one datasets (A1….A12) to B. So it’s actually 12
joins.
● After doing a groupBy, and working on the group to select a row will lead to
memory out of exception as a row is very huge.

Steps to solve issues
● Divide the large dataset B into chunks of 500mb and say the chunks are
(B1...Bn). This will make sure that we are joining and solving groupBy issue to a
500mb file at a time
● Sort each dataset from (B1...Bn) with the joinkeys which will make sure Unique
keys of Big data set reside in same partition.
● Join Each 500mb with other 12 datasets(A1...A12).
val joinedDF = allEventsDF.foldLeft(sortedBaseSourceDF)((x, y) => x.join(y._2,
getJoinColumnExpression(x, y._2, joinKeys, y._1), "left_outer"))

Contd...
● Now tasks is to do a groupBy on each 500mb chunked joined data.
● Now working on entire row giving us memory out exceptions, we added a
hashcode to the joined dataset and the selected the required columns along
with the hashCode.
● We do a map partition on the join dataset and take an iterator of 100 rows at a
time from each partition.

Contd...
● As we work on only 100 rows at a time, we do a aggregateByKey where it has
a combining stage which combines the same keys across 100 row chunks and
merging stage which combine the same keys across the partitions.
val allEventsResponseRDD = reqDF.mapPartitions(makingATuple).aggregateByKey(List[(Int, Row)]())((x, y)
=> (y._1, y._2) :: x, reduceListFunc)
● We join the actual resultant dataset with the actual join dataset with hashcol to
get all the other columns.
val allEventsResponseFullDF = rowWithHashDF.join(allEventsResponseDF, rowWithHashDF("hashCol")
===allEventsResponseDF("hashCol"), "inner").drop(allEventsResponseDF("hashCol"))

Contd...
● Now we get (c1….cn) resultant dataset as we have (B1….Bn) dataset’s of B.
● We do a union of all datasets c1….cn and get final dataset D.

Joining Large data at Scale

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Joining Large data at Scale

Similar to Joining Large data at Scale (20)

More from Sigmoid

More from Sigmoid (12)

Recently uploaded

Recently uploaded (20)

Joining Large data at Scale