ETL, pivoting and Handling Small File Problems in Spark
Extracting, Putting several transformation and Finally Loading the summarized data into hive
is the most important part of Data Warehousing. Now we face various types of problems in
spark in terms of developing you basic Data Quality Checking. So it is always
recommendable to pass the through the Data with Custom Data Quality checking steps like:
1. Null Checking in String Field
2. Null checking in Numeric Field
3. Alfa-Numeric Characters in Numeric field
4. Data Type selection on the basis of future requirements
5. Data format conversion(Most Important)
6. Filter Data
7. Address, SSN, Telephone, Email id validation etc.
In the transformation phase Spark demands many User Defined Functions as our
requirement goes more complex
Transformation like:
1. Aggregation
2. Routing
3. Normalization
4. De-Normalization
5. Intelligent Counter
6. Lookup
Load phase is putting your temporary table into Hive or HBase or Cassandra and use any
Visualization tool to show the outcome.
Now this article looks into another aspect of Small files handling in Spark which is really
important. It is said to keep in mind that “Don’t let your partition volume too high (Greater
than 2GB and don’t even make it too small which will cause overhead problem”
Now my data source is consisting of many small files, so do look at this step:
Now this execution plan itself shows the beauty of this hack and efficient use of broadcast
variable in spark.
This will definitely reduce down your I/O overhead problems for and provide a better result
in terms of performance.
So the data source is something like this:
The schema goes like this:
Now this data have different null problems where we need to create custom function in
RDD level and format the data.
Another problem with this data is the date format was not same throughout the file,
somewhere it’s like dd/mm/yyyy and somewhere dd-mm-yyyy. So serious amount of Data
Quality and conversion checking was required.
val dataRDD = data.map(line =>
line.split(",")).map(line=>ScoreRecord(checkStrNull(line(0)).trim,checkStrNull(line(1)).trim
,checkStrNull(line(2)).trim,checkStrNull(line(3)).trim,checkStrNull(line(4)).trim,checkStrNul
l(line(5)).trim,checkNumericNull(line(6)).trim.toInt,checkNumericNull(line(7)).trim.toDoub
le,checkNumericNull(line(8)).trim.toInt,checkNumericNull(line(9)).trim.toDouble,checkNu
mericNull(line(10)).trim.toInt));
This has the required conversion and checking.
Now I developed Spark SQL UDF to handle the data conversion problem, So my code goes
like this
df.registerTempTable("cricket_data");
val result = sqlContext.sql("select name,year,case when month in (10,11,12) then 'Q4'
when month in (7,8,9) then 'Q3' when month in (4,5,6) then 'Q2' when month in(1,2,3)
then 'Q1' end Quarter, run_scored from (select
name,year(convert(REPLACE(date_of_match,'/','-'))) as
year,month(convert(REPLACE(date_of_match,'/','-'))) as month,run_scored from
cricket_data) C");
Convert and REPLACE are custom UDF for this Job
Now this query gives me a result like this:
Now in terms of Data Warehouse this is very inefficient data. As the business user
demands summarized data with full visibility throughout the timestamp.
Here in ETL we use a component called “De-Normalizer” [In Informatica]
So it required transformations like:
Aggregator has a sorter which sorts the data first and then implements the aggregation.
Now these are costly transformations in terms of ETL. If we are having data volume 1 Billion
it suffers a big time due to less efficient cache and data mapping
Spark gives a brilliant solution to pivot ta the data in a single line:
val result_pivot = result.groupBy("name","year").pivot("Quarter").agg(sum("run_scored"))
This is an action which pivots the data and transposes huge volume of data within few
minutes.
The data goes like this:
Explain Plan for the Query
Explain Plan for the Pivot
We Load this summarized data in hive and show to the End user , So this how my table got
stored in hive.
Data in Hive
A very simple way to handle ETL in Spark! 

ETL and pivoting in spark

  • 1.
    ETL, pivoting andHandling Small File Problems in Spark Extracting, Putting several transformation and Finally Loading the summarized data into hive is the most important part of Data Warehousing. Now we face various types of problems in spark in terms of developing you basic Data Quality Checking. So it is always recommendable to pass the through the Data with Custom Data Quality checking steps like: 1. Null Checking in String Field 2. Null checking in Numeric Field 3. Alfa-Numeric Characters in Numeric field 4. Data Type selection on the basis of future requirements 5. Data format conversion(Most Important) 6. Filter Data 7. Address, SSN, Telephone, Email id validation etc. In the transformation phase Spark demands many User Defined Functions as our requirement goes more complex Transformation like: 1. Aggregation 2. Routing 3. Normalization 4. De-Normalization 5. Intelligent Counter 6. Lookup Load phase is putting your temporary table into Hive or HBase or Cassandra and use any Visualization tool to show the outcome. Now this article looks into another aspect of Small files handling in Spark which is really important. It is said to keep in mind that “Don’t let your partition volume too high (Greater than 2GB and don’t even make it too small which will cause overhead problem” Now my data source is consisting of many small files, so do look at this step:
  • 2.
    Now this executionplan itself shows the beauty of this hack and efficient use of broadcast variable in spark. This will definitely reduce down your I/O overhead problems for and provide a better result in terms of performance. So the data source is something like this: The schema goes like this: Now this data have different null problems where we need to create custom function in RDD level and format the data. Another problem with this data is the date format was not same throughout the file, somewhere it’s like dd/mm/yyyy and somewhere dd-mm-yyyy. So serious amount of Data Quality and conversion checking was required. val dataRDD = data.map(line => line.split(",")).map(line=>ScoreRecord(checkStrNull(line(0)).trim,checkStrNull(line(1)).trim ,checkStrNull(line(2)).trim,checkStrNull(line(3)).trim,checkStrNull(line(4)).trim,checkStrNul l(line(5)).trim,checkNumericNull(line(6)).trim.toInt,checkNumericNull(line(7)).trim.toDoub le,checkNumericNull(line(8)).trim.toInt,checkNumericNull(line(9)).trim.toDouble,checkNu mericNull(line(10)).trim.toInt)); This has the required conversion and checking. Now I developed Spark SQL UDF to handle the data conversion problem, So my code goes like this
  • 3.
    df.registerTempTable("cricket_data"); val result =sqlContext.sql("select name,year,case when month in (10,11,12) then 'Q4' when month in (7,8,9) then 'Q3' when month in (4,5,6) then 'Q2' when month in(1,2,3) then 'Q1' end Quarter, run_scored from (select name,year(convert(REPLACE(date_of_match,'/','-'))) as year,month(convert(REPLACE(date_of_match,'/','-'))) as month,run_scored from cricket_data) C"); Convert and REPLACE are custom UDF for this Job Now this query gives me a result like this: Now in terms of Data Warehouse this is very inefficient data. As the business user demands summarized data with full visibility throughout the timestamp. Here in ETL we use a component called “De-Normalizer” [In Informatica] So it required transformations like:
  • 4.
    Aggregator has asorter which sorts the data first and then implements the aggregation. Now these are costly transformations in terms of ETL. If we are having data volume 1 Billion it suffers a big time due to less efficient cache and data mapping Spark gives a brilliant solution to pivot ta the data in a single line: val result_pivot = result.groupBy("name","year").pivot("Quarter").agg(sum("run_scored")) This is an action which pivots the data and transposes huge volume of data within few minutes. The data goes like this: Explain Plan for the Query
  • 5.
    Explain Plan forthe Pivot We Load this summarized data in hive and show to the End user , So this how my table got stored in hive. Data in Hive
  • 6.
    A very simpleway to handle ETL in Spark! 