SlideShare a Scribd company logo
ETL, pivoting and Handling Small File Problems in Spark
Extracting, Putting several transformation and Finally Loading the summarized data into hive
is the most important part of Data Warehousing. Now we face various types of problems in
spark in terms of developing you basic Data Quality Checking. So it is always
recommendable to pass the through the Data with Custom Data Quality checking steps like:
1. Null Checking in String Field
2. Null checking in Numeric Field
3. Alfa-Numeric Characters in Numeric field
4. Data Type selection on the basis of future requirements
5. Data format conversion(Most Important)
6. Filter Data
7. Address, SSN, Telephone, Email id validation etc.
In the transformation phase Spark demands many User Defined Functions as our
requirement goes more complex
Transformation like:
1. Aggregation
2. Routing
3. Normalization
4. De-Normalization
5. Intelligent Counter
6. Lookup
Load phase is putting your temporary table into Hive or HBase or Cassandra and use any
Visualization tool to show the outcome.
Now this article looks into another aspect of Small files handling in Spark which is really
important. It is said to keep in mind that “Don’t let your partition volume too high (Greater
than 2GB and don’t even make it too small which will cause overhead problem”
Now my data source is consisting of many small files, so do look at this step:
Now this execution plan itself shows the beauty of this hack and efficient use of broadcast
variable in spark.
This will definitely reduce down your I/O overhead problems for and provide a better result
in terms of performance.
So the data source is something like this:
The schema goes like this:
Now this data have different null problems where we need to create custom function in
RDD level and format the data.
Another problem with this data is the date format was not same throughout the file,
somewhere it’s like dd/mm/yyyy and somewhere dd-mm-yyyy. So serious amount of Data
Quality and conversion checking was required.
val dataRDD = data.map(line =>
line.split(",")).map(line=>ScoreRecord(checkStrNull(line(0)).trim,checkStrNull(line(1)).trim
,checkStrNull(line(2)).trim,checkStrNull(line(3)).trim,checkStrNull(line(4)).trim,checkStrNul
l(line(5)).trim,checkNumericNull(line(6)).trim.toInt,checkNumericNull(line(7)).trim.toDoub
le,checkNumericNull(line(8)).trim.toInt,checkNumericNull(line(9)).trim.toDouble,checkNu
mericNull(line(10)).trim.toInt));
This has the required conversion and checking.
Now I developed Spark SQL UDF to handle the data conversion problem, So my code goes
like this
df.registerTempTable("cricket_data");
val result = sqlContext.sql("select name,year,case when month in (10,11,12) then 'Q4'
when month in (7,8,9) then 'Q3' when month in (4,5,6) then 'Q2' when month in(1,2,3)
then 'Q1' end Quarter, run_scored from (select
name,year(convert(REPLACE(date_of_match,'/','-'))) as
year,month(convert(REPLACE(date_of_match,'/','-'))) as month,run_scored from
cricket_data) C");
Convert and REPLACE are custom UDF for this Job
Now this query gives me a result like this:
Now in terms of Data Warehouse this is very inefficient data. As the business user
demands summarized data with full visibility throughout the timestamp.
Here in ETL we use a component called “De-Normalizer” [In Informatica]
So it required transformations like:
Aggregator has a sorter which sorts the data first and then implements the aggregation.
Now these are costly transformations in terms of ETL. If we are having data volume 1 Billion
it suffers a big time due to less efficient cache and data mapping
Spark gives a brilliant solution to pivot ta the data in a single line:
val result_pivot = result.groupBy("name","year").pivot("Quarter").agg(sum("run_scored"))
This is an action which pivots the data and transposes huge volume of data within few
minutes.
The data goes like this:
Explain Plan for the Query
Explain Plan for the Pivot
We Load this summarized data in hive and show to the End user , So this how my table got
stored in hive.
Data in Hive
A very simple way to handle ETL in Spark! 

More Related Content

What's hot

ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300
Mark Kromer
 
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Databricks
 
A High Performance Mutable Engagement Activity Delta Lake
A High Performance Mutable Engagement Activity Delta LakeA High Performance Mutable Engagement Activity Delta Lake
A High Performance Mutable Engagement Activity Delta Lake
Databricks
 
The immutable database datomic
The immutable database   datomicThe immutable database   datomic
The immutable database datomic
Laurence Chen
 
HadoopDB
HadoopDBHadoopDB
HadoopDB
Miguel Pastor
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
Databricks
 
Everyday Probabilistic Data Structures for Humans
Everyday Probabilistic Data Structures for HumansEveryday Probabilistic Data Structures for Humans
Everyday Probabilistic Data Structures for Humans
Databricks
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Databricks
 
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryAzure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power Query
Mark Kromer
 
SCALE - Stream processing and Open Data, a match made in Heaven
SCALE - Stream processing and Open Data, a match made in HeavenSCALE - Stream processing and Open Data, a match made in Heaven
SCALE - Stream processing and Open Data, a match made in Heaven
Nicolas Fränkel
 
Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)
Mark Kromer
 
Write intensive workloads and lsm trees
Write intensive workloads and lsm treesWrite intensive workloads and lsm trees
Write intensive workloads and lsm trees
Tilak Patidar
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
Databricks
 
Hundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario SpacagnaHundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario Spacagna
Spark Summit
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
 
Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005
Mark Kromer
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
 

What's hot (20)

ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300
 
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
 
A High Performance Mutable Engagement Activity Delta Lake
A High Performance Mutable Engagement Activity Delta LakeA High Performance Mutable Engagement Activity Delta Lake
A High Performance Mutable Engagement Activity Delta Lake
 
The immutable database datomic
The immutable database   datomicThe immutable database   datomic
The immutable database datomic
 
HadoopDB
HadoopDBHadoopDB
HadoopDB
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Everyday Probabilistic Data Structures for Humans
Everyday Probabilistic Data Structures for HumansEveryday Probabilistic Data Structures for Humans
Everyday Probabilistic Data Structures for Humans
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
 
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryAzure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power Query
 
SCALE - Stream processing and Open Data, a match made in Heaven
SCALE - Stream processing and Open Data, a match made in HeavenSCALE - Stream processing and Open Data, a match made in Heaven
SCALE - Stream processing and Open Data, a match made in Heaven
 
Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)
 
Write intensive workloads and lsm trees
Write intensive workloads and lsm treesWrite intensive workloads and lsm trees
Write intensive workloads and lsm trees
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
Hundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario SpacagnaHundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario Spacagna
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
 

Viewers also liked

ELECTRÓNICA+RADIO+TV. Tomo IV: AMPLIFICADORES B.F. ALTAVOCES. VÁLVULAS AMPLIF...
ELECTRÓNICA+RADIO+TV. Tomo IV: AMPLIFICADORES B.F. ALTAVOCES. VÁLVULAS AMPLIF...ELECTRÓNICA+RADIO+TV. Tomo IV: AMPLIFICADORES B.F. ALTAVOCES. VÁLVULAS AMPLIF...
ELECTRÓNICA+RADIO+TV. Tomo IV: AMPLIFICADORES B.F. ALTAVOCES. VÁLVULAS AMPLIF...
Gabriel Araceli
 
SEVEN-ÉFESO
SEVEN-ÉFESOSEVEN-ÉFESO
SEVEN-ÉFESO
Hugo Machado
 
Dynamic width file in Spark
Dynamic width file in SparkDynamic width file in Spark
Dynamic width file in Spark
Subhasish Guha
 
ELECTRÓNICA+RADIO+TV. Tomo III: DETECTORES. OSCILADORES. AMPLIFICADORES. Apén...
ELECTRÓNICA+RADIO+TV. Tomo III: DETECTORES. OSCILADORES. AMPLIFICADORES. Apén...ELECTRÓNICA+RADIO+TV. Tomo III: DETECTORES. OSCILADORES. AMPLIFICADORES. Apén...
ELECTRÓNICA+RADIO+TV. Tomo III: DETECTORES. OSCILADORES. AMPLIFICADORES. Apén...
Gabriel Araceli
 
A Igreja que queremos ser
A Igreja que queremos serA Igreja que queremos ser
A Igreja que queremos ser
Hugo Machado
 
Libertad probatoria en la prueba testimonial del niño, niña en el proceso civ...
Libertad probatoria en la prueba testimonial del niño, niña en el proceso civ...Libertad probatoria en la prueba testimonial del niño, niña en el proceso civ...
Libertad probatoria en la prueba testimonial del niño, niña en el proceso civ...
David Segovia
 
The salesforce nugget volume 4 (link campaign to opportunity)
The salesforce nugget  volume 4 (link campaign to opportunity)The salesforce nugget  volume 4 (link campaign to opportunity)
The salesforce nugget volume 4 (link campaign to opportunity)
Evan Natelson
 
Eosinophilic Gastroenteritis
Eosinophilic GastroenteritisEosinophilic Gastroenteritis
Eosinophilic Gastroenteritis
Krishnkant Rawal
 
Lem resume_pro
Lem  resume_proLem  resume_pro
Lem resume_pro
Lem Hernandez
 
Technical training.pptx (1)
Technical training.pptx (1)Technical training.pptx (1)
Technical training.pptx (1)
RSW Technology Guide
 
Tp3 power1
Tp3 power1Tp3 power1
Tp3 power1
Romina Pontiroli
 
Winning The Race To Value - Vendavo in Aftermarket Spare Parts Industries _ I...
Winning The Race To Value - Vendavo in Aftermarket Spare Parts Industries _ I...Winning The Race To Value - Vendavo in Aftermarket Spare Parts Industries _ I...
Winning The Race To Value - Vendavo in Aftermarket Spare Parts Industries _ I...
Caroline Burns
 

Viewers also liked (13)

ELECTRÓNICA+RADIO+TV. Tomo IV: AMPLIFICADORES B.F. ALTAVOCES. VÁLVULAS AMPLIF...
ELECTRÓNICA+RADIO+TV. Tomo IV: AMPLIFICADORES B.F. ALTAVOCES. VÁLVULAS AMPLIF...ELECTRÓNICA+RADIO+TV. Tomo IV: AMPLIFICADORES B.F. ALTAVOCES. VÁLVULAS AMPLIF...
ELECTRÓNICA+RADIO+TV. Tomo IV: AMPLIFICADORES B.F. ALTAVOCES. VÁLVULAS AMPLIF...
 
SEVEN-ÉFESO
SEVEN-ÉFESOSEVEN-ÉFESO
SEVEN-ÉFESO
 
Dynamic width file in Spark
Dynamic width file in SparkDynamic width file in Spark
Dynamic width file in Spark
 
ELECTRÓNICA+RADIO+TV. Tomo III: DETECTORES. OSCILADORES. AMPLIFICADORES. Apén...
ELECTRÓNICA+RADIO+TV. Tomo III: DETECTORES. OSCILADORES. AMPLIFICADORES. Apén...ELECTRÓNICA+RADIO+TV. Tomo III: DETECTORES. OSCILADORES. AMPLIFICADORES. Apén...
ELECTRÓNICA+RADIO+TV. Tomo III: DETECTORES. OSCILADORES. AMPLIFICADORES. Apén...
 
A Igreja que queremos ser
A Igreja que queremos serA Igreja que queremos ser
A Igreja que queremos ser
 
Libertad probatoria en la prueba testimonial del niño, niña en el proceso civ...
Libertad probatoria en la prueba testimonial del niño, niña en el proceso civ...Libertad probatoria en la prueba testimonial del niño, niña en el proceso civ...
Libertad probatoria en la prueba testimonial del niño, niña en el proceso civ...
 
The salesforce nugget volume 4 (link campaign to opportunity)
The salesforce nugget  volume 4 (link campaign to opportunity)The salesforce nugget  volume 4 (link campaign to opportunity)
The salesforce nugget volume 4 (link campaign to opportunity)
 
Eosinophilic Gastroenteritis
Eosinophilic GastroenteritisEosinophilic Gastroenteritis
Eosinophilic Gastroenteritis
 
Lem resume_pro
Lem  resume_proLem  resume_pro
Lem resume_pro
 
Resume
ResumeResume
Resume
 
Technical training.pptx (1)
Technical training.pptx (1)Technical training.pptx (1)
Technical training.pptx (1)
 
Tp3 power1
Tp3 power1Tp3 power1
Tp3 power1
 
Winning The Race To Value - Vendavo in Aftermarket Spare Parts Industries _ I...
Winning The Race To Value - Vendavo in Aftermarket Spare Parts Industries _ I...Winning The Race To Value - Vendavo in Aftermarket Spare Parts Industries _ I...
Winning The Race To Value - Vendavo in Aftermarket Spare Parts Industries _ I...
 

Similar to ETL and pivoting in spark

Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptx
Knoldus Inc.
 
Big Data Transformations Powered By Spark
Big Data Transformations Powered By SparkBig Data Transformations Powered By Spark
Big Data Transformations Powered By Spark
Knoldus Inc.
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
Saikiran Panjala
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
DataWorks Summit
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Grega Kespret
 
SPL_ALL_EN.pptx
SPL_ALL_EN.pptxSPL_ALL_EN.pptx
SPL_ALL_EN.pptx
政宏 张
 
Data ware house architecture
Data ware house architectureData ware house architecture
Data ware house architecture
Deepak Chaurasia
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
Provectus
 
Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)
Lucas Jellema
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache CassandraMovile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
DataStax Academy
 
Cassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of SeasonsCassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of Seasons
Eiti Kimura
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
What Are the Key Steps in Scraping Product Data from Amazon India.pptx
What Are the Key Steps in Scraping Product Data from Amazon India.pptxWhat Are the Key Steps in Scraping Product Data from Amazon India.pptx
What Are the Key Steps in Scraping Product Data from Amazon India.pptx
Productdata Scrape
 
What Are the Key Steps in Scraping Product Data from Amazon India.pdf
What Are the Key Steps in Scraping Product Data from Amazon India.pdfWhat Are the Key Steps in Scraping Product Data from Amazon India.pdf
What Are the Key Steps in Scraping Product Data from Amazon India.pdf
Productdata Scrape
 

Similar to ETL and pivoting in spark (20)

Ibm redbook
Ibm redbookIbm redbook
Ibm redbook
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
ETL
ETL ETL
ETL
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptx
 
Big Data Transformations Powered By Spark
Big Data Transformations Powered By SparkBig Data Transformations Powered By Spark
Big Data Transformations Powered By Spark
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
SPL_ALL_EN.pptx
SPL_ALL_EN.pptxSPL_ALL_EN.pptx
SPL_ALL_EN.pptx
 
Data ware house architecture
Data ware house architectureData ware house architecture
Data ware house architecture
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache CassandraMovile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
 
Cassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of SeasonsCassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of Seasons
 
My C.V
My C.VMy C.V
My C.V
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
What Are the Key Steps in Scraping Product Data from Amazon India.pptx
What Are the Key Steps in Scraping Product Data from Amazon India.pptxWhat Are the Key Steps in Scraping Product Data from Amazon India.pptx
What Are the Key Steps in Scraping Product Data from Amazon India.pptx
 
What Are the Key Steps in Scraping Product Data from Amazon India.pdf
What Are the Key Steps in Scraping Product Data from Amazon India.pdfWhat Are the Key Steps in Scraping Product Data from Amazon India.pdf
What Are the Key Steps in Scraping Product Data from Amazon India.pdf
 

Recently uploaded

The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
DuvanRamosGarzon1
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 

Recently uploaded (20)

The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 

ETL and pivoting in spark

  • 1. ETL, pivoting and Handling Small File Problems in Spark Extracting, Putting several transformation and Finally Loading the summarized data into hive is the most important part of Data Warehousing. Now we face various types of problems in spark in terms of developing you basic Data Quality Checking. So it is always recommendable to pass the through the Data with Custom Data Quality checking steps like: 1. Null Checking in String Field 2. Null checking in Numeric Field 3. Alfa-Numeric Characters in Numeric field 4. Data Type selection on the basis of future requirements 5. Data format conversion(Most Important) 6. Filter Data 7. Address, SSN, Telephone, Email id validation etc. In the transformation phase Spark demands many User Defined Functions as our requirement goes more complex Transformation like: 1. Aggregation 2. Routing 3. Normalization 4. De-Normalization 5. Intelligent Counter 6. Lookup Load phase is putting your temporary table into Hive or HBase or Cassandra and use any Visualization tool to show the outcome. Now this article looks into another aspect of Small files handling in Spark which is really important. It is said to keep in mind that “Don’t let your partition volume too high (Greater than 2GB and don’t even make it too small which will cause overhead problem” Now my data source is consisting of many small files, so do look at this step:
  • 2. Now this execution plan itself shows the beauty of this hack and efficient use of broadcast variable in spark. This will definitely reduce down your I/O overhead problems for and provide a better result in terms of performance. So the data source is something like this: The schema goes like this: Now this data have different null problems where we need to create custom function in RDD level and format the data. Another problem with this data is the date format was not same throughout the file, somewhere it’s like dd/mm/yyyy and somewhere dd-mm-yyyy. So serious amount of Data Quality and conversion checking was required. val dataRDD = data.map(line => line.split(",")).map(line=>ScoreRecord(checkStrNull(line(0)).trim,checkStrNull(line(1)).trim ,checkStrNull(line(2)).trim,checkStrNull(line(3)).trim,checkStrNull(line(4)).trim,checkStrNul l(line(5)).trim,checkNumericNull(line(6)).trim.toInt,checkNumericNull(line(7)).trim.toDoub le,checkNumericNull(line(8)).trim.toInt,checkNumericNull(line(9)).trim.toDouble,checkNu mericNull(line(10)).trim.toInt)); This has the required conversion and checking. Now I developed Spark SQL UDF to handle the data conversion problem, So my code goes like this
  • 3. df.registerTempTable("cricket_data"); val result = sqlContext.sql("select name,year,case when month in (10,11,12) then 'Q4' when month in (7,8,9) then 'Q3' when month in (4,5,6) then 'Q2' when month in(1,2,3) then 'Q1' end Quarter, run_scored from (select name,year(convert(REPLACE(date_of_match,'/','-'))) as year,month(convert(REPLACE(date_of_match,'/','-'))) as month,run_scored from cricket_data) C"); Convert and REPLACE are custom UDF for this Job Now this query gives me a result like this: Now in terms of Data Warehouse this is very inefficient data. As the business user demands summarized data with full visibility throughout the timestamp. Here in ETL we use a component called “De-Normalizer” [In Informatica] So it required transformations like:
  • 4. Aggregator has a sorter which sorts the data first and then implements the aggregation. Now these are costly transformations in terms of ETL. If we are having data volume 1 Billion it suffers a big time due to less efficient cache and data mapping Spark gives a brilliant solution to pivot ta the data in a single line: val result_pivot = result.groupBy("name","year").pivot("Quarter").agg(sum("run_scored")) This is an action which pivots the data and transposes huge volume of data within few minutes. The data goes like this: Explain Plan for the Query
  • 5. Explain Plan for the Pivot We Load this summarized data in hive and show to the End user , So this how my table got stored in hive. Data in Hive
  • 6. A very simple way to handle ETL in Spark! 