SlideShare a Scribd company logo
1 of 2
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
Apache Spark has been one of the leadingbigdata processingengines for a long time. Consideringthe growth in
volume and variety of Data, Creating frame work on Spark is an inevitablesolution. As the Data sources are
growing exponentially, itis also importantto maintain valid data in your data lake. The definition of Data Lake says
it should be ableto pull any format of data into the filesystem and extract related information from them. In
current scenario,there are multipleopen sourcetechnology to convert your different type of data into text format
but there are very less applications to check the data quality and the data correctness of the source data. Another
important pointto mention here the number of data sources arenow too high in comparison to the traditional
data warehouse system. So, clientwas earlier processing hundred structure sourcefilelandingin a defined time
line,now the number is more than 50 times, where the filelandingtimelineis not fixed and also the volume is
unknown. Semi structure, unstructured data, Sensor data,IoT data aregrowing rapidly,itis really importantto
understand the difference between a correct and incorrectrecord. Therefore, the fundamental way to starta data
lakeis to design a flexibleframework which is capableof handlingvariety of data, perform sets of data quality
checks , implement linear businesstransformations ,look up services and a module of data correctness which
would understand that some manual input has gone wrong and fix the data on runtime. This particular white
paper will givean insightof how to create generic services to process data within the data lake. The tool we are
going to use here will beSpark, developed in Scala.
Data Quality: Data Quality is an importantServicefor a framework and should be able to handlestandard Data
Quality checks as well as customDQ functions.The best way of implementation is to implement these functions on
Row RDD level.We need to create a Data Quality Packagewhich consumes checks from the Metadata Table and
implement on row RDD level.
def DataQuality(row:Row,checkmap:Map[String,String]): Row = {
Var errorColumns=””
Var errorDescription=””
….checks
Looping in to Hashmap—
1. Null Check
2. Integer Check
3. Date Check
4. Double Check
5. Custom Check
Row.merge(row, errorColumns,errorDescription)
}
Custom check will parsedefined expression which we want to parseand return correspondingoutput from the
DataQuality functions.
Checkmap contains map of column wiseData Quality rules setfor processingfiles.This can also bean object.
Data Correctness: Data Correctness is a key module which would understand the data and convert into a format
of the target without any input from User. There are three key data correction ruleengine.
1. Data containing any special characters and requires a proper cleaning and then casting. Also there are
some representation which requires special methods.
Eg: Amount field containing “,”,”$”,”- sign after the digit” etc
2. Data is expected to come within a range of values but it’s not matching or some characters are missing
from the desired String, this method is based on the closest match processing.
Eg. A description should contain –“Vendor 1, Vendor 2, Vendor 3” and the value came in the data
“Ven1” or “v-1” etc
3. Data containing any alphanumeric values in space of Integer or Decimal fields. Based on user input the
method will replace and put some proper values which would not cause an issue in the next layer.
def DataCorrect(str:String,correctMap:Map[String,String]):String={
--check sourcedata type
--check any listed valueis given by the user
--check the containing characters,sourcedata type and User input
--check String with the lagand lead row.
}
This should be an UDF, applied on the invalid data frameand insertto valid data frame.
Dynamic Business Transformation: The importance of row level dynamic business transformation isimmense.We
can perform mathematical operations, Stringoperations,and custom conditional transformations dynamically
into this method. This would be again a rowrdd based operation usingquasi quotes as an interpreter.
def LinearTransformation(row:Row,transMap:Map[String,String]):Row= {
--Iterate hashmap
--implement business transformations
--Push the new valueto a list
--Convert the listto another row
--merge the new row with the new one
}
These three methods would be implemented insidea traitand can be extended and modified as well. The whole
intention of this white paper to use more functional programming,implement small services to get the optimum
performance. Think differently while implementing!! :)

More Related Content

What's hot

Data mining model for the data retrieval from central server configuration
Data mining model for the data retrieval from central server configurationData mining model for the data retrieval from central server configuration
Data mining model for the data retrieval from central server configurationijcsit
 
ETL All The Things with Ruby
ETL All The Things with RubyETL All The Things with Ruby
ETL All The Things with RubyBrandon Rice
 
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopPaco Nathan
 
Etl Overview (Extract, Transform, And Load)
Etl Overview (Extract, Transform, And Load)Etl Overview (Extract, Transform, And Load)
Etl Overview (Extract, Transform, And Load)LizLavaveshkul
 
final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)Ankit Rathi
 
An efficient approach for illustrating web data of user search result
An efficient approach for illustrating web data of user search resultAn efficient approach for illustrating web data of user search result
An efficient approach for illustrating web data of user search resultNeha Singh
 
An E-commerce App in action built on top of a Multi-model Database
An E-commerce App in action built on top of a Multi-model DatabaseAn E-commerce App in action built on top of a Multi-model Database
An E-commerce App in action built on top of a Multi-model DatabaseArangoDB Database
 
Efficient Filtering Algorithms for Location- Aware Publish/subscribe
Efficient Filtering Algorithms for Location- Aware Publish/subscribeEfficient Filtering Algorithms for Location- Aware Publish/subscribe
Efficient Filtering Algorithms for Location- Aware Publish/subscribeIJSRD
 
Get started with R lang
Get started with R langGet started with R lang
Get started with R langsenthil0809
 

What's hot (9)

Data mining model for the data retrieval from central server configuration
Data mining model for the data retrieval from central server configurationData mining model for the data retrieval from central server configuration
Data mining model for the data retrieval from central server configuration
 
ETL All The Things with Ruby
ETL All The Things with RubyETL All The Things with Ruby
ETL All The Things with Ruby
 
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
 
Etl Overview (Extract, Transform, And Load)
Etl Overview (Extract, Transform, And Load)Etl Overview (Extract, Transform, And Load)
Etl Overview (Extract, Transform, And Load)
 
final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)
 
An efficient approach for illustrating web data of user search result
An efficient approach for illustrating web data of user search resultAn efficient approach for illustrating web data of user search result
An efficient approach for illustrating web data of user search result
 
An E-commerce App in action built on top of a Multi-model Database
An E-commerce App in action built on top of a Multi-model DatabaseAn E-commerce App in action built on top of a Multi-model Database
An E-commerce App in action built on top of a Multi-model Database
 
Efficient Filtering Algorithms for Location- Aware Publish/subscribe
Efficient Filtering Algorithms for Location- Aware Publish/subscribeEfficient Filtering Algorithms for Location- Aware Publish/subscribe
Efficient Filtering Algorithms for Location- Aware Publish/subscribe
 
Get started with R lang
Get started with R langGet started with R lang
Get started with R lang
 

Similar to Data Quality, Correctness and Dynamic Transformations using Spark and Scala

ETL and pivoting in spark
ETL and pivoting in sparkETL and pivoting in spark
ETL and pivoting in sparkSubhasish Guha
 
Keysum - Using Checksum Keys
Keysum - Using Checksum KeysKeysum - Using Checksum Keys
Keysum - Using Checksum KeysDavid Walker
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string search
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string searchJAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string search
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string searchIEEEGLOBALSOFTTECHNOLOGIES
 
JAVA 2013 IEEE NETWORKSECURITY PROJECT Spatial approximate string search
JAVA 2013 IEEE NETWORKSECURITY PROJECT Spatial approximate string searchJAVA 2013 IEEE NETWORKSECURITY PROJECT Spatial approximate string search
JAVA 2013 IEEE NETWORKSECURITY PROJECT Spatial approximate string searchIEEEGLOBALSOFTTECHNOLOGIES
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisTeradata Aster
 
Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with PythonMartin Loetzsch
 
Unit 2 Principles of Programming Languages
Unit 2 Principles of Programming LanguagesUnit 2 Principles of Programming Languages
Unit 2 Principles of Programming LanguagesVasavi College of Engg
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic rankingFELIX75
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxKnoldus Inc.
 
Big Data Transformations Powered By Spark
Big Data Transformations Powered By SparkBig Data Transformations Powered By Spark
Big Data Transformations Powered By SparkKnoldus Inc.
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Tina Zhang
 
Hatkit Project - Datafiddler
Hatkit Project - DatafiddlerHatkit Project - Datafiddler
Hatkit Project - Datafiddlerholiman
 

Similar to Data Quality, Correctness and Dynamic Transformations using Spark and Scala (20)

ETL and pivoting in spark
ETL and pivoting in sparkETL and pivoting in spark
ETL and pivoting in spark
 
Keysum - Using Checksum Keys
Keysum - Using Checksum KeysKeysum - Using Checksum Keys
Keysum - Using Checksum Keys
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Spatial approximate string search
Spatial approximate string searchSpatial approximate string search
Spatial approximate string search
 
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string search
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string searchJAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string search
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string search
 
JAVA 2013 IEEE NETWORKSECURITY PROJECT Spatial approximate string search
JAVA 2013 IEEE NETWORKSECURITY PROJECT Spatial approximate string searchJAVA 2013 IEEE NETWORKSECURITY PROJECT Spatial approximate string search
JAVA 2013 IEEE NETWORKSECURITY PROJECT Spatial approximate string search
 
Spatial approximate string search
Spatial approximate string searchSpatial approximate string search
Spatial approximate string search
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
 
Potter’S Wheel
Potter’S WheelPotter’S Wheel
Potter’S Wheel
 
Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with Python
 
Unit 2 Principles of Programming Languages
Unit 2 Principles of Programming LanguagesUnit 2 Principles of Programming Languages
Unit 2 Principles of Programming Languages
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic ranking
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptx
 
Big Data Transformations Powered By Spark
Big Data Transformations Powered By SparkBig Data Transformations Powered By Spark
Big Data Transformations Powered By Spark
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_
 
Hatkit Project - Datafiddler
Hatkit Project - DatafiddlerHatkit Project - Datafiddler
Hatkit Project - Datafiddler
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 

Data Quality, Correctness and Dynamic Transformations using Spark and Scala

  • 1. Data Quality, Correctness and Dynamic Transformations using Spark and Scala Apache Spark has been one of the leadingbigdata processingengines for a long time. Consideringthe growth in volume and variety of Data, Creating frame work on Spark is an inevitablesolution. As the Data sources are growing exponentially, itis also importantto maintain valid data in your data lake. The definition of Data Lake says it should be ableto pull any format of data into the filesystem and extract related information from them. In current scenario,there are multipleopen sourcetechnology to convert your different type of data into text format but there are very less applications to check the data quality and the data correctness of the source data. Another important pointto mention here the number of data sources arenow too high in comparison to the traditional data warehouse system. So, clientwas earlier processing hundred structure sourcefilelandingin a defined time line,now the number is more than 50 times, where the filelandingtimelineis not fixed and also the volume is unknown. Semi structure, unstructured data, Sensor data,IoT data aregrowing rapidly,itis really importantto understand the difference between a correct and incorrectrecord. Therefore, the fundamental way to starta data lakeis to design a flexibleframework which is capableof handlingvariety of data, perform sets of data quality checks , implement linear businesstransformations ,look up services and a module of data correctness which would understand that some manual input has gone wrong and fix the data on runtime. This particular white paper will givean insightof how to create generic services to process data within the data lake. The tool we are going to use here will beSpark, developed in Scala. Data Quality: Data Quality is an importantServicefor a framework and should be able to handlestandard Data Quality checks as well as customDQ functions.The best way of implementation is to implement these functions on Row RDD level.We need to create a Data Quality Packagewhich consumes checks from the Metadata Table and implement on row RDD level. def DataQuality(row:Row,checkmap:Map[String,String]): Row = { Var errorColumns=”” Var errorDescription=”” ….checks Looping in to Hashmap— 1. Null Check 2. Integer Check 3. Date Check 4. Double Check 5. Custom Check Row.merge(row, errorColumns,errorDescription) } Custom check will parsedefined expression which we want to parseand return correspondingoutput from the DataQuality functions. Checkmap contains map of column wiseData Quality rules setfor processingfiles.This can also bean object.
  • 2. Data Correctness: Data Correctness is a key module which would understand the data and convert into a format of the target without any input from User. There are three key data correction ruleengine. 1. Data containing any special characters and requires a proper cleaning and then casting. Also there are some representation which requires special methods. Eg: Amount field containing “,”,”$”,”- sign after the digit” etc 2. Data is expected to come within a range of values but it’s not matching or some characters are missing from the desired String, this method is based on the closest match processing. Eg. A description should contain –“Vendor 1, Vendor 2, Vendor 3” and the value came in the data “Ven1” or “v-1” etc 3. Data containing any alphanumeric values in space of Integer or Decimal fields. Based on user input the method will replace and put some proper values which would not cause an issue in the next layer. def DataCorrect(str:String,correctMap:Map[String,String]):String={ --check sourcedata type --check any listed valueis given by the user --check the containing characters,sourcedata type and User input --check String with the lagand lead row. } This should be an UDF, applied on the invalid data frameand insertto valid data frame. Dynamic Business Transformation: The importance of row level dynamic business transformation isimmense.We can perform mathematical operations, Stringoperations,and custom conditional transformations dynamically into this method. This would be again a rowrdd based operation usingquasi quotes as an interpreter. def LinearTransformation(row:Row,transMap:Map[String,String]):Row= { --Iterate hashmap --implement business transformations --Push the new valueto a list --Convert the listto another row --merge the new row with the new one } These three methods would be implemented insidea traitand can be extended and modified as well. The whole intention of this white paper to use more functional programming,implement small services to get the optimum performance. Think differently while implementing!! :)