Implementing enterprise metadata driven accelerator for Data Ingestion and linear Transformations. This article contains a road map how to design your framework to handle different Ingestion scenarios
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
1. Data Quality, Correctness and Dynamic Transformations using Spark and Scala
Apache Spark has been one of the leadingbigdata processingengines for a long time. Consideringthe growth in
volume and variety of Data, Creating frame work on Spark is an inevitablesolution. As the Data sources are
growing exponentially, itis also importantto maintain valid data in your data lake. The definition of Data Lake says
it should be ableto pull any format of data into the filesystem and extract related information from them. In
current scenario,there are multipleopen sourcetechnology to convert your different type of data into text format
but there are very less applications to check the data quality and the data correctness of the source data. Another
important pointto mention here the number of data sources arenow too high in comparison to the traditional
data warehouse system. So, clientwas earlier processing hundred structure sourcefilelandingin a defined time
line,now the number is more than 50 times, where the filelandingtimelineis not fixed and also the volume is
unknown. Semi structure, unstructured data, Sensor data,IoT data aregrowing rapidly,itis really importantto
understand the difference between a correct and incorrectrecord. Therefore, the fundamental way to starta data
lakeis to design a flexibleframework which is capableof handlingvariety of data, perform sets of data quality
checks , implement linear businesstransformations ,look up services and a module of data correctness which
would understand that some manual input has gone wrong and fix the data on runtime. This particular white
paper will givean insightof how to create generic services to process data within the data lake. The tool we are
going to use here will beSpark, developed in Scala.
Data Quality: Data Quality is an importantServicefor a framework and should be able to handlestandard Data
Quality checks as well as customDQ functions.The best way of implementation is to implement these functions on
Row RDD level.We need to create a Data Quality Packagewhich consumes checks from the Metadata Table and
implement on row RDD level.
def DataQuality(row:Row,checkmap:Map[String,String]): Row = {
Var errorColumns=””
Var errorDescription=””
….checks
Looping in to Hashmap—
1. Null Check
2. Integer Check
3. Date Check
4. Double Check
5. Custom Check
Row.merge(row, errorColumns,errorDescription)
}
Custom check will parsedefined expression which we want to parseand return correspondingoutput from the
DataQuality functions.
Checkmap contains map of column wiseData Quality rules setfor processingfiles.This can also bean object.
2. Data Correctness: Data Correctness is a key module which would understand the data and convert into a format
of the target without any input from User. There are three key data correction ruleengine.
1. Data containing any special characters and requires a proper cleaning and then casting. Also there are
some representation which requires special methods.
Eg: Amount field containing “,”,”$”,”- sign after the digit” etc
2. Data is expected to come within a range of values but it’s not matching or some characters are missing
from the desired String, this method is based on the closest match processing.
Eg. A description should contain –“Vendor 1, Vendor 2, Vendor 3” and the value came in the data
“Ven1” or “v-1” etc
3. Data containing any alphanumeric values in space of Integer or Decimal fields. Based on user input the
method will replace and put some proper values which would not cause an issue in the next layer.
def DataCorrect(str:String,correctMap:Map[String,String]):String={
--check sourcedata type
--check any listed valueis given by the user
--check the containing characters,sourcedata type and User input
--check String with the lagand lead row.
}
This should be an UDF, applied on the invalid data frameand insertto valid data frame.
Dynamic Business Transformation: The importance of row level dynamic business transformation isimmense.We
can perform mathematical operations, Stringoperations,and custom conditional transformations dynamically
into this method. This would be again a rowrdd based operation usingquasi quotes as an interpreter.
def LinearTransformation(row:Row,transMap:Map[String,String]):Row= {
--Iterate hashmap
--implement business transformations
--Push the new valueto a list
--Convert the listto another row
--merge the new row with the new one
}
These three methods would be implemented insidea traitand can be extended and modified as well. The whole
intention of this white paper to use more functional programming,implement small services to get the optimum
performance. Think differently while implementing!! :)