Business Case for leveraging Machine Learning (ML) to Validate Data Lake.pdf

Business Case For Leveraging Machine
Learning (ML) To Validate Data Lake
www.FirstEigen.com contact@firsteigen.com
Leverage ML to improve data quality in a data lake.

Without effective and comprehensive validation, a data lake becomes a data
swamp and does not offer a clear link to value creation to business.
Organizations are rapidly adopting Cloud Data Lake as the data lake of choice.
Given so, the need for validating data in real-time has become critical. Accurate,
consistent, and reliable data fuels algorithms, operational processes, and
effective decision-making. Existing data validation approaches rely on a
rule-basedapproach that is resource-intensive, time-consuming, costly, and not
scalable for 1000s of data assets.
Business Case For Leveraging Machine
Learning (ML) To Validate Data Lake

Following examples from Global 2000 organizations demonstrate the need to
establish data quality checks on each data asset present in the data lake.
Business Impact of Data
Quality issues in Data Lake
Business Impact of Data Quality issues
in Data Lake

ETL Jobs Fail to Identify
Files in a Data Lake

ETL Jobs Fail to Identify
Files in a Data Lake
New subscribers of an insurance company could not avail the telehealth services
for more than a week. Here, the root cause was that the Data Engineering team
was not aware of onboarding of the insurance company as a new client and ETL
jobs did not pick up the enrollment files that landed in their Azure data lake.

Trading Company Ingests
Data without Validation

Commodity traders of a trading company could not find the user level credit
information for a certain group of users on a Monday morning — a report was
blank — leading to disruptions in trading activities for 2 hours. The reason was
that the credit file received from another application had the credit field empty
and was not checked before being loaded to the Big Query.
Trading Company Ingests
Data without Validation

Misinformation due to
Poor Preprocessing

Supply chain executives of a restaurant chain company were surprised by the
report that consumption in the UK doubled in May. Current month’s consumption
file was appended to the consumption file from April because of a processing
error and stored in the AWS Data Lake.
Misinformation due to
Poor Preprocessing

Current Approach
And Challenges
Current Approach and Challenges
The current focus in Cloud Data Lake projects is on data ingestion, the process of
moving data from multiple data sources (often of different formats) into a single
destination. After data ingestion, data is moved through the data pipeline which
is where data errors/issues begin to surface. Our research estimates that an
average of 30–40% of any analytics project is spent identifying and fixing data
issues. In extreme cases, the project can get abandoned entirely.
Current data validation approaches are designed to establish data quality rules
for one container/bucket at a time — as a result, there are significant cost issues
in implementing these solutions for 1000s of buckets/containers. Container-wise
focus often leads to an incomplete set of rules or often not implementing any
rules at all.

Operational Challenges in Integrating
Data Validation Solutions
In general, the data engineering team experiences the following operational
challenges while integrating data validation solutions:
The time it takes to analyze data and consult the subject matter experts to
determine what rules need to be implemented
Implementation of the rules specific to each container. So, the effort is
linearly proportional to the number of containers/buckets/folders in the Data
Lake
Existing open-source tools/approaches come with limited audit trail
capability.
Generating an audit trail of the rule execution results for compliance
requirements often takes time and effort from the data engineering team.
Maintaining the implemented rules
Operational Challenges
in Integrating Data
Validation Solutions

Machine Learning (ML)- Based
Approach For Data Quality
Machine Learning (ML)-
Based Approach For
Data Quality
Instead of figuring out data quality rules through profiling, analysis, and
consultations with the subject matter experts, standardized unsupervised
machine learning algorithms can be applied at scale to the data lake
buckets/containers to determine acceptable data patterns and identify
anomalous records. We have had success in applying the following algorithms to
detect data errors in financial services and Internet of Things (IOT) data. Several
open-source ML software offers these algorithms as part of their packages.
These include:
DBSCAN [1]
Principal component analysis and Eigenvector analysis [2]
Association mining [3]

Machine Learning (ML)- Based
Approach For Data Quality
Leverage the anomalous records to measure the data trust score through the lens
of standardized data quality dimensions as shown below:
1. Freshness — determine if the data has arrived before the next step of the process
2. Completeness — determine the completeness of contextually important fields.
Contextually important fields should be identified using various mathematical
and or machine learning techniques.
3. Conformity — determine conformity to a pattern, length, format of contextually
important fields.
4. Uniqueness — determine the uniqueness of the individual records.
5. Drift — determine the drift of the key categorical and continuous fields from the
historical information
6. Anomaly — determine volume and value anomaly of critical columns

Value Comparison
Value Comparison
The benefits of ML-based Data Quality fit broadly in two categories: quantitative
and qualitative.
While the quantitative benefits make the most powerful argument in a business
case, the value of
the qualitative benefits should not be ignored.
Value Dimension
Cost Reduction
Time To Market
Risk Reduction
The rule-based approach addresses known
risks often missing out on newer types of
data risks.
The ML-based approach addresses
both known risks (such as
completeness, conformity, etc.)
and difficult to anticipate risks such
as changes in data density
One year-It would take a team of 8
resources to establish DQ checks for
approximately 1000 buckets/containers
3 Months-It would take a team of 2
resources to establish DQ checks
for approximately 1000
buckets/containers
The effort to establish DQ checks is
approximately between 8 to 16 resource
hours per bucket/container. For 1000
bucket/container data lake, the cost is
approximately $800-1600K
The effort to establish DQ checks is
approximately between 2-4
compute hours plus 1 resource hour
per bucket/container data lake, the
cost is approximately $150K-200K
including the cost of initial setup
Traditional Approach ML Based Approach

Conclusion
Conclusion
Data is the most valuable asset for organizations. Current approaches for
validating data are full of operational challenges leading to trust deficiency,
time-consuming, and costly methods for fixing data errors.
There is an urgent need to adopt a standardized autonomous approach for
validating the Cloud data lake to ensure it prevents data lake from becoming a
data swamp.

[1] J. Waller, Outlier Detection Using DBSCAN (2020), Data Blog
[2] S. Serneels et al, Principal component analysis for data
containing outliers and missing elements (2008), Science Direct
[3] S. B. Hassine et al, Using Association rules to detect data
quality issues, MIT Information Quality (MITIQ) Program

Business Case for leveraging Machine Learning (ML) to Validate Data Lake.pdf

Recommended

Recommended

More Related Content

Similar to Business Case for leveraging Machine Learning (ML) to Validate Data Lake.pdf

Similar to Business Case for leveraging Machine Learning (ML) to Validate Data Lake.pdf (20)

More from arifulislam946965

More from arifulislam946965 (13)

Recently uploaded

Recently uploaded (20)

Business Case for leveraging Machine Learning (ML) to Validate Data Lake.pdf