SlideShare a Scribd company logo
1 of 16
Download to read offline
Business Case For Leveraging Machine
Learning (ML) To Validate Data Lake
www.FirstEigen.com contact@firsteigen.com
Leverage ML to improve data quality in a data lake.
Without effective and comprehensive validation, a data lake becomes a data
swamp and does not offer a clear link to value creation to business.
Organizations are rapidly adopting Cloud Data Lake as the data lake of choice.
Given so, the need for validating data in real-time has become critical. Accurate,
consistent, and reliable data fuels algorithms, operational processes, and
effective decision-making. Existing data validation approaches rely on a
rule-basedapproach that is resource-intensive, time-consuming, costly, and not
scalable for 1000s of data assets.
Business Case For Leveraging Machine
Learning (ML) To Validate Data Lake
Following examples from Global 2000 organizations demonstrate the need to
establish data quality checks on each data asset present in the data lake.
Business Impact of Data
Quality issues in Data Lake
Business Impact of Data Quality issues
in Data Lake
ETL Jobs Fail to Identify
Files in a Data Lake
ETL Jobs Fail to Identify
Files in a Data Lake
New subscribers of an insurance company could not avail the telehealth services
for more than a week. Here, the root cause was that the Data Engineering team
was not aware of onboarding of the insurance company as a new client and ETL
jobs did not pick up the enrollment files that landed in their Azure data lake.
Trading Company Ingests
Data without Validation
Commodity traders of a trading company could not find the user level credit
information for a certain group of users on a Monday morning — a report was
blank — leading to disruptions in trading activities for 2 hours. The reason was
that the credit file received from another application had the credit field empty
and was not checked before being loaded to the Big Query.
Trading Company Ingests
Data without Validation
Misinformation due to
Poor Preprocessing
Supply chain executives of a restaurant chain company were surprised by the
report that consumption in the UK doubled in May. Current month’s consumption
file was appended to the consumption file from April because of a processing
error and stored in the AWS Data Lake.
Misinformation due to
Poor Preprocessing
Current Approach
And Challenges
Current Approach and Challenges
The current focus in Cloud Data Lake projects is on data ingestion, the process of
moving data from multiple data sources (often of different formats) into a single
destination. After data ingestion, data is moved through the data pipeline which
is where data errors/issues begin to surface. Our research estimates that an
average of 30–40% of any analytics project is spent identifying and fixing data
issues. In extreme cases, the project can get abandoned entirely.
Current data validation approaches are designed to establish data quality rules
for one container/bucket at a time — as a result, there are significant cost issues
in implementing these solutions for 1000s of buckets/containers. Container-wise
focus often leads to an incomplete set of rules or often not implementing any
rules at all.
Operational Challenges in Integrating
Data Validation Solutions
In general, the data engineering team experiences the following operational
challenges while integrating data validation solutions:
The time it takes to analyze data and consult the subject matter experts to
determine what rules need to be implemented
Implementation of the rules specific to each container. So, the effort is
linearly proportional to the number of containers/buckets/folders in the Data
Lake
Existing open-source tools/approaches come with limited audit trail
capability.
Generating an audit trail of the rule execution results for compliance
requirements often takes time and effort from the data engineering team.
Maintaining the implemented rules
Operational Challenges
in Integrating Data
Validation Solutions
Machine Learning (ML)- Based
Approach For Data Quality
Machine Learning (ML)-
Based Approach For
Data Quality
Instead of figuring out data quality rules through profiling, analysis, and
consultations with the subject matter experts, standardized unsupervised
machine learning algorithms can be applied at scale to the data lake
buckets/containers to determine acceptable data patterns and identify
anomalous records. We have had success in applying the following algorithms to
detect data errors in financial services and Internet of Things (IOT) data. Several
open-source ML software offers these algorithms as part of their packages.
These include:
DBSCAN [1]
Principal component analysis and Eigenvector analysis [2]
Association mining [3]
Machine Learning (ML)- Based
Approach For Data Quality
Leverage the anomalous records to measure the data trust score through the lens
of standardized data quality dimensions as shown below:
1. Freshness — determine if the data has arrived before the next step of the process
2. Completeness — determine the completeness of contextually important fields.
Contextually important fields should be identified using various mathematical
and or machine learning techniques.
3. Conformity — determine conformity to a pattern, length, format of contextually
important fields.
4. Uniqueness — determine the uniqueness of the individual records.
5. Drift — determine the drift of the key categorical and continuous fields from the
historical information
6. Anomaly — determine volume and value anomaly of critical columns
Value Comparison
Value Comparison
The benefits of ML-based Data Quality fit broadly in two categories: quantitative
and qualitative.
While the quantitative benefits make the most powerful argument in a business
case, the value of
the qualitative benefits should not be ignored.
Value Dimension
Cost Reduction
Time To Market
Risk Reduction
The rule-based approach addresses known
risks often missing out on newer types of
data risks.
The ML-based approach addresses
both known risks (such as
completeness, conformity, etc.)
and difficult to anticipate risks such
as changes in data density
One year-It would take a team of 8
resources to establish DQ checks for
approximately 1000 buckets/containers
3 Months-It would take a team of 2
resources to establish DQ checks
for approximately 1000
buckets/containers
The effort to establish DQ checks is
approximately between 8 to 16 resource
hours per bucket/container. For 1000
bucket/container data lake, the cost is
approximately $800-1600K
The effort to establish DQ checks is
approximately between 2-4
compute hours plus 1 resource hour
per bucket/container data lake, the
cost is approximately $150K-200K
including the cost of initial setup
Traditional Approach ML Based Approach
Conclusion
Conclusion
Data is the most valuable asset for organizations. Current approaches for
validating data are full of operational challenges leading to trust deficiency,
time-consuming, and costly methods for fixing data errors.
There is an urgent need to adopt a standardized autonomous approach for
validating the Cloud data lake to ensure it prevents data lake from becoming a
data swamp.
[1] J. Waller, Outlier Detection Using DBSCAN (2020), Data Blog
[2] S. Serneels et al, Principal component analysis for data
containing outliers and missing elements (2008), Science Direct
[3] S. B. Hassine et al, Using Association rules to detect data
quality issues, MIT Information Quality (MITIQ) Program

More Related Content

Similar to Business Case for leveraging Machine Learning (ML) to Validate Data Lake.pdf

Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Nathan Bijnens
 
Paper Final Taube Bienert GridInterop 2012
Paper Final Taube Bienert GridInterop 2012Paper Final Taube Bienert GridInterop 2012
Paper Final Taube Bienert GridInterop 2012Bert Taube
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And IntegrityGerrit Klaschke, CSM
 
Data warehouse 101-fundamentals-
Data warehouse 101-fundamentals-Data warehouse 101-fundamentals-
Data warehouse 101-fundamentals-AshishGuleria
 
December 2015 - TDWI Checklist Report - Seven Best Practices for Adapting DWA
December 2015 - TDWI Checklist Report - Seven Best Practices for Adapting DWADecember 2015 - TDWI Checklist Report - Seven Best Practices for Adapting DWA
December 2015 - TDWI Checklist Report - Seven Best Practices for Adapting DWACarsten Roland
 
Semantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data LakeSemantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data LakeCognizant
 
Collaborate 2012-business data transformation and consolidation
Collaborate 2012-business data transformation and consolidationCollaborate 2012-business data transformation and consolidation
Collaborate 2012-business data transformation and consolidationChain Sys Corporation
 
Collaborate 2012-business data transformation and consolidation for a global ...
Collaborate 2012-business data transformation and consolidation for a global ...Collaborate 2012-business data transformation and consolidation for a global ...
Collaborate 2012-business data transformation and consolidation for a global ...Chain Sys Corporation
 
Benchmarking Logistics Performance
Benchmarking Logistics PerformanceBenchmarking Logistics Performance
Benchmarking Logistics PerformanceARC Advisory Group
 
FirstEigen Brochure- All clouds.pdf
FirstEigen Brochure- All clouds.pdfFirstEigen Brochure- All clouds.pdf
FirstEigen Brochure- All clouds.pdfarifulislam946965
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEditor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEditor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEditor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEditor IJCATR
 
Semantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data LakeSemantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data LakeThomas Kelly, PMP
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise AnalyticsDATAVERSITY
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...Cognizant
 
burge Stephen PM conver 101415
burge Stephen PM conver 101415burge Stephen PM conver 101415
burge Stephen PM conver 101415Stephen Burge MBA
 

Similar to Business Case for leveraging Machine Learning (ML) to Validate Data Lake.pdf (20)

End User Informatics
End User InformaticsEnd User Informatics
End User Informatics
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
Paper Final Taube Bienert GridInterop 2012
Paper Final Taube Bienert GridInterop 2012Paper Final Taube Bienert GridInterop 2012
Paper Final Taube Bienert GridInterop 2012
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And Integrity
 
Data warehouse 101-fundamentals-
Data warehouse 101-fundamentals-Data warehouse 101-fundamentals-
Data warehouse 101-fundamentals-
 
December 2015 - TDWI Checklist Report - Seven Best Practices for Adapting DWA
December 2015 - TDWI Checklist Report - Seven Best Practices for Adapting DWADecember 2015 - TDWI Checklist Report - Seven Best Practices for Adapting DWA
December 2015 - TDWI Checklist Report - Seven Best Practices for Adapting DWA
 
Semantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data LakeSemantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data Lake
 
Collaborate 2012-business data transformation and consolidation
Collaborate 2012-business data transformation and consolidationCollaborate 2012-business data transformation and consolidation
Collaborate 2012-business data transformation and consolidation
 
Collaborate 2012-business data transformation and consolidation for a global ...
Collaborate 2012-business data transformation and consolidation for a global ...Collaborate 2012-business data transformation and consolidation for a global ...
Collaborate 2012-business data transformation and consolidation for a global ...
 
Benchmarking Logistics Performance
Benchmarking Logistics PerformanceBenchmarking Logistics Performance
Benchmarking Logistics Performance
 
FirstEigen Brochure- All clouds.pdf
FirstEigen Brochure- All clouds.pdfFirstEigen Brochure- All clouds.pdf
FirstEigen Brochure- All clouds.pdf
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
ijcatr04081001
ijcatr04081001ijcatr04081001
ijcatr04081001
 
Semantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data LakeSemantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data Lake
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...
 
burge Stephen PM conver 101415
burge Stephen PM conver 101415burge Stephen PM conver 101415
burge Stephen PM conver 101415
 

More from arifulislam946965

abkb-gives-ahrc-direction-on-screening-and-credibility-bow-river-employment-l...
abkb-gives-ahrc-direction-on-screening-and-credibility-bow-river-employment-l...abkb-gives-ahrc-direction-on-screening-and-credibility-bow-river-employment-l...
abkb-gives-ahrc-direction-on-screening-and-credibility-bow-river-employment-l...arifulislam946965
 
ERC-RP-Weekly-Slides-September-2022-Linked.pdf
ERC-RP-Weekly-Slides-September-2022-Linked.pdfERC-RP-Weekly-Slides-September-2022-Linked.pdf
ERC-RP-Weekly-Slides-September-2022-Linked.pdfarifulislam946965
 
AI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdfAI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdfarifulislam946965
 
Snowflake-Data-validation-Architecture-FirstEigen-White-Paper.pdf
Snowflake-Data-validation-Architecture-FirstEigen-White-Paper.pdfSnowflake-Data-validation-Architecture-FirstEigen-White-Paper.pdf
Snowflake-Data-validation-Architecture-FirstEigen-White-Paper.pdfarifulislam946965
 
13-Essential-Data-Validation-Checks.pdf
13-Essential-Data-Validation-Checks.pdf13-Essential-Data-Validation-Checks.pdf
13-Essential-Data-Validation-Checks.pdfarifulislam946965
 
What are the signs of pica eating disorder
What are the signs of pica eating disorderWhat are the signs of pica eating disorder
What are the signs of pica eating disorderarifulislam946965
 

More from arifulislam946965 (13)

abkb-gives-ahrc-direction-on-screening-and-credibility-bow-river-employment-l...
abkb-gives-ahrc-direction-on-screening-and-credibility-bow-river-employment-l...abkb-gives-ahrc-direction-on-screening-and-credibility-bow-river-employment-l...
abkb-gives-ahrc-direction-on-screening-and-credibility-bow-river-employment-l...
 
Flowers to the world.pdf
Flowers to the world.pdfFlowers to the world.pdf
Flowers to the world.pdf
 
ERC-RP-Weekly-Slides-September-2022-Linked.pdf
ERC-RP-Weekly-Slides-September-2022-Linked.pdfERC-RP-Weekly-Slides-September-2022-Linked.pdf
ERC-RP-Weekly-Slides-September-2022-Linked.pdf
 
do_ingestion.pdf
do_ingestion.pdfdo_ingestion.pdf
do_ingestion.pdf
 
do_pipelines.pdf
do_pipelines.pdfdo_pipelines.pdf
do_pipelines.pdf
 
do_dq.pdf
do_dq.pdfdo_dq.pdf
do_dq.pdf
 
AI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdfAI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdf
 
Snowflake-Data-validation-Architecture-FirstEigen-White-Paper.pdf
Snowflake-Data-validation-Architecture-FirstEigen-White-Paper.pdfSnowflake-Data-validation-Architecture-FirstEigen-White-Paper.pdf
Snowflake-Data-validation-Architecture-FirstEigen-White-Paper.pdf
 
13-Essential-Data-Validation-Checks.pdf
13-Essential-Data-Validation-Checks.pdf13-Essential-Data-Validation-Checks.pdf
13-Essential-Data-Validation-Checks.pdf
 
What are the signs of pica eating disorder
What are the signs of pica eating disorderWhat are the signs of pica eating disorder
What are the signs of pica eating disorder
 
바카라사이트
바카라사이트바카라사이트
바카라사이트
 
카지노사이트
카지노사이트카지노사이트
카지노사이트
 
우리카지노
우리카지노우리카지노
우리카지노
 

Recently uploaded

Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Servicediscovermytutordmt
 
RSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors DataRSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors DataExhibitors Data
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...anilsa9823
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756dollysharma2066
 
Grateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfGrateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfPaul Menig
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Dave Litwiller
 
It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayNZSG
 
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesMysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesDipal Arora
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMRavindra Nath Shukla
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear RegressionRavindra Nath Shukla
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Monthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxMonthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxAndy Lambert
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Roland Driesen
 
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptxB.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptxpriyanshujha201
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfAdmir Softic
 
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...amitlee9823
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLSeo
 
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 

Recently uploaded (20)

Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Service
 
RSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors DataRSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors Data
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
 
Grateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfGrateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdf
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
 
It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 May
 
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesMysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSM
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear Regression
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Monthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxMonthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptx
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...
 
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptxB.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
 
Mifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pills
Mifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pillsMifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pills
Mifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pills
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
 
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
 
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
 

Business Case for leveraging Machine Learning (ML) to Validate Data Lake.pdf

  • 1. Business Case For Leveraging Machine Learning (ML) To Validate Data Lake www.FirstEigen.com contact@firsteigen.com Leverage ML to improve data quality in a data lake.
  • 2. Without effective and comprehensive validation, a data lake becomes a data swamp and does not offer a clear link to value creation to business. Organizations are rapidly adopting Cloud Data Lake as the data lake of choice. Given so, the need for validating data in real-time has become critical. Accurate, consistent, and reliable data fuels algorithms, operational processes, and effective decision-making. Existing data validation approaches rely on a rule-basedapproach that is resource-intensive, time-consuming, costly, and not scalable for 1000s of data assets. Business Case For Leveraging Machine Learning (ML) To Validate Data Lake
  • 3. Following examples from Global 2000 organizations demonstrate the need to establish data quality checks on each data asset present in the data lake. Business Impact of Data Quality issues in Data Lake Business Impact of Data Quality issues in Data Lake
  • 4. ETL Jobs Fail to Identify Files in a Data Lake
  • 5. ETL Jobs Fail to Identify Files in a Data Lake New subscribers of an insurance company could not avail the telehealth services for more than a week. Here, the root cause was that the Data Engineering team was not aware of onboarding of the insurance company as a new client and ETL jobs did not pick up the enrollment files that landed in their Azure data lake.
  • 6. Trading Company Ingests Data without Validation
  • 7. Commodity traders of a trading company could not find the user level credit information for a certain group of users on a Monday morning — a report was blank — leading to disruptions in trading activities for 2 hours. The reason was that the credit file received from another application had the credit field empty and was not checked before being loaded to the Big Query. Trading Company Ingests Data without Validation
  • 9. Supply chain executives of a restaurant chain company were surprised by the report that consumption in the UK doubled in May. Current month’s consumption file was appended to the consumption file from April because of a processing error and stored in the AWS Data Lake. Misinformation due to Poor Preprocessing
  • 10. Current Approach And Challenges Current Approach and Challenges The current focus in Cloud Data Lake projects is on data ingestion, the process of moving data from multiple data sources (often of different formats) into a single destination. After data ingestion, data is moved through the data pipeline which is where data errors/issues begin to surface. Our research estimates that an average of 30–40% of any analytics project is spent identifying and fixing data issues. In extreme cases, the project can get abandoned entirely. Current data validation approaches are designed to establish data quality rules for one container/bucket at a time — as a result, there are significant cost issues in implementing these solutions for 1000s of buckets/containers. Container-wise focus often leads to an incomplete set of rules or often not implementing any rules at all.
  • 11. Operational Challenges in Integrating Data Validation Solutions In general, the data engineering team experiences the following operational challenges while integrating data validation solutions: The time it takes to analyze data and consult the subject matter experts to determine what rules need to be implemented Implementation of the rules specific to each container. So, the effort is linearly proportional to the number of containers/buckets/folders in the Data Lake Existing open-source tools/approaches come with limited audit trail capability. Generating an audit trail of the rule execution results for compliance requirements often takes time and effort from the data engineering team. Maintaining the implemented rules Operational Challenges in Integrating Data Validation Solutions
  • 12. Machine Learning (ML)- Based Approach For Data Quality Machine Learning (ML)- Based Approach For Data Quality Instead of figuring out data quality rules through profiling, analysis, and consultations with the subject matter experts, standardized unsupervised machine learning algorithms can be applied at scale to the data lake buckets/containers to determine acceptable data patterns and identify anomalous records. We have had success in applying the following algorithms to detect data errors in financial services and Internet of Things (IOT) data. Several open-source ML software offers these algorithms as part of their packages. These include: DBSCAN [1] Principal component analysis and Eigenvector analysis [2] Association mining [3]
  • 13. Machine Learning (ML)- Based Approach For Data Quality Leverage the anomalous records to measure the data trust score through the lens of standardized data quality dimensions as shown below: 1. Freshness — determine if the data has arrived before the next step of the process 2. Completeness — determine the completeness of contextually important fields. Contextually important fields should be identified using various mathematical and or machine learning techniques. 3. Conformity — determine conformity to a pattern, length, format of contextually important fields. 4. Uniqueness — determine the uniqueness of the individual records. 5. Drift — determine the drift of the key categorical and continuous fields from the historical information 6. Anomaly — determine volume and value anomaly of critical columns
  • 14. Value Comparison Value Comparison The benefits of ML-based Data Quality fit broadly in two categories: quantitative and qualitative. While the quantitative benefits make the most powerful argument in a business case, the value of the qualitative benefits should not be ignored. Value Dimension Cost Reduction Time To Market Risk Reduction The rule-based approach addresses known risks often missing out on newer types of data risks. The ML-based approach addresses both known risks (such as completeness, conformity, etc.) and difficult to anticipate risks such as changes in data density One year-It would take a team of 8 resources to establish DQ checks for approximately 1000 buckets/containers 3 Months-It would take a team of 2 resources to establish DQ checks for approximately 1000 buckets/containers The effort to establish DQ checks is approximately between 8 to 16 resource hours per bucket/container. For 1000 bucket/container data lake, the cost is approximately $800-1600K The effort to establish DQ checks is approximately between 2-4 compute hours plus 1 resource hour per bucket/container data lake, the cost is approximately $150K-200K including the cost of initial setup Traditional Approach ML Based Approach
  • 15. Conclusion Conclusion Data is the most valuable asset for organizations. Current approaches for validating data are full of operational challenges leading to trust deficiency, time-consuming, and costly methods for fixing data errors. There is an urgent need to adopt a standardized autonomous approach for validating the Cloud data lake to ensure it prevents data lake from becoming a data swamp.
  • 16. [1] J. Waller, Outlier Detection Using DBSCAN (2020), Data Blog [2] S. Serneels et al, Principal component analysis for data containing outliers and missing elements (2008), Science Direct [3] S. B. Hassine et al, Using Association rules to detect data quality issues, MIT Information Quality (MITIQ) Program