Data Cleansing and Beyond: How to Address Data Debt for AI

Data Cleansing and Beyond
How to Address Data Debt for AI
Scott W. Ambler
Data Methodologist | Author
Ambysoft Inc.
© Ambysoft Inc. 1

Scott Ambler
© Ambysoft Inc. 2
• scott@scottambler.com
• linkedin.com/in/sambler/
• @scottambler.bsky.social
Data Methodologist,
Board Advisor
Ambysoft.com
Thought Leader
AgileData.org
Thought Leader
AgileModeling.com
Co-Creator
pmi.org/disciplined-Agile

Agenda
• Today’s takeaways
• Defining artificial intelligence (AI)
• Machine Learning (ML) lifecycle
• Data quality (DQ)
• DQ and AI
• How to choose DQ techniques
• Parting thoughts
© Ambysoft Inc. 3

Data quality (DQ) is one of the critical factors for
successful AI
The best place to address DQ issues is at the source
The DQ challenges that you face will drive your choice of
multiple DQ techniques
4
Today’s Takeaways
© Ambysoft Inc.

Defining Artificial Intelligence (AI)
© Ambysoft Inc. 5
Artificial Intelligence (AI)
Machine Learning (ML)
Deep Learning (DL)
Large Language Models
(LLMs)/Generative AI
Computer systems with ”brain-like” logically structured
algorithms called artificial neural networks (ANNs).
Computer systems with the ability to learn without being
explicitly programmed. Training of ML models often involve
significant data curation, sometimes including data labeling.
Computer systems able to perform tasks normally requiring
human intelligence.
Computer systems based on very large deep learning models
that are pre-trained on vast amounts of data.

TheMachineLearning
Lifecycle
You need a disciplined hybrid strategy
© Ambysoft Inc. 6

A Very High-Level View of the ML Lifecycle
© Ambysoft Inc. 7
Train
the AI
(Construction)
Operate the AI/
Make Inferences
(Production)
Learnings

Adding Details to Construction
© Ambysoft Inc. 8
Construction
Make
Inferences
Production
Prepare the
Data
Train the
Model
Validate the
Model
Choose a
Training
Strategy
Deploy the
AI
New data and learnings

Construction is Iterative and Incremental
© Ambysoft Inc. 9
Construction
Make
Inferences
Production
Deploy
the AI
Validate the
Model
Choose a
Training
Strategy
Prepare
Data
Evolve
Reqs.
Train
Model

A Machine Learning Lifecycle
© Ambysoft Inc. 10
Construction
Make
Inferences
Production
Deploy
the AI
Validate the
Model
Envision
Explore
Usage
Explore
Data
Explore
Viability
Choose a
Training
Strategy
Prepare
Data
Evolve
Reqs.
Train
Model

© Ambysoft Inc.
11
ML Lifecycle – MLOps Rendering
Explore data,
usage, and
viability
Choose training strategy,
Prepare data,
Evolve requirements,
Train model
Make inferences
New data & learnings

ML Lifecycle –
Iterative
“CRISP-DM”
Rendering
© Ambysoft Inc. 12
Prepare
Data
Train the
Model
Validate
Envision
Deploy
Operate

Data is the New Water
© Ambysoft Inc. 14
Source: AgileData.org/essays/data-quality-metaphor.html
When water is dirty, we can:
1. Filter it just before we drink it
2. Filter it coming into our home
3. Filter it before it is dumped into our water
supply
4. Filter it at the source
5. Clean the actual source
6. Fix whatever is polluting the source

Definition: Data Debt
Technical debt is the accumulation of
defects, quality issues (such as difficult to
read code or low data quality), poor
architecture, and poor design in existing
solutions
Data debt (data technical debt) refers to
quality challenges associated with legacy
data sources, including both mission-critical
sources of record as well as “big data”
sources of insight.
15
Source: AgileData.org/essays/dataTechnicalDebt.html
© Ambysoft Inc.

Causes of Data Debt
• Business prioritizing time to market over quality concerns
• Manual data entry
• Multiple siloed sources
• Lack of input validation
• Inconsistent data collection methods
• Inconsistent business rules across applications
• Ineffective data management
• Weak data literacy
© Ambysoft Inc. 16

Data debt
and AI
17
© Ambysoft Inc.

Data Quality is a Leading Challenge in AI
Source: K2View 2024 State of GenAI Data Readiness, k2view.com/genai-adoption-survey/
© Ambysoft Inc. 18

Data Challenges are Holding Firms Back
© Ambysoft Inc. 19
Source: Scale Zeitgeist AI Readiness Report 2024, scale.com/ai-readiness-report

Construction
Validate the
Model
Choose a
Training
Strategy
Prepare
Data
Evolve
Reqs.
Train
Model
Data Quality and ML
© Ambysoft Inc. 20
Make
Inferences
Production
Deploy
the AI
Envision
Explore
Usage
Explore
Data
Explore
Viability
Source: Ambysoft.com/essays/machine-learning-lifecycle.html

Potential DQ Issues Faced by ML Teams
• Biased data
• Insufficient data
• Semantic differences across sources
• Missing data values
• Inconsistent data values
• Unclear/unknown sources of record
• Ownership limitations
• Privacy/security
• Data poisoning
• Data drift
• and many more…
21
© Ambysoft Inc.

Which DQ techniques
are right for you
in your unique context?
22
© Ambysoft Inc.

Potential Data Quality Techniques for ML Teams
23
Data
cleansing
Data
stewards
Data
architecture
Automated
regression testing
Data
labeling
Data
masking AI data cleansing
(e.g. for MDM)
Database
modeling
Transformation
(as in ETL)
Reviews (design,
architecture)
and more….
Data guidance
(standards, guidelines, …)
Database
refactoring
Data
repair
Executable business
rules (in the db)
Synthetic training
data
Manual regression
testing
© Ambysoft Inc.

What Data Quality Techniques Should We Apply?
© Ambysoft Inc. 24
Envision Construction
D
e
p
l
o
y
Inference/
Production
Internal
Data
External
Data
It depends!

Data Quality Technique Comparison Factors
A data quality (DQ)
technique comparison
factor provides a qualitative
scale along which a strategy
is rated for
contextualization.
25
Timeliness
Reactive Proactive
DataOps Automation
None Continuous
Effect on Source
None Direct
Benefit Realization
Long term Immediate
Required Skills
Sophisticated Straightforward
© Ambysoft Inc.
Source: AgileData.org/essays/dataqualitytechniqueassessment.html

Timeliness
Reactive Proactive
Effect on Source
None Direct
Required Skills
Benefit Realization
Long term Immediate
DataOps Automation
None Continuous
DQ Technique: Data Cleansing (at point of use)
Source data is ”cleansed” programmatically
to put it in the expected state
Advantages:
● Data quality problems are addressed for
the ML initiative
● Potentially a quick fix for your team
Disadvantages:
● Often requires significant effort
● DQ issues are not being addressed at the
source
When to apply it:
● When you cannot, or aren’t allowed, to
fix data at the source
26
© Ambysoft Inc.

DQ Technique: Database Refactoring
A database refactoring is a simple change to
a database schema or content that improves
its quality.
Advantages:
● Safely addresses a DQ issue at the source
● Significant improvements can be made
via a collection of small changes
Disadvantages:
● Requires coordination across teams
● Requires technical skills
When to apply it:
● To evolve production data sources
27
Timeliness
Reactive Proactive
Effect on Source
None Direct
Required Skills
Benefit Realization
Long term Immediate
DataOps Automation
None Continuous
© Ambysoft Inc.

Data cleansing
Comparison: Benefit Realization vs Effect on Source
© Ambysoft Inc. 28
Effect
On
Source
None Direct
Benefit Realization
Long term
Immediate
Database refactoring
Data stewards
Manual regression
testing
Automated regression
testing
Transformation

The ”Real Chart”:
Benefit Realization
vs Effect on Source
© Ambysoft Inc. 29
Source: AgileData.org/essays/
dataqualitytechniquecomparison.html

Choosing Data Quality Techniques for Machine
Learning
© Ambysoft Inc. 30
Envision Construction
D
e
p
l
o
y
Inference/
Production
Internal
Data
External
Data
Benefit: Immediate
Effect: Direct
Timeliness: Reactive
Benefit: Immediate
Automation: Continuous

Context: Updating an
Existing Production
Database
© Ambysoft Inc. 31
Internal
Data
Benefit: Immediate
Effect on source: Direct
X
X
But… it isn’t as simple as taking the “best
practices” as indicated in the green box.
The context of your situation will drive your
“best” choices. The technique ratings help you
to narrow down the ones to consider.
X

Data quality (DQ) is one of the critical factors for
successful AI
The best place to address DQ issues is at the source
The DQ challenges that you face will drive your choice of
multiple DQ techniques
34
Today’s Takeaways
© Ambysoft Inc.

Thank You!
© Ambysoft Inc. 35
• scott@scottambler.com
• linkedin.com/in/sambler/
• @scottambler.bsky.social
Data Methodologist,
Board Advisor
Ambysoft.com
Thought Leader
AgileData.org
Thought Leader
AgileModeling.com
Co-Creator
pmi.org/disciplined-Agile

Would you like this presentation
for your chapter, user group, or organization?
Reach out to me at ScottAmbler.com
© Ambysoft Inc. 36

The Agile Data Mission
To share proven agile and lean strategies
for data initiatives.
Learn more at AgileData.org
© Ambysoft Inc. 37

The Agile Modeling Mission
To share proven and effective strategies for
modeling/mapping and documentation.
Learn more at AgileModeling.com
© Ambysoft Inc. 38

Comparison Factor: Timeliness
© Ambysoft Inc. 39
Timeliness considers when a data quality problem is addressed in practice.
Timeliness
Reactive Proactive
Is the quality problem being avoided before it occurs or is it being fixed
after it has occurred?

Comparison Factor: DataOps Automation
© Ambysoft Inc. 40
(Level of) DataOps automation considers the potential for the technique to be
automated for your DataOps pipeline.
DataOps Automation
None Continuous
How well is this technique supported by automation in practice?

Comparison Factor: Effect on Source
© Ambysoft Inc. 41
Effect on source addresses the potential for the technique to impact/address the quality
issue at its source.
Effect on Source
None Direct
Are you solving the actual problem, or merely
putting a bandage on it?

Comparison Factor: Benefit Realization
© Ambysoft Inc. 42
Benefit realization addresses the time frame in which the value from the improvement
will be achieved.
Benefit Realization
Long term Immediate
How long will it take to see an actual impact on
data quality?

Comparison Factor: Required Skills
© Ambysoft Inc. 43
Required skills considers the amount of skill or experience required to successfully
perform the technique.
Skill
Complex Straightforward
How much skill or experience does it require to be effective at this technique?
How complex is the supporting process around the technique?

Data Cleansing and Beyond: How to Address Data Debt for AI

More Related Content

Similar to Data Cleansing and Beyond: How to Address Data Debt for AI

More from Scott W. Ambler

Recently uploaded

Data Cleansing and Beyond: How to Address Data Debt for AI