Data Cleansing and Beyond
How to Address Data Debt for AI
Scott W. Ambler
Data Methodologist | Author
Ambysoft Inc.
© Ambysoft Inc. 1
Scott Ambler
© Ambysoft Inc. 2
• scott@scottambler.com
• linkedin.com/in/sambler/
• @scottambler.bsky.social
Data Methodologist,
Board Advisor
Ambysoft.com
Thought Leader
AgileData.org
Thought Leader
AgileModeling.com
Co-Creator
pmi.org/disciplined-Agile
Agenda
• Today’s takeaways
• Defining artificial intelligence (AI)
• Machine Learning (ML) lifecycle
• Data quality (DQ)
• DQ and AI
• How to choose DQ techniques
• Parting thoughts
© Ambysoft Inc. 3
Data quality (DQ) is one of the critical factors for
successful AI
The best place to address DQ issues is at the source
The DQ challenges that you face will drive your choice of
multiple DQ techniques
4
Today’s Takeaways
© Ambysoft Inc.
Defining Artificial Intelligence (AI)
© Ambysoft Inc. 5
Artificial Intelligence (AI)
Machine Learning (ML)
Deep Learning (DL)
Large Language Models
(LLMs)/Generative AI
Computer systems with ”brain-like” logically structured
algorithms called artificial neural networks (ANNs).
Computer systems with the ability to learn without being
explicitly programmed. Training of ML models often involve
significant data curation, sometimes including data labeling.
Computer systems able to perform tasks normally requiring
human intelligence.
Computer systems based on very large deep learning models
that are pre-trained on vast amounts of data.
TheMachineLearning
Lifecycle
You need a disciplined hybrid strategy
© Ambysoft Inc. 6
A Very High-Level View of the ML Lifecycle
© Ambysoft Inc. 7
Train
the AI
(Construction)
Operate the AI/
Make Inferences
(Production)
Learnings
Adding Details to Construction
© Ambysoft Inc. 8
Construction
Make
Inferences
Production
Prepare the
Data
Train the
Model
Validate the
Model
Choose a
Training
Strategy
Deploy the
AI
New data and learnings
Construction is Iterative and Incremental
© Ambysoft Inc. 9
Construction
Make
Inferences
Production
Deploy
the AI
Validate the
Model
Choose a
Training
Strategy
Prepare
Data
Evolve
Reqs.
Train
Model
New data and learnings
A Machine Learning Lifecycle
© Ambysoft Inc. 10
Construction
Make
Inferences
Production
Deploy
the AI
Validate the
Model
Envision
Explore
Usage
Explore
Data
Explore
Viability
Choose a
Training
Strategy
Prepare
Data
Evolve
Reqs.
Train
Model
New data and learnings
© Ambysoft Inc.
11
ML Lifecycle – MLOps Rendering
Explore data,
usage, and
viability
Choose training strategy,
Prepare data,
Evolve requirements,
Train model
Make inferences
New data & learnings
ML Lifecycle –
Iterative
“CRISP-DM”
Rendering
© Ambysoft Inc. 12
Prepare
Data
Train the
Model
Validate
Envision
Deploy
Operate
Data Debt
13
© Ambysoft Inc.
Data is the New Water
© Ambysoft Inc. 14
Source: AgileData.org/essays/data-quality-metaphor.html
When water is dirty, we can:
1. Filter it just before we drink it
2. Filter it coming into our home
3. Filter it before it is dumped into our water
supply
4. Filter it at the source
5. Clean the actual source
6. Fix whatever is polluting the source
Definition: Data Debt
Technical debt is the accumulation of
defects, quality issues (such as difficult to
read code or low data quality), poor
architecture, and poor design in existing
solutions
Data debt (data technical debt) refers to
quality challenges associated with legacy
data sources, including both mission-critical
sources of record as well as “big data”
sources of insight.
15
Source: AgileData.org/essays/dataTechnicalDebt.html
© Ambysoft Inc.
Causes of Data Debt
• Business prioritizing time to market over quality concerns
• Manual data entry
• Multiple siloed sources
• Lack of input validation
• Inconsistent data collection methods
• Inconsistent business rules across applications
• Ineffective data management
• Weak data literacy
© Ambysoft Inc. 16
Data debt
and AI
17
© Ambysoft Inc.
Data Quality is a Leading Challenge in AI
Source: K2View 2024 State of GenAI Data Readiness, k2view.com/genai-adoption-survey/
© Ambysoft Inc. 18
Data Challenges are Holding Firms Back
© Ambysoft Inc. 19
Source: Scale Zeitgeist AI Readiness Report 2024, scale.com/ai-readiness-report
Construction
Validate the
Model
Choose a
Training
Strategy
Prepare
Data
Evolve
Reqs.
Train
Model
Data Quality and ML
© Ambysoft Inc. 20
Make
Inferences
Production
Deploy
the AI
Envision
Explore
Usage
Explore
Data
Explore
Viability
New data and learnings
Source: Ambysoft.com/essays/machine-learning-lifecycle.html
Potential DQ Issues Faced by ML Teams
• Biased data
• Insufficient data
• Semantic differences across sources
• Missing data values
• Inconsistent data values
• Unclear/unknown sources of record
• Ownership limitations
• Privacy/security
• Data poisoning
• Data drift
• and many more…
21
© Ambysoft Inc.
Which DQ techniques
are right for you
in your unique context?
22
© Ambysoft Inc.
Potential Data Quality Techniques for ML Teams
23
Data
cleansing
Data
stewards
Data
architecture
Automated
regression testing
Data
labeling
Data
masking AI data cleansing
(e.g. for MDM)
Database
modeling
Transformation
(as in ETL)
Reviews (design,
architecture)
and more….
Data guidance
(standards, guidelines, …)
Database
refactoring
Data
repair
Executable business
rules (in the db)
Synthetic training
data
Manual regression
testing
© Ambysoft Inc.
What Data Quality Techniques Should We Apply?
© Ambysoft Inc. 24
Envision Construction
D
e
p
l
o
y
Inference/
Production
Internal
Data
External
Data
New data and learnings
It depends!
Data Quality Technique Comparison Factors
A data quality (DQ)
technique comparison
factor provides a qualitative
scale along which a strategy
is rated for
contextualization.
25
Timeliness
Reactive Proactive
DataOps Automation
None Continuous
Effect on Source
None Direct
Benefit Realization
Long term Immediate
Required Skills
Sophisticated Straightforward
© Ambysoft Inc.
Source: AgileData.org/essays/dataqualitytechniqueassessment.html
Timeliness
Reactive Proactive
Effect on Source
None Direct
Required Skills
Sophisticated Straightforward
Benefit Realization
Long term Immediate
DataOps Automation
None Continuous
DQ Technique: Data Cleansing (at point of use)
Source data is ”cleansed” programmatically
to put it in the expected state
Advantages:
● Data quality problems are addressed for
the ML initiative
● Potentially a quick fix for your team
Disadvantages:
● Often requires significant effort
● DQ issues are not being addressed at the
source
When to apply it:
● When you cannot, or aren’t allowed, to
fix data at the source
26
© Ambysoft Inc.
DQ Technique: Database Refactoring
A database refactoring is a simple change to
a database schema or content that improves
its quality.
Advantages:
● Safely addresses a DQ issue at the source
● Significant improvements can be made
via a collection of small changes
Disadvantages:
● Requires coordination across teams
● Requires technical skills
When to apply it:
● To evolve production data sources
27
Timeliness
Reactive Proactive
Effect on Source
None Direct
Required Skills
Sophisticated Straightforward
Benefit Realization
Long term Immediate
DataOps Automation
None Continuous
© Ambysoft Inc.
Data cleansing
Comparison: Benefit Realization vs Effect on Source
© Ambysoft Inc. 28
Effect
On
Source
None Direct
Benefit Realization
Long term
Immediate
Database refactoring
Data stewards
Manual regression
testing
Automated regression
testing
Transformation
The ”Real Chart”:
Benefit Realization
vs Effect on Source
© Ambysoft Inc. 29
Source: AgileData.org/essays/
dataqualitytechniquecomparison.html
Choosing Data Quality Techniques for Machine
Learning
© Ambysoft Inc. 30
Envision Construction
D
e
p
l
o
y
Inference/
Production
Internal
Data
External
Data
New data and learnings
Benefit: Immediate
Effect: Direct
Timeliness: Reactive
Benefit: Immediate
Timeliness: Reactive
Automation: Continuous
Timeliness: Reactive
Automation: Continuous
Context: Updating an
Existing Production
Database
© Ambysoft Inc. 31
Internal
Data
Benefit: Immediate
Effect on source: Direct
X
X
But… it isn’t as simple as taking the “best
practices” as indicated in the green box.
The context of your situation will drive your
“best” choices. The technique ratings help you
to narrow down the ones to consider.
X
Context: Working
with External Data
(during Construction)
© Ambysoft Inc. 32
External
Data
Timeliness: Reactive
Automation: Continuous
X
X
Parting Advice
© Ambysoft Inc. 33
Data quality (DQ) is one of the critical factors for
successful AI
The best place to address DQ issues is at the source
The DQ challenges that you face will drive your choice of
multiple DQ techniques
34
Today’s Takeaways
© Ambysoft Inc.
Thank You!
© Ambysoft Inc. 35
• scott@scottambler.com
• linkedin.com/in/sambler/
• @scottambler.bsky.social
Data Methodologist,
Board Advisor
Ambysoft.com
Thought Leader
AgileData.org
Thought Leader
AgileModeling.com
Co-Creator
pmi.org/disciplined-Agile
Would you like this presentation
for your chapter, user group, or organization?
Reach out to me at ScottAmbler.com
© Ambysoft Inc. 36
The Agile Data Mission
To share proven agile and lean strategies
for data initiatives.
Learn more at AgileData.org
© Ambysoft Inc. 37
The Agile Modeling Mission
To share proven and effective strategies for
modeling/mapping and documentation.
Learn more at AgileModeling.com
© Ambysoft Inc. 38
Comparison Factor: Timeliness
© Ambysoft Inc. 39
Timeliness considers when a data quality problem is addressed in practice.
Timeliness
Reactive Proactive
Is the quality problem being avoided before it occurs or is it being fixed
after it has occurred?
Comparison Factor: DataOps Automation
© Ambysoft Inc. 40
(Level of) DataOps automation considers the potential for the technique to be
automated for your DataOps pipeline.
DataOps Automation
None Continuous
How well is this technique supported by automation in practice?
Comparison Factor: Effect on Source
© Ambysoft Inc. 41
Effect on source addresses the potential for the technique to impact/address the quality
issue at its source.
Effect on Source
None Direct
Are you solving the actual problem, or merely
putting a bandage on it?
Comparison Factor: Benefit Realization
© Ambysoft Inc. 42
Benefit realization addresses the time frame in which the value from the improvement
will be achieved.
Benefit Realization
Long term Immediate
How long will it take to see an actual impact on
data quality?
Comparison Factor: Required Skills
© Ambysoft Inc. 43
Required skills considers the amount of skill or experience required to successfully
perform the technique.
Skill
Complex Straightforward
How much skill or experience does it require to be effective at this technique?
How complex is the supporting process around the technique?

Data Cleansing and Beyond: How to Address Data Debt for AI

  • 1.
    Data Cleansing andBeyond How to Address Data Debt for AI Scott W. Ambler Data Methodologist | Author Ambysoft Inc. © Ambysoft Inc. 1
  • 2.
    Scott Ambler © AmbysoftInc. 2 • scott@scottambler.com • linkedin.com/in/sambler/ • @scottambler.bsky.social Data Methodologist, Board Advisor Ambysoft.com Thought Leader AgileData.org Thought Leader AgileModeling.com Co-Creator pmi.org/disciplined-Agile
  • 3.
    Agenda • Today’s takeaways •Defining artificial intelligence (AI) • Machine Learning (ML) lifecycle • Data quality (DQ) • DQ and AI • How to choose DQ techniques • Parting thoughts © Ambysoft Inc. 3
  • 4.
    Data quality (DQ)is one of the critical factors for successful AI The best place to address DQ issues is at the source The DQ challenges that you face will drive your choice of multiple DQ techniques 4 Today’s Takeaways © Ambysoft Inc.
  • 5.
    Defining Artificial Intelligence(AI) © Ambysoft Inc. 5 Artificial Intelligence (AI) Machine Learning (ML) Deep Learning (DL) Large Language Models (LLMs)/Generative AI Computer systems with ”brain-like” logically structured algorithms called artificial neural networks (ANNs). Computer systems with the ability to learn without being explicitly programmed. Training of ML models often involve significant data curation, sometimes including data labeling. Computer systems able to perform tasks normally requiring human intelligence. Computer systems based on very large deep learning models that are pre-trained on vast amounts of data.
  • 6.
    TheMachineLearning Lifecycle You need adisciplined hybrid strategy © Ambysoft Inc. 6
  • 7.
    A Very High-LevelView of the ML Lifecycle © Ambysoft Inc. 7 Train the AI (Construction) Operate the AI/ Make Inferences (Production) Learnings
  • 8.
    Adding Details toConstruction © Ambysoft Inc. 8 Construction Make Inferences Production Prepare the Data Train the Model Validate the Model Choose a Training Strategy Deploy the AI New data and learnings
  • 9.
    Construction is Iterativeand Incremental © Ambysoft Inc. 9 Construction Make Inferences Production Deploy the AI Validate the Model Choose a Training Strategy Prepare Data Evolve Reqs. Train Model New data and learnings
  • 10.
    A Machine LearningLifecycle © Ambysoft Inc. 10 Construction Make Inferences Production Deploy the AI Validate the Model Envision Explore Usage Explore Data Explore Viability Choose a Training Strategy Prepare Data Evolve Reqs. Train Model New data and learnings
  • 11.
    © Ambysoft Inc. 11 MLLifecycle – MLOps Rendering Explore data, usage, and viability Choose training strategy, Prepare data, Evolve requirements, Train model Make inferences New data & learnings
  • 12.
    ML Lifecycle – Iterative “CRISP-DM” Rendering ©Ambysoft Inc. 12 Prepare Data Train the Model Validate Envision Deploy Operate
  • 13.
  • 14.
    Data is theNew Water © Ambysoft Inc. 14 Source: AgileData.org/essays/data-quality-metaphor.html When water is dirty, we can: 1. Filter it just before we drink it 2. Filter it coming into our home 3. Filter it before it is dumped into our water supply 4. Filter it at the source 5. Clean the actual source 6. Fix whatever is polluting the source
  • 15.
    Definition: Data Debt Technicaldebt is the accumulation of defects, quality issues (such as difficult to read code or low data quality), poor architecture, and poor design in existing solutions Data debt (data technical debt) refers to quality challenges associated with legacy data sources, including both mission-critical sources of record as well as “big data” sources of insight. 15 Source: AgileData.org/essays/dataTechnicalDebt.html © Ambysoft Inc.
  • 16.
    Causes of DataDebt • Business prioritizing time to market over quality concerns • Manual data entry • Multiple siloed sources • Lack of input validation • Inconsistent data collection methods • Inconsistent business rules across applications • Ineffective data management • Weak data literacy © Ambysoft Inc. 16
  • 17.
  • 18.
    Data Quality isa Leading Challenge in AI Source: K2View 2024 State of GenAI Data Readiness, k2view.com/genai-adoption-survey/ © Ambysoft Inc. 18
  • 19.
    Data Challenges areHolding Firms Back © Ambysoft Inc. 19 Source: Scale Zeitgeist AI Readiness Report 2024, scale.com/ai-readiness-report
  • 20.
    Construction Validate the Model Choose a Training Strategy Prepare Data Evolve Reqs. Train Model DataQuality and ML © Ambysoft Inc. 20 Make Inferences Production Deploy the AI Envision Explore Usage Explore Data Explore Viability New data and learnings Source: Ambysoft.com/essays/machine-learning-lifecycle.html
  • 21.
    Potential DQ IssuesFaced by ML Teams • Biased data • Insufficient data • Semantic differences across sources • Missing data values • Inconsistent data values • Unclear/unknown sources of record • Ownership limitations • Privacy/security • Data poisoning • Data drift • and many more… 21 © Ambysoft Inc.
  • 22.
    Which DQ techniques areright for you in your unique context? 22 © Ambysoft Inc.
  • 23.
    Potential Data QualityTechniques for ML Teams 23 Data cleansing Data stewards Data architecture Automated regression testing Data labeling Data masking AI data cleansing (e.g. for MDM) Database modeling Transformation (as in ETL) Reviews (design, architecture) and more…. Data guidance (standards, guidelines, …) Database refactoring Data repair Executable business rules (in the db) Synthetic training data Manual regression testing © Ambysoft Inc.
  • 24.
    What Data QualityTechniques Should We Apply? © Ambysoft Inc. 24 Envision Construction D e p l o y Inference/ Production Internal Data External Data New data and learnings It depends!
  • 25.
    Data Quality TechniqueComparison Factors A data quality (DQ) technique comparison factor provides a qualitative scale along which a strategy is rated for contextualization. 25 Timeliness Reactive Proactive DataOps Automation None Continuous Effect on Source None Direct Benefit Realization Long term Immediate Required Skills Sophisticated Straightforward © Ambysoft Inc. Source: AgileData.org/essays/dataqualitytechniqueassessment.html
  • 26.
    Timeliness Reactive Proactive Effect onSource None Direct Required Skills Sophisticated Straightforward Benefit Realization Long term Immediate DataOps Automation None Continuous DQ Technique: Data Cleansing (at point of use) Source data is ”cleansed” programmatically to put it in the expected state Advantages: ● Data quality problems are addressed for the ML initiative ● Potentially a quick fix for your team Disadvantages: ● Often requires significant effort ● DQ issues are not being addressed at the source When to apply it: ● When you cannot, or aren’t allowed, to fix data at the source 26 © Ambysoft Inc.
  • 27.
    DQ Technique: DatabaseRefactoring A database refactoring is a simple change to a database schema or content that improves its quality. Advantages: ● Safely addresses a DQ issue at the source ● Significant improvements can be made via a collection of small changes Disadvantages: ● Requires coordination across teams ● Requires technical skills When to apply it: ● To evolve production data sources 27 Timeliness Reactive Proactive Effect on Source None Direct Required Skills Sophisticated Straightforward Benefit Realization Long term Immediate DataOps Automation None Continuous © Ambysoft Inc.
  • 28.
    Data cleansing Comparison: BenefitRealization vs Effect on Source © Ambysoft Inc. 28 Effect On Source None Direct Benefit Realization Long term Immediate Database refactoring Data stewards Manual regression testing Automated regression testing Transformation
  • 29.
    The ”Real Chart”: BenefitRealization vs Effect on Source © Ambysoft Inc. 29 Source: AgileData.org/essays/ dataqualitytechniquecomparison.html
  • 30.
    Choosing Data QualityTechniques for Machine Learning © Ambysoft Inc. 30 Envision Construction D e p l o y Inference/ Production Internal Data External Data New data and learnings Benefit: Immediate Effect: Direct Timeliness: Reactive Benefit: Immediate Timeliness: Reactive Automation: Continuous Timeliness: Reactive Automation: Continuous
  • 31.
    Context: Updating an ExistingProduction Database © Ambysoft Inc. 31 Internal Data Benefit: Immediate Effect on source: Direct X X But… it isn’t as simple as taking the “best practices” as indicated in the green box. The context of your situation will drive your “best” choices. The technique ratings help you to narrow down the ones to consider. X
  • 32.
    Context: Working with ExternalData (during Construction) © Ambysoft Inc. 32 External Data Timeliness: Reactive Automation: Continuous X X
  • 33.
  • 34.
    Data quality (DQ)is one of the critical factors for successful AI The best place to address DQ issues is at the source The DQ challenges that you face will drive your choice of multiple DQ techniques 34 Today’s Takeaways © Ambysoft Inc.
  • 35.
    Thank You! © AmbysoftInc. 35 • scott@scottambler.com • linkedin.com/in/sambler/ • @scottambler.bsky.social Data Methodologist, Board Advisor Ambysoft.com Thought Leader AgileData.org Thought Leader AgileModeling.com Co-Creator pmi.org/disciplined-Agile
  • 36.
    Would you likethis presentation for your chapter, user group, or organization? Reach out to me at ScottAmbler.com © Ambysoft Inc. 36
  • 37.
    The Agile DataMission To share proven agile and lean strategies for data initiatives. Learn more at AgileData.org © Ambysoft Inc. 37
  • 38.
    The Agile ModelingMission To share proven and effective strategies for modeling/mapping and documentation. Learn more at AgileModeling.com © Ambysoft Inc. 38
  • 39.
    Comparison Factor: Timeliness ©Ambysoft Inc. 39 Timeliness considers when a data quality problem is addressed in practice. Timeliness Reactive Proactive Is the quality problem being avoided before it occurs or is it being fixed after it has occurred?
  • 40.
    Comparison Factor: DataOpsAutomation © Ambysoft Inc. 40 (Level of) DataOps automation considers the potential for the technique to be automated for your DataOps pipeline. DataOps Automation None Continuous How well is this technique supported by automation in practice?
  • 41.
    Comparison Factor: Effecton Source © Ambysoft Inc. 41 Effect on source addresses the potential for the technique to impact/address the quality issue at its source. Effect on Source None Direct Are you solving the actual problem, or merely putting a bandage on it?
  • 42.
    Comparison Factor: BenefitRealization © Ambysoft Inc. 42 Benefit realization addresses the time frame in which the value from the improvement will be achieved. Benefit Realization Long term Immediate How long will it take to see an actual impact on data quality?
  • 43.
    Comparison Factor: RequiredSkills © Ambysoft Inc. 43 Required skills considers the amount of skill or experience required to successfully perform the technique. Skill Complex Straightforward How much skill or experience does it require to be effective at this technique? How complex is the supporting process around the technique?