SlideShare a Scribd company logo
Counteracting Selection
Bias in Machine Learning
1
Noam Finkelstein
MLConf SF
November 8th 2019
Overview
2
➣ Data are collected in all kinds of ways
➣ We pretend they are collected “At Random”
➣ This creates poor predictive performance in important
regions of the input space
➣ We can model the collection process to improve
performance
Takeaways
3
➣ Understand the importance of selection bias in ML
○ Not discussed as much in ML as in statistics
➣ Be able to identify when our data might have this problem.
➣ Learn how to model data collection.
➣ Learn how use our data to learn about selection bias when
possible.
Data Collection Step 1:
Things happen
4
Data Collection Step 2:
Some of them get recorded
5
Bias in Data Collection Step 1
Things happen
6
➣ Selection bias: Correlation between how likely we are to
see a data point (X, Y), and the outcome Y
➣ Example 1:
○ We are asked to create a tool to help project managers
predict profit of software projects
○ Our data include all software projects previously
undertaken at the company
○ PMs are good at their jobs, so projects that lose money
are not in the data much . They just don’t happen.
Bias in Data Collection Step 1
Things happen
7
Project Complexity
Profit
Approved Projects
Bias in Data Collection Step 1
Some Things Don’t Happen
8
Project Complexity
Profit
Approved Projects
Rejected Projects
Bias in Data Collection
99
➣ No ML model can learn about the
“complexity boundary”, even
though we have access to all the
projects that were undertaken.
Nothing is “missing”.
➣ This is a very bad way to fail!
Our model will do badly specifically
where we want it to protect us from
poor decisions.
Modeling the Data Collection Process
1010
➣ We know proposals that are
unlikely to be profitable are unlikely
to occur in the data.
➣ We can incorporate that
knowledge about the data
collection process into our model
to address this problem.
Bias in Data Collection Step 2
We don’t see everything
Weeks
WhiteBloodCellCount
➣ We want to know how patients are doing when they’re away from the clinic
➣ Patients come in when they’re feeling unwell, elevated WBC
➣ We’ll generally predict that they’re worse off than they are
Prediction in Machine Learning
➣ We generally model
➣ g is our favourite class of functions for regression or
classification, parameterized by
➣ “Easy” to do because Y is one dimensional, and
expectations are summary statistics
Modeling Data Collection
➣ Modeling the probability of observing some data,
is too hard (w/ finite data)!
➣ X is high dimensional
➣ Densities are complicated
Modeling Data Collection
➣ In many problems we care about, the probability of making
an observation is a function only of the outcome.
➣ Then the probability of making on observation is:
➣ Which, for (X, Y) pairs we don’t see, can be approximated:
Incorporating Knowledge on Data Collection
➣ If we’re being frequentists, we can define a loss function
that captures both how well we do on prediction outcome,
and how well we do on predicting observation:
Modeling Data Collection
➣ We can now learn from what we don’t see.
➣ We know there are regions of the input space w/ no data
➣ We know we’re less likely to see data w/ low profit
➣ Therefore: profit must be low in those regions
Project Complexity
Profit
Approved Projects
What if we don’t know the data
collection process?
17
➣ We can’t learn p entirely from data - would require us to
know the outcome specifically where we don’t observe it
(in most cases).
➣ If we have beliefs about p and g, we can be Bayesian about
things.
➣ If we have a few data points collected “at random” - i.e. not
according to p - then we can learn p
A Worked Example
18
➣ We have data collected according to some unknown,
non-random process p
WhiteBloodCellCount
Weeks
A Worked Example
19
➣ Functions compatible with this data will have different
behavior in unobserved regions
WhiteBloodCellCount
Weeks
A Worked Example
20
➣ We assume all data are “observed at random”, as usual. Fit
looks good!
➣ Validation data collected by the same process will not help!
WhiteBloodCellCount
Weeks
A Worked Example
21
➣ But it turns out the data was not collected at random -
we’re systematically way off in unobserved regions!
WhiteBloodCellCount
Weeks
A Worked Example
22
➣ What if we know how much more likely we are to make an
observation when the outcome is high?
WhiteBloodCellCount
Weeks
A Worked Example
23
➣ What if we don’t know anything about data collection, but
get a few observations “at random”?
WhiteBloodCellCount
Weeks
A Worked Example
24
➣ What if we don’t know anything about data collection, but
get a few observations “at random”?
WhiteBloodCellCount
Weeks
Conclusions
25
➣ Selection bias hurts us in ML in ways we can’t detect
through normal validation procedures
➣ If we know something about the data collection process
we can incorporate it into our model to improve prediction.
➣ If we happen to have some data collected “at random”, we
can use it to learn about selection bias elsewhere in our
data.
Thank you!
Get in touch
noam@jhu.edu
@nsfinkelstein
26

More Related Content

Similar to Noam Finkelstein - The Importance of Modeling Data Collection

Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
Sara Hooker
 
Getting Started with Big Data and Splunk
Getting Started with Big Data and SplunkGetting Started with Big Data and Splunk
Getting Started with Big Data and Splunk
Tom Chavez
 
Housing price prediction
Housing price predictionHousing price prediction
Housing price prediction
Abhimanyu Dwivedi
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...
Dataconomy Media
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparation
Sara Hooker
 
housepriceprediction-180915174356.pdf
housepriceprediction-180915174356.pdfhousepriceprediction-180915174356.pdf
housepriceprediction-180915174356.pdf
VinayShekarReddy
 
Advanced sampling part 1 presentation notes
Advanced sampling part 1   presentation notesAdvanced sampling part 1   presentation notes
Advanced sampling part 1 presentation notesAnthony Shingleton
 
U5 a1 stages in the decision making process
U5 a1 stages in the decision making processU5 a1 stages in the decision making process
U5 a1 stages in the decision making process
Peter R Breach
 
Analysing The Data
Analysing The DataAnalysing The Data
Analysing The Data
Angel Evans
 
Unit 2.pptx
Unit 2.pptxUnit 2.pptx
Challenges of Big Data Research
Challenges of Big Data ResearchChallenges of Big Data Research
Challenges of Big Data Research
Regional Science Academy
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
Knoldus Inc.
 
Understanding randomness
Understanding randomnessUnderstanding randomness
Understanding randomnesssuncil0071
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Ahmed Elmalla
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
Stephen Withington
 
Narrated Version Dallas MPUG
Narrated Version Dallas MPUGNarrated Version Dallas MPUG
Narrated Version Dallas MPUG
Glen Alleman
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratory
Sara Hooker
 
Making sense of community engagement, impacts and outcomes
Making sense of community engagement, impacts and outcomesMaking sense of community engagement, impacts and outcomes
Making sense of community engagement, impacts and outcomesMetroWater
 
Machine Learning for dummies!
Machine Learning for dummies!Machine Learning for dummies!
Machine Learning for dummies!
ZOLLHOF - Tech Incubator
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
Krishna Sankar
 

Similar to Noam Finkelstein - The Importance of Modeling Data Collection (20)

Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
 
Getting Started with Big Data and Splunk
Getting Started with Big Data and SplunkGetting Started with Big Data and Splunk
Getting Started with Big Data and Splunk
 
Housing price prediction
Housing price predictionHousing price prediction
Housing price prediction
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparation
 
housepriceprediction-180915174356.pdf
housepriceprediction-180915174356.pdfhousepriceprediction-180915174356.pdf
housepriceprediction-180915174356.pdf
 
Advanced sampling part 1 presentation notes
Advanced sampling part 1   presentation notesAdvanced sampling part 1   presentation notes
Advanced sampling part 1 presentation notes
 
U5 a1 stages in the decision making process
U5 a1 stages in the decision making processU5 a1 stages in the decision making process
U5 a1 stages in the decision making process
 
Analysing The Data
Analysing The DataAnalysing The Data
Analysing The Data
 
Unit 2.pptx
Unit 2.pptxUnit 2.pptx
Unit 2.pptx
 
Challenges of Big Data Research
Challenges of Big Data ResearchChallenges of Big Data Research
Challenges of Big Data Research
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
 
Understanding randomness
Understanding randomnessUnderstanding randomness
Understanding randomness
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
Narrated Version Dallas MPUG
Narrated Version Dallas MPUGNarrated Version Dallas MPUG
Narrated Version Dallas MPUG
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratory
 
Making sense of community engagement, impacts and outcomes
Making sense of community engagement, impacts and outcomesMaking sense of community engagement, impacts and outcomes
Making sense of community engagement, impacts and outcomes
 
Machine Learning for dummies!
Machine Learning for dummies!Machine Learning for dummies!
Machine Learning for dummies!
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 

More from MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
MLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
MLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
MLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
MLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
MLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
MLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
MLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
MLconf
 
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
MLconf
 

More from MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
 

Recently uploaded

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 

Recently uploaded (20)

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 

Noam Finkelstein - The Importance of Modeling Data Collection

  • 1. Counteracting Selection Bias in Machine Learning 1 Noam Finkelstein MLConf SF November 8th 2019
  • 2. Overview 2 ➣ Data are collected in all kinds of ways ➣ We pretend they are collected “At Random” ➣ This creates poor predictive performance in important regions of the input space ➣ We can model the collection process to improve performance
  • 3. Takeaways 3 ➣ Understand the importance of selection bias in ML ○ Not discussed as much in ML as in statistics ➣ Be able to identify when our data might have this problem. ➣ Learn how to model data collection. ➣ Learn how use our data to learn about selection bias when possible.
  • 4. Data Collection Step 1: Things happen 4
  • 5. Data Collection Step 2: Some of them get recorded 5
  • 6. Bias in Data Collection Step 1 Things happen 6 ➣ Selection bias: Correlation between how likely we are to see a data point (X, Y), and the outcome Y ➣ Example 1: ○ We are asked to create a tool to help project managers predict profit of software projects ○ Our data include all software projects previously undertaken at the company ○ PMs are good at their jobs, so projects that lose money are not in the data much . They just don’t happen.
  • 7. Bias in Data Collection Step 1 Things happen 7 Project Complexity Profit Approved Projects
  • 8. Bias in Data Collection Step 1 Some Things Don’t Happen 8 Project Complexity Profit Approved Projects Rejected Projects
  • 9. Bias in Data Collection 99 ➣ No ML model can learn about the “complexity boundary”, even though we have access to all the projects that were undertaken. Nothing is “missing”. ➣ This is a very bad way to fail! Our model will do badly specifically where we want it to protect us from poor decisions.
  • 10. Modeling the Data Collection Process 1010 ➣ We know proposals that are unlikely to be profitable are unlikely to occur in the data. ➣ We can incorporate that knowledge about the data collection process into our model to address this problem.
  • 11. Bias in Data Collection Step 2 We don’t see everything Weeks WhiteBloodCellCount ➣ We want to know how patients are doing when they’re away from the clinic ➣ Patients come in when they’re feeling unwell, elevated WBC ➣ We’ll generally predict that they’re worse off than they are
  • 12. Prediction in Machine Learning ➣ We generally model ➣ g is our favourite class of functions for regression or classification, parameterized by ➣ “Easy” to do because Y is one dimensional, and expectations are summary statistics
  • 13. Modeling Data Collection ➣ Modeling the probability of observing some data, is too hard (w/ finite data)! ➣ X is high dimensional ➣ Densities are complicated
  • 14. Modeling Data Collection ➣ In many problems we care about, the probability of making an observation is a function only of the outcome. ➣ Then the probability of making on observation is: ➣ Which, for (X, Y) pairs we don’t see, can be approximated:
  • 15. Incorporating Knowledge on Data Collection ➣ If we’re being frequentists, we can define a loss function that captures both how well we do on prediction outcome, and how well we do on predicting observation:
  • 16. Modeling Data Collection ➣ We can now learn from what we don’t see. ➣ We know there are regions of the input space w/ no data ➣ We know we’re less likely to see data w/ low profit ➣ Therefore: profit must be low in those regions Project Complexity Profit Approved Projects
  • 17. What if we don’t know the data collection process? 17 ➣ We can’t learn p entirely from data - would require us to know the outcome specifically where we don’t observe it (in most cases). ➣ If we have beliefs about p and g, we can be Bayesian about things. ➣ If we have a few data points collected “at random” - i.e. not according to p - then we can learn p
  • 18. A Worked Example 18 ➣ We have data collected according to some unknown, non-random process p WhiteBloodCellCount Weeks
  • 19. A Worked Example 19 ➣ Functions compatible with this data will have different behavior in unobserved regions WhiteBloodCellCount Weeks
  • 20. A Worked Example 20 ➣ We assume all data are “observed at random”, as usual. Fit looks good! ➣ Validation data collected by the same process will not help! WhiteBloodCellCount Weeks
  • 21. A Worked Example 21 ➣ But it turns out the data was not collected at random - we’re systematically way off in unobserved regions! WhiteBloodCellCount Weeks
  • 22. A Worked Example 22 ➣ What if we know how much more likely we are to make an observation when the outcome is high? WhiteBloodCellCount Weeks
  • 23. A Worked Example 23 ➣ What if we don’t know anything about data collection, but get a few observations “at random”? WhiteBloodCellCount Weeks
  • 24. A Worked Example 24 ➣ What if we don’t know anything about data collection, but get a few observations “at random”? WhiteBloodCellCount Weeks
  • 25. Conclusions 25 ➣ Selection bias hurts us in ML in ways we can’t detect through normal validation procedures ➣ If we know something about the data collection process we can incorporate it into our model to improve prediction. ➣ If we happen to have some data collected “at random”, we can use it to learn about selection bias elsewhere in our data.
  • 26. Thank you! Get in touch noam@jhu.edu @nsfinkelstein 26