Noam Finkelstein - The Importance of Modeling Data Collection

Counteracting Selection
Bias in Machine Learning
1
Noam Finkelstein
MLConf SF
November 8th 2019

Overview
2
➣ Data are collected in all kinds of ways
➣ We pretend they are collected “At Random”
➣ This creates poor predictive performance in important
regions of the input space
➣ We can model the collection process to improve
performance

Takeaways
3
➣ Understand the importance of selection bias in ML
○ Not discussed as much in ML as in statistics
➣ Be able to identify when our data might have this problem.
➣ Learn how to model data collection.
➣ Learn how use our data to learn about selection bias when
possible.

Data Collection Step 1:
Things happen
4

Data Collection Step 2:
Some of them get recorded
5

Bias in Data Collection Step 1
Things happen
6
➣ Selection bias: Correlation between how likely we are to
see a data point (X, Y), and the outcome Y
➣ Example 1:
○ We are asked to create a tool to help project managers
predict proﬁt of software projects
○ Our data include all software projects previously
undertaken at the company
○ PMs are good at their jobs, so projects that lose money
are not in the data much . They just don’t happen.

Things happen
7
Project Complexity
Proﬁt
Approved Projects

Some Things Don’t Happen
8
Project Complexity
Proﬁt
Approved Projects
Rejected Projects

Bias in Data Collection
99
➣ No ML model can learn about the
“complexity boundary”, even
though we have access to all the
projects that were undertaken.
Nothing is “missing”.
➣ This is a very bad way to fail!
Our model will do badly speciﬁcally
where we want it to protect us from
poor decisions.

Modeling the Data Collection Process
1010
➣ We know proposals that are
unlikely to be proﬁtable are unlikely
to occur in the data.
➣ We can incorporate that
knowledge about the data
collection process into our model
to address this problem.

We don’t see everything
Weeks
WhiteBloodCellCount
➣ We want to know how patients are doing when they’re away from the clinic
➣ Patients come in when they’re feeling unwell, elevated WBC
➣ We’ll generally predict that they’re worse oﬀ than they are

Prediction in Machine Learning
➣ We generally model
➣ g is our favourite class of functions for regression or
classiﬁcation, parameterized by
➣ “Easy” to do because Y is one dimensional, and
expectations are summary statistics

Modeling Data Collection
➣ Modeling the probability of observing some data,
is too hard (w/ ﬁnite data)!
➣ X is high dimensional
➣ Densities are complicated

➣ In many problems we care about, the probability of making
an observation is a function only of the outcome.
➣ Then the probability of making on observation is:
➣ Which, for (X, Y) pairs we don’t see, can be approximated:

Incorporating Knowledge on Data Collection
➣ If we’re being frequentists, we can deﬁne a loss function
that captures both how well we do on prediction outcome,
and how well we do on predicting observation:

➣ We can now learn from what we don’t see.
➣ We know there are regions of the input space w/ no data
➣ We know we’re less likely to see data w/ low profit
➣ Therefore: profit must be low in those regions
Project Complexity
Profit
Approved Projects

What if we don’t know the data
collection process?
17
➣ We can’t learn p entirely from data - would require us to
know the outcome speciﬁcally where we don’t observe it
(in most cases).
➣ If we have beliefs about p and g, we can be Bayesian about
things.
➣ If we have a few data points collected “at random” - i.e. not
according to p - then we can learn p

A Worked Example
18
➣ We have data collected according to some unknown,
non-random process p
WhiteBloodCellCount
Weeks

A Worked Example
19
➣ Functions compatible with this data will have diﬀerent
behavior in unobserved regions
WhiteBloodCellCount
Weeks

A Worked Example
20
➣ We assume all data are “observed at random”, as usual. Fit
looks good!
➣ Validation data collected by the same process will not help!
WhiteBloodCellCount
Weeks

A Worked Example
21
➣ But it turns out the data was not collected at random -
we’re systematically way oﬀ in unobserved regions!
WhiteBloodCellCount
Weeks

A Worked Example
22
➣ What if we know how much more likely we are to make an
observation when the outcome is high?
WhiteBloodCellCount
Weeks

A Worked Example
23
➣ What if we don’t know anything about data collection, but
get a few observations “at random”?
WhiteBloodCellCount
Weeks

A Worked Example
24
➣ What if we don’t know anything about data collection, but
get a few observations “at random”?
WhiteBloodCellCount
Weeks

Conclusions
25
➣ Selection bias hurts us in ML in ways we can’t detect
through normal validation procedures
➣ If we know something about the data collection process
we can incorporate it into our model to improve prediction.
➣ If we happen to have some data collected “at random”, we
can use it to learn about selection bias elsewhere in our
data.

Thank you!
Get in touch
noam@jhu.edu
@nsﬁnkelstein
26

Noam Finkelstein - The Importance of Modeling Data Collection

Recommended

Recommended

More Related Content

Similar to Noam Finkelstein - The Importance of Modeling Data Collection

Similar to Noam Finkelstein - The Importance of Modeling Data Collection (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Noam Finkelstein - The Importance of Modeling Data Collection