Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Noam Finkelstein - The Importance of Modeling Data Collection

106 views

Published on

The Importance of Modeling Data Collection

Data sets used in machine learning are often collected in a systematically biased way - certain data points are more likely to be collected than others. We call this "observation bias". For example, in health care, we are more likely to see lab tests when the patient is feeling unwell than otherwise. Failing to account for observation bias can, of course, result in poor predictions on new data. By contrast, properly accounting for this bias allows us to make better use of the data we do have.

In this presentation, we discuss practical and theoretical approaches to dealing with observation bias. When the nature of the bias is known, there are simple adjustments we can make to nonparametric function estimation techniques, such as Gaussian Process models. We also discuss the scenario where the data collection model is unknown. In this case, there are steps we can take to estimate it from observed data. Finally, we demonstrate that having a small subset of data points that are known to be collected at random - that is, in an unbiased way - can vastly improve our ability to account for observation bias in the rest of the data set.

My hope is that attendees of this presentation will be aware of the perils of observation bias in their own work, and be equipped with tools to address it.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Noam Finkelstein - The Importance of Modeling Data Collection

  1. 1. Counteracting Selection Bias in Machine Learning 1 Noam Finkelstein MLConf SF November 8th 2019
  2. 2. Overview 2 ➣ Data are collected in all kinds of ways ➣ We pretend they are collected “At Random” ➣ This creates poor predictive performance in important regions of the input space ➣ We can model the collection process to improve performance
  3. 3. Takeaways 3 ➣ Understand the importance of selection bias in ML ○ Not discussed as much in ML as in statistics ➣ Be able to identify when our data might have this problem. ➣ Learn how to model data collection. ➣ Learn how use our data to learn about selection bias when possible.
  4. 4. Data Collection Step 1: Things happen 4
  5. 5. Data Collection Step 2: Some of them get recorded 5
  6. 6. Bias in Data Collection Step 1 Things happen 6 ➣ Selection bias: Correlation between how likely we are to see a data point (X, Y), and the outcome Y ➣ Example 1: ○ We are asked to create a tool to help project managers predict profit of software projects ○ Our data include all software projects previously undertaken at the company ○ PMs are good at their jobs, so projects that lose money are not in the data much . They just don’t happen.
  7. 7. Bias in Data Collection Step 1 Things happen 7 Project Complexity Profit Approved Projects
  8. 8. Bias in Data Collection Step 1 Some Things Don’t Happen 8 Project Complexity Profit Approved Projects Rejected Projects
  9. 9. Bias in Data Collection 99 ➣ No ML model can learn about the “complexity boundary”, even though we have access to all the projects that were undertaken. Nothing is “missing”. ➣ This is a very bad way to fail! Our model will do badly specifically where we want it to protect us from poor decisions.
  10. 10. Modeling the Data Collection Process 1010 ➣ We know proposals that are unlikely to be profitable are unlikely to occur in the data. ➣ We can incorporate that knowledge about the data collection process into our model to address this problem.
  11. 11. Bias in Data Collection Step 2 We don’t see everything Weeks WhiteBloodCellCount ➣ We want to know how patients are doing when they’re away from the clinic ➣ Patients come in when they’re feeling unwell, elevated WBC ➣ We’ll generally predict that they’re worse off than they are
  12. 12. Prediction in Machine Learning ➣ We generally model ➣ g is our favourite class of functions for regression or classification, parameterized by ➣ “Easy” to do because Y is one dimensional, and expectations are summary statistics
  13. 13. Modeling Data Collection ➣ Modeling the probability of observing some data, is too hard (w/ finite data)! ➣ X is high dimensional ➣ Densities are complicated
  14. 14. Modeling Data Collection ➣ In many problems we care about, the probability of making an observation is a function only of the outcome. ➣ Then the probability of making on observation is: ➣ Which, for (X, Y) pairs we don’t see, can be approximated:
  15. 15. Incorporating Knowledge on Data Collection ➣ If we’re being frequentists, we can define a loss function that captures both how well we do on prediction outcome, and how well we do on predicting observation:
  16. 16. Modeling Data Collection ➣ We can now learn from what we don’t see. ➣ We know there are regions of the input space w/ no data ➣ We know we’re less likely to see data w/ low profit ➣ Therefore: profit must be low in those regions Project Complexity Profit Approved Projects
  17. 17. What if we don’t know the data collection process? 17 ➣ We can’t learn p entirely from data - would require us to know the outcome specifically where we don’t observe it (in most cases). ➣ If we have beliefs about p and g, we can be Bayesian about things. ➣ If we have a few data points collected “at random” - i.e. not according to p - then we can learn p
  18. 18. A Worked Example 18 ➣ We have data collected according to some unknown, non-random process p WhiteBloodCellCount Weeks
  19. 19. A Worked Example 19 ➣ Functions compatible with this data will have different behavior in unobserved regions WhiteBloodCellCount Weeks
  20. 20. A Worked Example 20 ➣ We assume all data are “observed at random”, as usual. Fit looks good! ➣ Validation data collected by the same process will not help! WhiteBloodCellCount Weeks
  21. 21. A Worked Example 21 ➣ But it turns out the data was not collected at random - we’re systematically way off in unobserved regions! WhiteBloodCellCount Weeks
  22. 22. A Worked Example 22 ➣ What if we know how much more likely we are to make an observation when the outcome is high? WhiteBloodCellCount Weeks
  23. 23. A Worked Example 23 ➣ What if we don’t know anything about data collection, but get a few observations “at random”? WhiteBloodCellCount Weeks
  24. 24. A Worked Example 24 ➣ What if we don’t know anything about data collection, but get a few observations “at random”? WhiteBloodCellCount Weeks
  25. 25. Conclusions 25 ➣ Selection bias hurts us in ML in ways we can’t detect through normal validation procedures ➣ If we know something about the data collection process we can incorporate it into our model to improve prediction. ➣ If we happen to have some data collected “at random”, we can use it to learn about selection bias elsewhere in our data.
  26. 26. Thank you! Get in touch noam@jhu.edu @nsfinkelstein 26

×