Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How can algorithms be biased?

104 views

Published on

n this talk, we will walk through the steps of how to build an algorithm to predict property prices from a dataset of property listings, focusing predominantly on finding the right features to include in building the model.

Published in: Technology
  • There was more choice at auction than I expected and easier to bid than I thought it would be. Thanks for the confidence to buy this way. ●●● https://bit.ly/3h0ICqP
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

How can algorithms be biased?

  1. 1. How Do Algorithms Become Biased? Eva Sasson @evasasson #DataDayMX
  2. 2. @evasasson
  3. 3. What you’ll learn today 1. How to build a predictive model 2. Where in the building process bias can be introduced 3. What are real - world ramifications
  4. 4. What are all these “buzzwords” ? ● Data science produces insights ● Machine learning produces predictions ● Artificial intelligence produces actions
  5. 5. paulvanderlaken.com
  6. 6. First off, let’s define bias 1. Sample bias 2. Systematic value distortion 3. Prejudice or stereotype bias
  7. 7. Let’s build our model! Predicting Property Prices
  8. 8. Clean & Prepare 2 Get data 1 Predict 5 Train & Test 3 Improve 4
  9. 9. Step 1: Gathering the data
  10. 10. Getting Data ◂ Public datasets ◂ APIs ◂ Existing datasets ◂ Surveys ◂ Web scrapers
  11. 11. Getting Data ◂ Public datasets ◂ APIs ◂ Existing datasets ◂ Surveys ◂ Web scrapers: import.io, beautiful soup, scrapey
  12. 12. Our Model: Predicting Housing Prices
  13. 13. $$$ Is our chosen prediction variable 13
  14. 14. Biggest concentration of bias is in the training data itself!
  15. 15. ◂ Variables remaining can proxy race ◂ If race is a useful predictor, then you have a hole in the data ◂ Indirect discrimination Removing ‘race’ from the dataset doesn’t remove the problem
  16. 16. Now we know the risks of training data... What do we do now?
  17. 17. Step 2: Explore and Clean Data
  18. 18. Clean & Prepare 2 Get data 1 Predict 5 Train & Test 3 Improve 4
  19. 19. 80% Of data science is cleaning and preparing data 20
  20. 20. Examples of data cleaning 1. Remove Duplicates 2. Remove Empty columns 3. Remove Not-relevant variables 4. Find averages for empty rows, or mark as 0 5. Remove rows that are blank for the features most important for you 6. Standardize units
  21. 21. 9,500 Rows started with 5,360 Rows of data remaining 4140 Duplicates and blank data removed 22
  22. 22. Be careful in the cleaning process! Some variables can be tampered with, dropped, and some cannot. 23
  23. 23. “ Collecting and cleaning data is an inherently subjective process. -Fabliha Ibnat, researcher at the University of Washington
  24. 24. biased, skewed, incomplete, human-labelled, human-cleaned Training data =
  25. 25. Supplement your dataset with new features that might help you!
  26. 26. Adding additional variables by zip code ◂ Yelp count of stars ◂ Yelp average of stars ◂ Average household income ◂ Per capita income ◂ High income households (% > $200k/yr)
  27. 27. Yelp data seems pretty democratic, that can’t cultivate bias right?
  28. 28. Not so fast!
  29. 29. “ As people talk about authenticity more online, star ratings decrease, independent of food quality -Sara Kay, food educator in NYC
  30. 30. Low housing prices with many ethnic restaurants High housing prices with few ethnic restaurants Housing prices from Yelp stars
  31. 31. What happened with Amazon Express?
  32. 32. We have our data and it’s cleaned and wrangled Now, we’re ready to build our model with it
  33. 33. Step 3: Train your model & test it
  34. 34. Clean & Prepare 2 Get data 1 Predict 5 Train & Test 3 Improve 4
  35. 35. 80:20 split Pick your train: test ratio 38
  36. 36. Optimize for: MAPE
  37. 37. Results table
  38. 38. Variable Importance for Random Forest
  39. 39. Variable Importance for XGBoost
  40. 40. 13% Of prediction power based on high household income 43
  41. 41. “ “Algorithms replicate the status quo” -Cathy O’Neil. Author, speaker, professor.
  42. 42. Step 4: Improve Hyperparameter tuning
  43. 43. Clean & Prepare 2 Get data 1 Predict 5 Train & Test 3 Improve 4
  44. 44. Experiment with Hyperparameter tuning ◂ Increase or decrease number of trees ◂ 10-fold cross validation ◂ Look at depth ◂ Random seed ◂ Where to split the data
  45. 45. Removing outliers leads to lowest MAPE
  46. 46. “ Algorithms will do more justice to the people who are easiest to understand at the expense of those who aren’t. -Michael Veale, Phd in Responsible ML at UCL
  47. 47. Error-Modeling for XG Boost
  48. 48. Step 5: Predict Put your model into action to uncover predictive insights
  49. 49. Clean & Prepare 2 Get data 1 Predict 5 Train & Test 3 Improve 4
  50. 50. Actual Value vs Predicted Value
  51. 51. Issues and areas of bias with prediction
  52. 52. Problems of Big Data Hubris
  53. 53. Algorithms make the same prediction every time.
  54. 54. Most algorithms are secret
  55. 55. Human bias treated as science. Opinion embedded in math. Cathy O’Neil.
  56. 56. Disadvantages the already disadvantaged.
  57. 57. What we can make from all of this Conclusions, Takeaways & Solutions
  58. 58. What can we do about it?
  59. 59. “ “When we look at bias as just a technical issue, we are missing the point” - Kate Crawford
  60. 60. Awareness We must check our training data
  61. 61. Python Packages FairML package Black Box AI Fairness 360
  62. 62. Transparency What assumptions were made? How decisions were made ? Who may be affected? What was underlying logic? Who is most at risk?
  63. 63. Purpose limitation (Just because it exists, doesn’t mean it should be used for new purposes)
  64. 64. Representation! Diversity in: ideas opinions perspectives
  65. 65. 72 Any Questions? Thanks! @evasasson Eva Sasson

×