Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

"Practical Approaches to Training Data Strategy: Bias, Legal and Ethical Considerations," a Presentation from Samasource


Published on

For the full video of this presentation, please visit:

For more information about embedded vision, please visit:

Audrey Jill Boguchwal, Senior Product Manager at Samasource, presents the "Practical Approaches to Training Data Strategy: Bias, Legal and Ethical Considerations" tutorial at the May 2019 Embedded Vision Summit.

Recent McKinsey research cites the top five limitations that prevent companies from adopting AI technology. Training data strategy is a common thread. Companies face challenges obtaining enough AI training data, developing strategies for robust data quality and ensuring that bias does not occur.

In this presentation, Boguchwal explores training data strategies that avoid bias in the data and that consider legal and ethical factors. She explains common types of bias, how bias can creep into datasets, the impact of bias, how to avoid bias and how to test your model for bias. She discusses legal and ethical considerations in data sourcing, including real cases where legal and ethical complications can arise, the impact of these complications and best practices for avoiding or mitigating them.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

"Practical Approaches to Training Data Strategy: Bias, Legal and Ethical Considerations," a Presentation from Samasource

  1. 1. © 2019 Samasource Practical Approaches to Training Data Strategy: Bias, Legal and Ethical Considerations Audrey Jill Boguchwal Samasource May 2019
  2. 2. © 2019 Samasource Training Data is the Soul of AI Training data lays the groundwork for model performance. – IBM, Microsoft, MIT CSAIL Computer vision training data may include: images, video, lidar, radar and other sensor data. 2
  3. 3. © 2019 Samasource AI Development and Adoption Challenges Training data presents the majority of challenges that can limit AI development. • Obtaining data sets • Labeling training data • Bias in training data, bias in algorithms, and bias in models • Explaining why a decision was reached by an algorithm • Carrying learnings from one algorithm model to another "Notes from the AI frontier: Applications and value of deep learning,” McKinsey 3
  4. 4. © 2019 Samasource Presentation Outline: Training Data Bias and Sourcing Strategies to avoid data bias and obtain data ethically and legally. • Common types of bias • How unintended bias can creep into datasets • Impact of biased training data • Strategies to avoid many types of bias • How to test for bias • Legal and ethical data sourcing considerations, with real-world examples and impact of problems • Best practices to avoid and mitigate sourcing issues 4
  5. 5. © 2019 Samasource 5 Common Types of Unintended Training Data Bias
  6. 6. © 2019 Samasource Sample Bias Data is unrepresentative of reality. Example: Data set has too few examples of people with darker skin tones. 6 Stock image example, not a real dataset
  7. 7. © 2019 Samasource Historical Bias Data reflects a prejudice or stereotype that we do not want to project into the future. Example: Data set has many images of women in kitchens and men in offices; but few of the reverse. 7 Stock image example, not a real dataset
  8. 8. © 2019 Samasource Measurement Bias Systemic value distortion from a problem with the device capturing data. Example: Image data came from one camera only, with an overexposure problem. 8 Stock image example, not from a real dataset
  9. 9. © 2019 Samasource 9 How Unintended Bias Can Creep into Datasets
  10. 10. © 2019 Samasource Dataset Bias Datasets used in training have similar images and lack diversity. Example: Cars images from 5 data sets have similar qualities within each set. 10 From “An Unbiased Look at Dataset Bias,” citation in Resources.
  11. 11. © 2019 Samasource Selection and Capture Biases Selection: Keyword search returns similar images. Capture: Objects photographed in similar ways that do not generalize. Example: Google Image results for “sunglasses” too similar. 11 Google Image search results for “Sunglasses,” all photographed in a similar way
  12. 12. © 2019 Samasource Class Imbalance Too few or too many examples of a class. Example: Dataset for a dog classifier has too many German Shepherds and no other dogs. 12 From Stanford Dogs Dataset.
  13. 13. © 2019 Samasource Negative Set Bias Data of “the rest of the world” is not well represented or balanced. Example: Features that classify “woman” are not on the person, but in the environment. 13 Stock image examples, not from a real dataset
  14. 14. © 2019 Samasource 14 Impact of Biased Training Data: Case Studies
  15. 15. © 2019 Samasource Models Trained on Bias Data Can Be Less Accurate Models can be overconfident and not discriminative. Models will classify based on the wrong features, leading to misclassifications. Example: Classifier uses scene, not person, to identify gender of person. 15 From “Men Also Like Shopping,” citation in Resources.
  16. 16. © 2019 Samasource Biased Data has Ethical, Legal, and Safety Implications 16 • Inability to detect presence, identity and/or correct gender expression of people with darker skin tones • Causes problems for facial recognition used in identification, surveillance, and law enforcement – “Gender Shades” • Lack of visibility as seen by autonomous vehicles (potentially) – “Predictive Inequity in Object Detection” • Perpetuating historical, negative stereotypes across race & gender • Stereotype: women belong in the kitchen, men in the office – “Men Also Like Shopping” • Google Photos wrongly labeled a black person as a gorilla – As posted on Twitter, discussed in popular press
  17. 17. © 2019 Samasource Case: AVs More Likely to Hit People with Darker Skin? Test data used to determine if object detection systems, like those seen in self-driving cars, have equitable detection for pedestrians of all skin tones – and if not, why? Results indicate detection accuracy is 5% higher for lighter skin – but many unaccounted variables remain. 17 Stock image example, not from a real dataset
  18. 18. © 2019 Samasource Is All Training Data Bias Undesirable? Unintended bias in data is undesirable. • All datasets are biased because they are not the full visual world • If data accurately represents reality and reality has a statistical bias, then the data should share that bias • Goals: understand, mitigate and manage bias 18
  19. 19. © 2019 Samasource 19 Strategies to Avoid Training Data Bias
  20. 20. © 2019 Samasource Strategies to Avoid the Effects of Training Data Bias 20 Offset dataset bias and capture bias by preprocessing data. • For object classifiers, if images look similar, consider transformations: flip or automatically crop to vary Avoid negative set bias by varying data. • Collect data that contains background scenes in addition to objects of interest Avoid selection bias by varying search terms and data sources. • Vary keywords, search engines to retrieve different kinds of images
  21. 21. © 2019 Samasource Ensure Reality is Always Represented in the Data 21 Avoid sample bias by sourcing and selecting training data with the end training goal in mind. • Ensure many diverse examples of all classes and edge cases • Example: When classifying pedestrians, source city street data showing people from all demographics. Highway data with few people isn’t a fit. Avoid historical bias and measurement bias with diverse sources. • Have multiple, diverse, varied data sources from many devices • Example: Use more than one training set, especially if it’s a stock set • Refresh data and retrain several times a year as the world changes • Example: Refresh data often for a clothing classifier to keep up with fashion
  22. 22. © 2019 Samasource Case: “Gender Shades” on Facial Dataset Diversity 22 Joy Buolamwini, real and average faces to test and train facial recognition.
  23. 23. © 2019 Samasource 23 Tests to Detect Dataset Bias
  24. 24. © 2019 Samasource Dataset Test: Name that Dataset If the test classifier can identify the source dataset, there may be dataset bias. Example: Given 3 images from 12 popular datasets, can you match images with the set? 24 From “Unbiased Look at Dataset Bias,” citation in Resources
  25. 25. © 2019 Samasource Model Test: Cross Dataset Generalization Test how well a typical object detector trained on one “native” dataset can generalize when tested on other, representative sets. Example: Can an object detector trained on LabelMe cars identify other cars? If not, indicates problems with LabelMe data. 25 From “An Unbiased Look at Dataset Bias,” citation in Resource.
  26. 26. © 2019 Samasource Model Test: Negative Bias Test that a model is using the right features from the data to define objects, evaluate whether background data is representative. Example: Test a model’s classification of “not car” using “not car” examples from other datasets it hasn’t been trained on. 26 Stock image example, not from a real dataset
  27. 27. © 2019 Samasource 27 Legal and Ethical Data Sourcing Considerations
  28. 28. © 2019 Samasource Check Local Privacy and Property Laws, Consult Experts Governments move slower than technology. Laws can change. Example: IBM’s “Diversity in Faces” used public Flickr photos without explicit consent. May not be legal in the future, could discredit the dataset; a shame. 28
  29. 29. © 2019 Samasource Case: Compliant Facial Data Sourcing in East Africa Tech company legally sourcing diverse facial images from East Africa, complete with consent forms. Realized after collecting that Kenyan privacy laws were more rigid. Used Uganda-sourced data only, instead of risking legal action in Kenya. 29
  30. 30. © 2019 Samasource Best Practices: Acquire Data Ethically and Legally • Know the legal definition of data consent in the collection location • If scraping (legally), consider images of celebrities who are already in the public eye • Data from private citizens, even if legal, is more likely to cause controversy • Buy data from accredited sources that own and manage image rights and know how to do business, such as Getty • It may cost more, but it might save you legal fees and embarrassing headlines • Document and credit sources • Understand EU’s GDPR & other major laws 30
  31. 31. © 2019 Samasource Best Practices: Evaluate Methodology for Ethics Use Fast.AI’s “Data Checklist” to work to make fewer ethical mistakes: • Have we tested our training data to ensure that it is fair and representative? • Have we studied and understood the possible sources of bias in our data? • Does our team reflect diversity of opinions, background and all kinds of thought [enabling us to see and catch more bias]? • What kinds of user consent do we need to collect or use the data? • Do we have a mechanism for gathering consent from users? • Have we clearly explained what users are consenting to? 31
  32. 32. © 2019 Samasource Best Practices: Understand What Bias Truly Means • Humans are inherently biased; eliminating all forms of bias is impossible • Understand cognitive bias, limitations and decision making (your algorithm makes decisions) • Challenge and test assumptions: weigh evidence, don’t jump to conclusions • Constantly, rigorously examine bias: • Your own biases • Biases of those providing data/information 32
  33. 33. © 2019 Samasource Key Takeaways to Avoid Bias and Source Properly • Clearly articulate your end training goal and know what data is needed to get to it • Map out ways bias can enter data proactively source data to avoid it • Ensure data represents reality for your training goal in quantity and diversity, replenish data often • Test data before and after training on a wide range of data • Be aware of ethics and laws, both current and potential • Always get proper consent for data, even for public data 33
  34. 34. © 2019 Samasource 25% of the Fortune 50 trust Samasource to Solve Their Training Data Challenges 34 Over one billion points annotated in 2018. We’ve helped lift 50,000 people out of poverty. Meet Samasource at booth #621
  35. 35. © 2019 Samasource 35 Resources
  36. 36. © 2019 Samasource Resources 36 Whose lives matter to self-driving cars? news/whose-lives-matter-to-self-driving- cars-043019.html 16 Things You Can Do to Make Tech More Ethical, part 1 action-1/, Checklist for data projects When it comes to Gorillas, Google Photos Remains Blind comes-to-gorillas-google-photos-remains- blind/ General Resources MIT Tech Review: AI Bias 2876/this-is-how-ai-bias-really- happensand-why-its-so-hard-to-fix/ Challenges with AI insights/artificial-intelligence/notes-from- the-ai-frontier-applications-and-value-of- deep-learning Stanford Dog Dataset geNetDogs/
  37. 37. © 2019 Samasource Resources 37 Unbiased Look at Dataset Bias wnload?doi= &type=pdf Predictive Inequity in Object Detection Undoing the Damage of Dataset Bias s/eccv2012_khosla.pdf About Samasource Papers & Studies Referenced Men also like Shopping Impact of Biases in Big Data Gender Shades & Update mwini18a/buolamwini18a.pdf content/uploads/2019/01/AIES- 19_paper_223.pdf