Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Con LA 2022 - AI Ethics

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Loading in …3
×

Check these out next

1 of 32 Ad

Data Con LA 2022 - AI Ethics

Melinda Thielbar, Data Science Practice Lead and Director of Data Science at Fidelity Investments
From corporations to governments to private individuals, most of the AI community has recognized the growing need to incorporate ethics into the development and maintenance of AI models. Much of the current discussion, though, is meant for leaders and managers. This talk is directed to data scientists, data engineers, ML Ops specialists, and anyone else who is responsible for the hands-on, day-to-day of work building, productionalizing, and maintaining AI models. We'll give a short overview of the business case for why technical AI expertise is critical to developing an AI Ethics strategy. Then we'll discuss the technical problems that cause AI models to behave unethically, how to detect problems at all phases of model development, and the tools and techniques that are available to support technical teams in Ethical AI development.

Melinda Thielbar, Data Science Practice Lead and Director of Data Science at Fidelity Investments
From corporations to governments to private individuals, most of the AI community has recognized the growing need to incorporate ethics into the development and maintenance of AI models. Much of the current discussion, though, is meant for leaders and managers. This talk is directed to data scientists, data engineers, ML Ops specialists, and anyone else who is responsible for the hands-on, day-to-day of work building, productionalizing, and maintaining AI models. We'll give a short overview of the business case for why technical AI expertise is critical to developing an AI Ethics strategy. Then we'll discuss the technical problems that cause AI models to behave unethically, how to detect problems at all phases of model development, and the tools and techniques that are available to support technical teams in Ethical AI development.

Advertisement
Advertisement

More Related Content

More from Data Con LA (20)

Recently uploaded (20)

Advertisement

Data Con LA 2022 - AI Ethics

  1. 1. Building AI That Works for Everyone AI Ethics for Technical People
  2. 2. About Me • Ph.D. Statistician • Labor Economist • Software Developer • Artist • Midwest Farm Girl • Pronouns: she, her, hers 3/1/20XX SAMPLE FOOTER TEXT 2
  3. 3. About This Talk Focused on “high-stakes AI”. • Defined by Smbasivan, Highball, Akron, Parish, and Aroyo (2021) • I do recommend these exercises for everyone. AI Ethics problems require input from technical people. Many of our biggest issues come from manual verification of automated systems. When I say “AI that works for everyone,” I mean everyone. • People using the model • People affected by the model • Data labelers • Data engineers • Machine learning engineers • Data scientists 3/1/20XX SAMPLE FOOTER TEXT 3
  4. 4. An Actual LinkedIn Poll from an AI Ethics Expert 3/1/20XX SAMPLE FOOTER TEXT 4 Predicted Cancer Predicted No Cancer Has Cancer TP (True Positive) FP (False Negative) Does Not Have Cancer FP (False Positive) TN (True Negative) 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 𝑇𝑜𝑡𝑎𝑙 𝑃𝑎𝑡𝑖𝑒𝑛𝑡𝑠 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 Which model would you rather have? A black box cancer screening model with 99% accuracy? An explainable cancer screening model with 90% accuracy? This is the wrong question! Predicted Cancer Predicted No Cancer Row Percents Has Cancer Has Cancer More Screening Has Cancer and Does Not Know 1% of Patients Does Not Have Cancer No Cancer More Screening No Extra Screening No Cancer 99% of Patients Predicted Cancer Predicted No Cancer Row Percents Has Cancer TP (True Positive) FN (False Negative) 1% of Patients Does Not Have Cancer FP (False Positive) TN (True Negative) 99% of Patients
  5. 5. Typical AI/ML Pipeline Failure Analysis Fairness Analysis Impact Analysis Feedback on model performance in production is the cornerstone of an AI Ethics practice. People affected by Decisions give feedback Operator reviews Decisions Training Data and Code produce a model Scoring Data and Model produce decisions P.Yes P.No A.Yes TrP FN A.No FP TM
  6. 6. Scoring Data and Model produce decisions Typical AI/ML Pipeline In practice, anything that isn’t model training or scoring is: • Ad hoc • Manual • Prone to data errors People affected by Decisions give feedback Operator reviews Decisions Training Data and Code produce a model P.Yes P.No A.Yes TrP FN A.No FP TM Failure Analysis Fairness Analysis Impact Analysis
  7. 7. Human Agency and Oversight Fairness Accountability Prevention of Harm Social and Environmental Well-Being Technical Robustness and Safety Privacy and Data Governance
  8. 8. Technical Pillars of Trustworthy AI Does this model work for everyone? Human Agency and Oversight Fairness Accountability Prevention of Harm Social and Environmental Well-Being Technical Robustness and Safety Privacy and Data Governance Prevention of Harm How often does the model fail, and what is the impact? Are model failures the same for everyone? How do we know the model is failing? Fairness Analysis Failure Analysis Failure Monitoring Impact Analysis
  9. 9. Typical AI/ML Pipeline Failure Analysis Fairness Analysis Impact Analysis Technical leaders and individual contributors have a role in each of these pillars. People affected by Decisions give feedback Operator reviews Decisions Training Data and Code produce a model Scoring Data and Model produce decisions P.Yes P.No A.Yes TrP FN A.No FP TM Human Agency and Oversight Prevention of Harm Fairness Social and Environmental Well-Being Privacy and Data Governance Privacy and Data Governance Accountability Technical Robustness and Safety
  10. 10. Technical Pillars of Trustworthy AI Does this model work for everyone? Human Agency and Oversight Fairness Accountability Prevention of Harm Social and Environmental Well-Being Technical Robustness and Safety Privacy and Data Governance Prevention of Harm How often does the model fail, and what is the impact? Are model failures the same for everyone? How do we know the model is failing? Fairness Analysis Failure Analysis Failure Monitoring Impact Analysis
  11. 11. Failure Analysis Cancer Screening Predicted Cancer Predicted No Cancer Has Cancer Has Cancer More Screening Has Cancer and Does Not Know Does Not Have Cancer No Cancer More Screening No Extra Screening No Cancer 1. Find the cell in the confusion matrix that causes the most harm to the least advantaged group. 2. Analyze rates and outcomes for that cell. Fairness Prevention of Harm Fraud Screening Predicted Fraud Predicted No Fraud Fraudulent Account Audit, Model Makes $ Fraud and No Audit, Model Loses $ Honest Account No Fraud, Customer Audit No Fraud No Audit
  12. 12. Aequitas Fairness Tree Is being predicted positive punitive or assistive? Which group is harmed most by mistakes? Can you intervene with most people or just a subset? Which group is harmed most by mistakes? # 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝐺𝑟𝑜𝑢𝑝 𝑆𝑖𝑧𝑒 False Discovery Rate (FDR) False Positive Rate True Positive Rate # 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 𝐺𝑟𝑜𝑢𝑝 𝑆𝑖𝑧𝑒 False Negative Rate (Recall) False Omission Rate Fairness Tree: Data Science and Public Policy, Carnegie Mellon University http://www.datasciencepublicpolicy.org/our-work/tools-guides/aequitas/ Everyone People who get intervention People who do not get intervention Most Subset Everyone People Not Assisted People with Actual Need Accountability Technical Robustness and Safety
  13. 13. Failure Analysis: Pre-Deployment • Failure analysis is often ad-hoc and depends heavily on the data sources available. • e.g. We may not know how many cancers human screeners miss. • Deployment should include automating failure analysis. • Deployment should include plans for cadence of failure analysis. 3/1/20XX SAMPLE FOOTER TEXT 13 Accountability Technical Robustness and Safety
  14. 14. Tools for Failure Analysis • Every model will produce the statistics listed in the fairness tree. (e.g. sklearn.metrics) • It is up to the modeling team to decide which statistics are the most important and to display them in a way that communicates impact to stakeholders. • Deciding on a set of metrics that should be monitored post-deployment is part of the analysis. • Once the analysis is done, it should be automated so it can be re-done at regular intervals. These scripts are usually tailored to the business problem. • AWS Clarify has a nice set of tools for calculating and displaying statistics. Accountability Technical Robustness and Safety
  15. 15. Failure Analysis Depends on Good Data Failure Analysis Fairness Analysis Impact Analysis People affected by Decisions give feedback Operator reviews Decisions Training Data and Code produce a model Scoring Data and Model produce decisions P.Yes P.No A.Yes TrP FN A.No FP TM Accountability Technical Robustness and Safety
  16. 16. "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI, Nithya Sambasivan and Shivani Kapania and Hannah Highfill and Diana Akrong and Praveen Kumar Paritosh and Lora Mois Aroyo (2021)
  17. 17. Technical Pillars of Trustworthy AI Does this model work for everyone? Human Agency and Oversight Fairness Accountability Prevention of Harm Social and Environmental Well-Being Technical Robustness and Safety Privacy and Data Governance Prevention of Harm How often does the model fail, and what is the impact? Are model failures the same for everyone? How do we know the model is failing? Fairness Analysis Failure Analysis Failure Monitoring Impact Analysis
  18. 18. Fairness Analysis 1. Focus on cell where most harm occurs. 2. Compare performance for underrepresented and/or unprivileged groups. Fairness Prevention of Harm Fraud Screening Group A Predicted Fraud Predicted No Fraud Fraudulent Account Audit, Model Makes $ Fraud and No Audit, Model Loses $ Honest Account No Fraud, Customer Audit No Fraud No Audit Fraud Screening Group B Predicted Fraud Predicted No Fraud Fraudulent Account Audit, Model Makes $ Fraud and No Audit, Model Loses $ Honest Account No Fraud, Customer Audit No Fraud No Audit Fraud Screening Group C Predicted Fraud Predicted No Fraud Fraudulent Account Audit, Model Makes $ Fraud and No Audit, Model Loses $ Honest Account No Fraud, Customer Audit No Fraud No Audit
  19. 19. Aequitas Fairness Tree Is being predicted positive punitive or assistive? Which group is harmed most by mistakes? Can you intervene with most people or just a subset? Which group is harmed most by mistakes? True Positive Rate False Negative Rate (Recall) Fairness Tree: Data Science and Public Policy, Carnegie Mellon University http://www.datasciencepublicpolicy.org/our-work/tools-guides/aequitas/ Everyone People who get intervention People who do not get intervention Most Subset Everyone People Not Assisted People with Actual Need FP/GS Parity FDR Parity FPR Parity Recall Parity FN/GS Parity FOR Parity FNR Parity # 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝐺𝑟𝑜𝑢𝑝 𝑆𝑖𝑧𝑒 False Discovery Rate (FDR) False Positive Rate # 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 𝐺𝑟𝑜𝑢𝑝 𝑆𝑖𝑧𝑒 False Omission Rate Fairness Prevention of Harm
  20. 20. Technical Pillars of Trustworthy AI Does this model work for everyone? Human Agency and Oversight Fairness Accountability Prevention of Harm Social and Environmental Well-Being Technical Robustness and Safety Privacy and Data Governance Prevention of Harm How often does the model fail, and what is the impact? Are model failures the same for everyone? How do we know the model is failing? Fairness Analysis Failure Analysis Failure Monitoring Impact Analysis
  21. 21. Failure Monitoring Failure Analysis Fairness Analysis Impact Analysis People affected by Decisions give feedback Operator reviews Decisions Training Data and Code produce a model Scoring Data and Model produce decisions P.Yes P.No A.Yes TrP FN A.No FP TM Human Agency and Oversight Prevention of Harm
  22. 22. How do we know the model is failing? • What pipelines exist for people to give feedback on model performance? • Experts/Operators who are using the models. • People who are affected by the model. • How do we automate monitoring the most critical model performance metrics? • What outside data is available as a check against our assumptions about the model? • There are no great tools for checking failures. • Cloud providers do offer some tools if you are using their cloud (e.g. AWS, Azure, and Google). 3/1/20XX SAMPLE FOOTER TEXT 22 Human Agency and Oversight Prevention of Harm
  23. 23. Lowest Hanging Fruit: Automate All Data Pipelines dbt Soda and SodaCL great-expectations deequ Runs code pipeline and data checks Data checks only Data checks only Data checks only Built-in tests and SQL-based user defined tests Built-in tests and SQL-based user defined tests Built-in tests for Python Build-in tests and Spark/PySpark- based user defined tests. SQL based with open source dbt core and subscription-based cloud option SQL-based with open source SodaCore and subscription-based SodaCloud Python-based Spark/PySpark- based Human Agency and Oversight Prevention of Harm
  24. 24. Hardening Pipelines: Obvious Tests for Tabular Data • Uniqueness: “This column/combination of columns should be unique by row.” • Correctness: “Only these values allowed in this column.” • Missingness: “These columns should be populated for X% of rows.” • Range: “Nothing bigger/smaller than [a,b] should be in this column.” • … You get the picture. 3/1/20XX SAMPLE FOOTER TEXT 24 Human Agency and Oversight Prevention of Harm
  25. 25. Hardening Pipelines: Less Obvious Tests for Tabular Data • Feature Drift: Are distributions of inputs changing? • Model Drift: Are the model predictions changing? • Kolmogorov-Smirnov: What is the probability of observing the data we see today (or something weirder) compared to what we think the data should look like? • A p-value of 0.05 means this test alarms 5% of the time when all is normal. Use False Discovery Rate to find true errors. • KL Divergence (Population Stability Index) • Sensitive to the bins you pick. • These tests are sensitive to outliers. Outliers happen all the time. Human Agency and Oversight Prevention of Harm
  26. 26. Data Pipelines Hardened? Automate the Workflow Model-card- toolkit Metaflow deepchecks Luigi Airflow Open source systems for creating model cards. Runs code pipeline and data checks. Developed specifically for data science. Data checks and performance checks for full model pipeline. Full featured and let you automate all of your scripts for everything. Like Luigi but automates some of the more tedious parts. Python based Python based Python based Python-based Python based. DAG: Directed Acyclic Graph • A collection of tasks and their dependencies. • Directed: Each task that requires output from previous tasks knows its own dependencies. • Acyclic: A graph term. It means there’s no point where a task depends on output from a task that can’t be performed before the current task. Model Card • Simplified explanation of model inputs, outputs, and assumptions. Human Agency and Oversight Prevention of Harm
  27. 27. Technical Pillars of Trustworthy AI Does this model work for everyone? Human Agency and Oversight Fairness Accountability Prevention of Harm Social and Environmental Well-Being Technical Robustness and Safety Privacy and Data Governance Prevention of Harm How often does the model fail, and what is the impact? Are model failures the same for everyone? How do we know the model is failing? Fairness Analysis Failure Analysis Failure Monitoring Impact Analysis
  28. 28. Impact Analysis: AI That Works for Everyone • The least technical part of AI Ethics. • Arguably the part of AI Ethics that most needs technical assistance. • Part of the initial project plan. • Local Impacts: This model’s impact on its stakeholders. • Social Impacts: How does this model contribute to AI’s larger issues? • Mitigation Analysis: What can we do within the scope of this project to mitigate negative impacts? SAMPLE FOOTER TEXT Social and Environmental Well-Being Privacy and Data Governance
  29. 29. Local Impact of an AI Model • Does this model improve working conditions for the people who use it? • e.g. An AI model that requires a lot of data input from nurses and doctors may increase their job responsibilities without compensating or rewarding them for extra effort. • Does this model improve outcomes for people affected by the model? • e.g. A fraud detection model may speed payment for most individuals. • Does this model make things worse for some individuals? • e.g. A fraud detection model may speed payment for most individuals and slow payment for others to an unacceptable level. • Are we collecting only the data we need? Are we keeping that data safe? • e.g. Does my word game really need my location? Social and Environmental Well-Being Privacy and Data Governance
  30. 30. Social Impact of an AI Model • Environmental cost of an AI model is non-negligible: https://openai.com/blog/ai-and-compute/ • We need efficient computation, and that is a technical problem. • Many AI models profit from free or underpaid labor: https://www.wired.com/story/foundations-ai-riddled-errors/ • Labeling software should be good software. • Large-scale adoption of AI models has other effects. • Never mind the trolly problem: Suppose 10% of the cars on the road are self- driving. Now, suppose there’s a network outage during a heavy traffic period. Social and Environmental Well-Being Privacy and Data Governance
  31. 31. AI Ethics and Model Development • Pre-Development • Impact Analysis: Who will use the model and how? • Failure Analysis: What is the most impactful failure? What is an acceptable level of failure? • Fairness Analysis: What are the underrepresented/unpriviledged groups? • Failure Monitoring: What development is needed for Human-to-Model feedback? • Model Development • Design and hardening of data pipelines, including privacy. • Model’s ability to meet failure thresholds. • Deployment • Does the model meet criteria set during pre-development? • Are the requirements in place?
  32. 32. Ethical AI is Good AI and Good AI is Ethical AI • Ethical AI knows when it fails and the impact of those failures. • Ethical AI fails in the same way for everyone. • Ethical AI is monitored for failures and has strong feedback loops that surface problems quickly. • Ethical AI is designed for positive impact on the communities where it is implemented and for society as a whole. Who doesn’t want that? Ellis-Lee, Mia. (2008) “Accessible Design is Good Design & Good Design is Accessible Design. Flywheel hosted blog. https://www.flywheelstrategic.com/thinking/post/flywheel-blog/2018/04/06/accessible-design-is-good-design-good- design-is-accessible-design

Editor's Notes

  • Deloitte:’s Trustworthy AI Framework: https://www2.deloitte.com/us/en/pages/deloitte-analytics/solutions/ethics-of-ai-framework.html, https://www.technologyreview.com/2020/03/25/950291/trustworthy-ai-is-a-framework-to-help-manage-unique-risk/
    US ai.gov: https://www.ai.gov/strategic-pillars/advancing-trustworthy-ai/
    OECD Publishing (2021) “Trustworthy AI: A Framework to Compare Implementation Tools for Trustworthy AI Systems”. https://www.oecd.org/science/tools-for-trustworthy-ai-008232ec-en.htm

  • Fang, Huanming, Hui Miao (2020) “Introducing the Model Card Toolkit for Easier Model Transparency and Reporting.” Google AI Blog. https://ai.googleblog.com/2020/07/introducing-model-card-toolkit-for.html
    Tagliabue, J., Tuulos, V., Greco, C. and Dave, V., 2021. DAG Card is the new Model Card. arXiv preprint arXiv:2110.13601. https://arxiv.org/pdf/2110.13601.pdf
  • “Where State Farm Sees ‘a Lot of Fraud,’ Black Customers See Discrimination” https://www.nytimes.com/2022/03/18/business/state-farm-fraud-black-customers.html
    “Aiming for truth, fairness, and equity in your company’s use of AI” https://www.ftc.gov/business-guidance/blog/2021/04/aiming-truth-fairness-equity-your-companys-use-ai
    “Weighing Big Tech’s Promise to Black America” https://www.wired.com/story/big-techs-promise-to-black-america/
  • Self-driving cars will make you forget how to drive: Javadi, AH., Emo, B., Howard, L. et al. Hippocampal and prefrontal processing of network topology to simulate the future. Nat Commun 8, 14652 (2017). https://doi.org/10.1038/ncomms14652

×