Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analyzing Bias in Data - IRE 2019


Published on

My slides for the talk "Unpacking AI’s influence in your community" at the Investigative Reporters and Editors conference, June 2019, Houston

Published in: Education
  • Be the first to comment

  • Be the first to like this

Analyzing Bias in Data - IRE 2019

  1. 1. Analyzing bias in data Jonathan Stray Columbia Journalism School IRE 2019
  2. 2. Institute for the Future’s “unintended harms of technology”,
  3. 3. Part I: Quantitative Fairness
  4. 4. What does this mean?
  5. 5. What would fair mean here? Same number of white/minority drivers ticketed? White/minority drivers ticketed in same ratio as local resident demographics? White/minority drivers ticketed in same ratio as local driver demographics? White/minority drivers ticketed for driving at the same speeds?
  6. 6. Legal concept: “similarly situated” Similarly situated. Alike in all relevant ways for purposes of a particular decision or issue. This term is often used in discrimination cases, in which the plaintiff may seek to show that he or she was treated differently from others who are similarly situated except for the alleged basis of discrimination. For example, a plaintiff who claims that she was not promoted because she is a woman would seek to show that similarly situated men -- that is, men with similar qualifications, experience, and tenure with the company -- were promoted. Wex’s law dictionary, Legal Information Institute, Cornell
  7. 7. Florida sentencing analysis adjusted for “points” Bias on the Bench, Michael Braga, Herald Tribune
  8. 8. Containing 1.4 million entries, the DOC database notes the exact number of points assigned to defendants convicted of felonies. The points are based on the nature and severity of the crime committed, as well as other factors such as past criminal history, use of a weapon and whether anyone got hurt. The more points a defendant gets, the longer the minimum sentence required by law. Florida legislators created the point system to ensure defendants committing the same crime are treated equally by judges. But that is not what happens. … The Herald-Tribune established this by grouping defendants who committed the same crimes according to the points they scored at sentencing. Anyone who scored from 30 to 30.9 would go into one group, while anyone who scored from 31 to 31.9 would go in another, and so on. We then evaluated how judges sentenced black and white defendants within each point range, assigning a weighted average based on the sentencing gap. If a judge wound up with a weighted average of 45 percent, it meant that judge sentenced black defendants to 45 percent more time behind bars than white defendants. Bias on the Bench: How We Did It, Michael Braga, Herald Tribune
  9. 9. For a brief period, Massachusetts recorded “warnings” as well as tickets, allowing us to directly compare who got off easy and who didn’t.
  10. 10. Calibration The idea: a prediction means the same thing for each group. Same percentage of re-arrest among black and white defendants who were scored as high risk. Same percentage of equally qualified men and women hired. Whether you will get a loan depends only on your probability of repayment. Mathematically: Equal positive predictive value (“precision”) for each group. A classifier with this property: most standard machine learning algorithms. Drawbacks: Disparate impacts may exacerbate existing disparities. Error rates may differ between groups in unfair ways. Legal principle: similarly situated Moral principle: equality of opportunity
  11. 11. Legal concept: “disparate impact” D. Adverse impact and the "four-fifths rule." A selection rate for any race, sex, or ethnic group which is less than four- fifths (4/5) (or eighty percent) of the rate for the group with the highest rate will generally be regarded by the Federal enforcement agencies as evidence of adverse impact, while a greater than four-fifths rate will generally not be regarded by Federal enforcement agencies as evidence of adverse impact. 29 CFR § 1607.4 Uniform Guidelines on Employee Selection Procedures, Information on impact
  12. 12. Demographic Parity The idea: the prediction should not depend on the group. Same percentage of black and white defendants scored as high risk. Same percentage of men and women hired. Same percentage of rich and poor students admitted. Mathematically: Equal rate of true/false prediction for all groups. A classifier with this property: choose the 10 best scoring applicants in each group. Drawbacks: Doesn’t measure who we accept, as long as we accept equal numbers in each group. The “perfect” predictor, which always guesses correctly, is considered unfair if the base rates are different. Legal principle: disparate impact Moral principle: equality of outcome
  13. 13. ProPublica argument: fairness as error rates
  14. 14. Equal error rates The idea: Don’t let a classifier make most of its mistakes on one group. Same percentage of black and white defendants who are not re-arrested are scored as high risk. Same percentage of qualified men and women mistakenly turned down. If you would have repaid a loan, you will be turned down at the same rate regardless of your income. Mathematically: Equal false positive rate, true positive rate between groups. A classifier with this property: use different thresholds for each group. Drawbacks: Classifier must use group membership explicitly. Calibration is not possible (the same score will mean different things for different groups.) Legal principle: disparate treatment Moral principle: equality of opportunity
  15. 15. Part II: Fairness In the Real World
  16. 16. Image by Craig Froehle
  17. 17. With different base rates, calibration, demographic, and error rate fairness are mutually exclusive. This can be proved with a little arithmetic, but the intuition is: - Can’t have demographic parity and calibration if different groups have different qualifications. - If risk really predicts outcome (calibration), then one group will have higher risk scores, which means more positives and therefore more more false positives. Impossibility theorem
  18. 18. False Positive Rate can be gamed A second misconception is that the false positive rate is a reasonable proxy of a group’s aggregate well- being, loosely defined. … Suppose, hypothetically, that prosecutors start enforcing low-level drug crimes that disproportionately involve black individuals, a policy that arguably hurts the black community. Further suppose that the newly arrested individuals have low risk of violent recidivism, and thus are released pending trial. … As a result, the false positive rate for blacks would decrease. To see this, recall that the numerator of the false positive rate (the number of detained defendants who do not reoffend) remains unchanged while the denominator (the number of defendants who do not reoffend) increases. Corbett-Davies and Goel, The Measure and Mismeasure of Fairness, 2018
  19. 19. Megan Stevenson, Assessing Risk Assessment in Action, 2018 Real-world results from Virginia
  20. 20. Algorithmic output may be ignored anyway First, it is still unclear whether risk-assessment tools actually have a great impact on the daily proceedings in courtrooms. During my days of observation, I found that risk-assessment tools are often actively resisted in criminal courts. Most judges and prosecutors do not trust the algorithms. They do not know the companies they come from, they do not understand their methods, and they often find them useless. Consequently, risk- assessment tools often go unused: social workers complete the software programs’ questionnaires, print out the score sheets, add them to the defendants’ files… after which the scores seem to disappear and are rarely mentioned during hearings or plea bargaining negotiations. Angèle Christin, The Mistrials of Algorithmic Sentencing
  21. 21. Challenges to determining fairness through data Groups never differ by just race / gender / class alone. There are several plausible definitions of “fair,” and they are both controversial and mutually exclusive. Every analysis method has potential false negatives and false positives. Causality is a particular problem. Humans may follow or ignore algorithmic recommendations
  22. 22. Part III: Reframing the Problem
  23. 23. When considering an algorithmic system, what do you compare it to? Absolute fairness – We don’t have perfect prediction or perfect data, and there may not be agreement over which definition of fairness to use. As fair as possible given the data – It may be possible to achieve this, given a particular definition of fairness, if we understand very well what the limitations of the input data are. An improvement over current processes and human decision-makers – It’s possible to evaluate existing institutions by the same standards as algorithms, and the results do not always favor humans. An improvement over other possible reforms – If the humans are biased and the algorithms are biased, is there some other approach? Fairness by Comparison
  24. 24. Sandra Mayson, Bias In, Bias Out Prediction in an Unequal World
  25. 25. (2) AUTOMATED DECISION SYSTEM IMPACT ASSESSMENT. The term ‘‘automated decision system impact assessment’’ means a study evaluating an automated decision system and the automated decision system’s development process, including the design and training data of the automated decision system, for impacts on accuracy, fairness, bias, discrimination, privacy, and security Algorithmic Accountability Act of 2019 (proposed) Proposed algorithmic fairness legislation doesn’t define “fairness”
  26. 26. Bias In, Bias Out, Sandra Mayson Assessing Risk Assessment in Action, Megan Stevenson Open Policing Project – Findings Open Policing Project – Workbench Tutorial 21 Definitions of Fairness and Their Politics, Arvind Narayanan Resources