Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Frontiers of Computational Journalism week 6 - Quantitative Fairness

22 views

Published on

Taught at Columbia Journalism School, Fall 2018
Full syllabus and lecture videos at http://www.compjournalism.com/?p=218

Published in: Education
  • Be the first to comment

  • Be the first to like this

Frontiers of Computational Journalism week 6 - Quantitative Fairness

  1. 1. Frontiers of Computational Journalism Columbia Journalism School Week 6: Quantitative Fairness October 17, 2018
  2. 2. This class • Experimental vs. Observational bias measurements • Fairness Criterion in “Machine Bias” • Quantitative Fairness and Impossibility Theorems • How are algorithmic results used? • Data quality • Examples: lending, child maltreatment screening
  3. 3. Experimental and Observational Analysis of Bias
  4. 4. Women in Academic science: a Changing Landscape Ceci, et. al
  5. 5. Simpson’s paradox Sex Bias in Graduate Admissions: Data from Berkeley Bickel, Hammel and O'Connell, 1975
  6. 6. Florida sentencing analysis adjusted for “points” Bias on the Bench, Michael Braga, Herald Tribune
  7. 7. Investors prefer entrepreneurial ventures pitched by attractive men, Brooks et. al. 2014
  8. 8. Swiss judges: a natural experiment 24 Judges of Swiss Federal Administrative court are randomly assigned to cases. They rule at different rates on migrant deportation cases. Here are their deportation rates broken down by party. Barnaby Skinner and Simone Rau, Tages-Anzeiger. https://github.com/barjacks/swiss-asylum-judges
  9. 9. Containing 1.4 million entries, the DOC database notes the exact number of points assigned to defendants convicted of felonies. The points are based on the nature and severity of the crime committed, as well as other factors such as past criminal history, use of a weapon and whether anyone got hurt. The more points a defendant gets, the longer the minimum sentence required by law. Florida legislators created the point system to ensure defendants committing the same crime are treated equally by judges. But that is not what happens. … The Herald-Tribune established this by grouping defendants who committed the same crimes according to the points they scored at sentencing. Anyone who scored from 30 to 30.9 would go into one group, while anyone who scored from 31 to 31.9 would go in another, and so on. We then evaluated how judges sentenced black and white defendants within each point range, assigning a weighted average based on the sentencing gap. If a judge wound up with a weighted average of 45 percent, it meant that judge sentenced black defendants to 45 percent more time behind bars than white defendants. Bias on the Bench: How We Did It, Michael Braga, Herald Tribune
  10. 10. Unadjusted disciplinary rates The Scourge of Racial Bias in New York State’s Prisons, NY Times
  11. 11. Limited data for adjustment In most prisons, blacks and Latinos were disciplined at higher rates than whites — in some cases twice as often, the analysis found. They were also sent to solitary confinement more frequently and for longer durations. At Clinton, a prison near the Canadian border where only one of the 998 guards is African- American, black inmates were nearly four times as likely to be sent to isolation as whites, and they were held there for an average of 125 days, compared with 90 days for whites. A greater share of black inmates are in prison for violent offenses, and minority inmates are disproportionately younger, factors that could explain why an inmate would be more likely to break prison rules, state officials said. But even after accounting for these elements, the disparities in discipline persisted, The Times found. The disparities were often greatest for infractions that gave discretion to officers, like disobeying a direct order. In these cases, the officer has a high degree of latitude to determine whether a rule is broken and does not need to produce physical evidence. The disparities were often smaller, according to the Times analysis, for violations that required physical evidence, like possession of contraband. The Scourge of Racial Bias in New York State’s Prisons, NY Times
  12. 12. Comparing more subjective offenses The Scourge of Racial Bias in New York State’s Prisons, NY Times
  13. 13. Why algorithmic decisions?
  14. 14. Human Decisions and Machine Predictions, Kleinberg et. al. 2017
  15. 15. From The Meta-Analysis of Clinical Judgment Project: Fifty-Six Years of Accumulated Research on Clinical Versus Statistical Prediction, Ægisdóttir et al.
  16. 16. Fairness Criterion in “Machine Bias”
  17. 17. Stephanie Wykstra, personal communication
  18. 18. ProPublica argument False positive rate P(high risk |black, no arrest) = C/(C+A) = 0.45 P(high risk |white, no arrest) = G/(G+E) = 0.23 False negative rate P(low risk | black, arrested ) = B/(B+D) = 0.28 P(low risk | white, arrested ) = F/(F+H) = 0.48 Northpointe response Positive predictive value P(arrest| black, high risk) = D/(C+D) = 0.63 P(arrest| white, high risk) = H/(G+H) = 0.59
  19. 19. P(outcome | score) is fair Fair prediction with disparate impact: A study of bias in recidivism prediction instruments, Chouldechova
  20. 20. How We Analyzed the COMPAS Recidivism Algorithm, ProPublica Or, as ProPublica put it
  21. 21. Equal FPR between groups implies unequal PPV Fair prediction with disparate impact: A study of bias in recidivism prediction instruments, Chouldechova
  22. 22. When the base rates differ by protected group and when there is not separation, one cannot have both conditional use accuracy equality and equality in the false negative and false positive rates. … The goal of complete race or gender neutrality is unachievable. … Altering a risk algorithm to improve matters can lead to difficult stakeholder choices. If it is essential to have conditional use accuracy equality, the algorithm will produce different false positive and false negative rates across the protected group categories. Conversely, if it is essential to have the same rates of false positives and false negatives across protected group categories, the algorithm cannot produce conditional use accuracy equality. Stakeholders will have to settle for an increase in one for a decrease in the other. Fairness in Criminal Justice Risk Assessments: The State of the Art, Berk et. al. Impossibility theorem
  23. 23. Quantitative Fairness and Impossibility
  24. 24. Notation for fairness properties Observable features of each case are a vector X The class or group membership of each case is A Model outputs a numeric “score” R R = r(X,A) ∊ [0,1] We turn the score into a binary classification C by thresholding at t C = r > t The true outcome (this is a prediction) is the binary variable Y A perfect predictor would have C = Y
  25. 25. Shira Mitchel and Jackie Shadlin, https://shiraamitchell.github.io/fairness
  26. 26. “Independence” or “demographic parity” The classifier predicts the same number of people in each group. C independent of A “Sufficiency” or “calibration” When classifier predicts true, all groups have the same probability of having a true outcome. C independent of A conditional on Y “Separation”, “equal error rates” The classifier has the same FPR / TPR for each group. Y independent of A conditional on C Barocas and Hardt, NIPS 2017 tutorial Fundamental Fairness criteria
  27. 27. “Independence” or “demographic parity” The idea: the prediction should not depend on the group. Same percentage of black and white defendants scored as high risk. Same percentage of men and women hired. Same percentage of rich and poor students admitted. Mathematically: C⊥A For all groups a,b we have Pa{C=1} = Pb{C=1} Equal rate of true/false prediction for all groups. A classifier with this property: choose the 10 best scoring applicants in each group. Drawbacks: Doesn’t measure who we accept, as long as we accept equal numbers in each group. The “perfect” predictor, which always guesses correctly, is considered unfair if the base rates are different. Legal principle: disparate impact Moral principle: equality of outcome
  28. 28. “Sufficiency” or “Calibration” The idea: a prediction means the same thing for each group. Same percentage of re-arrest among black and white defendants who were scored as high risk. Same percentage of equally qualified men and women hired. Whether you will get a loan depends only on your probability of repayment. Mathematically: Y⊥A|R For all groups a,b we have Pa{Y=1|C=1} = Pb{Y=1|C=1} Equal positive predictive value (Precision) for each group. A classifier with this property: most standard machine learning algorithms. Drawbacks: Disparate impacts may exacerbate existing disparities. Error rates may differ between groups in unfair ways. Legal principle: disparate treatment Moral principle: equality of opportunity
  29. 29. Barocas and Hardt, NIPS 2017 tutorial Why “sufficiency”?
  30. 30. “Separation” or “Equal error rates” The idea: Don’t let a classifier make most of its mistakes on one group. Same percentage of black and white defendants who are not re-arrested are scored as high risk. Same percentage of qualified men and women mistakenly turned down. If you would have repaid a loan, you will be turned down at the same rate regardless of your income. Mathematically: C⊥A|Y For all groups a,b we have Pa{C=1|Y=1} = Pb{C=1|Y=1} Equal FPR, TPR between groups. A classifier with this property: use different thresholds for each group. Drawbacks: Classifier must use group membership explicitly. Calibration is not possible (the same score will mean different things for different groups.) Legal principle: disparate treatment Moral principle: equality of opportunity
  31. 31. Barocas and Hardt, NIPS 2017 tutorial Why “separation”?
  32. 32. With different base rates, only one of these criteria at a time is achievable Proof from elementary properties of statistical independence, see Barocas and Hardt, NIPS 2017 tutorial Impossibility theorem
  33. 33. Even if two groups of the population admit simple classifiers, the whole population may not How Big Data is Unfair, Moritz Hardt Less/different training data for minorities
  34. 34. Data Quality
  35. 35. The black/white marijuana arrest gap, in nine charts, Dylan Matthews, Washington Post, 6/4/2013
  36. 36. Using Data to Make Sense of a Racial Disparity in NYC Marijuana Arrests, New York Times, 5/13/2018 A senior police official recently testified to the City Council that there was a simple justification — he said more people call 911 and 311 to complain about marijuana smoke in black and Hispanic neighborhoods ... Robert Gebeloff, a data journalist at The Times, transposed Census Bureau information about race, poverty levels and homeownership onto a precinct map. Then he dropped the police data into four buckets based on the percentage of a precinct’s residents who were black or Hispanic. What we found roughly aligned with the police explanation. In precincts that were more heavily black and Hispanic, the rate at which people called to complain about marijuana was generally higher. … What we discovered was that when two precincts had the same rate of marijuana calls, the one with a higher arrest rate was almost always home to more black people. The police said that had to do with violent crime rates being higher in those precincts, which commanders often react to by deploying more officers.
  37. 37. Risk, Race, and Recidivism: Predictive Bias and Disparate Impact Jennifer L. Skeem, Christopher T Lowenkamp, Criminology 54 (4) 2016 The proportion of racial disparities in crime explained by differential participation versus differential selection is hotly debated … In our view, official records of arrest—particularly for violent offenses—are a valid criterion. First, surveys of victimization yield “essentially the same racial differentials as do official statistics. For example, about 60 percent of robbery victims describe their assailants as black, and about 60 percent of victimization data also consistently show that they fit the official arrest data” (Walsh, 2004: 29). Second, self-reported offending data reveal similar race differentials, particularly for serious and violent crimes (see Piquero, 2015).
  38. 38. How are algorithmic results used?
  39. 39. How are “points” used by judges? Bias on the Bench, Michael Braga, Herald Tribune
  40. 40. Predictions put into practice: a quasi-experimental evaluation of Chicago’s predictive policing pilot, Saunders, Hunt, Hollywood, RAND, 2016 There are a number of interventions that can be directed at individual-focused predictions of gun crime because intervening with high-risk individuals is not a new concept. There is research evidence that targeting individuals who are the most criminally active can result in significant reductions in crime … Conversely, some research shows that interventions targeting individuals can sometimes backfire. As an example, some previous proactive interventions, including increased arrest of individuals perceived to be at high risk (selective apprehension) and longer incarceration periods (selective incapacitation), have led to negative social and economic unintended consequences. Auerhahn (1999) found that a selective incapacitation model generated a large number of persons falsely predicted to be high-risk offenders, although it did reasonably well at identifying those who were low risk.
  41. 41. Predictions put into practice: a quasi-experimental evaluation of Chicago’s predictive policing pilot, Saunders, Hunt, Hollywood, RAND, 2016
  42. 42. Predictions put into practice: a quasi-experimental evaluation of Chicago’s predictive policing pilot, Saunders, Hunt, Hollywood, RAND, 2016 Once other demographics, criminal history variables, and social network risk have been controlled for using propensity score weighting and doubly-robust regression modeling, being on the SSL did not significantly reduce the likelihood of being a murder or shooting victim, or being arrested for murder. Results indicate those placed on the SSL were 2.88 times more likely to be arrested for a shooting
  43. 43. Reverse-engineering the SSL score The contradictions of Chicago Police’s secret list, Kunichoff and Sier, Chicago Magazine 2017
  44. 44. Theo Douglas, Government Technology, 2018 The Chicago Police Department (CPD) is deploying predictive and analytic tools after seeing initial results and delivering on a commitment from Mayor Rahm Emanuel, a bureau chief said recently. Last year, CPD created six Strategic Decision Support Centers (SDSCs) at police stations, essentially local nerve centers for its high-tech approach to fighting crime in areas where incidents are most prevalent. … Connecting features like predictive mapping and policing, gunshot detection, surveillance cameras and citizen tips lets police identify “areas of risk, and ties all these things together into a very consumable, very easy to use, very understandable platform,” said Lewin. “The predictive policing component … the intelligence analyst and that daily intelligence cycle, is really important along with the room itself, which I didn’t talk about,” Lewin said in an interview.
  45. 45. Machine learning in lending
  46. 46. Banking startups adopt new tools for lending, Steve Lohr, New York Times None of the new start-ups are consumer banks in the full-service sense of taking deposits. Instead, they are focused on transforming the economics of underwriting and the experience of consumer borrowing — and hope to make more loans available at lower cost for millions of Americans. … They all envision consumer finance fueled by abundant information and clever software — the tools of data science, or big data — as opposed to the traditional math of creditworthiness, which relies mainly on a person’s credit history. … The data-driven lending start-ups see opportunity. As many as 70 million Americans either have no credit score or a slender paper trail of credit history that depresses their score, according to estimates from the National Consumer Reporting Association, a trade organization. Two groups that typically have thin credit files are immigrants and recent college graduates.
  47. 47. Predictably Unequal? The Effects of Machine Learning on Credit Markets, Fuster et al
  48. 48. Predictably Unequal? The Effects of Machine Learning on Credit Markets, Fuster et al
  49. 49. Machine learning for child abuse call screening
  50. 50. A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions Chouldechova et. al. Feedback loops can be a problem
  51. 51. A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions Chouldechova et. al. Classifier performance
  52. 52. A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions Chouldechova et. al. Designers hope to counter existing human bias
  53. 53. A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions Chouldechova et. al. Algorithmic risk scores vs. human scores

×