Frameworks for
Algorithmic Bias
Jonathan Stray
Columbia Journalism School
at Code for America, 3 July 2019
Institute for the Future’s “unintended harms of technology”, ethicalos.org
Our narrow setting: allocative fairness
Some reward or punishment is given out to select people,
based on an algorithmic prediction.
How is this outcome allocated between groups?
Fairness Intuitions
Protected Classes
Race (Civil Rights Act of 1964); Color (Civil Rights Act of 1964); Sex (Equal
Pay Act of 1963; Civil Rights Act of 1964); Religion (Civil Rights Act of
1964);National origin (Civil Rights Act of 1964); Citizenship (Immigration
Reform and Control Act); Age (Age Discrimination in Employment Act of
1967);Pregnancy (Pregnancy Discrimination Act); Familial status (Civil
Rights Act of 1968); Disability status (Rehabilitation Act of 1973; Americans
with Disabilities Act of 1990); Veteran status (Vietnam Era Veterans'
Readjustment Assistance Act of 1974; Uniformed Services Employment and
Reemployment Rights Act); Genetic information (Genetic Information
Nondiscrimination Act)
Fairness in Machine Learning, NIPS 2017 Tutorial
Solon Barocas and Moritz Hardt
What does this mean?
What would fair mean here?
Same number of white/minority drivers ticketed?
White/minority drivers ticketed in same ratio as local resident
demographics?
White/minority drivers ticketed in same ratio as local driver
demographics?
White/minority drivers ticketed for driving at the same speeds?
Legal concept: “similarly situated”
Similarly situated. Alike in all relevant ways for purposes of a particular
decision or issue. This term is often used in discrimination cases, in
which the plaintiff may seek to show that he or she was treated
differently from others who are similarly situated except for the alleged
basis of discrimination. For example, a plaintiff who claims that she was
not promoted because she is a woman would seek to show that similarly
situated men -- that is, men with similar qualifications, experience, and
tenure with the company -- were promoted.
Wex’s law dictionary, Legal Information Institute, Cornell
Florida sentencing analysis adjusted for “points”
Bias on the Bench, Michael Braga, Herald Tribune
Containing 1.4 million entries, the DOC database notes the exact number of points assigned to defendants
convicted of felonies. The points are based on the nature and severity of the crime committed, as well as
other factors such as past criminal history, use of a weapon and whether anyone got hurt. The more points a
defendant gets, the longer the minimum sentence required by law.
Florida legislators created the point system to ensure defendants committing the same crime are treated
equally by judges. But that is not what happens.
…
The Herald-Tribune established this by grouping defendants who committed the same crimes according to
the points they scored at sentencing. Anyone who scored from 30 to 30.9 would go into one group, while
anyone who scored from 31 to 31.9 would go in another, and so on.
We then evaluated how judges sentenced black and white defendants within each point range, assigning a
weighted average based on the sentencing gap.
If a judge wound up with a weighted average of 45 percent, it meant that judge sentenced black defendants
to 45 percent more time behind bars than white defendants.
Bias on the Bench: How We Did It, Michael Braga, Herald Tribune
Legal concept: “disparate impact”
D. Adverse impact and the "four-fifths rule."
A selection rate for any race, sex, or ethnic group which is less than four-
fifths (4/5) (or eighty percent) of the rate for the group with the highest
rate will generally be regarded by the Federal enforcement agencies as
evidence of adverse impact, while a greater than four-fifths rate will
generally not be regarded by Federal enforcement agencies as evidence
of adverse impact.
29 CFR § 1607.4
Uniform Guidelines on Employee Selection Procedures,
Information on impact
Image by Craig Froehle
ProPublica argument: fairness as error rates
Quantitative Fairness Formalisms
For a brief period, Massachusetts recorded “warnings” as well as tickets,
allowing more direct comparison of who got off easy and who didn’t.
Add observed decision on each person
Stephanie Wykstra
Add observed outcome for each person
Two confusion matrices
Black White
Binary classification with two groups is the simplest setting.
Can be generalized to continuous scores and more groups.
Notation for fairness properties
Observable features of each case are a vector X
The class or group membership of each case is A
Model outputs a numeric “score” R
R = r(X,A) ∊ [0,1]
We turn the score into a binary classification C by thresholding at t
C = r > t
The true outcome (this is a prediction) is the binary variable Y
A perfect predictor would have
C = Y
Shira Mitchel and Jackie Shadlin, https://shiraamitchell.github.io/fairness
“Anti-classification”
The classifier is blinded to group membership.
C independent of A conditional on X
“Independence” or “demographic parity”
The classifier predicts the same number of people in each group.
C independent of A
“Calibration” or “sufficiency”
When classifier predicts true, all groups have the same probability of having a true outcome.
C independent of A conditional on Y
“Equal error rates,” “classification parity,” or “separation”
The classifier has the same FPR / TPR for each group.
Y independent of A conditional on C
Barocas and Hardt, NIPS 2017 tutorial
Corbett-Davies and Goel, The Measure and Mismeasure of Fairness
Fairness criteria in this talk
The idea: the prediction should not use the group as input.
“Algorithm is blinded to race”
Mathematically:
C⊥A|X
For all individuals i,i we have Xi = Xj ⇒ Ci = Cj
A classifier with this property: choose any way you like, as long as the protected attribute is blinded.
Drawbacks: Insufficient on its own, as it allows random assignments of features to decisions.
Legal principle: “presumptively suspect” to base decisions on protected attributes
Moral principle: colorblindness
Anti-classification
Race, gender, orientation from Facebook likes
The idea: the classifier results should match group demographics.
Same percentage of black and white defendants scored as high risk. Same percentage of men and
women hired. Same percentage of rich and poor students admitted.
Mathematically:
C⊥A
For all groups a,b we have Pa{C=1} = Pb{C=1}
Equal rate of true/false prediction for all groups.
A classifier with this property: choose the 10 best scoring applicants in each group.
Drawbacks: Doesn’t measure who we accept, as long as we accept equal numbers in each group. The
“perfect” predictor, which always guesses correctly, is considered unfair if the base rates are different.
Legal principle: disparate impact
Moral principle: equality of outcome
Demographic Parity
The idea: a prediction means the same thing for each group.
Same percentage of re-arrest among black and white defendants who were scored as high risk. Same
percentage of equally qualified men and women hired. Whether you will get a loan depends only on your
probability of repayment.
Mathematically:
Y⊥A|R
For all groups a,b we have Pa{Y=1|C=1} = Pb{Y=1|C=1}
Implies equal PPV (positive predictive value or “precision”) for each group.
A classifier with this property: most standard machine learning algorithms.
Drawbacks: Disparate impacts may exacerbate existing disparities. Error rates may differ between
groups in unfair ways.
Legal principle: similarly situated
Moral principle: equality of opportunity
Calibration
Calibration: P(outcome | score) is balanced
Fair prediction with disparate impact: A study of bias in recidivism prediction instruments,
Chouldechova
The idea: Don’t let a classifier make most of its mistakes on one group.
Same percentage of black and white defendants who are not re-arrested are scored as high risk. Same
percentage of qualified men and women mistakenly turned down. If you would have repaid a loan, you
will be turned down at the same rate regardless of your income.
Mathematically:
C⊥A|Y
For all groups a,b we have Pa{C=1|Y=1} = Pb{C=1|Y=1}
Equal false positive rate, true positive rate between groups.
A classifier with this property: use different thresholds for each group.
Drawbacks: Classifier must use group membership explicitly. Calibration is not possible (the same
score will mean different things for different groups.)
Legal principle: disparate treatment
Moral principle: equality of opportunity
Equal Error Rates
Even if two groups of the population admit simple classifiers, the whole population may not
How Big Data is Unfair, Moritz Hardt
One way differential error rates can happen:
less training data for minorities
Impossibility theorems
Most of these metrics are derivable from confusion matrix. Thus, there are
algebraic dependencies between various fairness measures.
Calibration equalizes:
PPV = TP / (TP+FP)
Equal error rates equalizes:
FPR = FP / (FP+TN)
With different base rates, calibration, demographic, and error rate fairness are
mutually exclusive.
This can be proved with a little arithmetic, but the intuition is:
- Can’t have demographic parity and calibration if different groups have
different qualifications.
- If risk really predicts outcome (calibration), then one group will have higher
risk scores, which means more positives and therefore more more false
positives.
Impossibility theorems
Impossibility of calibration + equal error rates
Fair prediction with disparate impact: A study of bias in recidivism prediction instruments,
Chouldechova
Here, p is the “base rate” for the group, e.g. observed rate of re-arrest.
If p differs between groups, then either FPR or PPV must differ too.
Case study: Risk Assessment
The black/white marijuana arrest gap, in nine charts,
Dylan Matthews, Washington Post, 6/4/2013
Outcome training data may be biased…
Risk, Race, and Recidivism: Predictive Bias and Disparate Impact
Jennifer L. Skeem, Christopher T Lowenkamp, Criminology 54 (4) 2016
The proportion of racial disparities in crime explained by differential participation
versus differential selection is hotly debated
…
In our view, official records of arrest—particularly for violent offenses—are a valid
criterion. First, surveys of victimization yield “essentially the same racial
differentials as do official statistics. For example, about 60 percent of robbery
victims describe their assailants as black, and about 60 percent of victimization
data also consistently show that they fit the official arrest data” (Walsh, 2004:
29). Second, self-reported offending data reveal similar race differentials,
particularly for serious and violent crimes (see Piquero, 2015).
…other data may be less biased
Sandra Mayson, Bias In, Bias Out
Blinding to race can be counter-productive
Law, Bias and Algorithms course notebooks, Stanford
Some jurisdictions require not blinding gender
ProPublica argument
Defendant’s perspective
“I’m not going to commit a crime, but what is the probability I
will go to jail?”
False positive rate
P(high risk |black, no arrest) = C/(C+A) = 0.45
P(high risk |white, no arrest) = G/(G+E) = 0.23
Seems unfair!
COMPAS dispute based on different defns of fairness
Northpointe response
Judge’s perspective
“What is the probability this high-risk person commits a
crime?”
Positive predictive value
P(arrest| black, high risk) = D/(C+D) = 0.63
P(arrest| white, high risk) = H/(G+H) = 0.59
Seems fair!
COMPAS dispute based on different defns of fairness
False Positive Rate can be gamed
A second misconception is that the false positive rate is a reasonable proxy of a
group’s aggregate well- being, loosely defined.
…
Suppose, hypothetically, that prosecutors start enforcing low-level drug crimes that
disproportionately involve black individuals, a policy that arguably hurts the black
community. Further suppose that the newly arrested individuals have low risk of
violent recidivism, and thus are released pending trial.
…
As a result, the false positive rate for blacks would decrease. To see this, recall
that the numerator of the false positive rate (the number of detained defendants
who do not reoffend) remains unchanged while the denominator (the number of
defendants who do not reoffend) increases.
Corbett-Davies and Goel, The Measure and Mismeasure of Fairness, 2018
Algorithmic output may be ignored anyway
First, it is still unclear whether risk-assessment tools actually have a great
impact on the daily proceedings in courtrooms. During my days of
observation, I found that risk-assessment tools are often actively resisted in
criminal courts. Most judges and prosecutors do not trust the algorithms.
They do not know the companies they come from, they do not understand
their methods, and they often find them useless. Consequently, risk-
assessment tools often go unused: social workers complete the software
programs’ questionnaires, print out the score sheets, add them to the
defendants’ files… after which the scores seem to disappear and are rarely
mentioned during hearings or plea bargaining negotiations.
Angèle Christin, The Mistrials of Algorithmic Sentencing
Megan Stevenson, Assessing Risk Assessment in Action, 2018
Real-world results from Kentucky
Case Study: Lending Decisions
Banking startups adopt new tools for lending,
Steve Lohr, New York Times
None of the new start-ups are consumer banks in the full-service sense of taking
deposits. Instead, they are focused on transforming the economics of underwriting and
the experience of consumer borrowing — and hope to make more loans available at
lower cost for millions of Americans.
…
They all envision consumer finance fueled by abundant information and clever software
— the tools of data science, or big data — as opposed to the traditional math of
creditworthiness, which relies mainly on a person’s credit history.
…
The data-driven lending start-ups see opportunity. As many as 70 million Americans
either have no credit score or a slender paper trail of credit history that depresses their
score, according to estimates from the National Consumer Reporting Association, a
trade organization. Two groups that typically have thin credit files are immigrants and
recent college graduates.
Predictably Unequal? The Effects of Machine Learning on Credit Markets,
Fuster et al
Predictably Unequal? The Effects of Machine Learning on Credit Markets,
Fuster et al
Case study:
Child protective services call screening
Tool to help decide
whether to “screen in” a
child for follow-up,
based on history.
A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions
Chouldechova et. al.
Designers hoped to counter existing human bias
A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions
Chouldechova et. al.
Feedback loops can be a problem
A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions
Chouldechova et. al.
Classifier performance and per-race error rates
A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions
Chouldechova et. al.
Algorithmic risk scores vs. human scores
Are algorithms the problem?
Sandra Mayson, Bias In, Bias Out
Prediction in an Unequal World
“Oracle” test for algorithmic problems
To better separate concerns about predictive fairness properties from concerns about
other possible deficiencies of the model we find it helpful to apply what we call the
Oracle Test. This is a simple thought experiment that proceeds as follows.
Imagine that you are given access to an oracle, which for every individual informs
you with perfect accuracy whether the individual will have an event (e.g., will be
placed in foster care). Do any of the concerns you previously had remain when
handed this oracle?
Often the answer is yes. Even if we had perfect prediction accuracy, many valid and
reasonable concerns might remain.
A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions
Chouldechova et. al.
What now?
Challenges to creating a fair algorithm
Groups never differ by just race / gender / etc. alone.
You can’t really blind decisions to group membership.
There are several desirable definitions of “fair,” and they are mutually
exclusive.
Training data has its own bias, which is usually unknown.
Humans may follow or ignore algorithmic recommendations.
Even perfect prediction may not give you a fair system.
Fairness is a property of the system, not the algorithm.
When considering an algorithmic system, what do you compare it to?
Absolute fairness – We don’t have perfect prediction or perfect data, and there
may not be agreement over which definition of fairness to use.
As fair as possible given the data – It may be possible to achieve this, given a
particular definition of fairness, if we understand very well what the limitations of
the input data are.
An improvement over current processes and human decision-makers – It’s
possible to evaluate existing institutions by the same standards as algorithms,
and the results do not always favor humans.
An improvement over other possible reforms – If the humans are biased and
the algorithms are biased, is there some other approach?
Fairness by Comparison
(2) AUTOMATED DECISION SYSTEM IMPACT ASSESSMENT.
The term ‘‘automated decision system impact assessment’’ means a study
evaluating an automated decision system and the automated decision system’s
development process, including the design and training data of the automated
decision system, for impacts on accuracy, fairness, bias, discrimination, privacy,
and security
Algorithmic Accountability Act of 2019 (proposed)
Proposed algorithmic fairness legislation
doesn’t define “fairness”
Bias In, Bias Out, Sandra Mayson
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3257004
Assessing Risk Assessment in Action, Megan Stevenson
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3016088
Open Policing Project – Findings
https://openpolicing.stanford.edu/findings/
Open Policing Project – Workbench Tutorial
https://app.workbenchdata.com/workflows/18232/
21 Definitions of Fairness and Their Politics, Arvind Narayanan
https://www.youtube.com/watch?v=jIXIuYdnyyk
Resources

Frameworks for Algorithmic Bias

  • 1.
    Frameworks for Algorithmic Bias JonathanStray Columbia Journalism School at Code for America, 3 July 2019
  • 2.
    Institute for theFuture’s “unintended harms of technology”, ethicalos.org
  • 3.
    Our narrow setting:allocative fairness Some reward or punishment is given out to select people, based on an algorithmic prediction. How is this outcome allocated between groups?
  • 4.
  • 5.
    Protected Classes Race (CivilRights Act of 1964); Color (Civil Rights Act of 1964); Sex (Equal Pay Act of 1963; Civil Rights Act of 1964); Religion (Civil Rights Act of 1964);National origin (Civil Rights Act of 1964); Citizenship (Immigration Reform and Control Act); Age (Age Discrimination in Employment Act of 1967);Pregnancy (Pregnancy Discrimination Act); Familial status (Civil Rights Act of 1968); Disability status (Rehabilitation Act of 1973; Americans with Disabilities Act of 1990); Veteran status (Vietnam Era Veterans' Readjustment Assistance Act of 1974; Uniformed Services Employment and Reemployment Rights Act); Genetic information (Genetic Information Nondiscrimination Act) Fairness in Machine Learning, NIPS 2017 Tutorial Solon Barocas and Moritz Hardt
  • 6.
  • 7.
    What would fairmean here? Same number of white/minority drivers ticketed? White/minority drivers ticketed in same ratio as local resident demographics? White/minority drivers ticketed in same ratio as local driver demographics? White/minority drivers ticketed for driving at the same speeds?
  • 8.
    Legal concept: “similarlysituated” Similarly situated. Alike in all relevant ways for purposes of a particular decision or issue. This term is often used in discrimination cases, in which the plaintiff may seek to show that he or she was treated differently from others who are similarly situated except for the alleged basis of discrimination. For example, a plaintiff who claims that she was not promoted because she is a woman would seek to show that similarly situated men -- that is, men with similar qualifications, experience, and tenure with the company -- were promoted. Wex’s law dictionary, Legal Information Institute, Cornell
  • 9.
    Florida sentencing analysisadjusted for “points” Bias on the Bench, Michael Braga, Herald Tribune
  • 10.
    Containing 1.4 millionentries, the DOC database notes the exact number of points assigned to defendants convicted of felonies. The points are based on the nature and severity of the crime committed, as well as other factors such as past criminal history, use of a weapon and whether anyone got hurt. The more points a defendant gets, the longer the minimum sentence required by law. Florida legislators created the point system to ensure defendants committing the same crime are treated equally by judges. But that is not what happens. … The Herald-Tribune established this by grouping defendants who committed the same crimes according to the points they scored at sentencing. Anyone who scored from 30 to 30.9 would go into one group, while anyone who scored from 31 to 31.9 would go in another, and so on. We then evaluated how judges sentenced black and white defendants within each point range, assigning a weighted average based on the sentencing gap. If a judge wound up with a weighted average of 45 percent, it meant that judge sentenced black defendants to 45 percent more time behind bars than white defendants. Bias on the Bench: How We Did It, Michael Braga, Herald Tribune
  • 12.
    Legal concept: “disparateimpact” D. Adverse impact and the "four-fifths rule." A selection rate for any race, sex, or ethnic group which is less than four- fifths (4/5) (or eighty percent) of the rate for the group with the highest rate will generally be regarded by the Federal enforcement agencies as evidence of adverse impact, while a greater than four-fifths rate will generally not be regarded by Federal enforcement agencies as evidence of adverse impact. 29 CFR § 1607.4 Uniform Guidelines on Employee Selection Procedures, Information on impact
  • 13.
  • 14.
  • 15.
  • 16.
    For a briefperiod, Massachusetts recorded “warnings” as well as tickets, allowing more direct comparison of who got off easy and who didn’t. Add observed decision on each person
  • 17.
    Stephanie Wykstra Add observedoutcome for each person
  • 18.
    Two confusion matrices BlackWhite Binary classification with two groups is the simplest setting. Can be generalized to continuous scores and more groups.
  • 19.
    Notation for fairnessproperties Observable features of each case are a vector X The class or group membership of each case is A Model outputs a numeric “score” R R = r(X,A) ∊ [0,1] We turn the score into a binary classification C by thresholding at t C = r > t The true outcome (this is a prediction) is the binary variable Y A perfect predictor would have C = Y
  • 20.
    Shira Mitchel andJackie Shadlin, https://shiraamitchell.github.io/fairness
  • 21.
    “Anti-classification” The classifier isblinded to group membership. C independent of A conditional on X “Independence” or “demographic parity” The classifier predicts the same number of people in each group. C independent of A “Calibration” or “sufficiency” When classifier predicts true, all groups have the same probability of having a true outcome. C independent of A conditional on Y “Equal error rates,” “classification parity,” or “separation” The classifier has the same FPR / TPR for each group. Y independent of A conditional on C Barocas and Hardt, NIPS 2017 tutorial Corbett-Davies and Goel, The Measure and Mismeasure of Fairness Fairness criteria in this talk
  • 22.
    The idea: theprediction should not use the group as input. “Algorithm is blinded to race” Mathematically: C⊥A|X For all individuals i,i we have Xi = Xj ⇒ Ci = Cj A classifier with this property: choose any way you like, as long as the protected attribute is blinded. Drawbacks: Insufficient on its own, as it allows random assignments of features to decisions. Legal principle: “presumptively suspect” to base decisions on protected attributes Moral principle: colorblindness Anti-classification
  • 23.
    Race, gender, orientationfrom Facebook likes
  • 24.
    The idea: theclassifier results should match group demographics. Same percentage of black and white defendants scored as high risk. Same percentage of men and women hired. Same percentage of rich and poor students admitted. Mathematically: C⊥A For all groups a,b we have Pa{C=1} = Pb{C=1} Equal rate of true/false prediction for all groups. A classifier with this property: choose the 10 best scoring applicants in each group. Drawbacks: Doesn’t measure who we accept, as long as we accept equal numbers in each group. The “perfect” predictor, which always guesses correctly, is considered unfair if the base rates are different. Legal principle: disparate impact Moral principle: equality of outcome Demographic Parity
  • 25.
    The idea: aprediction means the same thing for each group. Same percentage of re-arrest among black and white defendants who were scored as high risk. Same percentage of equally qualified men and women hired. Whether you will get a loan depends only on your probability of repayment. Mathematically: Y⊥A|R For all groups a,b we have Pa{Y=1|C=1} = Pb{Y=1|C=1} Implies equal PPV (positive predictive value or “precision”) for each group. A classifier with this property: most standard machine learning algorithms. Drawbacks: Disparate impacts may exacerbate existing disparities. Error rates may differ between groups in unfair ways. Legal principle: similarly situated Moral principle: equality of opportunity Calibration
  • 26.
    Calibration: P(outcome |score) is balanced Fair prediction with disparate impact: A study of bias in recidivism prediction instruments, Chouldechova
  • 27.
    The idea: Don’tlet a classifier make most of its mistakes on one group. Same percentage of black and white defendants who are not re-arrested are scored as high risk. Same percentage of qualified men and women mistakenly turned down. If you would have repaid a loan, you will be turned down at the same rate regardless of your income. Mathematically: C⊥A|Y For all groups a,b we have Pa{C=1|Y=1} = Pb{C=1|Y=1} Equal false positive rate, true positive rate between groups. A classifier with this property: use different thresholds for each group. Drawbacks: Classifier must use group membership explicitly. Calibration is not possible (the same score will mean different things for different groups.) Legal principle: disparate treatment Moral principle: equality of opportunity Equal Error Rates
  • 28.
    Even if twogroups of the population admit simple classifiers, the whole population may not How Big Data is Unfair, Moritz Hardt One way differential error rates can happen: less training data for minorities
  • 29.
    Impossibility theorems Most ofthese metrics are derivable from confusion matrix. Thus, there are algebraic dependencies between various fairness measures. Calibration equalizes: PPV = TP / (TP+FP) Equal error rates equalizes: FPR = FP / (FP+TN)
  • 30.
    With different baserates, calibration, demographic, and error rate fairness are mutually exclusive. This can be proved with a little arithmetic, but the intuition is: - Can’t have demographic parity and calibration if different groups have different qualifications. - If risk really predicts outcome (calibration), then one group will have higher risk scores, which means more positives and therefore more more false positives. Impossibility theorems
  • 31.
    Impossibility of calibration+ equal error rates Fair prediction with disparate impact: A study of bias in recidivism prediction instruments, Chouldechova Here, p is the “base rate” for the group, e.g. observed rate of re-arrest. If p differs between groups, then either FPR or PPV must differ too.
  • 32.
    Case study: RiskAssessment
  • 33.
    The black/white marijuanaarrest gap, in nine charts, Dylan Matthews, Washington Post, 6/4/2013 Outcome training data may be biased…
  • 34.
    Risk, Race, andRecidivism: Predictive Bias and Disparate Impact Jennifer L. Skeem, Christopher T Lowenkamp, Criminology 54 (4) 2016 The proportion of racial disparities in crime explained by differential participation versus differential selection is hotly debated … In our view, official records of arrest—particularly for violent offenses—are a valid criterion. First, surveys of victimization yield “essentially the same racial differentials as do official statistics. For example, about 60 percent of robbery victims describe their assailants as black, and about 60 percent of victimization data also consistently show that they fit the official arrest data” (Walsh, 2004: 29). Second, self-reported offending data reveal similar race differentials, particularly for serious and violent crimes (see Piquero, 2015). …other data may be less biased
  • 35.
    Sandra Mayson, BiasIn, Bias Out Blinding to race can be counter-productive
  • 36.
    Law, Bias andAlgorithms course notebooks, Stanford Some jurisdictions require not blinding gender
  • 37.
    ProPublica argument Defendant’s perspective “I’mnot going to commit a crime, but what is the probability I will go to jail?” False positive rate P(high risk |black, no arrest) = C/(C+A) = 0.45 P(high risk |white, no arrest) = G/(G+E) = 0.23 Seems unfair! COMPAS dispute based on different defns of fairness
  • 38.
    Northpointe response Judge’s perspective “Whatis the probability this high-risk person commits a crime?” Positive predictive value P(arrest| black, high risk) = D/(C+D) = 0.63 P(arrest| white, high risk) = H/(G+H) = 0.59 Seems fair! COMPAS dispute based on different defns of fairness
  • 39.
    False Positive Ratecan be gamed A second misconception is that the false positive rate is a reasonable proxy of a group’s aggregate well- being, loosely defined. … Suppose, hypothetically, that prosecutors start enforcing low-level drug crimes that disproportionately involve black individuals, a policy that arguably hurts the black community. Further suppose that the newly arrested individuals have low risk of violent recidivism, and thus are released pending trial. … As a result, the false positive rate for blacks would decrease. To see this, recall that the numerator of the false positive rate (the number of detained defendants who do not reoffend) remains unchanged while the denominator (the number of defendants who do not reoffend) increases. Corbett-Davies and Goel, The Measure and Mismeasure of Fairness, 2018
  • 40.
    Algorithmic output maybe ignored anyway First, it is still unclear whether risk-assessment tools actually have a great impact on the daily proceedings in courtrooms. During my days of observation, I found that risk-assessment tools are often actively resisted in criminal courts. Most judges and prosecutors do not trust the algorithms. They do not know the companies they come from, they do not understand their methods, and they often find them useless. Consequently, risk- assessment tools often go unused: social workers complete the software programs’ questionnaires, print out the score sheets, add them to the defendants’ files… after which the scores seem to disappear and are rarely mentioned during hearings or plea bargaining negotiations. Angèle Christin, The Mistrials of Algorithmic Sentencing
  • 41.
    Megan Stevenson, AssessingRisk Assessment in Action, 2018 Real-world results from Kentucky
  • 42.
  • 43.
    Banking startups adoptnew tools for lending, Steve Lohr, New York Times None of the new start-ups are consumer banks in the full-service sense of taking deposits. Instead, they are focused on transforming the economics of underwriting and the experience of consumer borrowing — and hope to make more loans available at lower cost for millions of Americans. … They all envision consumer finance fueled by abundant information and clever software — the tools of data science, or big data — as opposed to the traditional math of creditworthiness, which relies mainly on a person’s credit history. … The data-driven lending start-ups see opportunity. As many as 70 million Americans either have no credit score or a slender paper trail of credit history that depresses their score, according to estimates from the National Consumer Reporting Association, a trade organization. Two groups that typically have thin credit files are immigrants and recent college graduates.
  • 44.
    Predictably Unequal? TheEffects of Machine Learning on Credit Markets, Fuster et al
  • 45.
    Predictably Unequal? TheEffects of Machine Learning on Credit Markets, Fuster et al
  • 46.
    Case study: Child protectiveservices call screening
  • 47.
    Tool to helpdecide whether to “screen in” a child for follow-up, based on history.
  • 48.
    A case studyof algorithm-assisted decision making in child maltreatment hotline screening decisions Chouldechova et. al. Designers hoped to counter existing human bias
  • 49.
    A case studyof algorithm-assisted decision making in child maltreatment hotline screening decisions Chouldechova et. al. Feedback loops can be a problem
  • 50.
    A case studyof algorithm-assisted decision making in child maltreatment hotline screening decisions Chouldechova et. al. Classifier performance and per-race error rates
  • 51.
    A case studyof algorithm-assisted decision making in child maltreatment hotline screening decisions Chouldechova et. al. Algorithmic risk scores vs. human scores
  • 52.
  • 53.
    Sandra Mayson, BiasIn, Bias Out Prediction in an Unequal World
  • 54.
    “Oracle” test foralgorithmic problems To better separate concerns about predictive fairness properties from concerns about other possible deficiencies of the model we find it helpful to apply what we call the Oracle Test. This is a simple thought experiment that proceeds as follows. Imagine that you are given access to an oracle, which for every individual informs you with perfect accuracy whether the individual will have an event (e.g., will be placed in foster care). Do any of the concerns you previously had remain when handed this oracle? Often the answer is yes. Even if we had perfect prediction accuracy, many valid and reasonable concerns might remain. A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions Chouldechova et. al.
  • 55.
  • 56.
    Challenges to creatinga fair algorithm Groups never differ by just race / gender / etc. alone. You can’t really blind decisions to group membership. There are several desirable definitions of “fair,” and they are mutually exclusive. Training data has its own bias, which is usually unknown. Humans may follow or ignore algorithmic recommendations. Even perfect prediction may not give you a fair system. Fairness is a property of the system, not the algorithm.
  • 57.
    When considering analgorithmic system, what do you compare it to? Absolute fairness – We don’t have perfect prediction or perfect data, and there may not be agreement over which definition of fairness to use. As fair as possible given the data – It may be possible to achieve this, given a particular definition of fairness, if we understand very well what the limitations of the input data are. An improvement over current processes and human decision-makers – It’s possible to evaluate existing institutions by the same standards as algorithms, and the results do not always favor humans. An improvement over other possible reforms – If the humans are biased and the algorithms are biased, is there some other approach? Fairness by Comparison
  • 58.
    (2) AUTOMATED DECISIONSYSTEM IMPACT ASSESSMENT. The term ‘‘automated decision system impact assessment’’ means a study evaluating an automated decision system and the automated decision system’s development process, including the design and training data of the automated decision system, for impacts on accuracy, fairness, bias, discrimination, privacy, and security Algorithmic Accountability Act of 2019 (proposed) Proposed algorithmic fairness legislation doesn’t define “fairness”
  • 59.
    Bias In, BiasOut, Sandra Mayson https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3257004 Assessing Risk Assessment in Action, Megan Stevenson https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3016088 Open Policing Project – Findings https://openpolicing.stanford.edu/findings/ Open Policing Project – Workbench Tutorial https://app.workbenchdata.com/workflows/18232/ 21 Definitions of Fairness and Their Politics, Arvind Narayanan https://www.youtube.com/watch?v=jIXIuYdnyyk Resources