Measuring Model Fairness
Stephen Hoover
PyData LA
2018 October 23
Measuring Model Fairness
Stephen Hoover
PyData LA
2018 October 23
J. Henry Hinnefeld, Peter Cooman, Nat Mammo, and Rupert Deese,
"Evaluating Fairness Metrics in the Presence of Dataset Bias";
https://arxiv.org/abs/1809.09245
What is fair?
https://pxhere.com/en/photo/199386
Models determine whether you can buy a home...
https://www.flickr.com/photos/cafecredit/26700612773
and how long you spend on parole...
https://www.flickr.com/photos/archivesnz/27160240521
and what advertisements you see.
https://www.flickr.com/photos/dno1967b/8283313605
and what advertisements you see.
https://www.flickr.com/photos/44313045@N08/6290270129
How do you measure if your model is fair?
https://pixabay.com/en/legal-scales-of-justice-judge-450202/
How do you measure if your model is fair?
https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
How do you measure if your model is fair?
https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
How do you measure if your model is fair?
http://www.northpointeinc.com/files/publications/Criminal-Justice-Behavior-COMPAS.pdf
https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm
The bias comes from your data.
https://commons.wikimedia.org/wiki/File:Bias_t.png
Subtlety #1: Different groups can have different ground truth positive rates
https://www.breastcancer.org/symptoms/understand_bc/statistics
Certain fairness metrics make assumptions about the balance of ground
truth positive rates
Disparate Impact is a popular metric which assumes that the ground truth positive
rates for both groups are the same
Certifying and removing disparate impact, Feldman et al. (https://arxiv.org/abs/1412.3756)
Subtlety #2: Your data are a biased representation of ground truth
Datasets can contain label bias when a protected attribute affects the way
individuals are assigned labels.
In addition, the results indicate that students from African American and Latino families are more likely
than their White peers to receive expulsion or out of school suspension as consequences for the same or
similar problem behavior.
A dataset for predicting “student problem behavior” that used “has been suspended”
for its label could contain label bias.
”Race is not neutral: A national investigation of African American and Latino disproportionality in school discipline.” Skiba et al.
Subtlety #2: Your data are a biased representation of ground truth
Datasets can contain sample bias when a protected attribute affects the sampling
process that generated your data
We find that persons of African and Hispanic descent were stopped more frequently than whites, even
after controlling for precinct variability and race-specific estimates of crime participation.
A dataset for predicting contraband possession that used stop-and-frisk data could
contain sample bias.
”An analysis of the NYPD’s stop-and-frisk policy in the context of claims of racial bias” Gelman et al.
Certain fairness metrics are based on accuracy with respect to possibly
biased labels
Equal Opportunity is a popular metric which compares the True Positive rates
between protected groups
Equality of Opportunity in Supervised Learning, Hardt et al. (https://arxiv.org/pdf/1610.02413.pdf)
Subtlety #3: It matters whether the modeled decision’s consequences are
positive or negative
When a model is punitive you might care more about False Positives.
When a model is assistive you might care more about False Negatives.
The point is you have to think about these questions.
We can’t math our way out of thinking about fairness
You still need a person to think about the ethical implications of your model
Originally people thought ‘Models are just math, so they must be fair’
→ definitely not true
Now there’s a temptation to say ‘Adding this constraint will make my model fair’
→ still not automatically true
Can we detect real bias in real data?
Spoiler: it can be tough!
● Start with real data from Civis's work
○ Features are demographics, outcome is a probability
○ Consider racial bias; white versus African American
Can we detect real bias in real data?
Create artificial datasets with known causal bias; then we'll see if we can detect it.
● Start with real data from Civis's work
○ Features are demographics, outcome is a probability
○ Consider racial bias; white versus African American
● Two datasets:
○ Artificially balanced: select white subset and randomly re-assign race
○ Unmodified (imbalanced) dataset
Next introduce known sample and label bias
Label bias: you're in the dataset, but protected class affects your label
Use the original dataset but modify the labels
For unbiased data, use
Next introduce known sample and label bias
Sample bias: protected class affects whether you're in the sample at all
Create a modified dataset with labels taken from the original data
For unbiased data, use
Two experiments, four datasets each
Train an elastic net logistic regression classification model on each dataset
Train, predict, then measure
Apply each fairness metric to the model's probability predictions
Disparate Impact
Equal Opportunity
Equal Mis-Opportunity
Difference in Average Residuals
With balanced ground truth, all metrics detect bias
Good news!
No
bias
Sam
ple
bias
Label bias
Both
No bias
Sample bias
Label bias
Both
No bias
Sample bias
Label bias
Both
No
bias
Sam
ple
bias
Label bias
Both
No bias
Sample bias
Label bias
Both
No bias
Sample bias
Label bias
Both
With imbalanced ground truth, all metrics still detect bias...
...even when there isn't any bias in the "truth".
No
bias
Sam
ple
bias
Label bias
Both
No bias
Sample bias
Label bias
Both
No bias
Sample bias
Label bias
Both
Label bias is particularly hard to distinguish from imbalanced ground truth
Which is fair, since reality has label bias in this case.
There's no one-size fits all solution
Except for "think hard about your inputs and your outputs"
● These metrics can help
There's no one-size fits all solution
Except for "think hard about your inputs and your outputs"
● These metrics can help
● There are methods you can use in training to make your model more fair
○ (assuming you know what "fair" means for you)
https://arxiv.org/pdf/1412.3756.pdf
https://arxiv.org/pdf/1806.08010.pdf
http://proceedings.mlr.press/v28/zemel13.pdf
There's no one-size fits all solution
Except for "think hard about your inputs and your outputs"
● These metrics can help
● There are methods you can use in training to make your model more fair
● Use a diverse team to create the models!
https://imgur.com/gallery/hem9m
There's no one-size fits all solution
Except for "think hard about your inputs and your outputs"
● These metrics can help
● There are methods you can use in training to make your model more fair
● Use a diverse team to create the models!
● Know your data and check your predictions
https://pixabay.com/en/isolated-thinking-freedom-ape-1052504/
“Big Data processes codify the past. They do not invent the future. Doing
that requires moral imagination, and that’s something only humans can
provide. We have to explicitly embed better values into our algorithms,
creating Big Data models that follow our ethical lead. Sometimes that will
mean putting fairness ahead of profit.”
― Cathy O'Neil, Weapons of Math Destruction: How Big Data Increases
Inequality and Threatens Democracy
“Big Data processes codify the past. They do not invent the future. Doing
that requires moral imagination, and that’s something only humans can
provide. We have to explicitly embed better values into our algorithms,
creating Big Data models that follow our ethical lead. Sometimes that will
mean putting fairness ahead of profit.”
― Cathy O'Neil, Weapons of Math Destruction: How Big Data Increases
Inequality and Threatens Democracy
J. Henry Hinnefeld, Peter Cooman, Nat Mammo, and Rupert Deese,
"Evaluating Fairness Metrics in the Presence of Dataset Bias";
https://arxiv.org/abs/1809.09245
Stephen Hoover
@StephenActual

Measuring Model Fairness - Stephen Hoover

  • 1.
    Measuring Model Fairness StephenHoover PyData LA 2018 October 23
  • 2.
    Measuring Model Fairness StephenHoover PyData LA 2018 October 23 J. Henry Hinnefeld, Peter Cooman, Nat Mammo, and Rupert Deese, "Evaluating Fairness Metrics in the Presence of Dataset Bias"; https://arxiv.org/abs/1809.09245
  • 3.
  • 4.
    Models determine whetheryou can buy a home... https://www.flickr.com/photos/cafecredit/26700612773
  • 5.
    and how longyou spend on parole... https://www.flickr.com/photos/archivesnz/27160240521
  • 6.
    and what advertisementsyou see. https://www.flickr.com/photos/dno1967b/8283313605
  • 7.
    and what advertisementsyou see. https://www.flickr.com/photos/44313045@N08/6290270129
  • 8.
    How do youmeasure if your model is fair? https://pixabay.com/en/legal-scales-of-justice-judge-450202/
  • 9.
    How do youmeasure if your model is fair? https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
  • 10.
    How do youmeasure if your model is fair? https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
  • 11.
    How do youmeasure if your model is fair? http://www.northpointeinc.com/files/publications/Criminal-Justice-Behavior-COMPAS.pdf https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm
  • 12.
    The bias comesfrom your data. https://commons.wikimedia.org/wiki/File:Bias_t.png
  • 13.
    Subtlety #1: Differentgroups can have different ground truth positive rates https://www.breastcancer.org/symptoms/understand_bc/statistics
  • 14.
    Certain fairness metricsmake assumptions about the balance of ground truth positive rates Disparate Impact is a popular metric which assumes that the ground truth positive rates for both groups are the same Certifying and removing disparate impact, Feldman et al. (https://arxiv.org/abs/1412.3756)
  • 15.
    Subtlety #2: Yourdata are a biased representation of ground truth Datasets can contain label bias when a protected attribute affects the way individuals are assigned labels. In addition, the results indicate that students from African American and Latino families are more likely than their White peers to receive expulsion or out of school suspension as consequences for the same or similar problem behavior. A dataset for predicting “student problem behavior” that used “has been suspended” for its label could contain label bias. ”Race is not neutral: A national investigation of African American and Latino disproportionality in school discipline.” Skiba et al.
  • 16.
    Subtlety #2: Yourdata are a biased representation of ground truth Datasets can contain sample bias when a protected attribute affects the sampling process that generated your data We find that persons of African and Hispanic descent were stopped more frequently than whites, even after controlling for precinct variability and race-specific estimates of crime participation. A dataset for predicting contraband possession that used stop-and-frisk data could contain sample bias. ”An analysis of the NYPD’s stop-and-frisk policy in the context of claims of racial bias” Gelman et al.
  • 17.
    Certain fairness metricsare based on accuracy with respect to possibly biased labels Equal Opportunity is a popular metric which compares the True Positive rates between protected groups Equality of Opportunity in Supervised Learning, Hardt et al. (https://arxiv.org/pdf/1610.02413.pdf)
  • 18.
    Subtlety #3: Itmatters whether the modeled decision’s consequences are positive or negative When a model is punitive you might care more about False Positives. When a model is assistive you might care more about False Negatives. The point is you have to think about these questions.
  • 19.
    We can’t mathour way out of thinking about fairness You still need a person to think about the ethical implications of your model Originally people thought ‘Models are just math, so they must be fair’ → definitely not true Now there’s a temptation to say ‘Adding this constraint will make my model fair’ → still not automatically true
  • 20.
    Can we detectreal bias in real data? Spoiler: it can be tough! ● Start with real data from Civis's work ○ Features are demographics, outcome is a probability ○ Consider racial bias; white versus African American
  • 21.
    Can we detectreal bias in real data? Create artificial datasets with known causal bias; then we'll see if we can detect it. ● Start with real data from Civis's work ○ Features are demographics, outcome is a probability ○ Consider racial bias; white versus African American ● Two datasets: ○ Artificially balanced: select white subset and randomly re-assign race ○ Unmodified (imbalanced) dataset
  • 22.
    Next introduce knownsample and label bias Label bias: you're in the dataset, but protected class affects your label Use the original dataset but modify the labels For unbiased data, use
  • 23.
    Next introduce knownsample and label bias Sample bias: protected class affects whether you're in the sample at all Create a modified dataset with labels taken from the original data For unbiased data, use
  • 24.
    Two experiments, fourdatasets each Train an elastic net logistic regression classification model on each dataset
  • 25.
    Train, predict, thenmeasure Apply each fairness metric to the model's probability predictions Disparate Impact Equal Opportunity Equal Mis-Opportunity Difference in Average Residuals
  • 26.
    With balanced groundtruth, all metrics detect bias Good news! No bias Sam ple bias Label bias Both No bias Sample bias Label bias Both No bias Sample bias Label bias Both
  • 27.
    No bias Sam ple bias Label bias Both No bias Samplebias Label bias Both No bias Sample bias Label bias Both With imbalanced ground truth, all metrics still detect bias... ...even when there isn't any bias in the "truth".
  • 28.
    No bias Sam ple bias Label bias Both No bias Samplebias Label bias Both No bias Sample bias Label bias Both Label bias is particularly hard to distinguish from imbalanced ground truth Which is fair, since reality has label bias in this case.
  • 29.
    There's no one-sizefits all solution Except for "think hard about your inputs and your outputs" ● These metrics can help
  • 30.
    There's no one-sizefits all solution Except for "think hard about your inputs and your outputs" ● These metrics can help ● There are methods you can use in training to make your model more fair ○ (assuming you know what "fair" means for you) https://arxiv.org/pdf/1412.3756.pdf https://arxiv.org/pdf/1806.08010.pdf http://proceedings.mlr.press/v28/zemel13.pdf
  • 31.
    There's no one-sizefits all solution Except for "think hard about your inputs and your outputs" ● These metrics can help ● There are methods you can use in training to make your model more fair ● Use a diverse team to create the models! https://imgur.com/gallery/hem9m
  • 32.
    There's no one-sizefits all solution Except for "think hard about your inputs and your outputs" ● These metrics can help ● There are methods you can use in training to make your model more fair ● Use a diverse team to create the models! ● Know your data and check your predictions https://pixabay.com/en/isolated-thinking-freedom-ape-1052504/
  • 33.
    “Big Data processescodify the past. They do not invent the future. Doing that requires moral imagination, and that’s something only humans can provide. We have to explicitly embed better values into our algorithms, creating Big Data models that follow our ethical lead. Sometimes that will mean putting fairness ahead of profit.” ― Cathy O'Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy
  • 34.
    “Big Data processescodify the past. They do not invent the future. Doing that requires moral imagination, and that’s something only humans can provide. We have to explicitly embed better values into our algorithms, creating Big Data models that follow our ethical lead. Sometimes that will mean putting fairness ahead of profit.” ― Cathy O'Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy J. Henry Hinnefeld, Peter Cooman, Nat Mammo, and Rupert Deese, "Evaluating Fairness Metrics in the Presence of Dataset Bias"; https://arxiv.org/abs/1809.09245 Stephen Hoover @StephenActual