Measures and mismeasures
of algorithmic fairness
Manojit Nandi
Senior Data Scientist, Rocketrip
@mnandi92
About Me (According to Google Cloud Vision API)
● Dancer? Aerial dancer and
circus acrobat.
● Entertaining? Hopefully.
● Fun? Most of the time.
● Girl?!?
What is Algorithmic Fairness?
Algorithmic
Fairness
● Algorithmic Fairness is a growing field
of research that aims to mitigate the
effects of unwarranted
bias/discrimination on people in
machine learning.
● Primarily focused on mathematical
formalisms of fairness and developing
solutions for these formalisms.
● IMPORTANT: Fairness is inherently a
social and ethical concept.
In battle practice, a mage knight data
scientist must be equipped with more
than just magic algorithms, my friend.
Fairness, Accountability, Transparency (FAT*)
ML
● Interdisciplinary research area that
focuses on creating machine-
learning systems that work
towards goals, such as fairness
and justice.
● Many open-source libraries
(FairTest, thesis-ml, AI 360)
developed based on this research.
● FAT* 2019 Conference happening
in Atlanta, GA.
Photo credits: Moritz Hardt
Algorithmic Fairness in Popular
Media
Legal Regulations
In the United States, many industries have legal
regulations to prevent disparate impact against
vulnerable populations.
● Education (Education Amendments Act)
● Employment (Civil Rights Act)
● Credit (Equal Credit Opportunity Act)
● Housing (Fair Housing Act)
Types of Algorithmic Biases
Bias in Allocation
● Most commonly researched family
of algorithmic fairness problem
(why we invented the math
definitions).
● Algorithmic Idea: How do models
perform in binary classification
problems across different groups?
● Fundamental Idea: When
allocating finite resources (credit
loans, gainful employment), we
often favor the privileged class
over the more vulnerable.
Source: Reuters News
Bias in representation
● Focused on looking at how harmful
labels/representations are propagated.
● Often related to language and
computer vision problems.
● Harder to quantify error compared to
bias in allocation problems.
● Concerned with algorithms
promoting harmful
stereotypes and lack of
recognition.
SnapChat filters.
Tested this
yesterday
Weaponization of Machine Learning
● As data scientist, we are often not
taught to think about how models
could be used inappropriately.
● With the increasing usage of AI in
high-stakes situations, we must be
careful not to harm or endanger
vulnerable populations.
Source: Why Stanford Researcher tried to Create
A “Gaydar” Machine; New York Times
Types of Fairness Measures
Sam Corbett-Davies
Stanford University
Sharad Goel
Stanford University
“21 Definitions of Algorithmic Fairness”
● There are more than 30 different
mathematical definitions of fairness in
the academic literature.
● There isn’t a one, true definition of
fairness.
● These definitions can be grouped
together into three families:
○ Anti-Classification
○ Classification Parity
○ Calibration
Arvind Narayanan
Anti-Classification
● Heuristic: Algorithmic decisions “ignore”
protected attributes.
● In addition to excluding protected
attributes, one must also be concerned
about learning proxy features.
● Useful for defining loss function of fairness-
aware models.
Fairness-Aware
Algorithms● Given a set of features X, labels Y,
and protected characteristics Z, we
want to create a model that learns
to predict the labels Y, but also
doesn’t “accidentally” learn to
predict the protected characteristics
Z.
● Can view this constrained
optimization as akin to
regularization. Sometimes referred
to as accuracy-fairness trade-off. Source: Towards Fairness in ML with
Adversarial Networks (GoDrivenData)
Is good classifier? Learning protected attributes?
Dangers of Anti-Classification
Measures
● By “removing” protected features, we
ignore the underlying processes that
affect different demographics.
● Fairness metrics are focused on making
outcomes equal.
● DANGER! Sometimes making outcomes
equal adversely impacts a vulnerable
demographic.
Source: Corbett-Davies, Goel (2019)
Classification Parity
● Given some traditional
classification measure
(accuracy, false positive rate),
is our measure equal across
different protected groups.
● Most commonly used to audit
algorithms from a legal
perspective.
Source: Gender Shades,
Buolamwini et al. (2018)
Demographic Parity
● Demographic Parity looks at the
proportion of positive outcomes by
protected attribute group.
● Demographic Parity is used to
audit models for disparate impact
(80% rule).
● DANGER! Satisfying immediate
constraint may have potential
negative long-term consequences. Source: Delayed Impact of Fair Machine Learning,
Liu et. al (2018)
Parity of False Positive Rates
● As the name suggest, this measures looks
at false positive rate across different
protected groups.
● Sometimes called “Equal Opportunity”
● It’s possible to have improve false positive
rate by increasing number of true
negatives.
● DANGER! If we don’t take into
considerations societal factors, we may
end up harming vulnerable populations.
Ignore number of false
positives, just increase
this.
Calibration
● In case of risk assessment (recidivism, child
protective services), we use a scoring
function s(x) to estimate the true risk to the
individual.
● We define some threshold t to make a
decision when s(x) > t.
● Example: Child Protective Services (CPS)
assigns a risk score to child. CPS intervenes
if the perceived risk to the child is high
enough.
Statistical Calibration
● Heuristic: Two individuals with the same risk
score s have the same likelihood of receiving
the outcome.
● A risk score of 10 should mean the same thing
for a white individual as it does for a black
individual.
Debate about COMPAS
● COMPAS is used to assign a recidivism
risk score to prisoners.
● ProPublica Claim: Black defendants
have higher false positive rates.
● Northpointe Defense: Risk scores are
well-calibrated by groups.
Datasheets and Model Cards
Datasheets for Data
Sets● Taking inspiration from safety standards in
other industries, such as automobile testing
and clinical drug trials, Gebru et. al (2017)
propose standards for documenting
datasets.
● Documentation questions include:
○ How was the data collection? What time frame?
○ Why was the dataset created? Who funded its
creation?
○ Does the data contain any sensitive information?
○ How was the dataset pre-processed/cleaned?
○ If data relates to people, were they informed about
the intended use of the data?
Model Cards for Model Reporting
● Google researchers propose a
standard for documenting deployed
models.
● Sections include:
○ Intended Use
○ Factors (evaluation amongst
demographic groups)
○ Ethical Concerns
○ Caveats and Recommendations.
Mitchell et. al (2019)
Some Shout-outs!
AI Now Institute
● New York University research
institute that focuses on
understanding the societal
and cultural impact of AI and
machine learning.
● Recently hosted a symposium
on Ethics, Organizing, and
Accountability.
● Free, annual conference in September at Bloomberg’s NYC
headquarters that brings together data scientist, NGO leaders, and
policy makers.
● Great combination of theoretical research, applied results, and best
practices learned by policy-makers.
Data 4 Good Exchange
(D4GX)
Papers
Referenced
1. The Measures and Mismeasures of Fairness: A Critical Review of Fair Machine
Learning; https://5harad.com/papers/fair-ml.pdf
2. Delayed Impact of Fair Machine Learning; https://arxiv.org/pdf/1803.04383.pdf
3. Data Sheets for Datasets; https://arxiv.org/pdf/1803.09010.pdf
4. Model Cards for Model Reporting; https://arxiv.org/pdf/1810.03993.pdf
5. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender
Classification;
http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf
6. Fairness and Abstraction in Sociotechnical Systems;
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3265913

Measures and mismeasures of algorithmic fairness

  • 1.
    Measures and mismeasures ofalgorithmic fairness Manojit Nandi Senior Data Scientist, Rocketrip @mnandi92
  • 2.
    About Me (Accordingto Google Cloud Vision API) ● Dancer? Aerial dancer and circus acrobat. ● Entertaining? Hopefully. ● Fun? Most of the time. ● Girl?!?
  • 3.
  • 4.
    Algorithmic Fairness ● Algorithmic Fairnessis a growing field of research that aims to mitigate the effects of unwarranted bias/discrimination on people in machine learning. ● Primarily focused on mathematical formalisms of fairness and developing solutions for these formalisms. ● IMPORTANT: Fairness is inherently a social and ethical concept. In battle practice, a mage knight data scientist must be equipped with more than just magic algorithms, my friend.
  • 5.
    Fairness, Accountability, Transparency(FAT*) ML ● Interdisciplinary research area that focuses on creating machine- learning systems that work towards goals, such as fairness and justice. ● Many open-source libraries (FairTest, thesis-ml, AI 360) developed based on this research. ● FAT* 2019 Conference happening in Atlanta, GA. Photo credits: Moritz Hardt
  • 6.
  • 7.
    Legal Regulations In theUnited States, many industries have legal regulations to prevent disparate impact against vulnerable populations. ● Education (Education Amendments Act) ● Employment (Civil Rights Act) ● Credit (Equal Credit Opportunity Act) ● Housing (Fair Housing Act)
  • 8.
  • 9.
    Bias in Allocation ●Most commonly researched family of algorithmic fairness problem (why we invented the math definitions). ● Algorithmic Idea: How do models perform in binary classification problems across different groups? ● Fundamental Idea: When allocating finite resources (credit loans, gainful employment), we often favor the privileged class over the more vulnerable. Source: Reuters News
  • 10.
    Bias in representation ●Focused on looking at how harmful labels/representations are propagated. ● Often related to language and computer vision problems. ● Harder to quantify error compared to bias in allocation problems.
  • 11.
    ● Concerned withalgorithms promoting harmful stereotypes and lack of recognition. SnapChat filters. Tested this yesterday
  • 12.
    Weaponization of MachineLearning ● As data scientist, we are often not taught to think about how models could be used inappropriately. ● With the increasing usage of AI in high-stakes situations, we must be careful not to harm or endanger vulnerable populations. Source: Why Stanford Researcher tried to Create A “Gaydar” Machine; New York Times
  • 13.
    Types of FairnessMeasures Sam Corbett-Davies Stanford University Sharad Goel Stanford University
  • 14.
    “21 Definitions ofAlgorithmic Fairness” ● There are more than 30 different mathematical definitions of fairness in the academic literature. ● There isn’t a one, true definition of fairness. ● These definitions can be grouped together into three families: ○ Anti-Classification ○ Classification Parity ○ Calibration Arvind Narayanan
  • 15.
    Anti-Classification ● Heuristic: Algorithmicdecisions “ignore” protected attributes. ● In addition to excluding protected attributes, one must also be concerned about learning proxy features. ● Useful for defining loss function of fairness- aware models.
  • 16.
    Fairness-Aware Algorithms● Given aset of features X, labels Y, and protected characteristics Z, we want to create a model that learns to predict the labels Y, but also doesn’t “accidentally” learn to predict the protected characteristics Z. ● Can view this constrained optimization as akin to regularization. Sometimes referred to as accuracy-fairness trade-off. Source: Towards Fairness in ML with Adversarial Networks (GoDrivenData) Is good classifier? Learning protected attributes?
  • 17.
    Dangers of Anti-Classification Measures ●By “removing” protected features, we ignore the underlying processes that affect different demographics. ● Fairness metrics are focused on making outcomes equal. ● DANGER! Sometimes making outcomes equal adversely impacts a vulnerable demographic. Source: Corbett-Davies, Goel (2019)
  • 18.
    Classification Parity ● Givensome traditional classification measure (accuracy, false positive rate), is our measure equal across different protected groups. ● Most commonly used to audit algorithms from a legal perspective. Source: Gender Shades, Buolamwini et al. (2018)
  • 19.
    Demographic Parity ● DemographicParity looks at the proportion of positive outcomes by protected attribute group. ● Demographic Parity is used to audit models for disparate impact (80% rule). ● DANGER! Satisfying immediate constraint may have potential negative long-term consequences. Source: Delayed Impact of Fair Machine Learning, Liu et. al (2018)
  • 20.
    Parity of FalsePositive Rates ● As the name suggest, this measures looks at false positive rate across different protected groups. ● Sometimes called “Equal Opportunity” ● It’s possible to have improve false positive rate by increasing number of true negatives. ● DANGER! If we don’t take into considerations societal factors, we may end up harming vulnerable populations. Ignore number of false positives, just increase this.
  • 21.
    Calibration ● In caseof risk assessment (recidivism, child protective services), we use a scoring function s(x) to estimate the true risk to the individual. ● We define some threshold t to make a decision when s(x) > t. ● Example: Child Protective Services (CPS) assigns a risk score to child. CPS intervenes if the perceived risk to the child is high enough.
  • 22.
    Statistical Calibration ● Heuristic:Two individuals with the same risk score s have the same likelihood of receiving the outcome. ● A risk score of 10 should mean the same thing for a white individual as it does for a black individual.
  • 23.
    Debate about COMPAS ●COMPAS is used to assign a recidivism risk score to prisoners. ● ProPublica Claim: Black defendants have higher false positive rates. ● Northpointe Defense: Risk scores are well-calibrated by groups.
  • 24.
  • 25.
    Datasheets for Data Sets●Taking inspiration from safety standards in other industries, such as automobile testing and clinical drug trials, Gebru et. al (2017) propose standards for documenting datasets. ● Documentation questions include: ○ How was the data collection? What time frame? ○ Why was the dataset created? Who funded its creation? ○ Does the data contain any sensitive information? ○ How was the dataset pre-processed/cleaned? ○ If data relates to people, were they informed about the intended use of the data?
  • 26.
    Model Cards forModel Reporting ● Google researchers propose a standard for documenting deployed models. ● Sections include: ○ Intended Use ○ Factors (evaluation amongst demographic groups) ○ Ethical Concerns ○ Caveats and Recommendations. Mitchell et. al (2019)
  • 27.
  • 28.
    AI Now Institute ●New York University research institute that focuses on understanding the societal and cultural impact of AI and machine learning. ● Recently hosted a symposium on Ethics, Organizing, and Accountability.
  • 29.
    ● Free, annualconference in September at Bloomberg’s NYC headquarters that brings together data scientist, NGO leaders, and policy makers. ● Great combination of theoretical research, applied results, and best practices learned by policy-makers. Data 4 Good Exchange (D4GX)
  • 30.
    Papers Referenced 1. The Measuresand Mismeasures of Fairness: A Critical Review of Fair Machine Learning; https://5harad.com/papers/fair-ml.pdf 2. Delayed Impact of Fair Machine Learning; https://arxiv.org/pdf/1803.04383.pdf 3. Data Sheets for Datasets; https://arxiv.org/pdf/1803.09010.pdf 4. Model Cards for Model Reporting; https://arxiv.org/pdf/1810.03993.pdf 5. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification; http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf 6. Fairness and Abstraction in Sociotechnical Systems; https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3265913

Editor's Notes

  • #3 Google Vision
  • #5 Talk about dangers of over-formalism here. Just more than mis-classification (cows v.s horses)
  • #6 Google has a course on Fair ML, a bunch of libraries nows.
  • #7 Name a more iconic trio, I’ll wait. No one wants to be the person who brings upon the apocalypse.
  • #10 Who gets faster mail deliveries? Where we are now.
  • #11 Assigning one group the “gorilla” label does not prevent us from assigning it to others. Hard to create a loss function that captures societal and cultural history about the severity of this error.
  • #13 At best, we make some naive assumptions about hostile actors. Important to consider anthropological and ethical perspectives
  • #15 No one true definition (counter-intuitive to computer science and mathematics)
  • #16 Recent work looks at identifying causal paths.
  • #18 Fairness is a process property! Calibration scores equalized penalize women.
  • #19 Gender Shades by Joy Buolamwini Has spoken word video called “AI, ain’t I a woman”
  • #20 80% rule for EEOC (Equal Employment Opportunity Commision) Careful about over-loaning and unequal rates. Say orange and blue!
  • #21 Police arrest more low-risk individuals to increase True Negatives count. Sounds dumb, but keep in mind (malicious actors).
  • #22 Each individual has some true risk score r(X) [not known, we estimate with scoring function s(X)]
  • #24 COMPAS uses a questionaire: Have you ever been arrested before? Have any of your friends been arrested? Picture involves two individuals arrested for drug possession. Trade-offs to fairness
  • #25 One new solution: Better documentation!
  • #26 Not much discussion about what makes a good training dataset. Also shout-out Deon by DrivenData Material Safety Data sheets (Kitchen ingredients, cleaning chemicals)
  • #29 Looking to hire a post-doctoral researcher.
  • #31 List of papers referenced in this talk.