Economic Data Science
Seminar Series
2 December 2021
Slido#14560
Welcome
Chair – Dr Arthur Turrell
Deputy Director, Data Science Campus
Office for National Statistics
Slido#14560
Agenda
09:00 – 09:05 Welcome and introduction – Arthur Turrell
09:05 – 09:45 Causal inference for machine learning – Amit
Sharma
09:45 – 09:55 Q&A
09:55 – 10:00 Closing remarks – Arthur Turrell
Questions can be submitted via the slido app using code #14560.
You can also access slido via the link in the chat box.
Casual inference for
Machine Learning
Amit Sharma
Principle Researcher
Microsoft Research India
Slido#14560
Causal inference for
machine learning
Amit Sharma
Microsoft Research India
@amt_shrma
www.amitsharma.in
7
True: 𝑦=f(𝑥,u)+𝜖
CAUSAL
INFERENCE
MACHINE
LEARNING
Better Estimation and
Refutation of Causal Effect
github.com/microsoft/dowhy
• Better Generalization &
Robustness of ML models
• Principled framework for
Fairness and Explanation
Key message: Causal reasoning is essential for
machine learning
• Machine learning faces many fundamental challenges
• Out of distribution generalization, Robustness, Fairness, Explainability, Privacy
• A causal perspective can help
• Better definitions of the challenges
• Theoretically justified algorithms
• “Matching” for out-of-distribution prediction
• “Counterfactuals” for explainable predictions
• “Missing data” for fairness
Correlational machine learning
searches for patterns.
Often finds spurious ones
women’s
women’s chess club
captain
executed
captured
Bias in ML model for hiring decisions
[Reuters 2018, Weblink]
Accuracy on unseen angles (0, 90): 64%
[Piratla et al. ICML 2020]
Incorrect predictions under changes in data
[Alcorn et al. CVPR 2019]
Fooled by semantically equivalent perturbations
[Ribeiro et al. ACL 2018]
1. OOD generalization is a causal
problem.
Domain Generalization using Causal Matching. ICML 2021.
Alleviating Privacy Attacks using Causal Learning. ICML 2020.
Causal Regularization using Domain Priors. Arxiv.
13
True: 𝑦=f(𝑥,u)+𝜖
Typical supervised prediction
min
P
𝑦 − 𝑦 2
Use cross-validation to select model.
No need to worry about 𝑢.
Out-of-distribution prediction:
min
P∗
𝑦 − 𝑦 2
P* is not observed.
Cross-validation is not possible.
X Y
U
Train: P(X, Y, U)
Test: P(X, Y, U)
X Y
U
Train: P(X, Y, U)
Test: P*(X, Y, U)
Invariant causal learning: If you learn the causal
function from X->Y, your model will be optimal
across all unseen distributions.
Where’s the catch? Learning causal models
Disease
Severity
𝒀
Blood
Pressure
Heart
Rate
𝑿𝒑𝒂𝒓𝒆𝒏𝒕 𝑿𝒑𝒂𝒓𝒆𝒏𝒕
Structural Approach:
Create a causal graph
based on external
knowledge.
Multiple Domains
Approach: Find features
whose effect stays invariant
across many domains.
ℎ 𝑥𝐶 will lead to
similar accuracy on both
domains.
Constraints Approach:
Identify the constraints
that any causal model
should satisfy.
Blood Pressure =>
Disease Severity
I. Learning using causal structure
A dataset of people living with a chronic illness.
<Y:disease_severity> <X:age, gender, blood pressure, heart
rate>
Associational ML: min
ℎ (𝑥,𝑦) 𝐿𝑜𝑠𝑠(ℎ 𝑥 , 𝑦)
Disease
Severity
𝒀
Blood
Pressure
Heart
Rate
𝑿𝒑𝒂𝒓𝒆𝒏𝒕 𝑿𝒑𝒂𝒓𝒆𝒏𝒕
𝑿𝟏 𝑿𝟐
Age
Weight
(Ideal) Causal learning:
1. Identify which features directly cause the outcome (parents
of Y in the causal graph).
2. Build a predictive model using only those features.
Causal ML: min
ℎ (𝑥,𝑦) 𝐿𝑜𝑠𝑠(ℎ 𝑥𝐶 , 𝑦)
Lower train accuracy but stays consistent with new domains.
𝐗𝐂 = 𝐗𝐏𝐀 = {heart rate, blood pressure}
Why only parents of Y in the causal graph?
𝒀
𝑋𝑆0 𝑋𝑃𝐴
𝑋𝑆2
𝑋𝑆1
𝑋𝐶𝐻
𝑋𝑐𝑝
New Domain
(Covariate Shift)
𝒀
𝑋𝑆0 𝑋𝑃𝐴
𝑋𝑆2
𝑋𝑆1
𝑋𝐶𝐻
𝑋𝑐𝑝
𝑷 𝑿, 𝒀
= 𝑷 𝑿 𝑷(𝒀|𝑿)
𝑷∗ 𝑿, 𝒀
= 𝑷∗
𝑿 𝑷(𝒀|𝑿)
Why only parents of Y in the causal graph?
𝒀
𝑋𝑆0 𝑋𝑃𝐴
𝑋𝑆2
𝑋𝑆1
𝑋𝐶𝐻
𝑋𝑐𝑝
New Domain
(Concept Drift)
𝒀
𝑋𝑆0 𝑋𝑃𝐴
𝑋𝑆2
𝑋𝑆1
𝑋𝐶𝐻
𝑋𝑐𝑝
𝑷 𝑿, 𝒀
= 𝑷 𝑿 𝑷(𝒀|𝑿)
𝑷∗ 𝑿, 𝒀
= 𝑷 𝑿 𝑷∗
(𝒀|𝑿)
Why only parents of Y in the causal graph?
𝒀
𝑋𝑆0 𝑋𝑃𝐴
𝑋𝑆2
𝑋𝑆1
𝑋𝐶𝐻
𝑋𝑐𝑝
Any Intervention
on any feature
𝒀
𝑋𝑆0 𝑋𝑃𝐴
𝑋𝑆2
𝑋𝑆1
𝑋𝐶𝐻
𝑋𝑐𝑝
𝑃(𝑌|𝑋𝑃𝐴) is invariant across different distributions, unless there is a
change in true data-generating process for Y.
Any other benefits?
* Result 1: Better out-of-distribution generalization
• Result 2: Stronger differential privacy guarantees
Causal models are more robust to privacy attacks like membership
inference.
Theorem: When equivalent Laplace noise is added and models are trained on same
dataset, causal mechanism MC provides 𝜖𝐶-DP and associational mechanism MA
provides 𝜖𝐴-DP guarantees such that:
𝝐𝒄 ≤ 𝝐𝑨
Alleviating Privacy Attacks using Causal Learning. ICML 2020.
Summary: Causal predictive models offer
better accuracy and privacy.
So why is everyone not using it?
Same problem as for causal inference: Rare to have an outcome
variable where all parents are observed.
Can methods from causal inference also be used to solve it?
Where’s the catch? Learning causal models
Disease
Severity
𝒀
Blood
Pressure
Heart
Rate
𝑿𝒑𝒂𝒓𝒆𝒏𝒕 𝑿𝒑𝒂𝒓𝒆𝒏𝒕
Structural Approach:
Create a causal graph
based on external
knowledge.
Multiple Domains
Approach: Find features
whose effect stays invariant
across many domains.
ℎ 𝑥𝐶 will lead to
similar accuracy on both
domains.
Constraints Approach:
Identify the constraints
that any causal model
should satisfy.
Blood Pressure =>
Disease Severity
Leveraging data from multiple domains
TRAIN DATASET TEST DATASET
TRAIN DATASET TEST DATASET
Need to ensure that pair of images exactly match on shape features,
but vary on color (i.e., confounder)
Difference from causal inference: Matching for same causal features,
rather than same confounders
Data augmentation in ML
Representation
Learning
How it works?
Observed Unobserved May or may not
be observed
How it works?
Goal: Learn E 𝑌 𝑋𝑐
But 𝑋𝑐is not observed.
So match two images having the
same 𝑋𝑐 and enforce them to have
the same representation.
Observed Unobserved May or may not
be observed
Latent shape
features
Different colors
Image
Label
How it works?
Goal: Learn E 𝑌 𝑋𝑐
But 𝑋𝑐is not observed.
So match two images having the
same 𝑋𝑐 and enforce them to have
the same representation.
Observed Unobserved May or may not
be observed
Latent shape
features
Different colors
Image
Label
𝑿𝑪⫫ 𝑫𝒐𝒎𝒂𝒊𝒏|𝑶𝒃𝒋𝒆𝒄𝒕 => match inputs from
the different domains that belong to the same
object.
Aside: Many prior works on domain generalization optimize for the incorrect objective
• “Domain Invariant Representation” proposes 𝑋𝐶 ⫫ 𝐷𝑜𝑚𝑎𝑖𝑛
• “Class-conditional Domain Invariant Representation” proposes 𝑋𝐶 ⫫ 𝐷𝑜𝑚𝑎𝑖𝑛|𝑌
Leveraging multiple domains: Can also work if
data augmentations are not available
• If objects are not known, iteratively
learn matched pairs of inputs from
different domains.
• Assumption: Same-class inputs are
closer in causal features to inputs from
different classes.
• Start with matching random inputs from
the same class.
• Minimize intra-match distance: Find a
feature representation that minimizes the
distance within matches.
• Estimate matches: Update matches based
on the new representation and repeat.
IV. Empirical results: Causal models are more
accurate on unseen rotations of MNIST digits
Test Domain: Images rotated by 0 and 90 degrees.
This method also achieves state-of-the-art accuracy on PACS , the most popular domain generalization
benchmark.
2. ML Explanation is a causal
problem.
Explaining Machine Learning Classifiers through Counterfactual Examples. FaccT
2020.
Towards Unifying Feature Attribution and Counterfactual Explanations: Different
Means to the Same End. AIES 2021.
Explaining machine learning predictions
Techniques to explain machine predictions
LIME (Ribeiro et al., 2016); Local Rule-based (Guidotti et al., 2018);
SHAP (Lundberg et al., 2017); Intelligible Models (Lou et al., 2012); …..
Feature importance-based methods are
widely used in many practical applications
Feature 1
Feature 2
Feature 3
Importance score
In many cases, feature importance is not enough
Bank Loan distribution algorithm
Loan granted
Loan denied
Counterfactual explanations (CF)
(“what-if” scenarios)
Feature importance-based explanations
Annual income
No. of credit accounts
Credit years
Importance score
You would have got the loan if your
annual income had been 100,000
(Wachter et al., 2017)
Suppose a person does not get the loan.
Person: What should I do to get the loan in the future?
Many explanation scenarios are actually
asking “what-if” or causal questions
“If I change the most important feature according to explanation, will it
change the predicted outcome?”
“What if we change the second most important feature?”
Statistical summaries are not enough.
Require different kind of reasoning -> Causal reasoning
Individual treatment effect of different features
Causal reasoning for explaining machine learning
Bank Loan distribution algorithm
Loan granted
Loan denied
Counterfactual explanations (CF)
(“what-if” scenarios)
You would have got the loan if your
annual income had been 100,000
(Wachter et al., 2017)
What feature value caused the
prediction?
How to provide a feature ordering?
What does it mean to explain an event?
Event = ML model predicts 1.
[Halpern 2016] A feature is an ideal causal explanation iff:
• Necessity: Changing the feature changes model’s prediction.
• Sufficiency: If the feature stays the same, cannot change the model’s
prediction.
• Ideal explanations are rare.
݂ (𝑥1 , 𝑥2, 𝑥3) = ‫(ܫ‬0.4𝑥1 + 0.1𝑥2 + 0.1𝑥3 >= 0.5)
Given f(1,1,1) = 1,
x1 is necessary.
No feature is sufficient.
But we can quantify degree of necessity or
sufficiency
(x, f(x))
• Necessity = P (f(x) changes | feature is changed)
• Sufficiency = P( f(x) is unchanged | feature is unchanged)
Where these probabilities are over all plausible values of the features.
In practice, approximate by neighborhood of the point x.
Simple algorithm
Necessity: Given (𝑥, 𝑓(𝑥)), find necessity of feature 𝑥𝑖
• Sample point 𝑥’ such that 𝑥𝑖 is changed while keeping every other
feature constant.
• Calculate 𝑃(𝑓(𝑥’) ! = 𝑓(𝑥) ) over all such 𝑥’
Sufficiency: Given (𝑥, 𝑓(𝑥)), find sufficiency of feature 𝑥𝑖
• Sample point 𝑥’ such that 𝑥𝑖 is constant while changing all other
features.
• Calculate 𝑃(𝑓(𝑥’) == 𝑓(𝑥) ) over all such 𝑥’
A more efficient approximate algorithm
Necessity: Given (𝑥, 𝑓(𝑥)),
• Find the smallest changes to the input 𝑥 that change the outcome.
• Necessity is proportional to the number of times a feature is changed
to lead to a different outcome.
Sufficiency: Given (𝑥, 𝑓(𝑥)),
• Find the smallest changes to the input 𝑥 that change the outcome,
without changing 𝑥𝑖.
• Sufficiency is inversely proportional to the number of times a valid
change is found.
Diverse
counterfactual
explanations
Loss to get
desirable
outcome
Loss to ensure
proximity to
original input
Loss to provide
diverse
explanations
k – no. of counterfactuals
λ1 and λ2 – loss-balancing hyperparameters
dpp_diversity = det(K),
K =
𝟏
𝟏 +𝒅𝒊𝒔𝒕(c𝒊, c𝒋)
More generally, counterfactual explanations involve an
optimization
𝑪 𝒙 = arg min
𝑐1,…., 𝑐𝑘
1
𝑘
𝑖=1
𝑘
𝒚𝒍𝒐𝒔𝒔 𝑓 𝑐𝑖 , 𝑦 +
λ1
𝑘
𝒅𝒊𝒔𝒕 𝑐𝑖, 𝑥 − λ2𝒅𝒑𝒑_𝒅𝒊𝒗𝒆𝒓𝒔𝒊𝒕𝒚(𝑐1, . . , 𝑐𝑘)
Practical considerations
𝑪 𝒙 = arg min
𝑐1,…., 𝑐𝑘
1
𝑘
𝑖=1
𝑘
𝒚𝒍𝒐𝒔𝒔 𝑓 𝑐𝑖 , 𝑦 +
λ1
𝑘
𝒅𝒊𝒔𝒕 𝑐𝑖, 𝑥 − λ2𝒅𝒑𝒑_𝒅𝒊𝒗𝒆𝒓𝒔𝒊𝒕𝒚(𝑐1, . . , 𝑐𝑘)
 Incorporate additional
feasibility properties
a) Sparsity
b) User constraints
 Choice of yloss – hinge loss
 Separate categorical and
continuous distance functions
 Relative scale of mixed features
Python library
DiCE
(Diverse Counterfactual Explanations)
https://github.com/interpretml/DiCE
Evaluating fairness is a causal
problem.
Fair Machine Learning
Common approach:
1. Get a big training dataset, different rows containing observed
outcomes for different feature values.
44
Fair Machine Learning
45
Features for different individuals
Outcome
(e.g. loan paid back or not)
Fair Machine Learning
Common approach:
1. Get a big training dataset, different rows containing observed
outcomes for different feature values.
2. Select an appropriate fairness metric (e.g. equal error rates).
3. Apply state-of-the-art algorithm on this dataset to train a classifier
with fairness constraints.
4. Deploy the trained classifier to make future decisions.
46
Fair Machine Learning
This common approach, proposed and shown to work on benchmark
datasets in fair machine learning papers, doesn’t work in practice
unfortunately.
There is no guarantee that the supposedly fair classifer will actually
take fair decisions in the real-world.
Reason: Missingness in Training Data.
47
Missingness in Training Data
Loan Applicant 1,
with features 𝑿𝟏
𝐷1 = 1
𝐷2 = 0
Loan Applicant 2,
with features 𝑿𝟐
Decision-maker
in the past
e.g. bank
Loan Approved
Loan Denied
𝑌1
True outcome observed,
𝑋1, 𝑌1 became part of the
training data
True outcome never observed,
𝑋2, 𝑌2 missing from the
training data
48
?
Missingness in Training Data
• Training data, even if it contains objective ground truth outcome and
infinitely many samples, is one-sided due to systematic censoring by
past decisions.
49
Empirical Implications
(Pleiss et al. 2017) with
FPR Constraints
(Pleiss et al. 2017) with
FNR Constraints
(Kamiran et al. 2012)
with SP Constraints
(Kamiran and Calders
2012)
COMPAS
Dataset
Train FPRD
−0.00155
Test FPRD
0.061
Train FNRD
0.0056
Test FNRD
0.099
Train SPD
0.0229
Test SPD
0.2651
Train EOD
0.0111
Test EOD
−0.2266
ADULT
Dataset
Train FPRD
−0.00724
Test FPRD
0.0725
Train FNRD
0.00295
Test FNRD
0.0377
Train SPD
−0.0390
Test SPD
−0.1137
Train EOD
0.0293
Test EOD
−0.1327
Postprocessing Approach Inprocessing Approach Preprocessing Approach
Difference in Test and Train Fairness of Fair ML Algorithms under Training Data with Missingness
All implementations and respective hyper-parameters settings taken from examples provided in IBM AI Fairness 360 Library
Kallus and Zhou (2018) made similar observations for
Hardt et al (2016)’s algorithm on NYPD SQF dataset.
* Similar observations for
Celis et al (2019)’ algorithm.
Causal Graphs for Data Missingness
(Karthika Mohan and Judea Pearl, 2019)
𝑅𝐶 : Missigness mechanism
variable for variable 𝐶
MCAR MAR MNAR
51
𝐶∗ = C if RC = OFF
𝐶∗ = missing if RC = ON
# A B C RC
1 A1 B1 C1 OFF
2 A2 B2 C2 OFF
3 A3 B3 ON
4 A4 B4 ON
5 A5 B5 C5 OFF
6 A5 B6 ON
7 A6 B7 C7 OFF
Estimating Equality of Opportunity Fairness
constraint with Incomplete Data
52
d-separation (Pearl 1988)
𝑃 𝑌∗
𝑌∗
, 𝑍∗
)
= 𝑃 𝑌 𝑌, 𝑍, 𝐷 = 1)
𝑃 𝑌 𝑌, 𝑍)
because 𝑌 𝐷 | 𝑌, 𝑍
What fairness algorithms actually
estimate from incomplete data
Quantity we need
≠
Fairness Algorithms: Demographic parity, equality of opportunity
𝑃 𝑌 𝑋 , 𝑃 𝑌 𝑋, 𝑍 , and/or 𝑃 𝑋 , 𝑃 (𝑋, 𝑍).
Missingness Mechanisms:
Fully-Automated, Human, Machine-Aided Decision Making
More general results
53
Importance of Modeling Data Missingness in Algorithmic Fairness: A Causal Perspective. AAAI 2021
Missingness caused by human (or
machine aided) decision making is
more challenging than that caused by
fully automated decision making.
Recovering joint distribution of features is
impossible in almost all cases of missingness
caused by past decisions. Conditional
distributions (risk scores) may be recoverable in
some cases, depending on the causal graph.
CAUSAL
INFERENCE
MACHINE
LEARNING
Better Estimation and
Refutation of Causal Effect
github.com/microsoft/dowhy
• Better Generalization &
Robustness of ML models
• Principled framework for
Fairness and Explanation
• Amit Sharma
• Microsoft Research India
• @amt_shrma
• www.amitsharma.in
Q&A
Questions can be submitted via the slido app using code #14560.
You can also access slido via the link in the chat box.
Closing remarks
Dr Arthur Turrell
Deputy Director, Data Science Campus
Office for National Statistics
Slido#14560
Dates for your diary
9 December 2021 – Health Data Science Seminar, Guest speaker Tariq Khokhar (Head of data
science and health, Wellcome Trust) who will be presenting on “Using data
science to solve health challenges: what we've learned and where we're going”
13 December 2021 – ONS Economic Forum, presentations to include:
• Disaggregating annual subnational gross value added (GVA)
to lower levels of geography
• Understanding Towns in England and Wales: an industry analysis
Further details on the above events can be found at: ons.gov.uk/economicevents
Thank you for attending the
Economic Data Science Seminar Series
You can keep up to date on all upcoming events
via ons.gov.uk/economicevents
If you would like to ask a question or provide any feedback, please do
so via economic.engagement@ons.gov.uk

Economic Data Science Seminar Series - 2 December 2021

  • 1.
    Economic Data Science SeminarSeries 2 December 2021 Slido#14560
  • 2.
    Welcome Chair – DrArthur Turrell Deputy Director, Data Science Campus Office for National Statistics Slido#14560
  • 3.
    Agenda 09:00 – 09:05Welcome and introduction – Arthur Turrell 09:05 – 09:45 Causal inference for machine learning – Amit Sharma 09:45 – 09:55 Q&A 09:55 – 10:00 Closing remarks – Arthur Turrell
  • 4.
    Questions can besubmitted via the slido app using code #14560. You can also access slido via the link in the chat box.
  • 5.
    Casual inference for MachineLearning Amit Sharma Principle Researcher Microsoft Research India Slido#14560
  • 6.
    Causal inference for machinelearning Amit Sharma Microsoft Research India @amt_shrma www.amitsharma.in
  • 7.
  • 8.
    CAUSAL INFERENCE MACHINE LEARNING Better Estimation and Refutationof Causal Effect github.com/microsoft/dowhy • Better Generalization & Robustness of ML models • Principled framework for Fairness and Explanation
  • 9.
    Key message: Causalreasoning is essential for machine learning • Machine learning faces many fundamental challenges • Out of distribution generalization, Robustness, Fairness, Explainability, Privacy • A causal perspective can help • Better definitions of the challenges • Theoretically justified algorithms • “Matching” for out-of-distribution prediction • “Counterfactuals” for explainable predictions • “Missing data” for fairness
  • 10.
    Correlational machine learning searchesfor patterns. Often finds spurious ones
  • 11.
    women’s women’s chess club captain executed captured Biasin ML model for hiring decisions [Reuters 2018, Weblink] Accuracy on unseen angles (0, 90): 64% [Piratla et al. ICML 2020] Incorrect predictions under changes in data [Alcorn et al. CVPR 2019] Fooled by semantically equivalent perturbations [Ribeiro et al. ACL 2018]
  • 12.
    1. OOD generalizationis a causal problem. Domain Generalization using Causal Matching. ICML 2021. Alleviating Privacy Attacks using Causal Learning. ICML 2020. Causal Regularization using Domain Priors. Arxiv.
  • 13.
    13 True: 𝑦=f(𝑥,u)+𝜖 Typical supervisedprediction min P 𝑦 − 𝑦 2 Use cross-validation to select model. No need to worry about 𝑢. Out-of-distribution prediction: min P∗ 𝑦 − 𝑦 2 P* is not observed. Cross-validation is not possible. X Y U Train: P(X, Y, U) Test: P(X, Y, U) X Y U Train: P(X, Y, U) Test: P*(X, Y, U)
  • 14.
    Invariant causal learning:If you learn the causal function from X->Y, your model will be optimal across all unseen distributions.
  • 15.
    Where’s the catch?Learning causal models Disease Severity 𝒀 Blood Pressure Heart Rate 𝑿𝒑𝒂𝒓𝒆𝒏𝒕 𝑿𝒑𝒂𝒓𝒆𝒏𝒕 Structural Approach: Create a causal graph based on external knowledge. Multiple Domains Approach: Find features whose effect stays invariant across many domains. ℎ 𝑥𝐶 will lead to similar accuracy on both domains. Constraints Approach: Identify the constraints that any causal model should satisfy. Blood Pressure => Disease Severity
  • 16.
    I. Learning usingcausal structure A dataset of people living with a chronic illness. <Y:disease_severity> <X:age, gender, blood pressure, heart rate> Associational ML: min ℎ (𝑥,𝑦) 𝐿𝑜𝑠𝑠(ℎ 𝑥 , 𝑦) Disease Severity 𝒀 Blood Pressure Heart Rate 𝑿𝒑𝒂𝒓𝒆𝒏𝒕 𝑿𝒑𝒂𝒓𝒆𝒏𝒕 𝑿𝟏 𝑿𝟐 Age Weight (Ideal) Causal learning: 1. Identify which features directly cause the outcome (parents of Y in the causal graph). 2. Build a predictive model using only those features. Causal ML: min ℎ (𝑥,𝑦) 𝐿𝑜𝑠𝑠(ℎ 𝑥𝐶 , 𝑦) Lower train accuracy but stays consistent with new domains. 𝐗𝐂 = 𝐗𝐏𝐀 = {heart rate, blood pressure}
  • 17.
    Why only parentsof Y in the causal graph? 𝒀 𝑋𝑆0 𝑋𝑃𝐴 𝑋𝑆2 𝑋𝑆1 𝑋𝐶𝐻 𝑋𝑐𝑝 New Domain (Covariate Shift) 𝒀 𝑋𝑆0 𝑋𝑃𝐴 𝑋𝑆2 𝑋𝑆1 𝑋𝐶𝐻 𝑋𝑐𝑝 𝑷 𝑿, 𝒀 = 𝑷 𝑿 𝑷(𝒀|𝑿) 𝑷∗ 𝑿, 𝒀 = 𝑷∗ 𝑿 𝑷(𝒀|𝑿)
  • 18.
    Why only parentsof Y in the causal graph? 𝒀 𝑋𝑆0 𝑋𝑃𝐴 𝑋𝑆2 𝑋𝑆1 𝑋𝐶𝐻 𝑋𝑐𝑝 New Domain (Concept Drift) 𝒀 𝑋𝑆0 𝑋𝑃𝐴 𝑋𝑆2 𝑋𝑆1 𝑋𝐶𝐻 𝑋𝑐𝑝 𝑷 𝑿, 𝒀 = 𝑷 𝑿 𝑷(𝒀|𝑿) 𝑷∗ 𝑿, 𝒀 = 𝑷 𝑿 𝑷∗ (𝒀|𝑿)
  • 19.
    Why only parentsof Y in the causal graph? 𝒀 𝑋𝑆0 𝑋𝑃𝐴 𝑋𝑆2 𝑋𝑆1 𝑋𝐶𝐻 𝑋𝑐𝑝 Any Intervention on any feature 𝒀 𝑋𝑆0 𝑋𝑃𝐴 𝑋𝑆2 𝑋𝑆1 𝑋𝐶𝐻 𝑋𝑐𝑝 𝑃(𝑌|𝑋𝑃𝐴) is invariant across different distributions, unless there is a change in true data-generating process for Y.
  • 20.
    Any other benefits? *Result 1: Better out-of-distribution generalization • Result 2: Stronger differential privacy guarantees Causal models are more robust to privacy attacks like membership inference. Theorem: When equivalent Laplace noise is added and models are trained on same dataset, causal mechanism MC provides 𝜖𝐶-DP and associational mechanism MA provides 𝜖𝐴-DP guarantees such that: 𝝐𝒄 ≤ 𝝐𝑨 Alleviating Privacy Attacks using Causal Learning. ICML 2020.
  • 21.
    Summary: Causal predictivemodels offer better accuracy and privacy. So why is everyone not using it? Same problem as for causal inference: Rare to have an outcome variable where all parents are observed. Can methods from causal inference also be used to solve it?
  • 22.
    Where’s the catch?Learning causal models Disease Severity 𝒀 Blood Pressure Heart Rate 𝑿𝒑𝒂𝒓𝒆𝒏𝒕 𝑿𝒑𝒂𝒓𝒆𝒏𝒕 Structural Approach: Create a causal graph based on external knowledge. Multiple Domains Approach: Find features whose effect stays invariant across many domains. ℎ 𝑥𝐶 will lead to similar accuracy on both domains. Constraints Approach: Identify the constraints that any causal model should satisfy. Blood Pressure => Disease Severity
  • 23.
    Leveraging data frommultiple domains TRAIN DATASET TEST DATASET TRAIN DATASET TEST DATASET
  • 24.
    Need to ensurethat pair of images exactly match on shape features, but vary on color (i.e., confounder) Difference from causal inference: Matching for same causal features, rather than same confounders Data augmentation in ML Representation Learning
  • 25.
    How it works? ObservedUnobserved May or may not be observed
  • 26.
    How it works? Goal:Learn E 𝑌 𝑋𝑐 But 𝑋𝑐is not observed. So match two images having the same 𝑋𝑐 and enforce them to have the same representation. Observed Unobserved May or may not be observed Latent shape features Different colors Image Label
  • 27.
    How it works? Goal:Learn E 𝑌 𝑋𝑐 But 𝑋𝑐is not observed. So match two images having the same 𝑋𝑐 and enforce them to have the same representation. Observed Unobserved May or may not be observed Latent shape features Different colors Image Label 𝑿𝑪⫫ 𝑫𝒐𝒎𝒂𝒊𝒏|𝑶𝒃𝒋𝒆𝒄𝒕 => match inputs from the different domains that belong to the same object. Aside: Many prior works on domain generalization optimize for the incorrect objective • “Domain Invariant Representation” proposes 𝑋𝐶 ⫫ 𝐷𝑜𝑚𝑎𝑖𝑛 • “Class-conditional Domain Invariant Representation” proposes 𝑋𝐶 ⫫ 𝐷𝑜𝑚𝑎𝑖𝑛|𝑌
  • 28.
    Leveraging multiple domains:Can also work if data augmentations are not available • If objects are not known, iteratively learn matched pairs of inputs from different domains. • Assumption: Same-class inputs are closer in causal features to inputs from different classes. • Start with matching random inputs from the same class. • Minimize intra-match distance: Find a feature representation that minimizes the distance within matches. • Estimate matches: Update matches based on the new representation and repeat.
  • 29.
    IV. Empirical results:Causal models are more accurate on unseen rotations of MNIST digits Test Domain: Images rotated by 0 and 90 degrees. This method also achieves state-of-the-art accuracy on PACS , the most popular domain generalization benchmark.
  • 30.
    2. ML Explanationis a causal problem. Explaining Machine Learning Classifiers through Counterfactual Examples. FaccT 2020. Towards Unifying Feature Attribution and Counterfactual Explanations: Different Means to the Same End. AIES 2021.
  • 31.
    Explaining machine learningpredictions Techniques to explain machine predictions LIME (Ribeiro et al., 2016); Local Rule-based (Guidotti et al., 2018); SHAP (Lundberg et al., 2017); Intelligible Models (Lou et al., 2012); ….. Feature importance-based methods are widely used in many practical applications Feature 1 Feature 2 Feature 3 Importance score
  • 32.
    In many cases,feature importance is not enough Bank Loan distribution algorithm Loan granted Loan denied Counterfactual explanations (CF) (“what-if” scenarios) Feature importance-based explanations Annual income No. of credit accounts Credit years Importance score You would have got the loan if your annual income had been 100,000 (Wachter et al., 2017) Suppose a person does not get the loan. Person: What should I do to get the loan in the future?
  • 33.
    Many explanation scenariosare actually asking “what-if” or causal questions “If I change the most important feature according to explanation, will it change the predicted outcome?” “What if we change the second most important feature?” Statistical summaries are not enough. Require different kind of reasoning -> Causal reasoning Individual treatment effect of different features
  • 34.
    Causal reasoning forexplaining machine learning Bank Loan distribution algorithm Loan granted Loan denied Counterfactual explanations (CF) (“what-if” scenarios) You would have got the loan if your annual income had been 100,000 (Wachter et al., 2017) What feature value caused the prediction? How to provide a feature ordering?
  • 35.
    What does itmean to explain an event? Event = ML model predicts 1. [Halpern 2016] A feature is an ideal causal explanation iff: • Necessity: Changing the feature changes model’s prediction. • Sufficiency: If the feature stays the same, cannot change the model’s prediction. • Ideal explanations are rare. ݂ (𝑥1 , 𝑥2, 𝑥3) = ‫(ܫ‬0.4𝑥1 + 0.1𝑥2 + 0.1𝑥3 >= 0.5) Given f(1,1,1) = 1, x1 is necessary. No feature is sufficient.
  • 36.
    But we canquantify degree of necessity or sufficiency (x, f(x)) • Necessity = P (f(x) changes | feature is changed) • Sufficiency = P( f(x) is unchanged | feature is unchanged) Where these probabilities are over all plausible values of the features. In practice, approximate by neighborhood of the point x.
  • 37.
    Simple algorithm Necessity: Given(𝑥, 𝑓(𝑥)), find necessity of feature 𝑥𝑖 • Sample point 𝑥’ such that 𝑥𝑖 is changed while keeping every other feature constant. • Calculate 𝑃(𝑓(𝑥’) ! = 𝑓(𝑥) ) over all such 𝑥’ Sufficiency: Given (𝑥, 𝑓(𝑥)), find sufficiency of feature 𝑥𝑖 • Sample point 𝑥’ such that 𝑥𝑖 is constant while changing all other features. • Calculate 𝑃(𝑓(𝑥’) == 𝑓(𝑥) ) over all such 𝑥’
  • 38.
    A more efficientapproximate algorithm Necessity: Given (𝑥, 𝑓(𝑥)), • Find the smallest changes to the input 𝑥 that change the outcome. • Necessity is proportional to the number of times a feature is changed to lead to a different outcome. Sufficiency: Given (𝑥, 𝑓(𝑥)), • Find the smallest changes to the input 𝑥 that change the outcome, without changing 𝑥𝑖. • Sufficiency is inversely proportional to the number of times a valid change is found.
  • 39.
    Diverse counterfactual explanations Loss to get desirable outcome Lossto ensure proximity to original input Loss to provide diverse explanations k – no. of counterfactuals λ1 and λ2 – loss-balancing hyperparameters dpp_diversity = det(K), K = 𝟏 𝟏 +𝒅𝒊𝒔𝒕(c𝒊, c𝒋) More generally, counterfactual explanations involve an optimization 𝑪 𝒙 = arg min 𝑐1,…., 𝑐𝑘 1 𝑘 𝑖=1 𝑘 𝒚𝒍𝒐𝒔𝒔 𝑓 𝑐𝑖 , 𝑦 + λ1 𝑘 𝒅𝒊𝒔𝒕 𝑐𝑖, 𝑥 − λ2𝒅𝒑𝒑_𝒅𝒊𝒗𝒆𝒓𝒔𝒊𝒕𝒚(𝑐1, . . , 𝑐𝑘)
  • 40.
    Practical considerations 𝑪 𝒙= arg min 𝑐1,…., 𝑐𝑘 1 𝑘 𝑖=1 𝑘 𝒚𝒍𝒐𝒔𝒔 𝑓 𝑐𝑖 , 𝑦 + λ1 𝑘 𝒅𝒊𝒔𝒕 𝑐𝑖, 𝑥 − λ2𝒅𝒑𝒑_𝒅𝒊𝒗𝒆𝒓𝒔𝒊𝒕𝒚(𝑐1, . . , 𝑐𝑘)  Incorporate additional feasibility properties a) Sparsity b) User constraints  Choice of yloss – hinge loss  Separate categorical and continuous distance functions  Relative scale of mixed features Python library DiCE (Diverse Counterfactual Explanations) https://github.com/interpretml/DiCE
  • 43.
    Evaluating fairness isa causal problem.
  • 44.
    Fair Machine Learning Commonapproach: 1. Get a big training dataset, different rows containing observed outcomes for different feature values. 44
  • 45.
    Fair Machine Learning 45 Featuresfor different individuals Outcome (e.g. loan paid back or not)
  • 46.
    Fair Machine Learning Commonapproach: 1. Get a big training dataset, different rows containing observed outcomes for different feature values. 2. Select an appropriate fairness metric (e.g. equal error rates). 3. Apply state-of-the-art algorithm on this dataset to train a classifier with fairness constraints. 4. Deploy the trained classifier to make future decisions. 46
  • 47.
    Fair Machine Learning Thiscommon approach, proposed and shown to work on benchmark datasets in fair machine learning papers, doesn’t work in practice unfortunately. There is no guarantee that the supposedly fair classifer will actually take fair decisions in the real-world. Reason: Missingness in Training Data. 47
  • 48.
    Missingness in TrainingData Loan Applicant 1, with features 𝑿𝟏 𝐷1 = 1 𝐷2 = 0 Loan Applicant 2, with features 𝑿𝟐 Decision-maker in the past e.g. bank Loan Approved Loan Denied 𝑌1 True outcome observed, 𝑋1, 𝑌1 became part of the training data True outcome never observed, 𝑋2, 𝑌2 missing from the training data 48 ?
  • 49.
    Missingness in TrainingData • Training data, even if it contains objective ground truth outcome and infinitely many samples, is one-sided due to systematic censoring by past decisions. 49
  • 50.
    Empirical Implications (Pleiss etal. 2017) with FPR Constraints (Pleiss et al. 2017) with FNR Constraints (Kamiran et al. 2012) with SP Constraints (Kamiran and Calders 2012) COMPAS Dataset Train FPRD −0.00155 Test FPRD 0.061 Train FNRD 0.0056 Test FNRD 0.099 Train SPD 0.0229 Test SPD 0.2651 Train EOD 0.0111 Test EOD −0.2266 ADULT Dataset Train FPRD −0.00724 Test FPRD 0.0725 Train FNRD 0.00295 Test FNRD 0.0377 Train SPD −0.0390 Test SPD −0.1137 Train EOD 0.0293 Test EOD −0.1327 Postprocessing Approach Inprocessing Approach Preprocessing Approach Difference in Test and Train Fairness of Fair ML Algorithms under Training Data with Missingness All implementations and respective hyper-parameters settings taken from examples provided in IBM AI Fairness 360 Library Kallus and Zhou (2018) made similar observations for Hardt et al (2016)’s algorithm on NYPD SQF dataset. * Similar observations for Celis et al (2019)’ algorithm.
  • 51.
    Causal Graphs forData Missingness (Karthika Mohan and Judea Pearl, 2019) 𝑅𝐶 : Missigness mechanism variable for variable 𝐶 MCAR MAR MNAR 51 𝐶∗ = C if RC = OFF 𝐶∗ = missing if RC = ON # A B C RC 1 A1 B1 C1 OFF 2 A2 B2 C2 OFF 3 A3 B3 ON 4 A4 B4 ON 5 A5 B5 C5 OFF 6 A5 B6 ON 7 A6 B7 C7 OFF
  • 52.
    Estimating Equality ofOpportunity Fairness constraint with Incomplete Data 52 d-separation (Pearl 1988) 𝑃 𝑌∗ 𝑌∗ , 𝑍∗ ) = 𝑃 𝑌 𝑌, 𝑍, 𝐷 = 1) 𝑃 𝑌 𝑌, 𝑍) because 𝑌 𝐷 | 𝑌, 𝑍 What fairness algorithms actually estimate from incomplete data Quantity we need ≠
  • 53.
    Fairness Algorithms: Demographicparity, equality of opportunity 𝑃 𝑌 𝑋 , 𝑃 𝑌 𝑋, 𝑍 , and/or 𝑃 𝑋 , 𝑃 (𝑋, 𝑍). Missingness Mechanisms: Fully-Automated, Human, Machine-Aided Decision Making More general results 53 Importance of Modeling Data Missingness in Algorithmic Fairness: A Causal Perspective. AAAI 2021 Missingness caused by human (or machine aided) decision making is more challenging than that caused by fully automated decision making. Recovering joint distribution of features is impossible in almost all cases of missingness caused by past decisions. Conditional distributions (risk scores) may be recoverable in some cases, depending on the causal graph.
  • 54.
    CAUSAL INFERENCE MACHINE LEARNING Better Estimation and Refutationof Causal Effect github.com/microsoft/dowhy • Better Generalization & Robustness of ML models • Principled framework for Fairness and Explanation • Amit Sharma • Microsoft Research India • @amt_shrma • www.amitsharma.in
  • 55.
    Q&A Questions can besubmitted via the slido app using code #14560. You can also access slido via the link in the chat box.
  • 56.
    Closing remarks Dr ArthurTurrell Deputy Director, Data Science Campus Office for National Statistics Slido#14560
  • 57.
    Dates for yourdiary 9 December 2021 – Health Data Science Seminar, Guest speaker Tariq Khokhar (Head of data science and health, Wellcome Trust) who will be presenting on “Using data science to solve health challenges: what we've learned and where we're going” 13 December 2021 – ONS Economic Forum, presentations to include: • Disaggregating annual subnational gross value added (GVA) to lower levels of geography • Understanding Towns in England and Wales: an industry analysis Further details on the above events can be found at: ons.gov.uk/economicevents
  • 58.
    Thank you forattending the Economic Data Science Seminar Series You can keep up to date on all upcoming events via ons.gov.uk/economicevents If you would like to ask a question or provide any feedback, please do so via economic.engagement@ons.gov.uk

Editor's Notes

  • #7 How causal inference can be useful for ML
  • #8 To put in perspective, let’s compare prediction and causation. One of the ways I like to think of this problem is in terms of prediction versus causation. If you are given data about X and Y, If interested in prediction, can do k(x) assumption and use all the large-scale data and do well most of the times (e.g. ice cream). But we want causation,which is dy/dx. Without knowing the function f or anything about u, then we are not sure how to use big data.1 Another way of thinking why the second quseiton is hard ..is because of this..imagine X many variables, prediction is fine. But now we want to find the effect of single variable. Or say we see correlation between x and y can use that for prediction.but what if x is chaged…introduce the problem..dont say anything about hardness. One of the reasons is that while we have got pretty good at processing terabytes of data, causal inference methods haven’t caught up. The example I like to think of is fundamentally the difference between prediction and causation. Prediction is what big data is often used for…so if we have a big dataset with two variables X and Y, , we can simply feed it to a machine learning algorithm to get reasonable accurate predictions for y. However, often the most interesting questions are of a causal nature---does X cause Y---and here it is not entirely clear how to use big data: in fact, without knowing the data-generating process or possible confounders, having more data does not help. It is not even clear how to use this data.And this is unsatisfying. Make clear: observational data.
  • #11 In this
  • #12 The problem with spurious patterns is that they are not robust. Tweaking the environment And when those spurious patterns change, when we are in a new envThis is why conventional supervised ML must assume that training data is representative of the final deployed environment.
  • #14 To put in perspective, let’s compare prediction and causation. One of the ways I like to think of this problem is in terms of prediction versus causation. If you are given data about X and Y, If interested in prediction, can do k(x) assumption and use all the large-scale data and do well most of the times (e.g. ice cream). But we want causation,which is dy/dx. Without knowing the function f or anything about u, then we are not sure how to use big data.1 Another way of thinking why the second quseiton is hard ..is because of this..imagine X many variables, prediction is fine. But now we want to find the effect of single variable. Or say we see correlation between x and y can use that for prediction.but what if x is chaged…introduce the problem..dont say anything about hardness. One of the reasons is that while we have got pretty good at processing terabytes of data, causal inference methods haven’t caught up. The example I like to think of is fundamentally the difference between prediction and causation. Prediction is what big data is often used for…so if we have a big dataset with two variables X and Y, , we can simply feed it to a machine learning algorithm to get reasonable accurate predictions for y. However, often the most interesting questions are of a causal nature---does X cause Y---and here it is not entirely clear how to use big data: in fact, without knowing the data-generating process or possible confounders, having more data does not help. It is not even clear how to use this data.And this is unsatisfying. Make clear: observational data.
  • #18 It turns out when there are interventions on any of these features (e.g., the relationship between Age and blood pressure changes in a new sample), then the only relationship that stays invariant is between Y and its parents. And this is the key result that causal learning depends on, leading to our first result.
  • #19 It turns out when there are interventions on any of these features (e.g., the relationship between Age and blood pressure changes in a new sample), then the only relationship that stays invariant is between Y and its parents. And this is the key result that causal learning depends on, leading to our first result.
  • #20 It turns out when there are interventions on any of these features (e.g., the relationship between Age and blood pressure changes in a new sample), then the only relationship that stays invariant is between Y and its parents. And this is the key result that causal learning depends on, leading to our first result.
  • #32 As ML is increasingly used in societally critical domains, there are lots of techniques to explain Machine predictions. Most of these techniques are feature-importance based methods where they provide an importance score for features depending upon their contribution to a prediction.
  • #33 However, these feature-importance based methods do not tell a decision subject what they should do in order to get a desired outcome, which is the most important think a user would like to know if they get an undesired outcome. To explain this situation, consider a person who applied for a loan and gets rejected by the loan distribution algo of a financial company. In such case, Feature importance based methods would show the most important feature, say in this case the the annual income, that contributed to the rejection of the loan. However, it does not say what that person could have done with annual income to get the loan. That’s why a different class of explanations called CF have been proposed, which provides “what-if” scenarios and tells the person that they would have got the loan if they had different values on certain features. For instance, a CF example would be “You would have got the loan if your …”.
  • #40 So we propose a general optimization framework to generate actionable CFs and the different parts of our objective function takes care of different desirable properties. To generate ‘k’ diverse CFs, we use a determinantal point processes based method. Further, as we can see, we generate all ‘k’ counterfactuals at once.
  • #41 In addition to the loss, we have also addressed several practical problems that we need to consider while generating CFs, such as the choice of yloss and different scales of features. While these considerations might seem trivial from a technical perspective, they are important for supporting user interaction with counterfactuals Further, we have opensourced a python library, on github, that takes into considerations these practical concerns while generating diverse CFs for any dataset. We call this library DICE which stands for…