This document summarizes a discussion between Susan Athey and Guido Imbens on the relationship between machine learning and causal inference. It notes that while machine learning excels at prediction problems using large datasets, it has weaknesses when it comes to causal questions. Econometrics and statistics literature focuses more on formal theories of causality. The document proposes combining the strengths of both fields by developing machine learning methods that can estimate causal effects, accounting for issues like endogeneity and treatment effect heterogeneity. It outlines some open problems and directions for future research at the intersection of these fields.
An introduction to causal graphical models with examples of causality in practice from different fields of science. More focused discussion of causal inference in online ads and recommender systems.
Dowhy: An end-to-end library for causal inferenceAmit Sharma
In addition to efficient statistical estimators of a treatment's effect, successful application of causal inference requires specifying assumptions about the mechanisms underlying observed data and testing whether they are valid, and to what extent. However, most libraries for causal inference focus only on the task of providing powerful statistical estimators. We describe DoWhy, an open-source Python library that is built with causal assumptions as its first-class citizens, based on the formal framework of causal graphs to specify and test causal assumptions. DoWhy presents an API for the four steps common to any causal analysis---1) modeling the data using a causal graph and structural assumptions, 2) identifying whether the desired effect is estimable under the causal model, 3) estimating the effect using statistical estimators, and finally 4) refuting the obtained estimate through robustness checks and sensitivity analyses. In particular, DoWhy implements a number of robustness checks including placebo tests, bootstrap tests, and tests for unoberved confounding. DoWhy is an extensible library that supports interoperability with other implementations, such as EconML and CausalML for the the estimation step.
Predictive modeling has led to big successes in making inferences from data. Such models are used extensively, including in systems for recommending items, optimizing content, delivering ads, matching applicants to jobs, identifying health risks and so on. However, predictive models are not well-equipped to answer questions about cause and effect, which form the basis of many practical decision-making scenarios. For example, if a recommendation system is changed or removed, what will be the effect on total customer activity? Which strategy leads to a higher engagement with a product? How can we learn generalizable insights about users from biased data (e.g. that of opt-in users)? Through practical examples, I will show the value of counterfactual reasoning and causal inference for such scenarios, by demonstrating that relying on predictive modeling based on correlations can be counterproductive. I will then present an overview of experimental and observational causal inference methods, that can better inform decision-making through data, and also lead to more robust and generalizable prediction models.
Abstract: This PDSG workshop introduces basic concepts of ensemble methods in machine learning. Concepts covered are Condercet Jury Theorem, Weak Learners, Decision Stumps, Bagging and Majority Voting.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
Presentation at Advanced Intelligent Systems for Sustainable Development (AISSD 2021) 20-22 August 2021 organized by the scientific research group in Egypt with Collaboration with Faculty of Computers and AI, Cairo University and the Chinese University in Egypt
An introduction to causal graphical models with examples of causality in practice from different fields of science. More focused discussion of causal inference in online ads and recommender systems.
Dowhy: An end-to-end library for causal inferenceAmit Sharma
In addition to efficient statistical estimators of a treatment's effect, successful application of causal inference requires specifying assumptions about the mechanisms underlying observed data and testing whether they are valid, and to what extent. However, most libraries for causal inference focus only on the task of providing powerful statistical estimators. We describe DoWhy, an open-source Python library that is built with causal assumptions as its first-class citizens, based on the formal framework of causal graphs to specify and test causal assumptions. DoWhy presents an API for the four steps common to any causal analysis---1) modeling the data using a causal graph and structural assumptions, 2) identifying whether the desired effect is estimable under the causal model, 3) estimating the effect using statistical estimators, and finally 4) refuting the obtained estimate through robustness checks and sensitivity analyses. In particular, DoWhy implements a number of robustness checks including placebo tests, bootstrap tests, and tests for unoberved confounding. DoWhy is an extensible library that supports interoperability with other implementations, such as EconML and CausalML for the the estimation step.
Predictive modeling has led to big successes in making inferences from data. Such models are used extensively, including in systems for recommending items, optimizing content, delivering ads, matching applicants to jobs, identifying health risks and so on. However, predictive models are not well-equipped to answer questions about cause and effect, which form the basis of many practical decision-making scenarios. For example, if a recommendation system is changed or removed, what will be the effect on total customer activity? Which strategy leads to a higher engagement with a product? How can we learn generalizable insights about users from biased data (e.g. that of opt-in users)? Through practical examples, I will show the value of counterfactual reasoning and causal inference for such scenarios, by demonstrating that relying on predictive modeling based on correlations can be counterproductive. I will then present an overview of experimental and observational causal inference methods, that can better inform decision-making through data, and also lead to more robust and generalizable prediction models.
Abstract: This PDSG workshop introduces basic concepts of ensemble methods in machine learning. Concepts covered are Condercet Jury Theorem, Weak Learners, Decision Stumps, Bagging and Majority Voting.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
Presentation at Advanced Intelligent Systems for Sustainable Development (AISSD 2021) 20-22 August 2021 organized by the scientific research group in Egypt with Collaboration with Faculty of Computers and AI, Cairo University and the Chinese University in Egypt
Interpreting deep learning and machine learning models is not just another regulatory burden to be overcome. Scientists, physicians, researchers, and analyst that use these technologies for their important work have the right to trust and understand their models and the answers they generate. This talk is an overview of several techniques for interpreting deep learning and machine learning models and telling stories from their results.
Speaker: Patrick Hall is a Data Scientist and Product Engineer at H2O.ai. He’s also an Adjunct Professor at George Washington University for the Department of Decision Sciences. Prior to joining H2O, Patrick spent many years as a Senior Data Scientist SAS and has worked with many Fortune 500 companies on their data science and machine learning problems. https://www.linkedin.com/in/jpatrickhall
These slides are for the tutorial on how to use R language for data analysis and Machine Learning tasks.
The workshop was given at OSCON (Austin, TX), 2017
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEDatabricks
For companies that solve real-world problems and generate revenue from the data science products, being able to understand why a model makes a certain prediction can be as crucial as achieving high prediction accuracy in many applications. However, as data scientists pursuing higher accuracy by implementing complex algorithms such as ensemble or deep learning models, the algorithm itself becomes a blackbox and it creates the trade-off between accuracy and interpretability of a model’s output.
To address this problem, a unified framework SHAP (SHapley Additive exPlanations) was developed to help users interpret the predictions of complex models. In this session, we will talk about how to apply SHAP to various modeling approaches (GLM, XGBoost, CNN) to explain how each feature contributes and extract intuitive insights from a particular prediction. This talk is intended to introduce the concept of general purpose model explainer, as well as help practitioners understand SHAP and its applications.
Multiclass classification of imbalanced dataSaurabhWani6
Pydata Talk on Classification of imbalanced data.
It is an overview of concepts for better classification in imbalanced datasets.
Resampling techniques are introduced along with bagging and boosting methods.
DoWhy Python library for causal inference: An End-to-End toolAmit Sharma
As computing systems are more frequently and more actively intervening in societally critical domains such as healthcare, education, and governance, it is critical to correctly predict and understand the causal effects of these interventions. Without an A/B test, conventional machine learning methods, built on pattern recognition and correlational analyses, are insufficient for causal reasoning.
Much like machine learning libraries have done for prediction, "DoWhy" is a Python library that aims to spark causal thinking and analysis. DoWhy provides a unified interface for causal inference methods and automatically tests many assumptions, thus making inference accessible to non-experts.
For a quick introduction to causal inference, check out amit-sharma/causal-inference-tutorial. We also gave a more comprehensive tutorial at the ACM Knowledge Discovery and Data Mining (KDD 2018) conference: causalinference.gitlab.io/kdd-tutorial.
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaPyData
PyData London 2018
This talk will focus on the importance of correctly defining an anomaly when conducting anomaly detection using unsupervised machine learning. It will include a review of Isolation Forest algorithm (Liu et al. 2008), and a demonstration of how this algorithm can be applied to transaction monitoring, specifically to detect money laundering.
---
www.pydata.org
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Time Series basic concepts and ARIMA family of models. There is an associated video session along with code in github: https://github.com/bhaskatripathi/timeseries-autoregressive-models
https://drive.google.com/file/d/1yXffXQlL6i4ufQLSpFFrJgymhHNXL1Mf/view?usp=sharing
In this presentation, we provide a quick intro do bayesian inference, Gaussian Processes and then later relate to the latest state of the art research on Bayesian Deep Learning, in order to include uncertainty in deep neural net predictions
Part 1
- Introduction
- Application for Anomaly Detection
- AIOps
- GraphDB
Part 2
- Type Of Anomaly Detection
- How to Identify Outliers in your Data
Part 3
- Anomaly Detection for Timeseries Technique
A presentation about the development of the ideas from the autoencoder to the Stable Diffusion text-to-image model.
Models covered: autoencoder, VAE, VQ-VAE, VQ-GAN, latent diffusion, and stable diffusion.
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
Get your quality homework help now and stand out.Our professional writers are committed to excellence. We have trained the best scholars in different fields of study.Contact us now at http://www.essaysexperts.net/ and place your order at affordable price done within set deadlines.We always have someone online ready to answer all your queries and take your requests.
Interpreting deep learning and machine learning models is not just another regulatory burden to be overcome. Scientists, physicians, researchers, and analyst that use these technologies for their important work have the right to trust and understand their models and the answers they generate. This talk is an overview of several techniques for interpreting deep learning and machine learning models and telling stories from their results.
Speaker: Patrick Hall is a Data Scientist and Product Engineer at H2O.ai. He’s also an Adjunct Professor at George Washington University for the Department of Decision Sciences. Prior to joining H2O, Patrick spent many years as a Senior Data Scientist SAS and has worked with many Fortune 500 companies on their data science and machine learning problems. https://www.linkedin.com/in/jpatrickhall
These slides are for the tutorial on how to use R language for data analysis and Machine Learning tasks.
The workshop was given at OSCON (Austin, TX), 2017
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEDatabricks
For companies that solve real-world problems and generate revenue from the data science products, being able to understand why a model makes a certain prediction can be as crucial as achieving high prediction accuracy in many applications. However, as data scientists pursuing higher accuracy by implementing complex algorithms such as ensemble or deep learning models, the algorithm itself becomes a blackbox and it creates the trade-off between accuracy and interpretability of a model’s output.
To address this problem, a unified framework SHAP (SHapley Additive exPlanations) was developed to help users interpret the predictions of complex models. In this session, we will talk about how to apply SHAP to various modeling approaches (GLM, XGBoost, CNN) to explain how each feature contributes and extract intuitive insights from a particular prediction. This talk is intended to introduce the concept of general purpose model explainer, as well as help practitioners understand SHAP and its applications.
Multiclass classification of imbalanced dataSaurabhWani6
Pydata Talk on Classification of imbalanced data.
It is an overview of concepts for better classification in imbalanced datasets.
Resampling techniques are introduced along with bagging and boosting methods.
DoWhy Python library for causal inference: An End-to-End toolAmit Sharma
As computing systems are more frequently and more actively intervening in societally critical domains such as healthcare, education, and governance, it is critical to correctly predict and understand the causal effects of these interventions. Without an A/B test, conventional machine learning methods, built on pattern recognition and correlational analyses, are insufficient for causal reasoning.
Much like machine learning libraries have done for prediction, "DoWhy" is a Python library that aims to spark causal thinking and analysis. DoWhy provides a unified interface for causal inference methods and automatically tests many assumptions, thus making inference accessible to non-experts.
For a quick introduction to causal inference, check out amit-sharma/causal-inference-tutorial. We also gave a more comprehensive tutorial at the ACM Knowledge Discovery and Data Mining (KDD 2018) conference: causalinference.gitlab.io/kdd-tutorial.
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaPyData
PyData London 2018
This talk will focus on the importance of correctly defining an anomaly when conducting anomaly detection using unsupervised machine learning. It will include a review of Isolation Forest algorithm (Liu et al. 2008), and a demonstration of how this algorithm can be applied to transaction monitoring, specifically to detect money laundering.
---
www.pydata.org
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Time Series basic concepts and ARIMA family of models. There is an associated video session along with code in github: https://github.com/bhaskatripathi/timeseries-autoregressive-models
https://drive.google.com/file/d/1yXffXQlL6i4ufQLSpFFrJgymhHNXL1Mf/view?usp=sharing
In this presentation, we provide a quick intro do bayesian inference, Gaussian Processes and then later relate to the latest state of the art research on Bayesian Deep Learning, in order to include uncertainty in deep neural net predictions
Part 1
- Introduction
- Application for Anomaly Detection
- AIOps
- GraphDB
Part 2
- Type Of Anomaly Detection
- How to Identify Outliers in your Data
Part 3
- Anomaly Detection for Timeseries Technique
A presentation about the development of the ideas from the autoencoder to the Stable Diffusion text-to-image model.
Models covered: autoencoder, VAE, VQ-VAE, VQ-GAN, latent diffusion, and stable diffusion.
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
Get your quality homework help now and stand out.Our professional writers are committed to excellence. We have trained the best scholars in different fields of study.Contact us now at http://www.essaysexperts.net/ and place your order at affordable price done within set deadlines.We always have someone online ready to answer all your queries and take your requests.
Theory and Practice of Integrating Machine Learning and Conventional Statisti...University of Malaya
The practice of medical decision making is changing rapidly with the development of innovative
computing technologies. The growing interest of data analysis in line with the advancement in data
science raises the question of whether machine learning can be integrated with conventional statistics
in health research. To help address this knowledge gap, this talk focuses on the conceptual
integration between conventional statistics and machine learning, with a direction towards health
research. The similarities and differences between the two are compared using mathematical
concepts and algorithms. The comparison between conventional statistics and machine learning
methods indicates that conventional statistics are the fundamental basis of machine learning, where
the black box algorithms are derived from basic mathematics, but are advanced in terms of
automated analysis, handling big data and providing interactive visualizations. While the nature of
both these methods are different, they are conceptually similar. The evidence produced here
concludes that conventional statistics and machine learning are best to be integrated to develop
automated data analysis tools. Health researchers may explore machine learning as a potential tool to
enhance conventional statistics in data analytics for added reliable validation measures.
Inferential Analysis
Chapter 20
NUR 6812Nursing Research
Florida National University
Introduction - Inferential Analysis
We will discuss analysis of variance and regression, which are technically part of the same family of statistics known as the general linear method but are used to achieve different analytical goals
ANALYSIS OF VARIANCE
Analysis of variance (ANOVA) is used so often that Iversen and Norpoth (1987) said they once had a student who thought this was the name of an Italian statistician.
You can think of analysis of variance as a whole family of procedures beginning with the simple and frequently used t-test and becoming quite complicated with the use of multiple dependent variables (MANOVA, to be explained later in this chapter) and covariates.
Although the simpler varieties of these statistics can actually be calculated by hand, it is assumed that you will use a statistical software package for your calculations.
If you want to see how these calculations are done, you could try to compute a correlation, chi-square, t-test, or ANOVA yourself (see Yuker, 1958; Field, 2009), but in general it is too time consuming and too subject to human error to do these by hand.
IMPORTANT TERMINOLOGY
Several terms are used in these analyses that you need to be familiar with to understand the analyses themselves and the results. Many will already be familiar to you.
Statistical significance: This indicates the probability that the differences found are a result of error, not the treatment. Stated in terms of the P value, the convention is to accept either a 1% (P ≤ 0.01), or 1 out of 100, or 5% (P ≤ 0.05), or 5 out of 100, possibility that any differences seen could have been due to error (Cortina & Dunlap, 2007).
Research hypothesis: A research hypothesis is a declarative statement of the expected relationship between the dependent and independent variable(s).
Null hypothesis: The null hypothesis, based on the research hypothesis, states that the predicted relationships will not be found or that those found could have occurred by chance, meaning the difference will not be statistically significant.
Effect size: This is defined by Cortina and Dunlap as “the amount of variance in one variable accounted for by another in the sample at hand” (2007, p. 231). Effect size estimates are helpful adjuncts to significance testing. An important limitation, however, is that they are heavily influenced by the type of treatment or manipulation that occurred and the measures that are used.
Confidence intervals: Although sometimes suggested as an adjunct or replacement for the significance level, confidence intervals are determined in part by the alpha (significance level) (Cortina & Dunlap, 2007). Likened to a margin of error, the confidence intervals indicate the range within which the true difference between means may lie. A narrow confidence interval implies high precision; we can specify believable values within a narrow range ...
Inferential Analysis
Chapter 20
NUR 6812Nursing Research
Florida National University
Introduction - Inferential Analysis
We will discuss analysis of variance and regression, which are technically part of the same family of statistics known as the general linear method but are used to achieve different analytical goals
ANALYSIS OF VARIANCE
Analysis of variance (ANOVA) is used so often that Iversen and Norpoth (1987) said they once had a student who thought this was the name of an Italian statistician.
You can think of analysis of variance as a whole family of procedures beginning with the simple and frequently used t-test and becoming quite complicated with the use of multiple dependent variables (MANOVA, to be explained later in this chapter) and covariates.
Although the simpler varieties of these statistics can actually be calculated by hand, it is assumed that you will use a statistical software package for your calculations.
If you want to see how these calculations are done, you could try to compute a correlation, chi-square, t-test, or ANOVA yourself (see Yuker, 1958; Field, 2009), but in general it is too time consuming and too subject to human error to do these by hand.
IMPORTANT TERMINOLOGY
Several terms are used in these analyses that you need to be familiar with to understand the analyses themselves and the results. Many will already be familiar to you.
Statistical significance: This indicates the probability that the differences found are a result of error, not the treatment. Stated in terms of the P value, the convention is to accept either a 1% (P ≤ 0.01), or 1 out of 100, or 5% (P ≤ 0.05), or 5 out of 100, possibility that any differences seen could have been due to error (Cortina & Dunlap, 2007).
Research hypothesis: A research hypothesis is a declarative statement of the expected relationship between the dependent and independent variable(s).
Null hypothesis: The null hypothesis, based on the research hypothesis, states that the predicted relationships will not be found or that those found could have occurred by chance, meaning the difference will not be statistically significant.
Effect size: This is defined by Cortina and Dunlap as “the amount of variance in one variable accounted for by another in the sample at hand” (2007, p. 231). Effect size estimates are helpful adjuncts to significance testing. An important limitation, however, is that they are heavily influenced by the type of treatment or manipulation that occurred and the measures that are used.
Confidence intervals: Although sometimes suggested as an adjunct or replacement for the significance level, confidence intervals are determined in part by the alpha (significance level) (Cortina & Dunlap, 2007). Likened to a margin of error, the confidence intervals indicate the range within which the true difference between means may lie. A narrow confidence interval implies high precision; we can specify believable values within a narrow range ...
·Quantitative Data Analysis StatisticsIntroductionUnd.docxlanagore871
·
Quantitative Data Analysis: Statistics
Introduction
Understanding the use of basic statistical strategies is part of being a critical consumer of published research literature. Unless they plan to conduct research themselves, it is not as important for counselors to understand the mathematical calculations of the statistical techniques as it is to be able to recognize the names of the common ones and what kind of information they provide. There are several commercially-available software packages for analyzing quantitative data, one of which is described in detail in Chapter 14 of
Counseling Research: Quantitative, Qualitative, and Mixed Methods
.
Descriptive and Inferential Statistics
In quantitative studies, statistical techniques are used for data analysis. The two main categories of statistics are descriptive and inferential. Descriptive statistics are used to summarize the data. Some common descriptive statistics are the measures of central tendency: the mean, median, and mode. They provide information about where the middle is in distribution of scores. On the normal distribution, the mean, median, and mode are the same. Distributions are said to be skewed when extreme scores draw the mean away from the middle of the distribution. Measures of variability, such as the range, variance, and standard deviation, provide information about how widely a distribution of scores is dispersed (Erford, 2015, p. 250). The standard deviation is a measure of how the scores cluster around the mean. The greater the standard deviation, the greater the spread of scores.
Toggle DrawerHide Full Introduction
Inferential statistics are used to make inferences from the sample to the population. All inferential statistical procedures are based on probability theory. They are used to test hypotheses. Three commonly used inferential statistics are chi square, t-test, and analysis of variance (ANOVA). Chi square is used with nominal data to determine if the observed expected frequency differs significantly from the expected frequency. A t-test is used to determine whether there is a statistically significant difference between the means of two groups. ANOVA is used to determine whether there is a statistically significant difference between the means of three or more groups.
Statistical Significance
When a quantitative study tests a hypothesis, it is technically the null hypothesis being tested. The null hypothesis says there is no difference between the groups, or relationship between the variables (depending on the research design). If the statistical procedure indicates there is statistical significance, the null hypothesis is rejected, meaning that the probability is high that there really is a group difference or strong relationship between the variables.
Rejecting the null hypothesis is not equivalent to proving the research or alternative hypothesis. Researchers can embrace the research hypothesis as one plausible explanation, but because only .
ABSTRACT : This paper critically examined a broad view of Structural Equation Model (SEM) with a view
of pointing out direction on how researchers can employ this model to future researches, with specific focus on
several traditional multivariate procedures like factor analysis, discriminant analysis, path analysis. This study
employed a descriptive survey and historical research design. Data was computed viaDescriptive Statistics,
Correlation Coefficient, Reliability. The study concluded that Novice researchers must take care of assumptions
and concepts of Structure Equation Modeling, while building a model to check the proposed hypothesis. SEM is
more or less an evolving technique in the research, which is expanding to new fields. Moreover, it is providing
new insights to researchers for conducting longitudinal investigations.
.
Similar to Machine Learning and Causal Inference (20)
Jennifer Schaus and Associates hosts a complimentary webinar series on The FAR in 2024. Join the webinars on Wednesdays and Fridays at noon, eastern.
Recordings are on YouTube and the company website.
https://www.youtube.com/@jenniferschaus/videos
What is the point of small housing associations.pptxPaul Smith
Given the small scale of housing associations and their relative high cost per home what is the point of them and how do we justify their continued existance
Many ways to support street children.pptxSERUDS INDIA
By raising awareness, providing support, advocating for change, and offering assistance to children in need, individuals can play a crucial role in improving the lives of street children and helping them realize their full potential
Donate Us
https://serudsindia.org/how-individuals-can-support-street-children-in-india/
#donatefororphan, #donateforhomelesschildren, #childeducation, #ngochildeducation, #donateforeducation, #donationforchildeducation, #sponsorforpoorchild, #sponsororphanage #sponsororphanchild, #donation, #education, #charity, #educationforchild, #seruds, #kurnool, #joyhome
ZGB - The Role of Generative AI in Government transformation.pdfSaeed Al Dhaheri
This keynote was presented during the the 7th edition of the UAE Hackathon 2024. It highlights the role of AI and Generative AI in addressing government transformation to achieve zero government bureaucracy
Understanding the Challenges of Street ChildrenSERUDS INDIA
By raising awareness, providing support, advocating for change, and offering assistance to children in need, individuals can play a crucial role in improving the lives of street children and helping them realize their full potential
Donate Us
https://serudsindia.org/how-individuals-can-support-street-children-in-india/
#donatefororphan, #donateforhomelesschildren, #childeducation, #ngochildeducation, #donateforeducation, #donationforchildeducation, #sponsorforpoorchild, #sponsororphanage #sponsororphanchild, #donation, #education, #charity, #educationforchild, #seruds, #kurnool, #joyhome
Jennifer Schaus and Associates hosts a complimentary webinar series on The FAR in 2024. Join the webinars on Wednesdays and Fridays at noon, eastern.
Recordings are on YouTube and the company website.
https://www.youtube.com/@jenniferschaus/videos
Russian anarchist and anti-war movement in the third year of full-scale warAntti Rautiainen
Anarchist group ANA Regensburg hosted my online-presentation on 16th of May 2024, in which I discussed tactics of anti-war activism in Russia, and reasons why the anti-war movement has not been able to make an impact to change the course of events yet. Cases of anarchists repressed for anti-war activities are presented, as well as strategies of support for political prisoners, and modest successes in supporting their struggles.
Thumbnail picture is by MediaZona, you may read their report on anti-war arson attacks in Russia here: https://en.zona.media/article/2022/10/13/burn-map
Links:
Autonomous Action
http://Avtonom.org
Anarchist Black Cross Moscow
http://Avtonom.org/abc
Solidarity Zone
https://t.me/solidarity_zone
Memorial
https://memopzk.org/, https://t.me/pzk_memorial
OVD-Info
https://en.ovdinfo.org/antiwar-ovd-info-guide
RosUznik
https://rosuznik.org/
Uznik Online
http://uznikonline.tilda.ws/
Russian Reader
https://therussianreader.com/
ABC Irkutsk
https://abc38.noblogs.org/
Send mail to prisoners from abroad:
http://Prisonmail.online
YouTube: https://youtu.be/c5nSOdU48O8
Spotify: https://podcasters.spotify.com/pod/show/libertarianlifecoach/episodes/Russian-anarchist-and-anti-war-movement-in-the-third-year-of-full-scale-war-e2k8ai4
3. Supervised Machine Learning v.
Econometrics/Statistics Lit. on Causality
Supervised ML
Well-developed and widely
used nonparametric
prediction methods that work
well with big data
Used in technology
companies, computer
science, statistics, genomics,
neuroscience, etc.
Rapidly growing in influence
Cross-validation for model
selection
Focus on prediction and
applications of prediction
Weaknesses
Causality (with notable
exceptions, e.g. Pearl, but
not much on data analysis)
Econometrics/Soc Sci/Statistics
Formal theory of causality
Potential outcomes method (Rubin)
maps onto economic approaches
“Structural models” that predict
what happens when world
changes
Used for auctions, anti-trust (e.g.
mergers) and business decision-
making (e.g. pricing)
Well-developed and widely used
tools for estimation and inference
of causal effects in exp. and
observational studies
Used by social science, policy-
makers, development
organizations, medicine, business,
experimentation
Weaknesses
Non-parametric approaches fail
with many covariates
4. Lessons for Economists
Engineering approach
Methods that scale
Asymptotic normality of
estimates or predictions for
hypothesis testing not
important goal
Lots of incremental
improvements in algorithms,
judged by performance at
prediction
Formal theory and perfect
answers not required: “it
works”
More systematic in key
respects
Cross-validation for model
selection
Low hanging fruit
Model selection/variable selection
for exogenous covariates,
prediction component of model
Heterogeneity
Heterogeneous treatment
effects/elasticities
Personalized recommendations
based on estimates
Some specific areas
Recommendation systems
Topic modeling
Text analysis/classifiers
6. A Research Agenda on Causal Inference
Problems
Many problems in
social sciences entail
a combination of
prediction and causal
inference
Existing ML
approaches to
estimation, model
selection and
robustness do not
directly apply to the
problem of estimating
causal parameters
Inference more
challenging for some
ML methods
Proposals
Formally model the distinction between causal
and predictive parts of the model and treat them
differently for both estimation and inference
Abadie, Athey, Imbens and Wooldridge (2014,
under review; also work in progress)
Develop new estimation methods that combine
ML approaches for prediction component of
models with causal approaches
Athey-Imbens (2015, work in progress)
Develop new approaches to cross-validation
optimized for causal inference and optimal policy
estimation
Athey-Imbens (2015, work in progress)
Develop robustness measures for causal
parameters inspired by ML
Athey-Imbens (AER P&P 2015; work in progress)
Develop methods for causal inference for
network analysis drawing on CS tools for
networks
Athey-Eckles-Imbens (2015)
Large scale structural models with latent
variables
Athey-Nekipelov (2012, 2015); Athey, Blei, Hofman,
7. Model for Causal Inference
For causal questions, we wish to know what would
happen if a policy-maker changes a policy
Potential outcomes notation:
Yi(w) is the outcome unit i would have if assigned treatment w
For binary treatment, treatment effect is 𝜏𝑖 = 𝑌𝑖 1 − 𝑌𝑖(0)
Administer a drug, change minimum wage law, raise a price
Function of interest: mapping from alt. CF policies to outcomes
Holland: Fundamental Problem of Causal Inference
We do not see the same units at the same time with alt. CF policies
Units of study typically have fixed attributes xi
These would not change with alternative policies
E.g. we don’t contemplate moving coastal states inland when
we change minimum wage policy
9. When is Prediction Primary Focus?
Economics: “allocation of
scarce resources”
An allocation is a
decision.
Generally, optimizing
decisions requires knowing
the counterfactual payoffs
from alternative decisions.
Hence: intense focus on
causal inference in
applied economics
Examples where
prediction plays the
dominant role in causal
inference
Decision is obvious given
an unknown state
Many decisions hinge on a
prediction of a future state
Prediction dominant for a
component of causal
inference
Propensity score estimation
First stage of IV/2SLS
Predicting the baseline in
difference in difference
settings
Predicting the baseline in
time series settings
10. Prediction and Decision-Making:
Predicting a State Variable
Kleinberg, Ludwig, Mullainathan,
and Obermeyer (2015)
Motivating examples:
Will it rain? (Should I take an
umbrella?)
Which teacher is best? (Hiring,
promotion)
Unemployment spell length?
(Savings)
Risk of violation of regulation
(Health inspections)
Riskiest youth (Targeting
interventions)
Creditworthiness (Granting
loans)
Empirical applications:
Will defendant show up for
court? (Should we grant bail?)
Will patient die within the year?
(Should we replace joints?)
A formal model
Payoff Yi is, for all i, known function of
policy (Wi) and state of the world (S)
𝑌𝑖 = 𝜋(𝑊𝑖, 𝑆)
State of the world may depend on
policy choice
Then, the impact of changing policy is
𝜕
𝜕𝑊 𝑖
𝑌𝑖 =
𝜕
𝜕𝑊 𝑖
𝜋(𝑊𝑖, 𝑆)+
𝜕
𝜕𝑆
𝜋(𝑊𝑖, 𝑆) ∙
𝜕𝑆
𝜕𝑊 𝑖
Paper refers to second term as “causal
component”
Argue that taking an umbrella doesn’t
effect rain, so the main problem is
predicting rain
But in general
𝜕
𝜕𝑊 𝑖
𝜋 is
unknown/heterogeneous, as is 𝜋 – can
also think of that as the causal effect
But idea still carries over if knowing S
tells you the sign of
𝜕
𝜕𝑊 𝑖
𝜋
11. Application: Joint Replacements
Methods:
Regularized
logistic regression,
choosing penalty
parameter for
number of
covariates using
10-fold c-v
Data
65K Medicare
patients
3305 variables
and 51 state
dummiesColumns(3) and (4) show results of a simulation exercise: we identify a population of eligibles (using
published
Medicare guidelines: those who had multiple visits to physicians for osteoarthritis and multiple claims for
physical therapy or therapeutic joint injections) who did not receive replacement and assign them a predicted
risk. We then substitute the high risk surgeries in each row with patients from this eligible distribution for
replacement, starting at median predicted risk. Column (3) counts the futile procedures averted (i.e., replaced
12. Using ML for Propensity Scores
Propensity score: Pr 𝑊𝑖 = 𝑤𝑖|𝑋𝑖 = 𝑥𝑖 = 𝑝 𝑤(𝑥)
Propensity score weighting and matching is common in
treatment effects literature
“Selection on observables assumption”: 𝑌𝑖(𝑤) ⊥ 𝑊𝑖|𝑋𝑖
See Imbens and Rubin (2015) for extensive review
Propensity score estimation is a pure prediction problem
Machine learning literature applies propensity score weighting:
e.g. Beygelzimer and Langford (2009), Dudick, Langford and
Li (2011)
Properties or tradeoffs in selection among ML approaches
Estimated propensity scores work better than true propensity score
(Hirano, Imbens and Ridder (2003)), so optimizing for out of
sample prediction is not the best path
Various papers consider tradeoffs, no clear answer, but
classification trees and random forests do well
13. Using ML for Model Specification under
Selection on Observables
A heuristic:
If you control richly for covariates, can
estimate treatment effect ignoring
endogeneity
This motivates regressions with rich
specification
Naïve approach motivated by heuristic
using LASSO
Keep the treatment variable out of the model
selection by not penalizing it
Use LASSO to select the rest of the model
specification
Problem:
Treatment variable is forced in, and some
covariates will have coefficients forced to zero.
Treatment effect coefficient will pick up those
effects and will thus be biased.
See Belloni, Chernozhukov and Hansen JEP
Better approach:
Need to do variable
selection via LASSO
for the selection
equation and outcome
equation separately
Use LASSO with the
union of variables
selected
Belloni, Chernozhukov
& Hansen (2013) show
that this works under
some assumptions,
including constant
treatment effects
15. Using ML for the first stage of IV
IV assumptions
𝑌𝑖(𝑤) ⊥ 𝑍𝑖|𝑋𝑖
See Imbens and Rubin (2015) for extensive review
First stage estimation: instrument selection and
functional form
𝐸[𝑊𝑖|𝑍𝑖, 𝑋𝑖] This is a prediction problem where
interpretability is less important
Variety of methods available
Belloni, Chernozhukov, and Hansen (2010); Belloni,
Chen, Chernozhukov and Hansen (2012) proposed
LASSO
Under some conditions, second stage inference occurs as usual
Key: second-stage is immune to misspecification in the first
stage
16. Open Questions and Future Directions
Heterogeneous treatment effects in LASSO
Beyond LASSO
What can be learned from statistics literature and
treatment effect literature about best possible methods for
selection on observables and IV cases?
Are there methods that avoid biases in LASSO, that
preserve interpretability and ability to do inference?
Only very recently are there any results about normality of
random forest estimators (e.g. Wager 2014)
What are best-performing methods?
More general conditions where standard inference is
valid, or corrections to standard inference
17. Machine Learning Methods for
Estimating Heterogeneous Causal
Effects
Athey and Imbens, 2015
http://arxiv.org/abs/1504.01132
18. Motivation I: Experiments and Data-Mining
Concerns about ex-post “data-mining”
In medicine, scholars required to pre-specify analysis plan
In economic field experiments, calls for similar protocols
But how is researcher to predict all forms of
heterogeneity in an environment with many
covariates?
Goal:
Allow researcher to specify set of potential covariates
Data-driven search for heterogeneity in causal effects with
valid standard errors
19. Motivation II: Treatment Effect
Heterogeneity for Policy
Estimate of treatment effect heterogeneity needed
for optimal decision-making
This paper focuses on estimating treatment effect as
function of attributes directly, not optimized for
choosing optimal policy in a given setting
This “structural” function can be used in future
decision-making by policy-makers without the need
for customized analysis
20. Preview
Distinguish between causal effects and attributes
Estimate treatment effect heterogeneity:
Introduce estimation approaches that combine ML
prediction & causal inference tools
Introduce and analyze new cross-validation
approaches for causal inference
Inference on estimated treatment effects in
subpopulations
Enabling post-experiment data-mining
21. Regression Trees for Prediction
Data
Outcomes Yi, attributes
Xi.
Support of Xi is X.
Have training sample
with independent obs.
Want to predict on new
sample
Ex: Predict how many
clicks a link will receive if
placed in the first
position on a particular
search query
Build a “tree”:
Partition of X into “leaves” X j
Predict Y conditional on realization of
X in each region X j using the sample
mean in that region
Go through variables and leaves and
decide whether and where to split
leaves (creating a finer partition) using
in-sample goodness of fit criterion
Select tree complexity using cross-
validation based on prediction quality
22. Regression Trees for Prediction: Components
1. Model and Estimation
A. Model type: Tree structure
B. Estimator 𝑌𝑖: sample mean of Yi within leaf
C. Set of candidate estimators C: correspond to different
specifications of how tree is split
2. Criterion function (for fixed tuning parameter 𝜆)
A. In-sample Goodness-of-fit function:
Qis = -MSE (Mean Squared Error)=-1
𝑁 𝑖=1
𝑁
𝑌𝑖 − 𝑌𝑖
2
A. Structure and use of criterion
i. Criterion: Qcrit = Qis – 𝜆 x # leaves
ii. Select member of set of candidate estimators that maximizes Qcrit, given 𝜆
3. Cross-validation approach
A. Approach: Cross-validation on grid of tuning parameters. Select
tuning parameter 𝜆 with highest Out-of-sample Goodness-of-Fit Qos.
B. Out-of-sample Goodness-of-fit function: Qos = -MSE
23. Using Trees to Estimate Causal Effects
Model:
𝑌𝑖 = 𝑌𝑖 𝑊𝑖 =
𝑌𝑖(1) 𝑖𝑓 𝑊𝑖 = 1,
𝑌𝑖(0) 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒.
Suppose random assignment of Wi
Want to predict individual i’s treatment effect
𝜏𝑖 = 𝑌𝑖 1 − 𝑌𝑖(0)
This is not observed for any individual
Not clear how to apply standard machine learning tools
Let
𝜇(𝑤, 𝑥) = 𝔼[𝑌𝑖|𝑊𝑖 = 𝑤, 𝑋𝑖 = 𝑥]
𝜏(𝑥) = 𝜇(1, 𝑥) − 𝜇(0, 𝑥)
24. Using Trees to Estimate Causal Effects
𝜇(𝑤, 𝑥) = 𝔼[𝑌𝑖|𝑊𝑖 = 𝑤, 𝑋𝑖 = 𝑥]
𝜏(𝑥) = 𝜇(1, 𝑥) − 𝜇(0, 𝑥)
Approach 1: Analyze two groups
separately
Estimate 𝜇(1, 𝑥) using dataset where 𝑊𝑖 =
1
Estimate 𝜇(0, 𝑥) using dataset where 𝑊𝑖 =
0
Use propensity score weighting (PSW) if
needed
Do within-group cross-validation to choose
tuning parameters
Construct prediction using
𝜇(1, 𝑥) − 𝜇(0, 𝑥)
Approach 2: Estimate 𝜇(𝑤, 𝑥) using tree
including both covariates
Include PS as attribute if needed
Choose tuning parameters as usual
Construct prediction using
𝜇(1, 𝑥) − 𝜇(0, 𝑥)
Observations
Estimation and cross-validation
not optimized for goal
Lots of segments in Approach
1: combining two distinct ways
to partition the data
Problems with these
approaches
1. Approaches not tailored to the goal
of estimating treatment effects
2. How do you evaluate goodness of
fit for tree splitting and cross-
validation?
𝜏𝑖 = 𝑌𝑖 1 − 𝑌𝑖 0 is not observed
and thus you don’t have ground truth
25. Proposed Approach 3: Transform the Outcome
Suppose we have 50-50 randomization of
treatment/control
Let 𝑌𝑖
∗
=
2𝑌𝑖 𝑖𝑓 𝑊𝑖 = 1
−2𝑌𝑖 𝑖𝑓 𝑊𝑖 = 0
Then 𝐸 𝑌𝑖
∗
= 2 ⋅ 1
2
𝐸 𝑌𝑖 1 − 1
2
𝐸 𝑌𝑖 0 = 𝐸[𝜏𝑖]
Suppose treatment with probability pi
Let 𝑌𝑖
∗
=
𝑊 𝑖−𝑝
𝑝(1−𝑝)
𝑌𝑖 =
1
𝑝
𝑌 𝑖 𝑖𝑓 𝑊𝑖 = 1
− 1
1−𝑝
𝑌𝑖 𝑖𝑓 𝑊𝑖 = 0
Then 𝐸 𝑌𝑖
∗
= 𝑝1
𝑝
𝐸 𝑌𝑖 1 − (1 − 𝑝) 1
1−𝑝
𝐸 𝑌𝑖 0 = 𝐸[𝜏𝑖]
Selection on observables or stratified experiment
Let 𝑌𝑖
∗
=
𝑊 𝑖−𝑝 𝑋 𝑖
𝑝(𝑋 𝑖)(1−𝑝(𝑋 𝑖))
𝑌𝑖
Estimate 𝑝 𝑥 using traditional methods
26. Causal Trees:
Approach 3 (Conventional Tree, Transformed Outcome)
1. Model and Estimation
A. Model type: Tree structure
B. Estimator 𝜏𝑖
∗
: sample mean of 𝑌𝑖
∗
within leaf
C. Set of candidate estimators C: correspond to different
specifications of how tree is split
2. Criterion function (for fixed tuning parameter 𝜆)
A. In-sample Goodness-of-fit function:
Qis = -MSE (Mean Squared Error) = −1
𝑁 𝑖=1
𝑁
( 𝜏𝑖
∗
− 𝑌𝑖
∗)2
A. Structure and use of criterion
i. Criterion: Qcrit = Qis – 𝜆 x # leaves
ii. Select member of set of candidate estimators that maximizes Qcrit, given 𝜆
3. Cross-validation approach
A. Approach: Cross-validation on grid of tuning parameters. Select
tuning parameter 𝜆 with highest Out-of-sample Goodness-of-Fit Qos.
B. Out-of-sample Goodness-of-fit function: Qos = -MSE
27. Critique of Proposed Approach 3:
Transform the Outcome
𝑌𝑖
∗
=
𝑊 𝑖−𝑝
𝑝(1−𝑝)
𝑌𝑖 =
1
𝑝
𝑌 𝑖 𝑖𝑓 𝑊𝑖 = 1
− 1
1−𝑝
𝑌𝑖 𝑖𝑓 𝑊𝑖 = 0
Within a leaf, sample average of 𝑌𝑖
∗
is not most
efficient estimator of treatment effect
The proportion of treated units within the leaf is not the
same as the overall sample proportion
This weights treatment group mean by p, not by actual
fraction of treated in leaf
This motivates modification:
Use sample average treatment effect in the leaf (average
of treated less average of control)
28. Critique of Proposed Approach 3:
Transform the Outcome
Use of transformed outcome creates noise in tree
splitting and in out-of-sample cross-validation
In-sample:
Use variance of prediction for in-sample goodness of fit
For an estimator guaranteed to be unbiased in sample (such as
sample average treatment effect), the variance of the estimator
measures predictive power
Out of sample:
Use a matching estimator to construct estimate of ground truth
treatment effect out of sample. Single match minimizes bias. 𝜏𝑖
𝑚,𝑜𝑠
Matching estimator and transformed outcome both unbiased in
large sample when perfect matching can be found. But
transformed outcome introduces variance due to weighting factor;
matching estimator controls for predictable component of variance
𝑉𝑎𝑟 𝑌𝑖
∗
≈ 1
𝑝
𝑉𝑎𝑟 𝑌𝑖 1 + 1
1−𝑝
𝑉𝑎𝑟 𝑌𝑖 0
𝑉𝑎𝑟 𝜏𝑖
𝑚
≈ 𝑉𝑎𝑟 𝑌𝑖 1 |𝑋 + 𝑉𝑎𝑟 𝑌𝑖 0 |𝑋
29. Causal Trees
1. Model and Estimation
A. Model type: Tree structure
B. Estimator 𝜏𝑖
𝐶𝑇
: sample average treatment effect within leaf
C. Set of candidate estimators C: correspond to different
specifications of how tree is split
2. Criterion function (for fixed tuning parameter 𝜆)
A. In-sample Goodness-of-fit function:
Qis = −1
𝑁 𝑖=1
𝑁
( 𝜏𝑖
𝐶𝑇
)2
A. Structure and use of criterion
i. Criterion: Qcrit = Qis – 𝜆 x # leaves
ii. Select member of set of candidate estimators that maximizes Qcrit, given 𝜆
3. Cross-validation approach
A. Approach: Cross-validation on grid of tuning parameters. Select
tuning parameter 𝜆 with highest Out-of-sample Goodness-of-Fit Qos.
B. Out-of-sample Goodness-of-fit function: Qos = -1
𝑁 𝑖=1
𝑁
( 𝜏𝑖
𝐶𝑇
−
30. Comparing “Standard” and Causal
Approaches
They will be more similar
If treatment effects and levels are highly correlated
Two-tree approach
Will do poorly if there is a lot of heterogeneity in levels that is unrelated to
treatment effects; trees are much too complex without predicting
treatment effects
Will do well in certain specific circumstances, e.g.
Control outcomes constant in covariates
Treatment outcomes vary with covariates
Transformed outcome
Will do badly if there is a lot of observable heterogeneity in outcomes, and
if treatment probabilities are unbalanced or have high variance
Variance in criterion functions leads to trees that are too simple as they
criterion erroneously finds a lack of fit
How to compare approaches?
1. Oracle (simulations)
2. Transformed outcome goodness of fit
3. Matching goodness of fit
31. Inference
Conventional wisdom is that trees are bad for inference
The predictions are discontinuous in tree and not normally distributed.
But we are not interested in inference on tree structure.
Attractive feature of trees:
Can easily separate tree construction from treatment effect estimation
Tree constructed on training sample is independent of sampling variation
in the test sample
Holding tree from training sample fixed, can use standard methods to
conduct inference within each leaf of the tree on test sample
Can use any valid method for treatment effect estimation, not just the methods
used in training
For observational studies, literature (e.g. Hirano, Imbens and Ridder
(2003)) requires additional conditions for inference
E.g. leaf size must grow with population
Future research: extend ideas beyond trees
Bias arises in LASSO as well in the absence of strong sparsity conditions.
Expand on insight: separate datasets used for model selection and
estimation of prediction for a given model yields valid inference. See,
e.g., Denil, Matheson, and de Freitas (2014) on random forests.
32. Problem: Treatment Effect Heterogeneity in
Estimating Position Effects in Search
Queries highly heterogeneous
Tens of millions of unique search phrases each month
Query mix changes month to month for a variety of
reasons
Behavior conditional on query is fairly stable
Desire for segments.
Want to understand heterogeneity and make decisions
based on it
“Tune” algorithms separately by segment
Want to predict outcomes if query mix changes
For example, bring on new syndication partner with more
queries of a certain type
33. Relevance v. Position
25.4%
6.9%
4.0% 2.1%
4.9%
3.5%
1.7%
13.5%
17.9%
21.6%
Control
(1st Position)
(1,3) (1,5) (1,10)
CTR
Loss in CTR from Link Demotion (US All Non-Navigational)
Original CTR of Position Gain from Increased Relevance Loss from Demotion
34. Search Experiment Tree: Effect of Demoting
Top Link (Test Sample Effects) Some data
excluded with
prob p(x):
proportions do
not match
population
Highly
navigational
queries
excluded
37. Conclusions
Key to approach
Distinguish between causal and predictive parts of model
“Best of Both Worlds”
Combining very well established tools from different literatures
Systematic model selection with many covariates
Optimized for problem of causal effects
In terms of tradeoff between granular prediction and overfitting
With valid inference
Easy to communicate method and interpret results
Output is a partition of sample, treatment effects and standard
errors
Important application
Data-mining for heterogeneous effects in randomized
experiments
38. Literature
Approaches in the spirit of single tree/2 trees
Beygelzimer and Langford (2009)
Analogous to “two trees” approach with multiple
treatments; construct optimal policy
Foster, Tailor, Ruberg(2011)
Estimate 𝜇(𝑤, 𝑥) using random forests, define
𝜏𝑖 = 𝜇(1, 𝑋𝑖) − 𝜇(0, 𝑋𝑖), and do trees on 𝜏𝑖.
Imai and Ratkovic (2013)
In context of randomized experiment, estimate
𝜇(𝑤, 𝑥) using lasso type methods, and then
𝜏(𝑥) = 𝜇(1, 𝑥) − 𝜇(0, 𝑥).
Transformed outcomes or covariates
Tibshirani et al (2014); Weisberg and Pontes
(2015) in regression/LASSO setting
Dudick, Langford, and Li (2011) and
Beygelzimer and Langford (2009) for optimal
policy
Don’t highlight or address limitations of
transformed outcomes in estimation &
criteria
Estimating treatment effects
directly at leaves of trees
Su, Tsai, Wang, Nickerson, Li
(2009)
Do regular tree, but split if the
t-stat for the treatment effect
difference is large, rather than
when the change in prediction
error is large.
Zeileis, Hothorn, and Hornick
(2005)
“Model-based recursive
partitioning”: estimate a model
at the leaves of a tree. In-
sample splits based on
prediction error, do not focus
on out of sample cross-
validation for tuning.
None of these explore cross-
validation based on treatment
effect.
39. Extensions (Work in Progress)
Alternatives for cross-validation criteria
Optimizing selection on observables case
What is the best way to estimate propensity score
for this application?
Alternatives to propensity score weighting
40. Heterogeneity: Instrumental Variables
Setup
Binary treatment, binary instrument case
Instrument Zi
∆ 𝑌 𝑖 𝑆 =
𝐸 𝑌𝑖 𝑍𝑖 = 1, 𝑋𝑖 ∈ 𝑆 − 𝐸 𝑌𝑖 𝑍𝑖 = 0, 𝑋𝑖 ∈ 𝑆
∆ 𝑊 𝑖 𝑆 =
Pr(𝑊𝑖 = 1|𝑍𝑖 = 1, 𝑋𝑖 ∈ 𝑆) − Pr(𝑊𝑖 = 1|𝑍𝑖 = 0, 𝑋𝑖 ∈ 𝑆)
LATE estimator for 𝑥𝑖 ∈ 𝑆 is:
𝐿𝐴𝑇𝐸 𝑆 = ∆ 𝑌 𝑖 𝑆
∆ 𝑊 𝑖 𝑆
LATE heterogeneity issues
Tree model: want numerator and
denominator on same set S, to get LATE
for units w/ xi in S.
Set of units shifted by instrument varies with x
Average of LATE estimators over all
regions is NOT equal to the LATE for the
population
Proposed Method
Estimation & Inference:
Estimate numerator and
denominator simultaneously with
a single tree model
Inference on a distinct sample,
can do separately within each leaf
In-sample goodness of fit:
Prediction accuracy for both
components separately
Weight two components
In-sample criterion penalizes
complexity, as usual
Cross-validation
Bias paramount: is my tree
overfit?
Criterion: For each unit, find
closest neighbors and estimate
LATE (e.g. kernel)
Two parameters instead of one:
complexity and relative weight to
numerator
Can also estimate an
approximation for optimal weights
41. Next Steps
Application to demand elasticities (Amazon,
eBay, advertisers)
Can apply methods with regression at the bottom of
the tree—some modifications needed
More broadly, richer demand models at scale
Lessons for economists
ML methods work better in economic problems
when customized for economic goals
Not hard to customize methods and modify them
Not just possible but fairly easy to be systematic
about model selection
43. Optimal Decision Policies
Decision Policies v.
Treatment Effects
In some applications,
the goal is directly to
estimate an optimal
decision policy
There may be a large
number of alternatives
Decisions are made
immediately
Examples
Offers or marketing to
users
Advertisements
Mailings or emails
Online web page
optimization
Customized prices
44. Model
Outcome Yi incorporates
both costs and benefits
If cost is known, e.g. a
mailing, define outcome to
include cost
Treatment Wi is multi-valued
Attributes Xi observed
Maintain selection on
observables assumption:
𝑌𝑖(𝑤) ⊥ 𝑊𝑖|𝑋𝑖
Propensity score:
Pr 𝑊𝑖 = 𝑤𝑖|𝑋𝑖 = 𝑥𝑖 = 𝑝 𝑤(𝑥)
Optimal policy:
p*(x) = argmaxwE[Yi(w)|Xi=x]
Examples/interpretation
Marketing/web site design
Outcome is voting, purchase,
a click, etc.
Treatment is the offer
Past user behavior used to
define attributes
Selection on observables
justified by past
experimentation (or real-time
experimentation)
Personalized medicine
Treatment plan as a function
of individual characteristics
45. Learning Policy Functions
ML Literature:
Contextual bandits (e.g., John
Langford), associative
reinforcement learning,
associative bandits, learning
with partial feedback, bandits
with side information, partial
label problem
Cost-sensitive classification
Classifiers (e.g. logit, CART,
SVM) = discrete choice models
Weight observations by
observation-specific weight
Objective function: minimize
classification error
The policy problem
Minimize regret from
suboptimal policy (“policy
regret”)
For 2-choice case:
Procedure with transformed outcome:
Train classifier as if obs. treatment is
optimal:
(features, choice, weight)= (Xi, Wi, 𝑌 𝑖
𝑝(𝑥)
).
Estimated classifier is a possible policy
Result:
The loss from the cost-weighted classifier
(misclassification error minimization)
is the same in expectation
as the policy regret
Intuition
The expected value of the weights
conditional on xi,wi is E[Yi(wi)|Xi=xi]
Implication
Use off-the-shelf classifier to learn
optimal policies, e.g. logit, CART, SVM
Literature considers extensions to multi-
valued treatments (tree of binary
46. Weighted Classification Trees:
Transformed Outcome Approach
Interpretation
This is analog of using the transformed outcome approach for
heterogeneous treatment effects, but for learning optimal policies
The in-sample and out-of-sample criteria have high variance, and don’t
adjust for actual sample proportions or predictable variation in outcomes
as function of X
Comparing two policies: Loss A – Loss B with NT units
Policy A & B
Rec’s
Treated
Units
Control
Units
Expected value of sum
Region S10
A: Treat
B: No Treat
−
1
𝑁 𝑇
𝑌 𝑖
𝑝
1
𝑁 𝑇
𝑌 𝑖
1−𝑝
−Pr(𝑋𝑖 ∈ 𝑆10)𝐸[𝜏𝑖|𝑋𝑖 ∈ 𝑆10]
Region S01
A: No Treat
B: Treat
1
𝑁 𝑇
𝑌 𝑖
𝑝
−
1
𝑁 𝑇
𝑌 𝑖
1−𝑝
Pr(𝑋𝑖 ∈ 𝑆01)𝐸[𝜏𝑖|𝑋𝑖 ∈ 𝑆01]
47. Alternative: Causal Policy Tree
(Athey-Imbens 2015-WIP)
Improve by same logic that causal trees improved on transformed
outcome
In sample
Estimate treatment effect 𝜏(𝑥)within leaves using actual proportion treated
in leaves (average treated outcomes – average control outcomes)
Split based on classification costs, using sample average treatment effect
to estimate cost within a leaf. Equivalent to modifying TO to adjust for
correct leaf treatment proportions.
Comparing two policies: Loss A – Loss B with NT
units
Policy A & B
Rec’s
Treated
Units
Control
Units
Expected value of sum
Region S10
A: Treat
B: No Treat
−
1
𝑁 𝑇 𝜏(𝑆10) −
1
𝑁 𝑇
𝜏(𝑆10) −Pr(𝑋𝑖 ∈ 𝑆10)𝐸[𝜏𝑖|𝑋𝑖 ∈ 𝑆10]
Region S01
A: No Treat
B: Treat
1
𝑁 𝑇 𝜏(𝑆01)
1
𝑁 𝑇
𝜏(𝑆10) Pr(𝑋𝑖 ∈ 𝑆01)𝐸[𝜏𝑖|𝑋𝑖 ∈ 𝑆01]
48. Alternative: Causal Policy Tree
(Athey-Imbens 2015-WIP)
Comparing two policies: Loss A – Loss B with NT
units
Policy A & B
Rec’s
Treated
Units
Control
Units
Expected value of sum
Region S10
A: Treat
B: No Treat
−
1
𝑁 𝑇
𝜏(𝑆10) −
1
𝑁 𝑇
𝜏(𝑆10) −Pr(𝑋𝑖 ∈ 𝑆10)𝐸[𝜏𝑖|𝑋𝑖 ∈ 𝑆10]
Region S01
A: No Treat
B: Treat
1
𝑁 𝑇 𝜏(𝑆01)
1
𝑁 𝑇 𝜏(𝑆01)
Pr(𝑋𝑖 ∈ 𝑆01)𝐸[𝜏𝑖|𝑋𝑖 ∈ 𝑆01]
Policy A & B
Rec’s
Treated
Units
Control
Units
Expected value of sum
Region S10
A: Treat
B: No Treat
−
1
𝑁 𝑇
𝑌 𝑖
𝑝
1
𝑁 𝑇
𝑌 𝑖
1−𝑝
−Pr(𝑋𝑖 ∈ 𝑆10)𝐸[𝜏𝑖|𝑋𝑖 ∈ 𝑆10]
Region S01
A: No Treat
B: Treat
1
𝑁 𝑇
𝑌 𝑖
𝑝 −
1
𝑁 𝑇
𝑌 𝑖
1−𝑝
Pr(𝑋𝑖 ∈ 𝑆01)𝐸[𝜏𝑖|𝑋𝑖 ∈ 𝑆01]
49. Alternative approach for cross-validation criterion
Use nearest-neighbor matching to estimate treatment
effect for test observations
Categorize as misclassified at the individual unit level
Loss function is mis-classification error for misclassified
units
When comparing two policies (classifiers), for unit
where policies have different recommendations, the
difference in loss function is the estimated treatment
effect for that unit
If sample is large s.t. close matches can be found, this
criteria may be lower variance than transformed
outcome in small samples, and thus a better fit is
obtained
Alternative: Causal Policy Tree
(Athey-Imbens 2015-WIP)
50. Inference
Not considered in contextual bandit literature
As before, split the sample for estimating classification
tree and for conducting inference.
Within each leaf:
Optimal policy is determined by sign of estimated treatment
effect.
Simply use conventional test of one-sided hypothesis that
estimate of sign of treatment effect is wrong.
Alternative: Causal Policy Tree
(Athey-Imbens 2015-WIP)
51. Existing ML Literature
With minimal coding, apply pre-packaged classification tree
using transformed outcome as weights
Resulting tree is an estimate of optimal policy.
For multiple treatment options
Follow contextual bandit literature and construct tree of binary
classifiers. Race pairs of alternatives, then race winners
against each other on successively smaller subsets of the
covariate space.
Also propose further transformations (offsets)
Our observation:
Can do inference on hold-out sample taking tree as fixed.
Can improve on pre-packaged tree with causal policy
tree.
Also relates to a small literature in economics:
Estimating Policy Functions: Summary
52. Other Related Topics
Online learning
See e.g. Langford
Explore/exploit
Auer et al ’95, more recent work by Langford et al
Doubly robust estimation
Elad Hazan, Satyen Kale, Better Algorithms for Benign
Bandits, SODA 2009.
David Chan, Rong Ge, Ori Gershony, Tim Hesterberg,
Diane Lambert, Evaluating Online Ad Campaigns in a
Pipeline: Causal Models at Scale, KDD 2010
Dudick, Langford, and Li (2011)
54. Inference for Causal Effects v. Attributes:
Abadie, Athey, Imbens & Wooldridge (2014)
Approach
Formally define a population of
interest and how sampling occurs
Define an estimand that answers
the economic question using these
objects (effects versus attributes)
Specify: “What data are missing,
and how is the difference between
your estimator and the estimand
uncertain?”
Given data on 50 states from 2003, we
know with certainty the difference in
average income between coast and
interior
Although we could contemplate using
data from 2003 to estimate the 2004,
difference this depends on serial
correlation within states, no direct info in
Application to Effects v.
Attributes in Regression
Models
Sampling: Sample/population
does not go to zero, finite
sample
Causal effects have missing
data: don’t observe both
treatments for any unit
Huber-White robust standard
errors are conservative but best
feasible estimate for causal
effects
Standard errors on fixed
attributes may be much smaller
if sample is large relative to
population
Conventional approaches take
into account sampling variance
that should not be there
56. Robustness of Causal Estimates
Athey and Imbens (AER P&P, 2015)
General nonlinear models/estimation methods
Causal effect is defined as a function of model
parameters
Simple case with binary treatment, effect is 𝜏𝑖 = 𝑌𝑖 1 − 𝑌𝑖(0)
Consider other variables/features as “attributes”
Proposed metric for robustness:
Use a series of “tree” models to partition the sample by
attributes
Simple case: take each attribute one by one
Re-estimate model within each partition
For each tree, calculate overall sample average effect as a
weighted average of effects within each partition
This yields a set of sample average effects
Propose the standard deviation of effects as robustness
57. Robustness of Causal Estimates
Athey and Imbens (AER P&P, 2015)
Four Applications:
Randomly assigned training program
Treated individuals with artificial control group from
census data (Lalonde)
Lottery data (Imbens, Rubin & Sacerdote (2001))
Regression of earnings on education from NLSY
Findings
Robustness measure better for randomized experiments,
worse in observational studies
60. Robustness Metrics: Desiderata
Invariant to:
Scaling of explanatory variables
Transformations of vector of explanatory variables
Adding irrelevant variables
Each member model must be somehow distinct to
create variance, yet we want to allow lots of
interactions
Need to add lots of rich but different models
Well-grounded way to weight models
This paper had equal weighting
61. Robustness Metrics: Work In Progress
Std Deviation versus Worst-
Case
Desire for set of alternative
models that grows richer
New additions are similar to
previous ones, lower std dev
Standard dev metric:
Need to weight models to put
more weight on distinct
alternative models
“Worst-case” or “bounds”:
Find the lowest and highest
parameter estimates from a
set of models
Ok to add more models that
are similar to existing ones.
But worst-case is very
sensitive to outliers—how do
Theoretical underpinnings
Subjective versus objective
uncertainty
Subjective uncertainty: correct model
Objective uncertainty: distribution of
model estimates given correct model
What are the preferences of the
“decision-maker” who values
robustness?
“Variational preferences”
“Worst-case” in set of possible beliefs,
allow for a “cost” of beliefs that captures
beliefs that are “less likely.” (see
Strzalecki, 2011)
Our approach for exog. covariate case:
Convex cost to models that perform
poorly out of sample from a predictive
perspective
Good model
62. Conclusions on Robustness
ML inspires us to be both systematic and pragmatic
Big data gives us more choices in model selection and
ability to evaluate alternatives
Maybe we can finally make progress on robustness