Undergraduate Modeling Workshop - Air Quality Working Group Final Presentation, May 25, 2018

•Download as PPTX, PDF•

0 likes•520 views

The Statistical and Applied Mathematical Sciences Institute

Fine particulate matter (PM2.5) is a mixture of air pollutants that, at a high concentration level, has adverse effects on human health. An interesting statistics problem is to estimate these pollutant exposures for the entire US, such estimates can be used to inform policy and decision making. During the workshop, we will work on two major source of air quality data that are used by the EPA to estimate pollutant exposures, including monitoring data and the Community Multiscale Air Quality (CMAQ) model. The monitoring stations provide fairly accurate measurements of the pollutants; however, they are sparse in space and take measurements at a coarse time resolution, typically 1-in-3 or 1-in-6 days. On the other hand, the CMAQ model provides daily concentration levels of each component with complete spatial coverage on a grid; these model outputs, however, need to be evaluated and calibrated to the monitoring data. We will explore these air quality data for the summer of 2011 and brainstorm on statistical models to estimate air pollutant exposures. Group members: Meixi Chen, Vincent Gonzales, Alan Ji, Chandni Malhotra, Hongyu Mao, Sharon Sung

Education

AIR QUALITY
Alan, Chandni, Vincent, Ceci, Sharon, Mao
May 2018, SAMSI Workshop

What is PM2.5?
May 2018, SAMSI Workshop
Bypass nose/throat penetrate deep into lungs, circulatory system.
➢ Particulate Matter
➢ Diameter < 2.5 micrometers
➢ 3% the diameter of human
hair

May 2018, SAMSI Workshop
PM2.5 Monitoring Systems in the
US
➢Monitoring stations are sparse
➢Need predictions for locations
without a monitoring station

What is CMAQ?
May 2018, SAMSI Workshop
CMAQ - Community Multi-scale Air
Quality is a numerical air quality
model
To predict the concentration of air
pollutants

CMAQ Inaccuracies
● High topographical regions
contained greatest degrees of
error
● Areas with more monitoring stations
had best predictions

The Big Question/Goal
What is the best statistical
model that predicts PM 2.5
concentration level for the
entire U.S. using numerical
model outputs and other
available covariates?
May 2018, SAMSI Workshop

Variable Selection & Transformation
31 Plots: Covariate v.s. PM 2.5 (Response Variable)
? Each covariate related to PM 2.5
Residual Plot: Residuals of PM2.5 v.s. each covariate
Adjusted R-squared: not enough

Variable Selection & Transformation
Counting Covariates
31 -> 28 (same measurement)
28 -> 25 (cor plot & R-square)
Correlation Plot
Threshold for Decision: 0.8
Correlated pair
…...
…...
May 2018, SAMSI Workshop

Variable Selection & Transformation
31 Plots: Covariate v.s. PM 2.5 (Response Variable)
Residual Plot: Residuals of PM2.5 v.s. each covariate
Adjusted R-squared
Used to decide which covariate to exclude when two are highly correlated.

Variable Selection & Transformation
Residual Plot
➢ Do regression PM2.5 ~ CMAQ
➢ Plot the residuals against the other covariates
Finally, 15 covariates are selected
Boundary layer height
residuals

May 2018, SAMSI Workshop
Random Forest
No. of trees:
500
No. of variables tried at each split:
5
Mean of squared residuals (log
scale): 0.1075135
% Variance explained:
72.25
Some fun math behind the models…

May 2018, SAMSI Workshop
Spatial Model
Covariance
Matrix
Conditional
Normality
Some fun math behind the models…

The Kriging Concept
“The basic idea of kriging is to predict the value of a function at a given
point by computing a weighted average of the known values of the
function in the neighborhood of the point.”
———Wikipedia
May 2018, SAMSI Workshop

January 1st Measurements
May 2018, SAMSI Workshop

May 2018, SAMSI Workshop
Prediction Maps
for Jan 1st , 2011

August 1st Measurements
May 2018, SAMSI Workshop

May 2018, SAMSI Workshop
Prediction Maps
for Aug 1st , 2011

5 Fold Cross-Validation
➢ Divide the whole dataset into 5 folds
➢ Train the model using 4 of them and leave out the fifth one
➢ Make predictions on the fifth fold and obtain the MSE and MAD

Model MSE MAD
CMAQ 51.734 4.681
Simple LR 23.220 3.103
Random forest 13.254 2.177
Spatial analysis 9.734 1.718
May 2018, SAMSI Workshop
Model Comparison based on
cross-validation

May 2018, SAMSI Workshop
Prediction Maps
for Jan 1st , 2011
MSE of CMAQ = 51.734, MSE of LR = 23.220, MSE of RF = 13.254, MSE of Spatial Analysis = 9.734

May 2018, SAMSI Workshop
Prediction Maps
for Aug 1st , 2011
MSE of CMAQ = 51.734, MSE of LR = 23.220, MSE of RF = 13.254, MSE of Spatial Analysis = 9.734

Summary
➢ Spatial analysis makes the BEST predictions
➢ Potential Improvements:
○ Look at the interactions between covariates
○ Other machine learning methods like neural network
○ Seasonal analysis
○ Mid-west?

May 2018, SAMSI Workshop
Special thanks to Yawen, Amanda, Suman, and Doug

Undergraduate Modeling Workshop - Air Quality Working Group Final Presentation, May 25, 2018

What's hot

Bar graphs

kristenu83

Bar graphs

kristenu83

In this project the group members will play with daily rainfall data collected in Gulf coast (535stations in total) from 1949 to 2017. The purposes of this exercise are to: 1) to give students an idea of a typical example of a climate data set (spatio-temporal data) and someassociated scientific questions (e.g. how rainfall extremes vary in space and time and how that mightbe affected by other things like greenhouse gases or temperatures). 2) to get students familiar with data analysis using R including data manipulation, data visualization, and data summary. 3) to introduce some statistical methods (e.g. time series analysis, spatial statistics, extreme value analysis) to analyze this kind of data to "answer" (perform statistical inference) the questions of interest. Group members: Lin Ge, Jianan Jang, Jessica Robinson, Erin Song, Seth Temple, Adam Wu

Undergraduate Modeling Workshop - Southeastern US Rainfall Working Group Fina...

The Statistical and Applied Mathematical Sciences Institute

Spatial presentation of prognosis models in plant protection

CAPIGI

Calculus

East west University

Deep learning for multi year enso forecasts fnl

Rakesh S

2012 CRL Recruiting Memo

A Jorge Garcia

2nd Test - Scatterplots

Brandeis High School

Data science lab project

LuciaRavazzi

VECTOR CALCULUS

MANJULAKAMALANATHAN

Math in the News: 8/29/11

Media4math

How to train your mind to think like the ai machine you are training

Denis Rothman

QCL-14-v3_PARETO DIAGRAM_BANASTHALI UNIVERSITY_TANYA RATHORE

tanya rathore

2.6b scatter plots and lines of best fit

hartcher

Day 6 examples

jchartiersjsd

Some of you have already know how serious is the problem with air pollution in the capital of Bulgaria, Sofia but ... ▶️Do you know how it could be solved? Our community represented by 1800 members all around the world tried to tackle the issue at our previous #GlobalDatathon and our international #DataScience #MonthlyChallenge, part of a university program. Team Kiwi is solving the problem by implementing algorithms and statistical methods for air pollution prediction in the next 24 hours.

Air Pollution in Sofia - Solution through Data Science by Kiwi team

Data Science Society

MPI 794 (week-1 & 2)

Yasser B. A. Farag

Global warming graphs

Sophia Elliott

Mathematical modelling and its application in weather forecasting

Sarwar Azad

Application of differential and integral

Shohan Ahmed

What's hot (20)

Bar graphs

Undergraduate Modeling Workshop - Southeastern US Rainfall Working Group Fina...

Spatial presentation of prognosis models in plant protection

Calculus

Deep learning for multi year enso forecasts fnl

2012 CRL Recruiting Memo

2nd Test - Scatterplots

Data science lab project

VECTOR CALCULUS

Math in the News: 8/29/11

How to train your mind to think like the ai machine you are training

QCL-14-v3_PARETO DIAGRAM_BANASTHALI UNIVERSITY_TANYA RATHORE

2.6b scatter plots and lines of best fit

Day 6 examples

Air Pollution in Sofia - Solution through Data Science by Kiwi team

MPI 794 (week-1 & 2)

Global warming graphs

Mathematical modelling and its application in weather forecasting

Application of differential and integral

Similar to Undergraduate Modeling Workshop - Air Quality Working Group Final Presentation, May 25, 2018

Many applications require a large number of time series to be forecast completely automatically. For example, manufacturing companies often require weekly forecasts of demand for thousands of products at dozens of locations in order to plan distribution and maintain suitable inventory stocks. In these circumstances, it is not feasible for time series models to be developed for each series by an experienced analyst. Instead, an automatic forecasting algorithm is required. In addition to providing automatic forecasts when required, these algorithms also provide high quality benchmarks that can be used when developing more specific and specialized forecasting models. I will describe some algorithms for automatically forecasting univariate time series that have been developed over the last 20 years. The role of forecasting competitions in comparing the forecast accuracy of these algorithms will also be discussed.

Automatic algorithms for time series forecasting

Rob Hyndman

Modern, large-scale simulation models are often based on the "laws of physics". Interpreting the outputs of these models nevertheless introduces a host of new challenges for uncertainty quantification and decision making. Unlike the traditional low-dimensional models in statistics or in applied maths, the state space of these simulation models is often large (10^7 D) and the components of the model state vector do not correspond to real-world observables. Nevertheless, in contexts like weather and climate, for example, such models sometimes provide significantly more information than traditional empirical models. Structural Model Error (SME) is the difference between the mathematical structure of the simulation model and the system that generates the observations (assuming that the system has a nontrivial mathematical description). In the absence of SME, reducing imprecision in parameter values and in the current state of the system is a daunting but tractable task, and forecasting deterministic systems takes on a probabilistic This presentation illustrates how SME, and the (almost certain) lack of topological conjugacy it implies, has significant impacts on what can be expected from simulation modelling. Challenges to various approaches of Uncertainty Quantification currently used in practice ("ensemble forecasting") and statistical methods exploiting a discrepancy function are demonstrated.

MUMS Opening Workshop -On the Impact(s) of Structural Model Error on Simulati...

The Statistical and Applied Mathematical Sciences Institute

Can we predict the quality of spectrum-based fault localization?

Lionel Briand

AnnualAutomobileSalesPredictionusingARIMAModel (2).pdf

Farhad Sagor

A data science observatory based on RAMP - rapid analytics and model prototyping

Akin Osman Kazakci

ThesisDefensePresentation_KyleIngersoll

Kyle Ingersoll

Srikanta Mishra

Society of Petroleum Engineers

Performance is a process of assessment of the algorithm. Speed and security is the performance to be achieved in determining which algorithm is better to use. In determining the optimum route, there are two algorithms that can be used for comparison. The Genetic and Primary algorithms are two very popular algorithms for determining the optimum route on the graph. Prim can minimize circuit to avoid connected loop. Prim will determine the best route based on active vertex. This algorithm is especially useful when applied in a minimum spanning tree case. Genetics works with probability properties. Genetics cannot determine which route has the maximum value. However, genetics can determine the overall optimum route based on appropriate parameters. Each algorithm can be used for the case of the shortest path, minimum spanning tree or traveling salesman problem. The Prim algorithm is superior to the speed of Genetics. The strength of the Genetic algorithm lies in the number of generations and population generated as well as the selection, crossover and mutation processes as the resultant support. The disadvantage of the Genetic algorithm is spending to much time to get the desired result. Overall, the Prim algorithm has better performance than Genetic especially for a large number of vertices.

Prim and Genetic Algorithms Performance in Determining Optimum Route on Graph

Universitas Pembangunan Panca Budi

Owing to the multitude of surrogate modeling techniques, developed in the recent years and the diverse characteristics offered by them, automated adaptive model selection approaches could be helpful in selecting the most suitable surrogate for a given problem. Surrogate selection could be performed at three different levels: (i) model type selection, (ii) basis (or kernel) function selection, and (iii) hyper-parameter selection where hyper- parameters are those kernel parameters that are generally given by the users. Unlike the majority of existing model selection techniques, this paper explores the development of a method that performs selection coherently at all the three levels. In this context, the REES method is used to provide measures of the median and maximum errors of a candidate surrogate model. Two approaches are used for the 3-level selection; (i) A Cascaded approach performs each level in a nested loop in the order going from model-kernel-hyper- parameters; (ii) A more advanced One-Step approach solves a MINLP to simultaneously optimize the model, kernel, and hyper-parameters. In both approaches, multiobjective optimization is performed to yield the best trade-offs between the estimated median and maximum errors. Candidate surrogates that are considered include (i) Kriging, (ii) Radial Basis Function (RBF), and (iii) Support Vector Regression (SVR), and multiple candidate kernels are allowed within these surrogate models. The 3-level REES-based model selection is compared with model selection based on error estimated on a large set of additional test points, for validation purposes. Numerical experiments on a 2-variable, 6-variable, and 18-variable test problems, and wind farm power generation problem, show that the proposed approach provides unique flexibility in model selection and is also reasonably accurate when compared with selection based on errors estimated on additional test points.

COSMOS1_Scitech_2014_Ali

MDO_Lab

Statsci

Vassilios Kelessidis

Brussels airport forecast

Mohammed Awad

International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.

Dj4201737746

IJERA Editor

March 2, 2018 - Machine Learning for Production Forecasting

David Fulford

Accurately estimating evaporation is necessary for calculating and scheduling irrigation water requirements. Current literature points to the use of individual machine learning models for better estimation of evaporation. However, such methods have not been used in the Indian framework. Moreover, given the diversity of climate, it is necessary to develop an ensemble technique incorporating a significant number of machine learning algorithms to have a better estimation of weekly evaporation. The purpose of this paper is to develop an ensemble technique that makes the machine learning models that have a better estimation of weekly evaporation. The results showed that the Bagging Random Forest model has a much better performance in estimating weekly evaporation compared to other fitted ensemble models. R. S. Parmar | G. B. Chaudhari | S. H. Bhojani "Estimating Evaporation using Machine Learning Based Ensemble Technique" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-7 | Issue-4, August 2023, URL: https://www.ijtsrd.com/papers/ijtsrd59847.pdf Paper Url:https://www.ijtsrd.com/engineering/agricultural-engineering/59847/estimating-evaporation-using-machine-learning-based-ensemble-technique/r-s-parmar

Estimating Evaporation using Machine Learning Based Ensemble Technique

ijtsrd

Kernel based swarm optimization for renewable energy application

Aboul Ella Hassanien

The thesis involved the reviewing of various case studies to determine the types of modelling, choice of algorithm, types of analytical approaches and trying to determine the various complexities arising from these cases. From these reviews, procedures have been proposed to improve the efficiency and manage the various types of complexities from using agile methodological perspective. Focus was mostly done on Customer Segmentation and Clustering , with the sole purpose to bridge Big Data and Business Intelligence together using Analytic.

Agile analytics : An exploratory study of technical complexity management

Agnirudra Sikdar

Modelling a CPS Swarm System: A Simple Case Study The CPSwarm workbench is a toolchain that facilitates the entire design process of swarms of CPS including modelling, design, optimization, simulation and deployment. This paper highlights part of the work of the CPSwarm workbench in the context of the CPSwarm H2020 project. In particular, the CPSwarm workbench allows to create a generic swarm library that can be customized by developers to design new swarm environments, new swarm members and new swarm goals. This paper shows an application of the initial CPSwarm workbench by the example of a reference problem called EmergencyExit. In this example a swarm of robots needs to find an exit in an unmapped environment and leave this room through the exit as soon as possible. The example problem is further used to show the integration of Modelio, a UML/SysML modelling tool, and FREVO, an optimization tool in the CPSwarm workbench

Modelsward 2018 Industrial Track - Alessandra Bagnato

Alessandra Bagnato

Optimizing Numerical Weather Prediction Model Performance Using Machine Learning Techniques. Shakas Technologies ( Galaxy of Knowledge) #11/A 2nd East Main Road, Gandhi Nagar, Vellore - 632006. Mobile : +91-9500218218 / 8220150373| land line- 0416- 3552723 Shakas Training & Development | Shakas Sales & Services | Shakas Educational Trust|IEEE projects | Research & Development | Journal Publication | Email : info@shakastech.com | shakastech@gmail.com | website: www.shakastech.com Facebook: https://www.facebook.com/pages/Shakas-Technologies

Optimizing Numerical Weather Prediction Model Performance Using Machine Learn...

Shakas Technologies

Table of Contents - Practical Business Analytics using SAS

Venkata Reddy Konasani

Research Proposal

Komlan Atitey

Similar to Undergraduate Modeling Workshop - Air Quality Working Group Final Presentation, May 25, 2018 (20)

Automatic algorithms for time series forecasting

MUMS Opening Workshop -On the Impact(s) of Structural Model Error on Simulati...

Can we predict the quality of spectrum-based fault localization?

AnnualAutomobileSalesPredictionusingARIMAModel (2).pdf

A data science observatory based on RAMP - rapid analytics and model prototyping

ThesisDefensePresentation_KyleIngersoll

Srikanta Mishra

Prim and Genetic Algorithms Performance in Determining Optimum Route on Graph

COSMOS1_Scitech_2014_Ali

Statsci

Brussels airport forecast

Dj4201737746

March 2, 2018 - Machine Learning for Production Forecasting

Estimating Evaporation using Machine Learning Based Ensemble Technique

Kernel based swarm optimization for renewable energy application

Agile analytics : An exploratory study of technical complexity management

Modelsward 2018 Industrial Track - Alessandra Bagnato

Optimizing Numerical Weather Prediction Model Performance Using Machine Learn...

Table of Contents - Practical Business Analytics using SAS

Research Proposal

More from The Statistical and Applied Mathematical Sciences Institute

Recently, the machine learning community has expressed strong interest in applying latent variable modeling strategies to causal inference problems with unobserved confounding. Here, I discuss one of the big debates that occurred over the past year, and how we can move forward. I will focus specifically on the failure of point identification in this setting, and discuss how this can be used to design flexible sensitivity analyses that cleanly separate identified and unidentified components of the causal model.

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...

The Statistical and Applied Mathematical Sciences Institute

I will discuss paradigmatic statistical models of inference and learning from high dimensional data, such as sparse PCA and the perceptron neural network, in the sub-linear sparsity regime. In this limit the underlying hidden signal, i.e., the low-rank matrix in PCA or the neural network weights, has a number of non-zero components that scales sub-linearly with the total dimension of the vector. I will provide explicit low-dimensional variational formulas for the asymptotic mutual information between the signal and the data in suitable sparse limits. In the setting of support recovery these formulas imply sharp 0-1 phase transitions for the asymptotic minimum mean-square-error (or generalization error in the neural network setting). A similar phase transition was analyzed recently in the context of sparse high-dimensional linear regression by Reeves et al.

2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...

The Statistical and Applied Mathematical Sciences Institute

Many different measurement techniques are used to record neural activity in the brains of different organisms, including fMRI, EEG, MEG, lightsheet microscopy and direct recordings with electrodes. Each of these measurement modes have their advantages and disadvantages concerning the resolution of the data in space and time, the directness of measurement of the neural activity and which organisms they can be applied to. For some of these modes and for some organisms, significant amounts of data are now available in large standardized open-source datasets. I will report on our efforts to apply causal discovery algorithms to, among others, fMRI data from the Human Connectome Project, and to lightsheet microscopy data from zebrafish larvae. In particular, I will focus on the challenges we have faced both in terms of the nature of the data and the computational features of the discovery algorithms, as well as the modeling of experimental interventions.

Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...

The Statistical and Applied Mathematical Sciences Institute

Bayesian Additive Regression Trees (BART) has been shown to be an effective framework for modeling nonlinear regression functions, with strong predictive performance in a variety of contexts. The BART prior over a regression function is defined by independent prior distributions on tree structure and leaf or end-node parameters. In observational data settings, Bayesian Causal Forests (BCF) has successfully adapted BART for estimating heterogeneous treatment effects, particularly in cases where standard methods yield biased estimates due to strong confounding. We introduce BART with Targeted Smoothing, an extension which induces smoothness over a single covariate by replacing independent Gaussian leaf priors with smooth functions. We then introduce a new version of the Bayesian Causal Forest prior, which incorporates targeted smoothing for modeling heterogeneous treatment effects which vary smoothly over a target covariate. We demonstrate the utility of this approach by applying our model to a timely women's health and policy problem: comparing two dosing regimens for an early medical abortion protocol, where the outcome of interest is the probability of a successful early medical abortion procedure at varying gestational ages, conditional on patient covariates. We discuss the benefits of this approach in other women’s health and obstetrics modeling problems where gestational age is a typical covariate.

Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...

The Statistical and Applied Mathematical Sciences Institute

Difference-in-differences is a widely used evaluation strategy that draws causal inference from observational panel data. Its causal identification relies on the assumption of parallel trends, which is scale-dependent and may be questionable in some applications. A common alternative is a regression model that adjusts for the lagged dependent variable, which rests on the assumption of ignorability conditional on past outcomes. In the context of linear models, Angrist and Pischke (2009) show that the difference-in-differences and lagged-dependent-variable regression estimates have a bracketing relationship. Namely, for a true positive effect, if ignorability is correct, then mistakenly assuming parallel trends will overestimate the effect; in contrast, if the parallel trends assumption is correct, then mistakenly assuming ignorability will underestimate the effect. We show that the same bracketing relationship holds in general nonparametric (model-free) settings. We also extend the result to semiparametric estimation based on inverse probability weighting.

Causal Inference Opening Workshop - A Bracketing Relationship between Differe...

The Statistical and Applied Mathematical Sciences Institute

We develop sensitivity analyses for weak nulls in matched observational studies while allowing unit-level treatment effects to vary. In contrast to randomized experiments and paired observational studies, we show for general matched designs that over a large class of test statistics, any valid sensitivity analysis for the weak null must be unnecessarily conservative if Fisher's sharp null of no treatment effect for any individual also holds. We present a sensitivity analysis valid for the weak null, and illustrate why it is conservative if the sharp null holds through connections to inverse probability weighted estimators. An alternative procedure is presented that is asymptotically sharp if treatment effects are constant, and is valid for the weak null under additional assumptions which may be deemed reasonable by practitioners. The methods may be applied to matched observational studies constructed using any optimal without-replacement matching algorithm, allowing practitioners to assess robustness to hidden bias while allowing for treatment effect heterogeneity.

Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...

The Statistical and Applied Mathematical Sciences Institute

The world of health care is full of policy interventions: a state expands eligibility rules for its Medicaid program, a medical society changes its recommendations for screening frequency, a hospital implements a new care coordination program. After a policy change, we often want to know, “Did it work?” This is a causal question; we want to know whether the policy CAUSED outcomes to change. One popular way of estimating causal effects of policy interventions is a difference-in-differences study. In this controlled pre-post design, we measure the change in outcomes of people who are exposed to the new policy, comparing average outcomes before and after the policy is implemented. We contrast that change to the change over the same time period in people who were not exposed to the new policy. The differential change in the treated group’s outcomes, compared to the change in the comparison group’s outcomes, may be interpreted as the causal effect of the policy. To do so, we must assume that the comparison group’s outcome change is a good proxy for the treated group’s (counterfactual) outcome change in the absence of the policy. This conceptual simplicity and wide applicability in policy settings makes difference-in-differences an appealing study design. However, the apparent simplicity belies a thicket of conceptual, causal, and statistical complexity. In this talk, I will introduce the fundamentals of difference-in-differences studies and discuss recent innovations including key assumptions and ways to assess their plausibility, estimation, inference, and robustness checks.

Causal Inference Opening Workshop - Difference-in-differences: more than meet...

The Statistical and Applied Mathematical Sciences Institute

We present recent advances and statistical developments for evaluating Dynamic Treatment Regimes (DTR), which allow the treatment to be dynamically tailored according to evolving subject-level data. Identification of an optimal DTR is a key component for precision medicine and personalized health care. Specific topics covered in this talk include several recent projects with robust and flexible methods developed for the above research area. We will first introduce a dynamic statistical learning method, adaptive contrast weighted learning (ACWL), which combines doubly robust semiparametric regression estimators with flexible machine learning methods. We will further develop a tree-based reinforcement learning (T-RL) method, which builds an unsupervised decision tree that maintains the nature of batch-mode reinforcement learning. Unlike ACWL, T-RL handles the optimization problem with multiple treatment comparisons directly through a purity measure constructed with augmented inverse probability weighted estimators. T-RL is robust, efficient and easy to interpret for the identification of optimal DTRs. However, ACWL seems more robust against tree-type misspecification than T-RL when the true optimal DTR is non-tree-type. At the end of this talk, we will also present a new Stochastic-Tree Search method called ST-RL for evaluating optimal DTRs.

Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...

The Statistical and Applied Mathematical Sciences Institute

A fundamental feature of evaluating causal health effects of air quality regulations is that air pollution moves through space, rendering health outcomes at a particular population location dependent upon regulatory actions taken at multiple, possibly distant, pollution sources. Motivated by studies of the public-health impacts of power plant regulations in the U.S., this talk introduces the novel setting of bipartite causal inference with interference, which arises when 1) treatments are defined on observational units that are distinct from those at which outcomes are measured and 2) there is interference between units in the sense that outcomes for some units depend on the treatments assigned to many other units. Interference in this setting arises due to complex exposure patterns dictated by physical-chemical atmospheric processes of pollution transport, with intervention effects framed as propagating across a bipartite network of power plants and residential zip codes. New causal estimands are introduced for the bipartite setting, along with an estimation approach based on generalized propensity scores for treatments on a network. The new methods are deployed to estimate how emission-reduction technologies implemented at coal-fired power plants causally affect health outcomes among Medicare beneficiaries in the U.S.

Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...

The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...

The Statistical and Applied Mathematical Sciences Institute

We provide an overview of some recent developments in machine learning tools for dynamic treatment regime discovery in precision medicine. The first development is a new off-policy reinforcement learning tool for continual learning in mobile health to enable patients with type 1 diabetes to exercise safely. The second development is a new inverse reinforcement learning tools which enables use of observational data to learn how clinicians balance competing priorities for treating depression and mania in patients with bipolar disorder. Both practical and technical challenges are discussed.

Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...

The Statistical and Applied Mathematical Sciences Institute

The method of differences-in-differences (DID) is widely used to estimate causal effects. The primary advantage of DID is that it can account for time-invariant bias from unobserved confounders. However, the standard DID estimator will be biased if there is an interaction between history in the after period and the groups. That is, bias will be present if an event besides the treatment occurs at the same time and affects the treated group in a differential fashion. We present a method of bounds based on DID that accounts for an unmeasured confounder that has a differential effect in the post-treatment time period. These DID bracketing bounds are simple to implement and only require partitioning the controls into two separate groups. We also develop two key extensions for DID bracketing bounds. First, we develop a new falsification test to probe the key assumption that is necessary for the bounds estimator to provide consistent estimates of the treatment effect. Next, we develop a method of sensitivity analysis that adjusts the bounds for possible bias based on differences between the treated and control units from the pretreatment period. We apply these DID bracketing bounds and the new methods we develop to an application on the effect of voter identification laws on turnout. Specifically, we focus estimating whether the enactment of voter identification laws in Georgia and Indiana had an effect on voter turnout.

Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...

The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...

The Statistical and Applied Mathematical Sciences Institute

We study experimental design in large-scale stochastic systems with substantial uncertainty and structured cross-unit interference. We consider the problem of a platform that seeks to optimize supply-side payments p in a centralized marketplace where different suppliers interact via their effects on the overall supply-demand equilibrium, and propose a class of local experimentation schemes that can be used to optimize these payments without perturbing the overall market equilibrium. We show that, as the system size grows, our scheme can estimate the gradient of the platformâ€™s utility with respect to p while perturbing the overall market equilibrium by only a vanishingly small amount. We can then use these gradient estimates to optimize p via any stochastic first-order optimization method. These results stem from the insight that, while the system involves a large number of interacting units, any interference can only be channeled through a small number of key statistics, and this structure allows us to accurately predict feedback effects that arise from global system changes using only information collected while remaining in equilibrium.

Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...

The Statistical and Applied Mathematical Sciences Institute

We discuss a general roadmap for generating causal inference based on observational studies used to general real world evidence. We review targeted minimum loss estimation (TMLE), which provides a general template for the construction of asymptotically efficient plug-in estimators of a target estimand for realistic (i.e, infinite dimensional) statistical models. TMLE is a two stage procedure that first involves using ensemble machine learning termed super-learning to estimate the relevant stochastic relations between the treatment, censoring, covariates and outcome of interest. The super-learner allows one to fully utilize all the advances in machine learning (in addition to more conventional parametric model based estimators) to build a single most powerful ensemble machine learning algorithm. We present Highly Adaptive Lasso as an important machine learning algorithm to include. In the second step, the TMLE involves maximizing a parametric likelihood along a so-called least favorable parametric model through the super-learner fit of the relevant stochastic relations in the observed data. This second step bridges the state of the art in machine learning to estimators of target estimands for which statistical inference is available (i.e, confidence intervals, p-values etc). We also review recent advances in collaborative TMLE in which the fit of the treatment and censoring mechanism is tailored w.r.t. performance of TMLE. We also discuss asymptotically valid bootstrap based inference. Simulations and data analyses are provided as demonstrations.

Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...

The Statistical and Applied Mathematical Sciences Institute

We describe different approaches for specifying models and prior distributions for estimating heterogeneous treatment effects using Bayesian nonparametric models. We make an affirmative case for direct, informative (or partially informative) prior distributions on heterogeneous treatment effects, especially when treatment effect size and treatment effect variation is small relative to other sources of variability. We also consider how to provide scientifically meaningful summaries of complicated, high-dimensional posterior distributions over heterogeneous treatment effects with appropriate measures of uncertainty.

Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...

The Statistical and Applied Mathematical Sciences Institute

Climate change mitigation has traditionally been analyzed as some version of a public goods game (PGG) in which a group is most successful if everybody contributes, but players are best off individually by not contributing anything (i.e., “free-riding”)—thereby creating a social dilemma. Analysis of climate change using the PGG and its variants has helped explain why global cooperation on GHG reductions is so difficult, as nations have an incentive to free-ride on the reductions of others. Rather than inspire collective action, it seems that the lack of progress in addressing the climate crisis is driving the search for a “quick fix” technological solution that circumvents the need for cooperation.

2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...

The Statistical and Applied Mathematical Sciences Institute

2019 Fall Series: Professional Development, Writing Academic Papers…What Work...

The Statistical and Applied Mathematical Sciences Institute

Machine learning (including deep and reinforcement learning) and blockchain are two of the most noticeable technologies in recent years. The first one is the foundation of artificial intelligence and big data, and the second one has significantly disrupted the financial industry. Both technologies are data-driven, and thus there are rapidly growing interests in integrating them for more secure and efficient data sharing and analysis. In this paper, we review the research on combining blockchain and machine learning technologies and demonstrate that they can collaborate efficiently and effectively. In the end, we point out some future directions and expect more researches on deeper integration of the two promising technologies.

2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...

The Statistical and Applied Mathematical Sciences Institute

2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...

The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)