Auswertung von 12 Durchläufen personalisierter Aufgaben mit anonymem Peer Rev...Mathias Magdowski
Das Verfahren personalisierbarer Aufgaben mit anonymem Peer Review wurde in der Lehrveranstaltung "Grundlagen der Elektrotechnik" an der Otto-von-Guericke-Universität bereits in zwölf Durchläufen vom Wintersemester 2017/2018 bis zum Sommersemester 2019 erprobt. In diesem Foliensatz sind alle acht bisher entwickelten Aufgabentypen kurz dargestellt. Außerdem sind einige Diagramme zur Auswertung (erreichte Punktzahlen, Abweichungen der Peer-Review-Gutachten, Darstellung der Einreichungen über der Zeit) gezeigt.
Personalisierbare Aufgaben und anonymer Peer-ReviewMathias Magdowski
Im Foliensatz wird die Möglichkeit personalisierbarer Aufgaben an einem Beispiel für die Lehrveranstaltung "Grundlagen der Elektrotechnik" besprochen. Dabei bekommt jeder Studierende seine eigene Aufgabe per E-Mail zugeschickt, kann diese lösen und seine Lösung über eine Lern-Management-System wie Moodle zur Korrektur einreichen. Um den Korrekturaufwand für die Dozenten zu senken, begutachten sich die Studierenden dann anhand einer ebenfalls personalisierten Musterlösung gegenseitig. Das ganze Verfahren läuft automatisiert ab und ist dadurch gut skalierbar. Gegenüber einfachen Multiple-Choice- oder Zahlenwert-Aufgaben lässt sich hier auch der Rechenweg und Ansatz gut bewerten.
Bachelor Thesis Presentation: Modellierung und numerische Simulation von Mole...NoahOberweis
Abstract: In dieser Arbeit werden die Langevin Dynamics als Beispiel für die Einführung von verschiedenen Konzepten aus dem mathematischen Teilgebiet der Differentialgleichungen verwendet. Hierbei wird zu Beginn eine nicht-stochastische Variante der Langevin-Gleichung verwendet, um symplektische Integratoren und Verfahren anzuwenden. Im Anschluss werden stochastische Differentialgleichungen (SDE) eingeleitet, um ein Verständnis
der grundlegenden Theorie von SDEs aufzubauen und im Anschluss sowohl die Existenz, sowie die Eindeutigkeit der Lösung der Langevin-Gleichung zu beweisen. Diese Grundlagen
werden ben¨ otigt, um im letzten Teil der Arbeit einige numerische Verfahren zu behandeln, welche verschiedene Aussagen aus der stochastischen Langevin-Gleichung gewinnen, wie beispielsweise Minimum Energy Paths und Transition Time des Systems.
Auswertung von 12 Durchläufen personalisierter Aufgaben mit anonymem Peer Rev...Mathias Magdowski
Das Verfahren personalisierbarer Aufgaben mit anonymem Peer Review wurde in der Lehrveranstaltung "Grundlagen der Elektrotechnik" an der Otto-von-Guericke-Universität bereits in zwölf Durchläufen vom Wintersemester 2017/2018 bis zum Sommersemester 2019 erprobt. In diesem Foliensatz sind alle acht bisher entwickelten Aufgabentypen kurz dargestellt. Außerdem sind einige Diagramme zur Auswertung (erreichte Punktzahlen, Abweichungen der Peer-Review-Gutachten, Darstellung der Einreichungen über der Zeit) gezeigt.
Personalisierbare Aufgaben und anonymer Peer-ReviewMathias Magdowski
Im Foliensatz wird die Möglichkeit personalisierbarer Aufgaben an einem Beispiel für die Lehrveranstaltung "Grundlagen der Elektrotechnik" besprochen. Dabei bekommt jeder Studierende seine eigene Aufgabe per E-Mail zugeschickt, kann diese lösen und seine Lösung über eine Lern-Management-System wie Moodle zur Korrektur einreichen. Um den Korrekturaufwand für die Dozenten zu senken, begutachten sich die Studierenden dann anhand einer ebenfalls personalisierten Musterlösung gegenseitig. Das ganze Verfahren läuft automatisiert ab und ist dadurch gut skalierbar. Gegenüber einfachen Multiple-Choice- oder Zahlenwert-Aufgaben lässt sich hier auch der Rechenweg und Ansatz gut bewerten.
Bachelor Thesis Presentation: Modellierung und numerische Simulation von Mole...NoahOberweis
Abstract: In dieser Arbeit werden die Langevin Dynamics als Beispiel für die Einführung von verschiedenen Konzepten aus dem mathematischen Teilgebiet der Differentialgleichungen verwendet. Hierbei wird zu Beginn eine nicht-stochastische Variante der Langevin-Gleichung verwendet, um symplektische Integratoren und Verfahren anzuwenden. Im Anschluss werden stochastische Differentialgleichungen (SDE) eingeleitet, um ein Verständnis
der grundlegenden Theorie von SDEs aufzubauen und im Anschluss sowohl die Existenz, sowie die Eindeutigkeit der Lösung der Langevin-Gleichung zu beweisen. Diese Grundlagen
werden ben¨ otigt, um im letzten Teil der Arbeit einige numerische Verfahren zu behandeln, welche verschiedene Aussagen aus der stochastischen Langevin-Gleichung gewinnen, wie beispielsweise Minimum Energy Paths und Transition Time des Systems.
Symmetrieerkennung in Theorie und PraxisMarcus Riemer
FH Wedel Entwicklung trifft FU Berlin Forschung
Symmetrien sind ästhetisch, eindrucksvoll - und praktisch? Viele alltägliche Produkte weisen Symmetrien auf. Lässt sich diese Eigenschaft beispielsweise für die Qualitätssicherung nutzen? Dazu müsste eine Maschine Symmetrien erkennen. Doch wie erklärt man einem Computer, was er sieht?
Ein solches Erkennungsverfahren wird derzeit an der FH Wedel in Kooperation mit der FU Berlin getestet. Die beiden Studierenden der Informatik stellen auf der einen Seite die praktischen Herausforderungen bei der Implementierung eines solchen Verfahrens vor. Auf der anderen Seite soll aber auch die von der Forschungsgruppe der FU Berlin erarbeitete Theorie nicht zu kurz kommen.
Dieses Projekt erfolgt im Rahmen der langjährigen Kooperation zwischen der FH Wedel und dem Fachbereich Mathematik und Informatik der FU Berlin. Die FU Berlin bietet Masterabsolventen hervorragende Bedingungen für eine Promotion. Im Austausch mit den Professoren der FU Berlin haben diese in Wedel bereits mehrfach Vorlesungen und Vorträge gehalten.
Zu den Referenten:
Marcus Riemer und Timm Hoffmann sind beide Studierende des Masterstudiengangs Informatik an der FH Wedel. Darüber hinaus sind beide als wissenschaftliche Mitarbeiter an der FH tätig.
Alle Studierenden unserer Lehrveranstaltung „Grundlagen der Elektrotechnik“ bekommen eine eigene Aufgabe per E-Mail zugeschickt, können diese handschriftlich lösen und ihre eingescannte oder abfotografierte Lösung online zur Korrektur einreichen. Um den Korrekturaufwand für die Lehrenden zu senken, begutachten sich die Studierenden dann anhand einer ebenfalls personalisierten Musterlösung gegenseitig. Das Verfahren läuft über MATLAB automatisiert ab und ist dadurch gut skalierbar. Gegenüber einfachen Multiple-Choice- oder Zahlenwert-und-Einheit-Aufgaben lassen sich hier auch der Ansatz und Rechenweg gut bewerten. In diesem Artikel wird beschrieben, wie die Aufgaben und Musterlösungen in LATEX mit Hilfe der Pakete PGFPlots und Circuitikz generiert werden.
Netzwerkberechnung mit der Knotenspannungsanalyse in MATLAB und LTspiceMathias Magdowski
Im Foliensatz wird die Notwending zur Nutzung von systematischen Netzwerkberechnungsverfahren wie der Knotenspannungsanalyse anhand einer kleiner Beispielschaltung motiviert, die sonst nur mit Hilfe einer Stern-Dreieck-Transformation oder Dreieck-Stern-Transformation lösbar ist. Dazu werden die Analyseschritte bei der Knotenspannungsanalyse wiederholt. Das entstehende Gleichungssystem wird dann in MATLAB/GNU Octave gelöst. Weiterhin wird die Simulation des Netzwerks im Netzwerksimulator LTspice erklärt.
Recently, the machine learning community has expressed strong interest in applying latent variable modeling strategies to causal inference problems with unobserved confounding. Here, I discuss one of the big debates that occurred over the past year, and how we can move forward. I will focus specifically on the failure of point identification in this setting, and discuss how this can be used to design flexible sensitivity analyses that cleanly separate identified and unidentified components of the causal model.
I will discuss paradigmatic statistical models of inference and learning from high dimensional data, such as sparse PCA and the perceptron neural network, in the sub-linear sparsity regime. In this limit the underlying hidden signal, i.e., the low-rank matrix in PCA or the neural network weights, has a number of non-zero components that scales sub-linearly with the total dimension of the vector. I will provide explicit low-dimensional variational formulas for the asymptotic mutual information between the signal and the data in suitable sparse limits. In the setting of support recovery these formulas imply sharp 0-1 phase transitions for the asymptotic minimum mean-square-error (or generalization error in the neural network setting). A similar phase transition was analyzed recently in the context of sparse high-dimensional linear regression by Reeves et al.
Many different measurement techniques are used to record neural activity in the brains of different organisms, including fMRI, EEG, MEG, lightsheet microscopy and direct recordings with electrodes. Each of these measurement modes have their advantages and disadvantages concerning the resolution of the data in space and time, the directness of measurement of the neural activity and which organisms they can be applied to. For some of these modes and for some organisms, significant amounts of data are now available in large standardized open-source datasets. I will report on our efforts to apply causal discovery algorithms to, among others, fMRI data from the Human Connectome Project, and to lightsheet microscopy data from zebrafish larvae. In particular, I will focus on the challenges we have faced both in terms of the nature of the data and the computational features of the discovery algorithms, as well as the modeling of experimental interventions.
1) The document presents a statistical modeling approach called targeted smooth Bayesian causal forests (tsbcf) to smoothly estimate heterogeneous treatment effects over gestational age using observational data from early medical abortion regimens.
2) The tsbcf method extends Bayesian additive regression trees (BART) to estimate treatment effects that evolve smoothly over gestational age, while allowing for heterogeneous effects across patient subgroups.
3) The tsbcf analysis of early medical abortion regimen data found the simultaneous administration to be similarly effective overall to the interval administration, but identified some patient subgroups where effectiveness may vary more over gestational age.
Difference-in-differences is a widely used evaluation strategy that draws causal inference from observational panel data. Its causal identification relies on the assumption of parallel trends, which is scale-dependent and may be questionable in some applications. A common alternative is a regression model that adjusts for the lagged dependent variable, which rests on the assumption of ignorability conditional on past outcomes. In the context of linear models, Angrist and Pischke (2009) show that the difference-in-differences and lagged-dependent-variable regression estimates have a bracketing relationship. Namely, for a true positive effect, if ignorability is correct, then mistakenly assuming parallel trends will overestimate the effect; in contrast, if the parallel trends assumption is correct, then mistakenly assuming ignorability will underestimate the effect. We show that the same bracketing relationship holds in general nonparametric (model-free) settings. We also extend the result to semiparametric estimation based on inverse probability weighting.
We develop sensitivity analyses for weak nulls in matched observational studies while allowing unit-level treatment effects to vary. In contrast to randomized experiments and paired observational studies, we show for general matched designs that over a large class of test statistics, any valid sensitivity analysis for the weak null must be unnecessarily conservative if Fisher's sharp null of no treatment effect for any individual also holds. We present a sensitivity analysis valid for the weak null, and illustrate why it is conservative if the sharp null holds through connections to inverse probability weighted estimators. An alternative procedure is presented that is asymptotically sharp if treatment effects are constant, and is valid for the weak null under additional assumptions which may be deemed reasonable by practitioners. The methods may be applied to matched observational studies constructed using any optimal without-replacement matching algorithm, allowing practitioners to assess robustness to hidden bias while allowing for treatment effect heterogeneity.
This document discusses difference-in-differences (DiD) analysis, a quasi-experimental method used to estimate treatment effects. The author notes that while widely applicable, DiD relies on strong assumptions about the counterfactual. She recommends approaches like matching on observed variables between similar populations, thoughtfully specifying regression models to adjust for confounding factors, testing for parallel pre-treatment trends under different assumptions, and considering more complex models that allow for different types of changes over time. The overall message is that DiD requires careful consideration and testing of its underlying assumptions to draw valid causal conclusions.
Symmetrieerkennung in Theorie und PraxisMarcus Riemer
FH Wedel Entwicklung trifft FU Berlin Forschung
Symmetrien sind ästhetisch, eindrucksvoll - und praktisch? Viele alltägliche Produkte weisen Symmetrien auf. Lässt sich diese Eigenschaft beispielsweise für die Qualitätssicherung nutzen? Dazu müsste eine Maschine Symmetrien erkennen. Doch wie erklärt man einem Computer, was er sieht?
Ein solches Erkennungsverfahren wird derzeit an der FH Wedel in Kooperation mit der FU Berlin getestet. Die beiden Studierenden der Informatik stellen auf der einen Seite die praktischen Herausforderungen bei der Implementierung eines solchen Verfahrens vor. Auf der anderen Seite soll aber auch die von der Forschungsgruppe der FU Berlin erarbeitete Theorie nicht zu kurz kommen.
Dieses Projekt erfolgt im Rahmen der langjährigen Kooperation zwischen der FH Wedel und dem Fachbereich Mathematik und Informatik der FU Berlin. Die FU Berlin bietet Masterabsolventen hervorragende Bedingungen für eine Promotion. Im Austausch mit den Professoren der FU Berlin haben diese in Wedel bereits mehrfach Vorlesungen und Vorträge gehalten.
Zu den Referenten:
Marcus Riemer und Timm Hoffmann sind beide Studierende des Masterstudiengangs Informatik an der FH Wedel. Darüber hinaus sind beide als wissenschaftliche Mitarbeiter an der FH tätig.
Alle Studierenden unserer Lehrveranstaltung „Grundlagen der Elektrotechnik“ bekommen eine eigene Aufgabe per E-Mail zugeschickt, können diese handschriftlich lösen und ihre eingescannte oder abfotografierte Lösung online zur Korrektur einreichen. Um den Korrekturaufwand für die Lehrenden zu senken, begutachten sich die Studierenden dann anhand einer ebenfalls personalisierten Musterlösung gegenseitig. Das Verfahren läuft über MATLAB automatisiert ab und ist dadurch gut skalierbar. Gegenüber einfachen Multiple-Choice- oder Zahlenwert-und-Einheit-Aufgaben lassen sich hier auch der Ansatz und Rechenweg gut bewerten. In diesem Artikel wird beschrieben, wie die Aufgaben und Musterlösungen in LATEX mit Hilfe der Pakete PGFPlots und Circuitikz generiert werden.
Netzwerkberechnung mit der Knotenspannungsanalyse in MATLAB und LTspiceMathias Magdowski
Im Foliensatz wird die Notwending zur Nutzung von systematischen Netzwerkberechnungsverfahren wie der Knotenspannungsanalyse anhand einer kleiner Beispielschaltung motiviert, die sonst nur mit Hilfe einer Stern-Dreieck-Transformation oder Dreieck-Stern-Transformation lösbar ist. Dazu werden die Analyseschritte bei der Knotenspannungsanalyse wiederholt. Das entstehende Gleichungssystem wird dann in MATLAB/GNU Octave gelöst. Weiterhin wird die Simulation des Netzwerks im Netzwerksimulator LTspice erklärt.
Recently, the machine learning community has expressed strong interest in applying latent variable modeling strategies to causal inference problems with unobserved confounding. Here, I discuss one of the big debates that occurred over the past year, and how we can move forward. I will focus specifically on the failure of point identification in this setting, and discuss how this can be used to design flexible sensitivity analyses that cleanly separate identified and unidentified components of the causal model.
I will discuss paradigmatic statistical models of inference and learning from high dimensional data, such as sparse PCA and the perceptron neural network, in the sub-linear sparsity regime. In this limit the underlying hidden signal, i.e., the low-rank matrix in PCA or the neural network weights, has a number of non-zero components that scales sub-linearly with the total dimension of the vector. I will provide explicit low-dimensional variational formulas for the asymptotic mutual information between the signal and the data in suitable sparse limits. In the setting of support recovery these formulas imply sharp 0-1 phase transitions for the asymptotic minimum mean-square-error (or generalization error in the neural network setting). A similar phase transition was analyzed recently in the context of sparse high-dimensional linear regression by Reeves et al.
Many different measurement techniques are used to record neural activity in the brains of different organisms, including fMRI, EEG, MEG, lightsheet microscopy and direct recordings with electrodes. Each of these measurement modes have their advantages and disadvantages concerning the resolution of the data in space and time, the directness of measurement of the neural activity and which organisms they can be applied to. For some of these modes and for some organisms, significant amounts of data are now available in large standardized open-source datasets. I will report on our efforts to apply causal discovery algorithms to, among others, fMRI data from the Human Connectome Project, and to lightsheet microscopy data from zebrafish larvae. In particular, I will focus on the challenges we have faced both in terms of the nature of the data and the computational features of the discovery algorithms, as well as the modeling of experimental interventions.
1) The document presents a statistical modeling approach called targeted smooth Bayesian causal forests (tsbcf) to smoothly estimate heterogeneous treatment effects over gestational age using observational data from early medical abortion regimens.
2) The tsbcf method extends Bayesian additive regression trees (BART) to estimate treatment effects that evolve smoothly over gestational age, while allowing for heterogeneous effects across patient subgroups.
3) The tsbcf analysis of early medical abortion regimen data found the simultaneous administration to be similarly effective overall to the interval administration, but identified some patient subgroups where effectiveness may vary more over gestational age.
Difference-in-differences is a widely used evaluation strategy that draws causal inference from observational panel data. Its causal identification relies on the assumption of parallel trends, which is scale-dependent and may be questionable in some applications. A common alternative is a regression model that adjusts for the lagged dependent variable, which rests on the assumption of ignorability conditional on past outcomes. In the context of linear models, Angrist and Pischke (2009) show that the difference-in-differences and lagged-dependent-variable regression estimates have a bracketing relationship. Namely, for a true positive effect, if ignorability is correct, then mistakenly assuming parallel trends will overestimate the effect; in contrast, if the parallel trends assumption is correct, then mistakenly assuming ignorability will underestimate the effect. We show that the same bracketing relationship holds in general nonparametric (model-free) settings. We also extend the result to semiparametric estimation based on inverse probability weighting.
We develop sensitivity analyses for weak nulls in matched observational studies while allowing unit-level treatment effects to vary. In contrast to randomized experiments and paired observational studies, we show for general matched designs that over a large class of test statistics, any valid sensitivity analysis for the weak null must be unnecessarily conservative if Fisher's sharp null of no treatment effect for any individual also holds. We present a sensitivity analysis valid for the weak null, and illustrate why it is conservative if the sharp null holds through connections to inverse probability weighted estimators. An alternative procedure is presented that is asymptotically sharp if treatment effects are constant, and is valid for the weak null under additional assumptions which may be deemed reasonable by practitioners. The methods may be applied to matched observational studies constructed using any optimal without-replacement matching algorithm, allowing practitioners to assess robustness to hidden bias while allowing for treatment effect heterogeneity.
This document discusses difference-in-differences (DiD) analysis, a quasi-experimental method used to estimate treatment effects. The author notes that while widely applicable, DiD relies on strong assumptions about the counterfactual. She recommends approaches like matching on observed variables between similar populations, thoughtfully specifying regression models to adjust for confounding factors, testing for parallel pre-treatment trends under different assumptions, and considering more complex models that allow for different types of changes over time. The overall message is that DiD requires careful consideration and testing of its underlying assumptions to draw valid causal conclusions.
We present recent advances and statistical developments for evaluating Dynamic Treatment Regimes (DTR), which allow the treatment to be dynamically tailored according to evolving subject-level data. Identification of an optimal DTR is a key component for precision medicine and personalized health care. Specific topics covered in this talk include several recent projects with robust and flexible methods developed for the above research area. We will first introduce a dynamic statistical learning method, adaptive contrast weighted learning (ACWL), which combines doubly robust semiparametric regression estimators with flexible machine learning methods. We will further develop a tree-based reinforcement learning (T-RL) method, which builds an unsupervised decision tree that maintains the nature of batch-mode reinforcement learning. Unlike ACWL, T-RL handles the optimization problem with multiple treatment comparisons directly through a purity measure constructed with augmented inverse probability weighted estimators. T-RL is robust, efficient and easy to interpret for the identification of optimal DTRs. However, ACWL seems more robust against tree-type misspecification than T-RL when the true optimal DTR is non-tree-type. At the end of this talk, we will also present a new Stochastic-Tree Search method called ST-RL for evaluating optimal DTRs.
A fundamental feature of evaluating causal health effects of air quality regulations is that air pollution moves through space, rendering health outcomes at a particular population location dependent upon regulatory actions taken at multiple, possibly distant, pollution sources. Motivated by studies of the public-health impacts of power plant regulations in the U.S., this talk introduces the novel setting of bipartite causal inference with interference, which arises when 1) treatments are defined on observational units that are distinct from those at which outcomes are measured and 2) there is interference between units in the sense that outcomes for some units depend on the treatments assigned to many other units. Interference in this setting arises due to complex exposure patterns dictated by physical-chemical atmospheric processes of pollution transport, with intervention effects framed as propagating across a bipartite network of power plants and residential zip codes. New causal estimands are introduced for the bipartite setting, along with an estimation approach based on generalized propensity scores for treatments on a network. The new methods are deployed to estimate how emission-reduction technologies implemented at coal-fired power plants causally affect health outcomes among Medicare beneficiaries in the U.S.
Laine Thomas presented information about how causal inference is being used to determine the cost/benefit of the two most common surgical surgical treatments for women - hysterectomy and myomectomy.
We provide an overview of some recent developments in machine learning tools for dynamic treatment regime discovery in precision medicine. The first development is a new off-policy reinforcement learning tool for continual learning in mobile health to enable patients with type 1 diabetes to exercise safely. The second development is a new inverse reinforcement learning tools which enables use of observational data to learn how clinicians balance competing priorities for treating depression and mania in patients with bipolar disorder. Both practical and technical challenges are discussed.
The method of differences-in-differences (DID) is widely used to estimate causal effects. The primary advantage of DID is that it can account for time-invariant bias from unobserved confounders. However, the standard DID estimator will be biased if there is an interaction between history in the after period and the groups. That is, bias will be present if an event besides the treatment occurs at the same time and affects the treated group in a differential fashion. We present a method of bounds based on DID that accounts for an unmeasured confounder that has a differential effect in the post-treatment time period. These DID bracketing bounds are simple to implement and only require partitioning the controls into two separate groups. We also develop two key extensions for DID bracketing bounds. First, we develop a new falsification test to probe the key assumption that is necessary for the bounds estimator to provide consistent estimates of the treatment effect. Next, we develop a method of sensitivity analysis that adjusts the bounds for possible bias based on differences between the treated and control units from the pretreatment period. We apply these DID bracketing bounds and the new methods we develop to an application on the effect of voter identification laws on turnout. Specifically, we focus estimating whether the enactment of voter identification laws in Georgia and Indiana had an effect on voter turnout.
This document summarizes a simulation study evaluating causal inference methods for assessing the effects of opioid and gun policies. The study used real US state-level data to simulate the adoption of policies by some states and estimated the effects using different statistical models. It found that with fewer adopting states, type 1 error rates were too high, and most models lacked power. It recommends using cluster-robust standard errors and lagged outcomes to improve model performance. The study aims to help identify best practices for policy evaluation studies.
We study experimental design in large-scale stochastic systems with substantial uncertainty and structured cross-unit interference. We consider the problem of a platform that seeks to optimize supply-side payments p in a centralized marketplace where different suppliers interact via their effects on the overall supply-demand equilibrium, and propose a class of local experimentation schemes that can be used to optimize these payments without perturbing the overall market equilibrium. We show that, as the system size grows, our scheme can estimate the gradient of the platform’s utility with respect to p while perturbing the overall market equilibrium by only a vanishingly small amount. We can then use these gradient estimates to optimize p via any stochastic first-order optimization method. These results stem from the insight that, while the system involves a large number of interacting units, any interference can only be channeled through a small number of key statistics, and this structure allows us to accurately predict feedback effects that arise from global system changes using only information collected while remaining in equilibrium.
We discuss a general roadmap for generating causal inference based on observational studies used to general real world evidence. We review targeted minimum loss estimation (TMLE), which provides a general template for the construction of asymptotically efficient plug-in estimators of a target estimand for realistic (i.e, infinite dimensional) statistical models. TMLE is a two stage procedure that first involves using ensemble machine learning termed super-learning to estimate the relevant stochastic relations between the treatment, censoring, covariates and outcome of interest. The super-learner allows one to fully utilize all the advances in machine learning (in addition to more conventional parametric model based estimators) to build a single most powerful ensemble machine learning algorithm. We present Highly Adaptive Lasso as an important machine learning algorithm to include.
In the second step, the TMLE involves maximizing a parametric likelihood along a so-called least favorable parametric model through the super-learner fit of the relevant stochastic relations in the observed data. This second step bridges the state of the art in machine learning to estimators of target estimands for which statistical inference is available (i.e, confidence intervals, p-values etc). We also review recent advances in collaborative TMLE in which the fit of the treatment and censoring mechanism is tailored w.r.t. performance of TMLE. We also discuss asymptotically valid bootstrap based inference. Simulations and data analyses are provided as demonstrations.
We describe different approaches for specifying models and prior distributions for estimating heterogeneous treatment effects using Bayesian nonparametric models. We make an affirmative case for direct, informative (or partially informative) prior distributions on heterogeneous treatment effects, especially when treatment effect size and treatment effect variation is small relative to other sources of variability. We also consider how to provide scientifically meaningful summaries of complicated, high-dimensional posterior distributions over heterogeneous treatment effects with appropriate measures of uncertainty.
Climate change mitigation has traditionally been analyzed as some version of a public goods game (PGG) in which a group is most successful if everybody contributes, but players are best off individually by not contributing anything (i.e., “free-riding”)—thereby creating a social dilemma. Analysis of climate change using the PGG and its variants has helped explain why global cooperation on GHG reductions is so difficult, as nations have an incentive to free-ride on the reductions of others. Rather than inspire collective action, it seems that the lack of progress in addressing the climate crisis is driving the search for a “quick fix” technological solution that circumvents the need for cooperation.
This document discusses various types of academic writing and provides tips for effective academic writing. It outlines common academic writing formats such as journal papers, books, and reports. It also lists writing necessities like having a clear purpose, understanding your audience, using proper grammar and being concise. The document cautions against plagiarism and not proofreading. It provides additional dos and don'ts for writing, such as using simple language and avoiding filler words. Overall, the key message is that academic writing requires selling your ideas effectively to the reader.
Machine learning (including deep and reinforcement learning) and blockchain are two of the most noticeable technologies in recent years. The first one is the foundation of artificial intelligence and big data, and the second one has significantly disrupted the financial industry. Both technologies are data-driven, and thus there are rapidly growing interests in integrating them for more secure and efficient data sharing and analysis. In this paper, we review the research on combining blockchain and machine learning technologies and demonstrate that they can collaborate efficiently and effectively. In the end, we point out some future directions and expect more researches on deeper integration of the two promising technologies.
In this talk, we discuss QuTrack, a Blockchain-based approach to track experiment and model changes primarily for AI and ML models. In addition, we discuss how change analytics can be used for process improvement and to enhance the model development and deployment processes.
More from The Statistical and Applied Mathematical Sciences Institute (20)
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2018 MUMS Fall Course - Approximations and Computation - Jim Berger , September 25, 2018
1. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Lecture 5: Approximations and Computation
1
2. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Outline
• Marginal likelihood computation via importance sampling
• Adaptive importance sampling and exoplanets
• Laplace approximation and BIC
• Search in large model spaces and the inference challenge
2
3. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
I. Marginal Likelihood Computation via
Importance Sampling
3
4. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Goal: Computation of m(x) = f(x | θ)π(θ) dθ.
Importance sampling: Choose a proper distribution q(θ) (called the
importance function) that is easy to generate from, and such that q(θ) is
roughly proportional to f(x | θ)π(θ). Then
• Generate θ(1)
, θ(2)
, . . . , θ(L)
from q(θ).
• Approximate m(x) by
m(x) =
f(x | θ)π(θ)
q(θ)
q(θ) dθ ≈
1
L
L
i=1
f(x | θ(i)
)π(θ(i)
)
q(θ(i)
)
≡ m(x) .
By the Law of Large Numbers, m(x) → m(x) as L → ∞.
Choice of q is crucial: It should
– be easy to generate from;
– be roughly proportional to f(x | θ)π(θ);
– have tails that are somewhat heavier than those of f(x | θ)π(θ).
4
5. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
The reason q(θ) can’t have too sharp tails is that the variance of m(x) is
V/L, where
V =
[f(x | θ)π(θ)]2
q(θ)
dθ − m2
(x) ,
and this may not exist if q(θ) has tails that are too light, which can result
in an extremely slow converging algorithm.
Note: Assuming this variance is finite, it can be estimated by ˆV /L,
ˆV =
1
L
L
i=1
f(x | θ(i)
)π(θ(i)
)
q(θ(i)
)
2
− m(x)2
.
• One of the great advantages of importance sampling is the ease with
which one can estimate accuracy.
• Some care is needed: monitor ˆV as L increases to make sure it is not
increasing.
5
6. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Common choices of the importance function:
• If there is little data, choosing q(θ) = π(θ) is okay, and then
m(x) = 1
L
L
i=1 f(x | θ(i)
).
• If there is a lot of data and π(θ) does not have sharp tails (e.g. is
Cauchy) choosing q(θ) ∝ f(x | θ) is okay, if the likelihood is easy to
generate from (e.g., is normal).
• A common choice is q(θ) = N(θ | ˆθ, ˆI
−1
) where ˆθ and ˆI are the mle
and observed Fisher information matrix for θ.
– But the normal distribution has too sharp tails, so a much better choice
is a t-distribution with four degrees of freedom.
– Better yet is q(θ) = T4(θ | ˆθ, cˆI
−1
), and try different c > 1 until
convergence is fast.
– Better yet is q(θ) = T4(θ | ˆθπ, cˆI
−1
π ), where ˆθπ and ˆIπ are the maximizer
and Hessian of f(x | θ)π(θ).
– Better yet is q(θ) = T4(θ | ˆθπ, CˆI
−1
π C), where C > I is a diagonal
matrix, and one varies the ci until good convergence is achieved.
6
7. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
II. Adaptive Importance Sampling and
Exoplanets
Long literature on adaptive importance sampling: OH and BERGER (1993),
CAPPE, GUILLIN, MARIN, and ROBERT (2004), ARDIA, HOOGERHEIDE,
and VAN DIJK (2008), CAPP´E, DOUC, GUILLIN, MARIN, and ROBERT
(2008), CORNEBISE, MOULINES, and OLSSON (2008).
7
8. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
The Adaptive Importance Sampler
• The importance function qt,i(θ) (adapting over the two indices t and i)
tries to mimic the target f(x | θ)π(θ), with a mixture of T4 densities,
qt,i(θ) =
k
j=1
wjT4(θ | µj, Σj) ,
where the nonnegative weights wj sum to one.
• The algorithm adaptively chooses k, the wj, µj and Σj, with two layers
of adaptation:
– The outer layer is an annealing adaptation, wherein one, is
attempting to target [f(x | θ)π(θ)]1−t
, with t eventually going to 0.
∗ The reason is to hope to find at least the important modes.
– The inner layer, over the index i, seeks successively better
approximations to [f(x | θ)π(θ)]1−t
, with t fixed.
8
10. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
• Details of the inner adaptation, going from qt,i(θ) to qt,i+1(θ).
– A component is removed if its weight, wj, becomes too small.
– Sample θ(l)
, l = 1, . . . , N, from qt,i(θ); check for high importance
weights
wt,i,l =
[f(x | θ(l)
)π(θ(l)
)]1−t
qt,i(θ(l)
)
;
new components of the mixture (centered at the θ(l)
with the high
weights) are tentatively added to the mixture.
– The wj, µj and Σj in qt,i+1(θ) are chosen by an analytic fit to the
annealing target, using K-L divergence estimated from the previous
draws from qt,i(θ); i.e. they are chosen to minimize
[f(x | θ)π(θ)]1−t
log
[f(x | θ)π(θ)]1−t
qt,i+1(θ)
dθ ≈ c
N
l=1
log
[f(x | θ(l)
)π(θ(l)
)]1−t
qt,i+1(θ(l)
)
.
(Use the weights, means and covariances from qt,i as the starting point. Scale
the new initial covariance matrices by the distance to the closest mean.)
– Stop iterating on i when the estimated variance in using qt,i+1 is small.
10
11. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
• Details of the outer adaptation in going from tk to tk+1:
– Choose tk+1 so that
N
l=1 log ck[f(x|θ(l)
)π(θ(l)
)]1−tk
qtk,i∗ (θ(l)
)
N
l=1 log ck+1[f(x|θ(l)
)π(θ(l)
)]1−tk+1
qtk,i∗ (θ(l)
)
= 1 − δ .
– Alternatively, the effective sample size (ESS), given by
ESS =
( l wt,i,l)2
l w2
t,i,l
,
can be monitored and tk+1 kept small enough so that ESS is
reasonable.
11
12. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
An example: Exoplanet Detection
With Tom Loredo and David Chernoff (Cornell Astronomy)
Bin Liu and Merlise Clyde (Duke Statistics)
12
13. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Common Detection Method: Use of Radial Velocity
13
14. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Keplerian Radial Velocity Model
Parameters for single planet
• τ = orbital period (days)
• e = orbital eccentricity
• K = velocity amplitude (m/s)
• Argument of pericenter ω
• Mean anomaly at t = 0, M0
• System center-of-mass velocity v0
• Stellar jitter σ2
J
Keplerian reflex velocity vs. time: Letting θ = {K, τ, e, M0, ω},
v(t) = v0 + g(t; θ) ≡ v0 + K (e cos ω + cos[ω + ν(t)]) ,
where the true anomaly ν(t) (from Kepler’s equation for eccentric anomaly) is
E(t) − e sin E(t) =
2πt
τ
− M0; tan
ν(t)
2
=
1 + e
1 − e
1/2
tan
E(t)
2
.
14
15. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
The Likelihood Function
Keplerian observational velocity model:
Observations: xi = v0 + g(ti; θ) + ǫi + ηi,
where ǫi ∼ N(ǫi | 0, σ2
i ) with the σ2
i being known observational variances,
and ηi ∼ N(ηi | 0, σ2
J ). (Perhaps better is N(ǫi | 0, λσ2
i ), with λ unknown.)
Likelihood for data x = (x1, . . . , xn) is
f(x | θ, v0, σ2
J ) =
N
i=1
1
2π σ2
i + σ2
J
exp −
1
2
[xi − v0 − g(ti; θ)]2
σ2
i + σ2
J
∝
i
1
2π σ2
i + σ2
J
exp −
1
2 i
[xi − v0 − g(ti; θ)]2
σ2
i + σ2
J
.
This likelihood has extreme multimodality in τ; challenging multimodality
in M0; and is smooth (but often vague) in e.
15
17. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Bayesian Approach to Exoplanet Detection
• Let Mi denote the model that there are i planets with
– the two system parameters v0 and σ2
J ,
– 5 parameters describing each planets orbit,
– so the model parameter vector θi, has 2 + 5i parameters.
• Determine prior distributions π(θi, v0, σ2
J ) for the parameters
(semi-standard, as the result of a SAMSI program, except for σ2
J ).
• Compute the marginal likelihood of model Mi,
mi(x) = fi(x | θi, v0, σ2
J )π(θi, v0, σ2
J ) dθi dv0 dσ2
J .
We use adaptive importance sampling to carry out the computation.
• Typically look at Bayes factors (the ratio of marginal likelihoods) to
determine the number of planets.
17
20. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
III. The Laplace aproximation to Bayes
factors and BIC
20
21. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Goal: Analytically approximate the marginal density
m(x) = f(x | θ)π(θ)dθ .
Preliminary ‘nice’ reparameterization: Choose a ‘good’
transformation to make the Laplace approximation as accurate as possible.
In particular, all parameters should lie in (−∞, ∞).
• For variances, transform to ν = log σ2
as the parameter.
• For a probability p, transform to, e.g., ν = log p
1−p .
Definitions: Let L(θ) = log f(x | θ) denote the log-likelihood, maximized
at the mle ˆθ, and let ˆI denote the observed Fisher Information matrix
having (i, j) element
ˆIij = −
∂2
L(θ)
∂θi ∂θj θ=
ˆ
θ
.
21
22. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Expanding L(θ) = log f(x | θ) in a Taylor’s series about its maximum ˆθ
yields a
L(θ) ≈ L(ˆθ) −
1
2
(θ − ˆθ)tˆI(θ − ˆθ) .
If π(θ) is relatively flat near ˆθ, where L(θ) is non negligibleb
,
m(x) = f(x | θ)π(θ)dθ
≈ π(ˆθ) f(x | θ)dθ
≈ π(ˆθ) f(x | ˆθ) exp −
1
2
(θ − ˆθ)tˆI(θ − ˆθ) dθ
= π(ˆθ) f(x | ˆθ) (2 π)
p
2 |ˆI|−1/2
,
where p is the dimension of θ.
awe assume that L has continuous second partial derivatives and that the first partial
derivative vanishes at ˆθ
bThis will be true if L(θ) is highly peaked in a small neighborhood around ˆθ, which is
typically the case for large n
22
23. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
(First) Laplace approximation to Bayes factors: Apply Laplace
expansion to numerator and denominator of B21 to get the Laplace
approximation to B21:
BL
21 =
f2(x | θ2)π2(θ2)dθ2
f1(x | θ1)π1(θ1)dθ1
≈
π2(ˆθ2) f2(x | ˆθ2) |ˆI2|−1/2
(2π)p2/2
π1(ˆθ1) f1(x | ˆθ1) |ˆI1|−1/2 (2π)p1/2
,
where ˆθ1 and ˆθ2 are the m.l.e.’s for θ1 and θ2 (which have dimensions p1
and p2) and ˆI1 and ˆI2 are observed information matrices.
Large n and i.i.d. data: Then ˆIi ≈ nI∗
i , where I∗
i = E[ˆIi] is the expected
Fisher information over a single observation in Mi, and
BL
21 ≈
f2(x | ˆθ2)
f1(x | ˆθ1)
· n− 1
2 (p2−p1)
·
|I∗
2|−1/2
(2π)p2/2
π2(ˆθ2)
|I∗
1|−1/2(2π)p1/2π1(ˆθ1)
. (1)
23
24. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
BIC (Bayes Information Criterion)
The Schwarz (1978) approximation to Bayes factors is based on simply
ignoring the last term in (1), resulting in
BBIC
21 ≈
f2(x | ˆθ2)
f1(x | ˆθ1)
· n− 1
2 (p2−p1)
.
• Schwarz justified ignoring the last term because the first factor is
typically exponentially growing or decreasing with n, the second factor
is a polynomial in n, but the last term does not vary with n, and hence
is not as important.
• Raftery (1995), perhaps more convincingly, noted that, if one takes
πi(θi) to be N(θi | ˆθi, I
∗(−1)
i ) (a unit information prior centered at the
mle), then the third term in (1) is exactly 1. So there is a (pseudo)
prior yielding BIC.
Ref.: Schwarz (1978), Kass and Wasserman (1995), Dudley and Haughton (1997), Kass
and Vaidyanathan (1992), Pauler (1998).
24
25. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
• The BIC criterion is
BICi = −2 log fi(x | ˆθi) + pi log n (≈ −2 log mi(x)) ,
so that
BIC2 − BIC1 = −2 log BBIC
21 .
• Schwarz recommended, when dealing with multiple models, to just
choose that model with minimal BICi.
• The main justification that Schwarz gave for BIC is that it is
consistent, i.e. will select the correct model as n → ∞. (The constant
terms that BIC ignores are irrelevant asymptotically.)
If someone just presents BICi for all models, one can convert back to
approximate Bayes factors via
Bji ≈ e{− 1
2 [BICj −BICi]}
,
and do any desired Bayesian analysis (e.g., model averaging).
25
26. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
SAMSI Grad Students: A proposed project for you to work on together
(with Ralph and myself as advisors) – create UQ BIC to compare various
math models:
• Need to be able to find the ˆθi) (requiring optimization over the the
math model).
• Need to be able to find the observed information matrix (requiring
second derivatives of the math model).
• If use emulators, it would seem that the marginal distributions can be
calculated exactly.
26
27. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
IV. Search in Large Model Spaces and the
Inference Challenge
27
28. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Bayesian Analysis in Large Model Spaces:
a Search Strategy and Conceptual Issues (with German Molina)
Notation: Data x, arising from model M having density f(x | θ), with
unknown parameter θ that has prior density π(θ).
Search Strategy: Start at any model. Upon visiting model Ml, having
unknown parameter θl consisting of some subset of the variables θ1, . . . , θp,
determine its marginal likelihood ml(x) = fl(x | θl)π(θl)dθl.
At any stage, with {M1, . . . , Mk} denoting the models previously visited,
• define the current estimated posterior probabilities of models (with
prior probabilities P(Ml)) as
P(Ml | x) =
P(Ml) ml(x)
k
j=1 P(Mj) mj(x)
;
• define the current estimated variable inclusion probabilities (estimated
overall posterior probabilities that variables are in the model) as
ˆqi = ˆP(θi ∈ model | x) = k
j=1 P(Mj | x)1{θi∈Mj}.
28
29. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
1. At iteration k, compute the current posterior model and inclusion
probability estimates, P(Mj | x) and ˆqi.
2. Return to one of the k − 1 distinct models already visited, in
proportion to their estimated probabilities P(Mj | x).
3. Randomly (probability 0.5) choose to either add or remove a variable.
• If adding a variable, choose i with probability ∝ ˆqi+C
1−ˆqi+C
• If removing, choose i with probability ∝ 1−ˆqi+C
ˆqi+C
C is a tuning constant introduced to keep the ˆqi away from zero or one;
C = 0.01 is a reasonable default value.
4. If the model obtained in Step 3
• has already been visited, return to step 2.
• is new, update ˆP(Mj | x) and ˆqi and go to step 2.
29
30. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Conventional Priors for Normal Linear Models
The full model for the data Y = (Y1, . . . , Yn)′
is (with σ2
unknown)
MF : Y = X β + ε, ε ∼ Nn(0, σ2
I) ,
• β = (β1, β∗
) is a p × 1 vector of unknown coefficients, with β1 being
the intercept;
• X = (1 X∗
) is the n × p (p < n) full rank design matrix of covariates,
with the columns of X∗
being orthogonal to 1 = (1, . . . , 1)′
.
We consider selection from among the submodels Ml, l = 1, . . . , 2p
,
Mi : Y = Xi βi + ε, ε ∼ Nn(0, σ2
I),
• βi = (β1, β∗
i ) is a d(i)-dimensional subvector of β (note that we only
consider submodels that contain the intercept);
• Xi = (1 X∗
i ) is the corresponding n × d(i) matrix of covariates.
• Let fi(y | βi, σ2
) denote this density.
30
31. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Prior density under Mi: standard choices include
• g-Priors: πg
i (β1, β∗
i , σ2
| c) = 1
σ2 × N(d(i)−1) β∗
i | 0, cnσ2
(X∗′
i X∗
i )−1
,
where c is fixed, and is typically set equal to 1 or estimated in an
empirical Bayesian fashion; these are not information consistent.
• Zellner-Siow priors (ZSN): the intercept model has the prior
π(β1, σ2
) = 1/σ2
, while the the prior for other Mi is
πZSN
i (β1, β∗
i , σ2
) = πg
i (β1, β∗
i , σ2
| c) ,
πZSN
i (c) ∼ InverseGamma(c | 0.5, 0.5) .
Marginal density under Mi: mi(Y) = fi(y | βi, σ2
)πi(βi, σ2
) dβidσ2
Posterior probability of Mi, when the prior probability is P(Mi) (for
multiplicity adjustment, choose P(Mi) = Beta(d(i), p − d(i) + 1)), is
P(Mi | Y) =
P(Mi)mi(Y)
k P(Mk)mk(Y)
.
31
32. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Example: An Ozone data set was considered in Breiman and Friedman
(1985).
• It consists of 178 observations, each having 10 covariates (in addition
to the intercept).
• We consider a linear model with all linear main effects along with all
quadratic terms and second order interactions, yielding a total of 65
covariates and 265
≈ 3.6 × 1019
models.
• g-prior and Zellner-Siow priors were considered; these result in easily
computable marginal model probabilities mi(y).
• The algorithm was implemented at different starting points with
essentially no difference in results, and run through 5,000,000
iterations, saving only the top 65,536 models.
• No model had appreciable posterior probability; 0.0025 was the largest.
32
34. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
0.00.20.40.60.81.0
iterations
sumofnormalizedmarginals
Figure 1: Retrospective cumulative posterior probability of models visited in the
ozone problem.
34
35. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Comparison of models discovered: 25−node example
Model log−posterior
Frequency
−1495 −1490 −1485 −1480 −1475 −1470 −1465
050100150200
Metropolis
Gibbs
FINCS
Figure 2: Carvalho and Scott (2008): graphical model selection, 25 node exam-
ple. Models found by Gibbs (SVSS), Metropolis, and Feature INClusion Search
(what they called this search algorithm). Equal computation time for searches;
probability renormalization across all three model sets.
35
36. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Example. Posterior Model Probabilities for Variable Selection in
Probit Regression
(from CMU Case Studies VII, Viele et. al. case study)
Motivating example: Prosopagnosia (face blindness), is a condition
(usually developing after brain trauma) under which the individual cannot
easily distinguish between faces. A psychological study was conducted by
Tarr, Behrmann and Gauthier to address whether this extended to a
difficulty of distinguishing other objects, or was particular to faces.
The study considered 30 control subjects (C) and 2 subjects (S) diagnosed
as having prosopagnosia, and had them try to differentiate between similar
faces, similar ‘Greebles’ and similar ‘objects’ at varying levels of difficulty
and varying comparison time.
36
37. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Data: Sample size was n = 20, 083, the response and covariates being
C = Subject C
S = Subject S
G = Greebles
O = Object
D = Difficulty
B = Brief time
A = Images match or not
⇒ R = Response (answer correct or not).
All variables are binary. All interactions, consisting of all combinations of
C, S, G, O, D, B, A (except{C=S=1} and {G=O=1}), are of interest. There are 3
possible subjects (a control and the two patients), 3 possible stimuli (faces,
objects, or greebles), and 2 possibilities for each of D, B and A, yielding
3 × 3 × 2 × 2 × 2 = 72 possible interactions.
37
38. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Statistical modeling: For a specified interaction vector Xi, let yi and
ni − yi be the numbers of successes and failures among the responses with
that interaction vector, with probability of success pi assumed to follow the
probit regression model
pi = Φ(β1 +
72
j=2 Xijβj),
with Φ being the normal cdf. The full model likelihood (up to a fixed
proportionality constant) is then
f(y | β) =
72
i=1 pyi
i (1 − pi)ni−yi
.
Goal: Select from among the 272
submodels which have some of the βj set
equal to zero. (Actually, only models with graphical structure were
considered, i.e., if an interaction term is in the model, all the lower order
effects must also be there.)
38
39. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Prior Choice: Conditionally inducing submodel priors from a full-model
prior
• A standard noninformative prior for p = (p1, p2, . . . , p72) is the uniform
prior, usable here since it is proper.
• Changing variables from p to β yields πL(β) is N72(0, (X′
X)−1
),
where X′
= (X′
1, X′
2, . . . , X′
72).
This is the full model prior.
• For the model with a nonzero subvector β(j) of dimension kj, with the
remaining βi = 0, induce a prior from the full model prior by
conditioning on the remaining βi = 0:
πL(β(j) | β(−j) = 0) = Nkj (0, (X′
(j)X(j))−1
) ,
where β(−j) is the subvector of parameters that are zero and X(j) is
the corresponding design matrix. (This is a nice conditional property
of normal distributions.)
39
40. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Computation of the marginal probability of a visited model: Use
the modified Laplace approximation (leaving πj(βj) in the integrand, since
it is a normal distribution and can be integrated out)
mj(y) = fj(y | βj)πj(βj) dβj
≈ fj(y | ˆβj)|I + V −1
j
ˆIj|−1/2
e
− 1
2
ˆ
β
′
j
ˆ
I
−1
j +V −1
j
−1
ˆ
βj
.
where ˆIj is the observed information matrix.
• This allows use of standard probit packages to obtain ˆβj, fj(y | ˆβj),
and ˆIj.
• Note that (approximate) posterior means, µj, and covariance matrices,
Σj, are also available in closed form.
40
41. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Uniform-induced conditional prior
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
Cumulative sum of posterior probabilities (1)
Models ordered by posterior probability
cumulativeposteriorprobability
0 1000 2000 3000 4000 5000
0.00.20.40.60.81.0
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
Cumulative Sum of posterior probabilities (1)
Number of models visited
Cumulativesumofmodelposteriorprobabilities
0 1000 2000 3000 4000 5000
0.00.20.40.60.81.0
41
43. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Utilization of the Posterior Model Probabilities
One usually must decide between
• Model Averaging: perform inferences, averaged over all models.
• Model Selection: choose a specific model, e.g.
– the maximum posterior probability model;
– the median probability model (that which includes only those βi for
which P(βi ∈ model | y) ≥ 0.5).
43
44. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Posterior inclusion probabilities for the 72 parameters
Var. prior1 prior2 prior3 prior4
Int. 1 1 1 1
C 1 1 1 1
S 1 1 1 1
G 1 1 1 1
O 1 1 1 1
D 1 1 1 1
B 1 1 1 1
A 1 1 1 1
CG 0.966 0.082 0.154 0.955
CO 0.519 0.033 0.063 0.155
CD 0.780 0.035 0.109 0.454
CB 1 0.999 0.999 1
CA 1 1 1 1
SG 1 1 1 1
SO 0.999 0.877 0.969 0.995
SD 1 0.999 0.999 1
SB 1 1 1 1
SA 1 1 1 1
Var. prior1 prior2 prior3 prior4
GD 0.998 0.358 0.519 0.833
GB 1 1 1 1
GA 1 1 1 1
OD 0.999 0.418 0.700 0.744
OB 1 1 1 1
OA 1 1 1 1
DB 1 0.999 0.999 0.999
DA 1 1 1 1
BA 1 1 1 1
CGD 0.264 0.000 0.000 0.134
CGB 0.938 0.002 0.020 0.943
CGA 0.936 0.002 0.018 0.944
COD 0.074 0.000 0.000 0.003
COB 0.417 0.001 0.003 0.080
COA 0.275 0.003 0.013 0.094
CDB 0.684 0.012 0.063 0.357
CDA 0.323 0.001 0.011 0.235
CBA 0.999 0.639 0.833 0.999
44
46. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Non-Bayesian search in model space: The same principle can be
applied for any search criterion and for any model structure:
• Choose the model features (e.g. variables, graphical nodes, links in
graph structures) you wish to drive the search.
• Choose the criterion for defining a good model (e.g.
AIC = −2 log fi(x | ˆθi) + 2pi)
• Convert the criterion to pseudo-probabilities for the models, e.g.
P(Mj | x) ∝ e−AICj /2
.
• Define the feature inclusion probabilities as
ˆqi = ˆP(featurei ∈ model | x) =
k
j=1
P(Mj | x)1{featurei∈Mj} .
• Apply the search algorithm.
46
47. SAMSI MUMS Course Fall, 2018
✬
✫
✩
✪
Summary of Search
For (non-orthogonal) large model spaces,
• Effective search can be done by revisiting high probability models and
changing variables (features) according to their estimated posterior inclusion
probabilities. The technique is being used to great effect in many fields:
– Statistics: Berger and Molina, 2004 CMU Case Studies; 2005 Statist. Neerland.
– Economics: Sala-i-Martin, 2004 American Economic Review; different variant
– CS: Schmidt, Niculescu-Mizil, and Murphy, 2007 AAAI; different variant
– Graphical Models: Carvalho and Scott, 2007 tech report
• There may be no model with significant posterior probability (in the ozone
example the largest model posterior probability was 0.0025) and thousands
or millions of roughly comparable probability.
• Of most importance is thus overall features of the model space, such as the
posterior inclusion probabilities of each variable, and possibly bivariate
inclusion probabilities or other structural features (see, e.g., Ley and Steel,
2007 J. Macroeconomics).
47