We present a survey of computational and applied mathematical techniques that have the potential to contribute to the next generation of high-fidelity, multi-scale climate simulations. Examples of the climate science problems that can be investigated with more depth with these computational improvements include the capture of remote forcings of localized hydrological extreme events, an accurate representation of cloud features over a range of spatial and temporal scales, and parallel, large ensembles of simulations to more effectively explore model sensitivities and uncertainties.
Numerical techniques, such as adaptive mesh refinement, implicit time integration, and separate treatment of fast physical time scales are enabling improved accuracy and fidelity in simulation of dynamics and allowing more complete representations of climate features at the global scale. At the same time, partnerships with computer science teams have focused on taking advantage of evolving computer architectures such as many-core processors and GPUs. As a result, approaches which were previously considered prohibitively costly have become both more efficient and scalable. In combination, progress in these three critical areas is poised to transform climate modeling in the coming decades.
Numerous studies have found an average increase in extreme precipitation for both the U.S. and Northern Hemisphere mid-latitude land areas, consistent with the expectations arising from the observed increase in greenhouse gas concentrations (now more than 40% above pre-industrial levels). However, there are important regional variations in these trends that are not fully explained. These trend studies are typically based on direct analyses of observational station data. Such analyses confront multiple challenges, such as incomplete data and uneven spatial coverage of stations. Central scientific questions related to this general finding are: Are there changes in weather system phenomenology that are contributing to this observed increase? What is the contribution of increases in atmospheric water vapor? There are also questions related to application of potential future changes in planning. Because of the rarity (by definition) of extreme events, trends are mostly found only when aggregating over space. When would we expect to see a signal at the local level? What are the uncertainties surrounding future changes and their potential incorporation into future design? Further development of statistical/mathematical methods, or innovative application of existing methods, is desirable to aid scientists in exploring these central scientific questions. This talk will describe characteristics of the observation record and the issues surrounding the above questions.
Physical processes in the earth system are modeled with mathematical representations called parameterizations. This talk will describe some of the conceptual approaches and mathematics used do describe physical parameterizations focusing on cloud parameterizations. This includes tracing physical laws to discrete representations in coarse scale models. Clouds illustrate several of the complexities and techniques common to many physical parameterizations. This includes the problem of different scales, sub-grid scale variability. Discussions of mathematical methods for dealing with the sub-grid scale will be discussed. In-exactness or indeterminate problems for both weather and climate will be discussed, including the problems of indeterminate parameterizations, and inexact initial conditions. Different mathematical methods, including the use of stochastic methods, will be described and discussed, with examples from contemporary earth system models.
In the first part of the talk, we will present a sensitivity analysis of a novel sea ice model. neXtSIM is a continuous Lagrangian numerical model that uses an elastobrittle rheology to simulate the ice response to external forces. The response of the model is evaluated in terms of simulated ice drift distances from its initial position and from the mean position of the ensemble. The simulated ice drift is decomposed into advective and diffusive parts that are characterized separately both spatially and temporally and compared to what is obtained with a free-drift model, i.e. when the ice rheology does not play any role. Overall the large-scale response of neXtSIM is correlated to the ice thickness and the wind velocity fields while the free-drift model response is mostly correlated to the wind velocity pattern only. The seasonal variability of the model sensitivity shows the role of the ice compactness and rheology at both local and Arctic scales. Indeed, the ice drift simulated by neXtSIM in summer is close to the free-drift model, while the more compact and solid ice pack is showing a significantly different mechanical and drift behavior in winter. In contrast of the free-drift model, neXtSIM reproduces the sea ice Lagrangian diffusion regimes as found from observed trajectories. The forecast capability of neXtSIM is also evaluated using a large set of real buoy’s trajectories. We found that neXtSIM performs better in simulating sea ice drift, both in terms of forecast error and as a tool to assist search-and-rescue operations. Adaptive meshes, as the one used in neXtSIM, are used to model a wide variety of physical phenomena. Some of these models, in particular those of sea ice movement, use a remeshing process to remove and insert mesh points at various points in their evolution. This represents a challenge in developing compatible data assimilation schemes, as the dimension of the state space we wish to estimate can change over time when these remeshings occur.
In the second part of the talk, we highlight the challenges that such a modeling framework represents for data assimilation setup. We then describe a remeshing scheme for an adaptive mesh in one dimension. The development of advanced data assimilation methods that are appropriate for such a moving and remeshed grid is presented. Finally we discuss the extension of these techniques to two-dimensional models, like neXtSIM.
Multi Model Ensemble (MME) predictions are a popular ad-hoc technique for improving predictions of high-dimensional, multi-scale dynamical systems. The heuristic idea behind MME framework is simple: given a collection of models, one considers predictions obtained through the convex superposition of the individual probabilistic forecasts in the hope of mitigating model error. However, it is not obvious if this is a viable strategy and which models should be included in the MME forecast in order to achieve the best predictive performance. I will present an information-theoretic approach to this problem which allows for deriving a sufficient condition for improving dynamical predictions within the MME framework; moreover, this formulation gives rise to systematic and practical guidelines for optimising data assimilation techniques which are based on multi-model ensembles. Time permitting, the role and validity of “fluctuation-dissipation” arguments for improving imperfect predictions of externally perturbed non-autonomous systems - with possible applications to climate change considerations - will also be addressed.
We present a survey of computational and applied mathematical techniques that have the potential to contribute to the next generation of high-fidelity, multi-scale climate simulations. Examples of the climate science problems that can be investigated with more depth with these computational improvements include the capture of remote forcings of localized hydrological extreme events, an accurate representation of cloud features over a range of spatial and temporal scales, and parallel, large ensembles of simulations to more effectively explore model sensitivities and uncertainties.
Numerical techniques, such as adaptive mesh refinement, implicit time integration, and separate treatment of fast physical time scales are enabling improved accuracy and fidelity in simulation of dynamics and allowing more complete representations of climate features at the global scale. At the same time, partnerships with computer science teams have focused on taking advantage of evolving computer architectures such as many-core processors and GPUs. As a result, approaches which were previously considered prohibitively costly have become both more efficient and scalable. In combination, progress in these three critical areas is poised to transform climate modeling in the coming decades.
Numerous studies have found an average increase in extreme precipitation for both the U.S. and Northern Hemisphere mid-latitude land areas, consistent with the expectations arising from the observed increase in greenhouse gas concentrations (now more than 40% above pre-industrial levels). However, there are important regional variations in these trends that are not fully explained. These trend studies are typically based on direct analyses of observational station data. Such analyses confront multiple challenges, such as incomplete data and uneven spatial coverage of stations. Central scientific questions related to this general finding are: Are there changes in weather system phenomenology that are contributing to this observed increase? What is the contribution of increases in atmospheric water vapor? There are also questions related to application of potential future changes in planning. Because of the rarity (by definition) of extreme events, trends are mostly found only when aggregating over space. When would we expect to see a signal at the local level? What are the uncertainties surrounding future changes and their potential incorporation into future design? Further development of statistical/mathematical methods, or innovative application of existing methods, is desirable to aid scientists in exploring these central scientific questions. This talk will describe characteristics of the observation record and the issues surrounding the above questions.
Physical processes in the earth system are modeled with mathematical representations called parameterizations. This talk will describe some of the conceptual approaches and mathematics used do describe physical parameterizations focusing on cloud parameterizations. This includes tracing physical laws to discrete representations in coarse scale models. Clouds illustrate several of the complexities and techniques common to many physical parameterizations. This includes the problem of different scales, sub-grid scale variability. Discussions of mathematical methods for dealing with the sub-grid scale will be discussed. In-exactness or indeterminate problems for both weather and climate will be discussed, including the problems of indeterminate parameterizations, and inexact initial conditions. Different mathematical methods, including the use of stochastic methods, will be described and discussed, with examples from contemporary earth system models.
In the first part of the talk, we will present a sensitivity analysis of a novel sea ice model. neXtSIM is a continuous Lagrangian numerical model that uses an elastobrittle rheology to simulate the ice response to external forces. The response of the model is evaluated in terms of simulated ice drift distances from its initial position and from the mean position of the ensemble. The simulated ice drift is decomposed into advective and diffusive parts that are characterized separately both spatially and temporally and compared to what is obtained with a free-drift model, i.e. when the ice rheology does not play any role. Overall the large-scale response of neXtSIM is correlated to the ice thickness and the wind velocity fields while the free-drift model response is mostly correlated to the wind velocity pattern only. The seasonal variability of the model sensitivity shows the role of the ice compactness and rheology at both local and Arctic scales. Indeed, the ice drift simulated by neXtSIM in summer is close to the free-drift model, while the more compact and solid ice pack is showing a significantly different mechanical and drift behavior in winter. In contrast of the free-drift model, neXtSIM reproduces the sea ice Lagrangian diffusion regimes as found from observed trajectories. The forecast capability of neXtSIM is also evaluated using a large set of real buoy’s trajectories. We found that neXtSIM performs better in simulating sea ice drift, both in terms of forecast error and as a tool to assist search-and-rescue operations. Adaptive meshes, as the one used in neXtSIM, are used to model a wide variety of physical phenomena. Some of these models, in particular those of sea ice movement, use a remeshing process to remove and insert mesh points at various points in their evolution. This represents a challenge in developing compatible data assimilation schemes, as the dimension of the state space we wish to estimate can change over time when these remeshings occur.
In the second part of the talk, we highlight the challenges that such a modeling framework represents for data assimilation setup. We then describe a remeshing scheme for an adaptive mesh in one dimension. The development of advanced data assimilation methods that are appropriate for such a moving and remeshed grid is presented. Finally we discuss the extension of these techniques to two-dimensional models, like neXtSIM.
Multi Model Ensemble (MME) predictions are a popular ad-hoc technique for improving predictions of high-dimensional, multi-scale dynamical systems. The heuristic idea behind MME framework is simple: given a collection of models, one considers predictions obtained through the convex superposition of the individual probabilistic forecasts in the hope of mitigating model error. However, it is not obvious if this is a viable strategy and which models should be included in the MME forecast in order to achieve the best predictive performance. I will present an information-theoretic approach to this problem which allows for deriving a sufficient condition for improving dynamical predictions within the MME framework; moreover, this formulation gives rise to systematic and practical guidelines for optimising data assimilation techniques which are based on multi-model ensembles. Time permitting, the role and validity of “fluctuation-dissipation” arguments for improving imperfect predictions of externally perturbed non-autonomous systems - with possible applications to climate change considerations - will also be addressed.
The scanpath comparison framework based on string editing is revisited. The previous method of clustering based on k-means “preevaluation” is replaced by the mean shift algorithm followed by elliptical modeling via Principal Components Analysis. Ellipse intersection determines cluster overlap, with fast nearest-neighbor search provided by the kd-tree. Subsequent construction of Y -
matrices and parsing diagrams is fully automated, obviating prior interactive steps. Empirical validation is performed via analysis of eye movements collected during a variant of the Trail Making Test, where participants were asked to visually connect alphanumeric targets (letters and numbers). The observed repetitive position similarity index matches previously published results, providing ongoing support for the scanpath theory (at least in this situation). Task dependence of eye movements may be indicated by the global position index, which differs considerably from past results based on free viewing.
Weights Stagnation in Dynamic Local Search for SATcsandit
Since 1991, tries were made to enhance the stochast
ic local search techniques (SLS). Some
researchers turned their focus on studying the stru
cture of the propositional satisfiability
problems (SAT) to better understand their complexit
y in order to come up with better
algorithms. Other researchers focused in investigat
ing new ways to develop heuristics that alter
the search space based on some information gathered
prior to or during the search process.
Thus, many heuristics, enhancements and development
s were introduced to improve SLS
techniques performance during the last three decade
s. As a result a group of heuristics were
introduced namely Dynamic Local Search (DLS) that c
ould outperform the systematic search
techniques. Interestingly, a common characteristic
of DLS heuristics is that they all depend on
the use of weights during searching for satisfiable
formulas.
In our study we experimentally investigated the wei
ghts behaviors and movements during
searching for satisfiability using DLS techniques,
for simplicity, DDFW DLS heuristic is chosen.
As a results of our studies we discovered that whil
e solving hard SAT problems such as blocks
world and graph coloring problems, weights stagnati
on occur in many areas within the search
space. We conclude that weights stagnation occurren
ce is highly related to the level of the
problem density, complexity and connectivity.
The climate and earth sciences have recently undergone a rapid transformation from a data-poor
to a data-rich environment. In particular, massive amount of data about Earth and its
environment is now continuously being generated by a large number of Earth observing satellites
as well as physics-based earth system models running on large-scale computational platforms.
These massive and information-rich datasets offer huge potential for understanding how the
Earth's climate and ecosystem have been changing and how they are being impacted by humans’
actions. This talk will discuss various challenges involved in analyzing these massive data sets
as well as opportunities they present for both advancing machine learning as well as the science
of climate change in the context of monitoring the state of the tropical forests and surface water
on a global scale.
This problem represents an interesting opportunity for scientists and statisticians to collaborate since the problem is too big for either community. The science is not well established, although fairly sophisticated ice flow models exist. They are even becoming relevant to explain some of the complexity seen in observational data. At the same time, the complex phenomena we see in observations may not be particularly relevant to assessing the risks of significant increases in sea level rise over the near future. The talk will review what we have learned about this problem through the PISCEES SciDAC project. This problem is rich with challenges and opportunities, particularly for realigning how our two communities engage each other. The talk will review the computational, scientific, and mathematical "reality checks" that might stop any reasonable person from considering this topic further. I then will point out how each of these challenges could be mitigated if these different perspectives were better integrated.
Found this paper really interesting. It delves into the learning behaviors of Deep Learning Ensembles and compares them Bayesian Neural Networks, which theoretically does the same thing. This answers why Deep Ensembles Outperform
ABSTRACT: In this paper, we proposed a new identification algorithm based on Kolmogorov–Zurbenko Periodogram (KZP) to separate motions in spatial motion image data. The concept of directional periodogram is utilized to sample the wave field and collect information of motion scales and directions. KZ Periodogram enables us detecting precise dominate frequency information of spatial waves covered by highly background noises. The computation of directional periodogram filters out most of the noise effects, and the procedure is robust for missing and fraud spikes caused by noise and measurement errors. This design is critical for the closure-based clustering method to find cluster structures of potential parameter solutions in the parameter space. An example based on simulation data is given to demonstrate the four steps in the procedure of this method. Related functions are implemented in our recent published R package {kzfs}.
Model calibration or data inversion involves using experimental or field data to estimate the unknown parameters in a mathematical model. In the first part of the talk, I will present a review of a few approaches for model calibration or data inversion with the focus on model discrepancy and measurement bias. A few state-of-art methods, such as modeling the discrepancy by the Gaussian stochastic process (GaSP) or scaled Gaussian stochastic processes (S-GaSP), L2 calibration, Least squares (LS) calibration and orthogonal Gaussian process calibration, will be introduced. The connection and difference between these methods will be discussed. In the second part of talk, I will discuss our ongoing works on calibrating a geophysical model by integrating the different types of the field data, such as the interferometric synthetic aperture radar satellite (InSAR) interferograms, GPS data, velocities of tilt and lava lake from the Kilauea Volcano during the eruption in 2018. This task is complicated by the discrepancy between the model and reality different sample sizes and possible bias in field data. We introduce the scaled Gaussian stochastic process (S-GaSP), a new stochastic process to model the discrepancy function in calibration for the identifiability issue between the calibrated mathematical model and the discrepancy function. We also compare a few approaches to model the measurement bias in the data. A feasible way to fuse the field data from multiple sources will then be discussed. The calibration models are implemented in the "RobustCalibration" R Package on CRAN. The scientific goal of this work is to use data in May 2018 during the earthquake and the eruption of the Kilauea Volcano to resolve the location, volume, and pressure change in the Halema'uma'u Reservoir, as well as relating the results to the inferences from the past caldera collapses.
Spectral clustering with motifs and higher-order structuresDavid Gleich
I presented these slides at the #strathna meeting in Glasgow in June 2017. They are an updated and enhanced version of the earlier talks on the subject.
We have two sources for forest variables, from direct measurements, which are always expensive and
would be sparse in space, and correlated LiDAR data that has complete coverage. The Bonanza Creek Experimental Forest (BCEF) is a Long-Term Ecological Research (LTER) site consisting of vegetation and landforms typical of interior Alaska. People are interested in three forest variables: above-ground biomass (AGB); tree density (TD); basal area (BA). The brightness, greenness, and wetness tasseled cap indices can be used as covariates to explain the forest variables. In the undergraduate workshop project, students can brainstorm from the easiest regression models to more sophisticated spatial models and compare the differences
of inferences from different ideas.
Group members: Richard Groenwald, Mehmut Hatip, Katrina Lewis, Jennifer Soter, Astride Tchkaoua, Sylvester Wieb
The scanpath comparison framework based on string editing is revisited. The previous method of clustering based on k-means “preevaluation” is replaced by the mean shift algorithm followed by elliptical modeling via Principal Components Analysis. Ellipse intersection determines cluster overlap, with fast nearest-neighbor search provided by the kd-tree. Subsequent construction of Y -
matrices and parsing diagrams is fully automated, obviating prior interactive steps. Empirical validation is performed via analysis of eye movements collected during a variant of the Trail Making Test, where participants were asked to visually connect alphanumeric targets (letters and numbers). The observed repetitive position similarity index matches previously published results, providing ongoing support for the scanpath theory (at least in this situation). Task dependence of eye movements may be indicated by the global position index, which differs considerably from past results based on free viewing.
Weights Stagnation in Dynamic Local Search for SATcsandit
Since 1991, tries were made to enhance the stochast
ic local search techniques (SLS). Some
researchers turned their focus on studying the stru
cture of the propositional satisfiability
problems (SAT) to better understand their complexit
y in order to come up with better
algorithms. Other researchers focused in investigat
ing new ways to develop heuristics that alter
the search space based on some information gathered
prior to or during the search process.
Thus, many heuristics, enhancements and development
s were introduced to improve SLS
techniques performance during the last three decade
s. As a result a group of heuristics were
introduced namely Dynamic Local Search (DLS) that c
ould outperform the systematic search
techniques. Interestingly, a common characteristic
of DLS heuristics is that they all depend on
the use of weights during searching for satisfiable
formulas.
In our study we experimentally investigated the wei
ghts behaviors and movements during
searching for satisfiability using DLS techniques,
for simplicity, DDFW DLS heuristic is chosen.
As a results of our studies we discovered that whil
e solving hard SAT problems such as blocks
world and graph coloring problems, weights stagnati
on occur in many areas within the search
space. We conclude that weights stagnation occurren
ce is highly related to the level of the
problem density, complexity and connectivity.
The climate and earth sciences have recently undergone a rapid transformation from a data-poor
to a data-rich environment. In particular, massive amount of data about Earth and its
environment is now continuously being generated by a large number of Earth observing satellites
as well as physics-based earth system models running on large-scale computational platforms.
These massive and information-rich datasets offer huge potential for understanding how the
Earth's climate and ecosystem have been changing and how they are being impacted by humans’
actions. This talk will discuss various challenges involved in analyzing these massive data sets
as well as opportunities they present for both advancing machine learning as well as the science
of climate change in the context of monitoring the state of the tropical forests and surface water
on a global scale.
This problem represents an interesting opportunity for scientists and statisticians to collaborate since the problem is too big for either community. The science is not well established, although fairly sophisticated ice flow models exist. They are even becoming relevant to explain some of the complexity seen in observational data. At the same time, the complex phenomena we see in observations may not be particularly relevant to assessing the risks of significant increases in sea level rise over the near future. The talk will review what we have learned about this problem through the PISCEES SciDAC project. This problem is rich with challenges and opportunities, particularly for realigning how our two communities engage each other. The talk will review the computational, scientific, and mathematical "reality checks" that might stop any reasonable person from considering this topic further. I then will point out how each of these challenges could be mitigated if these different perspectives were better integrated.
Found this paper really interesting. It delves into the learning behaviors of Deep Learning Ensembles and compares them Bayesian Neural Networks, which theoretically does the same thing. This answers why Deep Ensembles Outperform
ABSTRACT: In this paper, we proposed a new identification algorithm based on Kolmogorov–Zurbenko Periodogram (KZP) to separate motions in spatial motion image data. The concept of directional periodogram is utilized to sample the wave field and collect information of motion scales and directions. KZ Periodogram enables us detecting precise dominate frequency information of spatial waves covered by highly background noises. The computation of directional periodogram filters out most of the noise effects, and the procedure is robust for missing and fraud spikes caused by noise and measurement errors. This design is critical for the closure-based clustering method to find cluster structures of potential parameter solutions in the parameter space. An example based on simulation data is given to demonstrate the four steps in the procedure of this method. Related functions are implemented in our recent published R package {kzfs}.
Model calibration or data inversion involves using experimental or field data to estimate the unknown parameters in a mathematical model. In the first part of the talk, I will present a review of a few approaches for model calibration or data inversion with the focus on model discrepancy and measurement bias. A few state-of-art methods, such as modeling the discrepancy by the Gaussian stochastic process (GaSP) or scaled Gaussian stochastic processes (S-GaSP), L2 calibration, Least squares (LS) calibration and orthogonal Gaussian process calibration, will be introduced. The connection and difference between these methods will be discussed. In the second part of talk, I will discuss our ongoing works on calibrating a geophysical model by integrating the different types of the field data, such as the interferometric synthetic aperture radar satellite (InSAR) interferograms, GPS data, velocities of tilt and lava lake from the Kilauea Volcano during the eruption in 2018. This task is complicated by the discrepancy between the model and reality different sample sizes and possible bias in field data. We introduce the scaled Gaussian stochastic process (S-GaSP), a new stochastic process to model the discrepancy function in calibration for the identifiability issue between the calibrated mathematical model and the discrepancy function. We also compare a few approaches to model the measurement bias in the data. A feasible way to fuse the field data from multiple sources will then be discussed. The calibration models are implemented in the "RobustCalibration" R Package on CRAN. The scientific goal of this work is to use data in May 2018 during the earthquake and the eruption of the Kilauea Volcano to resolve the location, volume, and pressure change in the Halema'uma'u Reservoir, as well as relating the results to the inferences from the past caldera collapses.
Spectral clustering with motifs and higher-order structuresDavid Gleich
I presented these slides at the #strathna meeting in Glasgow in June 2017. They are an updated and enhanced version of the earlier talks on the subject.
Spectral clustering with motifs and higher-order structures
Similar to Undergraduate Modeling Workshop - Hierarchical Models for Sparsely Sampled High-Dimensional LiDAR and Forest Variables, Andrew Finley, May 22, 2018
We have two sources for forest variables, from direct measurements, which are always expensive and
would be sparse in space, and correlated LiDAR data that has complete coverage. The Bonanza Creek Experimental Forest (BCEF) is a Long-Term Ecological Research (LTER) site consisting of vegetation and landforms typical of interior Alaska. People are interested in three forest variables: above-ground biomass (AGB); tree density (TD); basal area (BA). The brightness, greenness, and wetness tasseled cap indices can be used as covariates to explain the forest variables. In the undergraduate workshop project, students can brainstorm from the easiest regression models to more sophisticated spatial models and compare the differences
of inferences from different ideas.
Group members: Richard Groenwald, Mehmut Hatip, Katrina Lewis, Jennifer Soter, Astride Tchkaoua, Sylvester Wieb
Atmospheric Correction of Remote Sensing Data_RamaRao.pptxssusercd49c0
Atmospheric correction of remote sensing data. This PPT describes development of a region sensitive atmospheric correction method for hyperspectral image processing
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in RRevolution Analytics
Everything happens somewhere and spatial analysis attempts to use location as an explanatory variable. Such analysis is made complex by the very many ways we habitually record spatial location, the complexity of spatial data structures, and the wide variety of possible domain-driven questions we might ask. One option is to develop and use software for specific types of spatial data, another is to use a purpose-built geographical information system (GIS), but determined work by R enthusiasts has resulted in a multiplicity of packages in the R environment that can also be used.
Over a seven day period in August 2017 Hurricane Harvey brought extreme levels of rainfall to the Houston area, resulting in catastrophic flooding that caused loss of human life and damage to personal property and public infrastructure. In the wake of this event, there is growing interest in understanding the degree to which this event was unusual and estimating the probability of experiencing a similar event in other locations. Additionally, we investigate the degree to which the sea surface temperature in the Gulf of Mexico is associated with extreme precipitation in the US Gulf Coast. This talk addresses these issues through the development of an extreme value model.
We assume that the annual maximum precipitation values at Gulf Coast locations approximately follow the Generalized Extreme Value (GEV) distribution. Because the observed precipitation record in this region is relatively short, we borrow strength across spatial locations to improve GEV parameter estimates. We model the GEV parameters at US Gulf Coast locations using a multivariate spatial hierarchical model; for inference, a two-stage approach is utilized. Spatial
interpolation is used to estimate GEV parameters at unobserved locations, allowing us to characterize precipitation extremes throughout the region. Analysis indicates that Harvey was highly unusual as a seven
-day event, and that GoM SST seems to be more strongly linked to extreme precipitation in the Western part of
the region.
Similar to Undergraduate Modeling Workshop - Hierarchical Models for Sparsely Sampled High-Dimensional LiDAR and Forest Variables, Andrew Finley, May 22, 2018 (20)
Recently, the machine learning community has expressed strong interest in applying latent variable modeling strategies to causal inference problems with unobserved confounding. Here, I discuss one of the big debates that occurred over the past year, and how we can move forward. I will focus specifically on the failure of point identification in this setting, and discuss how this can be used to design flexible sensitivity analyses that cleanly separate identified and unidentified components of the causal model.
I will discuss paradigmatic statistical models of inference and learning from high dimensional data, such as sparse PCA and the perceptron neural network, in the sub-linear sparsity regime. In this limit the underlying hidden signal, i.e., the low-rank matrix in PCA or the neural network weights, has a number of non-zero components that scales sub-linearly with the total dimension of the vector. I will provide explicit low-dimensional variational formulas for the asymptotic mutual information between the signal and the data in suitable sparse limits. In the setting of support recovery these formulas imply sharp 0-1 phase transitions for the asymptotic minimum mean-square-error (or generalization error in the neural network setting). A similar phase transition was analyzed recently in the context of sparse high-dimensional linear regression by Reeves et al.
Many different measurement techniques are used to record neural activity in the brains of different organisms, including fMRI, EEG, MEG, lightsheet microscopy and direct recordings with electrodes. Each of these measurement modes have their advantages and disadvantages concerning the resolution of the data in space and time, the directness of measurement of the neural activity and which organisms they can be applied to. For some of these modes and for some organisms, significant amounts of data are now available in large standardized open-source datasets. I will report on our efforts to apply causal discovery algorithms to, among others, fMRI data from the Human Connectome Project, and to lightsheet microscopy data from zebrafish larvae. In particular, I will focus on the challenges we have faced both in terms of the nature of the data and the computational features of the discovery algorithms, as well as the modeling of experimental interventions.
Bayesian Additive Regression Trees (BART) has been shown to be an effective framework for modeling nonlinear regression functions, with strong predictive performance in a variety of contexts. The BART prior over a regression function is defined by independent prior distributions on tree structure and leaf or end-node parameters. In observational data settings, Bayesian Causal Forests (BCF) has successfully adapted BART for estimating heterogeneous treatment effects, particularly in cases where standard methods yield biased estimates due to strong confounding.
We introduce BART with Targeted Smoothing, an extension which induces smoothness over a single covariate by replacing independent Gaussian leaf priors with smooth functions. We then introduce a new version of the Bayesian Causal Forest prior, which incorporates targeted smoothing for modeling heterogeneous treatment effects which vary smoothly over a target covariate. We demonstrate the utility of this approach by applying our model to a timely women's health and policy problem: comparing two dosing regimens for an early medical abortion protocol, where the outcome of interest is the probability of a successful early medical abortion procedure at varying gestational ages, conditional on patient covariates. We discuss the benefits of this approach in other women’s health and obstetrics modeling problems where gestational age is a typical covariate.
Difference-in-differences is a widely used evaluation strategy that draws causal inference from observational panel data. Its causal identification relies on the assumption of parallel trends, which is scale-dependent and may be questionable in some applications. A common alternative is a regression model that adjusts for the lagged dependent variable, which rests on the assumption of ignorability conditional on past outcomes. In the context of linear models, Angrist and Pischke (2009) show that the difference-in-differences and lagged-dependent-variable regression estimates have a bracketing relationship. Namely, for a true positive effect, if ignorability is correct, then mistakenly assuming parallel trends will overestimate the effect; in contrast, if the parallel trends assumption is correct, then mistakenly assuming ignorability will underestimate the effect. We show that the same bracketing relationship holds in general nonparametric (model-free) settings. We also extend the result to semiparametric estimation based on inverse probability weighting.
We develop sensitivity analyses for weak nulls in matched observational studies while allowing unit-level treatment effects to vary. In contrast to randomized experiments and paired observational studies, we show for general matched designs that over a large class of test statistics, any valid sensitivity analysis for the weak null must be unnecessarily conservative if Fisher's sharp null of no treatment effect for any individual also holds. We present a sensitivity analysis valid for the weak null, and illustrate why it is conservative if the sharp null holds through connections to inverse probability weighted estimators. An alternative procedure is presented that is asymptotically sharp if treatment effects are constant, and is valid for the weak null under additional assumptions which may be deemed reasonable by practitioners. The methods may be applied to matched observational studies constructed using any optimal without-replacement matching algorithm, allowing practitioners to assess robustness to hidden bias while allowing for treatment effect heterogeneity.
The world of health care is full of policy interventions: a state expands eligibility rules for its Medicaid program, a medical society changes its recommendations for screening frequency, a hospital implements a new care coordination program. After a policy change, we often want to know, “Did it work?” This is a causal question; we want to know whether the policy CAUSED outcomes to change. One popular way of estimating causal effects of policy interventions is a difference-in-differences study. In this controlled pre-post design, we measure the change in outcomes of people who are exposed to the new policy, comparing average outcomes before and after the policy is implemented. We contrast that change to the change over the same time period in people who were not exposed to the new policy. The differential change in the treated group’s outcomes, compared to the change in the comparison group’s outcomes, may be interpreted as the causal effect of the policy. To do so, we must assume that the comparison group’s outcome change is a good proxy for the treated group’s (counterfactual) outcome change in the absence of the policy. This conceptual simplicity and wide applicability in policy settings makes difference-in-differences an appealing study design. However, the apparent simplicity belies a thicket of conceptual, causal, and statistical complexity. In this talk, I will introduce the fundamentals of difference-in-differences studies and discuss recent innovations including key assumptions and ways to assess their plausibility, estimation, inference, and robustness checks.
We present recent advances and statistical developments for evaluating Dynamic Treatment Regimes (DTR), which allow the treatment to be dynamically tailored according to evolving subject-level data. Identification of an optimal DTR is a key component for precision medicine and personalized health care. Specific topics covered in this talk include several recent projects with robust and flexible methods developed for the above research area. We will first introduce a dynamic statistical learning method, adaptive contrast weighted learning (ACWL), which combines doubly robust semiparametric regression estimators with flexible machine learning methods. We will further develop a tree-based reinforcement learning (T-RL) method, which builds an unsupervised decision tree that maintains the nature of batch-mode reinforcement learning. Unlike ACWL, T-RL handles the optimization problem with multiple treatment comparisons directly through a purity measure constructed with augmented inverse probability weighted estimators. T-RL is robust, efficient and easy to interpret for the identification of optimal DTRs. However, ACWL seems more robust against tree-type misspecification than T-RL when the true optimal DTR is non-tree-type. At the end of this talk, we will also present a new Stochastic-Tree Search method called ST-RL for evaluating optimal DTRs.
A fundamental feature of evaluating causal health effects of air quality regulations is that air pollution moves through space, rendering health outcomes at a particular population location dependent upon regulatory actions taken at multiple, possibly distant, pollution sources. Motivated by studies of the public-health impacts of power plant regulations in the U.S., this talk introduces the novel setting of bipartite causal inference with interference, which arises when 1) treatments are defined on observational units that are distinct from those at which outcomes are measured and 2) there is interference between units in the sense that outcomes for some units depend on the treatments assigned to many other units. Interference in this setting arises due to complex exposure patterns dictated by physical-chemical atmospheric processes of pollution transport, with intervention effects framed as propagating across a bipartite network of power plants and residential zip codes. New causal estimands are introduced for the bipartite setting, along with an estimation approach based on generalized propensity scores for treatments on a network. The new methods are deployed to estimate how emission-reduction technologies implemented at coal-fired power plants causally affect health outcomes among Medicare beneficiaries in the U.S.
Laine Thomas presented information about how causal inference is being used to determine the cost/benefit of the two most common surgical surgical treatments for women - hysterectomy and myomectomy.
We provide an overview of some recent developments in machine learning tools for dynamic treatment regime discovery in precision medicine. The first development is a new off-policy reinforcement learning tool for continual learning in mobile health to enable patients with type 1 diabetes to exercise safely. The second development is a new inverse reinforcement learning tools which enables use of observational data to learn how clinicians balance competing priorities for treating depression and mania in patients with bipolar disorder. Both practical and technical challenges are discussed.
The method of differences-in-differences (DID) is widely used to estimate causal effects. The primary advantage of DID is that it can account for time-invariant bias from unobserved confounders. However, the standard DID estimator will be biased if there is an interaction between history in the after period and the groups. That is, bias will be present if an event besides the treatment occurs at the same time and affects the treated group in a differential fashion. We present a method of bounds based on DID that accounts for an unmeasured confounder that has a differential effect in the post-treatment time period. These DID bracketing bounds are simple to implement and only require partitioning the controls into two separate groups. We also develop two key extensions for DID bracketing bounds. First, we develop a new falsification test to probe the key assumption that is necessary for the bounds estimator to provide consistent estimates of the treatment effect. Next, we develop a method of sensitivity analysis that adjusts the bounds for possible bias based on differences between the treated and control units from the pretreatment period. We apply these DID bracketing bounds and the new methods we develop to an application on the effect of voter identification laws on turnout. Specifically, we focus estimating whether the enactment of voter identification laws in Georgia and Indiana had an effect on voter turnout.
We study experimental design in large-scale stochastic systems with substantial uncertainty and structured cross-unit interference. We consider the problem of a platform that seeks to optimize supply-side payments p in a centralized marketplace where different suppliers interact via their effects on the overall supply-demand equilibrium, and propose a class of local experimentation schemes that can be used to optimize these payments without perturbing the overall market equilibrium. We show that, as the system size grows, our scheme can estimate the gradient of the platform’s utility with respect to p while perturbing the overall market equilibrium by only a vanishingly small amount. We can then use these gradient estimates to optimize p via any stochastic first-order optimization method. These results stem from the insight that, while the system involves a large number of interacting units, any interference can only be channeled through a small number of key statistics, and this structure allows us to accurately predict feedback effects that arise from global system changes using only information collected while remaining in equilibrium.
We discuss a general roadmap for generating causal inference based on observational studies used to general real world evidence. We review targeted minimum loss estimation (TMLE), which provides a general template for the construction of asymptotically efficient plug-in estimators of a target estimand for realistic (i.e, infinite dimensional) statistical models. TMLE is a two stage procedure that first involves using ensemble machine learning termed super-learning to estimate the relevant stochastic relations between the treatment, censoring, covariates and outcome of interest. The super-learner allows one to fully utilize all the advances in machine learning (in addition to more conventional parametric model based estimators) to build a single most powerful ensemble machine learning algorithm. We present Highly Adaptive Lasso as an important machine learning algorithm to include.
In the second step, the TMLE involves maximizing a parametric likelihood along a so-called least favorable parametric model through the super-learner fit of the relevant stochastic relations in the observed data. This second step bridges the state of the art in machine learning to estimators of target estimands for which statistical inference is available (i.e, confidence intervals, p-values etc). We also review recent advances in collaborative TMLE in which the fit of the treatment and censoring mechanism is tailored w.r.t. performance of TMLE. We also discuss asymptotically valid bootstrap based inference. Simulations and data analyses are provided as demonstrations.
We describe different approaches for specifying models and prior distributions for estimating heterogeneous treatment effects using Bayesian nonparametric models. We make an affirmative case for direct, informative (or partially informative) prior distributions on heterogeneous treatment effects, especially when treatment effect size and treatment effect variation is small relative to other sources of variability. We also consider how to provide scientifically meaningful summaries of complicated, high-dimensional posterior distributions over heterogeneous treatment effects with appropriate measures of uncertainty.
Climate change mitigation has traditionally been analyzed as some version of a public goods game (PGG) in which a group is most successful if everybody contributes, but players are best off individually by not contributing anything (i.e., “free-riding”)—thereby creating a social dilemma. Analysis of climate change using the PGG and its variants has helped explain why global cooperation on GHG reductions is so difficult, as nations have an incentive to free-ride on the reductions of others. Rather than inspire collective action, it seems that the lack of progress in addressing the climate crisis is driving the search for a “quick fix” technological solution that circumvents the need for cooperation.
This seminar discussed ways in which to produce professional academic writing, from academic papers to research proposals or technical writing in general.
Machine learning (including deep and reinforcement learning) and blockchain are two of the most noticeable technologies in recent years. The first one is the foundation of artificial intelligence and big data, and the second one has significantly disrupted the financial industry. Both technologies are data-driven, and thus there are rapidly growing interests in integrating them for more secure and efficient data sharing and analysis. In this paper, we review the research on combining blockchain and machine learning technologies and demonstrate that they can collaborate efficiently and effectively. In the end, we point out some future directions and expect more researches on deeper integration of the two promising technologies.
In this talk, we discuss QuTrack, a Blockchain-based approach to track experiment and model changes primarily for AI and ML models. In addition, we discuss how change analytics can be used for process improvement and to enhance the model development and deployment processes.
More from The Statistical and Applied Mathematical Sciences Institute (20)
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
2024.06.01 Introducing a competency framework for languag learning materials ...
Undergraduate Modeling Workshop - Hierarchical Models for Sparsely Sampled High-Dimensional LiDAR and Forest Variables, Andrew Finley, May 22, 2018
1. Hierarchical models for sparsely sampled
high-dimensional LiDAR and forest variables: An
interior Alaska FIA case study
Andrew Finley (Michigan State University)
Hans-Erik Andersen (Forest Service, Forest Inventory and Analysis)
Sudipto Banerjee (University of California, LA)
Bruce Cook (NASA Goddard Space Flight Center)
Doug Morton (NASA Goddard Space Flight Center)
SAMSI Undergraduate Modelling Workshop
2. Extraordinary opportunities to understand the spatial and temporal
complexity of environmental processes at broad scales.
Unprecedented investment to collect, develop, and distribute data and
tools to further large-scale and long-term science.
For example:
National Ecological Observatory Network (NEON)
designed to detect and enable forecasting of ecological change at
continental scales over time — NSF $434 million 30 year project
USDA Forest Service Forest Inventory and Analysis (FIA)
designed to monitor status and trends in forest land — since 1998
FIA measured 510,340 inventory plots across conterminous US
measuring 5,839,642 trees (now with 2+ repeated measurements)!
National Aeronautics and Space Administration (NASA)
Global Ecosystem Dynamics Investigation LiDAR (GEDI) — $95
million 5 year project
SAMSI Undergraduate Modelling Workshop
3. Key challenges in spatiotemporal environmental data analysis
Data sets often exhibit:
missingness and misalignment among outcomes
space- and time-varying impact of covariates
complex residual dependence structures
nonstationarity among multiple outcomes across locations
unknown time and perhaps space lags between outcomes and
covariates
SAMSI Undergraduate Modelling Workshop
4. Key challenges in spatiotemporal environmental data analysis
Data sets often exhibit:
missingness and misalignment among outcomes
space- and time-varying impact of covariates
complex residual dependence structures
nonstationarity among multiple outcomes across locations
unknown time and perhaps space lags between outcomes and
covariates
Interest in modeling frameworks that:
incorporate many sources of space and time indexed data
accommodate structured residual dependence
propagate uncertainty through to predictions
scale to effectively exploit information in massive data sets
SAMSI Undergraduate Modelling Workshop
5. Joint NASA and FIA Forest Service initiative
Project goal: Design and implement an operational forest inventory in
Interior AK by extending sparse networks of ground samples with space
and airborne multi-sensor data.
Data products:
1. Complete coverage maps (e.g., 15×15 m resolution) of forest:
Above ground biomass (AGB; mg/ha)
Basal area (BA; m2
/ha)
Density (TPH; trees/ha)
From inventory plots
Fractional cover (FC; %)
Canopy height (P95; m)
From LiDAR
2. Pixel level prediction with uncertainty estimates
3. Biologically consistent relationships among predictions
4. Reporting with uncertainty for user defined areas
Fun read: www.wired.com/2014/12/alaska-laser-survey-3d-map
SAMSI Undergraduate Modelling Workshop
6. Tanana Inventory Unit (TIU)
Data:
1. LiDAR transects
50,000 km flight lines
25 TB data
∼43 million signals
{FC, P95}
2. Inventory plots
1,461 ∼7 m radius
{AGB, BA, TPH}
3. Complete coverage
% forest canopy (TC)
Forest fire (FIRE)
SAMSI Undergraduate Modelling Workshop
8. Tanana Inventory Unit (TIU)
Data:
1. LiDAR transects
50,000 km flight lines
25 TB data
∼43 million signals
{FC, P95}
2. Inventory plots
1,461 ∼7 m radius
{AGB, BA, TPH}
3. Complete coverage
% forest canopy (TC)
Forest fire (FIRE)
SAMSI Undergraduate Modelling Workshop
9. Tanana Inventory Unit (TIU)
Data:
1. LiDAR transects
50,000 km flight lines
25 TB data
∼43 million signals
{FC, P95}
2. Inventory plots
1,461 ∼7 m radius
{AGB, BA, TPH}
3. Complete coverage
% forest canopy (TC)
Forest fire (FIRE)
Easting (km)
Northing(km)
0 100 200 300 400 500 600
0100200300400500
Forest fire within 20 years
SAMSI Undergraduate Modelling Workshop
10. Key inferential challenges
incorporate many sources of spatially indexed data
address misalignment (missingness) among responses
accommodate and leverage residual spatial dependence
propagate parameter uncertainty through to predictions
deliver statistically valid probabilistic prediction of arbitrary areas
maintain observed covariance among multivariate predictions
scale to effectively exploit information in massive data sets
SAMSI Undergraduate Modelling Workshop
11. Key inferential challenges
incorporate many sources of spatially indexed data
address misalignment (missingness) among responses
accommodate and leverage residual spatial dependence
propagate parameter uncertainty through to predictions
deliver statistically valid probabilistic prediction of arbitrary areas
maintain observed covariance among multivariate predictions
scale to effectively exploit information in massive data sets
Some anticipated extensions:
incorporate time-indexed observations
model nonstationarity among multiple responses across locations
estimate space- and time-varying impact of covariates
SAMSI Undergraduate Modelling Workshop
12. Hierarchical Gaussian process models
Say we observe q outcomes at a given location within domain L. A
multivariate spatial regression:
yk ( ) = xk ( ) βk + wk ( ) + ek ( ), for k = 1, 2, . . . , q
yk ( ) is the kth
outcome at generic location (e.g., AGB, BA, TPH,
FC, P95)
Mean: xk ( ) includes an intercept, TC, and FIRE
Cov: w( ) = (w1( ), w2( ), . . . , wq( )) ∼ MVGP(0, Γθ(·, ·)) where
Γθ( , ) = {Cov(wi ( ), wj ( ))} for i, j = 1, 2, . . . , q
Error: e( ) = (e1( ), e2( ), . . . , eq( )) ∼ MVN(0, Ψ)
TIU we must accommodate spatial misalignment (i.e., yk ’s are partially
observed at some locations), see, e.g., Gelfand et al. 2004, Finley et al.
2014.
Skip to results
SAMSI Undergraduate Modelling Workshop
13. Hierarchical Gaussian process models
Multivariate spatial regression model for TIU and similar settings
Forest variables
Spatial
random effect
Trend
Spatial
decay
LiDAR variables
Trend Spatial
random effect
Spatial
decay
Landsat, etc.
Non-spatial error variance-covariance
Spatial variance-covariance
SAMSI Undergraduate Modelling Workshop
14. Hierarchical Gaussian process models
Multivariate spatial regression model for TIU and similar settings
Forest variables
Spatial
random effect
Trend
Spatial
decay
LiDAR variables
Trend Spatial
random effect
Spatial
decay
Landsat, etc.
Non-spatial error variance-covariance
Spatial variance-covariance
SAMSI Undergraduate Modelling Workshop
15. Hierarchical Gaussian process models
Multivariate spatial regression model for TIU and similar settings
Forest variables
Spatial
random effect
Trend
Spatial
decay
LiDAR variables
Trend Spatial
random effect
Spatial
decay
Landsat, etc.
Non-spatial error variance-covariance
Spatial variance-covariance
SAMSI Undergraduate Modelling Workshop
16. Hierarchical Gaussian process models
Multivariate spatial regression model for TIU and similar settings
Forest variables
Spatial
random effect
Trend
Spatial
decay
LiDAR variables
Trend Spatial
random effect
Spatial
decay
Landsat, etc.
Non-spatial error variance-covariance
Spatial variance-covariance
SAMSI Undergraduate Modelling Workshop
17. Hierarchical Gaussian process models
Multivariate spatial regression model for TIU and similar settings
Forest variables
Spatial
random effect
Trend
Spatial
decay
LiDAR variables
Trend Spatial
random effect
Spatial
decay
Landsat, etc.
Non-spatial error variance-covariance
Spatial variance-covariance
SAMSI Undergraduate Modelling Workshop
18. Spatiotemporal regression models
Start with a simple univariate regression:
y( ) = x( ) β + w( ) + e( )
Potentially very rich: understand spatially- and/or
temporally-varying impact of intercept or predictors on outcome
Produce maps for random effects: {w( ) : ∈ L}
L is spatial domain (e.g., D ⊂ d
) or spatiotemporal domain (e.g.,
D ⊂ d
× +
)
Model-based predictions: y( 0) | {y( 1), y( 2), . . . , y( n)}
SAMSI Undergraduate Modelling Workshop
19. Gaussian spatiotemporal process
{w( ) : ∈ L} ∼ GP(0, Kθ(·, ·)) implies
w = (w( 1), w( 2), . . . , w( n)) ∼ MVN(0, Kθ)
for every finite set of points 1, 2, . . . , n.
Kθ = {Kθ( i , j )} is a spatial variance-covariance matrix, where
θ = {σ, φ}
Stationary: Kθ( , ) = Kθ( − ). Isotropy:
Kθ( , ) = Kθ( − ).
SAMSI Undergraduate Modelling Workshop
20. Likelihood from (full rank) GP models
Assuming {w( ) : ∈ L} ∼ GP(0, Kθ(·, ·)) implies
w = (w( 1), w( 2), . . . , w( n)) ∼ MVN(0, Kθ)
Estimating process parameters from the likelihood involves:
p(w) ∝ −
1
2
log det(Kθ) −
1
2
w K−1
θ w
Bayesian inference: priors on θ and many Markov chain Monte Carlo
(MCMC) iterations
See, e.g., Finley et al. 2015 and Finley et al. 2017 for some coding tips.
SAMSI Undergraduate Modelling Workshop
21. Computation issues
Storage: n2
pairwise distances to compute Kθ
Kθ is dense; Need to solve Kθx = b and need det(Kθ)
This is best achieved using chol(Kθ) = LDL
Complexity: roughly O(n3
) flops
Computationally infeasible for large datasets
SAMSI Undergraduate Modelling Workshop
22. Burgeoning literature on spatial big data
Low-rank models: (Wahba 1990; Higdon 2002; Kamman & Wand 2003;
Paciorek 2007; Rasmussen & Williams 2006; Stein 2007, 2008; Cressie &
Johannesson 2008; Banerjee et al. 2008, 2010; Gramacy & Lee 2008;
Finley et al. 2009; Sang et al. 2011, 2012; Lemos et al. 2011; Guhaniyogi
et al. 2011, 2013; Salazar et al. 2013; Katzfuss 2016)
Spectral approximations and composite likelihoods: (Fuentes 2007;
Paciorek 2007; Eidsvik et al. 2016)
Multi-resolution approaches: (Nychka, 2014; Johannesson et al. 2007;
Matsuo et al. 2010; Tzeng & Huang 2015; Katzfuss 2016)
Sparsity: (Solve Ax = b by (i) sparse A, or (ii) sparse A−1
)
1. Covariance tapering (Furrer et al. 2006; Du et al. 2009; Kaufman et
al. 2009; Shaby and Ruppert 2013)
2. GMRFs to GPs: INLA (Rue et al. 2009; Lindgren et al. 2011)
3. LAGP (Gramacy et al. 2014; Gramacy & Apley 2015)
4. Nearest-neighbor Gaussian Process (NNGP) models (Datta et al.
2015, 2016; Finley et al. 2017)
SAMSI Undergraduate Modelling Workshop
23. Reduced (Low) rank models
Kθ ≈ BθK∗
θ Bθ + Dθ
Bθ is n × r matrix of spatial basis functions, r << n
K∗
θ is r × r spatial covariance matrix
Dθ is either diagonal or sparse
Examples: Kernel projections, Splines, Predictive process, FRK,
spectral basis . . .
Computations exploit above structure: roughly O(nr2
) << O(n3
)
flops
SAMSI Undergraduate Modelling Workshop
24. Low-rank models: hierarchical approach
N(w∗
| 0, K∗
θ ) × N(w | Bθw∗
, D)
w is n × 1 and n is large
w∗
is r × 1, where r << n; so K∗
θ is r × r
Bθ is n × r is a matrix of “basis” functions
D is n × n, but easy to invert (e.g., diagonal)
Derive var(w) (or var(w∗
| y)) in alternate ways to obtain
(BθK∗
θ Bθ + D)−1
= D−1
− D−1
Bθ(K∗−1
θ + Bθ D−1
Bθ)−1
Bθ D−1
.
This is the famous Sherman-Woodbury-Morrison formula.
Modeling: specifying w∗
and Bθ.
See Finley et al. 2015 for implementation details in spBayes R package
SAMSI Undergraduate Modelling Workshop
25. Oversmoothing due to reduced-rank models
(a) True w (b) Full GP (c) PPGP 64 knots
Figure: Comparing full GP vs low-rank GP with 2500 locations. Figure (c)
exhibits oversmoothing by a low-rank process (predictive process with 64 knots)
See Stein 2014 for good reasons not to use reduced-rank spatial models
SAMSI Undergraduate Modelling Workshop
27. Simple method of introducing sparsity (e.g. graphical models)
p(w) = N(w | 0, Kθ)
= p(w1)p(w2 | w1)
× p(w3 | w1, w2)
× p(w4 |¨¨w1, w2, w3)
× p(w5 |¨¨w1,¨¨w2, w3, w4)
× p(w6 |¨¨w1,¨¨w2,¨¨w3, w4, w5)
× p(w7 | w1,¨¨w2,¨¨w3, w4,¨¨w5, w6) .
We need to solve n − 1 linear systems of size at most m × m, where m is
the number of neighbors in the conditional set.
SAMSI Undergraduate Modelling Workshop
28. Sparse likelihood approximations (Vecchia, 1988; Stein et al., 2004)
With w( ) ∼ GP(0, Kθ(·)), write the joint density p(w) as:
N(w | 0, Kθ) =
n
i=1
p(w( i ) | wH( i ))
≈
n
i=1
p(w( i ) | wN( i )) = N(w | 0, ˜Kθ) .
where N( i ) ⊆ H( i ).
Shrinkage: Choose N( ) as the set of “m nearest-neighbors” among
H( i ). Theory: “Screening” effect of kriging.
˜K−1
θ depends on Kθ, but is sparser with at most nm2
non-zero
entries
Extension to a GP (Datta et al., JASA, 2016) called the Nearest
Neighbor Gaussian Process (NNGP)
SAMSI Undergraduate Modelling Workshop
29. (a) True w (b) Full GP (c) PPGP 64 knots
(d) NNGP, m = 10 (e) NNGP, m = 20
SAMSI Undergraduate Modelling Workshop
30. q
q
q
q
q
q q
q q q q q q q q q q q q q q q q q q
m
RMSPE
1.15
1.20
1.25
1.30
1.35
q
q
q
q
q
q
q
q
q q q q q q q q q q q q q q q q q
2.10
2.15
2.20
2.25
2.30
2.35
2.40
Mean95%CIwidth
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
q
q
NNGP RMSPE
NNGP Mean 95% CI width
Full GP RMSPE
Full GP Mean 95% CI width
Figure: Choice of m in NNGP models: Out-of-sample Root Mean Squared
Prediction Error (RMSPE) and mean width between the upper and lower 95%
posterior predictive credible intervals for a range of m for the univariate
synthetic data analysis
SAMSI Undergraduate Modelling Workshop
31. Figure: Wall time required for one MCMC iteration by number of locations n
and m=10 nearest neighbors (both axes are on the log scale).
SAMSI Undergraduate Modelling Workshop
32. Concluding remarks: Storage and computation
Algorithms: Gibbs, RWM, HMC, VB, INLA; NNGP/HMC especially
promising
Model-based solution for spatial “BIG DATA”
Never needs to store n × n distance matrix—store n small m × m
matrices, where m is the number of nearest neighbors considered
and m << n, e.g., m ≈ 15.
Total flop count per iteration is O(nm3
) i.e., linear in n
Scalable to massive datasets because m is small—you never need
more than a few neighbors.
Compare with reduced-rank models: O(nm3
) << O(nr2
).
New R package spNNGP (on CRAN
https://cran.r-project.org/web/packages/spNNGP)
SAMSI Undergraduate Modelling Workshop
33. Tanana Valley initial run results
Initial analysis fit the multivariate spatial NNGP model (with
misalignment between inventory plots and LiDAR outcomes) Skip TIU model
Model fit and prediction algorithms written in C with heavy use of
OpenMP for parallelization.
Outcome vector included: AGB, BA, TPH, FC, and P95
AGB, BA, and TPH measured on 1,461 forest inventory plots
FC and P95 measured on 5 million LiDAR pixels
We considered m=15 neighbors for NNGPs
Posterior inference was based on 25k post burn-in MCMC samples
Full GP covariance matrix Kθ would be 5,001,461×5,001,461!
NNGP run time was ∼12 hours (Intel 18 core machine) Prediction for
TIU takes ∼5 days to deliver pixel level posterior distributions.
SAMSI Undergraduate Modelling Workshop
37. Prototype for FIA/NASA TIU data products user interface
http://www.globalfiredata.org/temp/tanana.html
SAMSI Undergraduate Modelling Workshop
38. Prototype for FIA/NASA TIU data products user interface
http://www.globalfiredata.org/temp/tanana.html
SAMSI Undergraduate Modelling Workshop
39. Thank You !
Datta, A., S. Banerjee, A.O. Finley, and A.E. Gelfand. (2016) Hierarchical Nearest-Neighbor Gaussian process models for large
geostatistical datasets. Journal of the American Statistical Association, 111:800-812.
Datta, A., S. Banerjee, A.O. Finley, N.A.S. Hamm, and M. Schaap. (2016) Non-separable Dynamic Nearest Neighbor Gaussian
Process Models for Large Spatio-temporal Data with Application to Particulate Matter Analysis. Annals of Applied Statistics,
31286-1316.
Datta, A., S. Banerjee, A.O. Finley, and A.E. Gelfand. (2016) On Nearest-Neighbor Gaussian Process Models for Massive Spatial
Data. WIREs Computational Statistics, 8:162-171.
Finley, A.O., S. Banerjee, Y., Zhou, B.D. Cook. (2017) Joint hierarchical models for sparsely sampled high-dimensional LiDAR
and forest variables. Remote Sensing of Environment, 1:149-161.
Finley, A.O., A. Datta, B.C. Cook, D.C. Morton, H.E. Andersen, and S. Banerjee (2017) Applying Nearest Neighbor Gaussian
Processes to massive spatial data sets: Forest canopy height prediction across Tanana Valley Alaska.
https://arxiv.org/abs/1702.00434
Finley, A.O., S. Banerjee, A.E. Gelfand. (2015) spBayes for large univariate and multivariate point-referenced spatio-temporal
data models. Journal of Statistical Software, 63:1-28.
Heaton, M.J. A. Datta, A.O. Finley, R. Furrer, R. Guhaniyogi, F. Gerber, R.B. Gramacy, D. Hammerling, M. Katzfuss, F.
Lindgren, D.W. Nychka, F. Sun, and A. Zammit-Mangion. (2017) Methods for analyzing large spatial data: A review and
comparison. https://arxiv.org/abs/1710.05013
Other references provided upon request.
SAMSI Undergraduate Modelling Workshop
40. Concluding remarks: Comparisons
Are low-rank spatial models well and truly beaten?
Certainly do not seem to scale as nicely as NNGP
Have somewhat greater theoretical tractability (e.g. Bayesian
asymptotics)
Can be used to flexibly model smoothness
Can be constructed for other processes—e.g., Spatial Dirichlet
Predictive Process
Compare with scalable multi-resolution frameworks (Katzfuss, 2016)
Highly scalable meta-kriging frameworks (Guhaniyogi, 2016)
Future work: High-dimensional multivariate spatial-temporal variable
selection
SAMSI Undergraduate Modelling Workshop