This document describes performing extreme value analysis on daily precipitation data from Fort Collins, Colorado from 1900 to 1999 using R. It first reads in and plots the data, summarizing seasonal variations. It then performs two extreme value analysis approaches: the block maxima approach, which fits a generalized extreme value distribution to summer maximum daily precipitation values within blocks; and the peak over threshold approach, which fits a generalized Pareto distribution to values exceeding a threshold. It estimates return levels such as the 100-year event and calculates confidence intervals.
imager package in R and example
References:
http://dahtah.github.io/imager/
http://dahtah.github.io/imager/imager.html
https://cran.r-project.org/web/packages/imager/imager.pdf
imager package in R and example
References:
http://dahtah.github.io/imager/
http://dahtah.github.io/imager/imager.html
https://cran.r-project.org/web/packages/imager/imager.pdf
The basic concept for the data structure.
It covers these topics
System Life Cycle
Algorithm Specification
Data Abstraction
Performance Analysis
Space Complexity
Time Complexity
Asymptotic Notation
Text Book: Fundamentals of Data Structures in C++
E. Horowitz, et al.
After we applied the stochastic Galerkin method to solve stochastic PDE, and solve large linear system, we obtain stochastic solution (random field), which is represented in Karhunen Loeve and PCE basis. No sampling error is involved, only algebraic truncation error. Now we would like to escape classical MCMC path to compute the posterior. We develop an Bayesian* update formula for KLE-PCE coefficients.
The basic concept for the data structure.
It covers these topics
System Life Cycle
Algorithm Specification
Data Abstraction
Performance Analysis
Space Complexity
Time Complexity
Asymptotic Notation
Text Book: Fundamentals of Data Structures in C++
E. Horowitz, et al.
After we applied the stochastic Galerkin method to solve stochastic PDE, and solve large linear system, we obtain stochastic solution (random field), which is represented in Karhunen Loeve and PCE basis. No sampling error is involved, only algebraic truncation error. Now we would like to escape classical MCMC path to compute the posterior. We develop an Bayesian* update formula for KLE-PCE coefficients.
This talk will review dynamic modeling and prediction for temporal and spatio-temporal data and describe algorithms for suitable state space models. Use of dynamic models for modeling crash types by severity will be briefly illustrated. Extension of these approaches for handling irregular temporal spacing and spatial sparseness will be discussed, and a potential application to travel time prediction will be explored.
Crashes on limited access roadways typically occur due to drivers being unable to react in time to avoid collisions with vehicles ahead of them either moving slower or merging
unexpectedly. Prevailing traffic stream conditions with high volume and low or variable speed downstream of low volume and high speed conditions can increase the possibilities for such collisions to occur. Real time trajectories of vehicles collected through crowd sourcing methods can give information about the distribution of speeds in the traffic stream by space
and time. Spatio-temporal models relating these observed speed distributions to the occurrence of crashes or near crashes can help to identify crash prone traffic conditions as
they arise, offering the opportunity to warn drivers before crashes occur.
As the analysis by Kalra & Paddock (2016) demonstrated, traditional crash data and analysis approaches may require hundreds of millions or billions of self-driving miles to achieve sufficient power to demonstrate that automated vehicles (AVs) have lower injury/fatality risk than human-driven vehicles. Moreover, crash risk for AVs is a moving target as algorithms and systems change, and the mistakes AVs will make are not necessarily the same mistakes humans make. Thus, we need to rethink both the data that will make up transportation safety datasets in the near future as well as the analytical approaches used. I will present some newer data-collection approaches along with some specific challenges that might call for different analytical approaches than are being used for crash data today.
As we prepare for a future of driverless cars, what new risks must we work to understand? Despite the connotation of driverless, we can expect that humans will remain in the loop at each iteration of increasingly autonomous technology integration. While our technology is advancing, our population and economics are also transitioning to present challenging paradigm shifts that we should account for in assessing the risks of driverless cars. Let us take this holistic systems engineering approach to exploring transportation at the Statistical and Applied Mathematical Sciences Institute.
Highway crash data with average of 39 thousand fatalities and 2.4 million nonfatal injuries per year have repetitive and predictable patterns, and may benefit from statistical predictive
models to enhance highway safety and operation efforts to reduce crash fatalities/injuries. Highway crashes have patterns that repeat over fixed periods of time within the data set for
crashes such as motorcycle, bicycles, pedestrians, nighttime, fixed object, weekend, and winter crashes. In some States, these crashes are weekly, monthly, or seasonally. Contributing
factors such as: age category, light condition, weather, weekday, underlying state of the economy, and others impact these variations.
Numerous studies have found an average increase in extreme precipitation for both the U.S. and Northern Hemisphere mid-latitude land areas, consistent with the expectations arising from the observed increase in greenhouse gas concentrations (now more than 40% above pre-industrial levels). However, there are important regional variations in these trends that are not fully explained. These trend studies are typically based on direct analyses of observational station data. Such analyses confront multiple challenges, such as incomplete data and uneven spatial coverage of stations. Central scientific questions related to this general finding are: Are there changes in weather system phenomenology that are contributing to this observed increase? What is the contribution of increases in atmospheric water vapor? There are also questions related to application of potential future changes in planning. Because of the rarity (by definition) of extreme events, trends are mostly found only when aggregating over space. When would we expect to see a signal at the local level? What are the uncertainties surrounding future changes and their potential incorporation into future design? Further development of statistical/mathematical methods, or innovative application of existing methods, is desirable to aid scientists in exploring these central scientific questions. This talk will describe characteristics of the observation record and the issues surrounding the above questions.
Climate Science presents several data intensive challenges that are the intersection of software architecture and data science. This includes developing approaches for scaling the analysis of highly distributed data across institutional and system boundaries. JPL has been developing approaches for quantitatively evaluating software architectures to consider different topologies in the deployment of computing capabilities and methodologies in order to support the analysis of distributed climate data. This talk will cover those approaches and also needed research in new methodologies as remote sensing and climate model output data continue to increase in their size and distribution.
Climate change could have far-reaching consequences for human health across the 21st century. At the same time, development choices will alter underlying vulnerability to these risks, affecting the magnitude and pattern of impacts. The current and projected human health risks of climate change are diverse and wide-ranging, potentially altering the burden of any health outcome sensitive to weather or climate. Climate variability and change can affect morbidity and mortality from extreme weather and climate events, and from changes in air quality arising from changing concentrations of ozone, particulate matter, or aeroallergens. Altering weather patterns and sea level rise also may facilitate changes in the geographic range, seasonality, and incidence of selected infectious diseases in some regions, such as malaria moving into highland areas in parts of sub-Saharan Africa. Changes in water availability and agricultural productivity could affect undernutrition, particularly in parts of Asia and Africa. These risks are not independent, but will interact in complex ways with risks in other sectors. Policies and programs need to explicitly take climate change into account to facilitate sustainable and resilient societies that effectively prepare for, manage, and recover from climate-related hazards.
Remote-sensing data offer unprecedented opportunities to address Earth-system-science challenges, such as understanding the relationship between the atmosphere and Earth's surface using physics, chemistry, biology, mathematics, and computing. Statistical methods have often been seen as a hybrid of the latter two, so that a lot of attention has been given to computing estimates but far less to quantifying the uncertainty of the estimates. In my "bird's-eye view," I shall give a way to look at the problem using conditional probability models and three states of knowledge. Examples will be given of analyzing remotely sensed data of a leading greenhouse gas, carbon dioxide.
This is a brief introduction to how R can be useful in the manufacturing sector to calculate the frequency of faults and then developing the model so that preventive maintenance can be done
ggtimeseries-->ggplot2 extensions
This R package offers novel time series visualisations. It is based on ggplot2 and offers geoms and pre-packaged functions for easily creating any of the offered charts. Some examples are listed below.
This package can be installed from github by installing devtools library and then running the following command - devtools::install_github('Ather-Energy/ggTimeSeries').
reference: https://github.com/Ather-Energy/ggTimeSeries
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
Defining customized scalable aggregation logic is one of Apache Spark’s most powerful features. User Defined Aggregate Functions (UDAF) are a flexible mechanism for extending both Spark data frames and Structured Streaming with new functionality ranging from specialized summary techniques to building blocks for exploratory data analysis.
Chapter 16 Inference for RegressionClimate ChangeThe .docxketurahhazelhurst
Chapter 16: Inference for Regression
Climate Change
The earth has been getting warmer. Most climate scientists agree that one important cause of the warming is
the increase in atmospheric levels of carbon dioxide (CO2), a green house gas. Here is part of a regression
analysis of the mean annual CO2 concentration (CO2) in the atmosphere, measured in parts per thousand
(ppt), at the top of Mauna Loa in Hawaii and the mean annual air temperature (Temp) over both land and
sea across the globe, in degrees Celsius.
Let’s first read the dataset into R
climate <- read.table('Climate_Change.txt', sep = '\t', header = TRUE)
and take a look at the data structure:
str(climate)
## 'data.frame': 29 obs. of 3 variables:
## $ year: int 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ...
## $ Temp: num 14.2 14.3 14.1 14.3 14.1 ...
## $ CO2 : num 339 340 341 342 344 ...
We see three variables, which are year, Temp (mean annual air temperature) and CO2 (mean annual CO2
concentration), and there are 29 observations in each variable.
We now take Temp as the response variable and CO2 the predictor variable, to study their relationship. To see
if linear regression is appropriate, we make a scatterplot of Temp against CO2
plot(climate$CO2, climate$Temp, xlab = 'CO2 Concentration', ylab = 'Temperature')
340 350 360 370 380
1
4
.1
1
4
.3
1
4
.5
CO2 Concentration
Te
m
p
e
ra
tu
re
It seems reasonable to fit a linear model to the dataset, because both variables are quantitative, the data
points show a linear pattern, and there is no outlier. So, let’s fit the model:
imod <- lm(Temp ~ CO2, data = climate)
1
The summary of the fitted model is given by
summary(imod)
##
## Call:
## lm(formula = Temp ~ CO2, data = climate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.16809 -0.07972 0.00194 0.07013 0.18532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.707076 0.481006 22.260 < 2e-16 ***
## CO2 0.010062 0.001336 7.534 4.19e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09847 on 27 degrees of freedom
## Multiple R-squared: 0.6776, Adjusted R-squared: 0.6657
## F-statistic: 56.76 on 1 and 27 DF, p-value: 4.192e-08
which contains a lot of information. We see that R2 = 0.6776 and the SD of residuals se = 0.09847 (the
estimator of population standard deviation σ) with 27 degrees of freedom. In Coefficients section we
see the intercept b0 = 10.71 and the slope b1 = 0.01. Their standard errors are SE(b0) = 0.481 and
SE(b1) = 0.00134. Their t-test statistics are t0 = b0/SE(b0) = 22.26 and t1 = b1/SE(b1) = 7.534. Their
corresponding (two-tailed) p-values are very small (<2e-16 and 4.19e-08). As a result, we reject H0 : β1 = 0
and conclude there is a positive correlation between Temp and CO2. The b1 = 0.01 can be interpreted as
follows: The air temperature will increase by 0.01 degrees Celsius on average if the CO2 concentration in the
atmosphere increases by 1 p ...
Similar to CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (EVA) Using R - Whitney Huang, Oct 23, 2017 (20)
Recently, the machine learning community has expressed strong interest in applying latent variable modeling strategies to causal inference problems with unobserved confounding. Here, I discuss one of the big debates that occurred over the past year, and how we can move forward. I will focus specifically on the failure of point identification in this setting, and discuss how this can be used to design flexible sensitivity analyses that cleanly separate identified and unidentified components of the causal model.
I will discuss paradigmatic statistical models of inference and learning from high dimensional data, such as sparse PCA and the perceptron neural network, in the sub-linear sparsity regime. In this limit the underlying hidden signal, i.e., the low-rank matrix in PCA or the neural network weights, has a number of non-zero components that scales sub-linearly with the total dimension of the vector. I will provide explicit low-dimensional variational formulas for the asymptotic mutual information between the signal and the data in suitable sparse limits. In the setting of support recovery these formulas imply sharp 0-1 phase transitions for the asymptotic minimum mean-square-error (or generalization error in the neural network setting). A similar phase transition was analyzed recently in the context of sparse high-dimensional linear regression by Reeves et al.
Many different measurement techniques are used to record neural activity in the brains of different organisms, including fMRI, EEG, MEG, lightsheet microscopy and direct recordings with electrodes. Each of these measurement modes have their advantages and disadvantages concerning the resolution of the data in space and time, the directness of measurement of the neural activity and which organisms they can be applied to. For some of these modes and for some organisms, significant amounts of data are now available in large standardized open-source datasets. I will report on our efforts to apply causal discovery algorithms to, among others, fMRI data from the Human Connectome Project, and to lightsheet microscopy data from zebrafish larvae. In particular, I will focus on the challenges we have faced both in terms of the nature of the data and the computational features of the discovery algorithms, as well as the modeling of experimental interventions.
Bayesian Additive Regression Trees (BART) has been shown to be an effective framework for modeling nonlinear regression functions, with strong predictive performance in a variety of contexts. The BART prior over a regression function is defined by independent prior distributions on tree structure and leaf or end-node parameters. In observational data settings, Bayesian Causal Forests (BCF) has successfully adapted BART for estimating heterogeneous treatment effects, particularly in cases where standard methods yield biased estimates due to strong confounding.
We introduce BART with Targeted Smoothing, an extension which induces smoothness over a single covariate by replacing independent Gaussian leaf priors with smooth functions. We then introduce a new version of the Bayesian Causal Forest prior, which incorporates targeted smoothing for modeling heterogeneous treatment effects which vary smoothly over a target covariate. We demonstrate the utility of this approach by applying our model to a timely women's health and policy problem: comparing two dosing regimens for an early medical abortion protocol, where the outcome of interest is the probability of a successful early medical abortion procedure at varying gestational ages, conditional on patient covariates. We discuss the benefits of this approach in other women’s health and obstetrics modeling problems where gestational age is a typical covariate.
Difference-in-differences is a widely used evaluation strategy that draws causal inference from observational panel data. Its causal identification relies on the assumption of parallel trends, which is scale-dependent and may be questionable in some applications. A common alternative is a regression model that adjusts for the lagged dependent variable, which rests on the assumption of ignorability conditional on past outcomes. In the context of linear models, Angrist and Pischke (2009) show that the difference-in-differences and lagged-dependent-variable regression estimates have a bracketing relationship. Namely, for a true positive effect, if ignorability is correct, then mistakenly assuming parallel trends will overestimate the effect; in contrast, if the parallel trends assumption is correct, then mistakenly assuming ignorability will underestimate the effect. We show that the same bracketing relationship holds in general nonparametric (model-free) settings. We also extend the result to semiparametric estimation based on inverse probability weighting.
We develop sensitivity analyses for weak nulls in matched observational studies while allowing unit-level treatment effects to vary. In contrast to randomized experiments and paired observational studies, we show for general matched designs that over a large class of test statistics, any valid sensitivity analysis for the weak null must be unnecessarily conservative if Fisher's sharp null of no treatment effect for any individual also holds. We present a sensitivity analysis valid for the weak null, and illustrate why it is conservative if the sharp null holds through connections to inverse probability weighted estimators. An alternative procedure is presented that is asymptotically sharp if treatment effects are constant, and is valid for the weak null under additional assumptions which may be deemed reasonable by practitioners. The methods may be applied to matched observational studies constructed using any optimal without-replacement matching algorithm, allowing practitioners to assess robustness to hidden bias while allowing for treatment effect heterogeneity.
The world of health care is full of policy interventions: a state expands eligibility rules for its Medicaid program, a medical society changes its recommendations for screening frequency, a hospital implements a new care coordination program. After a policy change, we often want to know, “Did it work?” This is a causal question; we want to know whether the policy CAUSED outcomes to change. One popular way of estimating causal effects of policy interventions is a difference-in-differences study. In this controlled pre-post design, we measure the change in outcomes of people who are exposed to the new policy, comparing average outcomes before and after the policy is implemented. We contrast that change to the change over the same time period in people who were not exposed to the new policy. The differential change in the treated group’s outcomes, compared to the change in the comparison group’s outcomes, may be interpreted as the causal effect of the policy. To do so, we must assume that the comparison group’s outcome change is a good proxy for the treated group’s (counterfactual) outcome change in the absence of the policy. This conceptual simplicity and wide applicability in policy settings makes difference-in-differences an appealing study design. However, the apparent simplicity belies a thicket of conceptual, causal, and statistical complexity. In this talk, I will introduce the fundamentals of difference-in-differences studies and discuss recent innovations including key assumptions and ways to assess their plausibility, estimation, inference, and robustness checks.
We present recent advances and statistical developments for evaluating Dynamic Treatment Regimes (DTR), which allow the treatment to be dynamically tailored according to evolving subject-level data. Identification of an optimal DTR is a key component for precision medicine and personalized health care. Specific topics covered in this talk include several recent projects with robust and flexible methods developed for the above research area. We will first introduce a dynamic statistical learning method, adaptive contrast weighted learning (ACWL), which combines doubly robust semiparametric regression estimators with flexible machine learning methods. We will further develop a tree-based reinforcement learning (T-RL) method, which builds an unsupervised decision tree that maintains the nature of batch-mode reinforcement learning. Unlike ACWL, T-RL handles the optimization problem with multiple treatment comparisons directly through a purity measure constructed with augmented inverse probability weighted estimators. T-RL is robust, efficient and easy to interpret for the identification of optimal DTRs. However, ACWL seems more robust against tree-type misspecification than T-RL when the true optimal DTR is non-tree-type. At the end of this talk, we will also present a new Stochastic-Tree Search method called ST-RL for evaluating optimal DTRs.
A fundamental feature of evaluating causal health effects of air quality regulations is that air pollution moves through space, rendering health outcomes at a particular population location dependent upon regulatory actions taken at multiple, possibly distant, pollution sources. Motivated by studies of the public-health impacts of power plant regulations in the U.S., this talk introduces the novel setting of bipartite causal inference with interference, which arises when 1) treatments are defined on observational units that are distinct from those at which outcomes are measured and 2) there is interference between units in the sense that outcomes for some units depend on the treatments assigned to many other units. Interference in this setting arises due to complex exposure patterns dictated by physical-chemical atmospheric processes of pollution transport, with intervention effects framed as propagating across a bipartite network of power plants and residential zip codes. New causal estimands are introduced for the bipartite setting, along with an estimation approach based on generalized propensity scores for treatments on a network. The new methods are deployed to estimate how emission-reduction technologies implemented at coal-fired power plants causally affect health outcomes among Medicare beneficiaries in the U.S.
Laine Thomas presented information about how causal inference is being used to determine the cost/benefit of the two most common surgical surgical treatments for women - hysterectomy and myomectomy.
We provide an overview of some recent developments in machine learning tools for dynamic treatment regime discovery in precision medicine. The first development is a new off-policy reinforcement learning tool for continual learning in mobile health to enable patients with type 1 diabetes to exercise safely. The second development is a new inverse reinforcement learning tools which enables use of observational data to learn how clinicians balance competing priorities for treating depression and mania in patients with bipolar disorder. Both practical and technical challenges are discussed.
The method of differences-in-differences (DID) is widely used to estimate causal effects. The primary advantage of DID is that it can account for time-invariant bias from unobserved confounders. However, the standard DID estimator will be biased if there is an interaction between history in the after period and the groups. That is, bias will be present if an event besides the treatment occurs at the same time and affects the treated group in a differential fashion. We present a method of bounds based on DID that accounts for an unmeasured confounder that has a differential effect in the post-treatment time period. These DID bracketing bounds are simple to implement and only require partitioning the controls into two separate groups. We also develop two key extensions for DID bracketing bounds. First, we develop a new falsification test to probe the key assumption that is necessary for the bounds estimator to provide consistent estimates of the treatment effect. Next, we develop a method of sensitivity analysis that adjusts the bounds for possible bias based on differences between the treated and control units from the pretreatment period. We apply these DID bracketing bounds and the new methods we develop to an application on the effect of voter identification laws on turnout. Specifically, we focus estimating whether the enactment of voter identification laws in Georgia and Indiana had an effect on voter turnout.
We study experimental design in large-scale stochastic systems with substantial uncertainty and structured cross-unit interference. We consider the problem of a platform that seeks to optimize supply-side payments p in a centralized marketplace where different suppliers interact via their effects on the overall supply-demand equilibrium, and propose a class of local experimentation schemes that can be used to optimize these payments without perturbing the overall market equilibrium. We show that, as the system size grows, our scheme can estimate the gradient of the platform’s utility with respect to p while perturbing the overall market equilibrium by only a vanishingly small amount. We can then use these gradient estimates to optimize p via any stochastic first-order optimization method. These results stem from the insight that, while the system involves a large number of interacting units, any interference can only be channeled through a small number of key statistics, and this structure allows us to accurately predict feedback effects that arise from global system changes using only information collected while remaining in equilibrium.
We discuss a general roadmap for generating causal inference based on observational studies used to general real world evidence. We review targeted minimum loss estimation (TMLE), which provides a general template for the construction of asymptotically efficient plug-in estimators of a target estimand for realistic (i.e, infinite dimensional) statistical models. TMLE is a two stage procedure that first involves using ensemble machine learning termed super-learning to estimate the relevant stochastic relations between the treatment, censoring, covariates and outcome of interest. The super-learner allows one to fully utilize all the advances in machine learning (in addition to more conventional parametric model based estimators) to build a single most powerful ensemble machine learning algorithm. We present Highly Adaptive Lasso as an important machine learning algorithm to include.
In the second step, the TMLE involves maximizing a parametric likelihood along a so-called least favorable parametric model through the super-learner fit of the relevant stochastic relations in the observed data. This second step bridges the state of the art in machine learning to estimators of target estimands for which statistical inference is available (i.e, confidence intervals, p-values etc). We also review recent advances in collaborative TMLE in which the fit of the treatment and censoring mechanism is tailored w.r.t. performance of TMLE. We also discuss asymptotically valid bootstrap based inference. Simulations and data analyses are provided as demonstrations.
We describe different approaches for specifying models and prior distributions for estimating heterogeneous treatment effects using Bayesian nonparametric models. We make an affirmative case for direct, informative (or partially informative) prior distributions on heterogeneous treatment effects, especially when treatment effect size and treatment effect variation is small relative to other sources of variability. We also consider how to provide scientifically meaningful summaries of complicated, high-dimensional posterior distributions over heterogeneous treatment effects with appropriate measures of uncertainty.
Climate change mitigation has traditionally been analyzed as some version of a public goods game (PGG) in which a group is most successful if everybody contributes, but players are best off individually by not contributing anything (i.e., “free-riding”)—thereby creating a social dilemma. Analysis of climate change using the PGG and its variants has helped explain why global cooperation on GHG reductions is so difficult, as nations have an incentive to free-ride on the reductions of others. Rather than inspire collective action, it seems that the lack of progress in addressing the climate crisis is driving the search for a “quick fix” technological solution that circumvents the need for cooperation.
This seminar discussed ways in which to produce professional academic writing, from academic papers to research proposals or technical writing in general.
Machine learning (including deep and reinforcement learning) and blockchain are two of the most noticeable technologies in recent years. The first one is the foundation of artificial intelligence and big data, and the second one has significantly disrupted the financial industry. Both technologies are data-driven, and thus there are rapidly growing interests in integrating them for more secure and efficient data sharing and analysis. In this paper, we review the research on combining blockchain and machine learning technologies and demonstrate that they can collaborate efficiently and effectively. In the end, we point out some future directions and expect more researches on deeper integration of the two promising technologies.
In this talk, we discuss QuTrack, a Blockchain-based approach to track experiment and model changes primarily for AI and ML models. In addition, we discuss how change analytics can be used for process improvement and to enhance the model development and deployment processes.
More from The Statistical and Applied Mathematical Sciences Institute (20)
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
We all have good and bad thoughts from time to time and situation to situation. We are bombarded daily with spiraling thoughts(both negative and positive) creating all-consuming feel , making us difficult to manage with associated suffering. Good thoughts are like our Mob Signal (Positive thought) amidst noise(negative thought) in the atmosphere. Negative thoughts like noise outweigh positive thoughts. These thoughts often create unwanted confusion, trouble, stress and frustration in our mind as well as chaos in our physical world. Negative thoughts are also known as “distorted thinking”.
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (EVA) Using R - Whitney Huang, Oct 23, 2017
1. Performing Extreme Value Analysis (EVA) using R
Performing Extreme Value Analysis (EVA)
using R
Whitney Huang
October 23, 2017
Data: Fort Collins daily precipitation
Read the data
Plot the data
Summarize the data
EVA: Block Maxima Approach
Step I: Determine the block size and compute maxima for blocks
Step II: Fit a GEV to the maxima and assess the fit
Step III: Perform inference for return levels
EVA: Peak Over Threshold Approach
Step I: Pick a threshold and extract the threshold exceedances
Step II: Fit a GPD to threshold excesses and assess the fit
Step III: Perform inference for return levels
Data: Fort Collins daily precipitation
We will analyze the daily precipitation amounts (inches) in Fort Collins, Colorado
from 1900 to 1999. (Source Colorado Climate Center, Colorado State University
http://ulysses.atmos.colostate.edu.)
Read the data
# Install the packages for this demos
#install.packages(c("extRemes", "scales", "dplyr", "ismev"))
library(extRemes) # Load the extRemes package for performing EVA
data(Fort) # Load the preciptation data (it is a built-in data sets in extRemes)
head(Fort) # Look at the first few observations
## obs tobs month day year Prec
## 1 1 1 1 1 1900 0
## 2 2 2 1 2 1900 0
## 3 3 3 1 3 1900 0
## 4 4 4 1 4 1900 0
2. Performing Extreme Value Analysis (EVA) using R
## 5 5 5 1 5 1900 0
## 6 6 6 1 6 1900 0
tail(Fort) # Look at the last few observations
## obs tobs month day year Prec
## 36519 36519 360 12 26 1999 0
## 36520 36520 361 12 27 1999 0
## 36521 36521 362 12 28 1999 0
## 36522 36522 363 12 29 1999 0
## 36523 36523 364 12 30 1999 0
## 36524 36524 365 12 31 1999 0
str(Fort) # Look at the structure of the data set
## 'data.frame': 36524 obs. of 6 variables:
## $ obs : num 1 2 3 4 5 6 7 8 9 10 ...
## $ tobs : num 1 2 3 4 5 6 7 8 9 10 ...
## $ month: num 1 1 1 1 1 1 1 1 1 1 ...
## $ day : num 1 2 3 4 5 6 7 8 9 10 ...
## $ year : num 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 ...
## $ Prec : num 0 0 0 0 0 0 0 0 0 0 ...
Plot the data
# Set up the time format
days_year <- table(Fort$year)
time_year <- Fort$tobs / rep.int(days_year, times = days_year)
time <- Fort$year + time_year
# Plot the spatial location of Fort Collins and the daily precip time series
par(mfrow = c(2, 1))
library(maps) # Load the package for map drawing
map("state", col = "gray", mar = rep(0, 4))
# Fort Collins, Lat Long Coordinates: 40.5853° N, 105.0844° W
points(-105.0844, 40.5853, pch = "*", col = "red", cex = 1.5)
par(las = 1, mar = c(5.1, 4.1, 1.1, 2.1))
plot(time, Fort$Prec, type = "h",
xlab = "Time (Year)",
ylab = "Daily precipitation (in)")
3. Performing Extreme Value Analysis (EVA) using R
# Zoom in to look at seasonal variation
par(mfrow = c(1, 2), mar = c(5.1, 4.1, 4.1, 0.6))
id <- which(Fort$year %in% 1999) # Look at year 1999
par(las = 1)
# Plot the 1999 daily precip time series
plot(1:365, Fort$Prec[id], type = "h",
xlab = " ",
ylab = "Daily precipitation (in)",
xaxt = "n", ylim = c(0, max(Fort$Prec)),
main = "1999")
par(las = 3)
axis(1, at = seq(15, 345, 30), labels = month.abb)
Leaf_Day <- seq(1520, 1520 + 1460 * 23, 1460)
par(las = 1)
library(scales) # Load the scales pacakge to modify color transparency
# Plot the daily precip as a function of the day of the year
plot(rep(1:365, 100), Fort$Prec[-Leaf_Day], pch = 1,
col = alpha("blue", 0.25), cex = 0.5,
xaxt = "n", xlab = " ", ylab = " ", main = "1900 ~ 1999",
ylim = c(0, max(Fort$Prec)))
par(las = 3)
axis(1, at = seq(15, 345, 30), labels = month.abb)
5. Performing Extreme Value Analysis (EVA) using R
Summarize the data
# Six number summary for summer and winter daily precip
summary(prec_summer)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05289 0.01000 4.63000
summary(prec_winter)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01477 0.00000 1.32000
# Variation
sd(prec_summer)
## [1] 0.2004784
sd(prec_winter)
## [1] 0.0641466
IQR(prec_summer)
## [1] 0.01
6. Performing Extreme Value Analysis (EVA) using R
IQR(prec_winter)
## [1] 0
# Chance of rain
length(prec_summer[prec_summer > 0]) / length(prec_summer)
## [1] 0.2827174
length(prec_winter[prec_winter > 0]) / length(prec_winter)
## [1] 0.1476064
EVA: Block Maxima Approach
We are going to conduct an extreme value analysis using the extRemes package
developed and maintained by Eric Gilleland . In the previous lecture we
learned the block maxima method and threshold exceedances method for
doing EVA. Let’s see how that works in R .
Step I: Determine the block size and compute maxima for
blocks
library(dplyr)
grouped_summer <- group_by(Fort[summer, ], year)
# Extracting summer maxima and the timings when the maxima occur
summer_max <- summarise(grouped_summer, prec = max(Prec), t = which.max(Prec))
summer_max
## # A tibble: 100 × 3
## year prec t
## <dbl> <dbl> <int>
## 1 1900 0.51 13
## 2 1901 1.46 15
## 3 1902 1.02 28
## 4 1903 0.53 47
## 5 1904 1.10 35
## 6 1905 1.11 66
## 7 1906 0.64 23
## 8 1907 1.21 56
## 9 1908 1.93 60
## 10 1909 1.11 28
7. Performing Extreme Value Analysis (EVA) using R
## # ... with 90 more rows
Let’s plot the summer maxima
# Setting up the figure configuration
old.par <- par(no.readonly = TRUE)
mar.default <- par("mar")
mar.left <- mar.default
mar.right <- mar.default
mar.left[2] <- 0
mar.right[4] <- 0
# Time series plot
par(fig = c(0.2, 1, 0, 1), mar = mar.left)
plot(1900 + 1:9200 / 92, prec_summer,
xlab = "Year", ylab = "",
main = "Summer Maximum n Fort Collins",
type = "h", pch = 19, cex = 0.5, col = "lightblue",
ylim = c(0, max_precip), yaxt = "n")
par(las = 2)
axis(4, at = 0:5)
par(las = 0)
mtext("Precipitation (in)", side = 2, line = 4)
abline(v = 1900:2000, col = "gray", lty = 2)
points(1900:1999 + summer_max$t / 90, summer_max$prec,
pch = 16, col = "blue", cex = 0.5)
# Histogram
hs <- hist(summer_max$prec,
breaks = seq(0, max_precip, length.out = 40),
plot = FALSE)
par(fig = c(0, 0.2, 0, 1.0), mar = mar.right, new = T)
plot (NA, type = 'n', axes = FALSE, yaxt = 'n',
col = rgb(0, 0, 0.5, alpha = 0.5),
xlab = "Density", ylab = NA, main = NA,
xlim = c(-max(hs$density), 0),
ylim = c(0, max_precip))
axis(1, at = c(-0.8, -0.4, 0), c(0.8, 0.4, 0), las = 2)
arrows(rep(0, length(hs$breaks[-40])), hs$breaks[-40],
-hs$density, hs$breaks[-40], col = "blue",
length = 0, angle = 0, lwd = 1)
arrows(rep(0, length(hs$breaks[-1])), hs$breaks[-1],
-hs$density, hs$breaks[-1], col = "blue",
length = 0, angle = 0, lwd = 1)
arrows(-hs$density, hs$breaks[-40], -hs$density,
hs$breaks[-1], col = "blue", angle = 0,
length = 0)
mle <- fevd(summer_max$prec)$results$par
xg <- seq(0, max_precip, length.out = 100)
library(ismev)
lines(-gev.dens(mle, xg), xg, col = "red")
8. Performing Extreme Value Analysis (EVA) using R
par(old.par)
Step II: Fit a GEV to the maxima and assess the fit
# Fit a GEV to summer maximum daily precip using MLE
gevfit <- fevd(summer_max$prec)
# Print the results
gevfit
##
## fevd(x = summer_max$prec)
##
## [1] "Estimation Method used: MLE"
##
##
## Negative Log-Likelihood Value: 100.5622
##
##
## Estimated parameters:
## location scale shape
## 0.8262256 0.4974066 0.2158052
10. Performing Extreme Value Analysis (EVA) using R
Step III: Perform inference for return levels
Suppose we are interested in estimating 100-year return level
RL100 <- return.level(gevfit, return.period = 100) # Estimate of the 100-year event
RL100
## fevd(x = summer_max$prec)
## return.level.fevd.mle(x = gevfit, return.period = 100)
##
## GEV model fitted to summer_max$prec
## Data are assumed to be stationary
## [1] "Return Levels for period units in years"
11. Performing Extreme Value Analysis (EVA) using R
## 100-year level
## 4.741325
# Quantify the estimate uncertainty
## Delta method
CI_delta <- ci(gevfit, return.period = 100, verbose = T)
##
## Preparing to calculate 95 % CI for 100-year return level
##
## Model is fixed
##
## Using Normal Approximation Method.
CI_delta
## fevd(x = summer_max$prec)
##
## [1] "Normal Approx."
##
## [1] "100-year return level: 4.741"
##
## [1] "95% Confidence Interval: (2.9471, 6.5356)"
## Profile likelihood method
CI_prof <- ci(gevfit, method = "proflik", xrange = c(2.5, 8),
return.period = 100, verbose = F)
CI_prof
## fevd(x = summer_max$prec)
##
## [1] "Profile Likelihood"
##
## [1] "100-year return level: 4.741"
##
## [1] "95% Confidence Interval: (3.5081, 7.5811)"
hist(summer_max$prec, breaks = seq(0, max_precip, length.out = 35),
col = alpha("lightblue", 0.2), border = "gray",
xlim = c(0, 8), prob = T, ylim = c(0, 1.2),
xlab = "summer max (in)",
main = "95% CI for 100-yr RL")
xg <- seq(0, 8, len = 1000)
mle <- gevfit$results$par
lines(xg, gev.dens(mle, xg), lwd = 1.5)
#for (i in 1:3) abline(v = CI_delta[i], lty = 2, col = "blue")
for (i in c(1, 3)) abline(v = CI_prof[i], lty = 2, col = "red")
abline(v = RL100, lwd = 1.5, lty = 2)
12. Performing Extreme Value Analysis (EVA) using R
#legend("topleft", legend = c("Delta CI", "Prof CI"),
#col = c("blue", "red"), lty = c(2, 3))
EVA: Peak Over Threshold Approach
Step I: Pick a threshold and extract the threshold
exceedances
old.par <- par(no.readonly = TRUE)
mar.default <- par('mar')
mar.left <- mar.default
mar.right <- mar.default
13. Performing Extreme Value Analysis (EVA) using R
mar.left[2] <- 0
mar.right[4] <- 0
# Time series plot
par(fig = c(0.2, 1, 0, 1), mar = mar.left)
plot(1900 + 1:9200 / 92, prec_summer, type = "h", col = "lightblue",
xlab = "Year", ylab = "Daily Precip (inches)", yaxt = "n")
#Threshold exceedances
thres <- 0.4
ex <- prec_summer[prec_summer >= thres]
length(ex)
## [1] 344
#Extract the timing of POT
ex_t <- which(prec_summer >= thres)
abline(h = thres, col = "blue", lty = 2)
points(1900 + ex_t / 92, ex, col = alpha("blue", 0.5), pch = 16,
cex = 0.75)
par(las = 2)
axis(4, at = 0:5)
par(las = 0)
mtext("Precipitation (in)", side = 2, line = 5)
grid()
hs <- hist(ex, seq(thres, max_precip, len = 50), plot = FALSE)
par(fig = c(0, 0.2, 0, 1.0), mar = mar.right, new = T)
plot (NA, type = 'n', axes = FALSE, yaxt = 'n',
col = rgb(0,0,0.5, alpha = 0.5),
xlab = "Density", ylab = NA, main = NA,
xlim = c(-max(hs$density), 0),
ylim = c(0, max_precip))
axis(1, at = c(-3, -2, -1, 0), c(3, 2, 1, 0), las = 2)
#abline(h = 21, col = "red", lty = 5)
arrows(rep(0, length(hs$breaks[-50])), hs$breaks[-50],
-hs$density, hs$breaks[-50], col = "blue",
length = 0, angle = 0, lwd = 1)
arrows(rep(0, length(hs$breaks[-1])), hs$breaks[-1],
-hs$density, hs$breaks[-1], col = "blue",
length = 0, angle = 0, lwd = 1)
arrows(-hs$density, hs$breaks[-50], -hs$density,
hs$breaks[-1], col = "blue", angle = 0,
length = 0)
mle <- fevd(prec_summer, threshold = thres, type = "GP")$results$par
xg <- seq(thres, max_precip, length.out = 100)
lines(-gpd.dens(mle, thres, xg), xg, col = "red")
14. Performing Extreme Value Analysis (EVA) using R
par(old.par)
How to choose the “right” threshold?
# Mean residula life plot
mrlplot(prec_summer, main = "Mean Residual Life", xlab = "Threshold (in)")
# I choose 0.4 as the threshold but note that the "straightness"
# is difficult to assess
abline(v = 0.4, col = "blue", lty = 2)
15. Performing Extreme Value Analysis (EVA) using R
Step II: Fit a GPD to threshold excesses and assess the fit
# Fit a GPD for threshold exceenances using MLE
gpdfit1 <- fevd(prec_summer, threshold = thres, type = "GP")
# Print the results
gpdfit1
##
## fevd(x = prec_summer, threshold = thres, type = "GP")
##
## [1] "Estimation Method used: MLE"
##
##
## Negative Log-Likelihood Value: 46.14484
17. Performing Extreme Value Analysis (EVA) using R
Step III: Perform inference for return levels
Again we are interested in estimating 100-year return level
# Here we need to adjust the return period as here we are
#estimating the return level for summer precip
RL100 <- return.level(gpdfit1, return.period = 100 * 92 / 365.25)
RL100
## fevd(x = prec_summer, threshold = thres, type = "GP")
## return.level.fevd.mle(x = gpdfit1, return.period = 100 * 92/365.25)
##
## GP model fitted to prec_summer
18. Performing Extreme Value Analysis (EVA) using R
## Data are assumed to be stationary
## [1] "Return Levels for period units in years"
## 25.1882272416153-year level
## 5.656769
CI_delta <- ci(gpdfit1, return.period = 100 * 92 / 365.25,
verbose = F)
CI_delta
## fevd(x = prec_summer, threshold = thres, type = "GP")
##
## [1] "Normal Approx."
##
## [1] "25.1882272416153-year return level: 5.657"
##
## [1] "95% Confidence Interval: (3.2246, 8.089)"
CI_prof <- ci(gpdfit1, method = "proflik", xrange = c(3, 10),
return.period = 100 * 92 / 365.25, verbose = F)
CI_prof
## fevd(x = prec_summer, threshold = thres, type = "GP")
##
## [1] "Profile Likelihood"
##
## [1] "25.1882272416153-year return level: 5.657"
##
## [1] "95% Confidence Interval: (3.9572, 9.4964)"
hist(ex, 40, col = alpha("lightblue", 0.2), border = "gray",
xlim = c(thres, 10), prob = T, ylim = c(0, 4),
xlab = "Threshold excess (in)",
main = "95% CI for 100-yr RL")
xg <- seq(thres, 10, len = 1000)
mle <- gpdfit1$results$par
lines(xg, gpd.dens(mle, thres, xg), lwd = 1.5)
#for (i in c(1, 3)) abline(v = CI_delta[i], lty = 2, col = "blue")
for (i in c(1,3)) abline(v = CI_prof[i], lty = 2, col = "red")
abline(v = RL100, lwd = 1.5, lty = 2)
19. Performing Extreme Value Analysis (EVA) using R
#legend("topleft", legend = c("Delta CI", "Prof CI"),
#col = c("blue", "red"), lty = c(2, 3))