SlideShare a Scribd company logo
1 of 122
1. What is R?
2. Where is R available?
3. Will R run on all machines?
4. Why should I switch to R?
5. What can R do?
6. How easy is it to learn R?
7. How large is the R
community?
8. Does R have a support system?
9. Is there a beginners guide to R?
http://www.r-project.org
What is R?
• R is a language and environment for statistical computing and graphics.
• Developed by Robert Gentleman and Ross Ihaka.
• It can be freely downloaded from
http://www.rproject.org
• R can be considered as an implementation of S. There are some important
differences, but much code written for S runs unaltered under R.
• R is a GNU project.
Why I Must Write GNU
I consider that the golden rule requires that if I like a program I must share
it with other people who like it. Software sellers want to divide the users and
conquer them, making each user agree not to share with others. I refuse to
break solidarity with other users in this way. I cannot in good conscience
sign a nondisclosure agreement or a software license agreement. For years I
worked within the Artificial Intelligence Lab to resist such tendencies and
other inhospitalities, but eventually they had gone too far: I could not
remain in an institution where such things are done for me against my will.
So that I can continue to use computers without dishonor, I have decided to
put together a sufficient body of free software so that I will be able to get
along without any software that is not free. I have resigned from the AI lab
to deny MIT any legal excuse to prevent me from giving GNU away.
Richard Stallman, Founder, Free Software Foundation.
What can R do?
R is called a “programming environment”.
R includes
• an effective data handling and storage facility,
• a suite of operators for calculations on arrays, in particular matrices,
• a large, coherent, integrated collection of intermediate tools for data
analysis,
• graphical facilities for data analysis and display either on-screen or on
hardcopy, and
• a well-developed, simple and effective programming language which
includes conditionals, loops, user-defined recursive functions and input
and output facilities.
Who currently uses R?
SAS is the most common statistics package in general but R or S is
most popular with researchers in Statistics. A look at common
Statistical journals confirms this popularity. R is also popular for
quantitative applications in Finance.
Why should I switch to R?
1. It is free. (in what sense?)
2. It has an extensive support system in the form on many online
manuals, tutorials,discussion and help forums.
3. In some ways it is better than its closest competitor, S+ (plots,
memory management).
4. In addition to the base packages there is a very exhaustive list of
extension and user contributed packages for specialized tasks.
Getting help in R
• Within R:
– The ? Command can be used to get help on a specific command
within R
• ? graphics typed at the R prompt will provide a description of R
graphics.
• demo(graphics) will demonstrate some examples.
• Another useful command is example. Type ? example at the R
prompt for a description.
• Documentation
– Manuals, FAQs, reference cards, tutorials and news about recent
developments are available at http://www.r-project.org/other-
docs.html.
– CRAN Task Views for specialized applications.
• Online help
– The R posting guides are R-help, R-devel and Bioconductor
– The site for R-help is https://www.stat.math.ethz.ch/pipermail/r-help/
CRAN Task View: Computational Econometrics
Maintainer: Achim Zeileis
http://www.maths.bris.ac.uk/R/src/contrib/Views/Econo
metrics.html
CRAN Task View: Statistical Genetics
Maintainer: Giovanni Montana
http://www.maths.bris.ac.uk/R/src/contrib/Views/Genet
ics.html
CRAN Task View: Bayesian Inference
Maintainer: Jong Hee Park
http://cran.r-
project.org/src/contrib/Views/Bayesian.html
Type CRAN Task View into a search engine such as
Google for more
Posting Guide: How to ask good
questions that prompt useful answers
• R-help is intended to be comprehensible to people who want to use R
to solve problems but who are not necessarily interested in or
knowledgeable about programming.
• R-devel is intended for questions and discussion about code
development in R. Questions likely to prompt discussion unintelligible
to non-programmers should go to to R-devel.
• Bioconductor is for announcements about the development of the
BioConductor package , availability of new code, questions and
answers about problems and solutions using Bioconductor, etc.
On Wed, 6 Dec 2000, Hisaji ONO wrote:
> Hello, the R people.
> I look for robust regression in R. This method is available in S, its name is rreg.
There's better robust regression in the VR bundle of packages.
library(MASS)
help(rlm)
library(lqs)
help(lqs)
-thomas
Thomas Lumley
Assistant Professor, Biostatistics
University of Washington, Seattle
About
this list
Date view Thread view Subject view Author view Other groups
Subject: Re: [R] Is robust regression available in R.
From: Thomas Lumley (thomas@biostat.washington.edu)
Date: Wed 06 Dec 2000 - 04:12:58 EST
Available Bundles and Packages
aaMIMutual information for protein sequence alignments
abindCombine multi-dimensional arrays
accuracyTools for testing and improving accuracy of statistical results.
acepackace() and avas() for selecting regression transformations
actuarActuarial functions
adaptadapt -- multidimensional numerical integration
ade4Analysis of Environmental Data : Exploratory and Euclidean method
adehabitatAnalysis of habitat selection by animals
adliftAn adaptive lifting scheme algorithm
agceanalysis of growth curve experiments
akimaInterpolation of irregularly spaced data
AlgDesignAlgDesign
alr3Methods and data to accompany Applied Linear Regression 3rd editi
amapAnother Multidimensional Analysis Package
AMOREA MORE flexible neural network package
AnalyzeFMRIFunctions for analysis of fMRI datasets stored in the ANALYZE for
aodAnalysis of Overdispersed Data
apeAnalyses of Phylogenetics and Evolution
apTreeshapeAnalyses of Phylogenetic Treeshape
ArDecTime series autoregressive decomposition
arulesMining Association Rules and Frequent Itemsets
Why should I switch to R?
• Parametric Inference: ttest, power.t.test, chisq.test(). bartlett.test, logLik , extractAIC
• Design of experiments: aov, tukeyHSD,plot.design, conf.design,.AlgDesign
• Sample surveys: survey, pps.
• Linear models and regression: lm, lme, aov, gls, dfbeta, dwtest . shapiro.test.
• Multivariate analysis: mvtnorm,mnornt, cluster, manova, mvnormtest,prcomp, cancor.
• Statistical genomics: BioConductor, genetics, Geneland,hapsim, PHYLOGR, qtl, bqtl.
• Bayesian analysis: bayesm, bayesurv, MCMCpack,bma, lmm, coda.
• Resampling: ecdf, jackknife , empinf, oneboot, bootci, jack.after.boot .
• Survival analysis: survreg, survdiff, coxph.
• Stochastic processes and time series: tseries, arima, garch, pacf, ts.plot, adf.test.
• Advanced data analysis: glm, lme, nlme, longitudinal, mitools
• Statistical quality control: linprog, quadprog, lp.transport, qcc.
• Non-parametric inference: wilcox.test, friedman.test, rlm, quantreg.
In addition to graphical and programming tools.
And beyond…
• Exploratory Data Analysis : eda
• Financial data analysis: fPortfolio, financial, fMultivar
• Spatial analysis: spatstat, geoR, SemiPar
• Smoothing: SemiPar, splines, mgcv
• Econometric analysis: gear, Ecdat
• R also has many inbuilt datasets a list of which may be
viewed by typing data() at the prompt.
Are there any downsides?
• R is not menu driven. Commands to be executed
must be typed in at the prompt.
• This is not a complete disadvantage because it
prevents the cookbook approach to statistics.
• However it means that the user will need to invest
some initial to become familiar with R syntax.
• Sample code is available for each command and is
helpful to familiarize those new to programming
eg try typing ? lm at the prompt and scroll to the
bottom.
Tutorial structure
1. Tutorial 1: R Graphics (and basics)
2. Tutorial 2 : Regression analysis
3. Tutorial 3 : Programming in R
4. Tutorial 4 : R libraries
Tutorial 1 : R Graphics
Getting started
• To open an R session click on the ‘R’ shortcut on the desktop. This will
open a commands window with the R prompt ‘>’. All commands have
to be typed in at the prompt.
• R code in the presentation is indicated in blue. You can cut and paste
this into the commands window.
• Use the arrow keys to recall previous commands.
• You can scroll up the command window to view earlier commands and
output.
Downloading R
• The base package in R can be downloaded from www.r-project.org,
popularly known as the CRAN website.
• You will need to click on one of the mirror sites.
• Additional packages can either be downloaded from this site or from
within R.
Reading in data
Assignment
• The most straight forward way to store a list of numbers is through an
assignment using the c command.
• As an example, we can create a new variable called newvar which will contain
the numbers 3, 5, 7, and 9:
newvar <- c(3,5,7,9)
• When you enter this command you should not see any output except a new
command line.
• To see what numbers are included in newvar type newvar at the prompt and
press the enter key:
• If you wish to work with one of the numbers you can get access to it
using the variable and then square brackets indicating which number:
eg try typing newvar[2]
newvar[1:2]
newvar[-2]
Reading a CSV file
• We shall read a very short data file called simple.csv which has six
rows of data on three variables labeled "trial," "mass," and "velocity."
• The command to read the data file is read.csv.
• The following command will read in the data and assign it to a
variable called new data
newdata <- read.csv(file="simple.csv",head=TRUE,sep=",")
• To view the data type newdata at the prompt.
• Try typing summary(newdata)
• Try typing summary(newdata[(newdata$trial=="A"),])
• Try typing table(newdata$trial)
• You can now access each individual column using a "$" to
separate the two names eg try typing newdata$mass
• If you are not sure what columns are contained in the variable
type names(newdata) at the prompt.
Reading in data
• There are many ways to read data using R. We have only give two
examples : direct assignment and reading csv files.
• Other commands include read.csv2, read.delim, read.fwf and scan.
• To get help on these commands you can type ? read.fwf at the
prompt etc.
• It is also possible to import data of other formats such as SAS,
SPSS etc into R.
Reading in data
Plotting data
• To see some of the possibilities that R offers, enter
demo(graphics)
• Press the Enter key to move to each new graph.
• Note that the code required to produce each plot is being
simultaneously displayed in the command window.
The plot function
• In order to illustrate R's graphical functionalities, let us consider
a simple example of a bivariate graph of 10 pairs of random
variables. These values were generated with:
x <- rnorm(10)
x <- sort(x)
y <- rnorm(10)
• To get a scatter-plot of x against y, type
plot(x, y)
and the graph will be plotted on the active graphical device.
• This plot uses default axis labels, limits, symbols etc.
• We can customize plots by passing options to the plot
commands. Try the following variations:
plot(x,y, type="l")
plot(x,y, type= "l", lty=2)
plot(x,y, type= "l", lwd=3)
plot(x,y, type= "l", col="red")
While this needs some getting accustomed to, it does
make the point that plots are subjective and it is
necessary for the user to make intelligent choices of
plotting parameters. For example try the following
commands
par(mfrow=c(1,2))
plot(x,x^2,type="l",ylim=c(0,1))
plot(x,x^2,type= "l", ylim=c(0,10))
The plot function
• For a fully customized plot try the following
par(mfrow=c(1,1))
plot(x, y, xlab="Ten random values", ylab="Ten other values",
xlim=c(-2, 2),ylim=c(-2, 2), pch=22, col="red",bg="yellow",
bty="l", tcl=0.4,main="How to customize a plot with R", las=1,
cex=1.5)
• What does each of the options do?
• This type of control over parameters is typical of R and is also
found in analysis functions and programming. It is one of the
features which makes R truly scientific and superior to many
other software packages.
More customized plots
• Some plotting options can be passed on as
arguments to the plot function while others will need
modification of the default graphical settings specified
in par.
• You can view the current settings in par by typing par(
) at the prompt.
• Let us consider the following modification to par.
• Type the following ( > denotes the R prompt)
opar <- par()
par(mfrow=c(1,1))
par(bg="lightyellow", col.axis="blue", mar=c(4, 4, 2.5,
0.25))
plot(x, y, xlab="Ten random values", ylab="Ten other
values",
xlim=c(-2, 2), ylim=c(-2, 2), pch=22, col="red", bg="yellow",
bty="l", tcl=-.25, las=1, cex=1.5)
title("How to customize a plot with R ", font.main=3, adj=1)
• However once the user has spent some time setting his favourite
plotting options, it is easy to replicate these for another dataset.
• We shall now look at a sample data set in R. Consider the data set
florida which has the votes for the various candidates by county
in the state of Florida in the last US presidential elections. You
can attach this dataset by typing
attach("usingR.RData")
attach(florida)
Type florida at the prompt to view the data.
Next try plotting the votes for Bush and Buchanan.
Interactive plotting : The identify and locator functions
identify is a useful function which can be used to label selected
points on a plot.
Type plot(BUSH, BUCHANAN, xlab="Bush", ylab="Buchanan")
identify(BUSH, BUCHANAN, County)
Then click near a point to identify the county..
Another interactive function is the locator function.
Type
plot(1:nrow(florida), BUSH, col="red",pch=2,xlab="County no",
ylab="Votes")
points(1:nrow(florida), BUCHANAN, col="green", pch=4)
leg <- c("BUSH", "BUCHANAN")
Then type
legend(locator(1), leg, col=c(" red ", " green "), pch=c(2,4))
We next discuss histograms, density plots,
boxplots and normal probability plots
Type the following:
attach("usingR.RData")
attach(possum)
Type possum at the prompt to view the data. Next type
hist(totlngth) for the default histogram.
Now suppose we want to specify the bins ourselves. Type
par(mfrow = c(1, 2))
hist(totlngth, freq=F, breaks = 72.5 + (0:5) * 5,
xlab="Total length", main ="A: Breaks at 72.5, 77.5, ...")
hist(totlngth, freq=F, breaks = 75 + (0:5) * 5, xlab="Total length",
main="B: Breaks at 75, 80, ...")
To get a corresponding density estimate (using kernel smoothing)
type
d <- density(totlngth)
points(d) will superimpose the density estimate.
A better (scaled) superimposition can be obtained by
points(d$x, d$y/1.08,type="l", col="blue")
• qqnorm(totlngth) gives a normal probability plot of the variable
totlngth. The points of this plot will lie approximately on a straight line if
the distribution is normal.
• In order to calibrate the eye to recognise plots that indicate nonnormal
variation, it is helpful to do several normal probability plots for random
samples of the relevant size from a normal distribution.
• Type the following
attach(possum)
par(mfrow=c(3,4)) # A 3 by 4 layout of plots
y <- totlngth
qqnorm(y,xlab= " ", ylab="Length", main="Possums" ,col="blue")
for(i in 1:11)
qqnorm(rnorm(43),col="red",xlab="", ylab="Simulated lengths",
main="Simulated")
Now lets explore some very sophisticated plots
Suppose we have normally distributed data and we want to see how the
empirical density estimate compares with the normal density estimate
as we vary the sample size. Type the following commands
library(lattice)
n <- seq(5, 45, 5)
x <- rnorm(sum(n))
y <- factor(rep(n, n), labels=paste("n =", n))
densityplot(~ x | y,
panel = function(x, ...) {
panel.densityplot(x, col="DarkOliveGreen", ...)
panel.mathdensity(dmath=dnorm,
args=list(mean=mean(x), sd=sd(x)),
col="darkblue")
})
• The iris dataset gives measurements on four variables for
several species of the flower. A pairwise scatter of these
variables with separate markers for each species may be
obtained by
data(iris)
splom(
~iris[1:4], groups = Species, data = iris, xlab = "",
panel = panel.superpose,
auto.key = list(columns = 3)
)
Colours in R
• R is particularly good at handling colours. To see the palette in R, type
the following:
demo.pal <-
function(n, border = if (n<32) "light gray" else NA,
main = paste("color palettes; n=",n),
ch.col = c("rainbow(n, start=.7, end=.1)", "heat.colors(n)",
"terrain.colors(n)", "topo.colors(n)", "cm.colors(n)"))
{
nt <- length(ch.col)
i <- 1:n; j <- n / nt; d <- j/6; dy <- 2*d
plot(i,i+d, type="n", yaxt="n", ylab="", main=main)
for (k in 1:nt) {
rect(i-.5, (k-1)*j+ dy, i+.4, k*j,
col = eval(parse(text=ch.col[k])), border = border)
text(2*j, k * j +dy/4, ch.col[k])
}
}
n <- if(.Device == "postscript") 64 else 16
# Since for screen, larger n may give color allocation problem
demo.pal(n)
Colours in R
To use these colour palettes, try the following plot of the contours of a bivariate
normal density.
x <- y <- seq(-3,3,length=100)
norm.density <- matrix(0,100,100)
for (i in 1:100)
for (j in 1:100)
norm.density[i,j] <- dnorm(x[i])*dnorm(y[j])
par(mfrow=c(1,1))
image(x, y, norm.density, col = heat.colors(1000), axes = FALSE)
contour(x, y, norm.density, by = 5, add = TRUE, col = "peru")
box()
title(main = "The bivariate normal density", font.main = 4)
You can change the appearance of the plot by selecting the amount of colour
gradation. For example try
par(mfrow=c(1,2))
image(x, y, norm.density, col = heat.colors(1000), axes = FALSE)
contour(x, y, norm.density, by = 5, add = TRUE, col = "peru")
box()
image(x, y, norm.density, col = heat.colors(3), axes = FALSE)
contour(x, y, norm.density, by = 5, add = TRUE, col = "peru")
box()
The R Graph Gallery
• Visit the R Graph Gallery at
http://addictedtor.free.fr/graphiques/allgraph.php
• Click on a graph for a copy of the code used to generate it.
3D graphics and movies in R
• And the R movies gallery at
• http://addictedtor.free.fr/movies/
• for R movies.
• http://rgl.neoscientists.org/Gallery.html demonstrates the rgl package.
• Try the following code
library(rgl)
example(rgl.surface)
for(i in 1:360) {
rgl.viewpoint(i, i*(60/360), interactive=F)
}
Click on the RGL device at the bottom of the R screen to see the
results.
Exercises
1. (a) First attach the possum dataset using attach("usingR.RData")
and then attach(possum). Consider the variable head lengths
given by hdlngth. Plot the following on the same page
a) a histogram
b) a stem and leaf plot
c) a normal probability plot and
d) a density plot
(b) The measurements in the possum dataset have been
collected at various sites given by the variable site. Draw box
plots of hdlngth by site.
Solutions
1. (a) First attach the possum dataset using attach("usingR.RData") and then
attach(possum). Consider the variable head lengths given by
hdlngth. Plot the following on the same page
a) a histogram
b) a stem and leaf plot
c) a normal probability plot and
d) a density plot
First attach the data using
attach("usingR.RData")
attach(possum)
To get all plots on the same page, set the layout parameter mfrow
par(mfrow=c(2,2))
The histogram, normal probability and density plots can be obtained as
hist(hdlngth, xlab="headlength of possums", main="Histogram")
qqnorm(hdlngth, xlab="headlength of possums", main="Normal probability plot")
plot(density(hdlngth), xlab= "headlength of possums", main="Density plot")
Solutions
The stem and leaf is a little trickier. First to find the corresponding
command in R, we type help.search(“stem”) at the R prompt. Two likely
candidates seem the command stem in the base package and stem.leaf in the
aplpack package. The stem command does not work because it only returns
the stem and leaf display in the command window. Note
names(stem(hdlngth)) does not return anything. The stem.leaf command
also displays the results in the commands window but it does store the
information in an object. To see this type
library(aplpack)
sc <- stem.leaf(hdlngth)
sc
We can create an empty plot and use the text command to place this
information on the plot as follows:
plot(1:65,1:65, type="n", xlab=" ", ylab= " ", axes=F)
for(i in 1:16)
text(3,65-4*i,sc$stem[i], adj=c(0,0), cex=0.7)
(b) The measurements in the possum dataset have been collected
at various sites given by the variable site. Draw box plots of
hdlngth by site.
First reset the layout parameter with
par(mfrow=c(1,1))
The basic code is
boxplot(hdlngth~site)
You can also try the following variants:
boxplot(hdlngth ~ site, notch = TRUE, col = "blue")
boxplot(hdlngth ~ site, names=c("A","B","C","D","E","F","G"))
boxplot(hdlngth ~ site, names=c("Site A", " Site B", " Site C", " Site D", "
Site E", "Site F", "Site G"), las=2)
boxplot(hdlngth~site, subset=hdlngth<90)
boxplot(hdlngth ~ site, boxwex = 0.25, at = 2:8,
main = " Headlength of possums", xlab = " Site",
ylab= "Head length", ylim = c(50, 110), yaxs = "i")
Tutorial 2 : Regression analysis
The lm command
• The lm command is used to fit linear regressions in R.
• To fit a regression of y on x1, x2, the basic command is lm(y~x1+x2). Let use
generate some data for purpose of illustration:
y <- rnorm(10)
x1 <- rnorm(10)
x2 <- rnorm (10)
lm(y~x1+x2)
• This only displays the least squares estimates on the screen. What if we want
to test for significance? To do this we have to define a variable say lmfit which
will save the output. To do this type lmfit <- lm(y~x1+x2)
• lmfit is called an R list. To see the results now type summary(lmfit).To see
what else is contained in lmfit type names(lmfit). To see a particular
component of lmfit eg the residuals type lmfit$residuals
.
The gala dataset
• Now let us fit a linear model on some real data. We shall use the gala
dataset in the faraway library.
• This dataset concerning the number of species of tortoise on the various
Galapagos Islands. There are 30 cases (Islands) and 7 variables in the
dataset.
• The variables are
– Species The number of species of tortoise found on the island
– Endemics The number of endemic or native species
– Elevation The highest elevation of the island (m)
– Nearest The distance from the nearest island (km)
– Scruz The distance from Santa Cruz island (km)
– Adjacent The area of the adjacent island (km2)
• http://www.rit.edu/~rhrsbi/GalapagosPages/Darwin.html
The gala dataset
• We start by reading the data into R :
library(faraway)
attach(gala)
Use summary(gala), names(gala) etc to get a
sense of the data.
You can also use pairs(gala) to get a pairwise
scatterplot of the variables.
• Let us fit the regression
gfit <- lm(Species ~ Area + Elevation +
Nearest + Scruz +
Adjacent,data=gala)
• To see the results type summary(gfit)
• In particular, the fitted (or predicted) values and residuals are
gfit$fit
and gfit$res
The anova command
• A convenient way to compare two nested models is to use the anova
command
• Suppose we fit the two models
g1 <- lm(Species ~ Area + Elevation + Nearest +
Scruz +
Adjacent,data=gala)
g2 <- lm(Species ~ Area + Elevation +
Nearest + Scruz ,data=gala)
Then
anova(g2,g1)
will give us the conventional F test comparing these two models.
Testing nested models
• Suppose we want to test whether the
coefficients for the variables Area and
Adjacent are equal. We can type
g1 <- lm(Species ~ Area + Elevation + Nearest +
Scruz +
Adjacent,data=gala)
g2 <- lm(Species ~ I(Area + Adjacent) +
Elevation + Nearest +
Scruz,data=gala)
• Then anova(g2,g1) will perform the appropriate
F test.
• Suppose we want to test whether the coefficient of Area can be set to a
particular value say –0.1. We can then fit
g2 <- lm(Species ~ offset(-0.1*Area) +
Elevation + Nearest + Scruz +
Adjacent,data=gala)
Categorical predictors
• Suppose we want to include the variable Area as a categorical
predictor with 3 categories rather than a continuous one.
• To define the corresponding categorical variable
area.cat <- rep(3,nrow(gala))
area.cat[gala$Area<=5] <- 1
area.cat[(gala$Area>5)&(gala$Area<=1000)] <- 2
• Type cbind(gala$Area,area.cat) to view the results
• This regression can be fitted using the command
g3 <- lm(Species ~ as.factor(area.cat) +
Elevation + Nearest + Scruz +
Adjacent,data=gala)
Type summary(g3) to view the output.
• The factor command is useful for fitting ANOVA models.
Categorical predictors
• For example consider the coagulation data set which gives
measurements on blood coagulation corresponding to four diets.
data(coagulation)
coagulation
• A one way ANOVA model can be fitted to the data using
coag.fit <- lm(coag ~ factor(diet),
coagulation)
summary(coag.fit)
Multiple comparisons
• Another feature especially relevant for ANOVA models is to allow for
multiple comparisons while testing for pairwise differences.
• Tukey’s Honest Significant Difference (HSD) is designed for all
pairwise comparisons and depends on the studentized range
distribution. We compute the Tukey HSD bands for the diet data.
TukeyHSD(aov(coag.fit))
• You can compare these to the unadjusted
confidence intervals for the differences B-A,
C-A, D-A given below
B-A 1.813638 8.186362
C-A 3.813638 10.186362
D-A -3.022848 3.022848
• Some other pairwise comparison tests may be found in the stats
library.
Confidence intervals
• Returning to the Galapagos dataset, to construct individual 95%
confidence intervals for the regression parameters, we first extract the
parameters and the standard errors:
summary(g1)$coefficients gives the coefficients, standard errors, t and
p values as a matrix. We extract the first two columns using
beta <- summary(g1)$coefficients[,1]
se.beta <- summary(g1)$coefficients[,2]
We next compute the critical value of the t-statistic with error d.f.
t95 <- qt(0.975, g1$df.residual)
and the individual confidence intervals as
ci.beta <- cbind(beta-t95*se.beta, beta+t95*se.beta)
ci.beta
Confidence ellipsoids
• Now we construct the joint 95% confidence region for the coefficients
of Area and Elevation. Type
library(ellipse)
plot(ellipse(g1,c(2,3)),type="l")
• Add the origin and the point of the estimates:
points(0,0)
points(g1$coef[2],g1$coef[3],pch=18)
• Now we mark the one way confidence intervals on the plot for
reference:
abline(v=ci.beta[2,],lty=2)
abline(h=ci.beta[3,],lty=2)
Predictions
• Suppose we want to predict the number of species corresponding to
a hypothetical sample point with Area= 0.08,
Elevation= 93,
Nearest= 6.0,
Scruz= 12
Adjacent=
0.34.
• This can be done with
predict(g1,data.frame(Area=0.08,Elevation=93,Ne
arest=6.0,Scruz=12,
Adjacent=0.34),se=T)
• predict(g1) without any additional arguments
will return the predicted values for the sample
data points.
Generalized least squares
• Until now we have assumed that var e = s2I but it can happen that the
errors have non-constant variance or are correlated in which case we
should fit a generalized least squares.
• To illustrate this we will use a dataset called Longley’s regression data
where the response is the number of people employed, yearly from
1947 to 1962 and the predictors are GNP implicit price deflator, GNP,
unemployed, armed forces, non-institutionalized population 14 years
of age and over, and year.
• To attach and view the data type
data(longley)
names(longley)
Approach 1
• Assuming that the errors follow an autoregressive series of order one,
we can estimate the serial correlation as
data(longley)
g <- lm(Employed ~ GNP + Population,
data=longley)
cor(g$res[-1],g$res[-16])
• We now construct the S matrix and compute the GLS estimate of b
along with its standard errors.
x <- model.matrix(g)
Sigma <- diag(16)
Sigma <- 0.31041^abs(row(Sigma)-col(Sigma))
Sigi <- solve(Sigma)
xtxi <- solve(t(x) %*% Sigi %*% x)
beta <- xtxi %*% t(x) %*% Sigi %*%
longley$Empl
beta
Approach 2
• Since we can write S = SST , where S is a triangular matrix using the
Choleski Decomposition, another approach would be to regress S-1
y on S
–1
X as demonstrated below:
sm <- chol(Sigma)
smi <- solve(t(sm))
sx <- smi %*% x
sy <- smi %*% longley$Empl
lmsxsy <- lm(sy ~sx-1)
lmsxsy$coef
• Our initial estimate of the AR parameter is 0.31 but once we fit our GLS
model we can re-estimate it as cor(lmsxsy$res[-
1],lmsxsy$res[-16])
Approach 3
• The nlme library contains a GLS fitting function. We can use it to fit
this model:
library(nlme)
g <- gls(Employed ~GNP + Population,
correlation=corAR1(form= ~Year),
data=longley)
summary(g)
• We see that the estimated value of r obtained using Restricted
Maximum Likelihood estimation is 0.64. You can also specify
method = "ML" in g for Maximum Likelihood
estimation of r .
Weighted least squares
• Sometimes the errors are uncorrelated, but have unequal variance where the
form of the inequality is known. Weighted least squares (WLS) can be used in
this situation.
• Here is an example from an experiment to study the interaction of certain
kinds of elementary particles on collision with proton targets. The experiment
was designed to test certain theories about the nature of the strong interaction.
The cross-section(crossx) variable is believed to be linearly related to the
inverse of the energy(energy - has already been inverted). At each level of
the momentum, a very large number of observations were taken so that it was
possible to accurately estimate the standard deviation of the response(sd).
• Consider the following code
data(strongx)
strongx
• Define the weights and fit the model:
g <- lm(crossx ~energy, strongx, weights=sd^-2)
summary(g)
Diagnostics : residuals and leverage
• Let’s illustrate these test using an interesting economic dataset on 50
different countries.
• These data are averages over 1960-1970 on
dpi = per-capita disposable income in U.S. dollars;
ddpi = the percent rate of change in per capita disposable
income;
sr = aggregate personal saving divided by disposable income.
pop15 = percentage population under 15
pop75 = percentage population over 75
• The data come from Belsley, Kuh, and Welsch (1980).
• First take a look at the data:
data(savings)
savings
Diagnostics : outlier identification
• Consider the regression
g <- lm(sr ~ pop15 + pop75 + dpi + ddpi,
savings)
And a plot of the residuals :
plot(g$res,ylab="Residuals",main="Index
plot of residuals")
countries <- row.names(savings)
To identify outliers use
identify(1:50,g$res,countries)
Diagnostics : Leverage
• Now look at the leverage: We first extract the X-matrix here using
model.matrix() and then compute and plot the leverages or so called
”hat” values:
x <- model.matrix(g)
lev <- hat(x)
par(mfrow=c(1,1))
plot(lev,ylab="Leverages",main="Index plot of
Leverages")
abline(h=2*5/50)
• Notice that the sum of the leverages is equal to 5 for this data. Which countries
have large leverage? We have marked a horizontal line at 2p/n.
• Alternatively type
names(lev) <- countries
lev[lev > 0.2]
• The command names() assigns the country names to the elements of the
vector lev making it easier to identify them. Alternatively, we can do it
interactively like this identify(1:50,lev,countries)
Diagnostics : residual plots
• On the previous slide we plotted the raw residuals. Two alternative classes
of residuals are the studentized residuals and the jackknife residuals. To
get a plot of all three types on the same page type
par(mfrow=c(3,1))
plot(g$res,ylab="Residuals",main="Index plot of
residuals")
gs <- summary(g)
stud <- g$res/(gs$sig*sqrt(1-lev))
plot(stud,ylab="Studentized
Residuals",main="Studentized Residuals")
jack <- rstudent(g)
plot(jack,ylab="Jacknife Residuals",main="Jacknife
Residuals")
Jackknife residuals can be used for outlier
detection. Type jack[abs(jack)==max(abs(jack))] to
identify the largest value. A critical value for
these jackknife residuals with Bonferroni
correction for multiple testing can be computed
Influential Observations
• To identify influential observations consider Cook’s distances :
cook <- cooks.distance(g)
plot(cook,ylab="Cooks distances")
identify(1:50,cook,countries)
Multicollinearity
• The Longley dataset is a good example of collinearity:
• Check the correlation matrix first using round(cor(longley[,-
7]),3)
• Now we check the eigendecomposition:
x <- as.matrix(longley[,-7])
e <- eigen(t(x) %*% x)
sqrt(e$val[1]/e$val)
• One option could be to use ridge regression as
implemented by lm.ridge in the MASS library.
Type
library(MASS)
? lm.ridge
for details.
Variable selection
• The step function with the option direction = "forward",
direction="backward" or direction="both" can be used for forward,
backward or stepwise variable selection using p-values.
• The leaps function in the leaps library implements variable selection
based on model selection criteria like the adjusted R2, Mallow’s Cp,
AIC, BIC and PRESS.
Transformations
• Transformations of the response and predictors can improve the fit and
correct violations of model assumptions such as constant error
variance.
• A popular method is to use the Box Cox transformation.
• Consider the Galapagos Islands dataset analyzed earlier:
data(gala)
g <- lm(Species ~ Area + Elevation + Nearest +
Scruz + Adjacent,gala)
library(MASS)
boxcox(g,plotit=T)
boxcox(g,lambda=seq(0.0,1.0,by=0.05),plotit=T)
Transformations
Alternatively we can also transform the predictor using various functions such
as standard polynomials, orthogonal polynomials, splines etc. The following
code fits a non-linear regression of Species on Area using (natural) splines.
data(gala)
library(splines)
g4 <- lm(Species ~ns(Scruz, df=4),gala)
To view the results type
scruz.ord <- order(gala$Scruz)
plot(gala$Scruz, gala$Species, xlab= " Scruz ", ylab=
" Species ", lwd=2)
points(gala$Scruz[scruz.ord], predict(g4)[scruz.ord],
col= " yellow ", type= "l")
Tranformations
You can also look at the effect of increasing the
degrees of freedom
g8 <- lm(Species ~ns(Scruz, df=8),gala)
points(gala$Scruz[scruz.ord],
predict(g8)[scruz.ord], col= " orange" , type= "l")
g12 <- lm(Species ~ns(Scruz, df=12),gala)
points(gala$Scruz[scruz.ord],
predict(g12)[scruz.ord], col= " red" , type= "l")
g20 <- lm(Species ~ns(Scruz, df=20),gala)
points(gala$Scruz[scruz.ord],
predict(g20)[scruz.ord] , col= " purple" , type=
"l")
leg <- c(" 4 df ", " 8 df ", " 12 df " , " 20 df
")
legend(200, 200, leg, lty=1, col=c(" yellow",
" orange", " red", " purple " ))
Exercise
• The variable Species in the gala dataset is actually a count and one
should ideally fit a generalized linear model with a Poisson link.
– Use the glm function to fit an appropriate generalized linear model
to the data accounting for possible overdispersion.
– Use predict.glm to plot the predicted values corresponding to the
sample data together with prediction intervals.
– Perform post model fitting diagnostics using some of the following
functions from the car and stats libraries :
• cookd (car),
• dfbeta and dfbetas (stats)
• dffits (stats)
• influence.measures(stats)
• outlier.test (car).
Tutorial 3 : Programming in R
When do we need to programme?
• To define user defined functions.
• To create libraries.
• For simulations.
A sample programme
Here is an example of a sample programme to simulate the distribution of
the median of a Cauchy distribution, computes a Monte Carlo estimator of
its MSE and compare the histogram of the simulated values to its
asymptotic distribution. It also displays the time taken to perform the
simulations.
n <- 10
nsim <- 10000
theta.hat <- double(nsim)
for (i in 1:nsim) {
x <- rcauchy(n)
theta.hat[i] <- median(x)
}
mean(theta.hat^2)
cat("Calculation took", proc.time()[1], "seconds.n")
hist(theta.hat, freq = FALSE, breaks = 100)
curve(dnorm(x, sd = sqrt(mean(theta.hat^2))), add = TRUE)
curve(dnorm(x, sd = sqrt(1 / (4 * n * dcauchy(0)^2))), add = TRUE, col = "red")
Programming style
Make the programme as generic as possible.
For example to generate sample averages for 200 samples of size 20
from the standard normal distribution, one possible code is
for (i in 1:200)
{
samp.mean[i] <- mean(rnorm(20))
}
A more generic alternative is
sample.size <- 20
simulation.size <- 200
for (i in 1: simulation.size)
{
samp.mean[i] <- mean(rnorm(sample.size))
}
Programming style
• Indent.
Consider for (i in 1:simulation.size)
{
samp.mean[i,] <- NULL
for (j in 1: num.times)
{
samp.mean[i,j] <- mean(rnorm(mean=j,sd=1, n=samp.size))
}
}
as opposed to
for (i in 1:simulation.size)
{
samp.mean[i,] <- NULL
for (j in 1: num.times)
{
samp.mean[i,j] <- mean(rnorm(mean=j,sd=1, n=samp.size))
}
}
Programming style
Give meaningful variable names.
The programme
for (i in 1:m)
{
x[i,] <- NULL
for (j in 1: n)
{
x[i,j] <- mean(rnorm(mean=j,sd=1, n=K))
}
}
is perfectly valid but not as user friendly.
• When choosing names be careful not to overwrite inbuilt R functions.
For example, naming a variable lm will mask the lm function. If this
happens type remove("lm") to restore the function.
Programming style
Use matrices for fast computation.
The commands executed by the following lines of code
A <- matrix(0,500,500)
for (i in 1:500)
for (j in 1:500)
A[i,j] <- i + j
can be run faster and in a more elegant manner using
I.mat <- matrix(seq(1,500), nrow=500, ncol=500)
A <- I.mat + t(I.mat)
Though R is generally better with loops than S+, such matrix
programming is essential for fast computation.
Programming style
Add comments.
The # sign can be used to insert comments. For example :
simulation.size <- 200 # Set simulation size
samp.size <- 20 # Set sample size
num.times <- 10 # Set number of repetitions
for (i in 1:simulation.size)
{
samp.mean[i,] <- NULL
for (j in 1: num.times)
{
# Calculate the sample mean for i,j th observation
samp.mean[i,j] <- mean(rnorm(mean=j,sd=1, n=samp.size))
}
}
Variable types in R
• Numeric
– Real / Floating point
• default: double precision—15 significant digits
• single precision—7 significant digits
– Integer x <- 6
is.real(x)
x <- as.integer(x)
is.real(x)
is.integer(x)
• Logical
x <- c(1,2,3,4,5); y <- (x<3);
• Character String
x <- c(" North " , " South " , " East " , "
West " )
• List : collection of several objects of any type
x1 <- c(" North " , " South " , " East " , " West " )
x2 <- c(2,3,5,8)
x <- list(x1,x2)
• Complex arithmetic is also supported in R
z <- complex(real = rnorm(100), imag = rnorm(100))
Re(z)
Im(z)
Vectors and matrices in R
• Vectors
x <- c(45, 90, 135 )
x
y <- c(" North " , " South " , " East " , " West " )
y
x*2
length(x)
sum(x)
· When the values are from a systematic sequence you can save coding
x <- rep(2.1, 30)
y <- rep(" North " ,5)
x <- 1:10
x <- seq(1,10,2)
Vectors and matrices in R
• Matrices
a <- 1:3
b <- 4:6
c <- 7:9
X <- cbind(a,b,c)
X
dim(X)
Y <- rbind(a,b,c)
Y
dim(Y)
X+ Y
X*Y
X%*%Y
Z <- matrix(c(1,4,6,2,3,7.8), nrow=2, ncol=3, byrow=T)
Z <- matrix(c(1,4,6,2,3,7.8), nrow=2, ncol=3, byrow=F)
Data frames
• Most functions such as lm, glm, survreg, coxph etc will operate on data
frames.
• If the data is read in using command such as read.csv, read.txt etc, it
will automatically be saved as a data frame.
• If the data is read in from the keyboard, a data frame can be created as
follows.
length <- c(20, 24, 19, 24, 18, 30)
wt <- c(10, 14, 14, 12, 12.5, 17)
mydata <- data.frame(length, wt)
To see the variable names type names(mydata) . To access the length
variable use mydata$length. Alternatively you can attach the data set
using attach(mydata) in which case you can simply type length.
Programming Loops
• for loop : for (i in 1:10)
{
….. R code …..
}
• while loop while (logical condition)
{
….. R code …..
}
• if loop if (logical condition)
{
….. R code …..
}
• if else loop if (logical condition)
{
….. R code …..
}
else
{
….. R code …..
}
• The commands stop and break will exit from a loop without completion.
Some useful commands for programming
• Numerical solution of equations : uniroot, polyroot, optimize, nlm
• Alternatives to loops : apply, tapply, outer
• Matrix inversion and solution of linear equations : solve, solve.qr,
chol2inv, backsolve, qr.solve
• General matrix functions : eigen, svd, det
• Sorting : sort, order, rank
• Rounding up : ceiling, floor, round, trunc, signif
• Saving : write.matrix, source, sink, postscript, pdf
• Numerical settings : .Machine
• Random number generation : Random.Seed, RNG, RNGkind, set.seed
• Suppose we have decided on a favourite plotting set up which uses
blue dashed lines on a yellow background, square plotting characters
and prints the variable names parallel to each axis.
• Instead of retyping the options for each plot we can create a function
which uses these settings and also returns the summaries for each
variable.
myplot <- function(x,y, bgd = "lightyellow")
{
opar <- par()
par(bg=bgd)
plot(x, y, pch=22, col="blue", las=1)
myplot.out <- summary(data.frame(cbind(x,y)))
par(opar)
return(myplot.out)
}
• To use this function, first paste this into R and then use x1 <- rnorm(20)
x2 <- rnorm(20)
out <- myplot(x1,x2)
You can also use myplot(x1, x2, bgd= "grey")
To see the summary type out
User defined functions
User defined functions
• Here is a slightly more complicated function to calculate the number of
runs of 1’s in a binary sequence
f <- function (x, v=1)
{
x <- diff(x==v)
x <- x[x!=0]
if (x[1]==1) sum(x==1)
else 1+sum(x==1)
}
Now generate some data
n <- 50
x <- sample(0:1, n, replace=T, p=c(.2,.8))
x
To see the number of runs in the sequence type f(x,1)
Example 1
• Let us write a programme which will compare the power of the two
sample t-test with that of the Wilcoxon and Kolmogorov - Smirnov
tests when the underlying data are normal.
Example 1
• Let us first open an R script, name it example and save the script in the R
home directory.
• It is good practice to add in a descriptive header giving the purpose of the
programme and the date on which it was last modified.
#### R programme for simulating the power of the two sample t test vs various
#### non-parametric alternatives
#### 21/7/06
• We next need to specify the sample size and the number of simulations to
be run with sim.size <- 200
sample.size <- 10
• We shall set the mean of the first population to zero and run the simulation
for a range of values of the difference in means with mu1 <- 0
delta <- seq(-2,2, length=50)
• We also need to set the seed so as to be able to reproduce the random
number generation.
set.seed(231)
Example 1
• Our programme will then look like this:
for (j in 1:length(delta))
{
# Set mean of second population
for (i in 1:sim.size)
{
# Generate ith sample
# Perform ith set of tests
# Check if the test rejects the null hypothesis of equality
}
# Calculate the simulated power
}
Example 1
So our programme now looks like this:
sim.size <- 200; sample.size <- 10;
set.seed(231)
mu1 <- 0; delta <- seq(-2,2, length=50)
for (j in 1:length(delta))
{
mu2 <- mu1 + delta[j]
for (i in 1:sim.size)
{
}
# Calculate power for jth setting
} # End of j loop
Example 1
• Let us now define variables which will hold the simulated powers
sim.size <- 200; sample.size <- 10;
set.seed(231)
mu1 <- 0; delta <- seq(-2,2, length=50)
pow.ttest <- NULL
pow.wtest <- NULL
pow.kstest <- NULL
for (j in 1:length(delta))
{
mu2 <- mu1 + delta[j]
for (i in 1:sim.size)
{
# Calculate pt.test[I], pw.test[I], pks.test[I]
}
pow.ttest[j] <- sum(pt.test)/sim.size # Calculate powers for jth setting
pow.wtest[j] <- sum(pw.test)/sim.size
pow.kstest[j] <- sum(pks.test)/sim.size
} # End of j loop
Example 1
• The inner simulation loop looks like this
for (i in 1:sim.size)
{
# Generate ith sample
samp1 <- rnorm(mean=mu1,sample.size)
samp2 <- rnorm(mean=mu2,sample.size)
# Perform ith set of tests
test1 <- t.test(samp1, samp2,alternative = c("two.sided"))
pt.test[i] <- (test1$p.value < 0.05)
test2 <- wilcox.test(samp1, samp2,alternative = c("two.sided"),
exact = TRUE)
pw.test[i] <- (test2$p.value < 0.05)
test3 <- ks.test(samp1, samp2,alternative = c("two.sided"),
exact = TRUE)
pks.test[i] <- (test3$p.value < 0.05)
}
Example 1
• The complete programme has been saved as an R script called
twosamp.r in the R home directory. Open the file using the File menu
in R and run the simulation using source("twosamp.r")
• The code will automatically save plots of the simulated power in a pdf
file called twosamp.pdf also in the R home directory.
Example 2
• Now run simulations which will look at the robustness of the two
sample t-test to the following assumtions:
– Homoscedasticity
• Simulate two normal distributions with different standard deviations and plot
the level as a function of the ratio of sd’s. Make several plots (on the same
graph) corresponding to several choices of the standard deviation of the first
population.
– Normality
• Simulate data from two logistic distributions and find an estimate of the level
of the test. Does the level vary with sample size?
– Independence
• Simulate two correlated normal distributions and plot the level as a function of
of the correlation coefficient.
The package mvtnorm will be required to generate bivariate normal data.
Example 3
• Write a function which will calculate the number of runs in a binary
sequence of arbitrary length.
Hint : Use the diff function.
Additional Exercises
1. Generate Bernoulli data with n = 100 and p = .25, p = .05 and p =
.01. Is the data approximately normal in each case?
2. Sketch the distribution of the standardized average for data
generated from the uniform [0; 1] distribution. Compare the
histograms when n is 5, 10, 25 and 100.
3. Write a function which will compute a Monte Carlo estimate of the
ratio of the variances for the mean and the median for the
(a) N(0,1) distribution
(b) t distribution with 2 df.
Use the vioplot function in the vioplot library to create side by side
vioplots of the simulated distributions of the mean and the median
in the two cases.
Additional Exercises
4 (a). Search the stats library in R for a list of parametric and non-
parametric tests.
(b) Generate 200 standard normal variables and perform the one sample
t-test on the data.
(c) Repeat the steps in (2) 1000 times and draw a histogram of the
resulting p values.
(d) On the same graph plot the power of the one sample t-test as a
function of the true mean using
(i) simulation
(ii) the R function power.t.test
(iii ) an analytical expression for power
5. Plot the density of the t distribution for degrees of freedom = 1,2,5,100
and the standard normal density in different colours on the same graph.
Add a legend and title to the plot.
Additional Exercises
6. The sleep dataset in R shows the number of hours of extra sleep after
administration of a sleeping drug.
(a) Perform a two sample t-test on the data.
(b) Perform a two sample Wilcoxon test.
(c) Perform an analysis of variance assuming normality.
(d) Perform a non parametric analysis of variance using a Kruskal Wallis test.
(e) Compare the variances of the two groups using an F test.
7. Plot the density of the chi-squared distribution for 1-10 degrees of freedom in
different colours on the same graph. Add a legend and title to the plot.
Additional Exercises
8. Consider the variable eruption giving eruption lengths of the Faithful
geyser recorded in the data set faithful.
(a) Draw a histogram of the data.
(b) Write a function f which takes as input the means (m1,m2) and sd’s
(s1,s2), the mixing proportion (p) and the data point (x) and returns the
value of the corresponding mixture normal likelihood at the point.
(c) Now write a function fn which uses the function f to calculate the
likelihood for the entire data set.
(d) Use the function optim to get maximum likelihood estimates.
(e) Superimpose the sample and theoretical density on the histogram
and add a legend to the plot.
Additional Exercises
9. Create a data frame called Manitoba.lakes that contains the lake’s elevation
(in meters above sea level) and area (in square kilometers) as listed below.
Assign the names of the lakes using the row.names( ) function.
elevation area
Winnipeg 217 24387
Winnipegosis 254 5374
Manitoba 248 4624
SouthernIndian 254 2247
Cedar 253 1353
Island 227 1223
Gods 178 1151
Cross 207 755
Playgreen 217 657
(a) Plot log2(area) versus elevation. Add labeling information using the text
command with the label option.
(b) Use the R function dotchart( ) to display the areas of the Manitoba lakes
(i) on a linear scale,
and (ii) on a logarithmic scale.
Add, in each case suitable labeling information.
Additional Exercises
10. (a) Use the nlm function to numerically minimise the function
f(x,y,z) = sin(x)-sin(y-4)+z2+2.
(b) If gradient information is not supplied, nlm will use a
matrix-secant method which numerically approximates the
gradient. To use gradient information, redefine the function so
as to additionally contain an attribute called the gradient.
Now perform minimisation using the quasi-Newton method.
(c) Use the integrate function in R to find the constant of
integration, c for the posterior density function
c.e[-1/2{(0.12-x)2+(0.07-x)2+(0.08-x)2}]
Tutorial 4 : R libraries
What is a library?
• An R library or package is a collection of programmes with a common
objective. To see a list of packages available by default type search( )
at the R prompt.
• Some commonly used packages are base, graphics, stats, mgcv, nlme,
survival, Hmisc etc.
• R also has some very specialised packages. Examples include
– boot (bootstrap / jackknife)
– EbayesThresh (empirical Bayes thresholding),
– mAr (multivariate autoregressive analysis)
– neural (neural networks)
– nlqr (non linear quantile regression)
– portfolio (analysing equity portfolios) etc.
R contributed libraries
• A complete list of contributed packages is available on the R website under the link
Contributed extension packages. The list has also been saved to the file Available
Bundles and Packages.doc on the Desktop.
• The R News site also available from the website also provides a discussion of new
packages and updates to old packages.
• R also has summaries called CRAN Task Views for the following specialised subjects
– Cluster Cluster Analysis & Finite Mixture Models
– Econometrics Computational Econometrics
– Environmetrics Analysis of ecological and environmental data
– Finance Empirical Finance
– Genetics Statistical Genetics
– MachineLearning Machine Learning & Statistical Learning
– Multivariate Multivariate Statistics
– SocialSciences Statistics for the Social Sciences
– Spatial Analysis of Spatial Data
– gR gRaphical models in R
• The ctv package can be used to install the functions mentioned in the CRAN Task View.
Downloading libraries
The first option is to go to one of the CRAN mirror sites and click on one of the mirror
sites. A complete listing of libraries is available by following the link for contributed
extension packages. Clicking on the desired library will lead to a download page such
as
bivpois: Bivariate Poisson Models Using The EM Algorithm
Functions for fitting Bivariate Poisson Models using the EM algorithm.
Details can be found in Karlis and Ntzoufras (2003, RSS D & 2004,AUEB Technical Report)
Version:0.50-2
Depends:R (>= 2.0.1)
Date:2005-08-25
Author:Dimitris Karlis and Ioannis Ntzoufras
Maintainer:Ioannis Ntzoufras
License:GPL (version 2 or later)
URL:http://www.stat-athens.aueb.gr/~jbn/papers/paper14.htm
Package source: bivpois_0.50-2.tar.gz
Windows binary:bivpois_0.50-2.zip
Reference manual: bivpois.pdf
Downloading libraries
• Download the .zip file. To install the library you can either unzip the
file and copy to the library folder in the R home directory.
• Or you can open an R session and choose Install from local zip file
from the Package option on the menu.
• This will install the library as well as the corresponding help
documentation. For additional documentation you can visit the sited
URL.
• If the machine has an internet connection a simpler way to install a
package is to choose Set CRAN mirror from the Package menu and
then choose Install package.
• On installation function files in a library will all be copied to the
library folder in the R home directory. Documentation will be copied
to the Doc subfolder within each library.
The library command
• Consider the following uses of the library command
– library( ) # list all available packages
– library(lib = .Library) # list all packages in the default library
– library(help = stats) # documentation on package 'stats‘
– library(faraway) # load package ‘faraway‘
– require(faraway) # the same
– library(help=faraway) # documentation on package ‘faraway’
– search( ) # lists loaded packages
• Another useful command available for some packages is demo( ). Try
– demo(package = .packages(all.available = TRUE))
– demo(glm.vr, package="stats")
– demo(persp, package="graphics")
Some R libraries
• In this tutorial we shall consider the following libraries:
– TeachingDemos : Demonstrations for teaching
– Matrix : A Matrix package for R
– MCMCpack : Bayesian inference via Markov chain Monte Carlo
The TeachingDemos library
• As suggested by the name, this library contains functions useful for
interactively demonstrating basic statistical concepts.
• The library has already been loaded onto your machine. Attach the
library using library(TeachingDemos)
• Check if the package has any inbuilt demos using
demo(package="TeachingDemos").
• To see the package capabilities type library(help=TeachingDemos)
The TeachingDemos library
• Let us explore some of these functions. For example type ? faces to get
a description of the faces command.
• Next try running the sample code given in the description
• The first example is faces(rbind(1:3,5:3,3:5,5:7))
• The next is data(longley)
faces(longley[1:9,])
• Compare the differences between faces and faces2 using
faces2(matrix( runif(18*10), nrow=10), main='Random Faces')
and
faces2(matrix( runif(18*10), nrow=10), main='Random Faces')
• Type par(mfrow=c(1,1)) to restore the default plotting layout.
The TeachingDemos library
• Similarly try the examples for some other possibly useful functions
such as
– mle.demo
– power.examp
– put.points.demo
– rotate.cloud
– run.cor.examp
– run.hist.demo
– vis.binom
The Matrix library
• Matrix is a class of methods for numerical linear algebra with special
relevance for sparse ill conditioned matrices.
• We shall first use the library to compare the speed of least squares
fitting methods on an example for which the model matrix is large and
sparse.
• As an example, let’s create a model matrix, mm, and corresponding
response vector, y, for a simple linear regression model using the
Formaldehyde data.
data(Formaldehyde)
str(Formaldehyde)
(m <- cbind(1, Formaldehyde$carb))
(yo <- Formaldehyde$optden)
solve(t(m) %*% m) %*% t(m) %*% yo
system.time(solve(t(m) %*% m) %*% t(m) %*% yo)
dput(c(solve(t(m) %*% m) %*% t(m) %*% yo))
dput(unname(lm.fit(m, yo)$coefficients))
The Matrix library
• For a large, ill-conditioned least squares problem this does not perform well. Let
us read in an example of such data using
library(Matrix)
data(KNex, package = "Matrix")
y <- KNex$y
mm <- as(KNex$mm, "matrix")
• Type dim(mm) to get the dimension of mm.
• Now check the system times
system.time(naive.sol <- solve(t(mm) %*% mm) %*% t(mm) %*% y)
• Because the calculation of a “cross-product” matrix is a common operation in
statistics, the crossprod function has been provided to do this efficiently. Check
the system time for the above operation using crossprod:
system.time(cpod.sol <- solve(crossprod(mm), crossprod(mm,y)))
The Matrix library
• The crossprod function applied to a single matrix takes
advantage of symmetry when calculating the product but does
not retain the information that the product is symmetric and
positive semidefinite.
• As a result least squares estimates are calculated using a
general linear system solver based on an LU decomposition
when it would be faster, and more stable numerically, to use a
Cholesky decomposition.
The Matrix library
• The Matrix package uses the S4 class system (Chambers, 1998) to
retain information on the structure of matrices from the intermediate
calculations.
mm <- as(KNex$mm, "dgeMatrix")
system.time(Mat.sol <- solve(crossprod(mm), crossprod(mm,y)))
• Furthermore, any method that calculates a decomposition or
factorization stores the resulting factorization with the original object so
that it can be reused without recalculation.
xpx <- crossprod(mm)
xpy <- crossprod(mm, y)
system.time(solve(xpx, xpy))
The Matrix library
• The model matrix mm is sparse; that is, most of the elements of
mm are zero.
• The Matrix package incorporates special methods for sparse
matrices, which produce the fastest results of all.
The MCMCpack library
• This package contains functions to perform Bayesian inference using
posterior simulation for a number of statistical models. All models
return coda mcmc objects that can then be summarized using the coda
package. MCMCpack also contains some useful utility functions,
including some additional density functions and pseudo-random
number generators for statistical distributions, a general purpose
Metropolis sampling algorithm, and tools for visualization.
• You will also need to download the coda library to run MCMCpack.
• Let us type
library(MCMCpack)
library(coda)
library(help=MCMCpack)
to view package capabilities.
The MCMCpack library
• Let us look at the function Mcbinomialbeta
• Type the following sample code
posterior <- MCbinomialbeta(3,12,mc=5000)
summary(posterior)
• To plot the prior and posterior on the same graph type
plot(posterior)
grid <- seq(0,1,0.01)
plot(grid, dbeta(grid, 1, 1), type="l", col="red", lwd=3, ylim=c(0,3.6),
xlab="pi", ylab="density")
lines(density(posterior), col="blue", lwd=3)
legend(.75, 3.6, c("prior", "posterior"), lwd=3, col=c("red", "blue"))
The MCMCpack library
• Similarly consider MCnormalnormal
• Type
y <- c(2.65, 1.80, 2.29, 2.11, 2.27, 2.61, 2.49, 0.96, 1.72, 2.40)
posterior <- MCnormalnormal(y, 1, 0, 1, 5000)
summary(posterior)
and to see a plot
plot(posterior)
grid <- seq(-3,3,0.01)
plot(grid, dnorm(grid, 0, 1), type="l", col="red", lwd=3, ylim=c(0,1.4),
xlab="mu", ylab="density")
lines(density(posterior), col="blue", lwd=3)
legend(-3, 1.4, c("prior", "posterior"), lwd=3, col=c("red", "blue"))
The MCMCpack library
• To see the effect of varying the prior variance consider the following
code
y <- c(2.65, 1.80, 2.29, 2.11, 2.27, 2.61, 2.49, 0.96, 1.72, 2.40)
prior.var <- c(0.01,0.1,0.5,0.75,1,1.5,2,10,100)
par(mfrow=c(3,3))
for (ipv in 1:length(prior.var))
{
posterior <- MCnormalnormal(y, 1, 0, prior.var[ipv], 5000)
grid <- seq(-4,4,0.01)
plot(grid, dnorm(grid, 0, prior.var[ipv]), type="l", col="red", lwd=3,
ylim=c(0,1.4),
xlab="mu", ylab="density")
points(grid, dnorm(grid,mean(y), sd(y)), type ="l")
lines(density(posterior), col="blue", lwd=3)
legend(-3, 1.4, c("prior", "posterior", "sample"),
col=c("red", "blue" , "black"), cex=0.5)
title(paste("Prior variance = ", prior.var[ipv]), cex=0.6)
}
The coda library
• The coda library is used for convergence diagnostics of the MCMC
chain. Type library(help=coda) to see a list of the diagnostic tools
implemented.
• For example try
par(mfrow=c(1,1)
geweke.plot(posterior)
raftery.diag(posterior)
traceplot(posterior)
autocorr.plot(posterior)
• A complete list of Bayesian analysis implemented in R is listed in the
CRAN Task View on the subject.
Exercise
Frank Harrell’s Hmisc library contains many functions useful for data analysis,
high-level graphics, utility operations, functions for computing sample size and
power, importing datasets, imputing missing values, advanced table, making,
variable clustering, character string manipulation, conversion of S objects to
LaTeX code, and recoding variables.
(a) Load the Hmisc library and list its capabilities.
(b) Use the function binconf to compute the various possible confidence
intervals for the binomial proportion. Make a plot to study the relationship
between these intervals for varying sample size.
(c ) Compare the abilities of the functions fit.mult.impute, aregImpute and
impute for imputing missing data.
Suggestions for next steps

More Related Content

Similar to An introduction to R is a document useful

Fresher's guide to Preparing for a Big Data Interview
Fresher's guide to Preparing for a Big Data InterviewFresher's guide to Preparing for a Big Data Interview
Fresher's guide to Preparing for a Big Data InterviewRock Interview
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
Introduction to basic statistics
Introduction to basic statisticsIntroduction to basic statistics
Introduction to basic statisticsIBM
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of RAnalyticsWeek
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
 
R and Rcmdr Statistical Software
R and Rcmdr Statistical SoftwareR and Rcmdr Statistical Software
R and Rcmdr Statistical Softwarearttan2001
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with RGreat Wide Open
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenRevolution Analytics
 
GNU R in Clinical Research and Evidence-Based Medicine
GNU R in Clinical Research and Evidence-Based MedicineGNU R in Clinical Research and Evidence-Based Medicine
GNU R in Clinical Research and Evidence-Based MedicineAdrian Olszewski
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big AnalyticsAjay Ohri
 
DATA MINING USING R (1).pptx
DATA MINING USING R (1).pptxDATA MINING USING R (1).pptx
DATA MINING USING R (1).pptxmyworld93
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopDataWorks Summit
 

Similar to An introduction to R is a document useful (20)

Fresher's guide to Preparing for a Big Data Interview
Fresher's guide to Preparing for a Big Data InterviewFresher's guide to Preparing for a Big Data Interview
Fresher's guide to Preparing for a Big Data Interview
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Introduction to basic statistics
Introduction to basic statisticsIntroduction to basic statistics
Introduction to basic statistics
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of R
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
R tutorial
R tutorialR tutorial
R tutorial
 
R and Rcmdr Statistical Software
R and Rcmdr Statistical SoftwareR and Rcmdr Statistical Software
R and Rcmdr Statistical Software
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
R_L1-Aug-2022.pptx
R_L1-Aug-2022.pptxR_L1-Aug-2022.pptx
R_L1-Aug-2022.pptx
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
GNU R in Clinical Research and Evidence-Based Medicine
GNU R in Clinical Research and Evidence-Based MedicineGNU R in Clinical Research and Evidence-Based Medicine
GNU R in Clinical Research and Evidence-Based Medicine
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
DATA MINING USING R (1).pptx
DATA MINING USING R (1).pptxDATA MINING USING R (1).pptx
DATA MINING USING R (1).pptx
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Reason To learn & use r
Reason To learn & use rReason To learn & use r
Reason To learn & use r
 

Recently uploaded

(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCRsoniya singh
 
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCRashishs7044
 
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCRashishs7044
 
Progress Report - Oracle Database Analyst Summit
Progress  Report - Oracle Database Analyst SummitProgress  Report - Oracle Database Analyst Summit
Progress Report - Oracle Database Analyst SummitHolger Mueller
 
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...lizamodels9
 
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...ictsugar
 
Digital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfDigital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfJos Voskuil
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaoncallgirls2057
 
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,noida100girls
 
The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024christinemoorman
 
8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCRashishs7044
 
Marketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet CreationsMarketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet Creationsnakalysalcedo61
 
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In.../:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...lizamodels9
 
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCRashishs7044
 
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607dollysharma2066
 
Vip Female Escorts Noida 9711199171 Greater Noida Escorts Service
Vip Female Escorts Noida 9711199171 Greater Noida Escorts ServiceVip Female Escorts Noida 9711199171 Greater Noida Escorts Service
Vip Female Escorts Noida 9711199171 Greater Noida Escorts Serviceankitnayak356677
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...lizamodels9
 
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
Keppel Ltd. 1Q 2024 Business Update  Presentation SlidesKeppel Ltd. 1Q 2024 Business Update  Presentation Slides
Keppel Ltd. 1Q 2024 Business Update Presentation SlidesKeppelCorporation
 
Future Of Sample Report 2024 | Redacted Version
Future Of Sample Report 2024 | Redacted VersionFuture Of Sample Report 2024 | Redacted Version
Future Of Sample Report 2024 | Redacted VersionMintel Group
 
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...lizamodels9
 

Recently uploaded (20)

(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR
 
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
 
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
 
Progress Report - Oracle Database Analyst Summit
Progress  Report - Oracle Database Analyst SummitProgress  Report - Oracle Database Analyst Summit
Progress Report - Oracle Database Analyst Summit
 
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
 
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
 
Digital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfDigital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdf
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
 
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
 
The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024
 
8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR
 
Marketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet CreationsMarketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet Creations
 
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In.../:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
 
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
 
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
 
Vip Female Escorts Noida 9711199171 Greater Noida Escorts Service
Vip Female Escorts Noida 9711199171 Greater Noida Escorts ServiceVip Female Escorts Noida 9711199171 Greater Noida Escorts Service
Vip Female Escorts Noida 9711199171 Greater Noida Escorts Service
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
 
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
Keppel Ltd. 1Q 2024 Business Update  Presentation SlidesKeppel Ltd. 1Q 2024 Business Update  Presentation Slides
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
 
Future Of Sample Report 2024 | Redacted Version
Future Of Sample Report 2024 | Redacted VersionFuture Of Sample Report 2024 | Redacted Version
Future Of Sample Report 2024 | Redacted Version
 
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
 

An introduction to R is a document useful

  • 1. 1. What is R? 2. Where is R available? 3. Will R run on all machines? 4. Why should I switch to R? 5. What can R do? 6. How easy is it to learn R? 7. How large is the R community? 8. Does R have a support system? 9. Is there a beginners guide to R? http://www.r-project.org
  • 2. What is R? • R is a language and environment for statistical computing and graphics. • Developed by Robert Gentleman and Ross Ihaka. • It can be freely downloaded from http://www.rproject.org • R can be considered as an implementation of S. There are some important differences, but much code written for S runs unaltered under R. • R is a GNU project.
  • 3. Why I Must Write GNU I consider that the golden rule requires that if I like a program I must share it with other people who like it. Software sellers want to divide the users and conquer them, making each user agree not to share with others. I refuse to break solidarity with other users in this way. I cannot in good conscience sign a nondisclosure agreement or a software license agreement. For years I worked within the Artificial Intelligence Lab to resist such tendencies and other inhospitalities, but eventually they had gone too far: I could not remain in an institution where such things are done for me against my will. So that I can continue to use computers without dishonor, I have decided to put together a sufficient body of free software so that I will be able to get along without any software that is not free. I have resigned from the AI lab to deny MIT any legal excuse to prevent me from giving GNU away. Richard Stallman, Founder, Free Software Foundation.
  • 4. What can R do? R is called a “programming environment”. R includes • an effective data handling and storage facility, • a suite of operators for calculations on arrays, in particular matrices, • a large, coherent, integrated collection of intermediate tools for data analysis, • graphical facilities for data analysis and display either on-screen or on hardcopy, and • a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.
  • 5. Who currently uses R? SAS is the most common statistics package in general but R or S is most popular with researchers in Statistics. A look at common Statistical journals confirms this popularity. R is also popular for quantitative applications in Finance.
  • 6. Why should I switch to R? 1. It is free. (in what sense?) 2. It has an extensive support system in the form on many online manuals, tutorials,discussion and help forums. 3. In some ways it is better than its closest competitor, S+ (plots, memory management). 4. In addition to the base packages there is a very exhaustive list of extension and user contributed packages for specialized tasks.
  • 7. Getting help in R • Within R: – The ? Command can be used to get help on a specific command within R • ? graphics typed at the R prompt will provide a description of R graphics. • demo(graphics) will demonstrate some examples. • Another useful command is example. Type ? example at the R prompt for a description. • Documentation – Manuals, FAQs, reference cards, tutorials and news about recent developments are available at http://www.r-project.org/other- docs.html. – CRAN Task Views for specialized applications. • Online help – The R posting guides are R-help, R-devel and Bioconductor – The site for R-help is https://www.stat.math.ethz.ch/pipermail/r-help/
  • 8. CRAN Task View: Computational Econometrics Maintainer: Achim Zeileis http://www.maths.bris.ac.uk/R/src/contrib/Views/Econo metrics.html CRAN Task View: Statistical Genetics Maintainer: Giovanni Montana http://www.maths.bris.ac.uk/R/src/contrib/Views/Genet ics.html CRAN Task View: Bayesian Inference Maintainer: Jong Hee Park http://cran.r- project.org/src/contrib/Views/Bayesian.html Type CRAN Task View into a search engine such as Google for more
  • 9. Posting Guide: How to ask good questions that prompt useful answers • R-help is intended to be comprehensible to people who want to use R to solve problems but who are not necessarily interested in or knowledgeable about programming. • R-devel is intended for questions and discussion about code development in R. Questions likely to prompt discussion unintelligible to non-programmers should go to to R-devel. • Bioconductor is for announcements about the development of the BioConductor package , availability of new code, questions and answers about problems and solutions using Bioconductor, etc.
  • 10. On Wed, 6 Dec 2000, Hisaji ONO wrote: > Hello, the R people. > I look for robust regression in R. This method is available in S, its name is rreg. There's better robust regression in the VR bundle of packages. library(MASS) help(rlm) library(lqs) help(lqs) -thomas Thomas Lumley Assistant Professor, Biostatistics University of Washington, Seattle About this list Date view Thread view Subject view Author view Other groups Subject: Re: [R] Is robust regression available in R. From: Thomas Lumley (thomas@biostat.washington.edu) Date: Wed 06 Dec 2000 - 04:12:58 EST
  • 11. Available Bundles and Packages aaMIMutual information for protein sequence alignments abindCombine multi-dimensional arrays accuracyTools for testing and improving accuracy of statistical results. acepackace() and avas() for selecting regression transformations actuarActuarial functions adaptadapt -- multidimensional numerical integration ade4Analysis of Environmental Data : Exploratory and Euclidean method adehabitatAnalysis of habitat selection by animals adliftAn adaptive lifting scheme algorithm agceanalysis of growth curve experiments akimaInterpolation of irregularly spaced data AlgDesignAlgDesign alr3Methods and data to accompany Applied Linear Regression 3rd editi amapAnother Multidimensional Analysis Package AMOREA MORE flexible neural network package AnalyzeFMRIFunctions for analysis of fMRI datasets stored in the ANALYZE for aodAnalysis of Overdispersed Data apeAnalyses of Phylogenetics and Evolution apTreeshapeAnalyses of Phylogenetic Treeshape ArDecTime series autoregressive decomposition arulesMining Association Rules and Frequent Itemsets
  • 12. Why should I switch to R? • Parametric Inference: ttest, power.t.test, chisq.test(). bartlett.test, logLik , extractAIC • Design of experiments: aov, tukeyHSD,plot.design, conf.design,.AlgDesign • Sample surveys: survey, pps. • Linear models and regression: lm, lme, aov, gls, dfbeta, dwtest . shapiro.test. • Multivariate analysis: mvtnorm,mnornt, cluster, manova, mvnormtest,prcomp, cancor. • Statistical genomics: BioConductor, genetics, Geneland,hapsim, PHYLOGR, qtl, bqtl. • Bayesian analysis: bayesm, bayesurv, MCMCpack,bma, lmm, coda. • Resampling: ecdf, jackknife , empinf, oneboot, bootci, jack.after.boot . • Survival analysis: survreg, survdiff, coxph. • Stochastic processes and time series: tseries, arima, garch, pacf, ts.plot, adf.test. • Advanced data analysis: glm, lme, nlme, longitudinal, mitools • Statistical quality control: linprog, quadprog, lp.transport, qcc. • Non-parametric inference: wilcox.test, friedman.test, rlm, quantreg. In addition to graphical and programming tools.
  • 13. And beyond… • Exploratory Data Analysis : eda • Financial data analysis: fPortfolio, financial, fMultivar • Spatial analysis: spatstat, geoR, SemiPar • Smoothing: SemiPar, splines, mgcv • Econometric analysis: gear, Ecdat • R also has many inbuilt datasets a list of which may be viewed by typing data() at the prompt.
  • 14. Are there any downsides? • R is not menu driven. Commands to be executed must be typed in at the prompt. • This is not a complete disadvantage because it prevents the cookbook approach to statistics. • However it means that the user will need to invest some initial to become familiar with R syntax. • Sample code is available for each command and is helpful to familiarize those new to programming eg try typing ? lm at the prompt and scroll to the bottom.
  • 15. Tutorial structure 1. Tutorial 1: R Graphics (and basics) 2. Tutorial 2 : Regression analysis 3. Tutorial 3 : Programming in R 4. Tutorial 4 : R libraries
  • 16. Tutorial 1 : R Graphics
  • 17. Getting started • To open an R session click on the ‘R’ shortcut on the desktop. This will open a commands window with the R prompt ‘>’. All commands have to be typed in at the prompt. • R code in the presentation is indicated in blue. You can cut and paste this into the commands window. • Use the arrow keys to recall previous commands. • You can scroll up the command window to view earlier commands and output.
  • 18. Downloading R • The base package in R can be downloaded from www.r-project.org, popularly known as the CRAN website. • You will need to click on one of the mirror sites. • Additional packages can either be downloaded from this site or from within R.
  • 19. Reading in data Assignment • The most straight forward way to store a list of numbers is through an assignment using the c command. • As an example, we can create a new variable called newvar which will contain the numbers 3, 5, 7, and 9: newvar <- c(3,5,7,9) • When you enter this command you should not see any output except a new command line. • To see what numbers are included in newvar type newvar at the prompt and press the enter key: • If you wish to work with one of the numbers you can get access to it using the variable and then square brackets indicating which number: eg try typing newvar[2] newvar[1:2] newvar[-2]
  • 20. Reading a CSV file • We shall read a very short data file called simple.csv which has six rows of data on three variables labeled "trial," "mass," and "velocity." • The command to read the data file is read.csv. • The following command will read in the data and assign it to a variable called new data newdata <- read.csv(file="simple.csv",head=TRUE,sep=",") • To view the data type newdata at the prompt. • Try typing summary(newdata) • Try typing summary(newdata[(newdata$trial=="A"),]) • Try typing table(newdata$trial) • You can now access each individual column using a "$" to separate the two names eg try typing newdata$mass • If you are not sure what columns are contained in the variable type names(newdata) at the prompt. Reading in data
  • 21. • There are many ways to read data using R. We have only give two examples : direct assignment and reading csv files. • Other commands include read.csv2, read.delim, read.fwf and scan. • To get help on these commands you can type ? read.fwf at the prompt etc. • It is also possible to import data of other formats such as SAS, SPSS etc into R. Reading in data
  • 22. Plotting data • To see some of the possibilities that R offers, enter demo(graphics) • Press the Enter key to move to each new graph. • Note that the code required to produce each plot is being simultaneously displayed in the command window.
  • 23. The plot function • In order to illustrate R's graphical functionalities, let us consider a simple example of a bivariate graph of 10 pairs of random variables. These values were generated with: x <- rnorm(10) x <- sort(x) y <- rnorm(10) • To get a scatter-plot of x against y, type plot(x, y) and the graph will be plotted on the active graphical device.
  • 24. • This plot uses default axis labels, limits, symbols etc. • We can customize plots by passing options to the plot commands. Try the following variations: plot(x,y, type="l") plot(x,y, type= "l", lty=2) plot(x,y, type= "l", lwd=3) plot(x,y, type= "l", col="red") While this needs some getting accustomed to, it does make the point that plots are subjective and it is necessary for the user to make intelligent choices of plotting parameters. For example try the following commands par(mfrow=c(1,2)) plot(x,x^2,type="l",ylim=c(0,1)) plot(x,x^2,type= "l", ylim=c(0,10)) The plot function
  • 25. • For a fully customized plot try the following par(mfrow=c(1,1)) plot(x, y, xlab="Ten random values", ylab="Ten other values", xlim=c(-2, 2),ylim=c(-2, 2), pch=22, col="red",bg="yellow", bty="l", tcl=0.4,main="How to customize a plot with R", las=1, cex=1.5) • What does each of the options do? • This type of control over parameters is typical of R and is also found in analysis functions and programming. It is one of the features which makes R truly scientific and superior to many other software packages.
  • 26. More customized plots • Some plotting options can be passed on as arguments to the plot function while others will need modification of the default graphical settings specified in par. • You can view the current settings in par by typing par( ) at the prompt. • Let us consider the following modification to par. • Type the following ( > denotes the R prompt) opar <- par() par(mfrow=c(1,1)) par(bg="lightyellow", col.axis="blue", mar=c(4, 4, 2.5, 0.25)) plot(x, y, xlab="Ten random values", ylab="Ten other values", xlim=c(-2, 2), ylim=c(-2, 2), pch=22, col="red", bg="yellow", bty="l", tcl=-.25, las=1, cex=1.5) title("How to customize a plot with R ", font.main=3, adj=1)
  • 27. • However once the user has spent some time setting his favourite plotting options, it is easy to replicate these for another dataset. • We shall now look at a sample data set in R. Consider the data set florida which has the votes for the various candidates by county in the state of Florida in the last US presidential elections. You can attach this dataset by typing attach("usingR.RData") attach(florida) Type florida at the prompt to view the data. Next try plotting the votes for Bush and Buchanan.
  • 28. Interactive plotting : The identify and locator functions identify is a useful function which can be used to label selected points on a plot. Type plot(BUSH, BUCHANAN, xlab="Bush", ylab="Buchanan") identify(BUSH, BUCHANAN, County) Then click near a point to identify the county.. Another interactive function is the locator function. Type plot(1:nrow(florida), BUSH, col="red",pch=2,xlab="County no", ylab="Votes") points(1:nrow(florida), BUCHANAN, col="green", pch=4) leg <- c("BUSH", "BUCHANAN") Then type legend(locator(1), leg, col=c(" red ", " green "), pch=c(2,4))
  • 29. We next discuss histograms, density plots, boxplots and normal probability plots Type the following: attach("usingR.RData") attach(possum) Type possum at the prompt to view the data. Next type hist(totlngth) for the default histogram. Now suppose we want to specify the bins ourselves. Type par(mfrow = c(1, 2)) hist(totlngth, freq=F, breaks = 72.5 + (0:5) * 5, xlab="Total length", main ="A: Breaks at 72.5, 77.5, ...") hist(totlngth, freq=F, breaks = 75 + (0:5) * 5, xlab="Total length", main="B: Breaks at 75, 80, ...")
  • 30. To get a corresponding density estimate (using kernel smoothing) type d <- density(totlngth) points(d) will superimpose the density estimate. A better (scaled) superimposition can be obtained by points(d$x, d$y/1.08,type="l", col="blue")
  • 31. • qqnorm(totlngth) gives a normal probability plot of the variable totlngth. The points of this plot will lie approximately on a straight line if the distribution is normal. • In order to calibrate the eye to recognise plots that indicate nonnormal variation, it is helpful to do several normal probability plots for random samples of the relevant size from a normal distribution. • Type the following attach(possum) par(mfrow=c(3,4)) # A 3 by 4 layout of plots y <- totlngth qqnorm(y,xlab= " ", ylab="Length", main="Possums" ,col="blue") for(i in 1:11) qqnorm(rnorm(43),col="red",xlab="", ylab="Simulated lengths", main="Simulated")
  • 32. Now lets explore some very sophisticated plots Suppose we have normally distributed data and we want to see how the empirical density estimate compares with the normal density estimate as we vary the sample size. Type the following commands library(lattice) n <- seq(5, 45, 5) x <- rnorm(sum(n)) y <- factor(rep(n, n), labels=paste("n =", n)) densityplot(~ x | y, panel = function(x, ...) { panel.densityplot(x, col="DarkOliveGreen", ...) panel.mathdensity(dmath=dnorm, args=list(mean=mean(x), sd=sd(x)), col="darkblue") })
  • 33. • The iris dataset gives measurements on four variables for several species of the flower. A pairwise scatter of these variables with separate markers for each species may be obtained by data(iris) splom( ~iris[1:4], groups = Species, data = iris, xlab = "", panel = panel.superpose, auto.key = list(columns = 3) )
  • 34. Colours in R • R is particularly good at handling colours. To see the palette in R, type the following: demo.pal <- function(n, border = if (n<32) "light gray" else NA, main = paste("color palettes; n=",n), ch.col = c("rainbow(n, start=.7, end=.1)", "heat.colors(n)", "terrain.colors(n)", "topo.colors(n)", "cm.colors(n)")) { nt <- length(ch.col) i <- 1:n; j <- n / nt; d <- j/6; dy <- 2*d plot(i,i+d, type="n", yaxt="n", ylab="", main=main) for (k in 1:nt) { rect(i-.5, (k-1)*j+ dy, i+.4, k*j, col = eval(parse(text=ch.col[k])), border = border) text(2*j, k * j +dy/4, ch.col[k]) } } n <- if(.Device == "postscript") 64 else 16 # Since for screen, larger n may give color allocation problem demo.pal(n)
  • 35. Colours in R To use these colour palettes, try the following plot of the contours of a bivariate normal density. x <- y <- seq(-3,3,length=100) norm.density <- matrix(0,100,100) for (i in 1:100) for (j in 1:100) norm.density[i,j] <- dnorm(x[i])*dnorm(y[j]) par(mfrow=c(1,1)) image(x, y, norm.density, col = heat.colors(1000), axes = FALSE) contour(x, y, norm.density, by = 5, add = TRUE, col = "peru") box() title(main = "The bivariate normal density", font.main = 4) You can change the appearance of the plot by selecting the amount of colour gradation. For example try par(mfrow=c(1,2)) image(x, y, norm.density, col = heat.colors(1000), axes = FALSE) contour(x, y, norm.density, by = 5, add = TRUE, col = "peru") box() image(x, y, norm.density, col = heat.colors(3), axes = FALSE) contour(x, y, norm.density, by = 5, add = TRUE, col = "peru") box()
  • 36. The R Graph Gallery • Visit the R Graph Gallery at http://addictedtor.free.fr/graphiques/allgraph.php • Click on a graph for a copy of the code used to generate it.
  • 37. 3D graphics and movies in R • And the R movies gallery at • http://addictedtor.free.fr/movies/ • for R movies. • http://rgl.neoscientists.org/Gallery.html demonstrates the rgl package. • Try the following code library(rgl) example(rgl.surface) for(i in 1:360) { rgl.viewpoint(i, i*(60/360), interactive=F) } Click on the RGL device at the bottom of the R screen to see the results.
  • 38. Exercises 1. (a) First attach the possum dataset using attach("usingR.RData") and then attach(possum). Consider the variable head lengths given by hdlngth. Plot the following on the same page a) a histogram b) a stem and leaf plot c) a normal probability plot and d) a density plot (b) The measurements in the possum dataset have been collected at various sites given by the variable site. Draw box plots of hdlngth by site.
  • 39. Solutions 1. (a) First attach the possum dataset using attach("usingR.RData") and then attach(possum). Consider the variable head lengths given by hdlngth. Plot the following on the same page a) a histogram b) a stem and leaf plot c) a normal probability plot and d) a density plot First attach the data using attach("usingR.RData") attach(possum) To get all plots on the same page, set the layout parameter mfrow par(mfrow=c(2,2)) The histogram, normal probability and density plots can be obtained as hist(hdlngth, xlab="headlength of possums", main="Histogram") qqnorm(hdlngth, xlab="headlength of possums", main="Normal probability plot") plot(density(hdlngth), xlab= "headlength of possums", main="Density plot")
  • 40. Solutions The stem and leaf is a little trickier. First to find the corresponding command in R, we type help.search(“stem”) at the R prompt. Two likely candidates seem the command stem in the base package and stem.leaf in the aplpack package. The stem command does not work because it only returns the stem and leaf display in the command window. Note names(stem(hdlngth)) does not return anything. The stem.leaf command also displays the results in the commands window but it does store the information in an object. To see this type library(aplpack) sc <- stem.leaf(hdlngth) sc We can create an empty plot and use the text command to place this information on the plot as follows: plot(1:65,1:65, type="n", xlab=" ", ylab= " ", axes=F) for(i in 1:16) text(3,65-4*i,sc$stem[i], adj=c(0,0), cex=0.7)
  • 41. (b) The measurements in the possum dataset have been collected at various sites given by the variable site. Draw box plots of hdlngth by site. First reset the layout parameter with par(mfrow=c(1,1)) The basic code is boxplot(hdlngth~site) You can also try the following variants: boxplot(hdlngth ~ site, notch = TRUE, col = "blue") boxplot(hdlngth ~ site, names=c("A","B","C","D","E","F","G")) boxplot(hdlngth ~ site, names=c("Site A", " Site B", " Site C", " Site D", " Site E", "Site F", "Site G"), las=2) boxplot(hdlngth~site, subset=hdlngth<90) boxplot(hdlngth ~ site, boxwex = 0.25, at = 2:8, main = " Headlength of possums", xlab = " Site", ylab= "Head length", ylim = c(50, 110), yaxs = "i")
  • 42. Tutorial 2 : Regression analysis
  • 43. The lm command • The lm command is used to fit linear regressions in R. • To fit a regression of y on x1, x2, the basic command is lm(y~x1+x2). Let use generate some data for purpose of illustration: y <- rnorm(10) x1 <- rnorm(10) x2 <- rnorm (10) lm(y~x1+x2) • This only displays the least squares estimates on the screen. What if we want to test for significance? To do this we have to define a variable say lmfit which will save the output. To do this type lmfit <- lm(y~x1+x2) • lmfit is called an R list. To see the results now type summary(lmfit).To see what else is contained in lmfit type names(lmfit). To see a particular component of lmfit eg the residuals type lmfit$residuals .
  • 44. The gala dataset • Now let us fit a linear model on some real data. We shall use the gala dataset in the faraway library. • This dataset concerning the number of species of tortoise on the various Galapagos Islands. There are 30 cases (Islands) and 7 variables in the dataset. • The variables are – Species The number of species of tortoise found on the island – Endemics The number of endemic or native species – Elevation The highest elevation of the island (m) – Nearest The distance from the nearest island (km) – Scruz The distance from Santa Cruz island (km) – Adjacent The area of the adjacent island (km2) • http://www.rit.edu/~rhrsbi/GalapagosPages/Darwin.html
  • 45. The gala dataset • We start by reading the data into R : library(faraway) attach(gala) Use summary(gala), names(gala) etc to get a sense of the data. You can also use pairs(gala) to get a pairwise scatterplot of the variables. • Let us fit the regression gfit <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,data=gala) • To see the results type summary(gfit) • In particular, the fitted (or predicted) values and residuals are gfit$fit and gfit$res
  • 46. The anova command • A convenient way to compare two nested models is to use the anova command • Suppose we fit the two models g1 <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,data=gala) g2 <- lm(Species ~ Area + Elevation + Nearest + Scruz ,data=gala) Then anova(g2,g1) will give us the conventional F test comparing these two models.
  • 47. Testing nested models • Suppose we want to test whether the coefficients for the variables Area and Adjacent are equal. We can type g1 <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,data=gala) g2 <- lm(Species ~ I(Area + Adjacent) + Elevation + Nearest + Scruz,data=gala) • Then anova(g2,g1) will perform the appropriate F test. • Suppose we want to test whether the coefficient of Area can be set to a particular value say –0.1. We can then fit g2 <- lm(Species ~ offset(-0.1*Area) + Elevation + Nearest + Scruz + Adjacent,data=gala)
  • 48. Categorical predictors • Suppose we want to include the variable Area as a categorical predictor with 3 categories rather than a continuous one. • To define the corresponding categorical variable area.cat <- rep(3,nrow(gala)) area.cat[gala$Area<=5] <- 1 area.cat[(gala$Area>5)&(gala$Area<=1000)] <- 2 • Type cbind(gala$Area,area.cat) to view the results • This regression can be fitted using the command g3 <- lm(Species ~ as.factor(area.cat) + Elevation + Nearest + Scruz + Adjacent,data=gala) Type summary(g3) to view the output. • The factor command is useful for fitting ANOVA models.
  • 49. Categorical predictors • For example consider the coagulation data set which gives measurements on blood coagulation corresponding to four diets. data(coagulation) coagulation • A one way ANOVA model can be fitted to the data using coag.fit <- lm(coag ~ factor(diet), coagulation) summary(coag.fit)
  • 50. Multiple comparisons • Another feature especially relevant for ANOVA models is to allow for multiple comparisons while testing for pairwise differences. • Tukey’s Honest Significant Difference (HSD) is designed for all pairwise comparisons and depends on the studentized range distribution. We compute the Tukey HSD bands for the diet data. TukeyHSD(aov(coag.fit)) • You can compare these to the unadjusted confidence intervals for the differences B-A, C-A, D-A given below B-A 1.813638 8.186362 C-A 3.813638 10.186362 D-A -3.022848 3.022848 • Some other pairwise comparison tests may be found in the stats library.
  • 51. Confidence intervals • Returning to the Galapagos dataset, to construct individual 95% confidence intervals for the regression parameters, we first extract the parameters and the standard errors: summary(g1)$coefficients gives the coefficients, standard errors, t and p values as a matrix. We extract the first two columns using beta <- summary(g1)$coefficients[,1] se.beta <- summary(g1)$coefficients[,2] We next compute the critical value of the t-statistic with error d.f. t95 <- qt(0.975, g1$df.residual) and the individual confidence intervals as ci.beta <- cbind(beta-t95*se.beta, beta+t95*se.beta) ci.beta
  • 52. Confidence ellipsoids • Now we construct the joint 95% confidence region for the coefficients of Area and Elevation. Type library(ellipse) plot(ellipse(g1,c(2,3)),type="l") • Add the origin and the point of the estimates: points(0,0) points(g1$coef[2],g1$coef[3],pch=18) • Now we mark the one way confidence intervals on the plot for reference: abline(v=ci.beta[2,],lty=2) abline(h=ci.beta[3,],lty=2)
  • 53. Predictions • Suppose we want to predict the number of species corresponding to a hypothetical sample point with Area= 0.08, Elevation= 93, Nearest= 6.0, Scruz= 12 Adjacent= 0.34. • This can be done with predict(g1,data.frame(Area=0.08,Elevation=93,Ne arest=6.0,Scruz=12, Adjacent=0.34),se=T) • predict(g1) without any additional arguments will return the predicted values for the sample data points.
  • 54. Generalized least squares • Until now we have assumed that var e = s2I but it can happen that the errors have non-constant variance or are correlated in which case we should fit a generalized least squares. • To illustrate this we will use a dataset called Longley’s regression data where the response is the number of people employed, yearly from 1947 to 1962 and the predictors are GNP implicit price deflator, GNP, unemployed, armed forces, non-institutionalized population 14 years of age and over, and year. • To attach and view the data type data(longley) names(longley)
  • 55. Approach 1 • Assuming that the errors follow an autoregressive series of order one, we can estimate the serial correlation as data(longley) g <- lm(Employed ~ GNP + Population, data=longley) cor(g$res[-1],g$res[-16]) • We now construct the S matrix and compute the GLS estimate of b along with its standard errors. x <- model.matrix(g) Sigma <- diag(16) Sigma <- 0.31041^abs(row(Sigma)-col(Sigma)) Sigi <- solve(Sigma) xtxi <- solve(t(x) %*% Sigi %*% x) beta <- xtxi %*% t(x) %*% Sigi %*% longley$Empl beta
  • 56. Approach 2 • Since we can write S = SST , where S is a triangular matrix using the Choleski Decomposition, another approach would be to regress S-1 y on S –1 X as demonstrated below: sm <- chol(Sigma) smi <- solve(t(sm)) sx <- smi %*% x sy <- smi %*% longley$Empl lmsxsy <- lm(sy ~sx-1) lmsxsy$coef • Our initial estimate of the AR parameter is 0.31 but once we fit our GLS model we can re-estimate it as cor(lmsxsy$res[- 1],lmsxsy$res[-16])
  • 57. Approach 3 • The nlme library contains a GLS fitting function. We can use it to fit this model: library(nlme) g <- gls(Employed ~GNP + Population, correlation=corAR1(form= ~Year), data=longley) summary(g) • We see that the estimated value of r obtained using Restricted Maximum Likelihood estimation is 0.64. You can also specify method = "ML" in g for Maximum Likelihood estimation of r .
  • 58. Weighted least squares • Sometimes the errors are uncorrelated, but have unequal variance where the form of the inequality is known. Weighted least squares (WLS) can be used in this situation. • Here is an example from an experiment to study the interaction of certain kinds of elementary particles on collision with proton targets. The experiment was designed to test certain theories about the nature of the strong interaction. The cross-section(crossx) variable is believed to be linearly related to the inverse of the energy(energy - has already been inverted). At each level of the momentum, a very large number of observations were taken so that it was possible to accurately estimate the standard deviation of the response(sd). • Consider the following code data(strongx) strongx • Define the weights and fit the model: g <- lm(crossx ~energy, strongx, weights=sd^-2) summary(g)
  • 59. Diagnostics : residuals and leverage • Let’s illustrate these test using an interesting economic dataset on 50 different countries. • These data are averages over 1960-1970 on dpi = per-capita disposable income in U.S. dollars; ddpi = the percent rate of change in per capita disposable income; sr = aggregate personal saving divided by disposable income. pop15 = percentage population under 15 pop75 = percentage population over 75 • The data come from Belsley, Kuh, and Welsch (1980). • First take a look at the data: data(savings) savings
  • 60. Diagnostics : outlier identification • Consider the regression g <- lm(sr ~ pop15 + pop75 + dpi + ddpi, savings) And a plot of the residuals : plot(g$res,ylab="Residuals",main="Index plot of residuals") countries <- row.names(savings) To identify outliers use identify(1:50,g$res,countries)
  • 61. Diagnostics : Leverage • Now look at the leverage: We first extract the X-matrix here using model.matrix() and then compute and plot the leverages or so called ”hat” values: x <- model.matrix(g) lev <- hat(x) par(mfrow=c(1,1)) plot(lev,ylab="Leverages",main="Index plot of Leverages") abline(h=2*5/50) • Notice that the sum of the leverages is equal to 5 for this data. Which countries have large leverage? We have marked a horizontal line at 2p/n. • Alternatively type names(lev) <- countries lev[lev > 0.2] • The command names() assigns the country names to the elements of the vector lev making it easier to identify them. Alternatively, we can do it interactively like this identify(1:50,lev,countries)
  • 62. Diagnostics : residual plots • On the previous slide we plotted the raw residuals. Two alternative classes of residuals are the studentized residuals and the jackknife residuals. To get a plot of all three types on the same page type par(mfrow=c(3,1)) plot(g$res,ylab="Residuals",main="Index plot of residuals") gs <- summary(g) stud <- g$res/(gs$sig*sqrt(1-lev)) plot(stud,ylab="Studentized Residuals",main="Studentized Residuals") jack <- rstudent(g) plot(jack,ylab="Jacknife Residuals",main="Jacknife Residuals") Jackknife residuals can be used for outlier detection. Type jack[abs(jack)==max(abs(jack))] to identify the largest value. A critical value for these jackknife residuals with Bonferroni correction for multiple testing can be computed
  • 63. Influential Observations • To identify influential observations consider Cook’s distances : cook <- cooks.distance(g) plot(cook,ylab="Cooks distances") identify(1:50,cook,countries)
  • 64. Multicollinearity • The Longley dataset is a good example of collinearity: • Check the correlation matrix first using round(cor(longley[,- 7]),3) • Now we check the eigendecomposition: x <- as.matrix(longley[,-7]) e <- eigen(t(x) %*% x) sqrt(e$val[1]/e$val) • One option could be to use ridge regression as implemented by lm.ridge in the MASS library. Type library(MASS) ? lm.ridge for details.
  • 65. Variable selection • The step function with the option direction = "forward", direction="backward" or direction="both" can be used for forward, backward or stepwise variable selection using p-values. • The leaps function in the leaps library implements variable selection based on model selection criteria like the adjusted R2, Mallow’s Cp, AIC, BIC and PRESS.
  • 66. Transformations • Transformations of the response and predictors can improve the fit and correct violations of model assumptions such as constant error variance. • A popular method is to use the Box Cox transformation. • Consider the Galapagos Islands dataset analyzed earlier: data(gala) g <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,gala) library(MASS) boxcox(g,plotit=T) boxcox(g,lambda=seq(0.0,1.0,by=0.05),plotit=T)
  • 67. Transformations Alternatively we can also transform the predictor using various functions such as standard polynomials, orthogonal polynomials, splines etc. The following code fits a non-linear regression of Species on Area using (natural) splines. data(gala) library(splines) g4 <- lm(Species ~ns(Scruz, df=4),gala) To view the results type scruz.ord <- order(gala$Scruz) plot(gala$Scruz, gala$Species, xlab= " Scruz ", ylab= " Species ", lwd=2) points(gala$Scruz[scruz.ord], predict(g4)[scruz.ord], col= " yellow ", type= "l")
  • 68. Tranformations You can also look at the effect of increasing the degrees of freedom g8 <- lm(Species ~ns(Scruz, df=8),gala) points(gala$Scruz[scruz.ord], predict(g8)[scruz.ord], col= " orange" , type= "l") g12 <- lm(Species ~ns(Scruz, df=12),gala) points(gala$Scruz[scruz.ord], predict(g12)[scruz.ord], col= " red" , type= "l") g20 <- lm(Species ~ns(Scruz, df=20),gala) points(gala$Scruz[scruz.ord], predict(g20)[scruz.ord] , col= " purple" , type= "l") leg <- c(" 4 df ", " 8 df ", " 12 df " , " 20 df ") legend(200, 200, leg, lty=1, col=c(" yellow", " orange", " red", " purple " ))
  • 69. Exercise • The variable Species in the gala dataset is actually a count and one should ideally fit a generalized linear model with a Poisson link. – Use the glm function to fit an appropriate generalized linear model to the data accounting for possible overdispersion. – Use predict.glm to plot the predicted values corresponding to the sample data together with prediction intervals. – Perform post model fitting diagnostics using some of the following functions from the car and stats libraries : • cookd (car), • dfbeta and dfbetas (stats) • dffits (stats) • influence.measures(stats) • outlier.test (car).
  • 70. Tutorial 3 : Programming in R
  • 71. When do we need to programme? • To define user defined functions. • To create libraries. • For simulations.
  • 72. A sample programme Here is an example of a sample programme to simulate the distribution of the median of a Cauchy distribution, computes a Monte Carlo estimator of its MSE and compare the histogram of the simulated values to its asymptotic distribution. It also displays the time taken to perform the simulations. n <- 10 nsim <- 10000 theta.hat <- double(nsim) for (i in 1:nsim) { x <- rcauchy(n) theta.hat[i] <- median(x) } mean(theta.hat^2) cat("Calculation took", proc.time()[1], "seconds.n") hist(theta.hat, freq = FALSE, breaks = 100) curve(dnorm(x, sd = sqrt(mean(theta.hat^2))), add = TRUE) curve(dnorm(x, sd = sqrt(1 / (4 * n * dcauchy(0)^2))), add = TRUE, col = "red")
  • 73. Programming style Make the programme as generic as possible. For example to generate sample averages for 200 samples of size 20 from the standard normal distribution, one possible code is for (i in 1:200) { samp.mean[i] <- mean(rnorm(20)) } A more generic alternative is sample.size <- 20 simulation.size <- 200 for (i in 1: simulation.size) { samp.mean[i] <- mean(rnorm(sample.size)) }
  • 74. Programming style • Indent. Consider for (i in 1:simulation.size) { samp.mean[i,] <- NULL for (j in 1: num.times) { samp.mean[i,j] <- mean(rnorm(mean=j,sd=1, n=samp.size)) } } as opposed to for (i in 1:simulation.size) { samp.mean[i,] <- NULL for (j in 1: num.times) { samp.mean[i,j] <- mean(rnorm(mean=j,sd=1, n=samp.size)) } }
  • 75. Programming style Give meaningful variable names. The programme for (i in 1:m) { x[i,] <- NULL for (j in 1: n) { x[i,j] <- mean(rnorm(mean=j,sd=1, n=K)) } } is perfectly valid but not as user friendly. • When choosing names be careful not to overwrite inbuilt R functions. For example, naming a variable lm will mask the lm function. If this happens type remove("lm") to restore the function.
  • 76. Programming style Use matrices for fast computation. The commands executed by the following lines of code A <- matrix(0,500,500) for (i in 1:500) for (j in 1:500) A[i,j] <- i + j can be run faster and in a more elegant manner using I.mat <- matrix(seq(1,500), nrow=500, ncol=500) A <- I.mat + t(I.mat) Though R is generally better with loops than S+, such matrix programming is essential for fast computation.
  • 77. Programming style Add comments. The # sign can be used to insert comments. For example : simulation.size <- 200 # Set simulation size samp.size <- 20 # Set sample size num.times <- 10 # Set number of repetitions for (i in 1:simulation.size) { samp.mean[i,] <- NULL for (j in 1: num.times) { # Calculate the sample mean for i,j th observation samp.mean[i,j] <- mean(rnorm(mean=j,sd=1, n=samp.size)) } }
  • 78. Variable types in R • Numeric – Real / Floating point • default: double precision—15 significant digits • single precision—7 significant digits – Integer x <- 6 is.real(x) x <- as.integer(x) is.real(x) is.integer(x) • Logical x <- c(1,2,3,4,5); y <- (x<3); • Character String x <- c(" North " , " South " , " East " , " West " ) • List : collection of several objects of any type x1 <- c(" North " , " South " , " East " , " West " ) x2 <- c(2,3,5,8) x <- list(x1,x2) • Complex arithmetic is also supported in R z <- complex(real = rnorm(100), imag = rnorm(100)) Re(z) Im(z)
  • 79. Vectors and matrices in R • Vectors x <- c(45, 90, 135 ) x y <- c(" North " , " South " , " East " , " West " ) y x*2 length(x) sum(x) · When the values are from a systematic sequence you can save coding x <- rep(2.1, 30) y <- rep(" North " ,5) x <- 1:10 x <- seq(1,10,2)
  • 80. Vectors and matrices in R • Matrices a <- 1:3 b <- 4:6 c <- 7:9 X <- cbind(a,b,c) X dim(X) Y <- rbind(a,b,c) Y dim(Y) X+ Y X*Y X%*%Y Z <- matrix(c(1,4,6,2,3,7.8), nrow=2, ncol=3, byrow=T) Z <- matrix(c(1,4,6,2,3,7.8), nrow=2, ncol=3, byrow=F)
  • 81. Data frames • Most functions such as lm, glm, survreg, coxph etc will operate on data frames. • If the data is read in using command such as read.csv, read.txt etc, it will automatically be saved as a data frame. • If the data is read in from the keyboard, a data frame can be created as follows. length <- c(20, 24, 19, 24, 18, 30) wt <- c(10, 14, 14, 12, 12.5, 17) mydata <- data.frame(length, wt) To see the variable names type names(mydata) . To access the length variable use mydata$length. Alternatively you can attach the data set using attach(mydata) in which case you can simply type length.
  • 82. Programming Loops • for loop : for (i in 1:10) { ….. R code ….. } • while loop while (logical condition) { ….. R code ….. } • if loop if (logical condition) { ….. R code ….. } • if else loop if (logical condition) { ….. R code ….. } else { ….. R code ….. } • The commands stop and break will exit from a loop without completion.
  • 83. Some useful commands for programming • Numerical solution of equations : uniroot, polyroot, optimize, nlm • Alternatives to loops : apply, tapply, outer • Matrix inversion and solution of linear equations : solve, solve.qr, chol2inv, backsolve, qr.solve • General matrix functions : eigen, svd, det • Sorting : sort, order, rank • Rounding up : ceiling, floor, round, trunc, signif • Saving : write.matrix, source, sink, postscript, pdf • Numerical settings : .Machine • Random number generation : Random.Seed, RNG, RNGkind, set.seed
  • 84. • Suppose we have decided on a favourite plotting set up which uses blue dashed lines on a yellow background, square plotting characters and prints the variable names parallel to each axis. • Instead of retyping the options for each plot we can create a function which uses these settings and also returns the summaries for each variable. myplot <- function(x,y, bgd = "lightyellow") { opar <- par() par(bg=bgd) plot(x, y, pch=22, col="blue", las=1) myplot.out <- summary(data.frame(cbind(x,y))) par(opar) return(myplot.out) } • To use this function, first paste this into R and then use x1 <- rnorm(20) x2 <- rnorm(20) out <- myplot(x1,x2) You can also use myplot(x1, x2, bgd= "grey") To see the summary type out User defined functions
  • 85. User defined functions • Here is a slightly more complicated function to calculate the number of runs of 1’s in a binary sequence f <- function (x, v=1) { x <- diff(x==v) x <- x[x!=0] if (x[1]==1) sum(x==1) else 1+sum(x==1) } Now generate some data n <- 50 x <- sample(0:1, n, replace=T, p=c(.2,.8)) x To see the number of runs in the sequence type f(x,1)
  • 86. Example 1 • Let us write a programme which will compare the power of the two sample t-test with that of the Wilcoxon and Kolmogorov - Smirnov tests when the underlying data are normal.
  • 87. Example 1 • Let us first open an R script, name it example and save the script in the R home directory. • It is good practice to add in a descriptive header giving the purpose of the programme and the date on which it was last modified. #### R programme for simulating the power of the two sample t test vs various #### non-parametric alternatives #### 21/7/06 • We next need to specify the sample size and the number of simulations to be run with sim.size <- 200 sample.size <- 10 • We shall set the mean of the first population to zero and run the simulation for a range of values of the difference in means with mu1 <- 0 delta <- seq(-2,2, length=50) • We also need to set the seed so as to be able to reproduce the random number generation. set.seed(231)
  • 88. Example 1 • Our programme will then look like this: for (j in 1:length(delta)) { # Set mean of second population for (i in 1:sim.size) { # Generate ith sample # Perform ith set of tests # Check if the test rejects the null hypothesis of equality } # Calculate the simulated power }
  • 89. Example 1 So our programme now looks like this: sim.size <- 200; sample.size <- 10; set.seed(231) mu1 <- 0; delta <- seq(-2,2, length=50) for (j in 1:length(delta)) { mu2 <- mu1 + delta[j] for (i in 1:sim.size) { } # Calculate power for jth setting } # End of j loop
  • 90. Example 1 • Let us now define variables which will hold the simulated powers sim.size <- 200; sample.size <- 10; set.seed(231) mu1 <- 0; delta <- seq(-2,2, length=50) pow.ttest <- NULL pow.wtest <- NULL pow.kstest <- NULL for (j in 1:length(delta)) { mu2 <- mu1 + delta[j] for (i in 1:sim.size) { # Calculate pt.test[I], pw.test[I], pks.test[I] } pow.ttest[j] <- sum(pt.test)/sim.size # Calculate powers for jth setting pow.wtest[j] <- sum(pw.test)/sim.size pow.kstest[j] <- sum(pks.test)/sim.size } # End of j loop
  • 91. Example 1 • The inner simulation loop looks like this for (i in 1:sim.size) { # Generate ith sample samp1 <- rnorm(mean=mu1,sample.size) samp2 <- rnorm(mean=mu2,sample.size) # Perform ith set of tests test1 <- t.test(samp1, samp2,alternative = c("two.sided")) pt.test[i] <- (test1$p.value < 0.05) test2 <- wilcox.test(samp1, samp2,alternative = c("two.sided"), exact = TRUE) pw.test[i] <- (test2$p.value < 0.05) test3 <- ks.test(samp1, samp2,alternative = c("two.sided"), exact = TRUE) pks.test[i] <- (test3$p.value < 0.05) }
  • 92. Example 1 • The complete programme has been saved as an R script called twosamp.r in the R home directory. Open the file using the File menu in R and run the simulation using source("twosamp.r") • The code will automatically save plots of the simulated power in a pdf file called twosamp.pdf also in the R home directory.
  • 93. Example 2 • Now run simulations which will look at the robustness of the two sample t-test to the following assumtions: – Homoscedasticity • Simulate two normal distributions with different standard deviations and plot the level as a function of the ratio of sd’s. Make several plots (on the same graph) corresponding to several choices of the standard deviation of the first population. – Normality • Simulate data from two logistic distributions and find an estimate of the level of the test. Does the level vary with sample size? – Independence • Simulate two correlated normal distributions and plot the level as a function of of the correlation coefficient. The package mvtnorm will be required to generate bivariate normal data.
  • 94. Example 3 • Write a function which will calculate the number of runs in a binary sequence of arbitrary length. Hint : Use the diff function.
  • 95. Additional Exercises 1. Generate Bernoulli data with n = 100 and p = .25, p = .05 and p = .01. Is the data approximately normal in each case? 2. Sketch the distribution of the standardized average for data generated from the uniform [0; 1] distribution. Compare the histograms when n is 5, 10, 25 and 100. 3. Write a function which will compute a Monte Carlo estimate of the ratio of the variances for the mean and the median for the (a) N(0,1) distribution (b) t distribution with 2 df. Use the vioplot function in the vioplot library to create side by side vioplots of the simulated distributions of the mean and the median in the two cases.
  • 96. Additional Exercises 4 (a). Search the stats library in R for a list of parametric and non- parametric tests. (b) Generate 200 standard normal variables and perform the one sample t-test on the data. (c) Repeat the steps in (2) 1000 times and draw a histogram of the resulting p values. (d) On the same graph plot the power of the one sample t-test as a function of the true mean using (i) simulation (ii) the R function power.t.test (iii ) an analytical expression for power 5. Plot the density of the t distribution for degrees of freedom = 1,2,5,100 and the standard normal density in different colours on the same graph. Add a legend and title to the plot.
  • 97. Additional Exercises 6. The sleep dataset in R shows the number of hours of extra sleep after administration of a sleeping drug. (a) Perform a two sample t-test on the data. (b) Perform a two sample Wilcoxon test. (c) Perform an analysis of variance assuming normality. (d) Perform a non parametric analysis of variance using a Kruskal Wallis test. (e) Compare the variances of the two groups using an F test. 7. Plot the density of the chi-squared distribution for 1-10 degrees of freedom in different colours on the same graph. Add a legend and title to the plot.
  • 98. Additional Exercises 8. Consider the variable eruption giving eruption lengths of the Faithful geyser recorded in the data set faithful. (a) Draw a histogram of the data. (b) Write a function f which takes as input the means (m1,m2) and sd’s (s1,s2), the mixing proportion (p) and the data point (x) and returns the value of the corresponding mixture normal likelihood at the point. (c) Now write a function fn which uses the function f to calculate the likelihood for the entire data set. (d) Use the function optim to get maximum likelihood estimates. (e) Superimpose the sample and theoretical density on the histogram and add a legend to the plot.
  • 99. Additional Exercises 9. Create a data frame called Manitoba.lakes that contains the lake’s elevation (in meters above sea level) and area (in square kilometers) as listed below. Assign the names of the lakes using the row.names( ) function. elevation area Winnipeg 217 24387 Winnipegosis 254 5374 Manitoba 248 4624 SouthernIndian 254 2247 Cedar 253 1353 Island 227 1223 Gods 178 1151 Cross 207 755 Playgreen 217 657 (a) Plot log2(area) versus elevation. Add labeling information using the text command with the label option. (b) Use the R function dotchart( ) to display the areas of the Manitoba lakes (i) on a linear scale, and (ii) on a logarithmic scale. Add, in each case suitable labeling information.
  • 100. Additional Exercises 10. (a) Use the nlm function to numerically minimise the function f(x,y,z) = sin(x)-sin(y-4)+z2+2. (b) If gradient information is not supplied, nlm will use a matrix-secant method which numerically approximates the gradient. To use gradient information, redefine the function so as to additionally contain an attribute called the gradient. Now perform minimisation using the quasi-Newton method. (c) Use the integrate function in R to find the constant of integration, c for the posterior density function c.e[-1/2{(0.12-x)2+(0.07-x)2+(0.08-x)2}]
  • 101. Tutorial 4 : R libraries
  • 102. What is a library? • An R library or package is a collection of programmes with a common objective. To see a list of packages available by default type search( ) at the R prompt. • Some commonly used packages are base, graphics, stats, mgcv, nlme, survival, Hmisc etc. • R also has some very specialised packages. Examples include – boot (bootstrap / jackknife) – EbayesThresh (empirical Bayes thresholding), – mAr (multivariate autoregressive analysis) – neural (neural networks) – nlqr (non linear quantile regression) – portfolio (analysing equity portfolios) etc.
  • 103. R contributed libraries • A complete list of contributed packages is available on the R website under the link Contributed extension packages. The list has also been saved to the file Available Bundles and Packages.doc on the Desktop. • The R News site also available from the website also provides a discussion of new packages and updates to old packages. • R also has summaries called CRAN Task Views for the following specialised subjects – Cluster Cluster Analysis & Finite Mixture Models – Econometrics Computational Econometrics – Environmetrics Analysis of ecological and environmental data – Finance Empirical Finance – Genetics Statistical Genetics – MachineLearning Machine Learning & Statistical Learning – Multivariate Multivariate Statistics – SocialSciences Statistics for the Social Sciences – Spatial Analysis of Spatial Data – gR gRaphical models in R • The ctv package can be used to install the functions mentioned in the CRAN Task View.
  • 104. Downloading libraries The first option is to go to one of the CRAN mirror sites and click on one of the mirror sites. A complete listing of libraries is available by following the link for contributed extension packages. Clicking on the desired library will lead to a download page such as bivpois: Bivariate Poisson Models Using The EM Algorithm Functions for fitting Bivariate Poisson Models using the EM algorithm. Details can be found in Karlis and Ntzoufras (2003, RSS D & 2004,AUEB Technical Report) Version:0.50-2 Depends:R (>= 2.0.1) Date:2005-08-25 Author:Dimitris Karlis and Ioannis Ntzoufras Maintainer:Ioannis Ntzoufras License:GPL (version 2 or later) URL:http://www.stat-athens.aueb.gr/~jbn/papers/paper14.htm Package source: bivpois_0.50-2.tar.gz Windows binary:bivpois_0.50-2.zip Reference manual: bivpois.pdf
  • 105. Downloading libraries • Download the .zip file. To install the library you can either unzip the file and copy to the library folder in the R home directory. • Or you can open an R session and choose Install from local zip file from the Package option on the menu. • This will install the library as well as the corresponding help documentation. For additional documentation you can visit the sited URL. • If the machine has an internet connection a simpler way to install a package is to choose Set CRAN mirror from the Package menu and then choose Install package. • On installation function files in a library will all be copied to the library folder in the R home directory. Documentation will be copied to the Doc subfolder within each library.
  • 106. The library command • Consider the following uses of the library command – library( ) # list all available packages – library(lib = .Library) # list all packages in the default library – library(help = stats) # documentation on package 'stats‘ – library(faraway) # load package ‘faraway‘ – require(faraway) # the same – library(help=faraway) # documentation on package ‘faraway’ – search( ) # lists loaded packages • Another useful command available for some packages is demo( ). Try – demo(package = .packages(all.available = TRUE)) – demo(glm.vr, package="stats") – demo(persp, package="graphics")
  • 107. Some R libraries • In this tutorial we shall consider the following libraries: – TeachingDemos : Demonstrations for teaching – Matrix : A Matrix package for R – MCMCpack : Bayesian inference via Markov chain Monte Carlo
  • 108. The TeachingDemos library • As suggested by the name, this library contains functions useful for interactively demonstrating basic statistical concepts. • The library has already been loaded onto your machine. Attach the library using library(TeachingDemos) • Check if the package has any inbuilt demos using demo(package="TeachingDemos"). • To see the package capabilities type library(help=TeachingDemos)
  • 109. The TeachingDemos library • Let us explore some of these functions. For example type ? faces to get a description of the faces command. • Next try running the sample code given in the description • The first example is faces(rbind(1:3,5:3,3:5,5:7)) • The next is data(longley) faces(longley[1:9,]) • Compare the differences between faces and faces2 using faces2(matrix( runif(18*10), nrow=10), main='Random Faces') and faces2(matrix( runif(18*10), nrow=10), main='Random Faces') • Type par(mfrow=c(1,1)) to restore the default plotting layout.
  • 110. The TeachingDemos library • Similarly try the examples for some other possibly useful functions such as – mle.demo – power.examp – put.points.demo – rotate.cloud – run.cor.examp – run.hist.demo – vis.binom
  • 111. The Matrix library • Matrix is a class of methods for numerical linear algebra with special relevance for sparse ill conditioned matrices. • We shall first use the library to compare the speed of least squares fitting methods on an example for which the model matrix is large and sparse. • As an example, let’s create a model matrix, mm, and corresponding response vector, y, for a simple linear regression model using the Formaldehyde data. data(Formaldehyde) str(Formaldehyde) (m <- cbind(1, Formaldehyde$carb)) (yo <- Formaldehyde$optden) solve(t(m) %*% m) %*% t(m) %*% yo system.time(solve(t(m) %*% m) %*% t(m) %*% yo) dput(c(solve(t(m) %*% m) %*% t(m) %*% yo)) dput(unname(lm.fit(m, yo)$coefficients))
  • 112. The Matrix library • For a large, ill-conditioned least squares problem this does not perform well. Let us read in an example of such data using library(Matrix) data(KNex, package = "Matrix") y <- KNex$y mm <- as(KNex$mm, "matrix") • Type dim(mm) to get the dimension of mm. • Now check the system times system.time(naive.sol <- solve(t(mm) %*% mm) %*% t(mm) %*% y) • Because the calculation of a “cross-product” matrix is a common operation in statistics, the crossprod function has been provided to do this efficiently. Check the system time for the above operation using crossprod: system.time(cpod.sol <- solve(crossprod(mm), crossprod(mm,y)))
  • 113. The Matrix library • The crossprod function applied to a single matrix takes advantage of symmetry when calculating the product but does not retain the information that the product is symmetric and positive semidefinite. • As a result least squares estimates are calculated using a general linear system solver based on an LU decomposition when it would be faster, and more stable numerically, to use a Cholesky decomposition.
  • 114. The Matrix library • The Matrix package uses the S4 class system (Chambers, 1998) to retain information on the structure of matrices from the intermediate calculations. mm <- as(KNex$mm, "dgeMatrix") system.time(Mat.sol <- solve(crossprod(mm), crossprod(mm,y))) • Furthermore, any method that calculates a decomposition or factorization stores the resulting factorization with the original object so that it can be reused without recalculation. xpx <- crossprod(mm) xpy <- crossprod(mm, y) system.time(solve(xpx, xpy))
  • 115. The Matrix library • The model matrix mm is sparse; that is, most of the elements of mm are zero. • The Matrix package incorporates special methods for sparse matrices, which produce the fastest results of all.
  • 116. The MCMCpack library • This package contains functions to perform Bayesian inference using posterior simulation for a number of statistical models. All models return coda mcmc objects that can then be summarized using the coda package. MCMCpack also contains some useful utility functions, including some additional density functions and pseudo-random number generators for statistical distributions, a general purpose Metropolis sampling algorithm, and tools for visualization. • You will also need to download the coda library to run MCMCpack. • Let us type library(MCMCpack) library(coda) library(help=MCMCpack) to view package capabilities.
  • 117. The MCMCpack library • Let us look at the function Mcbinomialbeta • Type the following sample code posterior <- MCbinomialbeta(3,12,mc=5000) summary(posterior) • To plot the prior and posterior on the same graph type plot(posterior) grid <- seq(0,1,0.01) plot(grid, dbeta(grid, 1, 1), type="l", col="red", lwd=3, ylim=c(0,3.6), xlab="pi", ylab="density") lines(density(posterior), col="blue", lwd=3) legend(.75, 3.6, c("prior", "posterior"), lwd=3, col=c("red", "blue"))
  • 118. The MCMCpack library • Similarly consider MCnormalnormal • Type y <- c(2.65, 1.80, 2.29, 2.11, 2.27, 2.61, 2.49, 0.96, 1.72, 2.40) posterior <- MCnormalnormal(y, 1, 0, 1, 5000) summary(posterior) and to see a plot plot(posterior) grid <- seq(-3,3,0.01) plot(grid, dnorm(grid, 0, 1), type="l", col="red", lwd=3, ylim=c(0,1.4), xlab="mu", ylab="density") lines(density(posterior), col="blue", lwd=3) legend(-3, 1.4, c("prior", "posterior"), lwd=3, col=c("red", "blue"))
  • 119. The MCMCpack library • To see the effect of varying the prior variance consider the following code y <- c(2.65, 1.80, 2.29, 2.11, 2.27, 2.61, 2.49, 0.96, 1.72, 2.40) prior.var <- c(0.01,0.1,0.5,0.75,1,1.5,2,10,100) par(mfrow=c(3,3)) for (ipv in 1:length(prior.var)) { posterior <- MCnormalnormal(y, 1, 0, prior.var[ipv], 5000) grid <- seq(-4,4,0.01) plot(grid, dnorm(grid, 0, prior.var[ipv]), type="l", col="red", lwd=3, ylim=c(0,1.4), xlab="mu", ylab="density") points(grid, dnorm(grid,mean(y), sd(y)), type ="l") lines(density(posterior), col="blue", lwd=3) legend(-3, 1.4, c("prior", "posterior", "sample"), col=c("red", "blue" , "black"), cex=0.5) title(paste("Prior variance = ", prior.var[ipv]), cex=0.6) }
  • 120. The coda library • The coda library is used for convergence diagnostics of the MCMC chain. Type library(help=coda) to see a list of the diagnostic tools implemented. • For example try par(mfrow=c(1,1) geweke.plot(posterior) raftery.diag(posterior) traceplot(posterior) autocorr.plot(posterior) • A complete list of Bayesian analysis implemented in R is listed in the CRAN Task View on the subject.
  • 121. Exercise Frank Harrell’s Hmisc library contains many functions useful for data analysis, high-level graphics, utility operations, functions for computing sample size and power, importing datasets, imputing missing values, advanced table, making, variable clustering, character string manipulation, conversion of S objects to LaTeX code, and recoding variables. (a) Load the Hmisc library and list its capabilities. (b) Use the function binconf to compute the various possible confidence intervals for the binomial proportion. Make a plot to study the relationship between these intervals for varying sample size. (c ) Compare the abilities of the functions fit.mult.impute, aregImpute and impute for imputing missing data.