1. 1. What is R?
2. Where is R available?
3. Will R run on all machines?
4. Why should I switch to R?
5. What can R do?
6. How easy is it to learn R?
7. How large is the R
community?
8. Does R have a support system?
9. Is there a beginners guide to R?
http://www.r-project.org
2. What is R?
• R is a language and environment for statistical computing and graphics.
• Developed by Robert Gentleman and Ross Ihaka.
• It can be freely downloaded from
http://www.rproject.org
• R can be considered as an implementation of S. There are some important
differences, but much code written for S runs unaltered under R.
• R is a GNU project.
3. Why I Must Write GNU
I consider that the golden rule requires that if I like a program I must share
it with other people who like it. Software sellers want to divide the users and
conquer them, making each user agree not to share with others. I refuse to
break solidarity with other users in this way. I cannot in good conscience
sign a nondisclosure agreement or a software license agreement. For years I
worked within the Artificial Intelligence Lab to resist such tendencies and
other inhospitalities, but eventually they had gone too far: I could not
remain in an institution where such things are done for me against my will.
So that I can continue to use computers without dishonor, I have decided to
put together a sufficient body of free software so that I will be able to get
along without any software that is not free. I have resigned from the AI lab
to deny MIT any legal excuse to prevent me from giving GNU away.
Richard Stallman, Founder, Free Software Foundation.
4. What can R do?
R is called a “programming environment”.
R includes
• an effective data handling and storage facility,
• a suite of operators for calculations on arrays, in particular matrices,
• a large, coherent, integrated collection of intermediate tools for data
analysis,
• graphical facilities for data analysis and display either on-screen or on
hardcopy, and
• a well-developed, simple and effective programming language which
includes conditionals, loops, user-defined recursive functions and input
and output facilities.
5. Who currently uses R?
SAS is the most common statistics package in general but R or S is
most popular with researchers in Statistics. A look at common
Statistical journals confirms this popularity. R is also popular for
quantitative applications in Finance.
6. Why should I switch to R?
1. It is free. (in what sense?)
2. It has an extensive support system in the form on many online
manuals, tutorials,discussion and help forums.
3. In some ways it is better than its closest competitor, S+ (plots,
memory management).
4. In addition to the base packages there is a very exhaustive list of
extension and user contributed packages for specialized tasks.
7. Getting help in R
• Within R:
– The ? Command can be used to get help on a specific command
within R
• ? graphics typed at the R prompt will provide a description of R
graphics.
• demo(graphics) will demonstrate some examples.
• Another useful command is example. Type ? example at the R
prompt for a description.
• Documentation
– Manuals, FAQs, reference cards, tutorials and news about recent
developments are available at http://www.r-project.org/other-
docs.html.
– CRAN Task Views for specialized applications.
• Online help
– The R posting guides are R-help, R-devel and Bioconductor
– The site for R-help is https://www.stat.math.ethz.ch/pipermail/r-help/
8. CRAN Task View: Computational Econometrics
Maintainer: Achim Zeileis
http://www.maths.bris.ac.uk/R/src/contrib/Views/Econo
metrics.html
CRAN Task View: Statistical Genetics
Maintainer: Giovanni Montana
http://www.maths.bris.ac.uk/R/src/contrib/Views/Genet
ics.html
CRAN Task View: Bayesian Inference
Maintainer: Jong Hee Park
http://cran.r-
project.org/src/contrib/Views/Bayesian.html
Type CRAN Task View into a search engine such as
Google for more
9. Posting Guide: How to ask good
questions that prompt useful answers
• R-help is intended to be comprehensible to people who want to use R
to solve problems but who are not necessarily interested in or
knowledgeable about programming.
• R-devel is intended for questions and discussion about code
development in R. Questions likely to prompt discussion unintelligible
to non-programmers should go to to R-devel.
• Bioconductor is for announcements about the development of the
BioConductor package , availability of new code, questions and
answers about problems and solutions using Bioconductor, etc.
10. On Wed, 6 Dec 2000, Hisaji ONO wrote:
> Hello, the R people.
> I look for robust regression in R. This method is available in S, its name is rreg.
There's better robust regression in the VR bundle of packages.
library(MASS)
help(rlm)
library(lqs)
help(lqs)
-thomas
Thomas Lumley
Assistant Professor, Biostatistics
University of Washington, Seattle
About
this list
Date view Thread view Subject view Author view Other groups
Subject: Re: [R] Is robust regression available in R.
From: Thomas Lumley (thomas@biostat.washington.edu)
Date: Wed 06 Dec 2000 - 04:12:58 EST
11. Available Bundles and Packages
aaMIMutual information for protein sequence alignments
abindCombine multi-dimensional arrays
accuracyTools for testing and improving accuracy of statistical results.
acepackace() and avas() for selecting regression transformations
actuarActuarial functions
adaptadapt -- multidimensional numerical integration
ade4Analysis of Environmental Data : Exploratory and Euclidean method
adehabitatAnalysis of habitat selection by animals
adliftAn adaptive lifting scheme algorithm
agceanalysis of growth curve experiments
akimaInterpolation of irregularly spaced data
AlgDesignAlgDesign
alr3Methods and data to accompany Applied Linear Regression 3rd editi
amapAnother Multidimensional Analysis Package
AMOREA MORE flexible neural network package
AnalyzeFMRIFunctions for analysis of fMRI datasets stored in the ANALYZE for
aodAnalysis of Overdispersed Data
apeAnalyses of Phylogenetics and Evolution
apTreeshapeAnalyses of Phylogenetic Treeshape
ArDecTime series autoregressive decomposition
arulesMining Association Rules and Frequent Itemsets
13. And beyond…
• Exploratory Data Analysis : eda
• Financial data analysis: fPortfolio, financial, fMultivar
• Spatial analysis: spatstat, geoR, SemiPar
• Smoothing: SemiPar, splines, mgcv
• Econometric analysis: gear, Ecdat
• R also has many inbuilt datasets a list of which may be
viewed by typing data() at the prompt.
14. Are there any downsides?
• R is not menu driven. Commands to be executed
must be typed in at the prompt.
• This is not a complete disadvantage because it
prevents the cookbook approach to statistics.
• However it means that the user will need to invest
some initial to become familiar with R syntax.
• Sample code is available for each command and is
helpful to familiarize those new to programming
eg try typing ? lm at the prompt and scroll to the
bottom.
15. Tutorial structure
1. Tutorial 1: R Graphics (and basics)
2. Tutorial 2 : Regression analysis
3. Tutorial 3 : Programming in R
4. Tutorial 4 : R libraries
17. Getting started
• To open an R session click on the ‘R’ shortcut on the desktop. This will
open a commands window with the R prompt ‘>’. All commands have
to be typed in at the prompt.
• R code in the presentation is indicated in blue. You can cut and paste
this into the commands window.
• Use the arrow keys to recall previous commands.
• You can scroll up the command window to view earlier commands and
output.
18. Downloading R
• The base package in R can be downloaded from www.r-project.org,
popularly known as the CRAN website.
• You will need to click on one of the mirror sites.
• Additional packages can either be downloaded from this site or from
within R.
19. Reading in data
Assignment
• The most straight forward way to store a list of numbers is through an
assignment using the c command.
• As an example, we can create a new variable called newvar which will contain
the numbers 3, 5, 7, and 9:
newvar <- c(3,5,7,9)
• When you enter this command you should not see any output except a new
command line.
• To see what numbers are included in newvar type newvar at the prompt and
press the enter key:
• If you wish to work with one of the numbers you can get access to it
using the variable and then square brackets indicating which number:
eg try typing newvar[2]
newvar[1:2]
newvar[-2]
20. Reading a CSV file
• We shall read a very short data file called simple.csv which has six
rows of data on three variables labeled "trial," "mass," and "velocity."
• The command to read the data file is read.csv.
• The following command will read in the data and assign it to a
variable called new data
newdata <- read.csv(file="simple.csv",head=TRUE,sep=",")
• To view the data type newdata at the prompt.
• Try typing summary(newdata)
• Try typing summary(newdata[(newdata$trial=="A"),])
• Try typing table(newdata$trial)
• You can now access each individual column using a "$" to
separate the two names eg try typing newdata$mass
• If you are not sure what columns are contained in the variable
type names(newdata) at the prompt.
Reading in data
21. • There are many ways to read data using R. We have only give two
examples : direct assignment and reading csv files.
• Other commands include read.csv2, read.delim, read.fwf and scan.
• To get help on these commands you can type ? read.fwf at the
prompt etc.
• It is also possible to import data of other formats such as SAS,
SPSS etc into R.
Reading in data
22. Plotting data
• To see some of the possibilities that R offers, enter
demo(graphics)
• Press the Enter key to move to each new graph.
• Note that the code required to produce each plot is being
simultaneously displayed in the command window.
23. The plot function
• In order to illustrate R's graphical functionalities, let us consider
a simple example of a bivariate graph of 10 pairs of random
variables. These values were generated with:
x <- rnorm(10)
x <- sort(x)
y <- rnorm(10)
• To get a scatter-plot of x against y, type
plot(x, y)
and the graph will be plotted on the active graphical device.
24. • This plot uses default axis labels, limits, symbols etc.
• We can customize plots by passing options to the plot
commands. Try the following variations:
plot(x,y, type="l")
plot(x,y, type= "l", lty=2)
plot(x,y, type= "l", lwd=3)
plot(x,y, type= "l", col="red")
While this needs some getting accustomed to, it does
make the point that plots are subjective and it is
necessary for the user to make intelligent choices of
plotting parameters. For example try the following
commands
par(mfrow=c(1,2))
plot(x,x^2,type="l",ylim=c(0,1))
plot(x,x^2,type= "l", ylim=c(0,10))
The plot function
25. • For a fully customized plot try the following
par(mfrow=c(1,1))
plot(x, y, xlab="Ten random values", ylab="Ten other values",
xlim=c(-2, 2),ylim=c(-2, 2), pch=22, col="red",bg="yellow",
bty="l", tcl=0.4,main="How to customize a plot with R", las=1,
cex=1.5)
• What does each of the options do?
• This type of control over parameters is typical of R and is also
found in analysis functions and programming. It is one of the
features which makes R truly scientific and superior to many
other software packages.
26. More customized plots
• Some plotting options can be passed on as
arguments to the plot function while others will need
modification of the default graphical settings specified
in par.
• You can view the current settings in par by typing par(
) at the prompt.
• Let us consider the following modification to par.
• Type the following ( > denotes the R prompt)
opar <- par()
par(mfrow=c(1,1))
par(bg="lightyellow", col.axis="blue", mar=c(4, 4, 2.5,
0.25))
plot(x, y, xlab="Ten random values", ylab="Ten other
values",
xlim=c(-2, 2), ylim=c(-2, 2), pch=22, col="red", bg="yellow",
bty="l", tcl=-.25, las=1, cex=1.5)
title("How to customize a plot with R ", font.main=3, adj=1)
27. • However once the user has spent some time setting his favourite
plotting options, it is easy to replicate these for another dataset.
• We shall now look at a sample data set in R. Consider the data set
florida which has the votes for the various candidates by county
in the state of Florida in the last US presidential elections. You
can attach this dataset by typing
attach("usingR.RData")
attach(florida)
Type florida at the prompt to view the data.
Next try plotting the votes for Bush and Buchanan.
28. Interactive plotting : The identify and locator functions
identify is a useful function which can be used to label selected
points on a plot.
Type plot(BUSH, BUCHANAN, xlab="Bush", ylab="Buchanan")
identify(BUSH, BUCHANAN, County)
Then click near a point to identify the county..
Another interactive function is the locator function.
Type
plot(1:nrow(florida), BUSH, col="red",pch=2,xlab="County no",
ylab="Votes")
points(1:nrow(florida), BUCHANAN, col="green", pch=4)
leg <- c("BUSH", "BUCHANAN")
Then type
legend(locator(1), leg, col=c(" red ", " green "), pch=c(2,4))
29. We next discuss histograms, density plots,
boxplots and normal probability plots
Type the following:
attach("usingR.RData")
attach(possum)
Type possum at the prompt to view the data. Next type
hist(totlngth) for the default histogram.
Now suppose we want to specify the bins ourselves. Type
par(mfrow = c(1, 2))
hist(totlngth, freq=F, breaks = 72.5 + (0:5) * 5,
xlab="Total length", main ="A: Breaks at 72.5, 77.5, ...")
hist(totlngth, freq=F, breaks = 75 + (0:5) * 5, xlab="Total length",
main="B: Breaks at 75, 80, ...")
30. To get a corresponding density estimate (using kernel smoothing)
type
d <- density(totlngth)
points(d) will superimpose the density estimate.
A better (scaled) superimposition can be obtained by
points(d$x, d$y/1.08,type="l", col="blue")
31. • qqnorm(totlngth) gives a normal probability plot of the variable
totlngth. The points of this plot will lie approximately on a straight line if
the distribution is normal.
• In order to calibrate the eye to recognise plots that indicate nonnormal
variation, it is helpful to do several normal probability plots for random
samples of the relevant size from a normal distribution.
• Type the following
attach(possum)
par(mfrow=c(3,4)) # A 3 by 4 layout of plots
y <- totlngth
qqnorm(y,xlab= " ", ylab="Length", main="Possums" ,col="blue")
for(i in 1:11)
qqnorm(rnorm(43),col="red",xlab="", ylab="Simulated lengths",
main="Simulated")
32. Now lets explore some very sophisticated plots
Suppose we have normally distributed data and we want to see how the
empirical density estimate compares with the normal density estimate
as we vary the sample size. Type the following commands
library(lattice)
n <- seq(5, 45, 5)
x <- rnorm(sum(n))
y <- factor(rep(n, n), labels=paste("n =", n))
densityplot(~ x | y,
panel = function(x, ...) {
panel.densityplot(x, col="DarkOliveGreen", ...)
panel.mathdensity(dmath=dnorm,
args=list(mean=mean(x), sd=sd(x)),
col="darkblue")
})
33. • The iris dataset gives measurements on four variables for
several species of the flower. A pairwise scatter of these
variables with separate markers for each species may be
obtained by
data(iris)
splom(
~iris[1:4], groups = Species, data = iris, xlab = "",
panel = panel.superpose,
auto.key = list(columns = 3)
)
34. Colours in R
• R is particularly good at handling colours. To see the palette in R, type
the following:
demo.pal <-
function(n, border = if (n<32) "light gray" else NA,
main = paste("color palettes; n=",n),
ch.col = c("rainbow(n, start=.7, end=.1)", "heat.colors(n)",
"terrain.colors(n)", "topo.colors(n)", "cm.colors(n)"))
{
nt <- length(ch.col)
i <- 1:n; j <- n / nt; d <- j/6; dy <- 2*d
plot(i,i+d, type="n", yaxt="n", ylab="", main=main)
for (k in 1:nt) {
rect(i-.5, (k-1)*j+ dy, i+.4, k*j,
col = eval(parse(text=ch.col[k])), border = border)
text(2*j, k * j +dy/4, ch.col[k])
}
}
n <- if(.Device == "postscript") 64 else 16
# Since for screen, larger n may give color allocation problem
demo.pal(n)
35. Colours in R
To use these colour palettes, try the following plot of the contours of a bivariate
normal density.
x <- y <- seq(-3,3,length=100)
norm.density <- matrix(0,100,100)
for (i in 1:100)
for (j in 1:100)
norm.density[i,j] <- dnorm(x[i])*dnorm(y[j])
par(mfrow=c(1,1))
image(x, y, norm.density, col = heat.colors(1000), axes = FALSE)
contour(x, y, norm.density, by = 5, add = TRUE, col = "peru")
box()
title(main = "The bivariate normal density", font.main = 4)
You can change the appearance of the plot by selecting the amount of colour
gradation. For example try
par(mfrow=c(1,2))
image(x, y, norm.density, col = heat.colors(1000), axes = FALSE)
contour(x, y, norm.density, by = 5, add = TRUE, col = "peru")
box()
image(x, y, norm.density, col = heat.colors(3), axes = FALSE)
contour(x, y, norm.density, by = 5, add = TRUE, col = "peru")
box()
36. The R Graph Gallery
• Visit the R Graph Gallery at
http://addictedtor.free.fr/graphiques/allgraph.php
• Click on a graph for a copy of the code used to generate it.
37. 3D graphics and movies in R
• And the R movies gallery at
• http://addictedtor.free.fr/movies/
• for R movies.
• http://rgl.neoscientists.org/Gallery.html demonstrates the rgl package.
• Try the following code
library(rgl)
example(rgl.surface)
for(i in 1:360) {
rgl.viewpoint(i, i*(60/360), interactive=F)
}
Click on the RGL device at the bottom of the R screen to see the
results.
38. Exercises
1. (a) First attach the possum dataset using attach("usingR.RData")
and then attach(possum). Consider the variable head lengths
given by hdlngth. Plot the following on the same page
a) a histogram
b) a stem and leaf plot
c) a normal probability plot and
d) a density plot
(b) The measurements in the possum dataset have been
collected at various sites given by the variable site. Draw box
plots of hdlngth by site.
39. Solutions
1. (a) First attach the possum dataset using attach("usingR.RData") and then
attach(possum). Consider the variable head lengths given by
hdlngth. Plot the following on the same page
a) a histogram
b) a stem and leaf plot
c) a normal probability plot and
d) a density plot
First attach the data using
attach("usingR.RData")
attach(possum)
To get all plots on the same page, set the layout parameter mfrow
par(mfrow=c(2,2))
The histogram, normal probability and density plots can be obtained as
hist(hdlngth, xlab="headlength of possums", main="Histogram")
qqnorm(hdlngth, xlab="headlength of possums", main="Normal probability plot")
plot(density(hdlngth), xlab= "headlength of possums", main="Density plot")
40. Solutions
The stem and leaf is a little trickier. First to find the corresponding
command in R, we type help.search(“stem”) at the R prompt. Two likely
candidates seem the command stem in the base package and stem.leaf in the
aplpack package. The stem command does not work because it only returns
the stem and leaf display in the command window. Note
names(stem(hdlngth)) does not return anything. The stem.leaf command
also displays the results in the commands window but it does store the
information in an object. To see this type
library(aplpack)
sc <- stem.leaf(hdlngth)
sc
We can create an empty plot and use the text command to place this
information on the plot as follows:
plot(1:65,1:65, type="n", xlab=" ", ylab= " ", axes=F)
for(i in 1:16)
text(3,65-4*i,sc$stem[i], adj=c(0,0), cex=0.7)
41. (b) The measurements in the possum dataset have been collected
at various sites given by the variable site. Draw box plots of
hdlngth by site.
First reset the layout parameter with
par(mfrow=c(1,1))
The basic code is
boxplot(hdlngth~site)
You can also try the following variants:
boxplot(hdlngth ~ site, notch = TRUE, col = "blue")
boxplot(hdlngth ~ site, names=c("A","B","C","D","E","F","G"))
boxplot(hdlngth ~ site, names=c("Site A", " Site B", " Site C", " Site D", "
Site E", "Site F", "Site G"), las=2)
boxplot(hdlngth~site, subset=hdlngth<90)
boxplot(hdlngth ~ site, boxwex = 0.25, at = 2:8,
main = " Headlength of possums", xlab = " Site",
ylab= "Head length", ylim = c(50, 110), yaxs = "i")
43. The lm command
• The lm command is used to fit linear regressions in R.
• To fit a regression of y on x1, x2, the basic command is lm(y~x1+x2). Let use
generate some data for purpose of illustration:
y <- rnorm(10)
x1 <- rnorm(10)
x2 <- rnorm (10)
lm(y~x1+x2)
• This only displays the least squares estimates on the screen. What if we want
to test for significance? To do this we have to define a variable say lmfit which
will save the output. To do this type lmfit <- lm(y~x1+x2)
• lmfit is called an R list. To see the results now type summary(lmfit).To see
what else is contained in lmfit type names(lmfit). To see a particular
component of lmfit eg the residuals type lmfit$residuals
.
44. The gala dataset
• Now let us fit a linear model on some real data. We shall use the gala
dataset in the faraway library.
• This dataset concerning the number of species of tortoise on the various
Galapagos Islands. There are 30 cases (Islands) and 7 variables in the
dataset.
• The variables are
– Species The number of species of tortoise found on the island
– Endemics The number of endemic or native species
– Elevation The highest elevation of the island (m)
– Nearest The distance from the nearest island (km)
– Scruz The distance from Santa Cruz island (km)
– Adjacent The area of the adjacent island (km2)
• http://www.rit.edu/~rhrsbi/GalapagosPages/Darwin.html
45. The gala dataset
• We start by reading the data into R :
library(faraway)
attach(gala)
Use summary(gala), names(gala) etc to get a
sense of the data.
You can also use pairs(gala) to get a pairwise
scatterplot of the variables.
• Let us fit the regression
gfit <- lm(Species ~ Area + Elevation +
Nearest + Scruz +
Adjacent,data=gala)
• To see the results type summary(gfit)
• In particular, the fitted (or predicted) values and residuals are
gfit$fit
and gfit$res
46. The anova command
• A convenient way to compare two nested models is to use the anova
command
• Suppose we fit the two models
g1 <- lm(Species ~ Area + Elevation + Nearest +
Scruz +
Adjacent,data=gala)
g2 <- lm(Species ~ Area + Elevation +
Nearest + Scruz ,data=gala)
Then
anova(g2,g1)
will give us the conventional F test comparing these two models.
47. Testing nested models
• Suppose we want to test whether the
coefficients for the variables Area and
Adjacent are equal. We can type
g1 <- lm(Species ~ Area + Elevation + Nearest +
Scruz +
Adjacent,data=gala)
g2 <- lm(Species ~ I(Area + Adjacent) +
Elevation + Nearest +
Scruz,data=gala)
• Then anova(g2,g1) will perform the appropriate
F test.
• Suppose we want to test whether the coefficient of Area can be set to a
particular value say –0.1. We can then fit
g2 <- lm(Species ~ offset(-0.1*Area) +
Elevation + Nearest + Scruz +
Adjacent,data=gala)
48. Categorical predictors
• Suppose we want to include the variable Area as a categorical
predictor with 3 categories rather than a continuous one.
• To define the corresponding categorical variable
area.cat <- rep(3,nrow(gala))
area.cat[gala$Area<=5] <- 1
area.cat[(gala$Area>5)&(gala$Area<=1000)] <- 2
• Type cbind(gala$Area,area.cat) to view the results
• This regression can be fitted using the command
g3 <- lm(Species ~ as.factor(area.cat) +
Elevation + Nearest + Scruz +
Adjacent,data=gala)
Type summary(g3) to view the output.
• The factor command is useful for fitting ANOVA models.
49. Categorical predictors
• For example consider the coagulation data set which gives
measurements on blood coagulation corresponding to four diets.
data(coagulation)
coagulation
• A one way ANOVA model can be fitted to the data using
coag.fit <- lm(coag ~ factor(diet),
coagulation)
summary(coag.fit)
50. Multiple comparisons
• Another feature especially relevant for ANOVA models is to allow for
multiple comparisons while testing for pairwise differences.
• Tukey’s Honest Significant Difference (HSD) is designed for all
pairwise comparisons and depends on the studentized range
distribution. We compute the Tukey HSD bands for the diet data.
TukeyHSD(aov(coag.fit))
• You can compare these to the unadjusted
confidence intervals for the differences B-A,
C-A, D-A given below
B-A 1.813638 8.186362
C-A 3.813638 10.186362
D-A -3.022848 3.022848
• Some other pairwise comparison tests may be found in the stats
library.
51. Confidence intervals
• Returning to the Galapagos dataset, to construct individual 95%
confidence intervals for the regression parameters, we first extract the
parameters and the standard errors:
summary(g1)$coefficients gives the coefficients, standard errors, t and
p values as a matrix. We extract the first two columns using
beta <- summary(g1)$coefficients[,1]
se.beta <- summary(g1)$coefficients[,2]
We next compute the critical value of the t-statistic with error d.f.
t95 <- qt(0.975, g1$df.residual)
and the individual confidence intervals as
ci.beta <- cbind(beta-t95*se.beta, beta+t95*se.beta)
ci.beta
52. Confidence ellipsoids
• Now we construct the joint 95% confidence region for the coefficients
of Area and Elevation. Type
library(ellipse)
plot(ellipse(g1,c(2,3)),type="l")
• Add the origin and the point of the estimates:
points(0,0)
points(g1$coef[2],g1$coef[3],pch=18)
• Now we mark the one way confidence intervals on the plot for
reference:
abline(v=ci.beta[2,],lty=2)
abline(h=ci.beta[3,],lty=2)
53. Predictions
• Suppose we want to predict the number of species corresponding to
a hypothetical sample point with Area= 0.08,
Elevation= 93,
Nearest= 6.0,
Scruz= 12
Adjacent=
0.34.
• This can be done with
predict(g1,data.frame(Area=0.08,Elevation=93,Ne
arest=6.0,Scruz=12,
Adjacent=0.34),se=T)
• predict(g1) without any additional arguments
will return the predicted values for the sample
data points.
54. Generalized least squares
• Until now we have assumed that var e = s2I but it can happen that the
errors have non-constant variance or are correlated in which case we
should fit a generalized least squares.
• To illustrate this we will use a dataset called Longley’s regression data
where the response is the number of people employed, yearly from
1947 to 1962 and the predictors are GNP implicit price deflator, GNP,
unemployed, armed forces, non-institutionalized population 14 years
of age and over, and year.
• To attach and view the data type
data(longley)
names(longley)
55. Approach 1
• Assuming that the errors follow an autoregressive series of order one,
we can estimate the serial correlation as
data(longley)
g <- lm(Employed ~ GNP + Population,
data=longley)
cor(g$res[-1],g$res[-16])
• We now construct the S matrix and compute the GLS estimate of b
along with its standard errors.
x <- model.matrix(g)
Sigma <- diag(16)
Sigma <- 0.31041^abs(row(Sigma)-col(Sigma))
Sigi <- solve(Sigma)
xtxi <- solve(t(x) %*% Sigi %*% x)
beta <- xtxi %*% t(x) %*% Sigi %*%
longley$Empl
beta
56. Approach 2
• Since we can write S = SST , where S is a triangular matrix using the
Choleski Decomposition, another approach would be to regress S-1
y on S
–1
X as demonstrated below:
sm <- chol(Sigma)
smi <- solve(t(sm))
sx <- smi %*% x
sy <- smi %*% longley$Empl
lmsxsy <- lm(sy ~sx-1)
lmsxsy$coef
• Our initial estimate of the AR parameter is 0.31 but once we fit our GLS
model we can re-estimate it as cor(lmsxsy$res[-
1],lmsxsy$res[-16])
57. Approach 3
• The nlme library contains a GLS fitting function. We can use it to fit
this model:
library(nlme)
g <- gls(Employed ~GNP + Population,
correlation=corAR1(form= ~Year),
data=longley)
summary(g)
• We see that the estimated value of r obtained using Restricted
Maximum Likelihood estimation is 0.64. You can also specify
method = "ML" in g for Maximum Likelihood
estimation of r .
58. Weighted least squares
• Sometimes the errors are uncorrelated, but have unequal variance where the
form of the inequality is known. Weighted least squares (WLS) can be used in
this situation.
• Here is an example from an experiment to study the interaction of certain
kinds of elementary particles on collision with proton targets. The experiment
was designed to test certain theories about the nature of the strong interaction.
The cross-section(crossx) variable is believed to be linearly related to the
inverse of the energy(energy - has already been inverted). At each level of
the momentum, a very large number of observations were taken so that it was
possible to accurately estimate the standard deviation of the response(sd).
• Consider the following code
data(strongx)
strongx
• Define the weights and fit the model:
g <- lm(crossx ~energy, strongx, weights=sd^-2)
summary(g)
59. Diagnostics : residuals and leverage
• Let’s illustrate these test using an interesting economic dataset on 50
different countries.
• These data are averages over 1960-1970 on
dpi = per-capita disposable income in U.S. dollars;
ddpi = the percent rate of change in per capita disposable
income;
sr = aggregate personal saving divided by disposable income.
pop15 = percentage population under 15
pop75 = percentage population over 75
• The data come from Belsley, Kuh, and Welsch (1980).
• First take a look at the data:
data(savings)
savings
60. Diagnostics : outlier identification
• Consider the regression
g <- lm(sr ~ pop15 + pop75 + dpi + ddpi,
savings)
And a plot of the residuals :
plot(g$res,ylab="Residuals",main="Index
plot of residuals")
countries <- row.names(savings)
To identify outliers use
identify(1:50,g$res,countries)
61. Diagnostics : Leverage
• Now look at the leverage: We first extract the X-matrix here using
model.matrix() and then compute and plot the leverages or so called
”hat” values:
x <- model.matrix(g)
lev <- hat(x)
par(mfrow=c(1,1))
plot(lev,ylab="Leverages",main="Index plot of
Leverages")
abline(h=2*5/50)
• Notice that the sum of the leverages is equal to 5 for this data. Which countries
have large leverage? We have marked a horizontal line at 2p/n.
• Alternatively type
names(lev) <- countries
lev[lev > 0.2]
• The command names() assigns the country names to the elements of the
vector lev making it easier to identify them. Alternatively, we can do it
interactively like this identify(1:50,lev,countries)
62. Diagnostics : residual plots
• On the previous slide we plotted the raw residuals. Two alternative classes
of residuals are the studentized residuals and the jackknife residuals. To
get a plot of all three types on the same page type
par(mfrow=c(3,1))
plot(g$res,ylab="Residuals",main="Index plot of
residuals")
gs <- summary(g)
stud <- g$res/(gs$sig*sqrt(1-lev))
plot(stud,ylab="Studentized
Residuals",main="Studentized Residuals")
jack <- rstudent(g)
plot(jack,ylab="Jacknife Residuals",main="Jacknife
Residuals")
Jackknife residuals can be used for outlier
detection. Type jack[abs(jack)==max(abs(jack))] to
identify the largest value. A critical value for
these jackknife residuals with Bonferroni
correction for multiple testing can be computed
64. Multicollinearity
• The Longley dataset is a good example of collinearity:
• Check the correlation matrix first using round(cor(longley[,-
7]),3)
• Now we check the eigendecomposition:
x <- as.matrix(longley[,-7])
e <- eigen(t(x) %*% x)
sqrt(e$val[1]/e$val)
• One option could be to use ridge regression as
implemented by lm.ridge in the MASS library.
Type
library(MASS)
? lm.ridge
for details.
65. Variable selection
• The step function with the option direction = "forward",
direction="backward" or direction="both" can be used for forward,
backward or stepwise variable selection using p-values.
• The leaps function in the leaps library implements variable selection
based on model selection criteria like the adjusted R2, Mallow’s Cp,
AIC, BIC and PRESS.
66. Transformations
• Transformations of the response and predictors can improve the fit and
correct violations of model assumptions such as constant error
variance.
• A popular method is to use the Box Cox transformation.
• Consider the Galapagos Islands dataset analyzed earlier:
data(gala)
g <- lm(Species ~ Area + Elevation + Nearest +
Scruz + Adjacent,gala)
library(MASS)
boxcox(g,plotit=T)
boxcox(g,lambda=seq(0.0,1.0,by=0.05),plotit=T)
67. Transformations
Alternatively we can also transform the predictor using various functions such
as standard polynomials, orthogonal polynomials, splines etc. The following
code fits a non-linear regression of Species on Area using (natural) splines.
data(gala)
library(splines)
g4 <- lm(Species ~ns(Scruz, df=4),gala)
To view the results type
scruz.ord <- order(gala$Scruz)
plot(gala$Scruz, gala$Species, xlab= " Scruz ", ylab=
" Species ", lwd=2)
points(gala$Scruz[scruz.ord], predict(g4)[scruz.ord],
col= " yellow ", type= "l")
68. Tranformations
You can also look at the effect of increasing the
degrees of freedom
g8 <- lm(Species ~ns(Scruz, df=8),gala)
points(gala$Scruz[scruz.ord],
predict(g8)[scruz.ord], col= " orange" , type= "l")
g12 <- lm(Species ~ns(Scruz, df=12),gala)
points(gala$Scruz[scruz.ord],
predict(g12)[scruz.ord], col= " red" , type= "l")
g20 <- lm(Species ~ns(Scruz, df=20),gala)
points(gala$Scruz[scruz.ord],
predict(g20)[scruz.ord] , col= " purple" , type=
"l")
leg <- c(" 4 df ", " 8 df ", " 12 df " , " 20 df
")
legend(200, 200, leg, lty=1, col=c(" yellow",
" orange", " red", " purple " ))
69. Exercise
• The variable Species in the gala dataset is actually a count and one
should ideally fit a generalized linear model with a Poisson link.
– Use the glm function to fit an appropriate generalized linear model
to the data accounting for possible overdispersion.
– Use predict.glm to plot the predicted values corresponding to the
sample data together with prediction intervals.
– Perform post model fitting diagnostics using some of the following
functions from the car and stats libraries :
• cookd (car),
• dfbeta and dfbetas (stats)
• dffits (stats)
• influence.measures(stats)
• outlier.test (car).
71. When do we need to programme?
• To define user defined functions.
• To create libraries.
• For simulations.
72. A sample programme
Here is an example of a sample programme to simulate the distribution of
the median of a Cauchy distribution, computes a Monte Carlo estimator of
its MSE and compare the histogram of the simulated values to its
asymptotic distribution. It also displays the time taken to perform the
simulations.
n <- 10
nsim <- 10000
theta.hat <- double(nsim)
for (i in 1:nsim) {
x <- rcauchy(n)
theta.hat[i] <- median(x)
}
mean(theta.hat^2)
cat("Calculation took", proc.time()[1], "seconds.n")
hist(theta.hat, freq = FALSE, breaks = 100)
curve(dnorm(x, sd = sqrt(mean(theta.hat^2))), add = TRUE)
curve(dnorm(x, sd = sqrt(1 / (4 * n * dcauchy(0)^2))), add = TRUE, col = "red")
73. Programming style
Make the programme as generic as possible.
For example to generate sample averages for 200 samples of size 20
from the standard normal distribution, one possible code is
for (i in 1:200)
{
samp.mean[i] <- mean(rnorm(20))
}
A more generic alternative is
sample.size <- 20
simulation.size <- 200
for (i in 1: simulation.size)
{
samp.mean[i] <- mean(rnorm(sample.size))
}
74. Programming style
• Indent.
Consider for (i in 1:simulation.size)
{
samp.mean[i,] <- NULL
for (j in 1: num.times)
{
samp.mean[i,j] <- mean(rnorm(mean=j,sd=1, n=samp.size))
}
}
as opposed to
for (i in 1:simulation.size)
{
samp.mean[i,] <- NULL
for (j in 1: num.times)
{
samp.mean[i,j] <- mean(rnorm(mean=j,sd=1, n=samp.size))
}
}
75. Programming style
Give meaningful variable names.
The programme
for (i in 1:m)
{
x[i,] <- NULL
for (j in 1: n)
{
x[i,j] <- mean(rnorm(mean=j,sd=1, n=K))
}
}
is perfectly valid but not as user friendly.
• When choosing names be careful not to overwrite inbuilt R functions.
For example, naming a variable lm will mask the lm function. If this
happens type remove("lm") to restore the function.
76. Programming style
Use matrices for fast computation.
The commands executed by the following lines of code
A <- matrix(0,500,500)
for (i in 1:500)
for (j in 1:500)
A[i,j] <- i + j
can be run faster and in a more elegant manner using
I.mat <- matrix(seq(1,500), nrow=500, ncol=500)
A <- I.mat + t(I.mat)
Though R is generally better with loops than S+, such matrix
programming is essential for fast computation.
77. Programming style
Add comments.
The # sign can be used to insert comments. For example :
simulation.size <- 200 # Set simulation size
samp.size <- 20 # Set sample size
num.times <- 10 # Set number of repetitions
for (i in 1:simulation.size)
{
samp.mean[i,] <- NULL
for (j in 1: num.times)
{
# Calculate the sample mean for i,j th observation
samp.mean[i,j] <- mean(rnorm(mean=j,sd=1, n=samp.size))
}
}
78. Variable types in R
• Numeric
– Real / Floating point
• default: double precision—15 significant digits
• single precision—7 significant digits
– Integer x <- 6
is.real(x)
x <- as.integer(x)
is.real(x)
is.integer(x)
• Logical
x <- c(1,2,3,4,5); y <- (x<3);
• Character String
x <- c(" North " , " South " , " East " , "
West " )
• List : collection of several objects of any type
x1 <- c(" North " , " South " , " East " , " West " )
x2 <- c(2,3,5,8)
x <- list(x1,x2)
• Complex arithmetic is also supported in R
z <- complex(real = rnorm(100), imag = rnorm(100))
Re(z)
Im(z)
79. Vectors and matrices in R
• Vectors
x <- c(45, 90, 135 )
x
y <- c(" North " , " South " , " East " , " West " )
y
x*2
length(x)
sum(x)
· When the values are from a systematic sequence you can save coding
x <- rep(2.1, 30)
y <- rep(" North " ,5)
x <- 1:10
x <- seq(1,10,2)
80. Vectors and matrices in R
• Matrices
a <- 1:3
b <- 4:6
c <- 7:9
X <- cbind(a,b,c)
X
dim(X)
Y <- rbind(a,b,c)
Y
dim(Y)
X+ Y
X*Y
X%*%Y
Z <- matrix(c(1,4,6,2,3,7.8), nrow=2, ncol=3, byrow=T)
Z <- matrix(c(1,4,6,2,3,7.8), nrow=2, ncol=3, byrow=F)
81. Data frames
• Most functions such as lm, glm, survreg, coxph etc will operate on data
frames.
• If the data is read in using command such as read.csv, read.txt etc, it
will automatically be saved as a data frame.
• If the data is read in from the keyboard, a data frame can be created as
follows.
length <- c(20, 24, 19, 24, 18, 30)
wt <- c(10, 14, 14, 12, 12.5, 17)
mydata <- data.frame(length, wt)
To see the variable names type names(mydata) . To access the length
variable use mydata$length. Alternatively you can attach the data set
using attach(mydata) in which case you can simply type length.
82. Programming Loops
• for loop : for (i in 1:10)
{
….. R code …..
}
• while loop while (logical condition)
{
….. R code …..
}
• if loop if (logical condition)
{
….. R code …..
}
• if else loop if (logical condition)
{
….. R code …..
}
else
{
….. R code …..
}
• The commands stop and break will exit from a loop without completion.
83. Some useful commands for programming
• Numerical solution of equations : uniroot, polyroot, optimize, nlm
• Alternatives to loops : apply, tapply, outer
• Matrix inversion and solution of linear equations : solve, solve.qr,
chol2inv, backsolve, qr.solve
• General matrix functions : eigen, svd, det
• Sorting : sort, order, rank
• Rounding up : ceiling, floor, round, trunc, signif
• Saving : write.matrix, source, sink, postscript, pdf
• Numerical settings : .Machine
• Random number generation : Random.Seed, RNG, RNGkind, set.seed
84. • Suppose we have decided on a favourite plotting set up which uses
blue dashed lines on a yellow background, square plotting characters
and prints the variable names parallel to each axis.
• Instead of retyping the options for each plot we can create a function
which uses these settings and also returns the summaries for each
variable.
myplot <- function(x,y, bgd = "lightyellow")
{
opar <- par()
par(bg=bgd)
plot(x, y, pch=22, col="blue", las=1)
myplot.out <- summary(data.frame(cbind(x,y)))
par(opar)
return(myplot.out)
}
• To use this function, first paste this into R and then use x1 <- rnorm(20)
x2 <- rnorm(20)
out <- myplot(x1,x2)
You can also use myplot(x1, x2, bgd= "grey")
To see the summary type out
User defined functions
85. User defined functions
• Here is a slightly more complicated function to calculate the number of
runs of 1’s in a binary sequence
f <- function (x, v=1)
{
x <- diff(x==v)
x <- x[x!=0]
if (x[1]==1) sum(x==1)
else 1+sum(x==1)
}
Now generate some data
n <- 50
x <- sample(0:1, n, replace=T, p=c(.2,.8))
x
To see the number of runs in the sequence type f(x,1)
86. Example 1
• Let us write a programme which will compare the power of the two
sample t-test with that of the Wilcoxon and Kolmogorov - Smirnov
tests when the underlying data are normal.
87. Example 1
• Let us first open an R script, name it example and save the script in the R
home directory.
• It is good practice to add in a descriptive header giving the purpose of the
programme and the date on which it was last modified.
#### R programme for simulating the power of the two sample t test vs various
#### non-parametric alternatives
#### 21/7/06
• We next need to specify the sample size and the number of simulations to
be run with sim.size <- 200
sample.size <- 10
• We shall set the mean of the first population to zero and run the simulation
for a range of values of the difference in means with mu1 <- 0
delta <- seq(-2,2, length=50)
• We also need to set the seed so as to be able to reproduce the random
number generation.
set.seed(231)
88. Example 1
• Our programme will then look like this:
for (j in 1:length(delta))
{
# Set mean of second population
for (i in 1:sim.size)
{
# Generate ith sample
# Perform ith set of tests
# Check if the test rejects the null hypothesis of equality
}
# Calculate the simulated power
}
89. Example 1
So our programme now looks like this:
sim.size <- 200; sample.size <- 10;
set.seed(231)
mu1 <- 0; delta <- seq(-2,2, length=50)
for (j in 1:length(delta))
{
mu2 <- mu1 + delta[j]
for (i in 1:sim.size)
{
}
# Calculate power for jth setting
} # End of j loop
90. Example 1
• Let us now define variables which will hold the simulated powers
sim.size <- 200; sample.size <- 10;
set.seed(231)
mu1 <- 0; delta <- seq(-2,2, length=50)
pow.ttest <- NULL
pow.wtest <- NULL
pow.kstest <- NULL
for (j in 1:length(delta))
{
mu2 <- mu1 + delta[j]
for (i in 1:sim.size)
{
# Calculate pt.test[I], pw.test[I], pks.test[I]
}
pow.ttest[j] <- sum(pt.test)/sim.size # Calculate powers for jth setting
pow.wtest[j] <- sum(pw.test)/sim.size
pow.kstest[j] <- sum(pks.test)/sim.size
} # End of j loop
91. Example 1
• The inner simulation loop looks like this
for (i in 1:sim.size)
{
# Generate ith sample
samp1 <- rnorm(mean=mu1,sample.size)
samp2 <- rnorm(mean=mu2,sample.size)
# Perform ith set of tests
test1 <- t.test(samp1, samp2,alternative = c("two.sided"))
pt.test[i] <- (test1$p.value < 0.05)
test2 <- wilcox.test(samp1, samp2,alternative = c("two.sided"),
exact = TRUE)
pw.test[i] <- (test2$p.value < 0.05)
test3 <- ks.test(samp1, samp2,alternative = c("two.sided"),
exact = TRUE)
pks.test[i] <- (test3$p.value < 0.05)
}
92. Example 1
• The complete programme has been saved as an R script called
twosamp.r in the R home directory. Open the file using the File menu
in R and run the simulation using source("twosamp.r")
• The code will automatically save plots of the simulated power in a pdf
file called twosamp.pdf also in the R home directory.
93. Example 2
• Now run simulations which will look at the robustness of the two
sample t-test to the following assumtions:
– Homoscedasticity
• Simulate two normal distributions with different standard deviations and plot
the level as a function of the ratio of sd’s. Make several plots (on the same
graph) corresponding to several choices of the standard deviation of the first
population.
– Normality
• Simulate data from two logistic distributions and find an estimate of the level
of the test. Does the level vary with sample size?
– Independence
• Simulate two correlated normal distributions and plot the level as a function of
of the correlation coefficient.
The package mvtnorm will be required to generate bivariate normal data.
94. Example 3
• Write a function which will calculate the number of runs in a binary
sequence of arbitrary length.
Hint : Use the diff function.
95. Additional Exercises
1. Generate Bernoulli data with n = 100 and p = .25, p = .05 and p =
.01. Is the data approximately normal in each case?
2. Sketch the distribution of the standardized average for data
generated from the uniform [0; 1] distribution. Compare the
histograms when n is 5, 10, 25 and 100.
3. Write a function which will compute a Monte Carlo estimate of the
ratio of the variances for the mean and the median for the
(a) N(0,1) distribution
(b) t distribution with 2 df.
Use the vioplot function in the vioplot library to create side by side
vioplots of the simulated distributions of the mean and the median
in the two cases.
96. Additional Exercises
4 (a). Search the stats library in R for a list of parametric and non-
parametric tests.
(b) Generate 200 standard normal variables and perform the one sample
t-test on the data.
(c) Repeat the steps in (2) 1000 times and draw a histogram of the
resulting p values.
(d) On the same graph plot the power of the one sample t-test as a
function of the true mean using
(i) simulation
(ii) the R function power.t.test
(iii ) an analytical expression for power
5. Plot the density of the t distribution for degrees of freedom = 1,2,5,100
and the standard normal density in different colours on the same graph.
Add a legend and title to the plot.
97. Additional Exercises
6. The sleep dataset in R shows the number of hours of extra sleep after
administration of a sleeping drug.
(a) Perform a two sample t-test on the data.
(b) Perform a two sample Wilcoxon test.
(c) Perform an analysis of variance assuming normality.
(d) Perform a non parametric analysis of variance using a Kruskal Wallis test.
(e) Compare the variances of the two groups using an F test.
7. Plot the density of the chi-squared distribution for 1-10 degrees of freedom in
different colours on the same graph. Add a legend and title to the plot.
98. Additional Exercises
8. Consider the variable eruption giving eruption lengths of the Faithful
geyser recorded in the data set faithful.
(a) Draw a histogram of the data.
(b) Write a function f which takes as input the means (m1,m2) and sd’s
(s1,s2), the mixing proportion (p) and the data point (x) and returns the
value of the corresponding mixture normal likelihood at the point.
(c) Now write a function fn which uses the function f to calculate the
likelihood for the entire data set.
(d) Use the function optim to get maximum likelihood estimates.
(e) Superimpose the sample and theoretical density on the histogram
and add a legend to the plot.
99. Additional Exercises
9. Create a data frame called Manitoba.lakes that contains the lake’s elevation
(in meters above sea level) and area (in square kilometers) as listed below.
Assign the names of the lakes using the row.names( ) function.
elevation area
Winnipeg 217 24387
Winnipegosis 254 5374
Manitoba 248 4624
SouthernIndian 254 2247
Cedar 253 1353
Island 227 1223
Gods 178 1151
Cross 207 755
Playgreen 217 657
(a) Plot log2(area) versus elevation. Add labeling information using the text
command with the label option.
(b) Use the R function dotchart( ) to display the areas of the Manitoba lakes
(i) on a linear scale,
and (ii) on a logarithmic scale.
Add, in each case suitable labeling information.
100. Additional Exercises
10. (a) Use the nlm function to numerically minimise the function
f(x,y,z) = sin(x)-sin(y-4)+z2+2.
(b) If gradient information is not supplied, nlm will use a
matrix-secant method which numerically approximates the
gradient. To use gradient information, redefine the function so
as to additionally contain an attribute called the gradient.
Now perform minimisation using the quasi-Newton method.
(c) Use the integrate function in R to find the constant of
integration, c for the posterior density function
c.e[-1/2{(0.12-x)2+(0.07-x)2+(0.08-x)2}]
102. What is a library?
• An R library or package is a collection of programmes with a common
objective. To see a list of packages available by default type search( )
at the R prompt.
• Some commonly used packages are base, graphics, stats, mgcv, nlme,
survival, Hmisc etc.
• R also has some very specialised packages. Examples include
– boot (bootstrap / jackknife)
– EbayesThresh (empirical Bayes thresholding),
– mAr (multivariate autoregressive analysis)
– neural (neural networks)
– nlqr (non linear quantile regression)
– portfolio (analysing equity portfolios) etc.
103. R contributed libraries
• A complete list of contributed packages is available on the R website under the link
Contributed extension packages. The list has also been saved to the file Available
Bundles and Packages.doc on the Desktop.
• The R News site also available from the website also provides a discussion of new
packages and updates to old packages.
• R also has summaries called CRAN Task Views for the following specialised subjects
– Cluster Cluster Analysis & Finite Mixture Models
– Econometrics Computational Econometrics
– Environmetrics Analysis of ecological and environmental data
– Finance Empirical Finance
– Genetics Statistical Genetics
– MachineLearning Machine Learning & Statistical Learning
– Multivariate Multivariate Statistics
– SocialSciences Statistics for the Social Sciences
– Spatial Analysis of Spatial Data
– gR gRaphical models in R
• The ctv package can be used to install the functions mentioned in the CRAN Task View.
104. Downloading libraries
The first option is to go to one of the CRAN mirror sites and click on one of the mirror
sites. A complete listing of libraries is available by following the link for contributed
extension packages. Clicking on the desired library will lead to a download page such
as
bivpois: Bivariate Poisson Models Using The EM Algorithm
Functions for fitting Bivariate Poisson Models using the EM algorithm.
Details can be found in Karlis and Ntzoufras (2003, RSS D & 2004,AUEB Technical Report)
Version:0.50-2
Depends:R (>= 2.0.1)
Date:2005-08-25
Author:Dimitris Karlis and Ioannis Ntzoufras
Maintainer:Ioannis Ntzoufras
License:GPL (version 2 or later)
URL:http://www.stat-athens.aueb.gr/~jbn/papers/paper14.htm
Package source: bivpois_0.50-2.tar.gz
Windows binary:bivpois_0.50-2.zip
Reference manual: bivpois.pdf
105. Downloading libraries
• Download the .zip file. To install the library you can either unzip the
file and copy to the library folder in the R home directory.
• Or you can open an R session and choose Install from local zip file
from the Package option on the menu.
• This will install the library as well as the corresponding help
documentation. For additional documentation you can visit the sited
URL.
• If the machine has an internet connection a simpler way to install a
package is to choose Set CRAN mirror from the Package menu and
then choose Install package.
• On installation function files in a library will all be copied to the
library folder in the R home directory. Documentation will be copied
to the Doc subfolder within each library.
106. The library command
• Consider the following uses of the library command
– library( ) # list all available packages
– library(lib = .Library) # list all packages in the default library
– library(help = stats) # documentation on package 'stats‘
– library(faraway) # load package ‘faraway‘
– require(faraway) # the same
– library(help=faraway) # documentation on package ‘faraway’
– search( ) # lists loaded packages
• Another useful command available for some packages is demo( ). Try
– demo(package = .packages(all.available = TRUE))
– demo(glm.vr, package="stats")
– demo(persp, package="graphics")
107. Some R libraries
• In this tutorial we shall consider the following libraries:
– TeachingDemos : Demonstrations for teaching
– Matrix : A Matrix package for R
– MCMCpack : Bayesian inference via Markov chain Monte Carlo
108. The TeachingDemos library
• As suggested by the name, this library contains functions useful for
interactively demonstrating basic statistical concepts.
• The library has already been loaded onto your machine. Attach the
library using library(TeachingDemos)
• Check if the package has any inbuilt demos using
demo(package="TeachingDemos").
• To see the package capabilities type library(help=TeachingDemos)
109. The TeachingDemos library
• Let us explore some of these functions. For example type ? faces to get
a description of the faces command.
• Next try running the sample code given in the description
• The first example is faces(rbind(1:3,5:3,3:5,5:7))
• The next is data(longley)
faces(longley[1:9,])
• Compare the differences between faces and faces2 using
faces2(matrix( runif(18*10), nrow=10), main='Random Faces')
and
faces2(matrix( runif(18*10), nrow=10), main='Random Faces')
• Type par(mfrow=c(1,1)) to restore the default plotting layout.
110. The TeachingDemos library
• Similarly try the examples for some other possibly useful functions
such as
– mle.demo
– power.examp
– put.points.demo
– rotate.cloud
– run.cor.examp
– run.hist.demo
– vis.binom
111. The Matrix library
• Matrix is a class of methods for numerical linear algebra with special
relevance for sparse ill conditioned matrices.
• We shall first use the library to compare the speed of least squares
fitting methods on an example for which the model matrix is large and
sparse.
• As an example, let’s create a model matrix, mm, and corresponding
response vector, y, for a simple linear regression model using the
Formaldehyde data.
data(Formaldehyde)
str(Formaldehyde)
(m <- cbind(1, Formaldehyde$carb))
(yo <- Formaldehyde$optden)
solve(t(m) %*% m) %*% t(m) %*% yo
system.time(solve(t(m) %*% m) %*% t(m) %*% yo)
dput(c(solve(t(m) %*% m) %*% t(m) %*% yo))
dput(unname(lm.fit(m, yo)$coefficients))
112. The Matrix library
• For a large, ill-conditioned least squares problem this does not perform well. Let
us read in an example of such data using
library(Matrix)
data(KNex, package = "Matrix")
y <- KNex$y
mm <- as(KNex$mm, "matrix")
• Type dim(mm) to get the dimension of mm.
• Now check the system times
system.time(naive.sol <- solve(t(mm) %*% mm) %*% t(mm) %*% y)
• Because the calculation of a “cross-product” matrix is a common operation in
statistics, the crossprod function has been provided to do this efficiently. Check
the system time for the above operation using crossprod:
system.time(cpod.sol <- solve(crossprod(mm), crossprod(mm,y)))
113. The Matrix library
• The crossprod function applied to a single matrix takes
advantage of symmetry when calculating the product but does
not retain the information that the product is symmetric and
positive semidefinite.
• As a result least squares estimates are calculated using a
general linear system solver based on an LU decomposition
when it would be faster, and more stable numerically, to use a
Cholesky decomposition.
114. The Matrix library
• The Matrix package uses the S4 class system (Chambers, 1998) to
retain information on the structure of matrices from the intermediate
calculations.
mm <- as(KNex$mm, "dgeMatrix")
system.time(Mat.sol <- solve(crossprod(mm), crossprod(mm,y)))
• Furthermore, any method that calculates a decomposition or
factorization stores the resulting factorization with the original object so
that it can be reused without recalculation.
xpx <- crossprod(mm)
xpy <- crossprod(mm, y)
system.time(solve(xpx, xpy))
115. The Matrix library
• The model matrix mm is sparse; that is, most of the elements of
mm are zero.
• The Matrix package incorporates special methods for sparse
matrices, which produce the fastest results of all.
116. The MCMCpack library
• This package contains functions to perform Bayesian inference using
posterior simulation for a number of statistical models. All models
return coda mcmc objects that can then be summarized using the coda
package. MCMCpack also contains some useful utility functions,
including some additional density functions and pseudo-random
number generators for statistical distributions, a general purpose
Metropolis sampling algorithm, and tools for visualization.
• You will also need to download the coda library to run MCMCpack.
• Let us type
library(MCMCpack)
library(coda)
library(help=MCMCpack)
to view package capabilities.
117. The MCMCpack library
• Let us look at the function Mcbinomialbeta
• Type the following sample code
posterior <- MCbinomialbeta(3,12,mc=5000)
summary(posterior)
• To plot the prior and posterior on the same graph type
plot(posterior)
grid <- seq(0,1,0.01)
plot(grid, dbeta(grid, 1, 1), type="l", col="red", lwd=3, ylim=c(0,3.6),
xlab="pi", ylab="density")
lines(density(posterior), col="blue", lwd=3)
legend(.75, 3.6, c("prior", "posterior"), lwd=3, col=c("red", "blue"))
118. The MCMCpack library
• Similarly consider MCnormalnormal
• Type
y <- c(2.65, 1.80, 2.29, 2.11, 2.27, 2.61, 2.49, 0.96, 1.72, 2.40)
posterior <- MCnormalnormal(y, 1, 0, 1, 5000)
summary(posterior)
and to see a plot
plot(posterior)
grid <- seq(-3,3,0.01)
plot(grid, dnorm(grid, 0, 1), type="l", col="red", lwd=3, ylim=c(0,1.4),
xlab="mu", ylab="density")
lines(density(posterior), col="blue", lwd=3)
legend(-3, 1.4, c("prior", "posterior"), lwd=3, col=c("red", "blue"))
119. The MCMCpack library
• To see the effect of varying the prior variance consider the following
code
y <- c(2.65, 1.80, 2.29, 2.11, 2.27, 2.61, 2.49, 0.96, 1.72, 2.40)
prior.var <- c(0.01,0.1,0.5,0.75,1,1.5,2,10,100)
par(mfrow=c(3,3))
for (ipv in 1:length(prior.var))
{
posterior <- MCnormalnormal(y, 1, 0, prior.var[ipv], 5000)
grid <- seq(-4,4,0.01)
plot(grid, dnorm(grid, 0, prior.var[ipv]), type="l", col="red", lwd=3,
ylim=c(0,1.4),
xlab="mu", ylab="density")
points(grid, dnorm(grid,mean(y), sd(y)), type ="l")
lines(density(posterior), col="blue", lwd=3)
legend(-3, 1.4, c("prior", "posterior", "sample"),
col=c("red", "blue" , "black"), cex=0.5)
title(paste("Prior variance = ", prior.var[ipv]), cex=0.6)
}
120. The coda library
• The coda library is used for convergence diagnostics of the MCMC
chain. Type library(help=coda) to see a list of the diagnostic tools
implemented.
• For example try
par(mfrow=c(1,1)
geweke.plot(posterior)
raftery.diag(posterior)
traceplot(posterior)
autocorr.plot(posterior)
• A complete list of Bayesian analysis implemented in R is listed in the
CRAN Task View on the subject.
121. Exercise
Frank Harrell’s Hmisc library contains many functions useful for data analysis,
high-level graphics, utility operations, functions for computing sample size and
power, importing datasets, imputing missing values, advanced table, making,
variable clustering, character string manipulation, conversion of S objects to
LaTeX code, and recoding variables.
(a) Load the Hmisc library and list its capabilities.
(b) Use the function binconf to compute the various possible confidence
intervals for the binomial proportion. Make a plot to study the relationship
between these intervals for varying sample size.
(c ) Compare the abilities of the functions fit.mult.impute, aregImpute and
impute for imputing missing data.