1. BCBB Workshop
An introduction to R
Kui Shen, PhD
Bioinformatics and Computational Biosciences Branch (BCBB)
National Institute of Allergy and Infectious Diseases (NIAID)
2. OCICB Bioinformatics and Computational
Biosciences Branch (BCBB)
§ Part of NIAID
§ Group of ~40
§ Software developers
§ Computational biologists
§ Project management &
analysis professionals
§ Biostatistics, phylogenetics,
genomics, structural
biology, programming
3. What will we learn today?
§ R basics
• Arithmetic
• Matrix algebra
• Graphics
• Statistical tests
• Importing and exporting data
• Installing packages
§ Basic statistics with R
• Descriptive statistics
• General linear regression
4. What’s R?
• R is a language and environment for statistical computing and
• graphics. It is an open-source and free software package.
5. What can R do? -- Importing and
manipulating data
§ Importing data.
• Basic text files. (read.table).
• Excel files. ( read.xls in the gdata package).
§ Manipulating data.
• Indexing and subsetting.
• Merging and Reshaping.
An example: from wide format to long format (help('reshape'))
6. What can R do? -- Data analysis
§ Statistical data analysis
• Descriptive statistics
• One- and two-sample tests
• Regression and correlation
• Analysis of variance
• Tabular data
• Power and sample size estimation
• Mixed model
§ Genetic and genomic data analysis (Bioconductor)
• Microarray
• Next generation sequencing
8. What can R do? -- Visualization
xyplot(Counts~Status|sample,data=dat)
9. What can R do? – Generating reports 1
An example of Sweave code:
documentclass[a4paper]{article}
title{Sweave=R+LaTeX}
author{Kui Shen}
begin{document}
maketitle
<<LM, eval=TRUE, echo=T,fig=TRUE,
include=TRUE>>=
library(MASS)
data(cats)
mylm = lm(Hwt ~ Bwt,data=cats)
plot(Hwt ~ Bwt,data=cats, main='Linear
regression')
abline(mylm)
@
This is the embedded figure.
end{document}
10. What can R do? – Generating reports 2
An example of Sweave code:
documentclass[a4paper]{article}
title{A table generated by xtable}
begin{document}
maketitle
<<LM>>=
library(MASS)
library(xtable)
data(cats)
mylm = lm(Hwt ~ Bwt,data=cats)
@
<<echo=FALSE,results=tex>>=
print(xtable(mylm,caption=
"Linear regreesin results"))
@
end{document}
12. Basic statistic analysis: linear regression
§ Linear model is the most fundamental topic for data
scientist.
§ Before you perform any fancy machine learning
methods, your go-to-method will be the linear model.
14. Linear regression
§ In correlation, the two variables are treated as equals.
§ In regression, one variable is considered independent
(predictor) variable X and the other the dependent
(outcome, response) variable Y.
15. Regression equation
yi= α + β*xi + random errori
Follows a normal distributionFixed values
Regression equation:
What is ‘linear”?
The mean of the response variable is a linear combination of the
parameters (regression coefficients) and the predictor variables.
16. Assumptions for a linear regression:
§ The relationship between X and Y is linear.
§ Y is distributed normally at each value of X.
§ The variance of Y at every value of X is the
same (homogeneity of variances).
§ The observations are independent.
18. Significance test
Distribution of slope ~ Tn-2(β,s.e.( ))βˆ
H0: β1 = 0 (no linear relationship)
H1: β1 ≠ 0 (linear relationship does exist)
)ˆ.(.
0ˆ
β
β
es
−Tn-2=
19. Residual analysis for linearity
Not Linear Linear
ü
x
residuals
x
Y
x
Y
x
residuals
n Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
20. Residual analysis for homoscedasticity
Non-constant variance
ü Constant variance
x x
Y
x x
Y
residuals
residuals
n Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
21. Residual analysis for independence
Not Independent
Independent
X
X
residuals
residuals
X
residuals
ü
n Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
22. Multivariate regression pitfalls - overfitting
§ In multivariate linear modeling, you can get significant
but meaningless results if you use too many
predictors in the model (no predictive ability for new
samples).
§ But how about big data? For example, for next
generation sequencing data, there are thousands of
genes but few samples.
§ When the number of features p is as large as, or
larger than, the number of observations n , least
squares cannot be performed. (Large p, small n
problem).
23. Linear regression for big data
§ Shrinkage Methods
• LASSO (Least Absolute Shrinkage and Selection
Operator)
§ Dimension Reduction Methods
• Principal Components Regression
27. R programming: variables and assignment
§ Variable: alphanumeric symbols, plus “.” and “_” are
allowed.
§ Variables are case sensitive, so “A” are “a” are two
different variables.
§ Assignment: <-
• A<-3; a<-4; X.1<-9;
• During the development of s at bell labs, a key
corresponding to ‘_” was printed as a by
time-shared terminal execuport.
§ In 2001, the “=“ assignment was introduced for compatibility with
other languages.
28. Arithmetic operators
§ Addition: +
§ Subtraction: -
§ Division: /
§ Multiplication: *
§ Exponentiation: ^
§ Using R as a calculator:
28
29. Function calls
§ Syntax: functionName(arg1, arg2)
§ Arguments can be specified by position or by name, while
name overrides position.
§ > log(x=16,base=2)
§ [1] 4
§ > log(16,2)
§ [1] 4
§ > log(2,16)
§ [1] 0.25
§ > log(base=2,x=16)
§ [1] 4
30. Data type
§ Three major data types: numeric, character, and
logical.
• # numeric variables
• X<-3.1
• # character variables
• X<-"Male"
• # Logical variables
• X<-TRUE; Y<-FALSE
31. Data structure: vector, matrix and array
§ Vector: an ordered collection of data.
• x<-c(3.1,4.2,5.6); y<-c("F","M","F”); z<-c(T,F,T,F,T)
• c(…) is a fonction to concatenate its arguments end
to end.
§ Matrix: two-dimensional generalizations of vectors
§ Array: multi-dimensional generalizations of vectors
32. Data structure: data frame
§ Data frames are matrix-like structures.
§ The columns in data frame can be of different data types.
§ Generate a new data frame
§ Covert a matrix to a data frame
33. Indexing and selecting
§ Indexing: appending an index
vector in square brackets to
the name of a vector, matrix or
data frame.
§ The index vector can be a
numeric vector or a character
vector.
50. Data visualization
plot(Hwt ~ Bwt, data=cats)
The tilde (~) operator is used to specify that Hwt is described by
Bwt.
51. Linear model
§ Express the relationship as formula: Hwt ~ Bwt
• It reads: Hwt predicted by Bwt, or Hwt as a function of
Bwt.
• Variable on the left side: dependent variable. Variable on
the right side: independent variable.
§ Hwt is not totally predicted by Bwt, we add random factors
into the linear model: Hwt ~ Bwt + ε. This model divided
Hwt related factors into two parts: factors that we can
control (systematic part) and random errors (random or
probabilistic part of the model).
§ Implement in R:
• Mylm <- lm(Hwt ~ Bwt, data=cats)
• summary(mylm)
53. Linear regression results
§ It looks complicated. Let’s check it. The first part is the review of
linear model.
§ The second part is the residuals.
54. Residuals
§ Ideally, the residuals are independent, identically distributed (iid) random
variables.
56. Outliers
§ Sample 144 looks like an outliers/influence point. How to test it?
§ plot(mylm,which=4)
§ Remove outliers and re-conduct the analysis
57. Results: coefficient table
The p-value for Bwt is same as the p-value for the
model since there is only one independent variable.
plot(Hwt ~ Bwt,data=cats)
abline(mylm)
58. Results: R-squared and p-value
§ The last part:
§ R-squared: a measure of variance explained. In this case,
64.66% our data is explained by the model.
§ Adjusted R-squared: adjusts for the number of
independent variables in a model relative to the number of
data points.
§ F-statistic is used to test if one or more of the non-constant
coefficients in the regression equation are non-zero or not.
§ Finally, we got the p-value. But what is p-value?
59. P-value
Statistic hypothesis
H0 : All non-constant coefficients in the regression model are zero.
Ha : At least one of the non-constant coefficients in the regression model
is non-zero
Under the null hypothesis, the data is unlikely when the p-value is small.
60. Report the results
• Linear regression analysis was conducted and the results
indicated that body weight significantly predicted heart weight (β
= 4.03, p<0.01). Body weight also explained a significant
proportion of variance in heart weight, R2=0.65, F(1,142)=259.8,
p<0.01.