An introduction to R

BCBB Workshop
An introduction to R
Kui Shen, PhD
Bioinformatics and Computational Biosciences Branch (BCBB)
National Institute of Allergy and Infectious Diseases (NIAID)

OCICB Bioinformatics and Computational
Biosciences Branch (BCBB)
§  Part of NIAID
§  Group of ~40
§  Software developers
§  Computational biologists
§  Project management &
analysis professionals
§  Biostatistics, phylogenetics,
genomics, structural
biology, programming

What will we learn today?
§ R basics
•  Arithmetic
•  Matrix algebra
•  Graphics
•  Statistical tests
•  Importing and exporting data
•  Installing packages
§ Basic statistics with R
•  Descriptive statistics
•  General linear regression

What’s R?
•  R is a language and environment for statistical computing and
•  graphics. It is an open-source and free software package.

What can R do? -- Importing and
manipulating data
§  Importing data.
•  Basic text files. (read.table).
•  Excel files. ( read.xls in the gdata package).
§  Manipulating data.
•  Indexing and subsetting.
•  Merging and Reshaping.
An example: from wide format to long format (help('reshape'))

What can R do? -- Data analysis
§  Statistical data analysis
•  Descriptive statistics
•  One- and two-sample tests
•  Regression and correlation
•  Analysis of variance
•  Tabular data
•  Power and sample size estimation
•  Mixed model
§  Genetic and genomic data analysis (Bioconductor)
•  Microarray
•  Next generation sequencing

What can R do? -- Visualization

What can R do? -- Visualization
xyplot(Counts~Status|sample,data=dat)

What can R do? – Generating reports 1
An example of Sweave code:
documentclass[a4paper]{article}
title{Sweave=R+LaTeX}
author{Kui Shen}
begin{document}
maketitle
<<LM, eval=TRUE, echo=T,fig=TRUE,
include=TRUE>>=
library(MASS)
data(cats)
mylm = lm(Hwt ~ Bwt,data=cats)
plot(Hwt ~ Bwt,data=cats, main='Linear
regression')
abline(mylm)
@
This is the embedded figure.
end{document}

What can R do? – Generating reports 2
An example of Sweave code:
documentclass[a4paper]{article}
title{A table generated by xtable}
begin{document}
maketitle
<<LM>>=
library(MASS)
library(xtable)
data(cats)
mylm = lm(Hwt ~ Bwt,data=cats)
@
<<echo=FALSE,results=tex>>=
print(xtable(mylm,caption=
"Linear regreesin results"))
@
end{document}

Basic statistic analysis: linear regression
§  Linear model is the most fundamental topic for data
scientist.
§  Before you perform any fancy machine learning
methods, your go-to-method will be the linear model.

Linear correlation
Y
X
Y
X
Y
Y
X
X
Linear relationships Curvilinear relationships
n Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Linear regression
§  In correlation, the two variables are treated as equals.
§  In regression, one variable is considered independent
(predictor) variable X and the other the dependent
(outcome, response) variable Y.

Regression equation
yi= α + β*xi + random errori
Follows a normal distributionFixed values
Regression equation:
What is ‘linear”?
The mean of the response variable is a linear combination of the
parameters (regression coefficients) and the predictor variables.

Assumptions for a linear regression:
§ The relationship between X and Y is linear.
§ Y is distributed normally at each value of X.
§ The variance of Y at every value of X is the
same (homogeneity of variances).
§ The observations are independent.

Least squares estimation (LSE)
§ Estimating the intercept and slope: least
squares estimation

Significance test
Distribution of slope ~ Tn-2(β,s.e.( ))βˆ
H0: β1 = 0 (no linear relationship)
H1: β1 ≠ 0 (linear relationship does exist)
)ˆ.(.
0ˆ
β
β
es
−Tn-2=

Residual analysis for linearity
Not Linear Linear
ü
x
residuals
x
Y
x
Y
x
residuals

Residual analysis for homoscedasticity
Non-constant variance
ü Constant variance
x x
Y
x x
Y
residuals
residuals

Residual analysis for independence
Not Independent
Independent
X
X
residuals
residuals
X
residuals
ü

Multivariate regression pitfalls - overfitting
§  In multivariate linear modeling, you can get significant
but meaningless results if you use too many
predictors in the model (no predictive ability for new
samples).
§  But how about big data? For example, for next
generation sequencing data, there are thousands of
genes but few samples.
§  When the number of features p is as large as, or
larger than, the number of observations n , least
squares cannot be performed. (Large p, small n
problem).

Linear regression for big data
§ Shrinkage Methods
•  LASSO (Least Absolute Shrinkage and Selection
Operator)
§  Dimension Reduction Methods
•  Principal Components Regression

R programming: preliminaries
§  Install: http://www.r-project.org/
§  Manual: https://cran.r-project.org/manuals.html
§  Start and quit: R; q();
§  R and statistics
•  Around 25 built-in statistical packages
•  Much more available at:
•  http://CRAN.R-project.org
•  http://www.bioconductor.org
§  Getting help
•  help(‘plot’)
•  ?plot
§  Useful website: http://stackoverflow.com/

R and Rstudio
Rstudio (http://www.rstudio.com/)

R programming: variables and assignment
§  Variable: alphanumeric symbols, plus “.” and “_” are
allowed.
§  Variables are case sensitive, so “A” are “a” are two
different variables.
§  Assignment: <-
•  A<-3; a<-4; X.1<-9;
•  During the development of s at bell labs, a key
corresponding to ‘_” was printed as a by
time-shared terminal execuport.
§  In 2001, the “=“ assignment was introduced for compatibility with
other languages.

Arithmetic operators
§  Addition: +
§  Subtraction: -
§  Division: /
§  Multiplication: *
§  Exponentiation: ^
§  Using R as a calculator:
28

Function calls
§  Syntax: functionName(arg1, arg2)
§  Arguments can be specified by position or by name, while
name overrides position.
§  > log(x=16,base=2)
§  [1] 4
§  > log(16,2)
§  [1] 4
§  > log(2,16)
§  [1] 0.25
§  > log(base=2,x=16)
§  [1] 4

Data type
§  Three major data types: numeric, character, and
logical.
•  # numeric variables
•  X<-3.1
•  # character variables
•  X<-"Male"
•  # Logical variables
•  X<-TRUE; Y<-FALSE

Data structure: vector, matrix and array
§  Vector: an ordered collection of data.
•  x<-c(3.1,4.2,5.6); y<-c("F","M","F”); z<-c(T,F,T,F,T)
•  c(…) is a fonction to concatenate its arguments end
to end.
§  Matrix: two-dimensional generalizations of vectors
§  Array: multi-dimensional generalizations of vectors

Data structure: data frame
§  Data frames are matrix-like structures.
§  The columns in data frame can be of different data types.
§  Generate a new data frame
§  Covert a matrix to a data frame

Indexing and selecting
§  Indexing: appending an index
vector in square brackets to
the name of a vector, matrix or
data frame.
§  The index vector can be a
numeric vector or a character
vector.

Deleting
Deleting: using negative subscripts

Loops
Syntax:
for(variable in sequence) {
statements
}

Loops
Avoid ‘for’ looping and use “apply” instead if possible.

The “apply” family of functions
37

The “apply” family of functions
38
??apply

Conditional execution
Comparison Operators
equal: ==
not equal: !=
greater: >
less than: <
greater or equal: >=
less than or equal:<=
Logical Operators
and: &
or: |
not: !

Conditional execution
Syntax:
if (condition) {
statement
} else {
alternative
}

User-defined functions
Syntax to define a function:
myfunction <- function(arg1, arg2, ...) {
function_body
}
Syntax to call a function:
myfunction(arg1=..., arg2=...)

Package installation
§  Install packages from CRAN network
•  install.packages(“MASS”)
§  Install packages from Bioconductor
•  source("http://bioconductor.org/biocLite.R")
•  biocLite("biomaRt")

Reading and writing
?read.table
?write.table

Basic Statistical Analysis
Descriptive statistics

Visualization: Parallel boxplots

A t-test (and ANOVA) is a linear regression
47

Basic statistics: Linear regression
§  library(MASS)
§  data(cats)
§  ?cat

Data visualization
plot(Hwt ~ Bwt, data=cats)
The tilde (~) operator is used to specify that Hwt is described by
Bwt.

Linear model
§  Express the relationship as formula: Hwt ~ Bwt
•  It reads: Hwt predicted by Bwt, or Hwt as a function of
Bwt.
•  Variable on the left side: dependent variable. Variable on
the right side: independent variable.
§  Hwt is not totally predicted by Bwt, we add random factors
into the linear model: Hwt ~ Bwt + ε. This model divided
Hwt related factors into two parts: factors that we can
control (systematic part) and random errors (random or
probabilistic part of the model).
§  Implement in R:
•  Mylm <- lm(Hwt ~ Bwt, data=cats)
•  summary(mylm)

Linear regression results
§  It looks complicated. Let’s check it. The first part is the review of
linear model.
§  The second part is the residuals.

Residuals
§  Ideally, the residuals are independent, identically distributed (iid) random
variables.

Outliers
§  Sample 144 looks like an outliers/influence point. How to test it?
§  plot(mylm,which=4)
§  Remove outliers and re-conduct the analysis

Results: coefficient table
The p-value for Bwt is same as the p-value for the
model since there is only one independent variable.
plot(Hwt ~ Bwt,data=cats)
abline(mylm)

Results: R-squared and p-value
§  The last part:
§  R-squared: a measure of variance explained. In this case,
64.66% our data is explained by the model.
§  Adjusted R-squared: adjusts for the number of
independent variables in a model relative to the number of
data points.
§  F-statistic is used to test if one or more of the non-constant
coefficients in the regression equation are non-zero or not.
§  Finally, we got the p-value. But what is p-value?

P-value
Statistic hypothesis
H0 : All non-constant coefficients in the regression model are zero.
Ha : At least one of the non-constant coefficients in the regression model
is non-zero
Under the null hypothesis, the data is unlikely when the p-value is small.

Report the results
•  Linear regression analysis was conducted and the results
indicated that body weight significantly predicted heart weight (β
= 4.03, p<0.01). Body weight also explained a significant
proportion of variance in heart weight, R2=0.65, F(1,142)=259.8,
p<0.01.

An introduction to R

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to An introduction to R

Similar to An introduction to R (20)

More from Bioinformatics and Computational Biosciences Branch

More from Bioinformatics and Computational Biosciences Branch (20)

Recently uploaded

Recently uploaded (20)

An introduction to R