R lecture oga

Handling quantitative data
using
statistical software R
Osamu Ogasawara
2015.01.19

Contents
1. What is R?
2. An Introductory Example
3. Types and Data Structures (in C and R)
4. Functional Programming (apply() function)
5. R Graphics
6. Bioinformatics (RNA-seq)

Computer Language Popularity
The TOIBE index is the weighted mean of following form:
((hits(PL,SE1)/hits(SE1) + ... + hits(PL,SEn)/hits(SEn))/n
where the PL is the search query of following pattern
+"<language> programming”

Computer Language
Popularity
C language
and its derivatives
(General purpose)
Script languages
Domain specific language

Computer Language
Popularity
Domain Specific
Languages
Script language The others

Classification of Computer Languages
by abstraction levels
Assembly Languages
High Level Languages
C, C++, Java, …
Very High Level Languages (VHLL)
Scripting languages: Perl, Python, Ruby, …
Domain Specific Language
R : statistics
Matlab, …
Higher level language is more closer to the natural language.

Simple Example (1)
histogram
> x<-rnorm(100000000)
> head(x)
[1] 0.4667083 0.8907642 0.8147121
0.4839252 0.5811472 0.4941122
> hist(x)
> system.time(x<-rnorm(100000000))
user system elapsed
8.771 0.249 9.020

Simple Example (2) t-test
>group1 <- c(0.7,-1.6,-0.2,-1.2,-0.1,3.4,3.7,0.8,0.0,2.0)
> group2 <- c(1.9, 0.8, 1.1, 0.1,-0.1,4.4,5.5,1.6,4.6,3.4)
> group1
[1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0
> group2
[1] 1.9 0.8 1.1 0.1 -0.1 4.4 5.5 1.6 4.6 3.4
> boxplot(group1, group2)
> t.test(group1, group2, var.equal=T)
Two Sample t-test
data: group1 and group2
t = -1.8608, df = 18, p-value = 0.07919
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.363874 0.203874
sample estimates:
mean of x mean of y
0.75 2.33
http://cse.naro.affrc.go.jp/takezawa/r-tips/r/65.html

Getting Help in R
Display the contents of the R manual. (If you know the
name of the function)
Search functions by keywords
Search functions by (partial) matching of function names
?rnorm
help(“rnorm”)
??”normal distribution”
help.search(“normal distribution”)
find(“rnorm”)
appropos(“rnorm”)

Probability Distributions
dnorm() : Density function
pnorm() : (cumulative) probability distribution function
qnorm() : Quantile
rnorm() : Random number generation
“Quick-R” site
http://www.statmethods.net/advg
raphs/probability.html

Plotting the density
function (1/2)> x<-seq(-4,4,length=100)
> x
[1] -4.00000000 -3.91919192 -3.83838384 -3.75757576 -3.67676768 -3.59595960
[7] -3.51515152 -3.43434343 -3.35353535 -3.27272727 -3.19191919 -3.11111111
[13] -3.03030303 -2.94949495 -2.86868687 -2.78787879 -2.70707071 -2.62626263
… omitted
> dx<-dnorm(x)

Plotting the density
function (2/2)
> plot(x,dx,type="l",xlab="x",ylab="y",main="The normal distribution”)

Plotting the probability
distribution function> x<-seq(-4,4,length=100)
> px<-pnorm(x)
> plot(x,px,type="l",xlab="x",ylab="y",main="The normal distribution")

Quantile (1/5)
plot(x,dnorm(x), type="n", ylim=c(0,1))
http://cse.niaes.affrc.go.jp/minaka/R/R-normal.html
Copyright (c) 2004 by MINAKA Nobuhiro. All rights reserved.

Quantile (2/5)
curve(dnorm(x), type="l", add=T)

Quantile (3/5)
curve(pnorm(x), type="l", lty=3, add=T)

Quantile (4/5)
abline(h=0.05)
abline(h=0.95)

Quantile (5/5)
x<-seq(-4,4,length=100)
abline(h=0.05)
abline(h=0.95)
lower.alpha5<-qnorm(0.05)
upper.alpha5<-qnorm(0.95)
abline(v=lower.alpha5)
abline(v=upper.alpha5)
points(lower.alpha5, 0.05, cex=3.0, pch="*")
points(upper.alpha5, 0.95, cex=3.0, pch="*")

Calculation of the p-value
of a numeral vector x.
http://d.hatena.ne.jp/hoxo_m/20130213/p1
norm.dist.p <- function(x) {
n <- length(x)
mean <- mean(x)
sd <- sd(x) / sqrt(n)
p <- pnorm(-abs(mean), mean=0, sd=sd) * 2
p
}
x <- rnorm(10, mean=0)
p <- norm.dist.p(x)
cat("p =", p, "n")

Bias in small samples
alpha = 0.05
ps <- sapply(1:10000, function(i)
{
x <- rnorm(10)
p <- norm.dist.p(x)
p
})
fp <- sum(ps < alpha) / length(ps)
cat("alpha error rate =", fp,
"n")
alpha error rate = 0.0812

Types in C (partial)Integer Types
Floating-Point Types

Memory Layout of C
Programs
1. Text segment (Code segment)
2. Initialized data segment
(initialized global variables
and static variables)
3. Uninitialized data segment
4. Stack (automatic variables)
5. Heap (for dynamic memory
allocation by malloc(), free(),
…)
http://www.geeksforgeeks.org/memory-layout-of-c-program/

Stack frame and
function callint main() {
int x = 0;
a();
return 0;
}
int a() {
int x=1;
b();
c();
return 0;
}
http://www.tenouk.com/ModuleZ.html

Recursion in C
#include<stdio.h>
Fact(int f) {
if (f == 1) return 1;
return (f * Fact(f - 1)); //called in function only once
}
int main() {
int fact;
fact = Fact(5);
printf("Factorial is %d", fact);
return 0;
}
http://www.programmingspark.com/2013/03/Working-of-Recursion-in-detail-using-Stack.html

Recursion in C
http://www.programmingspark.com/2013/03/Working-of-Recursion-in-detail-using-Stack.html

C pointers
int b = 17;
int* a = &b;
x = *a; /* x = 17 */

Adding an element to the
containers
Linked ListC Array (R vector)

Types in R
Logical : TRUE, T, FALSE, F
Numerical (double): 1, 1.0, 1.4e+3
Complex: 3.5+4i
Character : “abc”
> typeof(TRUE)
[1] "logical"
> typeof(1)
[1] "double"
> typeof(1.0)
[1] "double”
> typeof(3.5+4i)
[1] "complex"
> typeof("abc")
[1] "character”
> is.vector(TRUE)
[1] TRUE
> is.vector(1)
[1] TRUE
> is.vector(3.5+4i)
[1] TRUE
> is.vector("abc")
[1] TRUE

Creation of R vectors
> c(1,2,3,4,5)
[1] 1 2 3 4 5
> 1:5
[1] 1 2 3 4 5
> 5.1:-1.2
[1] 5.1 4.1 3.1 2.1
1.1 0.1 -0.9
> seq(1,3,0.5)
[1] 1.0 1.5 2.0 2.5 3.0
> rep(
> numeric(10)
[1] 0 0 0 0 0 0 0 0 0 0
> logical(10)
[1] FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE
> character(10)
[1] "" "" "" "" "" "" "" "" "" ""
> complex(10)
[1] 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i
0+0i 0+0i 0+0i 0+0i

Operation on vectors
> 1:10*2
[1] 2 4 6 8 10 12 14 16 18 20
> 2*(3^(0:4))
[1] 2 6 18 54 162
> v1<-1:10
> v2<-10:1
> v1+v2
[1] 11 11 11 11 11 11 11 11 11 11

> v1<-c(1,2,3)
> v1
[1] 1 2 3
> v1[1]
[1] 1
> v1[4]
[1] NA
> v1[5]<-10
> v1
[1] 1 2 3 NA 10
> v1[6]<-"a"
> v1
[1] "1" "2" "3" NA
"10" "a"
> v2<-runif(10, 1,10)
> v2
[1] 4.851027 7.618278 5.371393
3.940181 1.002870 9.511409 2.364836
5.246343
[9] 3.361870 9.435904
> v2<5
[1] TRUE FALSE FALSE TRUE TRUE
FALSE TRUE FALSE TRUE FALSE
> v2[v2<5]
[1] 4.851027 3.940181 1.002870
2.364836 3.361870
> v2[1:3]
[1] 4.851027 7.618278 5.371393
> v2[1:3*2]
[1] 7.618278 3.940181 9.511409

Creation of R Lists
> w1<-list("a", 10, TRUE)
> w1
[[1]]
[1] "a"
[[2]]
[1] 10
[[3]]
[1] TRUE
> w2 <- as.list(c(1,2,3))
> w2
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3

Data structure of R
objects
Type information pointers data (vector)

R List
> w1<-list(1:3,"ab",TRUE)
> w1
[[1]]
[1] 1 2 3
[[2]]
[1] "ab"
[[3]]
[1] TRUE
TRUE
“a” “b”
1 2 3

w1[1] returns sublist
w1[[1]] returns a content
of the list
TRUE
“a” “b”
1 2 3
> typeof(w1)
[1] "list"
> typeof(w1[1])
[1] "list"
> typeof(w1[[1]])
[1] "integer”
> w1[1]
[[1]]
[1] 1 2 3
> w1[[1]]
[1] 1 2 3
> w1[[1]][1]
[1] 1

w2<-w1[c(1,2)] TRUE
“a” “b”
1 2 3
w1
w2
> remove(w1)
> w1
Error: object 'w1' not found
> w2
[[1]]
[1] 1 2
[[2]]
[1] 3 4

R List and “names”
> w3<-list(a=1:3, b="abc", NA)
> w3
$a
[1] 1 2 3
$b
[1] "abc"
[[3]]
[1] NA
> w3[[1]]
[1] 1 2 3
> w3$a
[1] 1 2 3
> w3[1]
$a
[1] 1 2 3

Attributes of an R
object
TRUE
“a” “b”
1 2 3
> w3<-list(a=1:3,b="ab",TRUE)
> attributes(w3)
$names
[1] "a" "b" "”
> attr(w3,"names")<-NULL
> w3
[[1]]
[1] 1 2 3
[[2]]
[1] "ab"
[[3]]
[1] TRUE
$names
[1] "a" "b" ""

data.frame : List of vectors
> phenotype<-read.table("bodymap_phenodata.txt", header=T,
row.names=1, sep=" ", quote="")
> phenotype
num.tech.reps tissue.type gender age race
ERS025098 2 adipose F 73 caucasian
ERS025092 2 adrenal M 60 caucasian
ERS025085 2 brain F 77 caucasian
ERS025088 2 breast F 29 caucasian
ERS025089 2 colon F 68 caucasian
ERS025082 2 heart M 77 caucasian
ERS025081 2 kidney F 60 caucasian
ERS025096 2 liver M 37 caucasian
ERS025099 2 lung M 65 caucasian
ERS025086 2 lymphnode F 86 caucasian
ERS025084 6 mixture <NA> NA caucasian
ERS025083 2 ovary F 47 african_american
ERS025095 2 prostate M 73 caucasian
… omitted

RNA-seq
http://www.bgisequence.com/jp/services/sequencing-services/rna-sequencing/rna-seq/

http://bowtie-
bio.sourceforge.net/recou
nt/

bodymap_count_table.txt
 Tab delimited format
 The first line shows a list of sample identifiers. (19 human organs
 The first column is a list of gene identifiers (Ensemble genes)

Read a data table to a data frame
> phenotype<-read.table("bodymap_phenodata.txt", header=T,
row.names=1, sep=" ", quote="")
> phenotype
num.tech.reps tissue.type gender age race
ERS025098 2 adipose F 73 caucasian
ERS025092 2 adrenal M 60 caucasian
ERS025085 2 brain F 77 caucasian
ERS025088 2 breast F 29 caucasian
ERS025089 2 colon F 68 caucasian
ERS025082 2 heart M 77 caucasian
ERS025081 2 kidney F 60 caucasian
ERS025096 2 liver M 37 caucasian
ERS025099 2 lung M 65 caucasian
ERS025086 2 lymphnode F 86 caucasian
ERS025083 2 ovary F 47 african_american
ERS025095 2 prostate M 73 caucasian
… omitted

Inspect the type and
attribute of the data frame
> typeof(phenotype)
[1] "list"
> attributes(phenotype)
$names
[1] "num.tech.reps" "tissue.type" "gender" "age"
[5] "race"
$class
[1] "data.frame"
$row.names
[1] "ERS025098" "ERS025092" "ERS025085" "ERS025088" "ERS025089" "ERS025082"
[19] "ERS025091"

Read the count table
> data <- read.table("bodymap_count_table.txt", header=T, row.names=1, sep="t",
quote="")
> head(data)
ERS025098 ERS025092 ERS025085 ERS025088 ERS025089 ERS025082
ENSG00000000003 1354 216 215 924 725 125
ENSG00000000005 712 134 4 1495 119 20
ENSG00000000419 450 547 516 529 808 680
ENSG00000000457 188 368 196 386 156 259
ENSG00000000460 66 29 1 26 11 9
ENSG00000000938 104 79 7 29 0 3
… omitted

Replace the column
names: from the IDs to the
tissue type descriptions> colnames(data)
[19] "ERS025091"
> colnames(data)<-phenotype$tissue.type
> colnames(data)
[1] "adipose" "adrenal" "brain" "breast"
[5] "colon" "heart" "kidney" "liver"
[9] "lung" "lymphnode" "mixture" "mixture"
[13] "mixture" "ovary" "prostate" "skeletal_muscle"
[17] "testes" "thyroid" "white_blood_cell"
> head(data)
adipose adrenal brain breast colon heart kidney liver lung
ENSG00000000003 1354 216 215 924 725 125 796 1954 815
ENSG00000000005 712 134 4 1495 119 20 7 0 0
ENSG00000000419 450 547 516 529 808 680 744 369 636
ENSG00000000457 188 368 196 386 156 259 436 288 187
ENSG00000000460 66 29 1 26 11 9 25 42 12
ENSG00000000938 104 79 7 29 0 3 1 20 243

Looking into the data
frame> head(data$adipose, 100)
[1] 1354 712 450 188 66 104 0 1323 0 858 0 0
[13] 13 6346 0 0 0 0 0 3 0 485 0 0
[25] 36 0 0 0 0 1002 1360 0 4179 12 424 0
[37] 97 0 0 0 0 0 0 0 2577 0 0 0
[49] 0 0 5 2241 0 0 115 3678 0 14104 18 1662
[61] 0 0 0 0 6 0 0 7839 0 2 1313 1997
[73] 40 5390 0 0 0 208 180 1277 1460 0 0 1002
[85] 30 177 84 441 0 2986 1598 0 13925 94 5565 0
[97] 0 0 0 0
> length(data$adipose)
[1] 52580
> length(data$adipose[data$adipose>0])
[1] 9992

Distribution of the data
> hist(data$adipose)
> hist(log10(data$adipose))
> summary(log10(data$adipose))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-Inf -Inf -Inf -Inf -Inf 6
> summary(log10(data$adipose[data$adipose>0]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.462 2.382 2.287 3.109 6.200

attach() and detach() the
column header names to
its “environment”
> attach(data)
> head(adipose, 100)
[1] 1354 712 450 188 66 104 0 1323 0 858 0 0
[13] 13 6346 0 0 0 0 0 3 0 485 0 0
[25] 36 0 0 0 0 1002 1360 0 4179 12 424 0
[37] 97 0 0 0 0 0 0 0 2577 0 0 0
[49] 0 0 5 2241 0 0 115 3678 0 14104 18 1662
[61] 0 0 0 0 6 0 0 7839 0 2 1313 1997
[73] 40 5390 0 0 0 208 180 1277 1460 0 0 1002
[85] 30 177 84 441 0 2986 1598 0 13925 94 5565 0
[97] 0 0 0 0
> length(adipose)
[1] 52580
> detach(data)
> length(adipose)
Error: object 'adipose' not found
> length(data$adipose)
[1] 52580

Environment (1/2)
Environment basics : http://adv-
r.had.co.nz/Environments.html
The job of an environment is to associate, or bind, a set of
names to a set of values.
You can think of an environment as a bag of names:
• If an object has no names pointing to it, it gets
automatically deleted by the garbage collector.
• Every object in an environment has a unique name.
• The objects in an environment are not ordered (i.e., it
doesn’t make sense to ask what the first object in an
environment is).

Environment (2/2)Most environments are created as a consequence of using functions.
An environment has a parent environment.
http://adv-r.had.co.nz/Environments.html

the apply() function
> apply(data, 2, sum)
adipose adrenal brain breast
23957600 18987359 20995462 23426900
colon heart kidney liver
23397325 26762377 22630393 29314904
lung lymphnode mixture mixture
23426381 19489508 31135063 57697453
mixture ovary prostate skeletal_muscle
52460922 22857384 25215879 28400943
testes thyroid white_blood_cell
27261469 24465463 27871222
> png(filename="bar001.png")
> par(mai=c(1,2,1,1))
> barplot(s,horiz=T,las=1)
> dev.off()

Customizing (Traditional) Graphics
> s=apply(data, 2, sum)
> s
23957600 18987359 20995462 23426900
23397325 26762377 22630393 29314904
23426381 19489508 31135063 57697453
52460922 22857384 25215879 28400943
27261469 24465463 27871222
> barplot(s)

Customizing
(Traditional) Graphics
barplot(s, horiz=TRUE)

Customizing
(Traditional) Graphics
> par(mai=c(1,2,1,1))
> barplot(s,horiz=T,las=1)

Customizing
Traditional Graphics
with par() function
Paul Murrel
R Graphics 2nd. ed.
(2011)

Paul Murrel
R Graphics 2nd. ed.
(2011)

How many plot types
are there?

Winston Chang
R Graphics Cookbook
O’Reilly (2013)
ggplot2 and traditional graphics

Functional programming with
the apply() function
> apply(log10(data), 2, mean)
-Inf -Inf -Inf -Inf
-Inf -Inf -Inf -Inf
-Inf -Inf -Inf -Inf
-Inf -Inf -Inf -Inf
-Inf -Inf -Inf
> mean2<-function(x) { mean(x[x>0]) }
> apply(log10(data), 2, mean2)
2.335220 2.344531 2.278299 2.346041
2.380096 2.226729 2.415721 2.236490
2.484701 2.502548 2.531860 2.776740
2.670258 2.402131 2.503051 2.464915
2.486507 2.439520 2.597849
>

Quick-R
http://www.statmethods.net/management/userfunctions.html

Quick-R
http://www.statmethods.net/management/controlstructures.html

R lecture oga

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to R lecture oga

Similar to R lecture oga (20)

Recently uploaded

Recently uploaded (20)

R lecture oga

Editor's Notes