[1062BPY12001] Data analysis with R / week 3

R 語⾔言與資料分析
objects (物件) & categorical variables

Vectorized Operations
Many operations in R are vectorized making code more efficient, concise, and easier to read.
> x <- 1:4; y <- 6:9
> x + y
[1] 7 9 11 13
> x > 2
[1] FALSE FALSE TRUE TRUE
> x >= 2
[1] FALSE TRUE TRUE TRUE
> y == 8
[1] FALSE FALSE TRUE FALSE
> x * y
[1] 6 14 24 36
> x / y
[1] 0.1666667 0.2857143 0.3750000 0.4444444

Vectorized MatrixOperations
> x <- matrix(1:4, 2, 2); y <- matrix(rep(10, 4), 2, 2)
> x * y ## element-wise multiplication
[,1] [,2]
[1,] 10 30
[2,] 20 40
> x / y
[,1] [,2]
[1,] 0.1 0.3
[2,] 0.2 0.4
> x %*% y
[,1] [,2]
[1,] 40 40
[2,] 60 60

Matrix:`colnames()`&`rownames()`
> temp <- matrix(c(rep(0,9), rep(c(0,1,0), c(3,3,3))), nrow = 2)
> temp
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0 0 0 0 0 0 1 1 0
[2,] 0 0 0 0 0 0 1 0 0
> temp <- matrix(c(rep(0,9), rep(c(0,1,0), c(3,3,3))), nrow = 2, byrow = T)
> temp
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0 0 0 0 0 0 0 0 0
[2,] 0 0 0 1 1 1 0 0 0
> baseball.score = temp
> rownames(baseball.score)
NULL
> colnames(baseball.score)
NULL
> rownames(baseball.score) <- c('guest', 'home')
> baseball.score
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
guest 0 0 0 0 0 0 0 0 0
home 0 0 0 1 1 1 0 0 0
> colnames(baseball.score) <- c('1st', '2nd', '3rd', '4th', '5th', '6th', '7th',
'8th', '9th')
> baseball.score
1st 2nd 3rd 4th 5th 6th 7th 8th 9th
guest 0 0 0 0 0 0 0 0 0
home 0 0 0 1 1 1 0 0 0

charactervectors
generate random samples of characters
> colors <- c("red", "orange", "yellow", "green", "blue", "indigo", "violet")
> length(colors) ### 計算 vector ⻑⾧長度
[1] 7
> colors[7] ### 回傳第七項
[1] "violet"
> colors[7] <- "purple" ### 把第七項換成 "purple"
> colors <- c("red", "orange", "yellow", "green", "blue", "indigo", "violet")
> sample(colors)
[1] "violet" "yellow" "orange" "blue" "green" "indigo" "red"
> sample(colors, size=5)
[1] "indigo" "red" "blue" "green" "violet"
> sample(colors, size=5, replace=TRUE)
[1] "violet" "yellow" "indigo" "violet" "violet"
sample(x) # randomly permute the entire vector of state names
sample(x, 4) # randomly picks four states
sample(x, replace=TRUE) # selection with replacement

charactermatrix
> gender <- sample(c("Male", "Female"), 20, rep=TRUE)
> gender
[1] "Male" "Male" "Female" "Male" "Female" "Female" "Female" "Female"
[9] "Female" "Female" "Female" "Male" "Male" "Female" "Male" "Male"
[17] "Female" "Female" "Female" "Male"
> blood.type <-sample(c("A", "B", "AB", "O"), 20, rep=TRUE)
> blood.type
[1] "B" "O" "A" "O" "O" "O" "A" "B" "O" "A" "AB" "O" "B" "B"
[15] "A" "O" "A" "AB" "O" "O"
> gender.blood.type <- cbind(gender, blood.type)
> gender.blood.type
gender blood.type
[1,] "Male" "B"
[2,] "Male" "O"
[3,] "Female" "A"
[4,] "Male" "O"
[5,] "Female" "O"
[6,] "Female" "O"
[7,] "Female" "A"
[8,] "Female" "B"
[9,] "Female" "O"
[10,] "Female" "A"
[11,] "Female" "AB"
[12,] "Male" "O"
[13,] "Male" "B"
[14,] "Female" "B"
[15,] "Male" "A"
[16,] "Male" "O"
[17,] "Female" "A"
[18,] "Female" "AB"
[19,] "Female" "O"
[20,] "Male" "O"
> xtabs( ~ gender + blood.type, data = gender.blood.type)
blood.type
gender A AB B O
Female 4 2 2 4
Male 1 0 2 5
charactermatrix(cont’d)

Factors
Factors are used to represent categorical data.
Factors can be unordered or ordered
A factor as an integer vector where each integer has a label.
Factors are treated specially by modelling functions like lm() and glm()
Using factors with labels is better than using integers because factors are self-describing
e.g., having a variable that has values “Male” and “Female” is better than a variable that has values 1 and 2.
> x <- factor(c("yes", "yes", "no", "yes", "no"))
> x
[1] yes yes no yes no
Levels: no yes
> table(x)
x
no yes
2 3
> unclass(x)
[1] 2 2 1 2 1
attr(,"levels")
[1] "no" "yes"

8
Factors
The order of the levels can be set using the levels argument to factor(). This can be important in
linear modelling because the first level is used as the baseline level.
> x <- factor(c("yes", "yes", "no", "yes", "no"),
levels = c("yes", "no"))
> x
[1] yes yes no yes no
Levels: yes no
> colors <- c("red", "orange", “yellow", "green", "blue", "indigo", "violet")
# 隨機抽出 colors 裡⾯面的單位 50 次，可重複。
# 計算各種名稱出現之次數

Factors
As an example of an ordered factor, consider data consisting of the names of months:
> mons = c("March","April","January","November","January",
+ "September","October","September","November","August",
+ "January","November","November","February","May","August",
+ "July","December","August","August","September","November",
+ "February","April")
> mons
[1] "March" "April" "January" "November" "January" "September" "October" "September"
[9] "November" "August" "January" "November" "November" "February" "May" "August"
[17] "July" "December" "August" "August" "September" "November" "February" "April"
> summary(mons)
Length Class Mode
24 character character
> mons = factor(mons)
> table(mons)
mons
April August December February January July March May November October
2 4 1 2 3 1 1 1 5 1
September
3

Factors (cont’d)
The order of months is not reflected in the output of the `table` function. Creating an order factor solves
this problem:
> mons = factor(mons,levels=c("January","February","March",
+ "April","May","June","July","August","September",
+ "October","November","December"),
+ ordered=TRUE)
> table(mons)
mons
January February March April May June July August September October
3 2 1 2 1 0 1 4 3 1
November December
5 1
> head(mons)
[1] March April January November January September
12 Levels: January < February < March < April < May < June < July < August < ... < December
> mons[1] < mons[2]
[1] TRUE
> mons[1] < mons[3]
[1] FALSE
> mons[3] == mons[5]
[1] TRUE

[1062BPY12001] Data analysis with R / week 3

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to [1062BPY12001] Data analysis with R / week 3

Similar to [1062BPY12001] Data analysis with R / week 3 (20)

More from Kevin Chun-Hsien Hsu

More from Kevin Chun-Hsien Hsu (20)

Recently uploaded

Recently uploaded (20)

[1062BPY12001] Data analysis with R / week 3