R 語⾔言與資料分析
objects (物件) & categorical variables
Vectorized Operations
Many operations in R are vectorized making code more efficient, concise, and easier to read.
> x <- 1:4; y <- 6:9
> x + y
[1] 7 9 11 13
> x > 2
[1] FALSE FALSE TRUE TRUE
> x >= 2
[1] FALSE TRUE TRUE TRUE
> y == 8
[1] FALSE FALSE TRUE FALSE
> x * y
[1] 6 14 24 36
> x / y
[1] 0.1666667 0.2857143 0.3750000 0.4444444
Vectorized MatrixOperations
> x <- matrix(1:4, 2, 2); y <- matrix(rep(10, 4), 2, 2)
> x * y ## element-wise multiplication
[,1] [,2]
[1,] 10 30
[2,] 20 40
> x / y
[,1] [,2]
[1,] 0.1 0.3
[2,] 0.2 0.4
> x %*% y
[,1] [,2]
[1,] 40 40
[2,] 60 60
Matrix:`colnames()`&`rownames()`
> temp <- matrix(c(rep(0,9), rep(c(0,1,0), c(3,3,3))), nrow = 2)
> temp
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0 0 0 0 0 0 1 1 0
[2,] 0 0 0 0 0 0 1 0 0
> temp <- matrix(c(rep(0,9), rep(c(0,1,0), c(3,3,3))), nrow = 2, byrow = T)
> temp
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0 0 0 0 0 0 0 0 0
[2,] 0 0 0 1 1 1 0 0 0
> baseball.score = temp
> rownames(baseball.score)
NULL
> colnames(baseball.score)
NULL
> rownames(baseball.score) <- c('guest', 'home')
> baseball.score
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
guest 0 0 0 0 0 0 0 0 0
home 0 0 0 1 1 1 0 0 0
> colnames(baseball.score) <- c('1st', '2nd', '3rd', '4th', '5th', '6th', '7th',
'8th', '9th')
> baseball.score
1st 2nd 3rd 4th 5th 6th 7th 8th 9th
guest 0 0 0 0 0 0 0 0 0
home 0 0 0 1 1 1 0 0 0
charactervectors
generate random samples of characters
> colors <- c("red", "orange", "yellow", "green", "blue", "indigo", "violet")
> length(colors) ### 計算 vector ⻑⾧長度
[1] 7
> colors[7] ### 回傳第七項
[1] "violet"
> colors[7] <- "purple" ### 把第七項換成 "purple"
> colors <- c("red", "orange", "yellow", "green", "blue", "indigo", "violet")
> sample(colors)
[1] "violet" "yellow" "orange" "blue" "green" "indigo" "red"
> sample(colors, size=5)
[1] "indigo" "red" "blue" "green" "violet"
> sample(colors, size=5, replace=TRUE)
[1] "violet" "yellow" "indigo" "violet" "violet"
sample(x) # randomly permute the entire vector of state names
sample(x, 4) # randomly picks four states
sample(x, replace=TRUE) # selection with replacement
charactermatrix
> gender <- sample(c("Male", "Female"), 20, rep=TRUE)
> gender
[1] "Male" "Male" "Female" "Male" "Female" "Female" "Female" "Female"
[9] "Female" "Female" "Female" "Male" "Male" "Female" "Male" "Male"
[17] "Female" "Female" "Female" "Male"
> blood.type <-sample(c("A", "B", "AB", "O"), 20, rep=TRUE)
> blood.type
[1] "B" "O" "A" "O" "O" "O" "A" "B" "O" "A" "AB" "O" "B" "B"
[15] "A" "O" "A" "AB" "O" "O"
> gender.blood.type <- cbind(gender, blood.type)
> gender.blood.type
gender blood.type
[1,] "Male" "B"
[2,] "Male" "O"
[3,] "Female" "A"
[4,] "Male" "O"
[5,] "Female" "O"
[6,] "Female" "O"
[7,] "Female" "A"
[8,] "Female" "B"
[9,] "Female" "O"
[10,] "Female" "A"
[11,] "Female" "AB"
[12,] "Male" "O"
[13,] "Male" "B"
[14,] "Female" "B"
[15,] "Male" "A"
[16,] "Male" "O"
[17,] "Female" "A"
[18,] "Female" "AB"
[19,] "Female" "O"
[20,] "Male" "O"
> xtabs( ~ gender + blood.type, data = gender.blood.type)
blood.type
gender A AB B O
Female 4 2 2 4
Male 1 0 2 5
charactermatrix(cont’d)
Factors
Factors are used to represent categorical data.
Factors can be unordered or ordered
A factor as an integer vector where each integer has a label.
Factors are treated specially by modelling functions like lm() and glm()
Using factors with labels is better than using integers because factors are self-describing
e.g., having a variable that has values “Male” and “Female” is better than a variable that has values 1 and 2.
> x <- factor(c("yes", "yes", "no", "yes", "no"))
> x
[1] yes yes no yes no
Levels: no yes
> table(x)
x
no yes
2 3
> unclass(x)
[1] 2 2 1 2 1
attr(,"levels")
[1] "no" "yes"
8
Factors
The order of the levels can be set using the levels argument to factor(). This can be important in
linear modelling because the first level is used as the baseline level.
> x <- factor(c("yes", "yes", "no", "yes", "no"),
levels = c("yes", "no"))
> x
[1] yes yes no yes no
Levels: yes no
> colors <- c("red", "orange", “yellow", "green", "blue", "indigo", "violet")
# 隨機抽出 colors 裡⾯面的單位 50 次,可重複。
# 計算各種名稱出現之次數
Factors
As an example of an ordered factor, consider data consisting of the names of months:
> mons = c("March","April","January","November","January",
+ "September","October","September","November","August",
+ "January","November","November","February","May","August",
+ "July","December","August","August","September","November",
+ "February","April")
> mons
[1] "March" "April" "January" "November" "January" "September" "October" "September"
[9] "November" "August" "January" "November" "November" "February" "May" "August"
[17] "July" "December" "August" "August" "September" "November" "February" "April"
> summary(mons)
Length Class Mode
24 character character
> mons = factor(mons)
> table(mons)
mons
April August December February January July March May November October
2 4 1 2 3 1 1 1 5 1
September
3
Factors (cont’d)
The order of months is not reflected in the output of the `table` function. Creating an order factor solves
this problem:
> mons = factor(mons,levels=c("January","February","March",
+ "April","May","June","July","August","September",
+ "October","November","December"),
+ ordered=TRUE)
> table(mons)
mons
January February March April May June July August September October
3 2 1 2 1 0 1 4 3 1
November December
5 1
> head(mons)
[1] March April January November January September
12 Levels: January < February < March < April < May < June < July < August < ... < December
> mons[1] < mons[2]
[1] TRUE
> mons[1] < mons[3]
[1] FALSE
> mons[3] == mons[5]
[1] TRUE

[1062BPY12001] Data analysis with R / week 3

  • 1.
  • 2.
    Vectorized Operations Many operationsin R are vectorized making code more efficient, concise, and easier to read. > x <- 1:4; y <- 6:9 > x + y [1] 7 9 11 13 > x > 2 [1] FALSE FALSE TRUE TRUE > x >= 2 [1] FALSE TRUE TRUE TRUE > y == 8 [1] FALSE FALSE TRUE FALSE > x * y [1] 6 14 24 36 > x / y [1] 0.1666667 0.2857143 0.3750000 0.4444444
  • 3.
    Vectorized MatrixOperations > x<- matrix(1:4, 2, 2); y <- matrix(rep(10, 4), 2, 2) > x * y ## element-wise multiplication [,1] [,2] [1,] 10 30 [2,] 20 40 > x / y [,1] [,2] [1,] 0.1 0.3 [2,] 0.2 0.4 > x %*% y [,1] [,2] [1,] 40 40 [2,] 60 60
  • 4.
    Matrix:`colnames()`&`rownames()` > temp <-matrix(c(rep(0,9), rep(c(0,1,0), c(3,3,3))), nrow = 2) > temp [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 0 0 0 0 0 0 1 1 0 [2,] 0 0 0 0 0 0 1 0 0 > temp <- matrix(c(rep(0,9), rep(c(0,1,0), c(3,3,3))), nrow = 2, byrow = T) > temp [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 0 0 0 0 0 0 0 0 0 [2,] 0 0 0 1 1 1 0 0 0 > baseball.score = temp > rownames(baseball.score) NULL > colnames(baseball.score) NULL > rownames(baseball.score) <- c('guest', 'home') > baseball.score [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] guest 0 0 0 0 0 0 0 0 0 home 0 0 0 1 1 1 0 0 0 > colnames(baseball.score) <- c('1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th') > baseball.score 1st 2nd 3rd 4th 5th 6th 7th 8th 9th guest 0 0 0 0 0 0 0 0 0 home 0 0 0 1 1 1 0 0 0
  • 5.
    charactervectors generate random samplesof characters > colors <- c("red", "orange", "yellow", "green", "blue", "indigo", "violet") > length(colors) ### 計算 vector ⻑⾧長度 [1] 7 > colors[7] ### 回傳第七項 [1] "violet" > colors[7] <- "purple" ### 把第七項換成 "purple" > colors <- c("red", "orange", "yellow", "green", "blue", "indigo", "violet") > sample(colors) [1] "violet" "yellow" "orange" "blue" "green" "indigo" "red" > sample(colors, size=5) [1] "indigo" "red" "blue" "green" "violet" > sample(colors, size=5, replace=TRUE) [1] "violet" "yellow" "indigo" "violet" "violet" sample(x) # randomly permute the entire vector of state names sample(x, 4) # randomly picks four states sample(x, replace=TRUE) # selection with replacement
  • 6.
    charactermatrix > gender <-sample(c("Male", "Female"), 20, rep=TRUE) > gender [1] "Male" "Male" "Female" "Male" "Female" "Female" "Female" "Female" [9] "Female" "Female" "Female" "Male" "Male" "Female" "Male" "Male" [17] "Female" "Female" "Female" "Male" > blood.type <-sample(c("A", "B", "AB", "O"), 20, rep=TRUE) > blood.type [1] "B" "O" "A" "O" "O" "O" "A" "B" "O" "A" "AB" "O" "B" "B" [15] "A" "O" "A" "AB" "O" "O" > gender.blood.type <- cbind(gender, blood.type) > gender.blood.type gender blood.type [1,] "Male" "B" [2,] "Male" "O" [3,] "Female" "A" [4,] "Male" "O" [5,] "Female" "O" [6,] "Female" "O" [7,] "Female" "A" [8,] "Female" "B" [9,] "Female" "O" [10,] "Female" "A" [11,] "Female" "AB" [12,] "Male" "O" [13,] "Male" "B" [14,] "Female" "B" [15,] "Male" "A" [16,] "Male" "O" [17,] "Female" "A" [18,] "Female" "AB" [19,] "Female" "O" [20,] "Male" "O" > xtabs( ~ gender + blood.type, data = gender.blood.type) blood.type gender A AB B O Female 4 2 2 4 Male 1 0 2 5 charactermatrix(cont’d)
  • 7.
    Factors Factors are usedto represent categorical data. Factors can be unordered or ordered A factor as an integer vector where each integer has a label. Factors are treated specially by modelling functions like lm() and glm() Using factors with labels is better than using integers because factors are self-describing e.g., having a variable that has values “Male” and “Female” is better than a variable that has values 1 and 2. > x <- factor(c("yes", "yes", "no", "yes", "no")) > x [1] yes yes no yes no Levels: no yes > table(x) x no yes 2 3 > unclass(x) [1] 2 2 1 2 1 attr(,"levels") [1] "no" "yes"
  • 8.
    8 Factors The order ofthe levels can be set using the levels argument to factor(). This can be important in linear modelling because the first level is used as the baseline level. > x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no")) > x [1] yes yes no yes no Levels: yes no > colors <- c("red", "orange", “yellow", "green", "blue", "indigo", "violet") # 隨機抽出 colors 裡⾯面的單位 50 次,可重複。 # 計算各種名稱出現之次數
  • 9.
    Factors As an exampleof an ordered factor, consider data consisting of the names of months: > mons = c("March","April","January","November","January", + "September","October","September","November","August", + "January","November","November","February","May","August", + "July","December","August","August","September","November", + "February","April") > mons [1] "March" "April" "January" "November" "January" "September" "October" "September" [9] "November" "August" "January" "November" "November" "February" "May" "August" [17] "July" "December" "August" "August" "September" "November" "February" "April" > summary(mons) Length Class Mode 24 character character > mons = factor(mons) > table(mons) mons April August December February January July March May November October 2 4 1 2 3 1 1 1 5 1 September 3
  • 10.
    Factors (cont’d) The orderof months is not reflected in the output of the `table` function. Creating an order factor solves this problem: > mons = factor(mons,levels=c("January","February","March", + "April","May","June","July","August","September", + "October","November","December"), + ordered=TRUE) > table(mons) mons January February March April May June July August September October 3 2 1 2 1 0 1 4 3 1 November December 5 1 > head(mons) [1] March April January November January September 12 Levels: January < February < March < April < May < June < July < August < ... < December > mons[1] < mons[2] [1] TRUE > mons[1] < mons[3] [1] FALSE > mons[3] == mons[5] [1] TRUE