Introduction to R for Data Science
Lecturers
dipl. ing Branko Kovač
Data Analyst at CUBE/Data Science Mentor
at Springboard
Institut za savremene nauke
Data Science zajednica Srbije
branko.kovac@gmail.com
dr Goran S. Milovanović
Data Scientist at DiploFoundation
Data Science zajednica Srbije
goran.s.milovanovic@gmail.com
goranm@diplomacy.edu
Vectors in R
• No scalars in R; a <- 5 is a vector (length(a)==1)==TRUE 
• Vectorizing your code is a priority in vector programming languages such as R (more
on vectorizing takes part later during this course…)
• !!! - An excellent read: http://www.noamross.net/blog/2014/4/16/vectorization-in-r--
why.htmlwhy.html (a little bit advanced at this point - yet highly recommended)
Intro to R for Data Science
Session 2: Vectors, Matrices & Data Frames
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
char_list <- character(length = 0) #empty character list
> char_list
character(0)
num_list <- numeric(length = 10)
#length can be != 0, but 0 is default value
> num_list
[1] 0 0 0 0 0 0 0 0 0 0
log_list <- logical(length = 3) #default value is FALSE
> log_list
[1] FALSE FALSE FALSE
Vectors in R: c(), subsetting
Intro to R for Data Science
Session 2: Vectors, Matrices & Data Frames
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
log_list_2 <- c(TRUE, FALSE, FALSE, TRUE, TRUE, TRUE) # some Ts and Fs
> log_list_2
[1] TRUE FALSE FALSE TRUE TRUE TRUE
# Subsetting is regular-thing-to-do when using R
char_list_2[5] #single element can be selected
log_list_2[2:4] #or some interval
num_list_2[3:length(num_list_2)] #or even length() function
Vectors in R: ordering, coercing while concatenating
Intro to R for Data Science
Session 2: Vectors, Matrices & Data Frames
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
# Vector ordering
sort(test, decreasing = T) # using sort() function
test[order(test, decreasing = T)] # or with order() function
# Concatenation
new_num_vect <- c(num_list, num_list_2) #using 2 vectors to create new one
> new_num_vect #?
new_combo_vect <- c(num_list_2, log_list) #combination of num and log vector
new_combo_vect #a ll numbers? false to zero? coercion in action
Matrices in R: there are matrices in R, indeed
Intro to R for Data Science
Session 2: Vectors, Matrices & Data Frames
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
# Matrices are available in R
matr <- matrix(data = c(1,3,5,7,NA,11), nrow = 2, ncol = 3) #2x3 matrix
class(matr) # yes, it's matrix
typeof(matr) # double as expected
# Again: R Objects (like matrices) have classes, R Data (like integers)
# have types; the difference between class() and typeof().
• There are many 1e06 things that you can do with matrices in R. Only a few of them will
be discussed in the second (applied statistical modeling) part of the course.
• Matrices and vectors are fast - as fast as R (not quite a Roadrunner, beep-beep…) can
get. On the deepest implementation level, *everything in R is a vector*, in spite of the
wide-spread opinions that “everything in R is a list/an object”…
• Again !!! - An excellent read: http://www.noamross.net/blog/2014/4/16/vectorization-in-
r--why.htmlwhy.html
data.frame in R: mastering the Force
Intro to R for Data Science
Session 2: Vectors, Matrices & Data Frames
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
# Think of data frame columns as vectors! Because they are!
mean(cars_data$mpg) #mean of cars_data mpg (miles per galon) column
median(cars_data$cyl) #median of cars_data cyl (cylinders) column
is.list(cars_data[1,]); # but rows are lists!
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
> is.list(mtcars)
[1] TRUE
> length(mtcars)
[1] 11
> length(colnames(mtcars))
[1] 11
• A data.frame is…
• a list…
• whose components are its columns…
• which are, in turn, vectors.
• Consistency, as in any database:
• a column “is about” something –
but only about that one thing.
data.frame in R: subsetting data.frames
Intro to R for Data Science
Session 2: Vectors, Matrices & Data Frames
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
cars_data[c(1,3)] #keeping 1st and 3rd column only
cars_data[-c(1,3)] #removing 1st and 3rd column
cars_data[ ,-c(1,3)] #same as the previous line of code
cars_data[!duplicated(cars_data$mpg), ] #maybe we want to remove all cars with same mpg?
#remember it keeps only the first occurence!
subset(cars_data, mpg < 19) #this is one way (and it can be slow!)
cars_data[cars_data$mpg < 19, ] #this is another one (faster)
cars_data[which(cars_data$mpg < 19), ] #and another one (usually even more faster)
cars_data[cars_data$mpg > 20 & cars_data$am == 1, ] #multiple conditions
cars_data[grep("Merc", row.names(cars_data), value=T), ] #filtering by pattern match
data.frame in R: separation, joining, names(), rownames(), and
colnames()
Intro to R for Data Science
Session 2: Vectors, Matrices & Data Frames
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
# Separation and joining of data frames
low_mpg <- cars_data[cars_data$mpg < 15, ] #new data frame with mpg < 15
high_mpg <- cars_data[cars_data$mpg >= 15, ] #new data frame with mpg >= 15
mpg_join <- rbind(low_mpg, high_mpg) # we can combine 2 data frames like this
car_condition <- data.frame(sample(c("old","new"), replace = T, size = 32)) #creating random
# data frame with "old" and "new" values
names(car_condition) <- "condition" # for all kinds of objects
colnames(car_condition) <- "condition" # for "matrix-like" objects, but same effect here
rownames(car_condition) <- rownames(cars_data) # use row names of one data frame as row #
names of another
#or combine data frames like this:
mpg_join <- cbind(mpg_join, car_condition)
Introduction to R for Data Science :: Session 2

Introduction to R for Data Science :: Session 2

  • 1.
    Introduction to Rfor Data Science Lecturers dipl. ing Branko Kovač Data Analyst at CUBE/Data Science Mentor at Springboard Institut za savremene nauke Data Science zajednica Srbije branko.kovac@gmail.com dr Goran S. Milovanović Data Scientist at DiploFoundation Data Science zajednica Srbije goran.s.milovanovic@gmail.com goranm@diplomacy.edu
  • 2.
    Vectors in R •No scalars in R; a <- 5 is a vector (length(a)==1)==TRUE  • Vectorizing your code is a priority in vector programming languages such as R (more on vectorizing takes part later during this course…) • !!! - An excellent read: http://www.noamross.net/blog/2014/4/16/vectorization-in-r-- why.htmlwhy.html (a little bit advanced at this point - yet highly recommended) Intro to R for Data Science Session 2: Vectors, Matrices & Data Frames # Introduction to R for Data Science # SESSION 2 :: 5 May, 2016 char_list <- character(length = 0) #empty character list > char_list character(0) num_list <- numeric(length = 10) #length can be != 0, but 0 is default value > num_list [1] 0 0 0 0 0 0 0 0 0 0 log_list <- logical(length = 3) #default value is FALSE > log_list [1] FALSE FALSE FALSE
  • 3.
    Vectors in R:c(), subsetting Intro to R for Data Science Session 2: Vectors, Matrices & Data Frames # Introduction to R for Data Science # SESSION 2 :: 5 May, 2016 log_list_2 <- c(TRUE, FALSE, FALSE, TRUE, TRUE, TRUE) # some Ts and Fs > log_list_2 [1] TRUE FALSE FALSE TRUE TRUE TRUE # Subsetting is regular-thing-to-do when using R char_list_2[5] #single element can be selected log_list_2[2:4] #or some interval num_list_2[3:length(num_list_2)] #or even length() function
  • 4.
    Vectors in R:ordering, coercing while concatenating Intro to R for Data Science Session 2: Vectors, Matrices & Data Frames # Introduction to R for Data Science # SESSION 2 :: 5 May, 2016 # Vector ordering sort(test, decreasing = T) # using sort() function test[order(test, decreasing = T)] # or with order() function # Concatenation new_num_vect <- c(num_list, num_list_2) #using 2 vectors to create new one > new_num_vect #? new_combo_vect <- c(num_list_2, log_list) #combination of num and log vector new_combo_vect #a ll numbers? false to zero? coercion in action
  • 5.
    Matrices in R:there are matrices in R, indeed Intro to R for Data Science Session 2: Vectors, Matrices & Data Frames # Introduction to R for Data Science # SESSION 2 :: 5 May, 2016 # Matrices are available in R matr <- matrix(data = c(1,3,5,7,NA,11), nrow = 2, ncol = 3) #2x3 matrix class(matr) # yes, it's matrix typeof(matr) # double as expected # Again: R Objects (like matrices) have classes, R Data (like integers) # have types; the difference between class() and typeof(). • There are many 1e06 things that you can do with matrices in R. Only a few of them will be discussed in the second (applied statistical modeling) part of the course. • Matrices and vectors are fast - as fast as R (not quite a Roadrunner, beep-beep…) can get. On the deepest implementation level, *everything in R is a vector*, in spite of the wide-spread opinions that “everything in R is a list/an object”… • Again !!! - An excellent read: http://www.noamross.net/blog/2014/4/16/vectorization-in- r--why.htmlwhy.html
  • 6.
    data.frame in R:mastering the Force Intro to R for Data Science Session 2: Vectors, Matrices & Data Frames # Introduction to R for Data Science # SESSION 2 :: 5 May, 2016 # Think of data frame columns as vectors! Because they are! mean(cars_data$mpg) #mean of cars_data mpg (miles per galon) column median(cars_data$cyl) #median of cars_data cyl (cylinders) column is.list(cars_data[1,]); # but rows are lists! # Introduction to R for Data Science # SESSION 2 :: 5 May, 2016 > is.list(mtcars) [1] TRUE > length(mtcars) [1] 11 > length(colnames(mtcars)) [1] 11 • A data.frame is… • a list… • whose components are its columns… • which are, in turn, vectors. • Consistency, as in any database: • a column “is about” something – but only about that one thing.
  • 7.
    data.frame in R:subsetting data.frames Intro to R for Data Science Session 2: Vectors, Matrices & Data Frames # Introduction to R for Data Science # SESSION 2 :: 5 May, 2016 cars_data[c(1,3)] #keeping 1st and 3rd column only cars_data[-c(1,3)] #removing 1st and 3rd column cars_data[ ,-c(1,3)] #same as the previous line of code cars_data[!duplicated(cars_data$mpg), ] #maybe we want to remove all cars with same mpg? #remember it keeps only the first occurence! subset(cars_data, mpg < 19) #this is one way (and it can be slow!) cars_data[cars_data$mpg < 19, ] #this is another one (faster) cars_data[which(cars_data$mpg < 19), ] #and another one (usually even more faster) cars_data[cars_data$mpg > 20 & cars_data$am == 1, ] #multiple conditions cars_data[grep("Merc", row.names(cars_data), value=T), ] #filtering by pattern match
  • 8.
    data.frame in R:separation, joining, names(), rownames(), and colnames() Intro to R for Data Science Session 2: Vectors, Matrices & Data Frames # Introduction to R for Data Science # SESSION 2 :: 5 May, 2016 # Separation and joining of data frames low_mpg <- cars_data[cars_data$mpg < 15, ] #new data frame with mpg < 15 high_mpg <- cars_data[cars_data$mpg >= 15, ] #new data frame with mpg >= 15 mpg_join <- rbind(low_mpg, high_mpg) # we can combine 2 data frames like this car_condition <- data.frame(sample(c("old","new"), replace = T, size = 32)) #creating random # data frame with "old" and "new" values names(car_condition) <- "condition" # for all kinds of objects colnames(car_condition) <- "condition" # for "matrix-like" objects, but same effect here rownames(car_condition) <- rownames(cars_data) # use row names of one data frame as row # names of another #or combine data frames like this: mpg_join <- cbind(mpg_join, car_condition)