SlideShare a Scribd company logo
1 of 29
Download to read offline
Introduction to Data Analysis and Graphics in R
Introduction to Data Analysis and Graphics in R
Hellen Gakuruh
2017-03-31
Session three
Data Entry, Management and Manipulation in R
Outline
n
• Creating a dataset
• Understanding datasets
• Data input
• Useful functions for working with datasets
• Creating new variables
• Recording and renaming variables
n
• Missing and date values
• Type conversions
• Sorting data
• Merging datasets
• Subsetting datasets
• Using SQL statements to subset dataframes
Creating a dataset
• Data sets can be created for any of R’s data structure i.e. dimensionless
vector, 1 dim vector, matrix, array, data frame or list
• There are two way to create a data set:
1
– Using spreadsheet like data editor
– By coding then in
Invoking spreadsheet-like data editor in R
• R has four functions to invoke a spreadsheet-like data editor, these are:
– edit()
– fix()
– data.entry(), and
– dataentry()
Note on using spreadsheet-like data editors in R
• Using these function’s goes against R’s core functionality; program-
ming/coding
• Not a recommended way as it looses on documentation/reproducibility
Coding in data
• To code in data, function scan() can be quite handy in addition to calling
functions for any of the data structures; c() for vector, matrix(), array(),
data.frame(), and list()
• scan() is also not a good data entry process as it looses on reproducibility
as data is entered interactively (console)
Understanding datasets
• It can be a single variable or multiple variables
• In R, a single variable can be a dimensional vector (created with “c()”) or
a 1 dim array
• For multiple variables, if they are all of the same type (especially if numeric),
then matrix is a better data structure other wise for multiple types with
same length data frame is ideal
Understanding datasets
• If data is of different length and type, generic lists are appropriate.
• Lists can also be used to store different data sets for a particular project
as well as accompanying source code/function
2
Data input
We will look at:
• Spreadsheet data entry using "data.entry()"
• Using "scan()"
• Coding in data using data structure functions c(), matrix(), array(),
data.frame(), and list()
Spreadsheet data entry
• First, need to have variables or list of variables for data entry
• Then Call data entry
• From pop-up data editor, click on individual cell and enter data
• Variable names can be changed from data entry
Spreadsheet data entry demonstration
Data entry using “scan” function
• Can be used to input 1 dim atomic vectors
• Values entered interactively (on console) if file is not give
• For each entry, type value and click enter, after last value click enter and
entry mode will be exited
• Important to assign to variable name and specify type if it not “double”;
dataset2 <- scan(what = "character")
Demonstration on data input using function “scan”
Data entry using data structure functions
• Recommended way to generate data in R (ideally small data)
• Data structure function include:
– c() for atomic vectors
– matrix() for matrices
– array() for 1 or more dimension arrays
– data.frame for data frames
– list() for lists
Data entry using c()
• Used to create individual variables of any type as long as all elements are
of the same type e.g all logical or all character
3
# An integer vector
num <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) # same as 1:10
# A logical vector
logi <- c(TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE)
Data entry using function c()
# A character vector
R_authours <- c("Douglas Bates", "John Chambers", "Peter Dalgaard", "Seth Falcon", "Robert G
Data entry using “matrix()”
• 2 dimensional vectors (store data as rows and columns)
• Primarily created with function matrix() but rbind(), cbind() and
as.matrix() can be used to convert other vectors to a matrix
• Function matrix() can be called without any input thus creating an empty
matrix
• Argument “dimnames” can only be NULL (nothing) or a list
Data entry using “matrix()”
mat1 <- matrix(data = 1:9, nrow = 3, dimnames = list(NULL, c("a", "b", "c")))
mat1
a b c
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Data entry using “array()”
• Multi-dimensional structures (1 > dims), but often used for 3 dim structures
• Can only be used with one data type
• Matrices are special form of these data structures (have 2 dims)
• Primarily created with function array()
“array()” (cont)
n
dims <- list(1:3, c("a", "b", "c"), c("Yes", "No"))
arry <- array(data = seq(1, 9*2), dim = c(3, 3, 2), dimnames = dims)
4
arry
, , Yes
a b c
1 1 4 7
2 2 5 8
3 3 6 9
, , No
a b c
1 10 13 16
2 11 14 17
3 12 15 18
Data entry using ‘data.frame()‘
• Similar to matrices except they can contain different types of data as long
as they have the same length (number of elements)
• Though resemble matrices, they are actually list of vectors
• Columns contain measurements on one variable and rows contain cases
• Primarily created by data.frame()
data.frame()
# Example of weight loss data set
dataset3 <- data.frame(ID = 1:5, Exercise = c(TRUE, TRUE, FALSE, TRUE, FALSE), Height = c(5.
dataset3
ID Exercise Height Weight
1 1 TRUE 5.2 69
2 2 TRUE 4.9 72
3 3 FALSE 5.1 75
4 4 TRUE 5.2 67
5 5 FALSE 5.4 77
Data entry using “list()”
• A bit unique as not many statistical programs have similar data structure
• A sort of “carry-all” data structure
• Can also contain sub-list thus referred to as recursive
• Primarily created by list()
5
“lists()”
lst <- list(vect = 5:9, Matrix = mat1, Array = arry, Dataframe = dataset3, List = list("a",
str(lst)
List of 5
$ vect : int [1:5] 5 6 7 8 9
$ Matrix : int [1:3, 1:3] 1 2 3 4 5 6 7 8 9
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:3] "a" "b" "c"
$ Array : int [1:3, 1:3, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
..- attr(*, "dimnames")=List of 3
.. ..$ : chr [1:3] "1" "2" "3"
.. ..$ : chr [1:3] "a" "b" "c"
.. ..$ : chr [1:2] "Yes" "No"
$ Dataframe:'data.frame': 5 obs. of 4 variables:
..$ ID : int [1:5] 1 2 3 4 5
..$ Exercise: logi [1:5] TRUE TRUE FALSE TRUE FALSE
..$ Height : num [1:5] 5.2 4.9 5.1 5.2 5.4
..$ Weight : num [1:5] 69 72 75 67 77
$ List :List of 2
..$ : chr "a"
..$ : int [1:2] 2 3
R’s objects and properties
• Everything in R is referred to as an object from data structures to functions
and all objects have two types of attributes:
– Mode and
– Length
• Mode is the basic type of an object’s core constituent
• Length is extent or number of elements in an object
• Function mode() and length() are used to establish mode and length of
an object
Establishing basic composition of objects
n
mode(num)
[1] "numeric"
mode(mat1)
6
[1] "numeric"
mode(arry)
[1] "numeric"
n
mode(dataset3)
[1] "list"
mode(lst)
[1] "list"
Establishing length of an object
# Atomic vector
length(num)
[1] 10
# Matrix
length(mat1)
[1] 9
Establishing length of an object
• Length is not the best attribute for assessing a matrix or an array, “dim”
is more appropriate
mat1; dim(mat1)
a b c
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
[1] 3 3
Establishing length of R objects (cont)
# Data frames
length(dataset3) # This shows number of variables not cases/rows
[1] 4
7
# Lists
length(lst)
[1] 5
Difference between typeof(), mode() and storage.mode()
• There are 3 functions for checking basic constituents of an object, these
are:
– mode() which is an S compatible function for checking type
– storage.mode() which is used for compatability when calling func-
tions written in other languages (ensures data is of expected type)
– typeof() which is basically an R’s implementation of S’s mode()
Selecting between typeof(), mode(), and storage.mode()
• Which function should be used? Depends on why,
– If it’s just a general query, then typeof() is adequate
– If working with other S objects, use mode()
– If calling functions written in other languages, use storage.mode()
Other Attributes
• Attributes are basically meta data about an object in R
• All objects (except NULL) can have at least two or more attributes
• Attributes are stored as a pairlist i.e. name=value
• List of all attributes for an object are given by attributes()
• Individual attributes are given by attr()
Other attributes (cont)
• Other than mode and length other often used attributes are:
– Names
– Dimensions (dim)
– Dimnames
– Classes, and
– Time series
Names Attribute
• Used to name individual elements of a data object
8
• They are not mandatory, but quite handy when indexing element
• Accessed with name() and set with name(object) <-
• colnames() is used for matrix-like objects
Querying and setting element’s names
# Creating an unnamed vector
vect1 <- c(12, 54, 98)
names(vect1)
NULL
# Naming vector elements
names(vect1) <- c("a", "b", "c")
names(vect1)
[1] "a" "b" "c"
Naming elements (cont)
n
# An unnamed matrix
mat3 <- matrix(1:9, 3)
mat3
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
n
# Naming a matrix
colnames(mat3) <- c("a", "b", "c")
mat3
a b c
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Dimensions Attribute
• There at least two data structures with dimension attribute, these are
arrays (including matrices) and data frames
9
• Function dim() is used to query an objects dimension and dim <- used to
set dimension to an object
• There is a difference between an atomic vector and a 1 dim array; latter
has a dim attribute while former does not
• Giving a vector dimensions changes it’s data structure from a vector to an
array
Dim attribute (cont)
n
# An atomic vector (dimensionless)
vect2 <- 1
vect2
[1] 1
dim(vect2)
NULL
n
# Converting to 1 dimension array
dim(vect2) <- 1
vect2
[1] 1
dim(vect2)
[1] 1
Dimnames Attribute
• Gives names to dimensions
• Like “dim” attribute, “dimnames” attributes are given to vectors with dim
attribute like matrices, array and data frames
• Dimnames are given as a list of names (same lenth as “dim(x)”)
Quering and setting dimnames
n
# Matrix with no dimnames
vect3 <- matrix(1:9, 3)
vect3
10
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
n
# Adding dimnames
dimnames(vect3) <- list(1:3, c("a", "b", "c"))
vect3
a b c
1 1 4 7
2 2 5 8
3 3 6 9
Classes Attribute
• Class attribute is a special type of information used functions called
“methods”
• Used to determine how an object should be handled/acted upon
• All objects have an intrinsic class attribute which is basically it’s data
type, but other classes can be added to an object
Class attribute (cont)
• Classes are character vectors accessed and added with function class()
and class <- respectively or attr(obj, class)
• When a class is added to an object, that object is called an s3 object. This
makes it part of R’s Object Oriented Programming (OOP)
Quering and adding class attribute
n
# Intrinsic class attribute
vect <- 1:5; class(vect)
[1] "integer"
# (Assigned) Class attribute
attr(vect, "class")
NULL
n
11
# Add class with either
attr(vect, which = "class") <- "myclass"
# OR class(vect) <- "myclass"
# Query class attribute
attr(vect, "class")
[1] "myclass"
Useful note on adding class attribute
• Adding classes has it’s implications as far as “method dispatch” (selection
of suitable function) is concerned
• For example, changing from intrinsic class “numeric”to “myclass” means
function/methods for “myclass” if found, will be applied first
• Basically when a generic function such as “plot()” or “mean()” are called,
they will look for functions suitable for first listed class in a class vector, it
is not until all classes are are checked that a method for it’s intrisic class
is dispatched
Time series Attribute
• Used for data with time dimensionality like timely, daily, weekly, monthly,
quarterly or annual data
• Created by adding a “tsp” attribute
• It ensures time series parameters such as “start”, “end”, and “frequency”
are kept and
• For compartability with S version 2
Tsp Attribute
n
# Random annual data
set.seed(28)
tms <- round(rnorm(12, 56))
tms
[1] 54 56 55 54 56 57 56 56 56 58 55 58
attributes(tms)
NULL
n
12
# Adding attribute `tsp`
tsp(tms) <- c(start = 1, end = 12, frequency = 1)
attributes(tms)
$tsp
[1] 1 12 1
R’s data sets
n
• R has a number of data set
• Full documentation help(package = datasets)
• Currently there are 104 data sets
• Our of these:
Data Structure Number
Array 1
Character (1 dim vector) 2
Data frame 46
Dist (Distance Matrix) 2
Factor (1 dim integer vector) 2
List 4
Matrix 8
Numeric (1 dim vector) 6
Table (Atomic vectors) 51
ts 28
Conditional Statements
n
• Used to certain conditions are met by some data like observation above a
certain value
• They are also called control structures. Include:
– if-else
– ifelse
– for
Conditional Statements (cont)
• Others are
– while
13
– repeat
– break
– next, and
– switch
• We discuss frequently used control structure, that is if-else, ifelse and
for
if-else()
• Used to check if a condition evaluated to true, and if so an action is
performed
• It can be extended alternative action(s) with “else if” or “else”
• When “else statement” is given, it must be on same line as end of if
statement
• Example, check a vector has intrisic type “character”, if it does, we convert
it to a factor else leave it as it is
• Note: if-else can only be performed if condition evaluated to one logical
value either TRUE or FALSE
if-else() example
x <- c("a", "b", "c")
class(x)
[1] "character"
if (class(x) == "character") {
x <- as.factor(x)
} else {
x
}
class(x)
[1] "factor"
ifelse() control structure
• Used when condition evaluates to a logical vector length > 1
• Excellent for recoding variables, hence an example is done under “recoding
variables”
for() control structure
• for() is a looping structure used to perform repetitive tasks
14
• Though in most programming programs, this is a frequently used construct,
in R, there more efficient functions like apply group of functions
• for iterates from a certain value through a sequence performing an action
defined it’s body (body of any function including for loop is what is in
between {})
• As a simple example, let’s say Hello five time
for() loop example
for(i in 1:5) { # variable "i" is a counter (conting from 1 to 5)
cat("Hello n") # function "cat" is used to print to console
}
Hello
Hello
Hello
Hello
Hello
Recoding variables
• Recoding a variable means changing it’s values
• It is often recommended to create a new variable instead of overwriting
original variable
• Example:
– Create a dichotomous recoded variable of “feed” variable from
“chickwts” data set
– This variable will have values “casein” and “others” (this is something
often done during analysis)
Recoding variables (cont)
n
'data.frame': 71 obs. of 2 variables:
$ weight: num 179 160 136 227 217 168 108 124 143 140 ...
$ feed : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
n
# Current categories of variable of interest (feed)
levels(chickwts$feed)
[1] "casein" "horsebean" "linseed" "meatmeal" "soybean" "sunflower"
15
Recoding variables (cont)
# Recording with function "ifelse"
chickwts$feed2 <- ifelse(chickwts$feed == "casein", yes = "casein", no = "other")
# Conveting to a factor vector
chickwts$feed2 <- factor(chickwts$feed2)
# New levels
levels(chickwts$feed2)
[1] "casein" "other"
Renaming variables
• using base R, renaming any variables in a data frame requires all variable
names to issued to names() <- function
• For example, to rename “feed2” from previous slide:
# Current name
names(chickwts)
[1] "weight" "feed" "feed2"
Renaming variables (cont)
# Renaming variables (all must be proived)
names(chickwts) <- c("weight", "feed", "casein")
names(chickwts)
[1] "weight" "feed" "casein"
Missing values
• In R, denoted with Logical value “NA”
• Many operation can not be performed when there are missing value
• is.na() used to check for missing value
• If negated with “!” infront, it output current (non-missing) values
• For matrices and data frames, complete.cases() might be more appro-
priate
Missing values (cont)
n
16
# Vector with a missing value
vect1 <- c(letters[1:5], NA); vect1
[1] "a" "b" "c" "d" "e" NA
# A logical vector checking for missing values
is.na(vect1)
[1] FALSE FALSE FALSE FALSE FALSE TRUE
Missing values: complete case for matrices
n
vect2 <- letters[1:6]
mat3 <- rbind(vect1, vect2)
mat3
[,1] [,2] [,3] [,4] [,5] [,6]
vect1 "a" "b" "c" "d" "e" NA
vect2 "a" "b" "c" "d" "e" "f"
n
complete.cases(mat3)
[1] FALSE TRUE
• Output indicates the first row/case is not complete but the second is
complete
Date values
• Initially imported or created as numeric or character vectors
• Conversion (to class for data/time object: POSIXlt/POSIXct) depends on
whether they are character or numeric
• One way to convert character vector to date/time object is by using function
as.Date() specifying argument format as detailed by ?strftime
• as.Date() can also be used to convert a numeric vector to a date object,
by specifying argument origin; origin in R is “1970-01-01”
Date values (cont)
# Converting a character vector
date1char <- c("3/6/2017", "3/7/2017", "4/7/2017")
class(date1char)
17
[1] "character"
date1 <- as.Date(date1char, format = "%m/%e/%Y")
Date Values (cont)
date1
[1] "2017-03-06" "2017-03-07" "2017-04-07"
class(date1)
[1] "Date"
Date values (cont)
# Converting a numeric vector
date1num <- c(17231, 17232, 17263)
class(date1num)
[1] "numeric"
date2 <- as.Date(date1num, origin = "1970-01-01")
Date values (cont)
date2
[1] "2017-03-06" "2017-03-07" "2017-04-07"
class(date2)
[1] "Date"
Conversion between data types
• To convert from one data type to another, use as.data_type
like as.logical(), as.integer(), as.double(), as.character(),
as.raw(), and as.complex()
• But it must be convertible e.g.
– Can convert from logical to character but if character is not
“TRUE/FALSE” or “true/false” it will result in NA
– Cannot convert character to integer or double
18
Sorting data
• Sorting an atomic vector is done with sort()
• Sorting a data frame is done with order()
• Matrices are actually atomic vectors with dimensions, hence sorted with
looping function apply
• By default sort is done in an increasing manner, be nullified by setting
argument “decreasing” to TRUE
• Logical values ordered according to their integer form, i.e. TRUE = 1,
FALSE = 0 (TRUE > FALSE)
Sorting vectors
n
# An unsorted random numbers
set.seed(58)
tosort <- round(rnorm(10, 87, 10))
tosort
[1] 83 91 97 80 81 68 84 92 106 96
# Sorted vector (increasing)
sort(tosort)
[1] 68 80 81 83 84 91 92 96 97 106
# Sorted vector (decreasing)
sort(tosort, TRUE)
[1] 106 97 96 92 91 84 83 81 80 68
Sorting Matrices
n
mat2sort <- matrix(tosort[-1], 3, dimnames = list(1:3, c("a", "b", "c")))
mat2sort
a b c
1 91 81 92
2 97 68 106
3 80 84 96
n
19
# Sort by columns of a matrix
apply(mat2sort, 2, sort)
a b c
[1,] 80 68 92
[2,] 91 81 96
[3,] 97 84 106
Sorting Data frames
set.seed(3)
v1 <- round(rnorm(9, 50, 10))
set.seed(3)
v2 <- round(rnorm(9, 90))
set.seed(3)
logi <- sample(c(TRUE, FALSE), 9, TRUE, c(0.7, 0.3))
df1 <- data.frame(Logi = logi, V1 = v1, V2 = v2)
Sorting Data frames (cont)
# Sorted by first variable "logi"
df1[order(df1$Logi, decreasing = TRUE),]
Logi V1 V2
1 TRUE 40 89
3 TRUE 53 90
4 TRUE 38 89
5 TRUE 52 90
6 TRUE 50 90
7 TRUE 51 90
8 TRUE 61 91
9 TRUE 38 89
2 FALSE 47 90
Sorting data frames by more than one variable
• Sorting by more than one variable is first done on first listed variable then
the second and so on.
• Example:
– Sort variable Logi in a decreasing manner (TRUE first)
– Then sort variable “V1” in a decreasing manner
20
Sorting data frames example
df1[order(df1$Logi, df1$V1, decreasing = TRUE),]
Logi V1 V2
8 TRUE 61 91
3 TRUE 53 90
5 TRUE 52 90
7 TRUE 51 90
6 TRUE 50 90
1 TRUE 40 89
4 TRUE 38 89
9 TRUE 38 89
2 FALSE 47 90
Sorting by both decreasing and ascending order
# Negative sign used to indicate decreasing
df1[order(-df1$Logi, df1$V1), ]
Logi V1 V2
4 TRUE 38 89
9 TRUE 38 89
1 TRUE 40 89
6 TRUE 50 90
7 TRUE 51 90
5 TRUE 52 90
3 TRUE 53 90
8 TRUE 61 91
2 FALSE 47 90
Merging data sets
• Done by similar (intersecting) columns
• Can use database semantics
• Core considerations for merging
– Default merging done by intersect(names(x), names(y))
– Otherwise specific columns in each can be given especially if they do
not have same name or capitalization
Merging data frames
# Additional data set
dataset4 <- data.frame(ID = 6:10, Exercise = c(TRUE, FALSE, TRUE, TRUE, FALSE), Height = c(5
21
# Similar columns to be used for merging
intersect(names(dataset3), names(dataset4))
[1] "ID" "Exercise" "Height" "Weight"
Merging data frames
# Merging (adding cases)
merge(dataset3, dataset4, all = TRUE)
ID Exercise Height Weight
1 1 TRUE 5.2 69
2 2 TRUE 4.9 72
3 3 FALSE 5.1 75
4 4 TRUE 5.2 67
5 5 FALSE 5.4 77
6 6 TRUE 5.4 77
7 7 FALSE 5.4 74
8 8 TRUE 5.2 75
9 9 TRUE 5.6 79
10 10 FALSE 5.4 82
Subsetting data sets
Look at:
• Indexing
• Subsetting/extracting operators
• Subsetting different data objects
Indexing
• Indexing vectors are used to access elements from different data objects,
they include:
– Logical vector
– Positive integers
– Negative integers and
– Character vectors
• Note: It’s possible to have 0 index (empty indexing)
Indexing (cont)
• Logical vectors select elements which evaluate to TRUE
22
• Positive integers select elements at given positions
• Negative integers exclude values at given integers
• Character indices are only appropriate for named elements
• An empty index selects all values, used to replace all entries but at the
same time keeping it’s attributes
Subsetting/Extracting operators
• There three extracting operators and one extracting function
– [
– [[
– $, and
– getElement()
Subsetting/Extracting operators
• "[" can select more than one element and keeps their names if present
while "[[" and "$" can only select one element without their names
• "$" is only applicable for recursive objects (generic/list data structures),
basically data frames and lists
• "getElement()" function is similar to extracting with "[["
Subsetting Atomic Vectors
n
• Subsetting operator is [, although [[ can also be used to select a single
element without it’s names attribute
• Index vector put between subsetting operators.
n
vect1
[1] "a" "b" "c" "d" "e" NA
# Index vector: Elements that are not NA
!is.na(vect1)
[1] TRUE TRUE TRUE TRUE TRUE FALSE
Subsetting vectors (cont)
# Subset non-na values
vect1[!is.na(vect1)]
23
[1] "a" "b" "c" "d" "e"
# Subsetting with an empty index
tms[]
[1] 54 56 55 54 56 57 56 56 56 58 55 58
# Empty index useful for replacement while keeping attributes
set.seed(3)
tms[] <- sample(1:100, 12)
tms
[1] 17 80 38 32 58 96 12 28 54 95 47 45
attr(,"tsp")
[1] 1 12 1
Subsetting atomic elements (cont)
n
• Subsetting with “[[” returns without a names attribute
# Some of my favourite fruits
fruits <- c(Mangoes = 50, Apples = 35, Pineapples = 20)
n
fruits["Mangoes"]
Mangoes
50
fruits[["Mangoes"]]
[1] 50
Subsetting Matrices and Arrays
• Essentially atomic vectors with dimensions hence can be subset with [ and
[[
• Output is value occurring at given indices when all values are concatenated
• However, the best way to index these structures is by their dimension e.g.
[r, c] for 2 dim matrices and [r, c, l] meaning row, column, and layer for 3
dim arrays
• Exampl data set: R’s USPersonalExpenditure
24
Example data set
# One of R's data set
USPersonalExpenditure
1940 1945 1950 1955 1960
Food and Tobacco 22.200 44.500 59.60 73.2 86.80
Household Operation 10.500 15.500 29.00 36.5 46.20
Medical and Health 3.530 5.760 9.71 14.0 21.10
Personal Care 1.040 1.980 2.45 3.4 5.40
Private Education 0.341 0.974 1.80 2.6 3.64
# Subsetting with an empty index
USPersonalExpenditure[]
1940 1945 1950 1955 1960
Food and Tobacco 22.200 44.500 59.60 73.2 86.80
Household Operation 10.500 15.500 29.00 36.5 46.20
Medical and Health 3.530 5.760 9.71 14.0 21.10
Personal Care 1.040 1.980 2.45 3.4 5.40
Private Education 0.341 0.974 1.80 2.6 3.64
# Subseting with one index
USPersonalExpenditure[5]
[1] 0.341
# Subsetting with dimensions
USPersonalExpenditure[1, ] # Subset 1st row, all columns
1940 1945 1950 1955 1960
22.2 44.5 59.6 73.2 86.8
USPersonalExpenditure[1, 1] # Subset 1st row, first column
[1] 22.2
USPersonalExpenditure[3, "1950"] # Subset 3rd row, column 3 "1950"
[1] 9.71
USPersonalExpenditure[, "1960"] # Subset an entire row, drops dimension
Food and Tobacco Household Operation Medical and Health
86.80 46.20 21.10
Personal Care Private Education
5.40 3.64
# Maintaining dimension
USPersonalExpenditure[, "1960", drop = FALSE]
1960
Food and Tobacco 86.80
25
Household Operation 46.20
Medical and Health 21.10
Personal Care 5.40
Private Education 3.64
dim(USPersonalExpenditure[, "1960", drop = FALSE])
[1] 5 1
Subsetting Data frames
• All subsetting operators ([, [[ and $) can be used
• As before [ can selects more than one element
• Both [[ and $ can select one item, difference is that $ can not be used
with computed values like “i + 1” (index + 1)
• x$name is equivalent to x[[“name”, exact = FALSE]]
• Other than these operators, a much more efficient way to subset data
frames is with function subset()
Example data set: USArrests
# Vewing first 6 rows
head(USArrests)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
# Computing average of assault, murder and rape using "$"
avg_murder <- median(USArrests$Murder)
avg_assault <- median(USArrests$Assault)
avg_rape <- median(USArrests$Rape)
# Using "[" subset states with above average assault, murder and rape
high_crime <- USArrests[USArrests$Murder > avg_murder & USArrests$Assault > avg_assault & US
# Sort (by decreasing order for Murder) and output names of states
high_crime <- high_crime[order(high_crime$Murder, decreasing = TRUE),]
row.names(high_crime)
[1] "Georgia" "Florida" "Louisiana" "South Carolina"
[5] "Alabama" "Tennessee" "Texas" "Nevada"
[9] "Michigan" "New Mexico" "Maryland" "New York"
[13] "Illinois" "Alaska" "California" "Missouri"
[17] "Arizona" "Colorado"
26
# Subset a column without name attribute
high_crime[[1]]
[1] 17.4 15.4 15.4 14.4 13.2 13.2 12.7 12.2 12.1 11.4 11.3 11.1 10.4 10.0
[15] 9.0 9.0 8.1 7.9
# Or
USArrests[["Assault"]]
[1] 236 263 294 190 276 204 110 238 335 211 46 120 249 113 56 115 109
[18] 249 83 300 149 255 72 259 178 109 102 252 57 159 285 254 337 45
[35] 120 151 159 106 174 279 86 188 201 120 48 156 145 81 53 161
Subsetting with function “subset()”
• Function subset can be used to subset any vector, but most suitable for
data frames
• Here we will use it to subset high_crime states as we did before
• We use function with() to access variables without making reference to
data frame name
high_crime2 <- with(USArrests, subset(USArrests, Murder > avg_murder & Assault > avg_assault
high_crime2 <- high_crime2[order(high_crime2$Murder, decreasing = TRUE), ]
# Check both data sets are identical
identical(high_crime, high_crime2)
[1] TRUE
Subsetting Lists
• List can be subset with all three subsetting operators
• Rule of the thumb is, subsetting with [ returns a list, subsetting with [[
or $ outputs the same type as element being subset i.e. if list has data
frame, subsetting with [[ or $ will output a data frame
• Example data set: R’s first 10 values of “state.center” data set
Subsetting lists (cont)
# Example data
state.center; class(state.center)
$x
[1] -86.7509 -127.2500 -111.6250 -92.2992 -119.7730 -105.5130 -72.3573
[8] -74.9841 -81.6850 -83.3736
27
$y
[1] 32.5901 49.2500 34.2192 34.7336 36.5341 38.6777 41.5928 38.6777
[9] 27.8744 32.3329
[1] "list"
Subsetting lists (cont)
# Using `[` outputs a list
state.center[1]
$x
[1] -86.7509 -127.2500 -111.6250 -92.2992 -119.7730 -105.5130 -72.3573
[8] -74.9841 -81.6850 -83.3736
class(state.center[1])
[1] "list"
Subsetting lists (cont)
# Using `[[` outputs elements type
state.center[[1]]
[1] -86.7509 -127.2500 -111.6250 -92.2992 -119.7730 -105.5130 -72.3573
[8] -74.9841 -81.6850 -83.3736
class(state.center[[1]])
[1] "numeric"
Subsetting lists
# Using "$" outputs elements type
state.center$x
[1] -86.7509 -127.2500 -111.6250 -92.2992 -119.7730 -105.5130 -72.3573
[8] -74.9841 -81.6850 -83.3736
class(state.center$x)
[1] "numeric"
Using SQL statements to subset data frames
• Database semantics can sometimes be quite handy in subsetting e.g. subset
has to meet certain condition
28
• Core data base statements are :
– SELECT
– FROM
– WHERE
– ORDER BY
Using SQL statements to subset data frames
• If interested, read a small introduction to SQL statement from R’s Data Im-
port/Export manual (4.2) or go online and learn from “www.sqlcourse.com”
• Discussing this here might take us out scope, but it’s good to know it’s
possible in R using contributed packages like “sqldf” and “dplyr”.
Other functions useful for data sets
Function Description
str A compact display internals of a data frame
head Prints first part, default is first 6 rows
tail Prints last part, default is last 6 row
attach Put data frame on R’s search path hence variables are accessible without reference to data frame n
dettach Remove data frame from R’s search path. Recommended after completion of task
Other useful functions (cont)
Function Description
with Recommended alternative to attach, makes it possible to run expressions/function on a data fram
which Locates indices of logical value TRUE. Used for indexing data frame elements
29

More Related Content

What's hot

2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factorskrishna singh
 
Best corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiBest corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiUnmesh Baile
 
Data Structure and Algorithms
Data Structure and AlgorithmsData Structure and Algorithms
Data Structure and Algorithmsiqbalphy1
 
Data Structure
Data StructureData Structure
Data Structuresheraz1
 
Data Analysis and Programming in R
Data Analysis and Programming in RData Analysis and Programming in R
Data Analysis and Programming in REshwar Sai
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat SheetLaura Hughes
 
Data Structure In C#
Data Structure In C#Data Structure In C#
Data Structure In C#Shahzad
 
Stata cheatsheet transformation
Stata cheatsheet transformationStata cheatsheet transformation
Stata cheatsheet transformationLaura Hughes
 
Data structure and algorithm All in One
Data structure and algorithm All in OneData structure and algorithm All in One
Data structure and algorithm All in Onejehan1987
 
Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformationTim Essam
 
Elementary data structure
Elementary data structureElementary data structure
Elementary data structureBiswajit Mandal
 
data structure
data structuredata structure
data structurehashim102
 
R basics
R basicsR basics
R basicsFAO
 

What's hot (20)

2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors
 
Best corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiBest corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbai
 
LectureNotes-05-DSA
LectureNotes-05-DSALectureNotes-05-DSA
LectureNotes-05-DSA
 
Data Structure and Algorithms
Data Structure and AlgorithmsData Structure and Algorithms
Data Structure and Algorithms
 
R programming by ganesh kavhar
R programming by ganesh kavharR programming by ganesh kavhar
R programming by ganesh kavhar
 
R Introduction
R IntroductionR Introduction
R Introduction
 
Data Structure
Data StructureData Structure
Data Structure
 
Data Analysis and Programming in R
Data Analysis and Programming in RData Analysis and Programming in R
Data Analysis and Programming in R
 
Basic data-structures-v.1.1
Basic data-structures-v.1.1Basic data-structures-v.1.1
Basic data-structures-v.1.1
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
 
Data Structure In C#
Data Structure In C#Data Structure In C#
Data Structure In C#
 
Stata cheatsheet transformation
Stata cheatsheet transformationStata cheatsheet transformation
Stata cheatsheet transformation
 
R learning by examples
R learning by examplesR learning by examples
R learning by examples
 
Data structure and algorithm All in One
Data structure and algorithm All in OneData structure and algorithm All in One
Data structure and algorithm All in One
 
Data Structure Basics
Data Structure BasicsData Structure Basics
Data Structure Basics
 
Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformation
 
Elementary data structure
Elementary data structureElementary data structure
Elementary data structure
 
data structure
data structuredata structure
data structure
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
R basics
R basicsR basics
R basics
 

Similar to R training3

R Programming.pptx
R Programming.pptxR Programming.pptx
R Programming.pptxkalai75
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 
Basic of array and data structure, data structure basics, array, address calc...
Basic of array and data structure, data structure basics, array, address calc...Basic of array and data structure, data structure basics, array, address calc...
Basic of array and data structure, data structure basics, array, address calc...nsitlokeshjain
 
R_CheatSheet.pdf
R_CheatSheet.pdfR_CheatSheet.pdf
R_CheatSheet.pdfMariappanR3
 
C (PPS)Programming for problem solving.pptx
C (PPS)Programming for problem solving.pptxC (PPS)Programming for problem solving.pptx
C (PPS)Programming for problem solving.pptxrohinitalekar1
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureRai University
 
Bca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structureBca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structureRai University
 
II B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptxII B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptxsabithabanu83
 
Mca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureMca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureRai University
 

Similar to R training3 (20)

R Programming.pptx
R Programming.pptxR Programming.pptx
R Programming.pptx
 
Data Exploration in R.pptx
Data Exploration in R.pptxData Exploration in R.pptx
Data Exploration in R.pptx
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
Data Types of R.pptx
Data Types of R.pptxData Types of R.pptx
Data Types of R.pptx
 
Basic of array and data structure, data structure basics, array, address calc...
Basic of array and data structure, data structure basics, array, address calc...Basic of array and data structure, data structure basics, array, address calc...
Basic of array and data structure, data structure basics, array, address calc...
 
arrays
arraysarrays
arrays
 
R_CheatSheet.pdf
R_CheatSheet.pdfR_CheatSheet.pdf
R_CheatSheet.pdf
 
Arrays 06.ppt
Arrays 06.pptArrays 06.ppt
Arrays 06.ppt
 
Array
ArrayArray
Array
 
Array i imp
Array  i impArray  i imp
Array i imp
 
Data types in r
Data types in rData types in r
Data types in r
 
Ggplot2 v3
Ggplot2 v3Ggplot2 v3
Ggplot2 v3
 
C (PPS)Programming for problem solving.pptx
C (PPS)Programming for problem solving.pptxC (PPS)Programming for problem solving.pptx
C (PPS)Programming for problem solving.pptx
 
Unit4 Slides
Unit4 SlidesUnit4 Slides
Unit4 Slides
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structure
 
Bca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structureBca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structure
 
DataStructures.pptx
DataStructures.pptxDataStructures.pptx
DataStructures.pptx
 
II B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptxII B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptx
 
Session 4
Session 4Session 4
Session 4
 
Mca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureMca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structure
 

More from Hellen Gakuruh

Prelude to level_three
Prelude to level_threePrelude to level_three
Prelude to level_threeHellen Gakuruh
 
SessionThree_IntroductionToVersionControlSystems
SessionThree_IntroductionToVersionControlSystemsSessionThree_IntroductionToVersionControlSystems
SessionThree_IntroductionToVersionControlSystemsHellen Gakuruh
 
Introduction_to_Regular_Expressions_in_R
Introduction_to_Regular_Expressions_in_RIntroduction_to_Regular_Expressions_in_R
Introduction_to_Regular_Expressions_in_RHellen Gakuruh
 
SessionTen_CaseStudies
SessionTen_CaseStudiesSessionTen_CaseStudies
SessionTen_CaseStudiesHellen Gakuruh
 
SessionNine_HowandWheretoGetHelp
SessionNine_HowandWheretoGetHelpSessionNine_HowandWheretoGetHelp
SessionNine_HowandWheretoGetHelpHellen Gakuruh
 
SessionEight_PlottingInBaseR
SessionEight_PlottingInBaseRSessionEight_PlottingInBaseR
SessionEight_PlottingInBaseRHellen Gakuruh
 
SessionSeven_WorkingWithDatesandTime
SessionSeven_WorkingWithDatesandTimeSessionSeven_WorkingWithDatesandTime
SessionSeven_WorkingWithDatesandTimeHellen Gakuruh
 
SessionSix_TransformingManipulatingDataObjects
SessionSix_TransformingManipulatingDataObjectsSessionSix_TransformingManipulatingDataObjects
SessionSix_TransformingManipulatingDataObjectsHellen Gakuruh
 
SessionFive_ImportingandExportingData
SessionFive_ImportingandExportingDataSessionFive_ImportingandExportingData
SessionFive_ImportingandExportingDataHellen Gakuruh
 
SessionFour_DataTypesandObjects
SessionFour_DataTypesandObjectsSessionFour_DataTypesandObjects
SessionFour_DataTypesandObjectsHellen Gakuruh
 
SessionTwo_MakingFunctionCalls
SessionTwo_MakingFunctionCallsSessionTwo_MakingFunctionCalls
SessionTwo_MakingFunctionCallsHellen Gakuruh
 

More from Hellen Gakuruh (20)

R training6
R training6R training6
R training6
 
R training5
R training5R training5
R training5
 
R training4
R training4R training4
R training4
 
R training
R trainingR training
R training
 
Prelude to level_three
Prelude to level_threePrelude to level_three
Prelude to level_three
 
Prelude to level_two
Prelude to level_twoPrelude to level_two
Prelude to level_two
 
SessionThree_IntroductionToVersionControlSystems
SessionThree_IntroductionToVersionControlSystemsSessionThree_IntroductionToVersionControlSystems
SessionThree_IntroductionToVersionControlSystems
 
Day 2
Day 2Day 2
Day 2
 
Day 1
Day 1Day 1
Day 1
 
Introduction_to_Regular_Expressions_in_R
Introduction_to_Regular_Expressions_in_RIntroduction_to_Regular_Expressions_in_R
Introduction_to_Regular_Expressions_in_R
 
SessionTen_CaseStudies
SessionTen_CaseStudiesSessionTen_CaseStudies
SessionTen_CaseStudies
 
webScrapingFunctions
webScrapingFunctionswebScrapingFunctions
webScrapingFunctions
 
SessionNine_HowandWheretoGetHelp
SessionNine_HowandWheretoGetHelpSessionNine_HowandWheretoGetHelp
SessionNine_HowandWheretoGetHelp
 
SessionEight_PlottingInBaseR
SessionEight_PlottingInBaseRSessionEight_PlottingInBaseR
SessionEight_PlottingInBaseR
 
SessionSeven_WorkingWithDatesandTime
SessionSeven_WorkingWithDatesandTimeSessionSeven_WorkingWithDatesandTime
SessionSeven_WorkingWithDatesandTime
 
SessionSix_TransformingManipulatingDataObjects
SessionSix_TransformingManipulatingDataObjectsSessionSix_TransformingManipulatingDataObjects
SessionSix_TransformingManipulatingDataObjects
 
Files
FilesFiles
Files
 
SessionFive_ImportingandExportingData
SessionFive_ImportingandExportingDataSessionFive_ImportingandExportingData
SessionFive_ImportingandExportingData
 
SessionFour_DataTypesandObjects
SessionFour_DataTypesandObjectsSessionFour_DataTypesandObjects
SessionFour_DataTypesandObjects
 
SessionTwo_MakingFunctionCalls
SessionTwo_MakingFunctionCallsSessionTwo_MakingFunctionCalls
SessionTwo_MakingFunctionCalls
 

Recently uploaded

Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...amitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 

Recently uploaded (20)

Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 

R training3

  • 1. Introduction to Data Analysis and Graphics in R Introduction to Data Analysis and Graphics in R Hellen Gakuruh 2017-03-31 Session three Data Entry, Management and Manipulation in R Outline n • Creating a dataset • Understanding datasets • Data input • Useful functions for working with datasets • Creating new variables • Recording and renaming variables n • Missing and date values • Type conversions • Sorting data • Merging datasets • Subsetting datasets • Using SQL statements to subset dataframes Creating a dataset • Data sets can be created for any of R’s data structure i.e. dimensionless vector, 1 dim vector, matrix, array, data frame or list • There are two way to create a data set: 1
  • 2. – Using spreadsheet like data editor – By coding then in Invoking spreadsheet-like data editor in R • R has four functions to invoke a spreadsheet-like data editor, these are: – edit() – fix() – data.entry(), and – dataentry() Note on using spreadsheet-like data editors in R • Using these function’s goes against R’s core functionality; program- ming/coding • Not a recommended way as it looses on documentation/reproducibility Coding in data • To code in data, function scan() can be quite handy in addition to calling functions for any of the data structures; c() for vector, matrix(), array(), data.frame(), and list() • scan() is also not a good data entry process as it looses on reproducibility as data is entered interactively (console) Understanding datasets • It can be a single variable or multiple variables • In R, a single variable can be a dimensional vector (created with “c()”) or a 1 dim array • For multiple variables, if they are all of the same type (especially if numeric), then matrix is a better data structure other wise for multiple types with same length data frame is ideal Understanding datasets • If data is of different length and type, generic lists are appropriate. • Lists can also be used to store different data sets for a particular project as well as accompanying source code/function 2
  • 3. Data input We will look at: • Spreadsheet data entry using "data.entry()" • Using "scan()" • Coding in data using data structure functions c(), matrix(), array(), data.frame(), and list() Spreadsheet data entry • First, need to have variables or list of variables for data entry • Then Call data entry • From pop-up data editor, click on individual cell and enter data • Variable names can be changed from data entry Spreadsheet data entry demonstration Data entry using “scan” function • Can be used to input 1 dim atomic vectors • Values entered interactively (on console) if file is not give • For each entry, type value and click enter, after last value click enter and entry mode will be exited • Important to assign to variable name and specify type if it not “double”; dataset2 <- scan(what = "character") Demonstration on data input using function “scan” Data entry using data structure functions • Recommended way to generate data in R (ideally small data) • Data structure function include: – c() for atomic vectors – matrix() for matrices – array() for 1 or more dimension arrays – data.frame for data frames – list() for lists Data entry using c() • Used to create individual variables of any type as long as all elements are of the same type e.g all logical or all character 3
  • 4. # An integer vector num <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) # same as 1:10 # A logical vector logi <- c(TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE) Data entry using function c() # A character vector R_authours <- c("Douglas Bates", "John Chambers", "Peter Dalgaard", "Seth Falcon", "Robert G Data entry using “matrix()” • 2 dimensional vectors (store data as rows and columns) • Primarily created with function matrix() but rbind(), cbind() and as.matrix() can be used to convert other vectors to a matrix • Function matrix() can be called without any input thus creating an empty matrix • Argument “dimnames” can only be NULL (nothing) or a list Data entry using “matrix()” mat1 <- matrix(data = 1:9, nrow = 3, dimnames = list(NULL, c("a", "b", "c"))) mat1 a b c [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 Data entry using “array()” • Multi-dimensional structures (1 > dims), but often used for 3 dim structures • Can only be used with one data type • Matrices are special form of these data structures (have 2 dims) • Primarily created with function array() “array()” (cont) n dims <- list(1:3, c("a", "b", "c"), c("Yes", "No")) arry <- array(data = seq(1, 9*2), dim = c(3, 3, 2), dimnames = dims) 4
  • 5. arry , , Yes a b c 1 1 4 7 2 2 5 8 3 3 6 9 , , No a b c 1 10 13 16 2 11 14 17 3 12 15 18 Data entry using ‘data.frame()‘ • Similar to matrices except they can contain different types of data as long as they have the same length (number of elements) • Though resemble matrices, they are actually list of vectors • Columns contain measurements on one variable and rows contain cases • Primarily created by data.frame() data.frame() # Example of weight loss data set dataset3 <- data.frame(ID = 1:5, Exercise = c(TRUE, TRUE, FALSE, TRUE, FALSE), Height = c(5. dataset3 ID Exercise Height Weight 1 1 TRUE 5.2 69 2 2 TRUE 4.9 72 3 3 FALSE 5.1 75 4 4 TRUE 5.2 67 5 5 FALSE 5.4 77 Data entry using “list()” • A bit unique as not many statistical programs have similar data structure • A sort of “carry-all” data structure • Can also contain sub-list thus referred to as recursive • Primarily created by list() 5
  • 6. “lists()” lst <- list(vect = 5:9, Matrix = mat1, Array = arry, Dataframe = dataset3, List = list("a", str(lst) List of 5 $ vect : int [1:5] 5 6 7 8 9 $ Matrix : int [1:3, 1:3] 1 2 3 4 5 6 7 8 9 ..- attr(*, "dimnames")=List of 2 .. ..$ : NULL .. ..$ : chr [1:3] "a" "b" "c" $ Array : int [1:3, 1:3, 1:2] 1 2 3 4 5 6 7 8 9 10 ... ..- attr(*, "dimnames")=List of 3 .. ..$ : chr [1:3] "1" "2" "3" .. ..$ : chr [1:3] "a" "b" "c" .. ..$ : chr [1:2] "Yes" "No" $ Dataframe:'data.frame': 5 obs. of 4 variables: ..$ ID : int [1:5] 1 2 3 4 5 ..$ Exercise: logi [1:5] TRUE TRUE FALSE TRUE FALSE ..$ Height : num [1:5] 5.2 4.9 5.1 5.2 5.4 ..$ Weight : num [1:5] 69 72 75 67 77 $ List :List of 2 ..$ : chr "a" ..$ : int [1:2] 2 3 R’s objects and properties • Everything in R is referred to as an object from data structures to functions and all objects have two types of attributes: – Mode and – Length • Mode is the basic type of an object’s core constituent • Length is extent or number of elements in an object • Function mode() and length() are used to establish mode and length of an object Establishing basic composition of objects n mode(num) [1] "numeric" mode(mat1) 6
  • 7. [1] "numeric" mode(arry) [1] "numeric" n mode(dataset3) [1] "list" mode(lst) [1] "list" Establishing length of an object # Atomic vector length(num) [1] 10 # Matrix length(mat1) [1] 9 Establishing length of an object • Length is not the best attribute for assessing a matrix or an array, “dim” is more appropriate mat1; dim(mat1) a b c [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 [1] 3 3 Establishing length of R objects (cont) # Data frames length(dataset3) # This shows number of variables not cases/rows [1] 4 7
  • 8. # Lists length(lst) [1] 5 Difference between typeof(), mode() and storage.mode() • There are 3 functions for checking basic constituents of an object, these are: – mode() which is an S compatible function for checking type – storage.mode() which is used for compatability when calling func- tions written in other languages (ensures data is of expected type) – typeof() which is basically an R’s implementation of S’s mode() Selecting between typeof(), mode(), and storage.mode() • Which function should be used? Depends on why, – If it’s just a general query, then typeof() is adequate – If working with other S objects, use mode() – If calling functions written in other languages, use storage.mode() Other Attributes • Attributes are basically meta data about an object in R • All objects (except NULL) can have at least two or more attributes • Attributes are stored as a pairlist i.e. name=value • List of all attributes for an object are given by attributes() • Individual attributes are given by attr() Other attributes (cont) • Other than mode and length other often used attributes are: – Names – Dimensions (dim) – Dimnames – Classes, and – Time series Names Attribute • Used to name individual elements of a data object 8
  • 9. • They are not mandatory, but quite handy when indexing element • Accessed with name() and set with name(object) <- • colnames() is used for matrix-like objects Querying and setting element’s names # Creating an unnamed vector vect1 <- c(12, 54, 98) names(vect1) NULL # Naming vector elements names(vect1) <- c("a", "b", "c") names(vect1) [1] "a" "b" "c" Naming elements (cont) n # An unnamed matrix mat3 <- matrix(1:9, 3) mat3 [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 n # Naming a matrix colnames(mat3) <- c("a", "b", "c") mat3 a b c [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 Dimensions Attribute • There at least two data structures with dimension attribute, these are arrays (including matrices) and data frames 9
  • 10. • Function dim() is used to query an objects dimension and dim <- used to set dimension to an object • There is a difference between an atomic vector and a 1 dim array; latter has a dim attribute while former does not • Giving a vector dimensions changes it’s data structure from a vector to an array Dim attribute (cont) n # An atomic vector (dimensionless) vect2 <- 1 vect2 [1] 1 dim(vect2) NULL n # Converting to 1 dimension array dim(vect2) <- 1 vect2 [1] 1 dim(vect2) [1] 1 Dimnames Attribute • Gives names to dimensions • Like “dim” attribute, “dimnames” attributes are given to vectors with dim attribute like matrices, array and data frames • Dimnames are given as a list of names (same lenth as “dim(x)”) Quering and setting dimnames n # Matrix with no dimnames vect3 <- matrix(1:9, 3) vect3 10
  • 11. [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 n # Adding dimnames dimnames(vect3) <- list(1:3, c("a", "b", "c")) vect3 a b c 1 1 4 7 2 2 5 8 3 3 6 9 Classes Attribute • Class attribute is a special type of information used functions called “methods” • Used to determine how an object should be handled/acted upon • All objects have an intrinsic class attribute which is basically it’s data type, but other classes can be added to an object Class attribute (cont) • Classes are character vectors accessed and added with function class() and class <- respectively or attr(obj, class) • When a class is added to an object, that object is called an s3 object. This makes it part of R’s Object Oriented Programming (OOP) Quering and adding class attribute n # Intrinsic class attribute vect <- 1:5; class(vect) [1] "integer" # (Assigned) Class attribute attr(vect, "class") NULL n 11
  • 12. # Add class with either attr(vect, which = "class") <- "myclass" # OR class(vect) <- "myclass" # Query class attribute attr(vect, "class") [1] "myclass" Useful note on adding class attribute • Adding classes has it’s implications as far as “method dispatch” (selection of suitable function) is concerned • For example, changing from intrinsic class “numeric”to “myclass” means function/methods for “myclass” if found, will be applied first • Basically when a generic function such as “plot()” or “mean()” are called, they will look for functions suitable for first listed class in a class vector, it is not until all classes are are checked that a method for it’s intrisic class is dispatched Time series Attribute • Used for data with time dimensionality like timely, daily, weekly, monthly, quarterly or annual data • Created by adding a “tsp” attribute • It ensures time series parameters such as “start”, “end”, and “frequency” are kept and • For compartability with S version 2 Tsp Attribute n # Random annual data set.seed(28) tms <- round(rnorm(12, 56)) tms [1] 54 56 55 54 56 57 56 56 56 58 55 58 attributes(tms) NULL n 12
  • 13. # Adding attribute `tsp` tsp(tms) <- c(start = 1, end = 12, frequency = 1) attributes(tms) $tsp [1] 1 12 1 R’s data sets n • R has a number of data set • Full documentation help(package = datasets) • Currently there are 104 data sets • Our of these: Data Structure Number Array 1 Character (1 dim vector) 2 Data frame 46 Dist (Distance Matrix) 2 Factor (1 dim integer vector) 2 List 4 Matrix 8 Numeric (1 dim vector) 6 Table (Atomic vectors) 51 ts 28 Conditional Statements n • Used to certain conditions are met by some data like observation above a certain value • They are also called control structures. Include: – if-else – ifelse – for Conditional Statements (cont) • Others are – while 13
  • 14. – repeat – break – next, and – switch • We discuss frequently used control structure, that is if-else, ifelse and for if-else() • Used to check if a condition evaluated to true, and if so an action is performed • It can be extended alternative action(s) with “else if” or “else” • When “else statement” is given, it must be on same line as end of if statement • Example, check a vector has intrisic type “character”, if it does, we convert it to a factor else leave it as it is • Note: if-else can only be performed if condition evaluated to one logical value either TRUE or FALSE if-else() example x <- c("a", "b", "c") class(x) [1] "character" if (class(x) == "character") { x <- as.factor(x) } else { x } class(x) [1] "factor" ifelse() control structure • Used when condition evaluates to a logical vector length > 1 • Excellent for recoding variables, hence an example is done under “recoding variables” for() control structure • for() is a looping structure used to perform repetitive tasks 14
  • 15. • Though in most programming programs, this is a frequently used construct, in R, there more efficient functions like apply group of functions • for iterates from a certain value through a sequence performing an action defined it’s body (body of any function including for loop is what is in between {}) • As a simple example, let’s say Hello five time for() loop example for(i in 1:5) { # variable "i" is a counter (conting from 1 to 5) cat("Hello n") # function "cat" is used to print to console } Hello Hello Hello Hello Hello Recoding variables • Recoding a variable means changing it’s values • It is often recommended to create a new variable instead of overwriting original variable • Example: – Create a dichotomous recoded variable of “feed” variable from “chickwts” data set – This variable will have values “casein” and “others” (this is something often done during analysis) Recoding variables (cont) n 'data.frame': 71 obs. of 2 variables: $ weight: num 179 160 136 227 217 168 108 124 143 140 ... $ feed : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ... n # Current categories of variable of interest (feed) levels(chickwts$feed) [1] "casein" "horsebean" "linseed" "meatmeal" "soybean" "sunflower" 15
  • 16. Recoding variables (cont) # Recording with function "ifelse" chickwts$feed2 <- ifelse(chickwts$feed == "casein", yes = "casein", no = "other") # Conveting to a factor vector chickwts$feed2 <- factor(chickwts$feed2) # New levels levels(chickwts$feed2) [1] "casein" "other" Renaming variables • using base R, renaming any variables in a data frame requires all variable names to issued to names() <- function • For example, to rename “feed2” from previous slide: # Current name names(chickwts) [1] "weight" "feed" "feed2" Renaming variables (cont) # Renaming variables (all must be proived) names(chickwts) <- c("weight", "feed", "casein") names(chickwts) [1] "weight" "feed" "casein" Missing values • In R, denoted with Logical value “NA” • Many operation can not be performed when there are missing value • is.na() used to check for missing value • If negated with “!” infront, it output current (non-missing) values • For matrices and data frames, complete.cases() might be more appro- priate Missing values (cont) n 16
  • 17. # Vector with a missing value vect1 <- c(letters[1:5], NA); vect1 [1] "a" "b" "c" "d" "e" NA # A logical vector checking for missing values is.na(vect1) [1] FALSE FALSE FALSE FALSE FALSE TRUE Missing values: complete case for matrices n vect2 <- letters[1:6] mat3 <- rbind(vect1, vect2) mat3 [,1] [,2] [,3] [,4] [,5] [,6] vect1 "a" "b" "c" "d" "e" NA vect2 "a" "b" "c" "d" "e" "f" n complete.cases(mat3) [1] FALSE TRUE • Output indicates the first row/case is not complete but the second is complete Date values • Initially imported or created as numeric or character vectors • Conversion (to class for data/time object: POSIXlt/POSIXct) depends on whether they are character or numeric • One way to convert character vector to date/time object is by using function as.Date() specifying argument format as detailed by ?strftime • as.Date() can also be used to convert a numeric vector to a date object, by specifying argument origin; origin in R is “1970-01-01” Date values (cont) # Converting a character vector date1char <- c("3/6/2017", "3/7/2017", "4/7/2017") class(date1char) 17
  • 18. [1] "character" date1 <- as.Date(date1char, format = "%m/%e/%Y") Date Values (cont) date1 [1] "2017-03-06" "2017-03-07" "2017-04-07" class(date1) [1] "Date" Date values (cont) # Converting a numeric vector date1num <- c(17231, 17232, 17263) class(date1num) [1] "numeric" date2 <- as.Date(date1num, origin = "1970-01-01") Date values (cont) date2 [1] "2017-03-06" "2017-03-07" "2017-04-07" class(date2) [1] "Date" Conversion between data types • To convert from one data type to another, use as.data_type like as.logical(), as.integer(), as.double(), as.character(), as.raw(), and as.complex() • But it must be convertible e.g. – Can convert from logical to character but if character is not “TRUE/FALSE” or “true/false” it will result in NA – Cannot convert character to integer or double 18
  • 19. Sorting data • Sorting an atomic vector is done with sort() • Sorting a data frame is done with order() • Matrices are actually atomic vectors with dimensions, hence sorted with looping function apply • By default sort is done in an increasing manner, be nullified by setting argument “decreasing” to TRUE • Logical values ordered according to their integer form, i.e. TRUE = 1, FALSE = 0 (TRUE > FALSE) Sorting vectors n # An unsorted random numbers set.seed(58) tosort <- round(rnorm(10, 87, 10)) tosort [1] 83 91 97 80 81 68 84 92 106 96 # Sorted vector (increasing) sort(tosort) [1] 68 80 81 83 84 91 92 96 97 106 # Sorted vector (decreasing) sort(tosort, TRUE) [1] 106 97 96 92 91 84 83 81 80 68 Sorting Matrices n mat2sort <- matrix(tosort[-1], 3, dimnames = list(1:3, c("a", "b", "c"))) mat2sort a b c 1 91 81 92 2 97 68 106 3 80 84 96 n 19
  • 20. # Sort by columns of a matrix apply(mat2sort, 2, sort) a b c [1,] 80 68 92 [2,] 91 81 96 [3,] 97 84 106 Sorting Data frames set.seed(3) v1 <- round(rnorm(9, 50, 10)) set.seed(3) v2 <- round(rnorm(9, 90)) set.seed(3) logi <- sample(c(TRUE, FALSE), 9, TRUE, c(0.7, 0.3)) df1 <- data.frame(Logi = logi, V1 = v1, V2 = v2) Sorting Data frames (cont) # Sorted by first variable "logi" df1[order(df1$Logi, decreasing = TRUE),] Logi V1 V2 1 TRUE 40 89 3 TRUE 53 90 4 TRUE 38 89 5 TRUE 52 90 6 TRUE 50 90 7 TRUE 51 90 8 TRUE 61 91 9 TRUE 38 89 2 FALSE 47 90 Sorting data frames by more than one variable • Sorting by more than one variable is first done on first listed variable then the second and so on. • Example: – Sort variable Logi in a decreasing manner (TRUE first) – Then sort variable “V1” in a decreasing manner 20
  • 21. Sorting data frames example df1[order(df1$Logi, df1$V1, decreasing = TRUE),] Logi V1 V2 8 TRUE 61 91 3 TRUE 53 90 5 TRUE 52 90 7 TRUE 51 90 6 TRUE 50 90 1 TRUE 40 89 4 TRUE 38 89 9 TRUE 38 89 2 FALSE 47 90 Sorting by both decreasing and ascending order # Negative sign used to indicate decreasing df1[order(-df1$Logi, df1$V1), ] Logi V1 V2 4 TRUE 38 89 9 TRUE 38 89 1 TRUE 40 89 6 TRUE 50 90 7 TRUE 51 90 5 TRUE 52 90 3 TRUE 53 90 8 TRUE 61 91 2 FALSE 47 90 Merging data sets • Done by similar (intersecting) columns • Can use database semantics • Core considerations for merging – Default merging done by intersect(names(x), names(y)) – Otherwise specific columns in each can be given especially if they do not have same name or capitalization Merging data frames # Additional data set dataset4 <- data.frame(ID = 6:10, Exercise = c(TRUE, FALSE, TRUE, TRUE, FALSE), Height = c(5 21
  • 22. # Similar columns to be used for merging intersect(names(dataset3), names(dataset4)) [1] "ID" "Exercise" "Height" "Weight" Merging data frames # Merging (adding cases) merge(dataset3, dataset4, all = TRUE) ID Exercise Height Weight 1 1 TRUE 5.2 69 2 2 TRUE 4.9 72 3 3 FALSE 5.1 75 4 4 TRUE 5.2 67 5 5 FALSE 5.4 77 6 6 TRUE 5.4 77 7 7 FALSE 5.4 74 8 8 TRUE 5.2 75 9 9 TRUE 5.6 79 10 10 FALSE 5.4 82 Subsetting data sets Look at: • Indexing • Subsetting/extracting operators • Subsetting different data objects Indexing • Indexing vectors are used to access elements from different data objects, they include: – Logical vector – Positive integers – Negative integers and – Character vectors • Note: It’s possible to have 0 index (empty indexing) Indexing (cont) • Logical vectors select elements which evaluate to TRUE 22
  • 23. • Positive integers select elements at given positions • Negative integers exclude values at given integers • Character indices are only appropriate for named elements • An empty index selects all values, used to replace all entries but at the same time keeping it’s attributes Subsetting/Extracting operators • There three extracting operators and one extracting function – [ – [[ – $, and – getElement() Subsetting/Extracting operators • "[" can select more than one element and keeps their names if present while "[[" and "$" can only select one element without their names • "$" is only applicable for recursive objects (generic/list data structures), basically data frames and lists • "getElement()" function is similar to extracting with "[[" Subsetting Atomic Vectors n • Subsetting operator is [, although [[ can also be used to select a single element without it’s names attribute • Index vector put between subsetting operators. n vect1 [1] "a" "b" "c" "d" "e" NA # Index vector: Elements that are not NA !is.na(vect1) [1] TRUE TRUE TRUE TRUE TRUE FALSE Subsetting vectors (cont) # Subset non-na values vect1[!is.na(vect1)] 23
  • 24. [1] "a" "b" "c" "d" "e" # Subsetting with an empty index tms[] [1] 54 56 55 54 56 57 56 56 56 58 55 58 # Empty index useful for replacement while keeping attributes set.seed(3) tms[] <- sample(1:100, 12) tms [1] 17 80 38 32 58 96 12 28 54 95 47 45 attr(,"tsp") [1] 1 12 1 Subsetting atomic elements (cont) n • Subsetting with “[[” returns without a names attribute # Some of my favourite fruits fruits <- c(Mangoes = 50, Apples = 35, Pineapples = 20) n fruits["Mangoes"] Mangoes 50 fruits[["Mangoes"]] [1] 50 Subsetting Matrices and Arrays • Essentially atomic vectors with dimensions hence can be subset with [ and [[ • Output is value occurring at given indices when all values are concatenated • However, the best way to index these structures is by their dimension e.g. [r, c] for 2 dim matrices and [r, c, l] meaning row, column, and layer for 3 dim arrays • Exampl data set: R’s USPersonalExpenditure 24
  • 25. Example data set # One of R's data set USPersonalExpenditure 1940 1945 1950 1955 1960 Food and Tobacco 22.200 44.500 59.60 73.2 86.80 Household Operation 10.500 15.500 29.00 36.5 46.20 Medical and Health 3.530 5.760 9.71 14.0 21.10 Personal Care 1.040 1.980 2.45 3.4 5.40 Private Education 0.341 0.974 1.80 2.6 3.64 # Subsetting with an empty index USPersonalExpenditure[] 1940 1945 1950 1955 1960 Food and Tobacco 22.200 44.500 59.60 73.2 86.80 Household Operation 10.500 15.500 29.00 36.5 46.20 Medical and Health 3.530 5.760 9.71 14.0 21.10 Personal Care 1.040 1.980 2.45 3.4 5.40 Private Education 0.341 0.974 1.80 2.6 3.64 # Subseting with one index USPersonalExpenditure[5] [1] 0.341 # Subsetting with dimensions USPersonalExpenditure[1, ] # Subset 1st row, all columns 1940 1945 1950 1955 1960 22.2 44.5 59.6 73.2 86.8 USPersonalExpenditure[1, 1] # Subset 1st row, first column [1] 22.2 USPersonalExpenditure[3, "1950"] # Subset 3rd row, column 3 "1950" [1] 9.71 USPersonalExpenditure[, "1960"] # Subset an entire row, drops dimension Food and Tobacco Household Operation Medical and Health 86.80 46.20 21.10 Personal Care Private Education 5.40 3.64 # Maintaining dimension USPersonalExpenditure[, "1960", drop = FALSE] 1960 Food and Tobacco 86.80 25
  • 26. Household Operation 46.20 Medical and Health 21.10 Personal Care 5.40 Private Education 3.64 dim(USPersonalExpenditure[, "1960", drop = FALSE]) [1] 5 1 Subsetting Data frames • All subsetting operators ([, [[ and $) can be used • As before [ can selects more than one element • Both [[ and $ can select one item, difference is that $ can not be used with computed values like “i + 1” (index + 1) • x$name is equivalent to x[[“name”, exact = FALSE]] • Other than these operators, a much more efficient way to subset data frames is with function subset() Example data set: USArrests # Vewing first 6 rows head(USArrests) Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California 9.0 276 91 40.6 Colorado 7.9 204 78 38.7 # Computing average of assault, murder and rape using "$" avg_murder <- median(USArrests$Murder) avg_assault <- median(USArrests$Assault) avg_rape <- median(USArrests$Rape) # Using "[" subset states with above average assault, murder and rape high_crime <- USArrests[USArrests$Murder > avg_murder & USArrests$Assault > avg_assault & US # Sort (by decreasing order for Murder) and output names of states high_crime <- high_crime[order(high_crime$Murder, decreasing = TRUE),] row.names(high_crime) [1] "Georgia" "Florida" "Louisiana" "South Carolina" [5] "Alabama" "Tennessee" "Texas" "Nevada" [9] "Michigan" "New Mexico" "Maryland" "New York" [13] "Illinois" "Alaska" "California" "Missouri" [17] "Arizona" "Colorado" 26
  • 27. # Subset a column without name attribute high_crime[[1]] [1] 17.4 15.4 15.4 14.4 13.2 13.2 12.7 12.2 12.1 11.4 11.3 11.1 10.4 10.0 [15] 9.0 9.0 8.1 7.9 # Or USArrests[["Assault"]] [1] 236 263 294 190 276 204 110 238 335 211 46 120 249 113 56 115 109 [18] 249 83 300 149 255 72 259 178 109 102 252 57 159 285 254 337 45 [35] 120 151 159 106 174 279 86 188 201 120 48 156 145 81 53 161 Subsetting with function “subset()” • Function subset can be used to subset any vector, but most suitable for data frames • Here we will use it to subset high_crime states as we did before • We use function with() to access variables without making reference to data frame name high_crime2 <- with(USArrests, subset(USArrests, Murder > avg_murder & Assault > avg_assault high_crime2 <- high_crime2[order(high_crime2$Murder, decreasing = TRUE), ] # Check both data sets are identical identical(high_crime, high_crime2) [1] TRUE Subsetting Lists • List can be subset with all three subsetting operators • Rule of the thumb is, subsetting with [ returns a list, subsetting with [[ or $ outputs the same type as element being subset i.e. if list has data frame, subsetting with [[ or $ will output a data frame • Example data set: R’s first 10 values of “state.center” data set Subsetting lists (cont) # Example data state.center; class(state.center) $x [1] -86.7509 -127.2500 -111.6250 -92.2992 -119.7730 -105.5130 -72.3573 [8] -74.9841 -81.6850 -83.3736 27
  • 28. $y [1] 32.5901 49.2500 34.2192 34.7336 36.5341 38.6777 41.5928 38.6777 [9] 27.8744 32.3329 [1] "list" Subsetting lists (cont) # Using `[` outputs a list state.center[1] $x [1] -86.7509 -127.2500 -111.6250 -92.2992 -119.7730 -105.5130 -72.3573 [8] -74.9841 -81.6850 -83.3736 class(state.center[1]) [1] "list" Subsetting lists (cont) # Using `[[` outputs elements type state.center[[1]] [1] -86.7509 -127.2500 -111.6250 -92.2992 -119.7730 -105.5130 -72.3573 [8] -74.9841 -81.6850 -83.3736 class(state.center[[1]]) [1] "numeric" Subsetting lists # Using "$" outputs elements type state.center$x [1] -86.7509 -127.2500 -111.6250 -92.2992 -119.7730 -105.5130 -72.3573 [8] -74.9841 -81.6850 -83.3736 class(state.center$x) [1] "numeric" Using SQL statements to subset data frames • Database semantics can sometimes be quite handy in subsetting e.g. subset has to meet certain condition 28
  • 29. • Core data base statements are : – SELECT – FROM – WHERE – ORDER BY Using SQL statements to subset data frames • If interested, read a small introduction to SQL statement from R’s Data Im- port/Export manual (4.2) or go online and learn from “www.sqlcourse.com” • Discussing this here might take us out scope, but it’s good to know it’s possible in R using contributed packages like “sqldf” and “dplyr”. Other functions useful for data sets Function Description str A compact display internals of a data frame head Prints first part, default is first 6 rows tail Prints last part, default is last 6 row attach Put data frame on R’s search path hence variables are accessible without reference to data frame n dettach Remove data frame from R’s search path. Recommended after completion of task Other useful functions (cont) Function Description with Recommended alternative to attach, makes it possible to run expressions/function on a data fram which Locates indices of logical value TRUE. Used for indexing data frame elements 29