R training2

Introduction to Data Analysis and Graphics2
Introduction to Data Analysis and Graphics2
Hellen Gakuruh
2017-03-07
Session Two
Vector and Assignment, Data Objects and Data Importation
Outline
By the end of this session we will have knowledge on:
• Vectors and Assignment
• Data types
• Data structure and
• Importing data into R
Vector and Assignment
• Simplest data structure in R is a vector. From a data point of view, a
vector is collection of elements. These elements can be numeric values,
alphabetical characters, logical, dates and time values.
• Vectors are created with function “c” which means “concatenate”. e.g. a
numerical vector c(1, 5, 6, 8)
• Thee vectors can be named by using an assignment operator “<-” or
function “assign()”. e.g. to assign vector c(1, 5, 6, 8) to name “num”;
num <- c(1, 5, 6, 8) or assign(“num”, c(1, 5, 6, 8)). We often use “<-” for
assignment, “assign” function is mostly used in developing functions
• A vector can be of any length begining from 1 to about 2.1474836 × 109
1

Data types
R recognises seven data types, these are:
• Logical
• Integer
• Real/Double
• String/Character
• Factor
cont. . .
• Complex
• Raw
• R manuals specifys six types; logical, integer, double, character, complex
and raw. However, factor is a data type that does not fall into either of
the six listed data types.
• In this sub-section we introduce these data types
Data types: Logical
• These are vectors with only TRUE and FALSE values like c(TRUE, TRUE,
FALSE, TRUE, FALSE)
• Can be considered as binary vectors in analysis
• Other than categorical variables with these values, these vectors are often
created by binary operators like “<”, “>”, “<=”, >=, ==, =!, “|”, “||”,
“&”, and “&&”
• During analysis, these vectors can be coerced to numeric values in which
case TRUE becomes 1 and FALSE becomes 0
• These vectors include value “NA” which in R means “Not Available”, a
placeholder for missing values.
• Any operation done with a vector containing NA is bound to result to NA
since NA is unknown
Data types: Integer
• These are basically positive and negative numbers without fractions {. . . ,
-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, . . . }
• In R, integers are denoted with letter L e.g. c(-3L, 0L, 2L, 5L, 6L). Can
conﬁrm it’s an integer vector with function is.integer(c(-3L, 0L, 2L,
5L, 5L))
• Example of a variable which can be considered to naturally have integers
is “number of people” (you can’t have a fraction of a person)
• Mathematically denoted by ( mathbb{Z} )
2

Real/Double
• A real number is any number along an infinitely number line
• They include fractions
• Denoted mathematically with ( mathbb{R} )
• Any numeric vector that does not have values followed by letter “L” are
considered as double e.g. c(-3, 0, 2, 5, 6). Can confirm a vector is a real
or double vector with funtion “is.double” e.g is.double(c(-3, 0, 2, 5,
6))
String/Character
• Composed of alphabetical letters and word/text
• Denoted by single or double quotation marks
• R has a special vector with alphabetical letter; this is letters
• Example c("a", "b", "c"), letters, c('cats', 'and' , 'dogs')
• Can check whether a vector is a character vector with function
is.character e.g. is.character(letters)
Data type: Factors
n
• In R a factor vector is a categorical variable with discrete classification
(grouping)
• Example
cat <- factor(c(rep("Y", 28), rep("N", 10)))
is.factor(cat)
[1] TRUE
levels(cat)
[1] "N" "Y"
Data type: Complex
n
• These are vectors with real and imaginary values. Imaginary numbers are
denoted by letter “i”
• Mathematically used to make it possible to take square-root of negative
values
3

# Example, complex vector
3+2i
[1] 3+2i
# Confirm it's complex
is.complex(3+2i)
[1] TRUE
Data type: Raw
• These are vectors containing computer bytes or information on data storage
units
• More of computer language (0’s and 1’s) than human readable language
• Integers and doubles are jointly refered to as numeric
• The most commonly used data types are logical, numeric and characters.
Complex and raw data types are rarely used
int <- c(-3L, -2L, -1L, 0L, 1L, 2L, 3L)
is.integer(int)
[1] TRUE
is.numeric(int)
[1] TRUE
doub <- c(-3, -2, -1, 0, 1, 2, 3)
is.double(doub)
[1] TRUE
is.numeric(doub)
[1] TRUE
Data structures
• There two broad types of data structures in R
– Atomic vectors
– Generic (list) vectors
• These structures have three properties
– Type
– Length and
– Attributes
4

• Function "type" is used to establish a vector’s type, function "length"
is used to determine length and function "attributes" is used to get
additional information about a vector
• Atomic vectors and lists diﬀer in their type as atomic vectors can only
contain one data type while lists can contain any number of data types.
Atomic Vectors
• Contains only one data type, they include 1 dimensional atomic vectors, 2
dimensional atomic vectors called “matrices” and multi-dimensional atomic
vectors called “arrays”.
• Dimensionality can be considered as number of indices required to address
any element in a vector e.g. vector “cat” requires one index to address any
value, for example index “4” means fourth value which is Y
• Single variables are all atomic vectors of one dimension
• To check if a vector is either atomic or list, use is.atomic() or is.list().
Note there is a is.vector() but this checks if vector is named
Atomic vectors: Matrices
• Two dimensional atomic vectors, they contain data of the same type
• Any atomic vector can be converted to a matrix by adding a dim attribute
cat <- c(rep("Y", 28), rep("N", 10))
typeof(cat)
[1] "character"
dim(cat)
NULL
is.matrix(cat)
[1] FALSE
dim(cat) <- c(19, 2)
typeof(cat)
[1] "character"
dim(cat)
[1] 19 2
is.matrix(cat)
5

[1] TRUE
• Other than using "dim()" to convert a one dim to a multi-dimension
atomic vector, matrices can be created with "matrix()", or by coercing
another data object with "as.matrix()"
typeof(airmiles)
[1] "double"
airmiles2 <- matrix(airmiles, nrow = 8, ncol = 3)
is.matrix(airmiles2)
[1] TRUE
airmiles3 <- as.matrix(airmiles, nrow = 8, ncol = 3)
is.matrix(airmiles3)
[1] TRUE
rm(airmiles2, airmiles3)
Special 1 & 2 dimension atomic vectors
Time series objects
• These are vectors used to store observations collected at given time points
(interval) over a period time, e.g. observations collected every three three
months for ﬁve year.
• Distiguishing feature in this data is time, interval is usually constant like
three months (regular), but in other cases it might not be so (irregular)
• In R, time series data are numeric vectors with attribute class equal “ts”
meaning time series
• Time series vectors can either be 1 dim atomic vector like “AirPassengers”
data set in R or a 2d matrix like "EuStockMarkets"
typeof(AirPassengers)
[1] "double"
attr(AirPassengers, "class")
[1] "ts"
typeof(EuStockMarkets)
[1] "double"
attr(EuStockMarkets, "class")
[1] "mts" "ts" "matrix"
6

Atomic vectors: Arrays
• Arrays are multi-dimensional atomic vectors.
• Matrices are two dimensional array.
• They are rarely used, but it’s good to know they exist
• Created like matrices; "dim()" e.g. dim(a) <- c(6, 2, 2), or array()
or as.array()
Data structures: Generic vectors
• Lists are data structure which can contain more than on type of data type.
• There are two types of lists; two dimensional lists called "data frames"
and "lists"
Data frames
n
• Most recognizable data structure
• A core data strucure in R
• Present data in row and columns like matrices, but in this case columns
can have diﬀerent data types
# Example
head(faithful)
eruptions waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
Generic vectors: Lists
• These are unique data structure
• Can contain any number and type of object, not just data. Can contain
sub-lists hence also called recursive
• Created with function “list()”. Can also coerce other structures to a list
with function “as.list()”
• We will create this structure in our next session
7

Importing and Exporting Data in R
• Data importation also referred to as “reading in” data
• Reading data depends on type and location of file
• Sub-session interest, reading in local R, text, excel, database and other
statistical program files
• Also discuss web scrapping
Reading in .RData
• Data created in R can be store in RData file
• This could be any data structure or a collection of data saved from an
active working directory (workspace)
• Function “save.image()” used to store workspace, function “load” is used
to read in any “.RData” (or even .Rhistory)
# See current objects
ls()
[1] "cat" "doub" "int"
# Store in an external .RData file
save.image()
# Remove all object from workspace/global environment
rm(list = ls())
ls()
character(0)
# Read in .RData
load(".RData")
# Check we have them back
ls()
[1] "cat" "doub" "int"
R’s core importing function “read.table()”
• read.table is R’s core importing function
• Almost all other functions including contributed packages depend on this
function
• Reads a file and creates a data frame from it
• It has a number of wrapper functions (functions which provide a con-
vinience interface to another function like give pre-defined/default values,
this make function calls more efficient)
8

• Wrapper functions include read.csv(), read.csv2(), read.delim,
read.delim2
• CSV are comma separated files
• Delim are text files, word delim means delimited which implys how data
are separate like with tabs
• Both csv and delim are relatively easy to read into R as long as separa-
tor/delimitors are known
• In case separator or delimitor is not known and file cannot be opened, then
best to read in a few lines with read.lines function Live demo (reading
in CSV file)
Reading in Excel files
• Base R does not have a function to read in Excel based files
• But many contributed packages have functions to read them in
• Core reference in importing this type of files is one of R-projects manuals
R Data Import/Export specifically chapter 9.
• Recommendation made is to try and convert Excel file in to “.csv” (comma-
separated) or “delim” (tab-separated) file. Live demo (reading excel file)
Reading in Databases data
• A bit of caution, database data tend to be large, R is not to good when it
comes to large data, hence read in part of data or look for ways to increase
memory allocated to R processes like using cloud.
• Most Relational Database Management Systems (RDMS) have data similar
to R’s dataframe where columns are called “fields” and rows are called
“records”.
• Extracting part of relational database requires use of database quering
sematics core of which is a SELECT statement.
• In general, SELECT query uses:
– FROM to select the table
– WHERE to specify a condition for inclusion and
– ORDER BY to sort results (this is important as RDMS do not order
it’s rows like R’s dataframes)
• There are a number of contributed packaged on CRAN for reading RDMS
data, these include RMySQL, DBI, ROracle, RPostgreSQL and RSQLite.
Live demo (reading in RDMS and web data)
9

From other statistical softwares
• Other statistical softwares often used to read in data are SPSS, SAS, Stata
and EpiInfo
• Like excel and database data, to read in these ﬁles a package must be used
• Recommended package is package "foreign" other packages include,
"readstata3" and haven.
Live demo (reading SPSS and Stata data ﬁles)
10

R training2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to R training2

Similar to R training2 (20)

More from Hellen Gakuruh

More from Hellen Gakuruh (20)

Recently uploaded

Recently uploaded (20)

R training2