1. R-Programming–Basics
R Programming
Ground Up!
Syed Awase Khirni
Syed Awase earned his PhD from University of Zurich in GIS, supported by EU V Framework Scholarship from SPIRIT
Project (www.geo-spirit.org). He currently provides consulting services through his startup www.territorialprescience.com
and www.sycliq.com
1Copyright 2008-2016 Syed Awase Khirni TPRI
2. R-Programming–Basics
R Project
• R – Free Software
environment for
statistical computing
and graphics.
• https://www.r-
project.org
• https://cran.r-
project.org/mirrors.html
Copyright 2008-2016 Syed Awase Khirni TPRI 2
3. R-Programming–Basics
S
• S Language – Developed by
John Chambers et. al at Bell
Labs
• 1976 -> internal statistical
analysis environment –
originally implemented as
Fortran Libraries
• 1988-> Rewritten in C –
statistical models in S by
Chambers and Hastie
• 1998-> S v.4.0
• 1991-> R created in New
Zealand by Ross Ihaka and
Robert Gentleman.
• 1993 -> public release of R
• 1995-> Martin Machler
convinced Ross and Robert to
use the GNU GPU License
• 1996 , 1997 -> R Core Group
Formed with (S Plus Core
Group)
• 2000- R Version 1.0 Released
• 2015 R Version 3.1.3 -> March
9, 2015.
Copyright 2008-2016 Syed Awase Khirni TPRI 3
4. R-Programming–Basics
Design of the R System
• R –Statistical Programming
language based on S language
developed by Bell Labs.
• Divided into 2 conceptual parts
– Base
– Add-on Packages
• Base – R System contains
– The base package which is required
to run R and contains the most
fundamental functions.
– Other packages contained in the
base system include utils, stats,
datasets, graphics, grDevices, grid,
methods, tools, parallel, compiler,
splines, tcltk, stats4
• Add-on Packages are packages
that are published by either R
Core group or any third party
vendors
• Syntax similar to S, making it easy
for S-PLUS users to switch over
• Semantics are superficially similar
to S, but in reality are quite
different
• Runs on almost any standard
computing platform/OS
Copyright 2008-2016 Syed Awase Khirni TPRI 4
5. R-Programming–Basics
R?
• R is an integrated suite of
software facilities for data
manipulation, calculation
and graphical display
• R has
– Effective data handling and
storage facility
– A suite of operators for
calculations on arrays and
matrices
– A large, coherent,
integrated collection of
tools for data analysis
– Graphical facilities for data
analysis and display
– A well developed, simple
and effective programming
language
Copyright 2008-2016 Syed Awase Khirni TPRI 5
6. R-Programming–Basics
R- Drawbacks
• Little built-in support
for dynamic or 3-D
graphics
• Functionality is based
on consumer demand
and user contributions
• Web support provided
through third party
software.
Copyright 2008-2016 Syed Awase Khirni TPRI 6
9. R-Programming–Basics
Objects in R
• R has five basic or atomic classes of objects
– Character
– Numeric (real number)
– Integer
– Complex
– Logical (true/false)
• The most basic object is a vector
– A vector can only contain objects of the same class
– The one exception is a list, which is represented as a
vector but can contain objects of different classes
– Empty vectors can be created with the vector() function
Copyright 2008-2016 Syed Awase Khirni TPRI 9
11. R-Programming–Basics
Install.packages()
• To install additional
third party packages
into your R software.
We use
• Install.packages(“XLCon
nect”)
– To install XLConnect
package
– To activate an already
installed package we use
• Library(“packagename”)
Copyright 2008-2016 Syed Awase Khirni TPRI 11
Check if the package is already installed
or not.
any(grepl("<name of your package>",
installed.packages()))
12. R-Programming–Basics
Numbers in R
• Treated as numeric
objects (i.e. double
precision real numbers)
• Suffix L => integer
• Example : 1 => numeric
object
– 1L => explicitly gives an
integer
• 1/0 => inf (infinity)
• NaN => not a number or
missing value
Copyright 2008-2016 Syed Awase Khirni TPRI 12
13. R-Programming–Basics
Attributes
• R objects can have
attributes
– Names, dimnames
– Dimensions (e.g. matrices,
arrays)
– Class
– Length
– Other user-defined
attributes/metadata
• Attributes of an object
can be accessed using the
attributes() function.
Copyright 2008-2016 Syed Awase Khirni TPRI 13
14. R-Programming–Basics
Assignment Operator (<-)
• Expressions in R are done
using <- assignment
operator.
• The grammar of the
language determines
whether an expression is
complete or not
• The # character indicates a
comment. Anything to the
right of the # (including the
# itself) is ignored
• [1] indicates that x is a
vector and 123781213412
is the first element
Copyright 2008-2016 Syed Awase Khirni TPRI 14
//auto printing
Ctrl+L to clear console
17. R-Programming–Basics
Mixing Objects
• When different objects are mixed in a vector, coercion
occurs so that every element in the vector is of the
same class.
Copyright 2008-2016 Syed Awase Khirni TPRI 17
19. R-Programming–Basics
Matrices
• Vectors with a dimension
attribute are called Matrices.
The dimension attribute is
itself an integer vector of
length 2(nrow, ncol)
• Matrices are constructed
column-wise, so entries can be
thought of starting from the
upper left corner and running
down the columns.
• Matrices can also be created
directly from vectors by
adding a dimension attribute.
Copyright 2008-2016 Syed Awase Khirni TPRI 19
22. R-Programming–Basics
Lists in R
• Lists are a special type
of vector that can
contain elements of
different classes.
• Lists are a very
important data type in
R
Copyright 2008-2016 Syed Awase Khirni TPRI 22
23. R-Programming–Basics
Factors
• Used to represent
categorical data. Factors can
be unordered or ordered.
• Factors are treated
specially by modelling
functions like lm() and
glm()
• Using factors with labels is
better than using integers
because factors are self-
describing, having a
variable that has values.
Copyright 2008-2016 Syed Awase Khirni TPRI 23
24. R-Programming–Basics
Missing Values
• Many existing, industrial
and research datasets
contain Missing values.
• These can occur due to
various reasons such as
manual data entry
procedures, equipment
errors and incorrect
measurements.
• Missing values can appear
in the form of outliers or
even wrong data (i.e out
of boundaries)
Copyright 2008-2016 Syed Awase Khirni TPRI 24
• Missing values are denoted by NA
or NaN for undefined
mathematical operations
– Is.na() is used to test objects
if they are NA
– Is.nan() is used to test for
NaN
– NA values have a class also,
so there are integerNA,
characterNA etc.
– A NaN value is also NA but
the converse is not true.
25. R-Programming–Basics
Missing Values
• Three type of problems
are usually associated
with missing values
– Loss of efficiency
– Complications in
handling and
analyzing the data
– Bias resulting from
differences between
missing and complete
data.
Copyright 2008-2016 Syed Awase Khirni TPRI 25
Identifying NA values using is.na() and is.nan()
26. R-Programming–Basics
Data Frames
• Used to store tabular data
(table of values)
– They are represented as a
special type of list, where
every element of the list has
to have the same length.
– Each element of the list can
be thought of as a column
and the length of each
element of the list is the
number of the rows
• Data frames can store
different classes of objects
in each column, while
matrices must have every
element of the same class
• Data frames also have a
special attribute called
row.names.
• Data frames are usually
created by calling
read.table() or read.csv()
• Can be converted to a
matrix by calling
data.matrix() method
Copyright 2008-2016 Syed Awase Khirni TPRI 26
29. R-Programming–Basics
Names in R
• R Objects can also have
names, which is very
useful for writing
readable code and self-
describing objects
Copyright 2008-2016 Syed Awase Khirni TPRI 29
30. R-Programming–Basics
Subsetting
• Extracting subsets from
an existing dataset is
called subsetting
– []Always returns an
object of the same class
as the original
– [[]]Used to extract
elements of a list or a
data frame.
– $ is used to extract
element of a list or data
frame by name;
semantics are similar to
that of [[]].
Copyright 2008-2016 Syed Awase Khirni TPRI 30
38. R-Programming–Basics
Reading Data
• R provides some useful functions to read data
– Read.table, read.csv for reading tabular data
– readLines, for reading lines of a text file
– Source: for reading in R code files (inverse of
dump)
– dget: for reading in R code files (inverse of dput)
– Load: for reading in saved workspaces
– Unserialize, for reading single R objects in binary
form.
Copyright 2008-2016 Syed Awase Khirni TPRI 38
39. R-Programming–Basics
Writing Data
• R provides a set of functions to write data into
files
– Write.table: to write data in table format
– writeLines: to write lines
– Dump
– Dput
– Save
– serialize
Copyright 2008-2016 Syed Awase Khirni TPRI 39
40. R-Programming–Basics
Reading data files with read.table
• For small to moderately
sized datasets, we can
just call read.table
without specifying any
other arguments.
• Data <-
read.table(“sampledata.
txt”)
Copyright 2008-2016 Syed Awase Khirni TPRI 40
47. R-Programming–Basics
Write.csv()
• One of the easiest ways to save an R data
frame is to write it to a csv file or tsv file or
text file.
Copyright 2008-2016 Syed Awase Khirni TPRI 47
48. R-Programming–Basics
dput()
• Writes an ASCII text representation of an R
object to a file or connection, or uses one to
recreate the object
Copyright 2008-2016 Syed Awase Khirni TPRI 48
49. R-Programming–Basics
Head and Tail of DataSet
• Returns the first or the
last part of an object ,
i.e. vector, matrix, table,
data frame or function.
Copyright 2008-2016 Syed Awase Khirni TPRI 49
50. R-Programming–Basics
Loading “foreign” data
• Sometimes, we would
like to import data from
other statistical
packages like SAS,SPSS
and Stata
• Reading stata (.dta)
files with foreign library
• Writing data files from R
into Stata is also very
straightforward.
Copyright 2008-2016 Syed Awase Khirni TPRI 50
51. R-Programming–Basics
Library”foreign”data
• SPSS Data
– For data files in SPSS
format, it can be opened
with the function
read.spss from “foreign”
package.
– “to.data.frame” option
set to TRUE to return a
data frame.
Copyright 2008-2016 Syed Awase Khirni TPRI 51
55. R-Programming–Basics
Computing Memory Requirements
• An integer takes 8bytes for numeric data type.
• Imagine you have a data frame with 100,000
rows and 100 columns.
• 100,000 X100X8bytes/numeric
– 220 bytes/MB
– Which accounts for 785 MB of memory is
required.
Copyright 2008-2016 Syed Awase Khirni TPRI 55
56. R-Programming–Basics
Text Formats
• dump and dput are useful because the resulting textual
format is editable and in the case of corruption, potentially
recoverable
• In the case of writing out to a table or CSV file, dump and
dput preserve the metadata (sacrificing some readability),
so that another user doesn’t have to specify it all over
again.
• Textual formats can work much better with version control
programs like GIT and SVN, used to track changes
meaningfully
• Text formats have longer life and adhere to “unix
philosophy”
• However, the format is not very space-efficient.
Copyright 2008-2016 Syed Awase Khirni TPRI 56
57. R-Programming–Basics
Dump() function
• Creates a file in a format
that can be read with the
source() function or pasted
in with the copy/paste edit
functions of the windowing
system.
Copyright 2008-2016 Syed Awase Khirni TPRI 57
58. R-Programming–Basics
Dput() function
• Dput function saves data as
an R expression, which
means that the resulting file
can actually be copied and
pasted into the R console.
• Creates and uses an ASCII
file representing the object
• Writes an ASCII version of
the object onto the file.
Copyright 2008-2016 Syed Awase Khirni TPRI 58
59. R-Programming–Basics
Functions in R
• Functions are a
fundamental building
block of R
– Functions can be
assigned to variables
– Functions can be stored
in lists,
– Functions can be passed
as arguments to other
functions
– Functions can have
nested functions.
• Anonymous functions are
functions that have no
name.
• We use functions to
incorporate sets of
instructions that we want to
use repeatedly or that
because of their complexity,
are better self-contained in
a sub-program and called
when needed.
Copyright 2008-2016 Syed Awase Khirni TPRI 59
60. R-Programming–Basics
User Defined Functions in R
• UDF are defined to
accomplish a particular
task and are not aware
that a dedicated
function or library exists
already.
Copyright 2008-2016 Syed Awase Khirni TPRI 60
63. R-Programming–Basics
Infix Operators in R
• They are unique
functions and methods
that facilitate basic data
expressions or
transformations.
• They refer to the
placement of the
arithmetic operator
between variables.
• The types of infix
operators used in R
include functions for
data extraction,
arithmetic sequences,
comparison, logical
testings, variable
assignments and
custom data functions
Copyright 2008-2016 Syed Awase Khirni TPRI 63
64. R-Programming–Basics
Infix Operator in R
• Infix operators, are used
between operands, these
operators do a function call
in the background.
Copyright 2008-2016 Syed Awase Khirni TPRI 64
65. R-Programming–Basics
Predefined infix Operators in R
Operator Rank Description
%% 6 Reminder operator
%/% Integer Division
%*% 6 Matrix Multiplication
%o% 6 Outer Product
%x% 6 Kronecker product
%in% 9 Matching operator
:: 1 Extract -> extract function from a package namespace.
::: 1 Extract-> extract a hidden function from a namespace
$ 2 Extract list subset, extract list data by name
@ 2 Extract attributes by memory slot or location.
[[]] 3 Extract data by index
Copyright 2008-2016 Syed Awase Khirni TPRI 65
66. R-Programming–Basics
Predefined infix operators in R
Operator Rank Description
^ 4 Arithmetic Exponential Operator
: 5 Generate sequence of number
! 8 Not/Negation Operator
Xor 10 Logical/Exclusive OR
& 10 Logical and element
&& 10 Logical and control
~ 11 Assignment(equal) used in formals and model
building
<<- 12 Permanent Assignment
<- 13 Left assignment
-> 13 Right assignment
Copyright 2008-2016 Syed Awase Khirni TPRI 66
71. R-Programming–Basics
Ifelse()
• Vectors form the basic
building block of R
programming.
• Most functions in R take
vector as input and output a
resultant vector
• Vectorization of code will be
much faster than applying
the same function to each
element of the vector
individually.
• Ifelse() is a vector
equivalent of if..else
statement
• Test_expression must be a
logical vector (or an object
that can be coerced to
logical)
• Return value is a vector
with the same length as
test_expression
Copyright 2008-2016 Syed Awase Khirni TPRI 71
75. R-Programming–Basics
Repeat Loop
• A repeat loop is used to
iterate over a block of
code multiple number of
time
• There is no condition
check in repeat loop to
exit the loop
• We must put a condition
explicitly inside the body
of the loop and use the
break statement to exit
the loop
Copyright 2008-2016 Syed Awase Khirni TPRI 75
77. R-Programming–Basics
OOP in R
• An object is a data structure have some
attributes and methods which act on the
attributes
• A class is a blue print for the object.
• R has three(3) class systems
– S3 Class System
– S4 Class System
– Reference Class System
Copyright 2008-2016 Syed Awase Khirni TPRI 77
78. R-Programming–Basics
S3 Class System
• Primitive in nature
• Lacks a formal definition and
object of this class can be
simply created by adding a
class attribute.
• Objects are created by setting
the class attribute
• Attributes are accessed using $
• Methods belong to generic
function
• Follows copy-on-modify
semantics
S4 Class System
• A formally defined structure
which helps in making object
of the same class look more or
less similar.
• Class components are properly
defined using the setClass()
function and objects are
created using the new()
function.
• Attributes are accessed using
@
• Methods belong to generic
function
• Follows copy-on-modify
semantics
Copyright 2008-2016 Syed Awase Khirni TPRI 78
79. R-Programming–Basics
Reference Class System
• Similar to the object
oriented programming we
are used to in C# and Java.
• Basically an extension of S4
class system with an
environment added to it.
• Reference Class System
– Class defined using
SetRefClass()
– Objects are created
using generator
functions
– Attributes are accessed
using $
– Methods belong to the
class
– Does not follow copy-
on-modify semantics
Copyright 2008-2016 Syed Awase Khirni TPRI 79
85. R-Programming–Basics
S4 Class System in R
• S4 class is defined using the setClass() function
• Member variables are called slots
• When defining a class, we need to set the name and
the slots (along with class of the slot)
Copyright 2008-2016 Syed Awase Khirni TPRI 85
86. R-Programming–Basics
S4 Class System in R
Accessing Slots
• Slots of an object are
accessed using @
Modifying Slots
Copyright 2008-2016 Syed Awase Khirni TPRI 86
• A slot can be modified
through reassignment
operations as shown below
88. R-Programming–Basics
R Reference Class System
• Reference class in R are similar
to the object oriented
programming, we are used to
seeing in C++, Java, Python.
• Unlike S3 and S4 classes,
methods belong to class rather
than generic functions.
• Reference class are internally
implemented as S4 classes
with an environment added to
it.
• setRefClass() returns a
generator function which is
used to create objects of that
class
Copyright 2008-2016 Syed Awase Khirni TPRI 88
89. R-Programming–Basics
Reference Class in R
Accessing Fields in R
• Fields of the object can be
accessed using the $
operator
Modifying Fields in R
Copyright 2008-2016 Syed Awase Khirni TPRI 89
• Fields can be modified by
reassignment