SlideShare a Scribd company logo
Hands-on R Programming
Mrs. Nimrita Koul
Assistant Professor
School of Computing & IT
REVA University
Contents
 About R
 R and RStudio Download and Installation
 Features of R
R is:
 A Programming Environment for Statistical Computing, Data Analysis and Graphics. It is
a GNU project developed at Bell Laboratories by John Chambers .
 Graphical facilities for data analysis and visualization
 A well developed, simple and effective programming language.
 Includes conditionals, loops, user defined recursive functions and input and output
facilities.
 A fully planned and coherent system
 R is being updated with newer functionalities like deep learning libraries continuously.
 It has developed rapidly, and has been extended by a large collection of packages.
 R provides a wide variety of statistical tools (linear and nonlinear modelling,
classical statistical tests, time-series analysis, classification, clustering, …)
and graphical techniques, and is highly extensible
 One of R’s strengths is strong visualization with a great quality plots that can
be made with it
 R is available as Free Software under the terms of the Free Software
Foundation’s GNU General Public License in source code form.
 R documentation and manual is available at
 https://cran.r-project.org/manuals.html
Downloading R for Windows
 Goto - https://cran.r-project.org/bin/windows/base/
 Click on Download R 3.4.2 for Windows
 Click on downloaded exe file
 Select language as English
 Click on Next in screens
 Finish
RStudio
 https://www.rstudio.com/products/rstudio/download/
 Goto Column RStudio Desktop Open Source License
 Click on the Free version of RStudio
Installing RStudio
 Once the exe downloads to your PC, Click on it, follow the default settings
 Keep clicking on Next and finally Finish.
 You have installed Rstudio
 TO Find it:
 Goto Start
 In programs you will find Rstudio, click on it to start it
 R is case sensitive. So A and a are different symbols and would refer to different
variables
 Commands are separated either by a semi-colon (‘;’), or by a newline
 If a command is not complete at the end of a line, R will give a different prompt,
by default + on second and subsequent lines and continue to read input until the
command is syntactically complete.
 The vertical arrow keys on the keyboard can be used to scroll forward and
backward through a command history
R Features
Objects and Workspace
 The entities that R creates and manipulates are known as objects.
 These may be variables, arrays of numbers, character strings, functions, or
more general structures built from such components.
 During an R session, objects are created and stored by name.
 The R command > objects() (alternatively, ls()) can be used to display the
names of the objects which are currently stored within R.
 The collection of objects currently stored is called the workspace
 To remove objects the function rm is available: > rm(x, y, z, ink, junk, temp,
foo, bar)
 All saved objects are written to a file called .RData in the current directory,
and the command lines used in the session are saved to a file called .Rhistory.
 When R is started at later time from the same directory it reloads the
workspace from this file. At the same time the associated commands history
is reloaded.
LET US ALL BE ON THE RStudio CONSOLE
Built-in help system
 At the program's command prompt you can use any of the following:
 help.start() # general help
 apropos(“solve") # list all functions containing string foo
 example(solve) # show an example of function foo
 > help(solve)
 An alternative is
 > ?solve
 > help("[[")
 > help.start() - help is available in HTML format by running
 > help.search(“solve”)
Standard commands for managing
your workspace.
 ls() # list the objects in the current workspace
 setwd(mydirectory) # change to mydirectory
setwd("c:/docs/mydir") # note / instead of  in windows
setwd("/usr/rob/mydir") # on linux
 # view and set options for the session
help(options) # learn about available options
options() # view current option settings
options(digits=3) # number of digits to print on output
 # work with your previous commands
history() # display last 25 commands
history(max.show=Inf) # display all previous commands
 # save your command history
savehistory(file="myfile") # default is ".Rhistory"
# recall your command history
loadhistory(file="myfile") # default is ".Rhistory"
 # save the workspace to the file .RData in the cwd
save.image()
# save specific objects to a file
# if you don't specify the path, the cwd is assumed
save(object list,file="myfile.RData")
 # load a workspace into the current session
# if you don't specify the path, the cwd is assumed
load("myfile.RData")
 q() # quit R. You will be prompted to save the workspace.
Input / Output
 By default, launching R starts an interactive session with input from the keyboard and
output to the screen. However, you can have input come from a script file (a file
containing R commands) and direct output to a variety of destinations.
Input - The source( ) function runs a script in the current session. If the filename does
not include a path, the file is taken from the current working directory.
 # input a script
source("myfile")
 Output-The sink( ) function defines the direction of the output.
 # direct output to a file
 sink("myfile", append=FALSE, split=FALSE)
 # return output to the terminal
 sink()
 The append option controls whether output overwrites or adds to a file. The split option
determines if output is also sent to the screen as well as the output file.
 Here are some examples of the sink() function.
 # output directed to output.txt in current working directory. output overwrites existing
file. no output to terminal.
 sink("output.txt")
 # output directed to myfile.txt in cwd. output is appended to existing file. output also
send to terminal.
 sink("myfile.txt", append=TRUE, split=TRUE)
 When redirecting output, use the cat( ) function to annotate the output.
Graphs
 sink( ) will not redirect graphic output. To redirect graphic output use one of
the following functions. Use dev.off( ) to return output to the terminal.
Function Output to
pdf("mygraph.pdf") pdf file
win.metafile("mygraph.wmf") windows metafile
png("mygraph.png") png file
jpeg("mygraph.jpg") jpeg file
bmp("mygraph.bmp") bmp file
postscript("mygraph.ps") postscript file
 # example - output graph to jpeg file
jpeg("c:/mygraphs/myplot.jpg")
plot(x)
dev.off()
Packages
 Packages are collections of R functions, data, and compiled code in a well-
defined format.
 The directory where packages are stored is called the library.
 R comes with a standard set of packages. Others are available for download
and installation.
 install.packages(“packagename”) command installs a package.
 Once installed, they have to be loaded into the session to be used.
 .libPaths() # get library location
library() # see all packages installed
search() # see packages currently loaded
Download and Install a Package
 We need to download and install only once.
 To use the package, invoke the library(package) command to load it into the
current session. (You need to do this once in each session, unless
you customize your environment to automatically load it each time.)
 On MS Windows:
 Command Install.Packages(“Packagename”) installs a package from the
default mirror..
 Then use the library(packagename) function to load it for use. (e.g.
library(boot))
Customizing Startup
 You can customize the R environment through a site initialization file or a
directory initialization file. R will always source the Rprofile.site file first. On
Windows, the file is in the C:Program FilesRR-n.n.netc directory. You can
also place a .Rprofile file in any directory that you are going to run R from or
in the user home directory.
 At startup, R will source the Rprofile.site file. It will then look for a
.Rprofile file to source in the current working directory. If it doesn't find it, it
will look for one in the user's home directory.
 There are two special functions you can place in these files. .First( ) will be
run at the start of the R session and .Last( ) will be run at the end of the
session.
 # Sample Rprofile.site file
# Things you might want to change
# options(papersize="a4")
# options(editor="notepad")
# options(pager="internal")
# R interactive prompt
# options(prompt="> ")
# options(continue="+ ")
# to prefer Compiled HTML
help options(chmhelp=TRUE)
# to prefer HTML help
# options(htmlhelp=TRUE)
# General options
options(tab.width = 2)
options(width = 130)
options(graphics.record=TRUE)
.First <- function(){
library(Hmisc)
library(R2HTML)
cat("nWelcome at", date(), "n")
}
.Last <- function(){
cat("nGoodbye at ", date(), "n")
}
Basic Data Types
 Numeric
 Integer
 Complex
 Logical
 Character
Data Types
 R has a wide variety of data types including scalars, vectors (numerical,
character, logical), matrices, data frames, and lists.
 VECTORS
 a <- c(1,2,5.3,6,-2,4) # numeric vector
b <- c("one", "two", "three") # character vector
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
 Refer to elements of a vector using subscripts
 a[c(2,4)] # 2nd and 4th elements of vector
 > x = 10.5 # assign a decimal value
> x # print the value of x
[1] 10.5
> class(x) # print the class name of x
[1] "numeric"
 Furthermore, even if we assign an integer to a variable k, it is still being
saved as a numeric value.
 > k = 1
> k # print the value of k
[1] 1
> class(k) # print the class name of k
[1] "numeric"
 The fact that k is not an integer can be confirmed with he is.integer function.
 > is.integer(k) # is k an integer?
[1] FALSE
 Integer
 In order to create an integer variable in R, we invoke the as.integer function.
We can be assured that y is indeed an integer by applying
the is.integer function.
 > y = as.integer(3)
> y # print the value of y
[1] 3
> class(y) # print the class name of y
[1] "integer"
> is.integer(y) # is y an integer?
[1] TRUE
 Incidentally, we can coerce a numeric value into an integer with the
same as.integer function.
 > as.integer(3.14) # coerce a numeric value
[1] 3
 And we can parse a string for decimal values in much the same way.
 > as.integer("5.27") # coerce a decimal string
[1] 5
 On the other hand, it is erroneous trying to parse a non-decimal string.
 > as.integer("Joe") # coerce an non−decimal string
[1] NA
Warning message:
NAs introduced by coercion
 Often, it is useful to perform arithmetic on logical values. Like the C
language, TRUE has the value 1, while FALSE has value 0.
 > as.integer(TRUE) # the numeric value of TRUE
[1] 1
> as.integer(FALSE) # the numeric value of FALSE
[1] 0
Complex
 A complex value in R is defined via the pure imaginary value i.
 > z = 1 + 2i # create a complex number
> z # print the value of z
[1] 1+2i
> class(z) # print the class name of z
[1] "complex"
 The following gives an error as −1 is not a complex value.
 > sqrt(−1) # square root of −1
[1] NaN
Warning message:
In sqrt(−1) : NaNs produced
 Instead, we have to use the complex value −1 + 0i.
 > sqrt(−1+0i) # square root of −1+0i
[1] 0+1i
 An alternative is to coerce −1 into a complex value.
 > sqrt(as.complex(−1))
[1] 0+1i
Logical
 A logical value is often created via comparison between variables.
 > x = 1; y = 2 # sample values
> z = x > y # is x larger than y?
> z # print the logical value
[1] FALSE
> class(z) # print the class name of z
[1] "logical"
 Standard logical operations are "&" (and), "|" (or), and "!" (negation).
 > u = TRUE; v = FALSE
> u & v # u AND v
[1] FALSE
> u | v # u OR v
[1] TRUE
> !u # negation of u
[1] FALSE
Character
 A character object is used to represent string values in R. We convert objects
into character values with the as.character() function:
 > x = as.character(3.14)
> x # print the character string
[1] "3.14"
> class(x) # print the class name of x
[1] "character"
 Two character values can be concatenated with the paste function.
 > fname = "Joe"; lname ="Smith"
> paste(fname, lname)
[1] "Joe Smith"
 However, it is often more convenient to create a readable string with
the sprintf function, which has a C language syntax.
 > sprintf("%s has %d dollars", "Sam", 100)
[1] "Sam has 100 dollars"
 To extract a substring, we apply the substr function. Here is an example
showing how to extract the substring between the third and twelfth positions
in a string.
 > substr("Mary has a little lamb.", start=3, stop=12)
[1] "ry has a l"
 And to replace the first occurrence of the word "little" by another word "big"
in the string, we apply the sub function.
 > sub("little", "big", "Mary has a little lamb.")
[1] "Mary has a big lamb."
 More functions for string manipulation can be found in the R documentation.
 > help("sub")
Vector
 A vector is a sequence of data elements of the same basic type. Members in a
vector are officially called components.
 Here is a vector containing three numeric values 2, 3 and 5.
 > c(2, 3, 5)
[1] 2 3 5
 And here is a vector of logical values.
 > c(TRUE, FALSE, TRUE, FALSE, FALSE)
[1] TRUE FALSE TRUE FALSE FALSE
 A vector can contain character strings.
 > c("aa", "bb", "cc", "dd", "ee")
[1] "aa" "bb" "cc" "dd" "ee"
 The number of members in a vector is given by the length function.
 > length(c("aa", "bb", "cc", "dd", "ee"))
[1] 5
Combining Vectors
 Vectors can be combined via the function c. For examples, the following two
vectors n and sare combined into a new vector containing elements from both
vectors.
 > n = c(2, 3, 5)
> s = c("aa", "bb", "cc", "dd", "ee")
> c(n, s)
[1] "2" "3" "5" "aa" "bb" "cc" "dd" "ee"
 Value Coercion
 In the code snippet above, notice how the numeric values are being coerced
into character strings when the two vectors are combined. This is necessary
so as to maintain the same primitive data type for members in the same
vector.
Vector Arithmetic
 Arithmetic operations of vectors are performed member-by-member, i.e., member
wise.
 For example, suppose we have two vectors a and b.
 > a = c(1, 3, 5, 7)
> b = c(1, 2, 4, 8)
 Then, if we multiply a by 5, we would get a vector with each of its members
multiplied by 5.
 > 5 * a
[1] 5 15 25 35
 And if we add a and b together, the sum would be a vector whose members are the
sum of the corresponding members from a and b.
 > a + b
[1] 2 5 9 15
 Similarly for subtraction, multiplication and division, we get new vectors via
memberwise operations.
 > a - b
[1] 0 1 1 -1
> a * b
[1] 1 6 20 56
> a / b
[1] 1.000 1.500 1.250 0.875
 Recycling Rule
 If two vectors are of unequal length, the shorter one will be recycled in order to
match the longer vector. For example, the following vectors u and v have different
lengths, and their sum is computed by recycling values of the shorter vector u.
 > u = c(10, 20, 30)
> v = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
> u + v
[1] 11 22 33 14 25 36 17 28 39
Vector Index
 We retrieve values in a vector by declaring an index inside a single square
bracket "[]"operator.
 For example, the following shows how to retrieve a vector member. Since the
vector index is 1-based, we use the index position 3 for retrieving the third
member.
 > s = c("aa", "bb", "cc", "dd", "ee")
> s[3]
[1] "cc"
 Unlike other programming languages, the square bracket operator returns
more than just individual members. In fact, the result of the square bracket
operator is another vector, and s[3] is a vector slice containing a single
member "cc".
 Negative Index
 If the index is negative, it would strip the member whose position has the
same absolute value as the negative index. For example, the following
creates a vector slice with the third member removed.
 > s[-3]
[1] "aa" "bb" "dd" "ee"
 Out-of-Range Index
 If an index is out-of-range, a missing value will be reported via the
symbol NA.
 > s[10]
[1] NA
Numeric Index Vector
 A new vector can be sliced from a given vector with a numeric index vector, which consists
of member positions of the original vector to be retrieved.
 Here it shows how to retrieve a vector slice containing the second and third members of a
given vector s.
 > s = c("aa", "bb", "cc", "dd", "ee")
> s[c(2, 3)]
[1] "bb" "cc"
 Duplicate Indexes
 The index vector allows duplicate values. Hence the following retrieves a member twice in
one operation.
 > s[c(2, 3, 3)]
[1] "bb" "cc" "cc"
Out-of-Order Indexes
 The index vector can even be out-of-order. Here is a vector slice with the order of first
and second members reversed.
 > s[c(2, 1, 3)]
[1] "bb" "aa" "cc"
 Range Index
 To produce a vector slice between two indexes, we can use the colon operator ":". This
can be convenient for situations involving large vectors.
 > s[2:4]
[1] "bb" "cc" "dd"
 More information for the colon operator is available in the R documentation.
 > help(":")
 Logical Index Vector
 A new vector can be sliced from a given vector with a logical index vector,
which has the same length as the original vector. Its members are TRUE if the
corresponding members in the original vector are to be included in the slice,
and FALSE if otherwise.
 For example, consider the following vector s of length 5.
 > s = c("aa", "bb", "cc", "dd", "ee")
 To retrieve the the second and fourth members of s, we define a logical
vector L of the same length, and have its second and fourth members set
as TRUE.
 > L = c(FALSE, TRUE, FALSE, TRUE, FALSE)
> s[L]
[1] "bb" "dd"
 The code can be abbreviated into a single line.
 > s[c(FALSE, TRUE, FALSE, TRUE, FALSE)]
[1] "bb" "dd"
 Named Vector Members
 We can assign names to vector members. For example, the following variable v is
a character string vector with two members.
 > v = c("Mary", "Sue")
> v
[1] "Mary" "Sue"
 We now name the first member as First, and the second as Last.
 > names(v) = c("First", "Last")
> v
First Last
"Mary" "Sue"
 Then we can retrieve the first member by its name.
 > v["First"]
[1] "Mary"
 Furthermore, we can reverse the order with a character string index vector.
 > v[c("Last", "First")]
Last First
"Sue" "Mary"
 MATRIX
 A matrix is a collection of data elements arranged in a two-dimensional rectangular layout.
The following is an example of a matrix with 2 rows and 3 columns.
 A= [ 2 4 3
 1 5 7 }
 We reproduce a memory representation of the matrix in R with the matrix function. The data
elements must be of the same basic type.
 > A = matrix(
 + c(2, 4, 3, 1, 5, 7), # the data elements
 + nrow=2, # number of rows
 + ncol=3, # number of columns
 + byrow = TRUE) # fill matrix by rows
 > A # print the matrix
 [,1] [,2] [,3]
 [1,] 2 4 3
 [2,] 1 5 7
 An element at the mth row, nth column of A can be accessed by the
expression A[m, n].
 > A[2, 3] # element at 2nd row, 3rd column
 [1] 7
 The entire mth row A can be extracted as A[m, ].
 > A[2, ] # the 2nd row
 [1] 1 5 7
 Similarly, the entire nth column A can be extracted as A[ ,n].
 > A[ ,3] # the 3rd column
 [1] 3 7
 We can also extract more than one rows or columns at a time.
 > A[ ,c(1,3)] # the 1st and 3rd columns
 [,1] [,2]
 [1,] 2 3
 [2,] 1 7
 If we assign names to the rows and columns of the matrix, than we can access the
elements by names.
 > dimnames(A) = list(
 + c("row1", "row2"), # row names
 + c("col1", "col2", "col3")) # column names
 > A # print A
 col1 col2 col3
 row1 2 4 3
 row2 1 5 7
 > A["row2", "col3"] # element at 2nd row, 3rd column
 [1] 7
 Matrix Construction
 There are various ways to construct a matrix. When we construct a matrix
directly with data elements, the matrix content is filled along the column
orientation by default. For example, in the following code snippet, the
content of B is filled along the columns consecutively.
 > B = matrix(
 + c(2, 4, 3, 1, 5, 7),
 + nrow=3,
 + ncol=2)
 > B # B has 3 rows and 2 columns
 [,1] [,2]
 [1,] 2 1
 [2,] 4 5
 [3,] 3 7
 [3,] 3 7 2

 Then we can combine the columns of B and C with cbind.
 > cbind(B, C)
 [,1] [,2] [,3]
 [1,] 2 1 7
 [2,] 4 5 4
 Similarly, we can combine the rows of two matrices if they have the same number
of columns with the rbind function.
 > D = matrix(
 + c(6, 2),
 + nrow=1,
 + ncol=2)

 > D # D has 2 columns
 [,1] [,2]
 [1,] 6 2
Deconstruction
 We can deconstruct a matrix by applying the c function, which combines all
column vectors into one.
 > c(B)
 [1] 2 4 3 1 5 7
Transpose
 We construct the transpose of a matrix by interchanging its columns and rows
with the function t .
 > t(B) # transpose of B
 [,1] [,2] [,3]
 [1,] 2 4 3
 [2,] 1 5 7
Lists
 A list is a generic vector containing other objects.
 For example, the following variable x is a list containing copies of three vectors n, s,
b, and a numeric value 3.
 > n = c(2, 3, 5)
 > s = c("aa", "bb", "cc", "dd", "ee")
 > b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
 > x = list(n, s, b, 3) # x contains copies of n, s, b
 List Slicing
 We retrieve a list slice with the single square bracket "[]" operator. The following is a
slice containing the second member of x, which is a copy of s.
 > x[2]
 [[1]]
 [1] "aa" "bb" "cc" "dd" "ee"
 With an index vector, we can retrieve a slice with multiple members. Here a
slice containing the second and fourth members of x.
 > x[c(2, 4)]
 [[1]]
 [1] "aa" "bb" "cc" "dd" "ee"
 [[2]]
 [1] 3
List Member Reference
 In order to reference a list member directly, we have to use the double
square bracket "[[]]" operator. The following object x[[2]] is the second
member of x. In other words, x[[2]] is a copy of s, but is not a slice containing
s or its copy.
 > x[[2]]
 [1] "aa" "bb" "cc" "dd" "ee"
 We can modify its content directly.
 > x[[2]][1] = "ta"
 > x[[2]]
 [1] "ta" "bb" "cc" "dd" "ee"
 > s
 [1] "aa" "bb" "cc" "dd" "ee" # s is unaffected
Named List Members
 We can assign names to list members, and reference them by names instead
of numeric indexes.
 For example, in the following, v is a list of two members,
named "bob" and "john".
 > v = list(bob=c(2, 3, 5), john=c("aa", "bb"))
> v
$bob
[1] 2 3 5
$john
[1] "aa" "bb"
Data Frame
 A data frame is used for storing data tables. It is a list of vectors of equal
length. For example, the following variable df is a data frame containing
three vectors n, s, b.
 > n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b) # df is a data frame
 Build-in Data Frame
 We use built-in data frames in R for our tutorials. For example, here is a
built-in data frame in R, called mtcars.
 > mtcars
mpg cyl disp hp drat wt ...
Mazda RX4 21.0 6 160 110 3.90 2.62 ...
Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 ...
Datsun 710 22.8 4 108 93 3.85 2.32 ...
............
 The top line of the table, called the header, contains the column names.
Each horizontal line afterward denotes a data row, which begins with the
name of the row, and then followed by the actual data. Each data member of
a row is called a cell.
 To retrieve data in a cell, we would enter its row and column coordinates in
the single square bracket "[]" operator. The two coordinates are separated by
a comma. In other words, the coordinates begins with row position, then
followed by a comma, and ends with the column position. The order is
important.
 Here is the cell value from the first row, second column of mtcars.
 > mtcars[1, 2]
[1] 6
 Moreover, we can use the row and column names instead of the numeric
coordinates.
 > mtcars["Mazda RX4", "cyl"]
[1] 6
 Lastly, the number of data rows in the data frame is given by
the nrow function.
 > nrow(mtcars) # number of data rows
[1] 32
 And the number of columns of a data frame is given by the ncol function.
 > ncol(mtcars) # number of columns
[1] 11
 Further details of the mtcars data set is available in the R documentation.
 > help(mtcars)
 Data Frame Column Vector
 We reference a data frame column with the double square bracket "[[]]" operator.
 For example, to retrieve the ninth column vector of the built-in data set mtcars,
we write mtcars[[9]].
 > mtcars[[9]]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
 We can retrieve the same column vector by its name.
 > mtcars[["am"]]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
 We can also retrieve with the "$" operator in lieu of the double square bracket
operator.
 > mtcars$am
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
 Yet another way to retrieve the same column vector is to use the single
square bracket "[]"operator. We prepend the column name with a comma
character, which signals a wildcard match for the row position.
 > mtcars[,"am"]
[1] 1 1 1 0 0 0 0 0 0 0 0 ..
 Data Frame Column Slice
 We retrieve a data frame column slice with the single square bracket "[]" operator.
 Numeric Indexing
 The following is a slice containing the first column of the built-in data set mtcars.
 > mtcars[1]
mpg
Mazda RX4 21.0
Mazda RX4 Wag 21.0
Datsun 710 22.8
............
 Name Indexing
 We can retrieve the same column slice by its name.
 > mtcars["mpg"]
mpg
Mazda RX4 21.0
Mazda RX4 Wag 21.0
Datsun 710 22.8
............
 To retrieve a data frame slice with the two columns mpg and hp, we pack the column names
in an index vector inside the single square bracket operator.
 > mtcars[c("mpg", "hp")]
mpg hp
Mazda RX4 21.0 110
Mazda RX4 Wag 21.0 110
Datsun 710 22.8 93
............
 Data Frame Row Slice
 We retrieve rows from a data frame with the single square bracket operator, just like what we
did with columns. However, in additional to an index vector of row positions, we append an
extra comma character. This is important, as the extra comma signals a wildcard match for
the second coordinate for column positions.
 Numeric Indexing
 For example, the following retrieves a row record of the built-in data set mtcars. Please
notice the extra comma in the square bracket operator, and it is not a typo. It states that the
1974 Camaro Z28 has a gas mileage of 13.3 miles per gallon, and an eight cylinder 245 horse
power engine, ..., etc.
 > mtcars[24,]
mpg cyl disp hp drat wt ...
Camaro Z28 13.3 8 350 245 3.73 3.84 ...
 To retrieve more than one rows, we use a numeric index vector.
 > mtcars[c(3, 24),]
mpg cyl disp hp drat wt ...
Datsun 710 22.8 4 108 93 3.85 2.32 ...
Camaro Z28 13.3 8 350 245 3.73 3.84 ...
 Name Indexing
 We can retrieve a row by its name.
 > mtcars["Camaro Z28",]
mpg cyl disp hp drat wt ...
Camaro Z28 13.3 8 350 245 3.73 3.84 ...
 Logical Indexing
 Lastly, we can retrieve rows with a logical index vector. In the following vector L, the
member value is TRUE if the car has automatic transmission, and FALSE if otherwise.
 > L = mtcars$am == 0
> L
[1] FALSE FALSE FALSE TRUE ...
 Here is the list of vehicles with automatic transmission.
 > mtcars[L,]
mpg cyl disp hp drat wt ...
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 ...
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 ...
............
 And here is the gas mileage data for automatic transmission.
 > mtcars[L,]$mpg
[1] 21.4 18.7 18.1 14.3 24.4 ...
Useful Functions
 length(object) # number of elements or components
str(object) # structure of an object
class(object) # class or type of an object
names(object) # names
c(object, object,...) # combine objects into a vector
cbind(object, object, ...) # combine objects as columns
rbind(object, object, ...) # combine objects as rows
object # prints the object
ls() # list current objects
rm(object) # delete an object
newobject <- edit(object) # edit copy and save as newobject
fix(object) # edit in place
Sources of Data to R
Importing Data
 # read in the first worksheet from the workbook myexcel.xlsx
# first row contains variable names
 mydata <- read.table("c:/mydata.csv", header=TRUE,
sep=",", row.names="id")
 library(xlsx)
mydata <- read.xlsx("c:/myexcel.xlsx", 1)
# read in the worksheet named mysheet
mydata <- read.xlsx("c:/myexcel.xlsx", sheetName = "mysheet")
Keyboard Input
 # create a data frame from scratch
age <- c(25, 30, 56)
gender <- c("male", "female", "male")
weight <- c(160, 110, 220)
mydata <- data.frame(age,gender,weight)
You can also use R's built in spreadsheet to enter the data
interactively, as in the following example
 # enter data using editor
mydata <- data.frame(age=numeric(0), gender=character(0),
weight=numeric(0))
mydata <- edit(mydata)
# note that without the assignment in the line above, the edits are not
saved!
Importing from Database
 # RODBC Example
# import 2 tables (Crime and Punishment) from a DBMS
# into R data frames (and call them crimedat and pundat)
library(RODBC)
myconn <-odbcConnect("mydsn", uid="Rob", pwd="aardvark")
crimedat <- sqlFetch(myconn, "Crime")
pundat <- sqlQuery(myconn, "select * from Punishment")
close(myconn)
Exporting Data
 To A Tab Delimited Text File
 write.table(mydata, "c:/mydata.txt", sep="t")
 To an Excel Spreadsheet
 library(xlsx)
write.xlsx(mydata, "c:/mydata.xlsx")
Viewing Data
 # list objects in the working environment
ls()
 # list the variables in mydata
names(mydata)
 # list the structure of mydata
str(mydata)
 # list levels of factor v1 in mydata
levels(mydata$v1)
 # dimensions of an object
dim(object)
 # class of an object (numeric, matrix, data frame, etc)
class(object)
 # print mydata
mydata
 # print first 10 rows of mydata
head(mydata, n=10)
 # print last 5 rows of mydata
tail(mydata, n=5)
Value Labels
 You can use the factor function to create your own value labels
 # variable v1 is coded 1, 2 or 3
# we want to attach value labels 1=red, 2=blue, 3=green
mydata$v1 <- factor(mydata$v1,
levels = c(1,2,3),
labels = c("red", "blue", "green"))
 # variable y is coded 1, 3 or 5
# we want to attach value labels 1=Low, 3=Medium, 5=High
mydata$v1 <- ordered(mydata$y,
levels = c(1,3, 5),
labels = c("Low", "Medium", "High"))
 Use the factor() function for nominal data and the ordered() function
for ordinal data. R statistical and graphic functions will then treat the data
appriopriately.
 Note: factor and ordered are used the same way, with the same arguments.
The former creates factors and the later creates ordered factors.
Missing Data
 Missing Data
 In R, missing values are represented by the symbol NA (not available).
Impossible values (e.g., dividing by zero) are represented by the
symbol NaN (not a number).
 Testing for Missing Values
 is.na(x) # returns TRUE of x is missing
y <- c(1,2,3,NA)
is.na(y) # returns a vector (F F F T)
Recoding Values to Missing
 # recode 99 to missing for variable v1
# select rows where v1 is 99 and recode column v1
mydata$v1[mydata$v1==99] <- NA
 Excluding Missing Values from Analyses
 Arithmetic functions on missing values yield missing values.
 x <- c(1,2,NA,3)
mean(x) # returns NA
mean(x, na.rm=TRUE) # returns 2
 The function complete.cases() returns a logical vector indicating which cases
are complete.
 # list rows of data that have missing values
mydata[!complete.cases(mydata),]
 The function na.omit() returns the object with listwise deletion of missing
values
 # create new dataset without missing data
newdata <- na.omit(mydata)
Date Values
 Dates are represented as the number of days since 1970-01-01, with negative
values for earlier dates
# use as.Date( ) to convert strings to dates
mydates <- as.Date(c("2007-06-22", "2004-02-13"))
# number of days between 6/22/07 and 2/13/04
days <- mydates[1] - mydates[2]
 Sys.Date( ) returns today's date.
date() returns the current date and time
 # print today's date
today <- Sys.Date()
format(today, format="%B %d %Y")
"June 20 2007
Date Conversion
 Character to Date
 You can use the as.Date( ) function to convert character data to dates. The format
is as.Date(x, "format"), where x is the character data and format gives the appropriate
format.
 # convert date info in format 'mm/dd/yyyy'
strDates <- c("01/05/1965", "08/16/1975")
dates <- as.Date(strDates, "%m/%d/%Y")
 The default format is yyyy-mm-dd
 mydates <- as.Date(c("2007-06-22", "2004-02-13"))
 Date to Character
 You can convert dates to character data using the as.Character( ) function
 # convert dates to character data
strDates <- as.character(dates)
Statistics with R
Data Processing Steps
Commands to calculate descriptive
statistics Statistic Command
 Mean mean(variable.name)
 Median median(variable.name)
 Range range(variable.name)
 Standard deviation sd(variable.name)
 No. observations length(variable.name)
 Variance var(variable.name)
Comparing two groups of measurements
Identifying the type of test
 One-sample test -Used when a single sample with a specific hypothesized
value for the mean is to be considered. Examples of this include fixed value
comparisons such as whether average human height is 1.77m.
 Two independent sample test - measurements on two samples from two
different populations are compared. Examples include comparisons of males
and females.
 Paired-sample test - Used when two different measurements were taken on
the SAME experimental units. Examples are before and after studies on the
effect of medical treatments
Example of t-test
 Hypothesis : Average human height is 1.77m
 Alternative Hypothesis: Average human height is significantly different from 1.77m
 height <- c(1.43,1.75,1.85,1.74,1.65,1.83,1.91,1.52,1.92,1.83)
 t.test(height, mu = 1.77)
 One Sample t-test
 data: height
 t = -0.5205, df = 9, p-value = 0.6153
 alternative hypothesis: true mean is not equal to 1.77
 95 percent confidence interval:
 1.625646 1.860354
 sample estimates:
 mean of x
 1.743
 Therefore, alternative hypothesis is accepted.
Two independent sample t-test
 We use our example of dispersal distance in male and female butterflies. This
is your data:
 distance <- c(3,5,5,4,5,3,1,2,2,3)
 sex <-
c("male","male","male","male","male","female","female","female","female","fe
male")
 Before running the test it is important to consider your alternative
hypothesis, whether you want to run a one-tailed or two-tailed test.
 If no alternative hypothesis is specified, the command will assume a two-
tailed test.
 The two-sample t-test has a second assumption in addition to the normality of
the data:
 equal variance in the two samples.
 If the variances are assumed to be equal, this must be specified using the
argument var.equal = TRUE, otherwise Welch´s t-test that does not assume
equal variances is automatically used when needed.
 Here we assume equal variances and perform a two-tailed test.
 t.test(distance ~ sex, var.equal = TRUE)
 Two Sample t-test
 data: distance by sex
 t = -4.0166, df = 8, p-value = 0.003859
 alternative hypothesis: true difference in means is not equal to 0
 95 percent confidence interval: -3.4630505 -0.9369495
 sample estimates:
 mean in group female mean in group male 2.2 4.4
 Thus, male butterflies dispersal is significantly different from female
butterflies.
 We can also specify a one-sided alternative hypothesis by adding the
argument alternative= “less” or alternative = “greater” depending on which
tail is to be tested:
 t.test(distance ~ sex, var.equal = TRUE, alternative="greater")
 Two Sample t-test
 data: distance by sex
 t = -4.0166, df = 8, p-value = 0.9981
 alternative hypothesis: true difference in means is greater than 0
 95 percent confidence interval:
 -3.218516 Inf
 sample estimates:
 mean in group female mean in group male
 2.2 4.4
 The results of these tests state that female dispersal distance is not
significantly greater than male dispersal distance.
Paired sample t-test
 As an example of a paired test, you investigate whether the sleep of students
is affected by an exam. You ask 6 students how long they sleep the night
before an exam and the night after an exam. These are the answers you get:
 sleep.before <- c(4,2,7,4,3,2)
 sleep.after <- c(5,1,3,6,2,1)
 Here you simply run add the argument paired=TRUE to the command from the
two-sampled test above.
 t.test(sleep.before, sleep.after, paired=TRUE)
Output:
 Paired t-test
 data: sleep.before and sleep.after t = 0.7906, df = 5, p-value = 0.465
alternative hypothesis: true difference in means is not equal to 0
 95 percent confidence interval: -1.501038 2.834372
 sample estimates: mean of the differences 0.6666667
 Well, this test is NOT significant, and thus the data does not support an effect
of exams on students sleeping time. But maybe you forgot about the party
after the exam?
Correlation analysis
 Pearson Correlation - This test seeks to determine the level of relatedness
between two variables using a score that runs from -1 (perfect negative
correlation) to 1(perfect positive correlation). A value of zero indicates no
correlation.
 cor.test (iris$Sepal.Length,iris$Petal.Length)
 Pearson's product-moment correlation
 data: iris$Sepal.Length and iris$Petal.Length
 t = 21.646, df = 148, p-value < 2.2e-16
 alternative hypothesis: true correlation is not equal to 0
 95 percent confidence interval:
 0.8270363 0.9055080
 sample estimates:
 cor
 0.8717538
 This data shows a highly-significant (P-value < 2.2e-16) and strongly positive
(0.87) correlation between these two variables.
 In this case, the P-value is used to reject the null hypothesis that the true
correlation is equal to zero.
Spearman Correlation
 A non-parametric alternative to Pearson’s r is Spearman's rank correlation
coefficient, or Spearman’s rho. Like Pearson’s r, Spearman’s rho determines
the level of correlation of two variables ranging from -1 to 1. The difference
between the two measures is that Spearman uses the rank-order of the data
rather than the raw values.
 cor.test (iris$Sepal.Length,iris$Petal.Length, method="spearman")
 Spearman's rank correlation rho
 data: iris$Sepal.Length and iris$Petal.Length
 S = 66429.35, p-value < 2.2e-16
 alternative hypothesis: true rho is not equal to 0
 sample estimates:
 rho
 0.8818981
 Note that this test produced similar, but not identical, results compared with
Pearson’s R
Cross-tabulation and the χ2 test
 Basic contingency tables would have two categorical variables. In many cases
we may wish to test whether the two grouping variables are independent.
One of the most common ways to analyze contingency tables is with the χ2-
test (Chi-square test). The χ2 tests work by first calculating the difference
between expected and observed values:
 The result of this calculation, the so-called the X2score, is then compared to
the χ2 distribution to calculate a p-value to determine if the observed values
differ significantly from the expected values.
 We will use an example of eye color counts in two different groups of flies. The dataset can
be found in 3.9_flies_eyes_color.csv.
 We begin by loading the data and creating a contingency table.
 flyeyes <- read.csv("3.9_flies_eyes_color", header = T)
 tab <- table(Eyecolor, Group, data=flyeyes)
 A B
 Red 34 41
 White 16 9
 The ratio between red and white eyes differs between group A and group B.
 We will use chi-squared test to determine whether the data is more compatible with the
 null hypothesis that the variables of eye color and group are independent of each other
 or with the alternative hypothesis that eye color and group are not independent
 chisq.test(tab)
 Pearson's Chi-squared test with Yates' continuity correction
 data: tab
 X-squared = 1.92, df = 1, p-value = 0.1659
 In this case, based on a chi-squared value of 1.92 and 1 degree of freedom,
we calculated a P-value of 0.1659.
 Thus, data suggests that these variables are independent.
Linear Models
 Linear models are a large family of statistical analyses that relate a
continuous response to one or several explanatory variables.
 Explanatory variables can be grouping factors or continuous or a combination
of both
One-way analysis of variance (ANOVA)
 Tests whether means of more than two groups are the same, for example
whether fruit production differs among five populations of a plant species.
 If there are only two groups, a t-test is the way to proceed.
 ANOVA relates variance within groups to variance between groups.
 The analysis does not, however, tell you which groups are significantly
different from each other. For this purpose a Tukey test can be applied.
Two-way ANOVA
 This analysis can assesses the influence of two grouping factors on groups
means, for example, whether irrigation and fertilization have an effect on
plant growth.
 Importantly, two-way ANOVA can also analyze whether the two factors
interact, in the example, whether the effect of irrigation depends on the
fertilizer level (or the other way around). This is called a statistical
interaction.
 The same methods can also be applied to studies with more than two
grouping factors (multi-way ANOVA).
Linear regression
 Linear regression analyzes to what extent changes in a continuous
explanatory variable result in changes in the response variable, for example
whether larger females cause longer male courtship behavior. If a causal
relationship cannot be assumed a correlation analysis should be used. This
type of analysis can also be conducted with more than one continuous
explanatory variable (multiple regression).
Analysis of covariance (ANCOVA)
 ANCOVA allows more complicated analyses that involve effects of grouping
factors, explanatory factors and their interactions.
 An example is an analysis of whether the response to different doses of a
medication differs between male and female patients. Such more
complicated linear models can also include more than two explanatory
factors.
Defining the model
 First you must define the linear model that you want to use, using the lm()
function. Within this function a so-called formula statement defines the
relationship of the variables to each other. The response variable is always on
the left side of the tilde-symbol (~) and the explanatory variable(s) are on the
right side as in lm(, …). For instance, if one uses the AirQuality R internal
dataset and we want to make a model to predict Ozone content in the
atmosphere using wind speed, the model definition would be as follows:
 My.model <- lm(Ozone ~ Wind, data = airquality)
 airquality$Month <- as.factor(airquality$Month)
 #turns Month into a factor
 lm(Ozone ~ Month, data = airquality)
 While the following formula statement will yield a regression analysis:
 lm(Ozone ~ Temp, data = airquality)
 Formula statements are further used to combine explanatory variables and to
define interactions. If variables should be considered only by themselves
(additive effects), for example in a multiple regression without interaction
you connect the variables by a plus sign as in:
 lm(Ozone ~ Temp + Wind, data = airquality)
 On the other hand, if you want to consider interactions in addition to the
additive effects use an asterisk (*) between the explanatory variables, as in:
 lm(Ozone ~ Temp * Wind, data = airquality)
Checking assumptions with diagnostic plots
 All linear models including regression, one-way ANOVA and ANCOVA have the
following assumptions:
 1. The experimental units are independent and sampled at random. The
independence assumption depends heavily on the experimental design.
 2. The residuals have constant variance across values of explanatory
variables.
 3. The residuals, i.e. the differences between the observed values of a
response variable and the values fitted by the model, are normally distributed
with a mean of zero.
 My.model <- lm(Ozone ~ Wind, data = airquality)
 par(mfrow=c(1,2)
 plot(My.model, which = c(1,2))
Analyzing and interpreting the model
 An ANOVA table shows how much variation in the response is explained by the
explanatory factors. To get the ANOVA table, use the command
anova(My.model), where My.model is the object that stores the defined model
 airquality$Month <- as.factor(airquality$Month)
 #turns Month into a factor
 My.Model <- lm(Ozone ~ Month, data = airquality)
 anova(My.Model)
 Analysis of Variance Table
 Response: Ozone
 Df Sum Sq Mean Sq F value Pr(>F)
 Month 4 29438 7359.5 8.5356 4.827e-06 ***
 Residuals 111 95705 862.2
 ---
 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 In the output, df stands for degrees of freedom, which is the number of
values in the final calculation of a statistic that are free to vary, the F value
which is the test statistic calculated as the ratio between the explained and
the unexplained variance, and the corresponding p-value for this F statistic,
which is the probability of not rejecting the alternative hypothesis (significant
effect of the explanatory variable on the response variable) given than is
false.
 The command summary(model) will show the parameters estimated by the
model, for example the slope of the regression for regression analyses or the
difference between group means for ANOVA.
 My.model <- lm(Ozone ~ Wind, data = airquality)
 summary(My.model)
Examples
 One-way ANOVA
 To test whether fruit production differs between populations of Lythrum,
fruits were counted on 10 individuals on each of 3 populations.
 fruits <- data.frame(fruits = c(24, 19, 21, 20, 23, 19, 17, 20, 23, 20, 11, 15,
11, 9, 10, 14, 12, 12, 15, 13, 13, 11, 19, 12, 15, 15, 13, 18, 17, 13), pop =
c(rep(c(1), 10), rep(c(2), 10), rep(c(3), 10)))
 fruits$pop <- as.factor(fruits$pop)
 plot(fruits ~ pop, data = fruits)
 model<-lm(fruits~pop,data=fruits)
 par(mfrow=c(1,2));plot(model, which=c(1,2))
Two-way ANOVA
 In a study on pea cultivation methods, pea production was assessed in two treatments of
irrigation (normal irrigation and drought) and in three treatments of radiation (low, medium
and high). 10 plants in each of the six combinations were considered.
 plants<-
data.frame(seeds=c(39,39,39,40,40,39,41,42,40,40,39,38,41,41,40,41,40,40,41,40,38,40,40,3
9,42,40,39,41,39,40,39,40,41,40,41,39,40,41,40,39,42,40,39,39,42,40,39,39,39,39,41,38,40,
39,41,42,40,40,40,41),irrigation=c(rep(c(1),30),rep(c(2),30)),radiation=c(rep(c(1,2,3),20)))
 plants$irrigation<-as.factor(plants$irrigation)
 plants$radiation<-as.factor(plants$radiation)
 par(mfrow=c(1,2));plot(seeds~irrigation*radiation,data=plants)
 model<-lm(seeds~irrigation*radiation,data=plants)
 par(mfrow=c(1,2));plot(model, which=c(1,2))
 anova(model)
Linear Regression
 In an experiment testing whether the duration of male courtship behavior depends
on female size, 16 pairs of earwigs were observed.
 sex<-data.frame(pair = 1:16, fem_size = c(58.84, 60.37, 57, 59.86, 61.42, 60.34,
60.1, 59.63, 58.06, 61, 58.61, 60.94, 60.83, 57.7, 60, 59.09), male_court_hrs =
c(8.37, 9.88, 10.12, 8.39, 9.93, 9.69, 8.68, 11.74, 11.07, 8.69, 10.53, 10.38,
10.12, 11.14, 8.6, 11.26))
 plot(male_court_hrs~ fem_size,data=sex)
 model<-lm(male_court_hrs~ fem_size,data=sex)
 par(mfrow=c(1,2));plot(model, which=c(1,2))
 summary(model)
 This analysis suggests that larger females do not cause longer male courtship
behavior.
Basic graphs with R
 Bar-plots –
 We are going to use the internal dataset ToothGrowth (available with R
installation), which contains measurements of tooth length in guinea pigs that
received three levels of vitamin C and two supplement types
 We want to produce a barplot of the mean tooth length for all six
combinations of the two factors (supplement type: 2 levels, dose: 3 levels).
We first need to calculate the mean tooth length each of the combinations.
For this, we use the command tapply().
 tapply() can return a table with mean tooth lengths for all six combinations
and this table will be the input for the barplot. Importantly, tapply() will
create a matrix with two rows and three columns corresponding to the factor
levels in the dataset. This structure is needed to produce a groups barplot.
 mean.tg <- tapply(ToothGrowth$len, list(ToothGrowth$supp,
ToothGrowth$dose), FUN = mean)
 mean.tg
 barplot(mean.tg, beside = T)
 barplot(mean.tg, beside=T, xlab = "Dose (mg)", ylab = "Tooth length (cm)",
names = c("0.5", "1.0", "2.0"), cex.lab = 1.3, cex.names = 1.2,col=c(0, 8), ylim
= c(0, max(ToothGrowth$len)), las = 1)
 The labels of the axes can be specified with the arguments xlab and ylab and
the labels below each group of bars are controlled with the argument names.
The font size of these labels can be changed with cex.lab and cex.names.
These arguments are set to 1 by default and changes are relative to this
default. For example, cex.axis = 2 will double the font size. The limit of the
y-axis is specified with ylim. Here we use the maxiumum and the minimum in
the datset. The orientation of the axis labels can be altered with the
argument las that has four options (0,1,2,3). Here, las = 1 produces horizontal
axis labels. The colors of the bars are determined by col, in our example by a
vector with a length of two for the two groups, specifying 1 (black) and 8
(grey). The color can be specified either with numbers (1 to 8) or with the
color name
Grouped scatter plot with regression
lines
 To produce a scatterplot, we will use the plot() command. plot() is a higher-
level plotting command that it will create a new graph.
 We are going to use part of the internal dataset Iris (available with the R
installation) as an example (Figure 5-2). Iris contains flower measurements of
three different Iris species. You can explore the dataset by ?iris, summary(iris)
and str(iris). To reduce the dataset o two species and to plot all the
datapoints use:
 iris.short <- iris[1:100, ]
 plot(iris.short$Sepal.Length, iris.short$Sepal.Width)
 We can now assign two different plotting symbols for the species by creating a
new column in the data frame iris.short, named iris.short$pch, that contains
the number of the plotting symbol to be used. There are 26 different plotting
symbols, ranging from 0 to 25. Here we use symbol 1 for Iris setosa and
symbol 16 for Iris versicolor. You can use the same procedure to assign
different colors to the two species (see above). We can then set the axis
labels, range and orientation as well as font size using xlab, ylab, xlim, ylim,
las, cex.axis and cex.lab as explained above.
 iris.short$pch[iris.short$Species == "setosa"] <- 1
 iris.short$pch[iris.short$Species == "versicolor"] <- 16
 plot(iris.short$Sepal.Length, iris.short$Sepal.Width, type = "n", xlab = "Sepal
length (mm)", ylab = "Sepal width (mm)", xlim = c(4, 7.5), las = 1, cex.axis =
1.2, cex.lab = 1.3,pch = iris.short$pch)
Logistic Regression
 Logistic regression models are used to in situations where we want to know
how a binary response variable is affected by one or more continuous
variables. Common biological examples of this include assessing probability of
survival, probability of reproducing, or probability of an individual possessing
a certain allele. On a natural scale, logistic regression is non-linear and
cannot be analyzed using linear models. However this problem is
circumvented by using the logit transformation to linearize the model.
 Logit = log (p/1-p)
 Creating and analyzing model
 In R, logistic regression models are created using the generalized linear model
function glm(􀀌. This takes the general form of:
 Model􀀌<-􀀌glm(probability_data􀀌~ continuous_predictor, family = ”binomial”)
 The argument family=”binomial” tells the function to create a binomial
logistic regression model. As with the lm()function, we can use summary() to
obtain summary data of the model.
 Lmodel <- glm(survival ~ height, family = binomial, data = Hypericum)
 summary(Lmodel)
 G_sq <- LModel$null.deviance - LModel$deviance
 pchisq(G_sq, 1, lower.tail=F)
 Sequence<-seq(0,4,.1)
 PLMlogit<-predict(LModel, list(height=Sequence))
 plot(Sequence,PLMLogit, type="l", xlab="Log(height)", ylab="Logit Survival")
 PLMcurve<-predict(LModel,list(height=Sequence),type="response")
 plot(Sequence,PLMcurve,type="l", xlab="Log(height)", ylab="Survival")
 Logistic Regressions are used when you have a probability as a response
variable and a continuous predictor variable
 Logistic curves are analyzed as generalized linear models glm() though the
use of the logit transformation.
 Logistic regressions can be plotted either as a logistic curve or as a linear
function.
R Flow Control
 for (VAR in SEQ) {EXPR}
 while (COND) {EXPR}
 repeat {EXPR}
 The first one, for(), iterates through each component VAR of the sequence SEQ-
for example, in the first iteration VAR = SEQ[1], in the second iteration VAR =
SEQ[2], and so on.
 VAR is the abbreviation of variable.
 SEQ is the abbreviation of sequence, which is equivalent to a vector (including
list) in R.
 COND is the abbreviation of conditional, which can evaluate to TRUE or FALSE.
 EXPR is the abbreviation of expression in a formal sense.
for
 for ( i in 1:5 ) {
 print( paste('square of', i, '=', i^2) )
 }
while
 i_w <- 1
 while ( i_w <=10 ) {
 i_w <- i_w + 5
 }
 i_w
repeat
 i_r <-1
 repeat {
 i_r <- i_r + 1
 if (i_r > 10) {break}
 }
 i_r
If – else if - else
 x <- 3
 if ( ! is.numeric(x) ) {
 stop( paste(x, 'is not numeric') )
 } else if ( x%%2 == 1) {
 print ( paste(x, 'is an odd') )
 } else if ( x == round(x) ) {
 print ( paste(x, 'is an integer') )
 } else {
 print ( paste(x, 'is a number') )
 }

Functions
 R provides a convenient way to define custom function and make good use of
it. All functions read and parse input, which are referred to as arguments,
and then return output. R function is actually first-class object defined in R.
It can be created by using the command function(), which is followed by a
comma separated list of formal arguments enclosed by a pair of parenthesis,
and then the expression that form the body of the function.
 If the expression only includes one statement, it can be directly entered and
when there are multiple expressions, they have to be enclosed in braces {}.
The value returned by a R function, can be either yielded by R built-in
function return or simply the value of the last expression.
Example of a function
 expon <- function(x,n) {
 if ( x%%1 != 0 ) {
 stop('x must be an integer!')
 } else if ( n==0 ) {
 return(1)
 } else {
 prod <- x
 while( n>1 ) {
 prod <- x*prod
 n <- n-1
 }
 return(prod)
 } # end of else
 } # end of the function
 expon(3,4)
 The formal and body arguments to function expon() can later be accessed via
the R functions formals() and body(), as following:
 formals(expon)
 $x
 $n
 body(expon)
 {
 if (x%%1 != 0) {
 stop("x must be an integer!")
 } (…)
Thank You for Your Patience!!

More Related Content

What's hot

R Programming: Introduction To R Packages
R Programming: Introduction To R PackagesR Programming: Introduction To R Packages
R Programming: Introduction To R Packages
Rsquared Academy
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with R
ShareThis
 
Descriptive Statistics with R
Descriptive Statistics with RDescriptive Statistics with R
Descriptive Statistics with RKazuki Yoshida
 
Linear Regression With R
Linear Regression With RLinear Regression With R
Linear Regression With R
Edureka!
 
List in Python
List in PythonList in Python
List in Python
Siddique Ibrahim
 
joins in database
 joins in database joins in database
joins in database
Sultan Arshad
 
Advanced sql
Advanced sqlAdvanced sql
Advanced sql
Dhani Ahmad
 
3 Data Structure in R
3 Data Structure in R3 Data Structure in R
3 Data Structure in R
Dr Nisha Arora
 
R programming
R programmingR programming
R programming
Pooja Sharma
 
Relational database
Relational database Relational database
Relational database
Megha Sharma
 
Data analytics with R
Data analytics with RData analytics with R
Data analytics with R
Dr. C.V. Suresh Babu
 
Sql
SqlSql
R programming Fundamentals
R programming  FundamentalsR programming  Fundamentals
R programming Fundamentals
Ragia Ibrahim
 
Normal forms
Normal formsNormal forms
Normal forms
Samuel Igbanogu
 
Entity Relationship Diagram
Entity Relationship DiagramEntity Relationship Diagram
Entity Relationship Diagram
Shakila Mahjabin
 
Entity Relationship Modelling
Entity Relationship ModellingEntity Relationship Modelling
Entity Relationship Modelling
Bhandari Nawaraj
 

What's hot (20)

R Programming: Introduction To R Packages
R Programming: Introduction To R PackagesR Programming: Introduction To R Packages
R Programming: Introduction To R Packages
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with R
 
Descriptive Statistics with R
Descriptive Statistics with RDescriptive Statistics with R
Descriptive Statistics with R
 
Linear Regression With R
Linear Regression With RLinear Regression With R
Linear Regression With R
 
List in Python
List in PythonList in Python
List in Python
 
joins in database
 joins in database joins in database
joins in database
 
Advanced sql
Advanced sqlAdvanced sql
Advanced sql
 
3 Data Structure in R
3 Data Structure in R3 Data Structure in R
3 Data Structure in R
 
Sql ppt
Sql pptSql ppt
Sql ppt
 
Reading Data into R
Reading Data into RReading Data into R
Reading Data into R
 
R programming
R programmingR programming
R programming
 
Relational database
Relational database Relational database
Relational database
 
Data analytics with R
Data analytics with RData analytics with R
Data analytics with R
 
Sql
SqlSql
Sql
 
R programming Fundamentals
R programming  FundamentalsR programming  Fundamentals
R programming Fundamentals
 
Normal forms
Normal formsNormal forms
Normal forms
 
Decision tree
Decision treeDecision tree
Decision tree
 
Entity Relationship Diagram
Entity Relationship DiagramEntity Relationship Diagram
Entity Relationship Diagram
 
Entity Relationship Modelling
Entity Relationship ModellingEntity Relationship Modelling
Entity Relationship Modelling
 
DBMS Keys
DBMS KeysDBMS Keys
DBMS Keys
 

Similar to Workshop presentation hands on r programming

Unit 3
Unit 3Unit 3
Lecture1_R.ppt
Lecture1_R.pptLecture1_R.ppt
Lecture1_R.ppt
ArchishaKhandareSS20
 
Lecture1_R.ppt
Lecture1_R.pptLecture1_R.ppt
Lecture1_R.ppt
vikassingh569137
 
Modeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.pptModeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.ppt
anshikagoel52
 
An Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAn Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAdam Getchell
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
Andrew Lowe
 
Presentation on use of r statistics
Presentation on use of r statisticsPresentation on use of r statistics
Presentation on use of r statistics
Krishna Dhakal
 
Lecture1_R.pdf
Lecture1_R.pdfLecture1_R.pdf
Lecture1_R.pdf
BusyBird2
 
R basics
R basicsR basics
R basics
Sagun Baijal
 
Best corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiBest corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbai
Unmesh Baile
 
R for Pythonistas (PyData NYC 2017)
R for Pythonistas (PyData NYC 2017)R for Pythonistas (PyData NYC 2017)
R for Pythonistas (PyData NYC 2017)
Christopher Roach
 
Functions in python
Functions in pythonFunctions in python
Functions in python
Santosh Verma
 
Devtools cheatsheet
Devtools cheatsheetDevtools cheatsheet
Devtools cheatsheet
Dr. Volkan OBAN
 
Devtools cheatsheet
Devtools cheatsheetDevtools cheatsheet
Devtools cheatsheet
Dieudonne Nahigombeye
 
R-Shiny Cheat sheet
R-Shiny Cheat sheetR-Shiny Cheat sheet
R-Shiny Cheat sheet
Dr. Volkan OBAN
 
BITS: Introduction to Linux - Software installation the graphical and the co...
BITS: Introduction to Linux -  Software installation the graphical and the co...BITS: Introduction to Linux -  Software installation the graphical and the co...
BITS: Introduction to Linux - Software installation the graphical and the co...
BITS
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)
Guy Lebanon
 
R Introduction
R IntroductionR Introduction
R Introductionschamber
 

Similar to Workshop presentation hands on r programming (20)

Unit 3
Unit 3Unit 3
Unit 3
 
Lecture1_R.ppt
Lecture1_R.pptLecture1_R.ppt
Lecture1_R.ppt
 
Lecture1_R.ppt
Lecture1_R.pptLecture1_R.ppt
Lecture1_R.ppt
 
Lecture1 r
Lecture1 rLecture1 r
Lecture1 r
 
Modeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.pptModeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.ppt
 
An Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAn Overview Of Python With Functional Programming
An Overview Of Python With Functional Programming
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Presentation on use of r statistics
Presentation on use of r statisticsPresentation on use of r statistics
Presentation on use of r statistics
 
Lecture1_R.pdf
Lecture1_R.pdfLecture1_R.pdf
Lecture1_R.pdf
 
R basics
R basicsR basics
R basics
 
Best corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiBest corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbai
 
R for Pythonistas (PyData NYC 2017)
R for Pythonistas (PyData NYC 2017)R for Pythonistas (PyData NYC 2017)
R for Pythonistas (PyData NYC 2017)
 
Functions in python
Functions in pythonFunctions in python
Functions in python
 
Devtools cheatsheet
Devtools cheatsheetDevtools cheatsheet
Devtools cheatsheet
 
Devtools cheatsheet
Devtools cheatsheetDevtools cheatsheet
Devtools cheatsheet
 
R-Shiny Cheat sheet
R-Shiny Cheat sheetR-Shiny Cheat sheet
R-Shiny Cheat sheet
 
BITS: Introduction to Linux - Software installation the graphical and the co...
BITS: Introduction to Linux -  Software installation the graphical and the co...BITS: Introduction to Linux -  Software installation the graphical and the co...
BITS: Introduction to Linux - Software installation the graphical and the co...
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)
 
R Introduction
R IntroductionR Introduction
R Introduction
 

Recently uploaded

Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 

Recently uploaded (20)

Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 

Workshop presentation hands on r programming

  • 1. Hands-on R Programming Mrs. Nimrita Koul Assistant Professor School of Computing & IT REVA University
  • 2. Contents  About R  R and RStudio Download and Installation  Features of R
  • 3. R is:  A Programming Environment for Statistical Computing, Data Analysis and Graphics. It is a GNU project developed at Bell Laboratories by John Chambers .  Graphical facilities for data analysis and visualization  A well developed, simple and effective programming language.  Includes conditionals, loops, user defined recursive functions and input and output facilities.  A fully planned and coherent system  R is being updated with newer functionalities like deep learning libraries continuously.  It has developed rapidly, and has been extended by a large collection of packages.
  • 4.  R provides a wide variety of statistical tools (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible  One of R’s strengths is strong visualization with a great quality plots that can be made with it  R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form.  R documentation and manual is available at  https://cran.r-project.org/manuals.html
  • 5.
  • 6. Downloading R for Windows  Goto - https://cran.r-project.org/bin/windows/base/  Click on Download R 3.4.2 for Windows  Click on downloaded exe file  Select language as English  Click on Next in screens  Finish
  • 7.
  • 8.
  • 9. RStudio  https://www.rstudio.com/products/rstudio/download/  Goto Column RStudio Desktop Open Source License  Click on the Free version of RStudio
  • 10.
  • 11.
  • 12. Installing RStudio  Once the exe downloads to your PC, Click on it, follow the default settings  Keep clicking on Next and finally Finish.  You have installed Rstudio  TO Find it:  Goto Start  In programs you will find Rstudio, click on it to start it
  • 13.
  • 14.  R is case sensitive. So A and a are different symbols and would refer to different variables  Commands are separated either by a semi-colon (‘;’), or by a newline  If a command is not complete at the end of a line, R will give a different prompt, by default + on second and subsequent lines and continue to read input until the command is syntactically complete.  The vertical arrow keys on the keyboard can be used to scroll forward and backward through a command history R Features
  • 15. Objects and Workspace  The entities that R creates and manipulates are known as objects.  These may be variables, arrays of numbers, character strings, functions, or more general structures built from such components.  During an R session, objects are created and stored by name.  The R command > objects() (alternatively, ls()) can be used to display the names of the objects which are currently stored within R.  The collection of objects currently stored is called the workspace  To remove objects the function rm is available: > rm(x, y, z, ink, junk, temp, foo, bar)
  • 16.  All saved objects are written to a file called .RData in the current directory, and the command lines used in the session are saved to a file called .Rhistory.  When R is started at later time from the same directory it reloads the workspace from this file. At the same time the associated commands history is reloaded.
  • 17. LET US ALL BE ON THE RStudio CONSOLE
  • 18. Built-in help system  At the program's command prompt you can use any of the following:  help.start() # general help  apropos(“solve") # list all functions containing string foo  example(solve) # show an example of function foo  > help(solve)  An alternative is  > ?solve  > help("[[")  > help.start() - help is available in HTML format by running  > help.search(“solve”)
  • 19. Standard commands for managing your workspace.  ls() # list the objects in the current workspace  setwd(mydirectory) # change to mydirectory setwd("c:/docs/mydir") # note / instead of in windows setwd("/usr/rob/mydir") # on linux  # view and set options for the session help(options) # learn about available options options() # view current option settings options(digits=3) # number of digits to print on output  # work with your previous commands history() # display last 25 commands history(max.show=Inf) # display all previous commands
  • 20.  # save your command history savehistory(file="myfile") # default is ".Rhistory" # recall your command history loadhistory(file="myfile") # default is ".Rhistory"  # save the workspace to the file .RData in the cwd save.image() # save specific objects to a file # if you don't specify the path, the cwd is assumed save(object list,file="myfile.RData")  # load a workspace into the current session # if you don't specify the path, the cwd is assumed load("myfile.RData")  q() # quit R. You will be prompted to save the workspace.
  • 21. Input / Output  By default, launching R starts an interactive session with input from the keyboard and output to the screen. However, you can have input come from a script file (a file containing R commands) and direct output to a variety of destinations. Input - The source( ) function runs a script in the current session. If the filename does not include a path, the file is taken from the current working directory.  # input a script source("myfile")  Output-The sink( ) function defines the direction of the output.  # direct output to a file  sink("myfile", append=FALSE, split=FALSE)  # return output to the terminal  sink()
  • 22.  The append option controls whether output overwrites or adds to a file. The split option determines if output is also sent to the screen as well as the output file.  Here are some examples of the sink() function.  # output directed to output.txt in current working directory. output overwrites existing file. no output to terminal.  sink("output.txt")  # output directed to myfile.txt in cwd. output is appended to existing file. output also send to terminal.  sink("myfile.txt", append=TRUE, split=TRUE)  When redirecting output, use the cat( ) function to annotate the output.
  • 23. Graphs  sink( ) will not redirect graphic output. To redirect graphic output use one of the following functions. Use dev.off( ) to return output to the terminal. Function Output to pdf("mygraph.pdf") pdf file win.metafile("mygraph.wmf") windows metafile png("mygraph.png") png file jpeg("mygraph.jpg") jpeg file bmp("mygraph.bmp") bmp file postscript("mygraph.ps") postscript file
  • 24.  # example - output graph to jpeg file jpeg("c:/mygraphs/myplot.jpg") plot(x) dev.off()
  • 25. Packages  Packages are collections of R functions, data, and compiled code in a well- defined format.  The directory where packages are stored is called the library.  R comes with a standard set of packages. Others are available for download and installation.  install.packages(“packagename”) command installs a package.  Once installed, they have to be loaded into the session to be used.  .libPaths() # get library location library() # see all packages installed search() # see packages currently loaded
  • 26. Download and Install a Package  We need to download and install only once.  To use the package, invoke the library(package) command to load it into the current session. (You need to do this once in each session, unless you customize your environment to automatically load it each time.)  On MS Windows:  Command Install.Packages(“Packagename”) installs a package from the default mirror..  Then use the library(packagename) function to load it for use. (e.g. library(boot))
  • 27. Customizing Startup  You can customize the R environment through a site initialization file or a directory initialization file. R will always source the Rprofile.site file first. On Windows, the file is in the C:Program FilesRR-n.n.netc directory. You can also place a .Rprofile file in any directory that you are going to run R from or in the user home directory.  At startup, R will source the Rprofile.site file. It will then look for a .Rprofile file to source in the current working directory. If it doesn't find it, it will look for one in the user's home directory.  There are two special functions you can place in these files. .First( ) will be run at the start of the R session and .Last( ) will be run at the end of the session.
  • 28.  # Sample Rprofile.site file # Things you might want to change # options(papersize="a4") # options(editor="notepad") # options(pager="internal") # R interactive prompt # options(prompt="> ") # options(continue="+ ") # to prefer Compiled HTML help options(chmhelp=TRUE) # to prefer HTML help # options(htmlhelp=TRUE) # General options options(tab.width = 2) options(width = 130) options(graphics.record=TRUE) .First <- function(){ library(Hmisc) library(R2HTML) cat("nWelcome at", date(), "n") } .Last <- function(){ cat("nGoodbye at ", date(), "n") }
  • 29. Basic Data Types  Numeric  Integer  Complex  Logical  Character
  • 30. Data Types  R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists.  VECTORS  a <- c(1,2,5.3,6,-2,4) # numeric vector b <- c("one", "two", "three") # character vector c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector  Refer to elements of a vector using subscripts  a[c(2,4)] # 2nd and 4th elements of vector
  • 31.  > x = 10.5 # assign a decimal value > x # print the value of x [1] 10.5 > class(x) # print the class name of x [1] "numeric"  Furthermore, even if we assign an integer to a variable k, it is still being saved as a numeric value.  > k = 1 > k # print the value of k [1] 1 > class(k) # print the class name of k [1] "numeric"  The fact that k is not an integer can be confirmed with he is.integer function.  > is.integer(k) # is k an integer? [1] FALSE
  • 32.  Integer  In order to create an integer variable in R, we invoke the as.integer function. We can be assured that y is indeed an integer by applying the is.integer function.  > y = as.integer(3) > y # print the value of y [1] 3 > class(y) # print the class name of y [1] "integer" > is.integer(y) # is y an integer? [1] TRUE  Incidentally, we can coerce a numeric value into an integer with the same as.integer function.  > as.integer(3.14) # coerce a numeric value [1] 3
  • 33.  And we can parse a string for decimal values in much the same way.  > as.integer("5.27") # coerce a decimal string [1] 5  On the other hand, it is erroneous trying to parse a non-decimal string.  > as.integer("Joe") # coerce an non−decimal string [1] NA Warning message: NAs introduced by coercion  Often, it is useful to perform arithmetic on logical values. Like the C language, TRUE has the value 1, while FALSE has value 0.  > as.integer(TRUE) # the numeric value of TRUE [1] 1 > as.integer(FALSE) # the numeric value of FALSE [1] 0
  • 34. Complex  A complex value in R is defined via the pure imaginary value i.  > z = 1 + 2i # create a complex number > z # print the value of z [1] 1+2i > class(z) # print the class name of z [1] "complex"  The following gives an error as −1 is not a complex value.  > sqrt(−1) # square root of −1 [1] NaN Warning message: In sqrt(−1) : NaNs produced
  • 35.  Instead, we have to use the complex value −1 + 0i.  > sqrt(−1+0i) # square root of −1+0i [1] 0+1i  An alternative is to coerce −1 into a complex value.  > sqrt(as.complex(−1)) [1] 0+1i
  • 36. Logical  A logical value is often created via comparison between variables.  > x = 1; y = 2 # sample values > z = x > y # is x larger than y? > z # print the logical value [1] FALSE > class(z) # print the class name of z [1] "logical"
  • 37.  Standard logical operations are "&" (and), "|" (or), and "!" (negation).  > u = TRUE; v = FALSE > u & v # u AND v [1] FALSE > u | v # u OR v [1] TRUE > !u # negation of u [1] FALSE
  • 38. Character  A character object is used to represent string values in R. We convert objects into character values with the as.character() function:  > x = as.character(3.14) > x # print the character string [1] "3.14" > class(x) # print the class name of x [1] "character"  Two character values can be concatenated with the paste function.  > fname = "Joe"; lname ="Smith" > paste(fname, lname) [1] "Joe Smith"
  • 39.  However, it is often more convenient to create a readable string with the sprintf function, which has a C language syntax.  > sprintf("%s has %d dollars", "Sam", 100) [1] "Sam has 100 dollars"  To extract a substring, we apply the substr function. Here is an example showing how to extract the substring between the third and twelfth positions in a string.  > substr("Mary has a little lamb.", start=3, stop=12) [1] "ry has a l"  And to replace the first occurrence of the word "little" by another word "big" in the string, we apply the sub function.  > sub("little", "big", "Mary has a little lamb.") [1] "Mary has a big lamb."  More functions for string manipulation can be found in the R documentation.  > help("sub")
  • 40. Vector  A vector is a sequence of data elements of the same basic type. Members in a vector are officially called components.  Here is a vector containing three numeric values 2, 3 and 5.  > c(2, 3, 5) [1] 2 3 5  And here is a vector of logical values.  > c(TRUE, FALSE, TRUE, FALSE, FALSE) [1] TRUE FALSE TRUE FALSE FALSE  A vector can contain character strings.  > c("aa", "bb", "cc", "dd", "ee") [1] "aa" "bb" "cc" "dd" "ee"  The number of members in a vector is given by the length function.  > length(c("aa", "bb", "cc", "dd", "ee")) [1] 5
  • 41. Combining Vectors  Vectors can be combined via the function c. For examples, the following two vectors n and sare combined into a new vector containing elements from both vectors.  > n = c(2, 3, 5) > s = c("aa", "bb", "cc", "dd", "ee") > c(n, s) [1] "2" "3" "5" "aa" "bb" "cc" "dd" "ee"  Value Coercion  In the code snippet above, notice how the numeric values are being coerced into character strings when the two vectors are combined. This is necessary so as to maintain the same primitive data type for members in the same vector.
  • 42. Vector Arithmetic  Arithmetic operations of vectors are performed member-by-member, i.e., member wise.  For example, suppose we have two vectors a and b.  > a = c(1, 3, 5, 7) > b = c(1, 2, 4, 8)  Then, if we multiply a by 5, we would get a vector with each of its members multiplied by 5.  > 5 * a [1] 5 15 25 35  And if we add a and b together, the sum would be a vector whose members are the sum of the corresponding members from a and b.  > a + b [1] 2 5 9 15
  • 43.  Similarly for subtraction, multiplication and division, we get new vectors via memberwise operations.  > a - b [1] 0 1 1 -1 > a * b [1] 1 6 20 56 > a / b [1] 1.000 1.500 1.250 0.875  Recycling Rule  If two vectors are of unequal length, the shorter one will be recycled in order to match the longer vector. For example, the following vectors u and v have different lengths, and their sum is computed by recycling values of the shorter vector u.  > u = c(10, 20, 30) > v = c(1, 2, 3, 4, 5, 6, 7, 8, 9) > u + v [1] 11 22 33 14 25 36 17 28 39
  • 44. Vector Index  We retrieve values in a vector by declaring an index inside a single square bracket "[]"operator.  For example, the following shows how to retrieve a vector member. Since the vector index is 1-based, we use the index position 3 for retrieving the third member.  > s = c("aa", "bb", "cc", "dd", "ee") > s[3] [1] "cc"  Unlike other programming languages, the square bracket operator returns more than just individual members. In fact, the result of the square bracket operator is another vector, and s[3] is a vector slice containing a single member "cc".
  • 45.  Negative Index  If the index is negative, it would strip the member whose position has the same absolute value as the negative index. For example, the following creates a vector slice with the third member removed.  > s[-3] [1] "aa" "bb" "dd" "ee"  Out-of-Range Index  If an index is out-of-range, a missing value will be reported via the symbol NA.  > s[10] [1] NA
  • 46. Numeric Index Vector  A new vector can be sliced from a given vector with a numeric index vector, which consists of member positions of the original vector to be retrieved.  Here it shows how to retrieve a vector slice containing the second and third members of a given vector s.  > s = c("aa", "bb", "cc", "dd", "ee") > s[c(2, 3)] [1] "bb" "cc"  Duplicate Indexes  The index vector allows duplicate values. Hence the following retrieves a member twice in one operation.  > s[c(2, 3, 3)] [1] "bb" "cc" "cc"
  • 47. Out-of-Order Indexes  The index vector can even be out-of-order. Here is a vector slice with the order of first and second members reversed.  > s[c(2, 1, 3)] [1] "bb" "aa" "cc"  Range Index  To produce a vector slice between two indexes, we can use the colon operator ":". This can be convenient for situations involving large vectors.  > s[2:4] [1] "bb" "cc" "dd"  More information for the colon operator is available in the R documentation.  > help(":")
  • 48.  Logical Index Vector  A new vector can be sliced from a given vector with a logical index vector, which has the same length as the original vector. Its members are TRUE if the corresponding members in the original vector are to be included in the slice, and FALSE if otherwise.  For example, consider the following vector s of length 5.  > s = c("aa", "bb", "cc", "dd", "ee")  To retrieve the the second and fourth members of s, we define a logical vector L of the same length, and have its second and fourth members set as TRUE.  > L = c(FALSE, TRUE, FALSE, TRUE, FALSE) > s[L] [1] "bb" "dd"  The code can be abbreviated into a single line.  > s[c(FALSE, TRUE, FALSE, TRUE, FALSE)] [1] "bb" "dd"
  • 49.  Named Vector Members  We can assign names to vector members. For example, the following variable v is a character string vector with two members.  > v = c("Mary", "Sue") > v [1] "Mary" "Sue"  We now name the first member as First, and the second as Last.  > names(v) = c("First", "Last") > v First Last "Mary" "Sue"  Then we can retrieve the first member by its name.  > v["First"] [1] "Mary"  Furthermore, we can reverse the order with a character string index vector.  > v[c("Last", "First")] Last First "Sue" "Mary"
  • 50.  MATRIX  A matrix is a collection of data elements arranged in a two-dimensional rectangular layout. The following is an example of a matrix with 2 rows and 3 columns.  A= [ 2 4 3  1 5 7 }  We reproduce a memory representation of the matrix in R with the matrix function. The data elements must be of the same basic type.  > A = matrix(  + c(2, 4, 3, 1, 5, 7), # the data elements  + nrow=2, # number of rows  + ncol=3, # number of columns  + byrow = TRUE) # fill matrix by rows  > A # print the matrix  [,1] [,2] [,3]  [1,] 2 4 3  [2,] 1 5 7
  • 51.  An element at the mth row, nth column of A can be accessed by the expression A[m, n].  > A[2, 3] # element at 2nd row, 3rd column  [1] 7  The entire mth row A can be extracted as A[m, ].  > A[2, ] # the 2nd row  [1] 1 5 7  Similarly, the entire nth column A can be extracted as A[ ,n].  > A[ ,3] # the 3rd column  [1] 3 7
  • 52.  We can also extract more than one rows or columns at a time.  > A[ ,c(1,3)] # the 1st and 3rd columns  [,1] [,2]  [1,] 2 3  [2,] 1 7  If we assign names to the rows and columns of the matrix, than we can access the elements by names.  > dimnames(A) = list(  + c("row1", "row2"), # row names  + c("col1", "col2", "col3")) # column names  > A # print A  col1 col2 col3  row1 2 4 3  row2 1 5 7  > A["row2", "col3"] # element at 2nd row, 3rd column  [1] 7
  • 53.  Matrix Construction  There are various ways to construct a matrix. When we construct a matrix directly with data elements, the matrix content is filled along the column orientation by default. For example, in the following code snippet, the content of B is filled along the columns consecutively.  > B = matrix(  + c(2, 4, 3, 1, 5, 7),  + nrow=3,  + ncol=2)  > B # B has 3 rows and 2 columns  [,1] [,2]  [1,] 2 1  [2,] 4 5  [3,] 3 7  [3,] 3 7 2 
  • 54.  Then we can combine the columns of B and C with cbind.  > cbind(B, C)  [,1] [,2] [,3]  [1,] 2 1 7  [2,] 4 5 4  Similarly, we can combine the rows of two matrices if they have the same number of columns with the rbind function.  > D = matrix(  + c(6, 2),  + nrow=1,  + ncol=2)   > D # D has 2 columns  [,1] [,2]  [1,] 6 2
  • 55. Deconstruction  We can deconstruct a matrix by applying the c function, which combines all column vectors into one.  > c(B)  [1] 2 4 3 1 5 7
  • 56. Transpose  We construct the transpose of a matrix by interchanging its columns and rows with the function t .  > t(B) # transpose of B  [,1] [,2] [,3]  [1,] 2 4 3  [2,] 1 5 7
  • 57. Lists  A list is a generic vector containing other objects.  For example, the following variable x is a list containing copies of three vectors n, s, b, and a numeric value 3.  > n = c(2, 3, 5)  > s = c("aa", "bb", "cc", "dd", "ee")  > b = c(TRUE, FALSE, TRUE, FALSE, FALSE)  > x = list(n, s, b, 3) # x contains copies of n, s, b  List Slicing  We retrieve a list slice with the single square bracket "[]" operator. The following is a slice containing the second member of x, which is a copy of s.  > x[2]  [[1]]  [1] "aa" "bb" "cc" "dd" "ee"
  • 58.  With an index vector, we can retrieve a slice with multiple members. Here a slice containing the second and fourth members of x.  > x[c(2, 4)]  [[1]]  [1] "aa" "bb" "cc" "dd" "ee"  [[2]]  [1] 3
  • 59. List Member Reference  In order to reference a list member directly, we have to use the double square bracket "[[]]" operator. The following object x[[2]] is the second member of x. In other words, x[[2]] is a copy of s, but is not a slice containing s or its copy.  > x[[2]]  [1] "aa" "bb" "cc" "dd" "ee"  We can modify its content directly.  > x[[2]][1] = "ta"  > x[[2]]  [1] "ta" "bb" "cc" "dd" "ee"  > s  [1] "aa" "bb" "cc" "dd" "ee" # s is unaffected
  • 60. Named List Members  We can assign names to list members, and reference them by names instead of numeric indexes.  For example, in the following, v is a list of two members, named "bob" and "john".  > v = list(bob=c(2, 3, 5), john=c("aa", "bb")) > v $bob [1] 2 3 5 $john [1] "aa" "bb"
  • 61. Data Frame  A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b.  > n = c(2, 3, 5) > s = c("aa", "bb", "cc") > b = c(TRUE, FALSE, TRUE) > df = data.frame(n, s, b) # df is a data frame
  • 62.  Build-in Data Frame  We use built-in data frames in R for our tutorials. For example, here is a built-in data frame in R, called mtcars.  > mtcars mpg cyl disp hp drat wt ... Mazda RX4 21.0 6 160 110 3.90 2.62 ... Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 ... Datsun 710 22.8 4 108 93 3.85 2.32 ... ............  The top line of the table, called the header, contains the column names. Each horizontal line afterward denotes a data row, which begins with the name of the row, and then followed by the actual data. Each data member of a row is called a cell.
  • 63.  To retrieve data in a cell, we would enter its row and column coordinates in the single square bracket "[]" operator. The two coordinates are separated by a comma. In other words, the coordinates begins with row position, then followed by a comma, and ends with the column position. The order is important.  Here is the cell value from the first row, second column of mtcars.  > mtcars[1, 2] [1] 6  Moreover, we can use the row and column names instead of the numeric coordinates.  > mtcars["Mazda RX4", "cyl"] [1] 6
  • 64.  Lastly, the number of data rows in the data frame is given by the nrow function.  > nrow(mtcars) # number of data rows [1] 32  And the number of columns of a data frame is given by the ncol function.  > ncol(mtcars) # number of columns [1] 11  Further details of the mtcars data set is available in the R documentation.  > help(mtcars)
  • 65.  Data Frame Column Vector  We reference a data frame column with the double square bracket "[[]]" operator.  For example, to retrieve the ninth column vector of the built-in data set mtcars, we write mtcars[[9]].  > mtcars[[9]] [1] 1 1 1 0 0 0 0 0 0 0 0 ...  We can retrieve the same column vector by its name.  > mtcars[["am"]] [1] 1 1 1 0 0 0 0 0 0 0 0 ...  We can also retrieve with the "$" operator in lieu of the double square bracket operator.  > mtcars$am [1] 1 1 1 0 0 0 0 0 0 0 0 ...  Yet another way to retrieve the same column vector is to use the single square bracket "[]"operator. We prepend the column name with a comma character, which signals a wildcard match for the row position.  > mtcars[,"am"] [1] 1 1 1 0 0 0 0 0 0 0 0 ..
  • 66.  Data Frame Column Slice  We retrieve a data frame column slice with the single square bracket "[]" operator.  Numeric Indexing  The following is a slice containing the first column of the built-in data set mtcars.  > mtcars[1] mpg Mazda RX4 21.0 Mazda RX4 Wag 21.0 Datsun 710 22.8 ............  Name Indexing  We can retrieve the same column slice by its name.  > mtcars["mpg"] mpg Mazda RX4 21.0 Mazda RX4 Wag 21.0 Datsun 710 22.8 ............  To retrieve a data frame slice with the two columns mpg and hp, we pack the column names in an index vector inside the single square bracket operator.  > mtcars[c("mpg", "hp")] mpg hp Mazda RX4 21.0 110 Mazda RX4 Wag 21.0 110 Datsun 710 22.8 93 ............
  • 67.  Data Frame Row Slice  We retrieve rows from a data frame with the single square bracket operator, just like what we did with columns. However, in additional to an index vector of row positions, we append an extra comma character. This is important, as the extra comma signals a wildcard match for the second coordinate for column positions.  Numeric Indexing  For example, the following retrieves a row record of the built-in data set mtcars. Please notice the extra comma in the square bracket operator, and it is not a typo. It states that the 1974 Camaro Z28 has a gas mileage of 13.3 miles per gallon, and an eight cylinder 245 horse power engine, ..., etc.  > mtcars[24,] mpg cyl disp hp drat wt ... Camaro Z28 13.3 8 350 245 3.73 3.84 ...  To retrieve more than one rows, we use a numeric index vector.  > mtcars[c(3, 24),] mpg cyl disp hp drat wt ... Datsun 710 22.8 4 108 93 3.85 2.32 ... Camaro Z28 13.3 8 350 245 3.73 3.84 ...  Name Indexing  We can retrieve a row by its name.  > mtcars["Camaro Z28",] mpg cyl disp hp drat wt ... Camaro Z28 13.3 8 350 245 3.73 3.84 ...
  • 68.  Logical Indexing  Lastly, we can retrieve rows with a logical index vector. In the following vector L, the member value is TRUE if the car has automatic transmission, and FALSE if otherwise.  > L = mtcars$am == 0 > L [1] FALSE FALSE FALSE TRUE ...  Here is the list of vehicles with automatic transmission.  > mtcars[L,] mpg cyl disp hp drat wt ... Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 ... Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 ... ............  And here is the gas mileage data for automatic transmission.  > mtcars[L,]$mpg [1] 21.4 18.7 18.1 14.3 24.4 ...
  • 69. Useful Functions  length(object) # number of elements or components str(object) # structure of an object class(object) # class or type of an object names(object) # names c(object, object,...) # combine objects into a vector cbind(object, object, ...) # combine objects as columns rbind(object, object, ...) # combine objects as rows object # prints the object ls() # list current objects rm(object) # delete an object newobject <- edit(object) # edit copy and save as newobject fix(object) # edit in place
  • 71. Importing Data  # read in the first worksheet from the workbook myexcel.xlsx # first row contains variable names  mydata <- read.table("c:/mydata.csv", header=TRUE, sep=",", row.names="id")  library(xlsx) mydata <- read.xlsx("c:/myexcel.xlsx", 1) # read in the worksheet named mysheet mydata <- read.xlsx("c:/myexcel.xlsx", sheetName = "mysheet")
  • 72. Keyboard Input  # create a data frame from scratch age <- c(25, 30, 56) gender <- c("male", "female", "male") weight <- c(160, 110, 220) mydata <- data.frame(age,gender,weight)
  • 73. You can also use R's built in spreadsheet to enter the data interactively, as in the following example  # enter data using editor mydata <- data.frame(age=numeric(0), gender=character(0), weight=numeric(0)) mydata <- edit(mydata) # note that without the assignment in the line above, the edits are not saved!
  • 74. Importing from Database  # RODBC Example # import 2 tables (Crime and Punishment) from a DBMS # into R data frames (and call them crimedat and pundat) library(RODBC) myconn <-odbcConnect("mydsn", uid="Rob", pwd="aardvark") crimedat <- sqlFetch(myconn, "Crime") pundat <- sqlQuery(myconn, "select * from Punishment") close(myconn)
  • 75. Exporting Data  To A Tab Delimited Text File  write.table(mydata, "c:/mydata.txt", sep="t")  To an Excel Spreadsheet  library(xlsx) write.xlsx(mydata, "c:/mydata.xlsx")
  • 76. Viewing Data  # list objects in the working environment ls()  # list the variables in mydata names(mydata)  # list the structure of mydata str(mydata)  # list levels of factor v1 in mydata levels(mydata$v1)  # dimensions of an object dim(object)
  • 77.  # class of an object (numeric, matrix, data frame, etc) class(object)  # print mydata mydata  # print first 10 rows of mydata head(mydata, n=10)  # print last 5 rows of mydata tail(mydata, n=5)
  • 78. Value Labels  You can use the factor function to create your own value labels  # variable v1 is coded 1, 2 or 3 # we want to attach value labels 1=red, 2=blue, 3=green mydata$v1 <- factor(mydata$v1, levels = c(1,2,3), labels = c("red", "blue", "green"))
  • 79.  # variable y is coded 1, 3 or 5 # we want to attach value labels 1=Low, 3=Medium, 5=High mydata$v1 <- ordered(mydata$y, levels = c(1,3, 5), labels = c("Low", "Medium", "High"))  Use the factor() function for nominal data and the ordered() function for ordinal data. R statistical and graphic functions will then treat the data appriopriately.  Note: factor and ordered are used the same way, with the same arguments. The former creates factors and the later creates ordered factors.
  • 80. Missing Data  Missing Data  In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number).  Testing for Missing Values  is.na(x) # returns TRUE of x is missing y <- c(1,2,3,NA) is.na(y) # returns a vector (F F F T)
  • 81. Recoding Values to Missing  # recode 99 to missing for variable v1 # select rows where v1 is 99 and recode column v1 mydata$v1[mydata$v1==99] <- NA  Excluding Missing Values from Analyses  Arithmetic functions on missing values yield missing values.  x <- c(1,2,NA,3) mean(x) # returns NA mean(x, na.rm=TRUE) # returns 2
  • 82.  The function complete.cases() returns a logical vector indicating which cases are complete.  # list rows of data that have missing values mydata[!complete.cases(mydata),]  The function na.omit() returns the object with listwise deletion of missing values  # create new dataset without missing data newdata <- na.omit(mydata)
  • 83. Date Values  Dates are represented as the number of days since 1970-01-01, with negative values for earlier dates # use as.Date( ) to convert strings to dates mydates <- as.Date(c("2007-06-22", "2004-02-13")) # number of days between 6/22/07 and 2/13/04 days <- mydates[1] - mydates[2]  Sys.Date( ) returns today's date. date() returns the current date and time  # print today's date today <- Sys.Date() format(today, format="%B %d %Y") "June 20 2007
  • 84. Date Conversion  Character to Date  You can use the as.Date( ) function to convert character data to dates. The format is as.Date(x, "format"), where x is the character data and format gives the appropriate format.  # convert date info in format 'mm/dd/yyyy' strDates <- c("01/05/1965", "08/16/1975") dates <- as.Date(strDates, "%m/%d/%Y")  The default format is yyyy-mm-dd  mydates <- as.Date(c("2007-06-22", "2004-02-13"))  Date to Character  You can convert dates to character data using the as.Character( ) function  # convert dates to character data strDates <- as.character(dates)
  • 87. Commands to calculate descriptive statistics Statistic Command  Mean mean(variable.name)  Median median(variable.name)  Range range(variable.name)  Standard deviation sd(variable.name)  No. observations length(variable.name)  Variance var(variable.name)
  • 88. Comparing two groups of measurements Identifying the type of test  One-sample test -Used when a single sample with a specific hypothesized value for the mean is to be considered. Examples of this include fixed value comparisons such as whether average human height is 1.77m.  Two independent sample test - measurements on two samples from two different populations are compared. Examples include comparisons of males and females.  Paired-sample test - Used when two different measurements were taken on the SAME experimental units. Examples are before and after studies on the effect of medical treatments
  • 89. Example of t-test  Hypothesis : Average human height is 1.77m  Alternative Hypothesis: Average human height is significantly different from 1.77m  height <- c(1.43,1.75,1.85,1.74,1.65,1.83,1.91,1.52,1.92,1.83)  t.test(height, mu = 1.77)  One Sample t-test  data: height  t = -0.5205, df = 9, p-value = 0.6153  alternative hypothesis: true mean is not equal to 1.77  95 percent confidence interval:  1.625646 1.860354  sample estimates:  mean of x  1.743  Therefore, alternative hypothesis is accepted.
  • 90. Two independent sample t-test  We use our example of dispersal distance in male and female butterflies. This is your data:  distance <- c(3,5,5,4,5,3,1,2,2,3)  sex <- c("male","male","male","male","male","female","female","female","female","fe male")
  • 91.  Before running the test it is important to consider your alternative hypothesis, whether you want to run a one-tailed or two-tailed test.  If no alternative hypothesis is specified, the command will assume a two- tailed test.  The two-sample t-test has a second assumption in addition to the normality of the data:  equal variance in the two samples.  If the variances are assumed to be equal, this must be specified using the argument var.equal = TRUE, otherwise Welch´s t-test that does not assume equal variances is automatically used when needed.  Here we assume equal variances and perform a two-tailed test.
  • 92.  t.test(distance ~ sex, var.equal = TRUE)  Two Sample t-test  data: distance by sex  t = -4.0166, df = 8, p-value = 0.003859  alternative hypothesis: true difference in means is not equal to 0  95 percent confidence interval: -3.4630505 -0.9369495  sample estimates:  mean in group female mean in group male 2.2 4.4  Thus, male butterflies dispersal is significantly different from female butterflies.  We can also specify a one-sided alternative hypothesis by adding the argument alternative= “less” or alternative = “greater” depending on which tail is to be tested:  t.test(distance ~ sex, var.equal = TRUE, alternative="greater")
  • 93.  Two Sample t-test  data: distance by sex  t = -4.0166, df = 8, p-value = 0.9981  alternative hypothesis: true difference in means is greater than 0  95 percent confidence interval:  -3.218516 Inf  sample estimates:  mean in group female mean in group male  2.2 4.4  The results of these tests state that female dispersal distance is not significantly greater than male dispersal distance.
  • 94. Paired sample t-test  As an example of a paired test, you investigate whether the sleep of students is affected by an exam. You ask 6 students how long they sleep the night before an exam and the night after an exam. These are the answers you get:  sleep.before <- c(4,2,7,4,3,2)  sleep.after <- c(5,1,3,6,2,1)  Here you simply run add the argument paired=TRUE to the command from the two-sampled test above.  t.test(sleep.before, sleep.after, paired=TRUE)
  • 95. Output:  Paired t-test  data: sleep.before and sleep.after t = 0.7906, df = 5, p-value = 0.465 alternative hypothesis: true difference in means is not equal to 0  95 percent confidence interval: -1.501038 2.834372  sample estimates: mean of the differences 0.6666667  Well, this test is NOT significant, and thus the data does not support an effect of exams on students sleeping time. But maybe you forgot about the party after the exam?
  • 96. Correlation analysis  Pearson Correlation - This test seeks to determine the level of relatedness between two variables using a score that runs from -1 (perfect negative correlation) to 1(perfect positive correlation). A value of zero indicates no correlation.  cor.test (iris$Sepal.Length,iris$Petal.Length)  Pearson's product-moment correlation  data: iris$Sepal.Length and iris$Petal.Length  t = 21.646, df = 148, p-value < 2.2e-16  alternative hypothesis: true correlation is not equal to 0  95 percent confidence interval:  0.8270363 0.9055080  sample estimates:  cor  0.8717538
  • 97.  This data shows a highly-significant (P-value < 2.2e-16) and strongly positive (0.87) correlation between these two variables.  In this case, the P-value is used to reject the null hypothesis that the true correlation is equal to zero.
  • 98. Spearman Correlation  A non-parametric alternative to Pearson’s r is Spearman's rank correlation coefficient, or Spearman’s rho. Like Pearson’s r, Spearman’s rho determines the level of correlation of two variables ranging from -1 to 1. The difference between the two measures is that Spearman uses the rank-order of the data rather than the raw values.  cor.test (iris$Sepal.Length,iris$Petal.Length, method="spearman")
  • 99.  Spearman's rank correlation rho  data: iris$Sepal.Length and iris$Petal.Length  S = 66429.35, p-value < 2.2e-16  alternative hypothesis: true rho is not equal to 0  sample estimates:  rho  0.8818981  Note that this test produced similar, but not identical, results compared with Pearson’s R
  • 100. Cross-tabulation and the χ2 test  Basic contingency tables would have two categorical variables. In many cases we may wish to test whether the two grouping variables are independent. One of the most common ways to analyze contingency tables is with the χ2- test (Chi-square test). The χ2 tests work by first calculating the difference between expected and observed values:  The result of this calculation, the so-called the X2score, is then compared to the χ2 distribution to calculate a p-value to determine if the observed values differ significantly from the expected values.
  • 101.  We will use an example of eye color counts in two different groups of flies. The dataset can be found in 3.9_flies_eyes_color.csv.  We begin by loading the data and creating a contingency table.  flyeyes <- read.csv("3.9_flies_eyes_color", header = T)  tab <- table(Eyecolor, Group, data=flyeyes)  A B  Red 34 41  White 16 9  The ratio between red and white eyes differs between group A and group B.  We will use chi-squared test to determine whether the data is more compatible with the  null hypothesis that the variables of eye color and group are independent of each other  or with the alternative hypothesis that eye color and group are not independent
  • 102.  chisq.test(tab)  Pearson's Chi-squared test with Yates' continuity correction  data: tab  X-squared = 1.92, df = 1, p-value = 0.1659  In this case, based on a chi-squared value of 1.92 and 1 degree of freedom, we calculated a P-value of 0.1659.  Thus, data suggests that these variables are independent.
  • 103. Linear Models  Linear models are a large family of statistical analyses that relate a continuous response to one or several explanatory variables.  Explanatory variables can be grouping factors or continuous or a combination of both
  • 104. One-way analysis of variance (ANOVA)  Tests whether means of more than two groups are the same, for example whether fruit production differs among five populations of a plant species.  If there are only two groups, a t-test is the way to proceed.  ANOVA relates variance within groups to variance between groups.  The analysis does not, however, tell you which groups are significantly different from each other. For this purpose a Tukey test can be applied.
  • 105. Two-way ANOVA  This analysis can assesses the influence of two grouping factors on groups means, for example, whether irrigation and fertilization have an effect on plant growth.  Importantly, two-way ANOVA can also analyze whether the two factors interact, in the example, whether the effect of irrigation depends on the fertilizer level (or the other way around). This is called a statistical interaction.  The same methods can also be applied to studies with more than two grouping factors (multi-way ANOVA).
  • 106. Linear regression  Linear regression analyzes to what extent changes in a continuous explanatory variable result in changes in the response variable, for example whether larger females cause longer male courtship behavior. If a causal relationship cannot be assumed a correlation analysis should be used. This type of analysis can also be conducted with more than one continuous explanatory variable (multiple regression).
  • 107. Analysis of covariance (ANCOVA)  ANCOVA allows more complicated analyses that involve effects of grouping factors, explanatory factors and their interactions.  An example is an analysis of whether the response to different doses of a medication differs between male and female patients. Such more complicated linear models can also include more than two explanatory factors.
  • 108. Defining the model  First you must define the linear model that you want to use, using the lm() function. Within this function a so-called formula statement defines the relationship of the variables to each other. The response variable is always on the left side of the tilde-symbol (~) and the explanatory variable(s) are on the right side as in lm(, …). For instance, if one uses the AirQuality R internal dataset and we want to make a model to predict Ozone content in the atmosphere using wind speed, the model definition would be as follows:  My.model <- lm(Ozone ~ Wind, data = airquality)
  • 109.  airquality$Month <- as.factor(airquality$Month)  #turns Month into a factor  lm(Ozone ~ Month, data = airquality)  While the following formula statement will yield a regression analysis:  lm(Ozone ~ Temp, data = airquality)
  • 110.  Formula statements are further used to combine explanatory variables and to define interactions. If variables should be considered only by themselves (additive effects), for example in a multiple regression without interaction you connect the variables by a plus sign as in:  lm(Ozone ~ Temp + Wind, data = airquality)  On the other hand, if you want to consider interactions in addition to the additive effects use an asterisk (*) between the explanatory variables, as in:  lm(Ozone ~ Temp * Wind, data = airquality)
  • 111. Checking assumptions with diagnostic plots  All linear models including regression, one-way ANOVA and ANCOVA have the following assumptions:  1. The experimental units are independent and sampled at random. The independence assumption depends heavily on the experimental design.  2. The residuals have constant variance across values of explanatory variables.  3. The residuals, i.e. the differences between the observed values of a response variable and the values fitted by the model, are normally distributed with a mean of zero.
  • 112.  My.model <- lm(Ozone ~ Wind, data = airquality)  par(mfrow=c(1,2)  plot(My.model, which = c(1,2))
  • 113. Analyzing and interpreting the model  An ANOVA table shows how much variation in the response is explained by the explanatory factors. To get the ANOVA table, use the command anova(My.model), where My.model is the object that stores the defined model  airquality$Month <- as.factor(airquality$Month)  #turns Month into a factor  My.Model <- lm(Ozone ~ Month, data = airquality)  anova(My.Model)
  • 114.  Analysis of Variance Table  Response: Ozone  Df Sum Sq Mean Sq F value Pr(>F)  Month 4 29438 7359.5 8.5356 4.827e-06 ***  Residuals 111 95705 862.2  ---  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  In the output, df stands for degrees of freedom, which is the number of values in the final calculation of a statistic that are free to vary, the F value which is the test statistic calculated as the ratio between the explained and the unexplained variance, and the corresponding p-value for this F statistic, which is the probability of not rejecting the alternative hypothesis (significant effect of the explanatory variable on the response variable) given than is false.
  • 115.  The command summary(model) will show the parameters estimated by the model, for example the slope of the regression for regression analyses or the difference between group means for ANOVA.  My.model <- lm(Ozone ~ Wind, data = airquality)  summary(My.model)
  • 116. Examples  One-way ANOVA  To test whether fruit production differs between populations of Lythrum, fruits were counted on 10 individuals on each of 3 populations.  fruits <- data.frame(fruits = c(24, 19, 21, 20, 23, 19, 17, 20, 23, 20, 11, 15, 11, 9, 10, 14, 12, 12, 15, 13, 13, 11, 19, 12, 15, 15, 13, 18, 17, 13), pop = c(rep(c(1), 10), rep(c(2), 10), rep(c(3), 10)))  fruits$pop <- as.factor(fruits$pop)  plot(fruits ~ pop, data = fruits)  model<-lm(fruits~pop,data=fruits)  par(mfrow=c(1,2));plot(model, which=c(1,2))
  • 117. Two-way ANOVA  In a study on pea cultivation methods, pea production was assessed in two treatments of irrigation (normal irrigation and drought) and in three treatments of radiation (low, medium and high). 10 plants in each of the six combinations were considered.  plants<- data.frame(seeds=c(39,39,39,40,40,39,41,42,40,40,39,38,41,41,40,41,40,40,41,40,38,40,40,3 9,42,40,39,41,39,40,39,40,41,40,41,39,40,41,40,39,42,40,39,39,42,40,39,39,39,39,41,38,40, 39,41,42,40,40,40,41),irrigation=c(rep(c(1),30),rep(c(2),30)),radiation=c(rep(c(1,2,3),20)))  plants$irrigation<-as.factor(plants$irrigation)  plants$radiation<-as.factor(plants$radiation)  par(mfrow=c(1,2));plot(seeds~irrigation*radiation,data=plants)  model<-lm(seeds~irrigation*radiation,data=plants)  par(mfrow=c(1,2));plot(model, which=c(1,2))  anova(model)
  • 118. Linear Regression  In an experiment testing whether the duration of male courtship behavior depends on female size, 16 pairs of earwigs were observed.  sex<-data.frame(pair = 1:16, fem_size = c(58.84, 60.37, 57, 59.86, 61.42, 60.34, 60.1, 59.63, 58.06, 61, 58.61, 60.94, 60.83, 57.7, 60, 59.09), male_court_hrs = c(8.37, 9.88, 10.12, 8.39, 9.93, 9.69, 8.68, 11.74, 11.07, 8.69, 10.53, 10.38, 10.12, 11.14, 8.6, 11.26))  plot(male_court_hrs~ fem_size,data=sex)  model<-lm(male_court_hrs~ fem_size,data=sex)  par(mfrow=c(1,2));plot(model, which=c(1,2))  summary(model)  This analysis suggests that larger females do not cause longer male courtship behavior.
  • 119. Basic graphs with R  Bar-plots –  We are going to use the internal dataset ToothGrowth (available with R installation), which contains measurements of tooth length in guinea pigs that received three levels of vitamin C and two supplement types  We want to produce a barplot of the mean tooth length for all six combinations of the two factors (supplement type: 2 levels, dose: 3 levels). We first need to calculate the mean tooth length each of the combinations. For this, we use the command tapply().  tapply() can return a table with mean tooth lengths for all six combinations and this table will be the input for the barplot. Importantly, tapply() will create a matrix with two rows and three columns corresponding to the factor levels in the dataset. This structure is needed to produce a groups barplot.
  • 120.  mean.tg <- tapply(ToothGrowth$len, list(ToothGrowth$supp, ToothGrowth$dose), FUN = mean)  mean.tg  barplot(mean.tg, beside = T)  barplot(mean.tg, beside=T, xlab = "Dose (mg)", ylab = "Tooth length (cm)", names = c("0.5", "1.0", "2.0"), cex.lab = 1.3, cex.names = 1.2,col=c(0, 8), ylim = c(0, max(ToothGrowth$len)), las = 1)
  • 121.  The labels of the axes can be specified with the arguments xlab and ylab and the labels below each group of bars are controlled with the argument names. The font size of these labels can be changed with cex.lab and cex.names. These arguments are set to 1 by default and changes are relative to this default. For example, cex.axis = 2 will double the font size. The limit of the y-axis is specified with ylim. Here we use the maxiumum and the minimum in the datset. The orientation of the axis labels can be altered with the argument las that has four options (0,1,2,3). Here, las = 1 produces horizontal axis labels. The colors of the bars are determined by col, in our example by a vector with a length of two for the two groups, specifying 1 (black) and 8 (grey). The color can be specified either with numbers (1 to 8) or with the color name
  • 122. Grouped scatter plot with regression lines  To produce a scatterplot, we will use the plot() command. plot() is a higher- level plotting command that it will create a new graph.  We are going to use part of the internal dataset Iris (available with the R installation) as an example (Figure 5-2). Iris contains flower measurements of three different Iris species. You can explore the dataset by ?iris, summary(iris) and str(iris). To reduce the dataset o two species and to plot all the datapoints use:  iris.short <- iris[1:100, ]  plot(iris.short$Sepal.Length, iris.short$Sepal.Width)
  • 123.  We can now assign two different plotting symbols for the species by creating a new column in the data frame iris.short, named iris.short$pch, that contains the number of the plotting symbol to be used. There are 26 different plotting symbols, ranging from 0 to 25. Here we use symbol 1 for Iris setosa and symbol 16 for Iris versicolor. You can use the same procedure to assign different colors to the two species (see above). We can then set the axis labels, range and orientation as well as font size using xlab, ylab, xlim, ylim, las, cex.axis and cex.lab as explained above.  iris.short$pch[iris.short$Species == "setosa"] <- 1  iris.short$pch[iris.short$Species == "versicolor"] <- 16  plot(iris.short$Sepal.Length, iris.short$Sepal.Width, type = "n", xlab = "Sepal length (mm)", ylab = "Sepal width (mm)", xlim = c(4, 7.5), las = 1, cex.axis = 1.2, cex.lab = 1.3,pch = iris.short$pch)
  • 124. Logistic Regression  Logistic regression models are used to in situations where we want to know how a binary response variable is affected by one or more continuous variables. Common biological examples of this include assessing probability of survival, probability of reproducing, or probability of an individual possessing a certain allele. On a natural scale, logistic regression is non-linear and cannot be analyzed using linear models. However this problem is circumvented by using the logit transformation to linearize the model.  Logit = log (p/1-p)
  • 125.  Creating and analyzing model  In R, logistic regression models are created using the generalized linear model function glm(􀀌. This takes the general form of:  Model􀀌<-􀀌glm(probability_data􀀌~ continuous_predictor, family = ”binomial”)  The argument family=”binomial” tells the function to create a binomial logistic regression model. As with the lm()function, we can use summary() to obtain summary data of the model.  Lmodel <- glm(survival ~ height, family = binomial, data = Hypericum)  summary(Lmodel)  G_sq <- LModel$null.deviance - LModel$deviance  pchisq(G_sq, 1, lower.tail=F)
  • 126.  Sequence<-seq(0,4,.1)  PLMlogit<-predict(LModel, list(height=Sequence))  plot(Sequence,PLMLogit, type="l", xlab="Log(height)", ylab="Logit Survival")  PLMcurve<-predict(LModel,list(height=Sequence),type="response")  plot(Sequence,PLMcurve,type="l", xlab="Log(height)", ylab="Survival")
  • 127.  Logistic Regressions are used when you have a probability as a response variable and a continuous predictor variable  Logistic curves are analyzed as generalized linear models glm() though the use of the logit transformation.  Logistic regressions can be plotted either as a logistic curve or as a linear function.
  • 128. R Flow Control  for (VAR in SEQ) {EXPR}  while (COND) {EXPR}  repeat {EXPR}  The first one, for(), iterates through each component VAR of the sequence SEQ- for example, in the first iteration VAR = SEQ[1], in the second iteration VAR = SEQ[2], and so on.  VAR is the abbreviation of variable.  SEQ is the abbreviation of sequence, which is equivalent to a vector (including list) in R.  COND is the abbreviation of conditional, which can evaluate to TRUE or FALSE.  EXPR is the abbreviation of expression in a formal sense.
  • 129. for  for ( i in 1:5 ) {  print( paste('square of', i, '=', i^2) )  }
  • 130. while  i_w <- 1  while ( i_w <=10 ) {  i_w <- i_w + 5  }  i_w
  • 131. repeat  i_r <-1  repeat {  i_r <- i_r + 1  if (i_r > 10) {break}  }  i_r
  • 132. If – else if - else  x <- 3  if ( ! is.numeric(x) ) {  stop( paste(x, 'is not numeric') )  } else if ( x%%2 == 1) {  print ( paste(x, 'is an odd') )  } else if ( x == round(x) ) {  print ( paste(x, 'is an integer') )  } else {  print ( paste(x, 'is a number') )  } 
  • 133. Functions  R provides a convenient way to define custom function and make good use of it. All functions read and parse input, which are referred to as arguments, and then return output. R function is actually first-class object defined in R. It can be created by using the command function(), which is followed by a comma separated list of formal arguments enclosed by a pair of parenthesis, and then the expression that form the body of the function.  If the expression only includes one statement, it can be directly entered and when there are multiple expressions, they have to be enclosed in braces {}. The value returned by a R function, can be either yielded by R built-in function return or simply the value of the last expression.
  • 134. Example of a function  expon <- function(x,n) {  if ( x%%1 != 0 ) {  stop('x must be an integer!')  } else if ( n==0 ) {  return(1)  } else {  prod <- x  while( n>1 ) {  prod <- x*prod  n <- n-1  }  return(prod)  } # end of else  } # end of the function  expon(3,4)
  • 135.  The formal and body arguments to function expon() can later be accessed via the R functions formals() and body(), as following:  formals(expon)  $x  $n  body(expon)  {  if (x%%1 != 0) {  stop("x must be an integer!")  } (…)
  • 136. Thank You for Your Patience!!