This document provides an overview of the R programming language. It discusses that R is an environment for statistical computing and graphics. It includes conditionals, loops, user defined functions, and input/output facilities. The document describes how to download and install R and RStudio. It also covers key R features such as objects, classes, vectors, matrices, lists, functions, packages, graphics, and input/output.
The presentation is a brief case study of R Programming Language. In this, we discussed the scope of R, Uses of R, Advantages and Disadvantages of the R programming Language.
This presentation educates you about R - Decision Tree, Examples of use of decision tress with basic syntax, Input Data and out data with chart.
For more topics stay tuned with Learnbay.
this presentation is an introduction to R programming language.we will talk about usage, history, data structure and feathers of R programming language.
The R language is a project designed to create a free, open source language which can be used as a replacement for the S-PLUS language, originally developed as the S language at AT&T Bell Labs, and currently marketed by Insightful Corporation of Seattle, Washington. R is an open source implementation of S, and differs from S-plus largely in its command-line only format.
Topics Covered:
1.Introduction to R
2.Installing R
3.Why Learn R
4.The R Console
5.Basic Arithmetic and Objects
6.Program Example
7.Programming with Big Data in R
8.Big Data Strategies in R
9.Applications of R Programming
10.Companies Using R
11.What R is not so good at
12.Conclusion
Exploratory data analysis data visualization:
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to
Maximize insight into a data set.
Uncover underlying structure.
Extract important variables.
Detect outliers and anomalies.
Test underlying assumptions.
Develop parsimonious models.
Determine optimal factor settings
This introduction to the popular ggplot2 R graphics package will show you how to create a wide variety of graphical displays in R. Data sets and additional workshop materials available at http://projects.iq.harvard.edu/rtc/event/r-graphics
The presentation is a brief case study of R Programming Language. In this, we discussed the scope of R, Uses of R, Advantages and Disadvantages of the R programming Language.
This presentation educates you about R - Decision Tree, Examples of use of decision tress with basic syntax, Input Data and out data with chart.
For more topics stay tuned with Learnbay.
this presentation is an introduction to R programming language.we will talk about usage, history, data structure and feathers of R programming language.
The R language is a project designed to create a free, open source language which can be used as a replacement for the S-PLUS language, originally developed as the S language at AT&T Bell Labs, and currently marketed by Insightful Corporation of Seattle, Washington. R is an open source implementation of S, and differs from S-plus largely in its command-line only format.
Topics Covered:
1.Introduction to R
2.Installing R
3.Why Learn R
4.The R Console
5.Basic Arithmetic and Objects
6.Program Example
7.Programming with Big Data in R
8.Big Data Strategies in R
9.Applications of R Programming
10.Companies Using R
11.What R is not so good at
12.Conclusion
Exploratory data analysis data visualization:
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to
Maximize insight into a data set.
Uncover underlying structure.
Extract important variables.
Detect outliers and anomalies.
Test underlying assumptions.
Develop parsimonious models.
Determine optimal factor settings
This introduction to the popular ggplot2 R graphics package will show you how to create a wide variety of graphical displays in R. Data sets and additional workshop materials available at http://projects.iq.harvard.edu/rtc/event/r-graphics
The goal of this workshop is to introduce fundamental capabilities of R as a tool for performing data analysis. Here, we learn about the most comprehensive statistical analysis language R, to get a basic idea how to analyze real-word data, extract patterns from data and find causality.
'Business Analytics with 'R' at Edureka will prepare you to perform analytics and build models for real world data science problems. It is the world’s most powerful programming language for statistical computing and graphics making it a must know language for the aspiring Data Scientists. 'R' wins strongly on Statistical Capability, Graphical capability, Cost and rich set of packages.
This presentation discusses the following topics:
Basic features of R
Exploring R GUI
Data Frames & Lists
Handling Data in R Workspace
Reading Data Sets & Exporting Data from R
Manipulating & Processing Data in R
The normal forms (NF) of relational database theory provide criteria for determining a table’s degree of vulnerability to logical inconsistencies and anomalies.
The goal of this workshop is to introduce fundamental capabilities of R as a tool for performing data analysis. Here, we learn about the most comprehensive statistical analysis language R, to get a basic idea how to analyze real-word data, extract patterns from data and find causality.
'Business Analytics with 'R' at Edureka will prepare you to perform analytics and build models for real world data science problems. It is the world’s most powerful programming language for statistical computing and graphics making it a must know language for the aspiring Data Scientists. 'R' wins strongly on Statistical Capability, Graphical capability, Cost and rich set of packages.
This presentation discusses the following topics:
Basic features of R
Exploring R GUI
Data Frames & Lists
Handling Data in R Workspace
Reading Data Sets & Exporting Data from R
Manipulating & Processing Data in R
The normal forms (NF) of relational database theory provide criteria for determining a table’s degree of vulnerability to logical inconsistencies and anomalies.
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
This was a talk that I gave at CERN at the Inter-experimental Machine Learning (IML) Working Group Meeting in April 2017 about language-agnostic (or polyglot) analysis workflows. I show how it is possible to work in multiple languages and switch between them without leaving the workflow you started. Additionally, I demonstrate how an entire workflow can be encapsulated in a markdown file that is rendered to a publishable paper with cross-references and a bibliography (and with raw LaTeX file produced as a by-product) in a simple process, making the whole analysis workflow reproducible. For experimental particle physics, ROOT is the ubiquitous data analysis tool, and has been for the last 20 years old, so I also talk about how to exchange data to and from ROOT.
Best corporate-r-programming-training-in-mumbaiUnmesh Baile
Vibrant Technologies is headquarted in Mumbai,India.We are the best Teradata training provider in Navi Mumbai who provides Live Projects to students.We provide Corporate Training also.We are Best Teradata Database classes in Mumbai according to our students and corporates
BITS: Introduction to Linux - Software installation the graphical and the co...BITS
This slide is part of the BITS training session: "Introduction to linux for life sciences."
See http://www.bits.vib.be/index.php?option=com_content&view=article&id=17203890%3Abioperl-additional-material&catid=84&Itemid=284
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
3. R is:
A Programming Environment for Statistical Computing, Data Analysis and Graphics. It is
a GNU project developed at Bell Laboratories by John Chambers .
Graphical facilities for data analysis and visualization
A well developed, simple and effective programming language.
Includes conditionals, loops, user defined recursive functions and input and output
facilities.
A fully planned and coherent system
R is being updated with newer functionalities like deep learning libraries continuously.
It has developed rapidly, and has been extended by a large collection of packages.
4. R provides a wide variety of statistical tools (linear and nonlinear modelling,
classical statistical tests, time-series analysis, classification, clustering, …)
and graphical techniques, and is highly extensible
One of R’s strengths is strong visualization with a great quality plots that can
be made with it
R is available as Free Software under the terms of the Free Software
Foundation’s GNU General Public License in source code form.
R documentation and manual is available at
https://cran.r-project.org/manuals.html
5.
6. Downloading R for Windows
Goto - https://cran.r-project.org/bin/windows/base/
Click on Download R 3.4.2 for Windows
Click on downloaded exe file
Select language as English
Click on Next in screens
Finish
12. Installing RStudio
Once the exe downloads to your PC, Click on it, follow the default settings
Keep clicking on Next and finally Finish.
You have installed Rstudio
TO Find it:
Goto Start
In programs you will find Rstudio, click on it to start it
13.
14. R is case sensitive. So A and a are different symbols and would refer to different
variables
Commands are separated either by a semi-colon (‘;’), or by a newline
If a command is not complete at the end of a line, R will give a different prompt,
by default + on second and subsequent lines and continue to read input until the
command is syntactically complete.
The vertical arrow keys on the keyboard can be used to scroll forward and
backward through a command history
R Features
15. Objects and Workspace
The entities that R creates and manipulates are known as objects.
These may be variables, arrays of numbers, character strings, functions, or
more general structures built from such components.
During an R session, objects are created and stored by name.
The R command > objects() (alternatively, ls()) can be used to display the
names of the objects which are currently stored within R.
The collection of objects currently stored is called the workspace
To remove objects the function rm is available: > rm(x, y, z, ink, junk, temp,
foo, bar)
16. All saved objects are written to a file called .RData in the current directory,
and the command lines used in the session are saved to a file called .Rhistory.
When R is started at later time from the same directory it reloads the
workspace from this file. At the same time the associated commands history
is reloaded.
18. Built-in help system
At the program's command prompt you can use any of the following:
help.start() # general help
apropos(“solve") # list all functions containing string foo
example(solve) # show an example of function foo
> help(solve)
An alternative is
> ?solve
> help("[[")
> help.start() - help is available in HTML format by running
> help.search(“solve”)
19. Standard commands for managing
your workspace.
ls() # list the objects in the current workspace
setwd(mydirectory) # change to mydirectory
setwd("c:/docs/mydir") # note / instead of in windows
setwd("/usr/rob/mydir") # on linux
# view and set options for the session
help(options) # learn about available options
options() # view current option settings
options(digits=3) # number of digits to print on output
# work with your previous commands
history() # display last 25 commands
history(max.show=Inf) # display all previous commands
20. # save your command history
savehistory(file="myfile") # default is ".Rhistory"
# recall your command history
loadhistory(file="myfile") # default is ".Rhistory"
# save the workspace to the file .RData in the cwd
save.image()
# save specific objects to a file
# if you don't specify the path, the cwd is assumed
save(object list,file="myfile.RData")
# load a workspace into the current session
# if you don't specify the path, the cwd is assumed
load("myfile.RData")
q() # quit R. You will be prompted to save the workspace.
21. Input / Output
By default, launching R starts an interactive session with input from the keyboard and
output to the screen. However, you can have input come from a script file (a file
containing R commands) and direct output to a variety of destinations.
Input - The source( ) function runs a script in the current session. If the filename does
not include a path, the file is taken from the current working directory.
# input a script
source("myfile")
Output-The sink( ) function defines the direction of the output.
# direct output to a file
sink("myfile", append=FALSE, split=FALSE)
# return output to the terminal
sink()
22. The append option controls whether output overwrites or adds to a file. The split option
determines if output is also sent to the screen as well as the output file.
Here are some examples of the sink() function.
# output directed to output.txt in current working directory. output overwrites existing
file. no output to terminal.
sink("output.txt")
# output directed to myfile.txt in cwd. output is appended to existing file. output also
send to terminal.
sink("myfile.txt", append=TRUE, split=TRUE)
When redirecting output, use the cat( ) function to annotate the output.
23. Graphs
sink( ) will not redirect graphic output. To redirect graphic output use one of
the following functions. Use dev.off( ) to return output to the terminal.
Function Output to
pdf("mygraph.pdf") pdf file
win.metafile("mygraph.wmf") windows metafile
png("mygraph.png") png file
jpeg("mygraph.jpg") jpeg file
bmp("mygraph.bmp") bmp file
postscript("mygraph.ps") postscript file
24. # example - output graph to jpeg file
jpeg("c:/mygraphs/myplot.jpg")
plot(x)
dev.off()
25. Packages
Packages are collections of R functions, data, and compiled code in a well-
defined format.
The directory where packages are stored is called the library.
R comes with a standard set of packages. Others are available for download
and installation.
install.packages(“packagename”) command installs a package.
Once installed, they have to be loaded into the session to be used.
.libPaths() # get library location
library() # see all packages installed
search() # see packages currently loaded
26. Download and Install a Package
We need to download and install only once.
To use the package, invoke the library(package) command to load it into the
current session. (You need to do this once in each session, unless
you customize your environment to automatically load it each time.)
On MS Windows:
Command Install.Packages(“Packagename”) installs a package from the
default mirror..
Then use the library(packagename) function to load it for use. (e.g.
library(boot))
27. Customizing Startup
You can customize the R environment through a site initialization file or a
directory initialization file. R will always source the Rprofile.site file first. On
Windows, the file is in the C:Program FilesRR-n.n.netc directory. You can
also place a .Rprofile file in any directory that you are going to run R from or
in the user home directory.
At startup, R will source the Rprofile.site file. It will then look for a
.Rprofile file to source in the current working directory. If it doesn't find it, it
will look for one in the user's home directory.
There are two special functions you can place in these files. .First( ) will be
run at the start of the R session and .Last( ) will be run at the end of the
session.
28. # Sample Rprofile.site file
# Things you might want to change
# options(papersize="a4")
# options(editor="notepad")
# options(pager="internal")
# R interactive prompt
# options(prompt="> ")
# options(continue="+ ")
# to prefer Compiled HTML
help options(chmhelp=TRUE)
# to prefer HTML help
# options(htmlhelp=TRUE)
# General options
options(tab.width = 2)
options(width = 130)
options(graphics.record=TRUE)
.First <- function(){
library(Hmisc)
library(R2HTML)
cat("nWelcome at", date(), "n")
}
.Last <- function(){
cat("nGoodbye at ", date(), "n")
}
30. Data Types
R has a wide variety of data types including scalars, vectors (numerical,
character, logical), matrices, data frames, and lists.
VECTORS
a <- c(1,2,5.3,6,-2,4) # numeric vector
b <- c("one", "two", "three") # character vector
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
Refer to elements of a vector using subscripts
a[c(2,4)] # 2nd and 4th elements of vector
31. > x = 10.5 # assign a decimal value
> x # print the value of x
[1] 10.5
> class(x) # print the class name of x
[1] "numeric"
Furthermore, even if we assign an integer to a variable k, it is still being
saved as a numeric value.
> k = 1
> k # print the value of k
[1] 1
> class(k) # print the class name of k
[1] "numeric"
The fact that k is not an integer can be confirmed with he is.integer function.
> is.integer(k) # is k an integer?
[1] FALSE
32. Integer
In order to create an integer variable in R, we invoke the as.integer function.
We can be assured that y is indeed an integer by applying
the is.integer function.
> y = as.integer(3)
> y # print the value of y
[1] 3
> class(y) # print the class name of y
[1] "integer"
> is.integer(y) # is y an integer?
[1] TRUE
Incidentally, we can coerce a numeric value into an integer with the
same as.integer function.
> as.integer(3.14) # coerce a numeric value
[1] 3
33. And we can parse a string for decimal values in much the same way.
> as.integer("5.27") # coerce a decimal string
[1] 5
On the other hand, it is erroneous trying to parse a non-decimal string.
> as.integer("Joe") # coerce an non−decimal string
[1] NA
Warning message:
NAs introduced by coercion
Often, it is useful to perform arithmetic on logical values. Like the C
language, TRUE has the value 1, while FALSE has value 0.
> as.integer(TRUE) # the numeric value of TRUE
[1] 1
> as.integer(FALSE) # the numeric value of FALSE
[1] 0
34. Complex
A complex value in R is defined via the pure imaginary value i.
> z = 1 + 2i # create a complex number
> z # print the value of z
[1] 1+2i
> class(z) # print the class name of z
[1] "complex"
The following gives an error as −1 is not a complex value.
> sqrt(−1) # square root of −1
[1] NaN
Warning message:
In sqrt(−1) : NaNs produced
35. Instead, we have to use the complex value −1 + 0i.
> sqrt(−1+0i) # square root of −1+0i
[1] 0+1i
An alternative is to coerce −1 into a complex value.
> sqrt(as.complex(−1))
[1] 0+1i
36. Logical
A logical value is often created via comparison between variables.
> x = 1; y = 2 # sample values
> z = x > y # is x larger than y?
> z # print the logical value
[1] FALSE
> class(z) # print the class name of z
[1] "logical"
37. Standard logical operations are "&" (and), "|" (or), and "!" (negation).
> u = TRUE; v = FALSE
> u & v # u AND v
[1] FALSE
> u | v # u OR v
[1] TRUE
> !u # negation of u
[1] FALSE
38. Character
A character object is used to represent string values in R. We convert objects
into character values with the as.character() function:
> x = as.character(3.14)
> x # print the character string
[1] "3.14"
> class(x) # print the class name of x
[1] "character"
Two character values can be concatenated with the paste function.
> fname = "Joe"; lname ="Smith"
> paste(fname, lname)
[1] "Joe Smith"
39. However, it is often more convenient to create a readable string with
the sprintf function, which has a C language syntax.
> sprintf("%s has %d dollars", "Sam", 100)
[1] "Sam has 100 dollars"
To extract a substring, we apply the substr function. Here is an example
showing how to extract the substring between the third and twelfth positions
in a string.
> substr("Mary has a little lamb.", start=3, stop=12)
[1] "ry has a l"
And to replace the first occurrence of the word "little" by another word "big"
in the string, we apply the sub function.
> sub("little", "big", "Mary has a little lamb.")
[1] "Mary has a big lamb."
More functions for string manipulation can be found in the R documentation.
> help("sub")
40. Vector
A vector is a sequence of data elements of the same basic type. Members in a
vector are officially called components.
Here is a vector containing three numeric values 2, 3 and 5.
> c(2, 3, 5)
[1] 2 3 5
And here is a vector of logical values.
> c(TRUE, FALSE, TRUE, FALSE, FALSE)
[1] TRUE FALSE TRUE FALSE FALSE
A vector can contain character strings.
> c("aa", "bb", "cc", "dd", "ee")
[1] "aa" "bb" "cc" "dd" "ee"
The number of members in a vector is given by the length function.
> length(c("aa", "bb", "cc", "dd", "ee"))
[1] 5
41. Combining Vectors
Vectors can be combined via the function c. For examples, the following two
vectors n and sare combined into a new vector containing elements from both
vectors.
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc", "dd", "ee")
> c(n, s)
[1] "2" "3" "5" "aa" "bb" "cc" "dd" "ee"
Value Coercion
In the code snippet above, notice how the numeric values are being coerced
into character strings when the two vectors are combined. This is necessary
so as to maintain the same primitive data type for members in the same
vector.
42. Vector Arithmetic
Arithmetic operations of vectors are performed member-by-member, i.e., member
wise.
For example, suppose we have two vectors a and b.
> a = c(1, 3, 5, 7)
> b = c(1, 2, 4, 8)
Then, if we multiply a by 5, we would get a vector with each of its members
multiplied by 5.
> 5 * a
[1] 5 15 25 35
And if we add a and b together, the sum would be a vector whose members are the
sum of the corresponding members from a and b.
> a + b
[1] 2 5 9 15
43. Similarly for subtraction, multiplication and division, we get new vectors via
memberwise operations.
> a - b
[1] 0 1 1 -1
> a * b
[1] 1 6 20 56
> a / b
[1] 1.000 1.500 1.250 0.875
Recycling Rule
If two vectors are of unequal length, the shorter one will be recycled in order to
match the longer vector. For example, the following vectors u and v have different
lengths, and their sum is computed by recycling values of the shorter vector u.
> u = c(10, 20, 30)
> v = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
> u + v
[1] 11 22 33 14 25 36 17 28 39
44. Vector Index
We retrieve values in a vector by declaring an index inside a single square
bracket "[]"operator.
For example, the following shows how to retrieve a vector member. Since the
vector index is 1-based, we use the index position 3 for retrieving the third
member.
> s = c("aa", "bb", "cc", "dd", "ee")
> s[3]
[1] "cc"
Unlike other programming languages, the square bracket operator returns
more than just individual members. In fact, the result of the square bracket
operator is another vector, and s[3] is a vector slice containing a single
member "cc".
45. Negative Index
If the index is negative, it would strip the member whose position has the
same absolute value as the negative index. For example, the following
creates a vector slice with the third member removed.
> s[-3]
[1] "aa" "bb" "dd" "ee"
Out-of-Range Index
If an index is out-of-range, a missing value will be reported via the
symbol NA.
> s[10]
[1] NA
46. Numeric Index Vector
A new vector can be sliced from a given vector with a numeric index vector, which consists
of member positions of the original vector to be retrieved.
Here it shows how to retrieve a vector slice containing the second and third members of a
given vector s.
> s = c("aa", "bb", "cc", "dd", "ee")
> s[c(2, 3)]
[1] "bb" "cc"
Duplicate Indexes
The index vector allows duplicate values. Hence the following retrieves a member twice in
one operation.
> s[c(2, 3, 3)]
[1] "bb" "cc" "cc"
47. Out-of-Order Indexes
The index vector can even be out-of-order. Here is a vector slice with the order of first
and second members reversed.
> s[c(2, 1, 3)]
[1] "bb" "aa" "cc"
Range Index
To produce a vector slice between two indexes, we can use the colon operator ":". This
can be convenient for situations involving large vectors.
> s[2:4]
[1] "bb" "cc" "dd"
More information for the colon operator is available in the R documentation.
> help(":")
48. Logical Index Vector
A new vector can be sliced from a given vector with a logical index vector,
which has the same length as the original vector. Its members are TRUE if the
corresponding members in the original vector are to be included in the slice,
and FALSE if otherwise.
For example, consider the following vector s of length 5.
> s = c("aa", "bb", "cc", "dd", "ee")
To retrieve the the second and fourth members of s, we define a logical
vector L of the same length, and have its second and fourth members set
as TRUE.
> L = c(FALSE, TRUE, FALSE, TRUE, FALSE)
> s[L]
[1] "bb" "dd"
The code can be abbreviated into a single line.
> s[c(FALSE, TRUE, FALSE, TRUE, FALSE)]
[1] "bb" "dd"
49. Named Vector Members
We can assign names to vector members. For example, the following variable v is
a character string vector with two members.
> v = c("Mary", "Sue")
> v
[1] "Mary" "Sue"
We now name the first member as First, and the second as Last.
> names(v) = c("First", "Last")
> v
First Last
"Mary" "Sue"
Then we can retrieve the first member by its name.
> v["First"]
[1] "Mary"
Furthermore, we can reverse the order with a character string index vector.
> v[c("Last", "First")]
Last First
"Sue" "Mary"
50. MATRIX
A matrix is a collection of data elements arranged in a two-dimensional rectangular layout.
The following is an example of a matrix with 2 rows and 3 columns.
A= [ 2 4 3
1 5 7 }
We reproduce a memory representation of the matrix in R with the matrix function. The data
elements must be of the same basic type.
> A = matrix(
+ c(2, 4, 3, 1, 5, 7), # the data elements
+ nrow=2, # number of rows
+ ncol=3, # number of columns
+ byrow = TRUE) # fill matrix by rows
> A # print the matrix
[,1] [,2] [,3]
[1,] 2 4 3
[2,] 1 5 7
51. An element at the mth row, nth column of A can be accessed by the
expression A[m, n].
> A[2, 3] # element at 2nd row, 3rd column
[1] 7
The entire mth row A can be extracted as A[m, ].
> A[2, ] # the 2nd row
[1] 1 5 7
Similarly, the entire nth column A can be extracted as A[ ,n].
> A[ ,3] # the 3rd column
[1] 3 7
52. We can also extract more than one rows or columns at a time.
> A[ ,c(1,3)] # the 1st and 3rd columns
[,1] [,2]
[1,] 2 3
[2,] 1 7
If we assign names to the rows and columns of the matrix, than we can access the
elements by names.
> dimnames(A) = list(
+ c("row1", "row2"), # row names
+ c("col1", "col2", "col3")) # column names
> A # print A
col1 col2 col3
row1 2 4 3
row2 1 5 7
> A["row2", "col3"] # element at 2nd row, 3rd column
[1] 7
53. Matrix Construction
There are various ways to construct a matrix. When we construct a matrix
directly with data elements, the matrix content is filled along the column
orientation by default. For example, in the following code snippet, the
content of B is filled along the columns consecutively.
> B = matrix(
+ c(2, 4, 3, 1, 5, 7),
+ nrow=3,
+ ncol=2)
> B # B has 3 rows and 2 columns
[,1] [,2]
[1,] 2 1
[2,] 4 5
[3,] 3 7
[3,] 3 7 2
54. Then we can combine the columns of B and C with cbind.
> cbind(B, C)
[,1] [,2] [,3]
[1,] 2 1 7
[2,] 4 5 4
Similarly, we can combine the rows of two matrices if they have the same number
of columns with the rbind function.
> D = matrix(
+ c(6, 2),
+ nrow=1,
+ ncol=2)
> D # D has 2 columns
[,1] [,2]
[1,] 6 2
55. Deconstruction
We can deconstruct a matrix by applying the c function, which combines all
column vectors into one.
> c(B)
[1] 2 4 3 1 5 7
56. Transpose
We construct the transpose of a matrix by interchanging its columns and rows
with the function t .
> t(B) # transpose of B
[,1] [,2] [,3]
[1,] 2 4 3
[2,] 1 5 7
57. Lists
A list is a generic vector containing other objects.
For example, the following variable x is a list containing copies of three vectors n, s,
b, and a numeric value 3.
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc", "dd", "ee")
> b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
> x = list(n, s, b, 3) # x contains copies of n, s, b
List Slicing
We retrieve a list slice with the single square bracket "[]" operator. The following is a
slice containing the second member of x, which is a copy of s.
> x[2]
[[1]]
[1] "aa" "bb" "cc" "dd" "ee"
58. With an index vector, we can retrieve a slice with multiple members. Here a
slice containing the second and fourth members of x.
> x[c(2, 4)]
[[1]]
[1] "aa" "bb" "cc" "dd" "ee"
[[2]]
[1] 3
59. List Member Reference
In order to reference a list member directly, we have to use the double
square bracket "[[]]" operator. The following object x[[2]] is the second
member of x. In other words, x[[2]] is a copy of s, but is not a slice containing
s or its copy.
> x[[2]]
[1] "aa" "bb" "cc" "dd" "ee"
We can modify its content directly.
> x[[2]][1] = "ta"
> x[[2]]
[1] "ta" "bb" "cc" "dd" "ee"
> s
[1] "aa" "bb" "cc" "dd" "ee" # s is unaffected
60. Named List Members
We can assign names to list members, and reference them by names instead
of numeric indexes.
For example, in the following, v is a list of two members,
named "bob" and "john".
> v = list(bob=c(2, 3, 5), john=c("aa", "bb"))
> v
$bob
[1] 2 3 5
$john
[1] "aa" "bb"
61. Data Frame
A data frame is used for storing data tables. It is a list of vectors of equal
length. For example, the following variable df is a data frame containing
three vectors n, s, b.
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b) # df is a data frame
62. Build-in Data Frame
We use built-in data frames in R for our tutorials. For example, here is a
built-in data frame in R, called mtcars.
> mtcars
mpg cyl disp hp drat wt ...
Mazda RX4 21.0 6 160 110 3.90 2.62 ...
Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 ...
Datsun 710 22.8 4 108 93 3.85 2.32 ...
............
The top line of the table, called the header, contains the column names.
Each horizontal line afterward denotes a data row, which begins with the
name of the row, and then followed by the actual data. Each data member of
a row is called a cell.
63. To retrieve data in a cell, we would enter its row and column coordinates in
the single square bracket "[]" operator. The two coordinates are separated by
a comma. In other words, the coordinates begins with row position, then
followed by a comma, and ends with the column position. The order is
important.
Here is the cell value from the first row, second column of mtcars.
> mtcars[1, 2]
[1] 6
Moreover, we can use the row and column names instead of the numeric
coordinates.
> mtcars["Mazda RX4", "cyl"]
[1] 6
64. Lastly, the number of data rows in the data frame is given by
the nrow function.
> nrow(mtcars) # number of data rows
[1] 32
And the number of columns of a data frame is given by the ncol function.
> ncol(mtcars) # number of columns
[1] 11
Further details of the mtcars data set is available in the R documentation.
> help(mtcars)
65. Data Frame Column Vector
We reference a data frame column with the double square bracket "[[]]" operator.
For example, to retrieve the ninth column vector of the built-in data set mtcars,
we write mtcars[[9]].
> mtcars[[9]]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
We can retrieve the same column vector by its name.
> mtcars[["am"]]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
We can also retrieve with the "$" operator in lieu of the double square bracket
operator.
> mtcars$am
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
Yet another way to retrieve the same column vector is to use the single
square bracket "[]"operator. We prepend the column name with a comma
character, which signals a wildcard match for the row position.
> mtcars[,"am"]
[1] 1 1 1 0 0 0 0 0 0 0 0 ..
66. Data Frame Column Slice
We retrieve a data frame column slice with the single square bracket "[]" operator.
Numeric Indexing
The following is a slice containing the first column of the built-in data set mtcars.
> mtcars[1]
mpg
Mazda RX4 21.0
Mazda RX4 Wag 21.0
Datsun 710 22.8
............
Name Indexing
We can retrieve the same column slice by its name.
> mtcars["mpg"]
mpg
Mazda RX4 21.0
Mazda RX4 Wag 21.0
Datsun 710 22.8
............
To retrieve a data frame slice with the two columns mpg and hp, we pack the column names
in an index vector inside the single square bracket operator.
> mtcars[c("mpg", "hp")]
mpg hp
Mazda RX4 21.0 110
Mazda RX4 Wag 21.0 110
Datsun 710 22.8 93
............
67. Data Frame Row Slice
We retrieve rows from a data frame with the single square bracket operator, just like what we
did with columns. However, in additional to an index vector of row positions, we append an
extra comma character. This is important, as the extra comma signals a wildcard match for
the second coordinate for column positions.
Numeric Indexing
For example, the following retrieves a row record of the built-in data set mtcars. Please
notice the extra comma in the square bracket operator, and it is not a typo. It states that the
1974 Camaro Z28 has a gas mileage of 13.3 miles per gallon, and an eight cylinder 245 horse
power engine, ..., etc.
> mtcars[24,]
mpg cyl disp hp drat wt ...
Camaro Z28 13.3 8 350 245 3.73 3.84 ...
To retrieve more than one rows, we use a numeric index vector.
> mtcars[c(3, 24),]
mpg cyl disp hp drat wt ...
Datsun 710 22.8 4 108 93 3.85 2.32 ...
Camaro Z28 13.3 8 350 245 3.73 3.84 ...
Name Indexing
We can retrieve a row by its name.
> mtcars["Camaro Z28",]
mpg cyl disp hp drat wt ...
Camaro Z28 13.3 8 350 245 3.73 3.84 ...
68. Logical Indexing
Lastly, we can retrieve rows with a logical index vector. In the following vector L, the
member value is TRUE if the car has automatic transmission, and FALSE if otherwise.
> L = mtcars$am == 0
> L
[1] FALSE FALSE FALSE TRUE ...
Here is the list of vehicles with automatic transmission.
> mtcars[L,]
mpg cyl disp hp drat wt ...
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 ...
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 ...
............
And here is the gas mileage data for automatic transmission.
> mtcars[L,]$mpg
[1] 21.4 18.7 18.1 14.3 24.4 ...
69. Useful Functions
length(object) # number of elements or components
str(object) # structure of an object
class(object) # class or type of an object
names(object) # names
c(object, object,...) # combine objects into a vector
cbind(object, object, ...) # combine objects as columns
rbind(object, object, ...) # combine objects as rows
object # prints the object
ls() # list current objects
rm(object) # delete an object
newobject <- edit(object) # edit copy and save as newobject
fix(object) # edit in place
71. Importing Data
# read in the first worksheet from the workbook myexcel.xlsx
# first row contains variable names
mydata <- read.table("c:/mydata.csv", header=TRUE,
sep=",", row.names="id")
library(xlsx)
mydata <- read.xlsx("c:/myexcel.xlsx", 1)
# read in the worksheet named mysheet
mydata <- read.xlsx("c:/myexcel.xlsx", sheetName = "mysheet")
72. Keyboard Input
# create a data frame from scratch
age <- c(25, 30, 56)
gender <- c("male", "female", "male")
weight <- c(160, 110, 220)
mydata <- data.frame(age,gender,weight)
73. You can also use R's built in spreadsheet to enter the data
interactively, as in the following example
# enter data using editor
mydata <- data.frame(age=numeric(0), gender=character(0),
weight=numeric(0))
mydata <- edit(mydata)
# note that without the assignment in the line above, the edits are not
saved!
74. Importing from Database
# RODBC Example
# import 2 tables (Crime and Punishment) from a DBMS
# into R data frames (and call them crimedat and pundat)
library(RODBC)
myconn <-odbcConnect("mydsn", uid="Rob", pwd="aardvark")
crimedat <- sqlFetch(myconn, "Crime")
pundat <- sqlQuery(myconn, "select * from Punishment")
close(myconn)
75. Exporting Data
To A Tab Delimited Text File
write.table(mydata, "c:/mydata.txt", sep="t")
To an Excel Spreadsheet
library(xlsx)
write.xlsx(mydata, "c:/mydata.xlsx")
76. Viewing Data
# list objects in the working environment
ls()
# list the variables in mydata
names(mydata)
# list the structure of mydata
str(mydata)
# list levels of factor v1 in mydata
levels(mydata$v1)
# dimensions of an object
dim(object)
77. # class of an object (numeric, matrix, data frame, etc)
class(object)
# print mydata
mydata
# print first 10 rows of mydata
head(mydata, n=10)
# print last 5 rows of mydata
tail(mydata, n=5)
78. Value Labels
You can use the factor function to create your own value labels
# variable v1 is coded 1, 2 or 3
# we want to attach value labels 1=red, 2=blue, 3=green
mydata$v1 <- factor(mydata$v1,
levels = c(1,2,3),
labels = c("red", "blue", "green"))
79. # variable y is coded 1, 3 or 5
# we want to attach value labels 1=Low, 3=Medium, 5=High
mydata$v1 <- ordered(mydata$y,
levels = c(1,3, 5),
labels = c("Low", "Medium", "High"))
Use the factor() function for nominal data and the ordered() function
for ordinal data. R statistical and graphic functions will then treat the data
appriopriately.
Note: factor and ordered are used the same way, with the same arguments.
The former creates factors and the later creates ordered factors.
80. Missing Data
Missing Data
In R, missing values are represented by the symbol NA (not available).
Impossible values (e.g., dividing by zero) are represented by the
symbol NaN (not a number).
Testing for Missing Values
is.na(x) # returns TRUE of x is missing
y <- c(1,2,3,NA)
is.na(y) # returns a vector (F F F T)
81. Recoding Values to Missing
# recode 99 to missing for variable v1
# select rows where v1 is 99 and recode column v1
mydata$v1[mydata$v1==99] <- NA
Excluding Missing Values from Analyses
Arithmetic functions on missing values yield missing values.
x <- c(1,2,NA,3)
mean(x) # returns NA
mean(x, na.rm=TRUE) # returns 2
82. The function complete.cases() returns a logical vector indicating which cases
are complete.
# list rows of data that have missing values
mydata[!complete.cases(mydata),]
The function na.omit() returns the object with listwise deletion of missing
values
# create new dataset without missing data
newdata <- na.omit(mydata)
83. Date Values
Dates are represented as the number of days since 1970-01-01, with negative
values for earlier dates
# use as.Date( ) to convert strings to dates
mydates <- as.Date(c("2007-06-22", "2004-02-13"))
# number of days between 6/22/07 and 2/13/04
days <- mydates[1] - mydates[2]
Sys.Date( ) returns today's date.
date() returns the current date and time
# print today's date
today <- Sys.Date()
format(today, format="%B %d %Y")
"June 20 2007
84. Date Conversion
Character to Date
You can use the as.Date( ) function to convert character data to dates. The format
is as.Date(x, "format"), where x is the character data and format gives the appropriate
format.
# convert date info in format 'mm/dd/yyyy'
strDates <- c("01/05/1965", "08/16/1975")
dates <- as.Date(strDates, "%m/%d/%Y")
The default format is yyyy-mm-dd
mydates <- as.Date(c("2007-06-22", "2004-02-13"))
Date to Character
You can convert dates to character data using the as.Character( ) function
# convert dates to character data
strDates <- as.character(dates)
87. Commands to calculate descriptive
statistics Statistic Command
Mean mean(variable.name)
Median median(variable.name)
Range range(variable.name)
Standard deviation sd(variable.name)
No. observations length(variable.name)
Variance var(variable.name)
88. Comparing two groups of measurements
Identifying the type of test
One-sample test -Used when a single sample with a specific hypothesized
value for the mean is to be considered. Examples of this include fixed value
comparisons such as whether average human height is 1.77m.
Two independent sample test - measurements on two samples from two
different populations are compared. Examples include comparisons of males
and females.
Paired-sample test - Used when two different measurements were taken on
the SAME experimental units. Examples are before and after studies on the
effect of medical treatments
89. Example of t-test
Hypothesis : Average human height is 1.77m
Alternative Hypothesis: Average human height is significantly different from 1.77m
height <- c(1.43,1.75,1.85,1.74,1.65,1.83,1.91,1.52,1.92,1.83)
t.test(height, mu = 1.77)
One Sample t-test
data: height
t = -0.5205, df = 9, p-value = 0.6153
alternative hypothesis: true mean is not equal to 1.77
95 percent confidence interval:
1.625646 1.860354
sample estimates:
mean of x
1.743
Therefore, alternative hypothesis is accepted.
90. Two independent sample t-test
We use our example of dispersal distance in male and female butterflies. This
is your data:
distance <- c(3,5,5,4,5,3,1,2,2,3)
sex <-
c("male","male","male","male","male","female","female","female","female","fe
male")
91. Before running the test it is important to consider your alternative
hypothesis, whether you want to run a one-tailed or two-tailed test.
If no alternative hypothesis is specified, the command will assume a two-
tailed test.
The two-sample t-test has a second assumption in addition to the normality of
the data:
equal variance in the two samples.
If the variances are assumed to be equal, this must be specified using the
argument var.equal = TRUE, otherwise Welch´s t-test that does not assume
equal variances is automatically used when needed.
Here we assume equal variances and perform a two-tailed test.
92. t.test(distance ~ sex, var.equal = TRUE)
Two Sample t-test
data: distance by sex
t = -4.0166, df = 8, p-value = 0.003859
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval: -3.4630505 -0.9369495
sample estimates:
mean in group female mean in group male 2.2 4.4
Thus, male butterflies dispersal is significantly different from female
butterflies.
We can also specify a one-sided alternative hypothesis by adding the
argument alternative= “less” or alternative = “greater” depending on which
tail is to be tested:
t.test(distance ~ sex, var.equal = TRUE, alternative="greater")
93. Two Sample t-test
data: distance by sex
t = -4.0166, df = 8, p-value = 0.9981
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-3.218516 Inf
sample estimates:
mean in group female mean in group male
2.2 4.4
The results of these tests state that female dispersal distance is not
significantly greater than male dispersal distance.
94. Paired sample t-test
As an example of a paired test, you investigate whether the sleep of students
is affected by an exam. You ask 6 students how long they sleep the night
before an exam and the night after an exam. These are the answers you get:
sleep.before <- c(4,2,7,4,3,2)
sleep.after <- c(5,1,3,6,2,1)
Here you simply run add the argument paired=TRUE to the command from the
two-sampled test above.
t.test(sleep.before, sleep.after, paired=TRUE)
95. Output:
Paired t-test
data: sleep.before and sleep.after t = 0.7906, df = 5, p-value = 0.465
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval: -1.501038 2.834372
sample estimates: mean of the differences 0.6666667
Well, this test is NOT significant, and thus the data does not support an effect
of exams on students sleeping time. But maybe you forgot about the party
after the exam?
96. Correlation analysis
Pearson Correlation - This test seeks to determine the level of relatedness
between two variables using a score that runs from -1 (perfect negative
correlation) to 1(perfect positive correlation). A value of zero indicates no
correlation.
cor.test (iris$Sepal.Length,iris$Petal.Length)
Pearson's product-moment correlation
data: iris$Sepal.Length and iris$Petal.Length
t = 21.646, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8270363 0.9055080
sample estimates:
cor
0.8717538
97. This data shows a highly-significant (P-value < 2.2e-16) and strongly positive
(0.87) correlation between these two variables.
In this case, the P-value is used to reject the null hypothesis that the true
correlation is equal to zero.
98. Spearman Correlation
A non-parametric alternative to Pearson’s r is Spearman's rank correlation
coefficient, or Spearman’s rho. Like Pearson’s r, Spearman’s rho determines
the level of correlation of two variables ranging from -1 to 1. The difference
between the two measures is that Spearman uses the rank-order of the data
rather than the raw values.
cor.test (iris$Sepal.Length,iris$Petal.Length, method="spearman")
99. Spearman's rank correlation rho
data: iris$Sepal.Length and iris$Petal.Length
S = 66429.35, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.8818981
Note that this test produced similar, but not identical, results compared with
Pearson’s R
100. Cross-tabulation and the χ2 test
Basic contingency tables would have two categorical variables. In many cases
we may wish to test whether the two grouping variables are independent.
One of the most common ways to analyze contingency tables is with the χ2-
test (Chi-square test). The χ2 tests work by first calculating the difference
between expected and observed values:
The result of this calculation, the so-called the X2score, is then compared to
the χ2 distribution to calculate a p-value to determine if the observed values
differ significantly from the expected values.
101. We will use an example of eye color counts in two different groups of flies. The dataset can
be found in 3.9_flies_eyes_color.csv.
We begin by loading the data and creating a contingency table.
flyeyes <- read.csv("3.9_flies_eyes_color", header = T)
tab <- table(Eyecolor, Group, data=flyeyes)
A B
Red 34 41
White 16 9
The ratio between red and white eyes differs between group A and group B.
We will use chi-squared test to determine whether the data is more compatible with the
null hypothesis that the variables of eye color and group are independent of each other
or with the alternative hypothesis that eye color and group are not independent
102. chisq.test(tab)
Pearson's Chi-squared test with Yates' continuity correction
data: tab
X-squared = 1.92, df = 1, p-value = 0.1659
In this case, based on a chi-squared value of 1.92 and 1 degree of freedom,
we calculated a P-value of 0.1659.
Thus, data suggests that these variables are independent.
103. Linear Models
Linear models are a large family of statistical analyses that relate a
continuous response to one or several explanatory variables.
Explanatory variables can be grouping factors or continuous or a combination
of both
104. One-way analysis of variance (ANOVA)
Tests whether means of more than two groups are the same, for example
whether fruit production differs among five populations of a plant species.
If there are only two groups, a t-test is the way to proceed.
ANOVA relates variance within groups to variance between groups.
The analysis does not, however, tell you which groups are significantly
different from each other. For this purpose a Tukey test can be applied.
105. Two-way ANOVA
This analysis can assesses the influence of two grouping factors on groups
means, for example, whether irrigation and fertilization have an effect on
plant growth.
Importantly, two-way ANOVA can also analyze whether the two factors
interact, in the example, whether the effect of irrigation depends on the
fertilizer level (or the other way around). This is called a statistical
interaction.
The same methods can also be applied to studies with more than two
grouping factors (multi-way ANOVA).
106. Linear regression
Linear regression analyzes to what extent changes in a continuous
explanatory variable result in changes in the response variable, for example
whether larger females cause longer male courtship behavior. If a causal
relationship cannot be assumed a correlation analysis should be used. This
type of analysis can also be conducted with more than one continuous
explanatory variable (multiple regression).
107. Analysis of covariance (ANCOVA)
ANCOVA allows more complicated analyses that involve effects of grouping
factors, explanatory factors and their interactions.
An example is an analysis of whether the response to different doses of a
medication differs between male and female patients. Such more
complicated linear models can also include more than two explanatory
factors.
108. Defining the model
First you must define the linear model that you want to use, using the lm()
function. Within this function a so-called formula statement defines the
relationship of the variables to each other. The response variable is always on
the left side of the tilde-symbol (~) and the explanatory variable(s) are on the
right side as in lm(, …). For instance, if one uses the AirQuality R internal
dataset and we want to make a model to predict Ozone content in the
atmosphere using wind speed, the model definition would be as follows:
My.model <- lm(Ozone ~ Wind, data = airquality)
109. airquality$Month <- as.factor(airquality$Month)
#turns Month into a factor
lm(Ozone ~ Month, data = airquality)
While the following formula statement will yield a regression analysis:
lm(Ozone ~ Temp, data = airquality)
110. Formula statements are further used to combine explanatory variables and to
define interactions. If variables should be considered only by themselves
(additive effects), for example in a multiple regression without interaction
you connect the variables by a plus sign as in:
lm(Ozone ~ Temp + Wind, data = airquality)
On the other hand, if you want to consider interactions in addition to the
additive effects use an asterisk (*) between the explanatory variables, as in:
lm(Ozone ~ Temp * Wind, data = airquality)
111. Checking assumptions with diagnostic plots
All linear models including regression, one-way ANOVA and ANCOVA have the
following assumptions:
1. The experimental units are independent and sampled at random. The
independence assumption depends heavily on the experimental design.
2. The residuals have constant variance across values of explanatory
variables.
3. The residuals, i.e. the differences between the observed values of a
response variable and the values fitted by the model, are normally distributed
with a mean of zero.
112. My.model <- lm(Ozone ~ Wind, data = airquality)
par(mfrow=c(1,2)
plot(My.model, which = c(1,2))
113. Analyzing and interpreting the model
An ANOVA table shows how much variation in the response is explained by the
explanatory factors. To get the ANOVA table, use the command
anova(My.model), where My.model is the object that stores the defined model
airquality$Month <- as.factor(airquality$Month)
#turns Month into a factor
My.Model <- lm(Ozone ~ Month, data = airquality)
anova(My.Model)
114. Analysis of Variance Table
Response: Ozone
Df Sum Sq Mean Sq F value Pr(>F)
Month 4 29438 7359.5 8.5356 4.827e-06 ***
Residuals 111 95705 862.2
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In the output, df stands for degrees of freedom, which is the number of
values in the final calculation of a statistic that are free to vary, the F value
which is the test statistic calculated as the ratio between the explained and
the unexplained variance, and the corresponding p-value for this F statistic,
which is the probability of not rejecting the alternative hypothesis (significant
effect of the explanatory variable on the response variable) given than is
false.
115. The command summary(model) will show the parameters estimated by the
model, for example the slope of the regression for regression analyses or the
difference between group means for ANOVA.
My.model <- lm(Ozone ~ Wind, data = airquality)
summary(My.model)
116. Examples
One-way ANOVA
To test whether fruit production differs between populations of Lythrum,
fruits were counted on 10 individuals on each of 3 populations.
fruits <- data.frame(fruits = c(24, 19, 21, 20, 23, 19, 17, 20, 23, 20, 11, 15,
11, 9, 10, 14, 12, 12, 15, 13, 13, 11, 19, 12, 15, 15, 13, 18, 17, 13), pop =
c(rep(c(1), 10), rep(c(2), 10), rep(c(3), 10)))
fruits$pop <- as.factor(fruits$pop)
plot(fruits ~ pop, data = fruits)
model<-lm(fruits~pop,data=fruits)
par(mfrow=c(1,2));plot(model, which=c(1,2))
117. Two-way ANOVA
In a study on pea cultivation methods, pea production was assessed in two treatments of
irrigation (normal irrigation and drought) and in three treatments of radiation (low, medium
and high). 10 plants in each of the six combinations were considered.
plants<-
data.frame(seeds=c(39,39,39,40,40,39,41,42,40,40,39,38,41,41,40,41,40,40,41,40,38,40,40,3
9,42,40,39,41,39,40,39,40,41,40,41,39,40,41,40,39,42,40,39,39,42,40,39,39,39,39,41,38,40,
39,41,42,40,40,40,41),irrigation=c(rep(c(1),30),rep(c(2),30)),radiation=c(rep(c(1,2,3),20)))
plants$irrigation<-as.factor(plants$irrigation)
plants$radiation<-as.factor(plants$radiation)
par(mfrow=c(1,2));plot(seeds~irrigation*radiation,data=plants)
model<-lm(seeds~irrigation*radiation,data=plants)
par(mfrow=c(1,2));plot(model, which=c(1,2))
anova(model)
118. Linear Regression
In an experiment testing whether the duration of male courtship behavior depends
on female size, 16 pairs of earwigs were observed.
sex<-data.frame(pair = 1:16, fem_size = c(58.84, 60.37, 57, 59.86, 61.42, 60.34,
60.1, 59.63, 58.06, 61, 58.61, 60.94, 60.83, 57.7, 60, 59.09), male_court_hrs =
c(8.37, 9.88, 10.12, 8.39, 9.93, 9.69, 8.68, 11.74, 11.07, 8.69, 10.53, 10.38,
10.12, 11.14, 8.6, 11.26))
plot(male_court_hrs~ fem_size,data=sex)
model<-lm(male_court_hrs~ fem_size,data=sex)
par(mfrow=c(1,2));plot(model, which=c(1,2))
summary(model)
This analysis suggests that larger females do not cause longer male courtship
behavior.
119. Basic graphs with R
Bar-plots –
We are going to use the internal dataset ToothGrowth (available with R
installation), which contains measurements of tooth length in guinea pigs that
received three levels of vitamin C and two supplement types
We want to produce a barplot of the mean tooth length for all six
combinations of the two factors (supplement type: 2 levels, dose: 3 levels).
We first need to calculate the mean tooth length each of the combinations.
For this, we use the command tapply().
tapply() can return a table with mean tooth lengths for all six combinations
and this table will be the input for the barplot. Importantly, tapply() will
create a matrix with two rows and three columns corresponding to the factor
levels in the dataset. This structure is needed to produce a groups barplot.
121. The labels of the axes can be specified with the arguments xlab and ylab and
the labels below each group of bars are controlled with the argument names.
The font size of these labels can be changed with cex.lab and cex.names.
These arguments are set to 1 by default and changes are relative to this
default. For example, cex.axis = 2 will double the font size. The limit of the
y-axis is specified with ylim. Here we use the maxiumum and the minimum in
the datset. The orientation of the axis labels can be altered with the
argument las that has four options (0,1,2,3). Here, las = 1 produces horizontal
axis labels. The colors of the bars are determined by col, in our example by a
vector with a length of two for the two groups, specifying 1 (black) and 8
(grey). The color can be specified either with numbers (1 to 8) or with the
color name
122. Grouped scatter plot with regression
lines
To produce a scatterplot, we will use the plot() command. plot() is a higher-
level plotting command that it will create a new graph.
We are going to use part of the internal dataset Iris (available with the R
installation) as an example (Figure 5-2). Iris contains flower measurements of
three different Iris species. You can explore the dataset by ?iris, summary(iris)
and str(iris). To reduce the dataset o two species and to plot all the
datapoints use:
iris.short <- iris[1:100, ]
plot(iris.short$Sepal.Length, iris.short$Sepal.Width)
123. We can now assign two different plotting symbols for the species by creating a
new column in the data frame iris.short, named iris.short$pch, that contains
the number of the plotting symbol to be used. There are 26 different plotting
symbols, ranging from 0 to 25. Here we use symbol 1 for Iris setosa and
symbol 16 for Iris versicolor. You can use the same procedure to assign
different colors to the two species (see above). We can then set the axis
labels, range and orientation as well as font size using xlab, ylab, xlim, ylim,
las, cex.axis and cex.lab as explained above.
iris.short$pch[iris.short$Species == "setosa"] <- 1
iris.short$pch[iris.short$Species == "versicolor"] <- 16
plot(iris.short$Sepal.Length, iris.short$Sepal.Width, type = "n", xlab = "Sepal
length (mm)", ylab = "Sepal width (mm)", xlim = c(4, 7.5), las = 1, cex.axis =
1.2, cex.lab = 1.3,pch = iris.short$pch)
124. Logistic Regression
Logistic regression models are used to in situations where we want to know
how a binary response variable is affected by one or more continuous
variables. Common biological examples of this include assessing probability of
survival, probability of reproducing, or probability of an individual possessing
a certain allele. On a natural scale, logistic regression is non-linear and
cannot be analyzed using linear models. However this problem is
circumvented by using the logit transformation to linearize the model.
Logit = log (p/1-p)
125. Creating and analyzing model
In R, logistic regression models are created using the generalized linear model
function glm(. This takes the general form of:
Model<-glm(probability_data~ continuous_predictor, family = ”binomial”)
The argument family=”binomial” tells the function to create a binomial
logistic regression model. As with the lm()function, we can use summary() to
obtain summary data of the model.
Lmodel <- glm(survival ~ height, family = binomial, data = Hypericum)
summary(Lmodel)
G_sq <- LModel$null.deviance - LModel$deviance
pchisq(G_sq, 1, lower.tail=F)
127. Logistic Regressions are used when you have a probability as a response
variable and a continuous predictor variable
Logistic curves are analyzed as generalized linear models glm() though the
use of the logit transformation.
Logistic regressions can be plotted either as a logistic curve or as a linear
function.
128. R Flow Control
for (VAR in SEQ) {EXPR}
while (COND) {EXPR}
repeat {EXPR}
The first one, for(), iterates through each component VAR of the sequence SEQ-
for example, in the first iteration VAR = SEQ[1], in the second iteration VAR =
SEQ[2], and so on.
VAR is the abbreviation of variable.
SEQ is the abbreviation of sequence, which is equivalent to a vector (including
list) in R.
COND is the abbreviation of conditional, which can evaluate to TRUE or FALSE.
EXPR is the abbreviation of expression in a formal sense.
129. for
for ( i in 1:5 ) {
print( paste('square of', i, '=', i^2) )
}
132. If – else if - else
x <- 3
if ( ! is.numeric(x) ) {
stop( paste(x, 'is not numeric') )
} else if ( x%%2 == 1) {
print ( paste(x, 'is an odd') )
} else if ( x == round(x) ) {
print ( paste(x, 'is an integer') )
} else {
print ( paste(x, 'is a number') )
}
133. Functions
R provides a convenient way to define custom function and make good use of
it. All functions read and parse input, which are referred to as arguments,
and then return output. R function is actually first-class object defined in R.
It can be created by using the command function(), which is followed by a
comma separated list of formal arguments enclosed by a pair of parenthesis,
and then the expression that form the body of the function.
If the expression only includes one statement, it can be directly entered and
when there are multiple expressions, they have to be enclosed in braces {}.
The value returned by a R function, can be either yielded by R built-in
function return or simply the value of the last expression.
134. Example of a function
expon <- function(x,n) {
if ( x%%1 != 0 ) {
stop('x must be an integer!')
} else if ( n==0 ) {
return(1)
} else {
prod <- x
while( n>1 ) {
prod <- x*prod
n <- n-1
}
return(prod)
} # end of else
} # end of the function
expon(3,4)
135. The formal and body arguments to function expon() can later be accessed via
the R functions formals() and body(), as following:
formals(expon)
$x
$n
body(expon)
{
if (x%%1 != 0) {
stop("x must be an integer!")
} (…)