SlideShare a Scribd company logo
1 of 61
Download to read offline
BCBB Workshop
An introduction to R
Kui Shen, PhD
Bioinformatics and Computational Biosciences Branch (BCBB)
National Institute of Allergy and Infectious Diseases (NIAID)
OCICB Bioinformatics and Computational
Biosciences Branch (BCBB)
§  Part of NIAID
§  Group of ~40
§  Software developers
§  Computational biologists
§  Project management &
analysis professionals
§  Biostatistics, phylogenetics,
genomics, structural
biology, programming
What will we learn today?
§ R basics
•  Arithmetic
•  Matrix algebra
•  Graphics
•  Statistical tests
•  Importing and exporting data
•  Installing packages
§ Basic statistics with R
•  Descriptive statistics
•  General linear regression
What’s R?
•  R is a language and environment for statistical computing and
•  graphics. It is an open-source and free software package.
What can R do? -- Importing and
manipulating data
§  Importing data.
•  Basic text files. (read.table).
•  Excel files. ( read.xls in the gdata package).
§  Manipulating data.
•  Indexing and subsetting.
•  Merging and Reshaping.
An example: from wide format to long format (help('reshape'))
What can R do? -- Data analysis
§  Statistical data analysis
•  Descriptive statistics
•  One- and two-sample tests
•  Regression and correlation
•  Analysis of variance
•  Tabular data
•  Power and sample size estimation
•  Mixed model
§  Genetic and genomic data analysis (Bioconductor)
•  Microarray
•  Next generation sequencing
What can R do? -- Visualization
What can R do? -- Visualization
xyplot(Counts~Status|sample,data=dat)
What can R do? – Generating reports 1
An example of Sweave code:
documentclass[a4paper]{article}
title{Sweave=R+LaTeX}
author{Kui Shen}
begin{document}
maketitle
<<LM, eval=TRUE, echo=T,fig=TRUE,
include=TRUE>>=
library(MASS)
data(cats)
mylm = lm(Hwt ~ Bwt,data=cats)
plot(Hwt ~ Bwt,data=cats, main='Linear
regression')
abline(mylm)
@
This is the embedded figure.
end{document}
What can R do? – Generating reports 2
An example of Sweave code:
documentclass[a4paper]{article}
title{A table generated by xtable}
begin{document}
maketitle
<<LM>>=
library(MASS)
library(xtable)
data(cats)
mylm = lm(Hwt ~ Bwt,data=cats)
@
<<echo=FALSE,results=tex>>=
print(xtable(mylm,caption=
"Linear regreesin results"))
@
end{document}
What can R do? – Web tools
Basic statistic analysis: linear regression
§  Linear model is the most fundamental topic for data
scientist.
§  Before you perform any fancy machine learning
methods, your go-to-method will be the linear model.
Linear correlation
Y
X
Y
X
Y
Y
X
X
Linear relationships Curvilinear relationships
n Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear regression
§  In correlation, the two variables are treated as equals.
§  In regression, one variable is considered independent
(predictor) variable X and the other the dependent
(outcome, response) variable Y.
Regression equation
yi= α + β*xi + random errori
Follows a normal distributionFixed values
Regression equation:
What is ‘linear”?
The mean of the response variable is a linear combination of the
parameters (regression coefficients) and the predictor variables.
Assumptions for a linear regression:
§ The relationship between X and Y is linear.
§ Y is distributed normally at each value of X.
§ The variance of Y at every value of X is the
same (homogeneity of variances).
§ The observations are independent.
Least squares estimation (LSE)
§ Estimating the intercept and slope: least
squares estimation
Significance test
Distribution of slope ~ Tn-2(β,s.e.( ))βˆ
H0: β1 = 0 (no linear relationship)
H1: β1 ≠ 0 (linear relationship does exist)
)ˆ.(.
0ˆ
β
β
es
−Tn-2=
Residual analysis for linearity
Not Linear Linear
ü
x
residuals
x
Y
x
Y
x
residuals
n Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Residual analysis for homoscedasticity
Non-constant variance
ü Constant variance
x x
Y
x x
Y
residuals
residuals
n Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Residual analysis for independence
Not Independent
Independent
X
X
residuals
residuals
X
residuals
ü
n Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Multivariate regression pitfalls - overfitting
§  In multivariate linear modeling, you can get significant
but meaningless results if you use too many
predictors in the model (no predictive ability for new
samples).
§  But how about big data? For example, for next
generation sequencing data, there are thousands of
genes but few samples.
§  When the number of features p is as large as, or
larger than, the number of observations n , least
squares cannot be performed. (Large p, small n
problem).
Linear regression for big data
§ Shrinkage Methods
•  LASSO (Least Absolute Shrinkage and Selection
Operator)
§  Dimension Reduction Methods
•  Principal Components Regression
Questions?
R programming: preliminaries
§  Install: http://www.r-project.org/
§  Manual: https://cran.r-project.org/manuals.html
§  Start and quit: R; q();
§  R and statistics
•  Around 25 built-in statistical packages
•  Much more available at:
•  http://CRAN.R-project.org
•  http://www.bioconductor.org
§  Getting help
•  help(‘plot’)
•  ?plot
§  Useful website: http://stackoverflow.com/
R and Rstudio
Rstudio (http://www.rstudio.com/)
R programming: variables and assignment
§  Variable: alphanumeric symbols, plus “.” and “_” are
allowed.
§  Variables are case sensitive, so “A” are “a” are two
different variables.
§  Assignment: <-
•  A<-3; a<-4; X.1<-9;
•  During the development of s at bell labs, a key
corresponding to ‘_” was printed as a by
time-shared terminal execuport.
§  In 2001, the “=“ assignment was introduced for compatibility with
other languages.
Arithmetic operators
§  Addition: +
§  Subtraction: -
§  Division: /
§  Multiplication: *
§  Exponentiation: ^
§  Using R as a calculator:
28
Function calls
§  Syntax: functionName(arg1, arg2)
§  Arguments can be specified by position or by name, while
name overrides position.
§  > log(x=16,base=2)
§  [1] 4
§  > log(16,2)
§  [1] 4
§  > log(2,16)
§  [1] 0.25
§  > log(base=2,x=16)
§  [1] 4
Data type
§  Three major data types: numeric, character, and
logical.
•  # numeric variables
•  X<-3.1
•  # character variables
•  X<-"Male"
•  # Logical variables
•  X<-TRUE; Y<-FALSE
Data structure: vector, matrix and array
§  Vector: an ordered collection of data.
•  x<-c(3.1,4.2,5.6); y<-c("F","M","F”); z<-c(T,F,T,F,T)
•  c(…) is a fonction to concatenate its arguments end
to end.
§  Matrix: two-dimensional generalizations of vectors
§  Array: multi-dimensional generalizations of vectors
Data structure: data frame
§  Data frames are matrix-like structures.
§  The columns in data frame can be of different data types.
§  Generate a new data frame
§  Covert a matrix to a data frame
Indexing and selecting
§  Indexing: appending an index
vector in square brackets to
the name of a vector, matrix or
data frame.
§  The index vector can be a
numeric vector or a character
vector.
Deleting
Deleting: using negative subscripts
Loops
Syntax:
for(variable in sequence) {
statements
}
Loops
Avoid ‘for’ looping and use “apply” instead if possible.
The “apply” family of functions
37
The “apply” family of functions
38
??apply
Conditional execution
Comparison Operators
equal: ==
not equal: !=
greater: >
less than: <
greater or equal: >=
less than or equal:<=
Logical Operators
and: &
or: |
not: !
Conditional execution
Syntax:
if (condition) {
statement
} else {
alternative
}
User-defined functions
Syntax to define a function:
myfunction <- function(arg1, arg2, ...) {
function_body
}
Syntax to call a function:
myfunction(arg1=..., arg2=...)
Package installation
§  Install packages from CRAN network
•  install.packages(“MASS”)
§  Install packages from Bioconductor
•  source("http://bioconductor.org/biocLite.R")
•  biocLite("biomaRt")
Reading and writing
?read.table
?write.table
Basic Statistical Analysis
Descriptive statistics
Visualization: Parallel boxplots
T-test
46
?t.test
A t-test (and ANOVA) is a linear regression
47
Basic statistics: Linear regression
§  library(MASS)
§  data(cats)
§  ?cat
Data summary
Data visualization
plot(Hwt ~ Bwt, data=cats)
The tilde (~) operator is used to specify that Hwt is described by
Bwt.
Linear model
§  Express the relationship as formula: Hwt ~ Bwt
•  It reads: Hwt predicted by Bwt, or Hwt as a function of
Bwt.
•  Variable on the left side: dependent variable. Variable on
the right side: independent variable.
§  Hwt is not totally predicted by Bwt, we add random factors
into the linear model: Hwt ~ Bwt + ε. This model divided
Hwt related factors into two parts: factors that we can
control (systematic part) and random errors (random or
probabilistic part of the model).
§  Implement in R:
•  Mylm <- lm(Hwt ~ Bwt, data=cats)
•  summary(mylm)
Linear regression results
Linear regression results
§  It looks complicated. Let’s check it. The first part is the review of
linear model.
§  The second part is the residuals.
Residuals
§  Ideally, the residuals are independent, identically distributed (iid) random
variables.
Residuals
Outliers
§  Sample 144 looks like an outliers/influence point. How to test it?
§  plot(mylm,which=4)
§  Remove outliers and re-conduct the analysis
Results: coefficient table
The p-value for Bwt is same as the p-value for the
model since there is only one independent variable.
plot(Hwt ~ Bwt,data=cats)
abline(mylm)
Results: R-squared and p-value
§  The last part:
§  R-squared: a measure of variance explained. In this case,
64.66% our data is explained by the model.
§  Adjusted R-squared: adjusts for the number of
independent variables in a model relative to the number of
data points.
§  F-statistic is used to test if one or more of the non-constant
coefficients in the regression equation are non-zero or not.
§  Finally, we got the p-value. But what is p-value?
P-value
Statistic hypothesis
H0 : All non-constant coefficients in the regression model are zero.
Ha : At least one of the non-constant coefficients in the regression model
is non-zero
Under the null hypothesis, the data is unlikely when the p-value is small.
Report the results
•  Linear regression analysis was conducted and the results
indicated that body weight significantly predicted heart weight (β
= 4.03, p<0.01). Body weight also explained a significant
proportion of variance in heart weight, R2=0.65, F(1,142)=259.8,
p<0.01.
Questions?

More Related Content

What's hot

Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
Atai Rabby
 

What's hot (20)

Pca analysis
Pca analysisPca analysis
Pca analysis
 
Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2
 
Light Intro to the Gene Ontology
Light Intro to the Gene OntologyLight Intro to the Gene Ontology
Light Intro to the Gene Ontology
 
Data formats
Data formatsData formats
Data formats
 
Introduction to Rstudio
Introduction to RstudioIntroduction to Rstudio
Introduction to Rstudio
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
 
Pathways and genomes databases in bioinformatics
Pathways and genomes databases in bioinformaticsPathways and genomes databases in bioinformatics
Pathways and genomes databases in bioinformatics
 
Primer design
Primer designPrimer design
Primer design
 
ANOVA in R by Aman Chauhan
ANOVA in R by Aman ChauhanANOVA in R by Aman Chauhan
ANOVA in R by Aman Chauhan
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Introduction to Python for Bioinformatics
Introduction to Python for BioinformaticsIntroduction to Python for Bioinformatics
Introduction to Python for Bioinformatics
 
R programming slides
R  programming slidesR  programming slides
R programming slides
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Simple linear regressionn and Correlation
Simple linear regressionn and CorrelationSimple linear regressionn and Correlation
Simple linear regressionn and Correlation
 
Distance based method
Distance based method Distance based method
Distance based method
 
PAM : Point Accepted Mutation
PAM : Point Accepted MutationPAM : Point Accepted Mutation
PAM : Point Accepted Mutation
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
Phylogenetic analysis
Phylogenetic analysis Phylogenetic analysis
Phylogenetic analysis
 
String.pptx
String.pptxString.pptx
String.pptx
 
Reactome Pathways Portal
Reactome Pathways PortalReactome Pathways Portal
Reactome Pathways Portal
 

Viewers also liked

The Human Brain
The Human Brain The Human Brain
The Human Brain
udgamschool
 
Descriptive Statistics with R
Descriptive Statistics with RDescriptive Statistics with R
Descriptive Statistics with R
Kazuki Yoshida
 
Fons trompenaars Cultural Dimensions
Fons trompenaars Cultural DimensionsFons trompenaars Cultural Dimensions
Fons trompenaars Cultural Dimensions
Gamze Saba
 

Viewers also liked (20)

Introduction to 16S Microbiome Analysis
Introduction to 16S Microbiome AnalysisIntroduction to 16S Microbiome Analysis
Introduction to 16S Microbiome Analysis
 
The Human Brain
The Human Brain The Human Brain
The Human Brain
 
Advanced Computational Drug Design
Advanced Computational Drug DesignAdvanced Computational Drug Design
Advanced Computational Drug Design
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
Pathogen phylogenetics using BEAST
Pathogen phylogenetics using BEASTPathogen phylogenetics using BEAST
Pathogen phylogenetics using BEAST
 
3D graphics using VMD
3D graphics using VMD3D graphics using VMD
3D graphics using VMD
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
Advanced Molecular Dynamics 2016
Advanced Molecular Dynamics 2016Advanced Molecular Dynamics 2016
Advanced Molecular Dynamics 2016
 
R crash course
R crash courseR crash course
R crash course
 
Phylogenetics: Tree building
Phylogenetics: Tree buildingPhylogenetics: Tree building
Phylogenetics: Tree building
 
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
 
Introduction to Git and GitHub
Introduction to Git and GitHubIntroduction to Git and GitHub
Introduction to Git and GitHub
 
Descriptive Statistics with R
Descriptive Statistics with RDescriptive Statistics with R
Descriptive Statistics with R
 
BLAST and sequence alignment
BLAST and sequence alignmentBLAST and sequence alignment
BLAST and sequence alignment
 
Fons trompenaars Cultural Dimensions
Fons trompenaars Cultural DimensionsFons trompenaars Cultural Dimensions
Fons trompenaars Cultural Dimensions
 
初學R語言的60分鐘
初學R語言的60分鐘初學R語言的60分鐘
初學R語言的60分鐘
 
RNAseq Analysis
RNAseq AnalysisRNAseq Analysis
RNAseq Analysis
 
3Com 3C96010C-AC
3Com 3C96010C-AC3Com 3C96010C-AC
3Com 3C96010C-AC
 
Norton Bevel System - Brochure
Norton Bevel System - BrochureNorton Bevel System - Brochure
Norton Bevel System - Brochure
 
Gibi acessibilidade
Gibi acessibilidadeGibi acessibilidade
Gibi acessibilidade
 

Similar to An introduction to R

Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01
Henock Beyene
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
NYC Predictive Analytics
 

Similar to An introduction to R (20)

Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01
 
Dive into the Data
Dive into the DataDive into the Data
Dive into the Data
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear Regression
 
Nimrita koul Machine Learning
Nimrita koul  Machine LearningNimrita koul  Machine Learning
Nimrita koul Machine Learning
 
Elementary Data Analysis with MS Excel_Day-5
Elementary Data Analysis with MS Excel_Day-5Elementary Data Analysis with MS Excel_Day-5
Elementary Data Analysis with MS Excel_Day-5
 
Linear models for data science
Linear models for data scienceLinear models for data science
Linear models for data science
 
HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with R
 
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdf
 
Data Science as a Career and Intro to R
Data Science as a Career and Intro to RData Science as a Career and Intro to R
Data Science as a Career and Intro to R
 
Linear Regression.pptx
Linear Regression.pptxLinear Regression.pptx
Linear Regression.pptx
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
 
Introduction to machine learning and model building using linear regression
Introduction to machine learning and model building using linear regressionIntroduction to machine learning and model building using linear regression
Introduction to machine learning and model building using linear regression
 
Week_3_Lecture.pdf
Week_3_Lecture.pdfWeek_3_Lecture.pdf
Week_3_Lecture.pdf
 
Predicating continuous variables-1.pptx
Predicating continuous  variables-1.pptxPredicating continuous  variables-1.pptx
Predicating continuous variables-1.pptx
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Week_2_Lecture.pdf
Week_2_Lecture.pdfWeek_2_Lecture.pdf
Week_2_Lecture.pdf
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 

More from Bioinformatics and Computational Biosciences Branch

More from Bioinformatics and Computational Biosciences Branch (20)

Hong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptxHong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptx
 
Virus Sequence Alignment and Phylogenetic Analysis 2019
Virus Sequence Alignment and Phylogenetic Analysis 2019Virus Sequence Alignment and Phylogenetic Analysis 2019
Virus Sequence Alignment and Phylogenetic Analysis 2019
 
Nephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele resultsNephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele results
 
Introduction to METAGENOTE
Introduction to METAGENOTE Introduction to METAGENOTE
Introduction to METAGENOTE
 
Intro to homology modeling
Intro to homology modelingIntro to homology modeling
Intro to homology modeling
 
Protein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modelingProtein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modeling
 
Homology modeling: Modeller
Homology modeling: ModellerHomology modeling: Modeller
Homology modeling: Modeller
 
Protein docking
Protein dockingProtein docking
Protein docking
 
Protein function prediction
Protein function predictionProtein function prediction
Protein function prediction
 
Protein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on RosettaProtein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on Rosetta
 
Biological networks
Biological networksBiological networks
Biological networks
 
UNIX Basics and Cluster Computing
UNIX Basics and Cluster ComputingUNIX Basics and Cluster Computing
UNIX Basics and Cluster Computing
 
Statistical applications in GraphPad Prism
Statistical applications in GraphPad PrismStatistical applications in GraphPad Prism
Statistical applications in GraphPad Prism
 
Intro to JMP for statistics
Intro to JMP for statisticsIntro to JMP for statistics
Intro to JMP for statistics
 
Categorical models
Categorical modelsCategorical models
Categorical models
 
Better graphics in R
Better graphics in RBetter graphics in R
Better graphics in R
 
Automating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtoolsAutomating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtools
 
Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)
 
Overview of statistics: Statistical testing (Part I)
Overview of statistics: Statistical testing (Part I)Overview of statistics: Statistical testing (Part I)
Overview of statistics: Statistical testing (Part I)
 
GraphPad Prism: Curve fitting
GraphPad Prism: Curve fittingGraphPad Prism: Curve fitting
GraphPad Prism: Curve fitting
 

Recently uploaded

COMPOSTING : types of compost, merits and demerits
COMPOSTING : types of compost, merits and demeritsCOMPOSTING : types of compost, merits and demerits
COMPOSTING : types of compost, merits and demerits
Cherry
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Cherry
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Cherry
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cherry
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
Cherry
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Cherry
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 

Recently uploaded (20)

COMPOSTING : types of compost, merits and demerits
COMPOSTING : types of compost, merits and demeritsCOMPOSTING : types of compost, merits and demerits
COMPOSTING : types of compost, merits and demerits
 
BHUBANESHWAR ODIA CALL GIRL SEIRVEC ❣️ 72051//37929❣️ CALL GIRL IN ODIA HAND ...
BHUBANESHWAR ODIA CALL GIRL SEIRVEC ❣️ 72051//37929❣️ CALL GIRL IN ODIA HAND ...BHUBANESHWAR ODIA CALL GIRL SEIRVEC ❣️ 72051//37929❣️ CALL GIRL IN ODIA HAND ...
BHUBANESHWAR ODIA CALL GIRL SEIRVEC ❣️ 72051//37929❣️ CALL GIRL IN ODIA HAND ...
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsKanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 

An introduction to R

  • 1. BCBB Workshop An introduction to R Kui Shen, PhD Bioinformatics and Computational Biosciences Branch (BCBB) National Institute of Allergy and Infectious Diseases (NIAID)
  • 2. OCICB Bioinformatics and Computational Biosciences Branch (BCBB) §  Part of NIAID §  Group of ~40 §  Software developers §  Computational biologists §  Project management & analysis professionals §  Biostatistics, phylogenetics, genomics, structural biology, programming
  • 3. What will we learn today? § R basics •  Arithmetic •  Matrix algebra •  Graphics •  Statistical tests •  Importing and exporting data •  Installing packages § Basic statistics with R •  Descriptive statistics •  General linear regression
  • 4. What’s R? •  R is a language and environment for statistical computing and •  graphics. It is an open-source and free software package.
  • 5. What can R do? -- Importing and manipulating data §  Importing data. •  Basic text files. (read.table). •  Excel files. ( read.xls in the gdata package). §  Manipulating data. •  Indexing and subsetting. •  Merging and Reshaping. An example: from wide format to long format (help('reshape'))
  • 6. What can R do? -- Data analysis §  Statistical data analysis •  Descriptive statistics •  One- and two-sample tests •  Regression and correlation •  Analysis of variance •  Tabular data •  Power and sample size estimation •  Mixed model §  Genetic and genomic data analysis (Bioconductor) •  Microarray •  Next generation sequencing
  • 7. What can R do? -- Visualization
  • 8. What can R do? -- Visualization xyplot(Counts~Status|sample,data=dat)
  • 9. What can R do? – Generating reports 1 An example of Sweave code: documentclass[a4paper]{article} title{Sweave=R+LaTeX} author{Kui Shen} begin{document} maketitle <<LM, eval=TRUE, echo=T,fig=TRUE, include=TRUE>>= library(MASS) data(cats) mylm = lm(Hwt ~ Bwt,data=cats) plot(Hwt ~ Bwt,data=cats, main='Linear regression') abline(mylm) @ This is the embedded figure. end{document}
  • 10. What can R do? – Generating reports 2 An example of Sweave code: documentclass[a4paper]{article} title{A table generated by xtable} begin{document} maketitle <<LM>>= library(MASS) library(xtable) data(cats) mylm = lm(Hwt ~ Bwt,data=cats) @ <<echo=FALSE,results=tex>>= print(xtable(mylm,caption= "Linear regreesin results")) @ end{document}
  • 11. What can R do? – Web tools
  • 12. Basic statistic analysis: linear regression §  Linear model is the most fundamental topic for data scientist. §  Before you perform any fancy machine learning methods, your go-to-method will be the linear model.
  • 13. Linear correlation Y X Y X Y Y X X Linear relationships Curvilinear relationships n Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
  • 14. Linear regression §  In correlation, the two variables are treated as equals. §  In regression, one variable is considered independent (predictor) variable X and the other the dependent (outcome, response) variable Y.
  • 15. Regression equation yi= α + β*xi + random errori Follows a normal distributionFixed values Regression equation: What is ‘linear”? The mean of the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables.
  • 16. Assumptions for a linear regression: § The relationship between X and Y is linear. § Y is distributed normally at each value of X. § The variance of Y at every value of X is the same (homogeneity of variances). § The observations are independent.
  • 17. Least squares estimation (LSE) § Estimating the intercept and slope: least squares estimation
  • 18. Significance test Distribution of slope ~ Tn-2(β,s.e.( ))βˆ H0: β1 = 0 (no linear relationship) H1: β1 ≠ 0 (linear relationship does exist) )ˆ.(. 0ˆ β β es −Tn-2=
  • 19. Residual analysis for linearity Not Linear Linear ü x residuals x Y x Y x residuals n Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
  • 20. Residual analysis for homoscedasticity Non-constant variance ü Constant variance x x Y x x Y residuals residuals n Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
  • 21. Residual analysis for independence Not Independent Independent X X residuals residuals X residuals ü n Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
  • 22. Multivariate regression pitfalls - overfitting §  In multivariate linear modeling, you can get significant but meaningless results if you use too many predictors in the model (no predictive ability for new samples). §  But how about big data? For example, for next generation sequencing data, there are thousands of genes but few samples. §  When the number of features p is as large as, or larger than, the number of observations n , least squares cannot be performed. (Large p, small n problem).
  • 23. Linear regression for big data § Shrinkage Methods •  LASSO (Least Absolute Shrinkage and Selection Operator) §  Dimension Reduction Methods •  Principal Components Regression
  • 25. R programming: preliminaries §  Install: http://www.r-project.org/ §  Manual: https://cran.r-project.org/manuals.html §  Start and quit: R; q(); §  R and statistics •  Around 25 built-in statistical packages •  Much more available at: •  http://CRAN.R-project.org •  http://www.bioconductor.org §  Getting help •  help(‘plot’) •  ?plot §  Useful website: http://stackoverflow.com/
  • 26. R and Rstudio Rstudio (http://www.rstudio.com/)
  • 27. R programming: variables and assignment §  Variable: alphanumeric symbols, plus “.” and “_” are allowed. §  Variables are case sensitive, so “A” are “a” are two different variables. §  Assignment: <- •  A<-3; a<-4; X.1<-9; •  During the development of s at bell labs, a key corresponding to ‘_” was printed as a by time-shared terminal execuport. §  In 2001, the “=“ assignment was introduced for compatibility with other languages.
  • 28. Arithmetic operators §  Addition: + §  Subtraction: - §  Division: / §  Multiplication: * §  Exponentiation: ^ §  Using R as a calculator: 28
  • 29. Function calls §  Syntax: functionName(arg1, arg2) §  Arguments can be specified by position or by name, while name overrides position. §  > log(x=16,base=2) §  [1] 4 §  > log(16,2) §  [1] 4 §  > log(2,16) §  [1] 0.25 §  > log(base=2,x=16) §  [1] 4
  • 30. Data type §  Three major data types: numeric, character, and logical. •  # numeric variables •  X<-3.1 •  # character variables •  X<-"Male" •  # Logical variables •  X<-TRUE; Y<-FALSE
  • 31. Data structure: vector, matrix and array §  Vector: an ordered collection of data. •  x<-c(3.1,4.2,5.6); y<-c("F","M","F”); z<-c(T,F,T,F,T) •  c(…) is a fonction to concatenate its arguments end to end. §  Matrix: two-dimensional generalizations of vectors §  Array: multi-dimensional generalizations of vectors
  • 32. Data structure: data frame §  Data frames are matrix-like structures. §  The columns in data frame can be of different data types. §  Generate a new data frame §  Covert a matrix to a data frame
  • 33. Indexing and selecting §  Indexing: appending an index vector in square brackets to the name of a vector, matrix or data frame. §  The index vector can be a numeric vector or a character vector.
  • 36. Loops Avoid ‘for’ looping and use “apply” instead if possible.
  • 37. The “apply” family of functions 37
  • 38. The “apply” family of functions 38 ??apply
  • 39. Conditional execution Comparison Operators equal: == not equal: != greater: > less than: < greater or equal: >= less than or equal:<= Logical Operators and: & or: | not: !
  • 40. Conditional execution Syntax: if (condition) { statement } else { alternative }
  • 41. User-defined functions Syntax to define a function: myfunction <- function(arg1, arg2, ...) { function_body } Syntax to call a function: myfunction(arg1=..., arg2=...)
  • 42. Package installation §  Install packages from CRAN network •  install.packages(“MASS”) §  Install packages from Bioconductor •  source("http://bioconductor.org/biocLite.R") •  biocLite("biomaRt")
  • 47. A t-test (and ANOVA) is a linear regression 47
  • 48. Basic statistics: Linear regression §  library(MASS) §  data(cats) §  ?cat
  • 50. Data visualization plot(Hwt ~ Bwt, data=cats) The tilde (~) operator is used to specify that Hwt is described by Bwt.
  • 51. Linear model §  Express the relationship as formula: Hwt ~ Bwt •  It reads: Hwt predicted by Bwt, or Hwt as a function of Bwt. •  Variable on the left side: dependent variable. Variable on the right side: independent variable. §  Hwt is not totally predicted by Bwt, we add random factors into the linear model: Hwt ~ Bwt + ε. This model divided Hwt related factors into two parts: factors that we can control (systematic part) and random errors (random or probabilistic part of the model). §  Implement in R: •  Mylm <- lm(Hwt ~ Bwt, data=cats) •  summary(mylm)
  • 53. Linear regression results §  It looks complicated. Let’s check it. The first part is the review of linear model. §  The second part is the residuals.
  • 54. Residuals §  Ideally, the residuals are independent, identically distributed (iid) random variables.
  • 56. Outliers §  Sample 144 looks like an outliers/influence point. How to test it? §  plot(mylm,which=4) §  Remove outliers and re-conduct the analysis
  • 57. Results: coefficient table The p-value for Bwt is same as the p-value for the model since there is only one independent variable. plot(Hwt ~ Bwt,data=cats) abline(mylm)
  • 58. Results: R-squared and p-value §  The last part: §  R-squared: a measure of variance explained. In this case, 64.66% our data is explained by the model. §  Adjusted R-squared: adjusts for the number of independent variables in a model relative to the number of data points. §  F-statistic is used to test if one or more of the non-constant coefficients in the regression equation are non-zero or not. §  Finally, we got the p-value. But what is p-value?
  • 59. P-value Statistic hypothesis H0 : All non-constant coefficients in the regression model are zero. Ha : At least one of the non-constant coefficients in the regression model is non-zero Under the null hypothesis, the data is unlikely when the p-value is small.
  • 60. Report the results •  Linear regression analysis was conducted and the results indicated that body weight significantly predicted heart weight (β = 4.03, p<0.01). Body weight also explained a significant proportion of variance in heart weight, R2=0.65, F(1,142)=259.8, p<0.01.