R introduction v2
Upcoming SlideShare
Loading in...5
×
 

R introduction v2

on

  • 4,245 views

A slightly different introduction to #Rstats. Data frames, linear models and plots with ggplot2.

A slightly different introduction to #Rstats. Data frames, linear models and plots with ggplot2.

Statistics

Views

Total Views
4,245
Views on SlideShare
935
Embed Views
3,310

Actions

Likes
2
Downloads
22
Comments
0

21 Embeds 3,310

http://martinsbioblogg.wordpress.com 1957
http://www.r-bloggers.com 556
http://cloud.feedly.com 377
https://martinsbioblogg.wordpress.com 306
http://digg.com 20
http://newsblur.com 20
http://www.newsblur.com 16
http://reader.aol.com 12
http://feedly.com 10
http://feeds.feedburner.com 9
http://feedproxy.google.com 6
http://www.inoreader.com 6
https://twitter.com 5
http://www.feedspot.com 3
http://127.0.0.1 1
https://www.google.com 1
http://www.hanrss.com 1
http://twitterrific.com 1
http://flavors.me 1
http://inoreader.com 1
http://www.google.co.in 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

R introduction v2 R introduction v2 Presentation Transcript

  • THE WRIGHT LAB COMPUTATION LUNCHES An introduction to R Gene expression from DEVA to differential expression Handling massively parallel sequencing data
  • R: data frames, plots and linear models Martin Johnsson
  • statistical environment View slide
  • free open source View slide
  • scripting language
  • great packages (limma for microarrays, R/qtl for QTL mapping etc)
  • You need to write code!
  • ( Why scripting? harder the first time easier the next 20 times … necessity for moderately large data )
  • R-project.org Rstudio.com
  • Interface is immaterial ssh into a server, RStudio, alternatives very few platform-dependent elements that the end user needs to worry about
  • Scripting write your code down in a .R file run it with source ( ) ## comments
  • if it runs without intervention from start to finish, you’re actually programming
  • Help! within R: ? and tab your favourite search engine ask (Stack Exchange, R-help mailing list, package mailing lists)
  • First task: make a new script that imports a data set and takes a subset of the data
  • Reading in data Excel sheet to data.frame one sheet at a time clear formatting short succinct column names export text read.table
  • Subsetting logical operators ==, !=, >, <, ! subset(data, column1==1) subset(data, column1==1 & column2>2)
  • Indexing with [ ] first three rows: data[c(1,2,3),] two columns: data[,c(2,4)] ranges: data[,1:3]
  • Variables columns in expressions data$column1 + data$column2 log10(data$column2) assignment arrow data$new.column <- log10(data$column2) new.data <- subset(data, column1==10)
  • Exercise: Start a new script and save it as unicorn_analysis.R. Import unicorn_data.csv. Take a subset that only includes green unicorns.
  • RStudio: File > New > R Script data <- read.csv("unicorn_data.csv") green.unicorns <- subset(data, colour=="green")
  • Anatomy of a function call function.name(parameters) mean(data$column) mean(x=data$column) mean(x=data$column, na.rm=T) ?mean
  • mean(exp(log(x)))
  • programming in R == stringing together functions and writing new ones
  • Using a package (ggplot2) to make statistical graphics.
  • install.packages("ggplot2") library(ggplot2)
  • qplot(x=x.var, y=y.var, data=data) only x: histogram x and y numeric: scatterplot or set geometry (geom) yourself
  • geoms: point line boxplot jitter – scattered points tile – heatmap and many more
  • Exercise: Make a scatterplot of weight and horn length in green unicorns. Write all code in the unicorn_analysis.R script. Save the plots as variables so you can refer back to them.
  • 33 horn.length 30 27 24 250 300 350 weight 400 450 green.scatterplot <- qplot(x=weight, y=horn.length, data=green.unicorns)
  • Exercise: Make a boxplot of horn length versus diet.
  • 40 horn.length 35 30 25 candy diet flowers qplot(x=diet, y=weight, data=unicorn.data, geom="boxplot")
  • Small multiples split the plot into multiple subplots useful for looking at patterns qplot(x=x.var, y=y.var, data=data, facets=~variable) facets=variable1~variable2
  • Exercise: Again, make a boxplot of diet and horn length, but separated into small multiples by colour.
  • green 40 pink horn.length 35 30 25 candy flowers diet candy flowers qplot(x=diet, y=horn.length, data=data, geom="boxplot", facets=~colour)
  • Comparing means with linear models
  • Wilkinson–Rogers notation one predictor: y ~ x additive model: y ~ x1 + x2 interactions: y ~ x1 + x2 + x1:y2 or y ~ x1 * x2 factors: y ~ factor(x)
  • Student’s t-test
  • t.test(variable ~ grouping, data=data) alternative: two sided, less, greater var.equal paired geoms: boxplot, jitter
  • Carl Friedrich Gauss least squares estimation
  • 40 horn.length 35 30 25 250 300 350 weight 400 450
  • linear model y = a + b x + e, e ~ N(0, sigma) lm(y ~ x, data=some.data) formula data summary( ) function
  • Exercise: Make a regression of horn length and body weight.
  • model <- lm(horn.length ~ weight, data=data) summary(model) Call: lm(formula = horn.length ~ weight, data = data) Residuals: Min 1Q Median -6.5280 -2.0230 -0.1902 3Q 2.5459 Max 7.3620 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 20.41236 3.85774 5.291 1.76e-05 *** weight 0.03153 0.01093 2.886 0.00793 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.447 on 25 degrees of freedom (3 observations deleted due to missingness) Multiple R-squared: 0.2499, Adjusted R-squared: 0.2199 F-statistic: 8.327 on 1 and 25 DF, p-value: 0.007932
  • What have we actually fitted? model.matrix(horn.length ~ weight, data) What were the results? coef(model) How uncertain? confint(model)
  • Plotting the model regression equation y = a + x b a is the intercept b is the slope of the line pull out coefficients with coef( ) a plot with two layers: scatterplot with added geom_abline( )
  • 40 horn.length 35 30 25 250 300 350 weight 400 450 scatterplot <- qplot(x=weight, y=horn.length, data=data) a <- coef(model)[1] b <- coef(model)[2] scatterplot + geom_abline(intercept=a, slope=b)
  • A regression diagnostic the linear model needs several assumptions, particularly linearity and equal error variance the residuals vs fitted plot can help spot gross deviations
  • 8 residuals(model) 4 0 -4 28 30 32 fitted(model) 34 qplot(x=fitted(model), y=residuals(model))
  • Photo: Peter (anemoneprojectors), CC:BY-SA-2.0 http://www.flickr.com/people/anemoneprojectors/
  • analysis of variance aov(formula, data=some.data) drop1(aov.object, test="F") F-tests (Type II SS) post-hoc tests pairwise.t.test TukeyHSD
  • Exercise: Perform analysis of variance on weight and the effects of diet while controlling for colour.
  • We want a two-way anova with an F-test for diet. model.int <- aov(weight ~ diet * colour, data=data) drop1(model.int, test="F") model.add <- aov(weight ~ diet + colour, data=data) drop1(model.add, test="F") Single term deletions Model: weight ~ diet + colour Df Sum of Sq RSS AIC F value Pr(>F) <none> 85781 223.72 diet 1 471.1 86252 221.87 0.1318 0.71975 colour 1 13479.7 99260 225.66 3.7714 0.06396 . --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  • Black magic none of the is really R’s fault, but things that come up along the way
  • Black magic none of the is really R’s fault, but things that come up along the way missing values: na.rm=T, na.exclude( )
  • Black magic none of the is really R’s fault, but things that come up along the way missing values: na.rm=T, na.exclude( ) type I, II and III sums of squares in Anova
  • Black magic none of the is really R’s fault, but things that come up along the way missing values: na.rm=T, na.exclude( ) type I, II and III sums of squares in anova floating-point arithmetic, e.g. sin(pi)
  • Reading Daalgard, Introductory statistics with R, electronic resource at the library Faraway, The linear model in R Gelman & Hill, Data analysis using regression and multilevel/hierarchical models Wickham, ggplot2 book tons of tutorials online, for instance http://martinsbioblogg.wordpress.com/a-slightlydifferent-introduction-to-r/
  • Exercise More (and some of the same) analysis of the unicorn data set. Use the R documentation and google. I will post solutions.