R introduction v2

13,174 views

Published on

A slightly different introduction to #Rstats. Data frames, linear models and plots with ggplot2.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
13,174
On SlideShare
0
From Embeds
0
Number of Embeds
6,659
Actions
Shares
0
Downloads
39
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

R introduction v2

  1. 1. THE WRIGHT LAB COMPUTATION LUNCHES An introduction to R Gene expression from DEVA to differential expression Handling massively parallel sequencing data
  2. 2. R: data frames, plots and linear models Martin Johnsson
  3. 3. statistical environment
  4. 4. free open source
  5. 5. scripting language
  6. 6. great packages (limma for microarrays, R/qtl for QTL mapping etc)
  7. 7. You need to write code!
  8. 8. ( Why scripting? harder the first time easier the next 20 times … necessity for moderately large data )
  9. 9. R-project.org Rstudio.com
  10. 10. Interface is immaterial ssh into a server, RStudio, alternatives very few platform-dependent elements that the end user needs to worry about
  11. 11. Scripting write your code down in a .R file run it with source ( ) ## comments
  12. 12. if it runs without intervention from start to finish, you’re actually programming
  13. 13. Help! within R: ? and tab your favourite search engine ask (Stack Exchange, R-help mailing list, package mailing lists)
  14. 14. First task: make a new script that imports a data set and takes a subset of the data
  15. 15. Reading in data Excel sheet to data.frame one sheet at a time clear formatting short succinct column names export text read.table
  16. 16. Subsetting logical operators ==, !=, >, <, ! subset(data, column1==1) subset(data, column1==1 & column2>2)
  17. 17. Indexing with [ ] first three rows: data[c(1,2,3),] two columns: data[,c(2,4)] ranges: data[,1:3]
  18. 18. Variables columns in expressions data$column1 + data$column2 log10(data$column2) assignment arrow data$new.column <- log10(data$column2) new.data <- subset(data, column1==10)
  19. 19. Exercise: Start a new script and save it as unicorn_analysis.R. Import unicorn_data.csv. Take a subset that only includes green unicorns.
  20. 20. RStudio: File > New > R Script data <- read.csv("unicorn_data.csv") green.unicorns <- subset(data, colour=="green")
  21. 21. Anatomy of a function call function.name(parameters) mean(data$column) mean(x=data$column) mean(x=data$column, na.rm=T) ?mean
  22. 22. mean(exp(log(x)))
  23. 23. programming in R == stringing together functions and writing new ones
  24. 24. Using a package (ggplot2) to make statistical graphics.
  25. 25. install.packages("ggplot2") library(ggplot2)
  26. 26. qplot(x=x.var, y=y.var, data=data) only x: histogram x and y numeric: scatterplot or set geometry (geom) yourself
  27. 27. geoms: point line boxplot jitter – scattered points tile – heatmap and many more
  28. 28. Exercise: Make a scatterplot of weight and horn length in green unicorns. Write all code in the unicorn_analysis.R script. Save the plots as variables so you can refer back to them.
  29. 29. 33 horn.length 30 27 24 250 300 350 weight 400 450 green.scatterplot <- qplot(x=weight, y=horn.length, data=green.unicorns)
  30. 30. Exercise: Make a boxplot of horn length versus diet.
  31. 31. 40 horn.length 35 30 25 candy diet flowers qplot(x=diet, y=weight, data=unicorn.data, geom="boxplot")
  32. 32. Small multiples split the plot into multiple subplots useful for looking at patterns qplot(x=x.var, y=y.var, data=data, facets=~variable) facets=variable1~variable2
  33. 33. Exercise: Again, make a boxplot of diet and horn length, but separated into small multiples by colour.
  34. 34. green 40 pink horn.length 35 30 25 candy flowers diet candy flowers qplot(x=diet, y=horn.length, data=data, geom="boxplot", facets=~colour)
  35. 35. Comparing means with linear models
  36. 36. Wilkinson–Rogers notation one predictor: y ~ x additive model: y ~ x1 + x2 interactions: y ~ x1 + x2 + x1:y2 or y ~ x1 * x2 factors: y ~ factor(x)
  37. 37. Student’s t-test
  38. 38. t.test(variable ~ grouping, data=data) alternative: two sided, less, greater var.equal paired geoms: boxplot, jitter
  39. 39. Carl Friedrich Gauss least squares estimation
  40. 40. 40 horn.length 35 30 25 250 300 350 weight 400 450
  41. 41. linear model y = a + b x + e, e ~ N(0, sigma) lm(y ~ x, data=some.data) formula data summary( ) function
  42. 42. Exercise: Make a regression of horn length and body weight.
  43. 43. model <- lm(horn.length ~ weight, data=data) summary(model) Call: lm(formula = horn.length ~ weight, data = data) Residuals: Min 1Q Median -6.5280 -2.0230 -0.1902 3Q 2.5459 Max 7.3620 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 20.41236 3.85774 5.291 1.76e-05 *** weight 0.03153 0.01093 2.886 0.00793 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.447 on 25 degrees of freedom (3 observations deleted due to missingness) Multiple R-squared: 0.2499, Adjusted R-squared: 0.2199 F-statistic: 8.327 on 1 and 25 DF, p-value: 0.007932
  44. 44. What have we actually fitted? model.matrix(horn.length ~ weight, data) What were the results? coef(model) How uncertain? confint(model)
  45. 45. Plotting the model regression equation y = a + x b a is the intercept b is the slope of the line pull out coefficients with coef( ) a plot with two layers: scatterplot with added geom_abline( )
  46. 46. 40 horn.length 35 30 25 250 300 350 weight 400 450 scatterplot <- qplot(x=weight, y=horn.length, data=data) a <- coef(model)[1] b <- coef(model)[2] scatterplot + geom_abline(intercept=a, slope=b)
  47. 47. A regression diagnostic the linear model needs several assumptions, particularly linearity and equal error variance the residuals vs fitted plot can help spot gross deviations
  48. 48. 8 residuals(model) 4 0 -4 28 30 32 fitted(model) 34 qplot(x=fitted(model), y=residuals(model))
  49. 49. Photo: Peter (anemoneprojectors), CC:BY-SA-2.0 http://www.flickr.com/people/anemoneprojectors/
  50. 50. analysis of variance aov(formula, data=some.data) drop1(aov.object, test="F") F-tests (Type II SS) post-hoc tests pairwise.t.test TukeyHSD
  51. 51. Exercise: Perform analysis of variance on weight and the effects of diet while controlling for colour.
  52. 52. We want a two-way anova with an F-test for diet. model.int <- aov(weight ~ diet * colour, data=data) drop1(model.int, test="F") model.add <- aov(weight ~ diet + colour, data=data) drop1(model.add, test="F") Single term deletions Model: weight ~ diet + colour Df Sum of Sq RSS AIC F value Pr(>F) <none> 85781 223.72 diet 1 471.1 86252 221.87 0.1318 0.71975 colour 1 13479.7 99260 225.66 3.7714 0.06396 . --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  53. 53. Black magic none of the is really R’s fault, but things that come up along the way
  54. 54. Black magic none of the is really R’s fault, but things that come up along the way missing values: na.rm=T, na.exclude( )
  55. 55. Black magic none of the is really R’s fault, but things that come up along the way missing values: na.rm=T, na.exclude( ) type I, II and III sums of squares in Anova
  56. 56. Black magic none of the is really R’s fault, but things that come up along the way missing values: na.rm=T, na.exclude( ) type I, II and III sums of squares in anova floating-point arithmetic, e.g. sin(pi)
  57. 57. Reading Daalgard, Introductory statistics with R, electronic resource at the library Faraway, The linear model in R Gelman & Hill, Data analysis using regression and multilevel/hierarchical models Wickham, ggplot2 book tons of tutorials online, for instance http://martinsbioblogg.wordpress.com/a-slightlydifferent-introduction-to-r/
  58. 58. Exercise More (and some of the same) analysis of the unicorn data set. Use the R documentation and google. I will post solutions.

×