R is a language and environment for data analysis. This workshop introduces you to this free program with some potential uses and basic features of the language.
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Intro to R: free software for data analysis and graphics
1. An Introduction to R: free software for data analysis and graphics Allan C. Just PhD Student Environmental Health Sciences Mailman School of Public Health Columbia University December 01, 2010 Wickham 2008 A list of useful links can be found at the end Attribution-NonCommercial-ShareAlike 3.0 United States License. http://creativecommons.org/licenses/by-nc-sa/3.0/us/
2. Intro to R Objectives After this workshop participants will be able to: Describe some capabilities and uses of R Search for help within R and use good coding practices for reproducible research in R Read in, summarize, and transform a simple dataset in R Create and save reproducible graphs from R (scatterplots, histograms, boxplots, bar charts)
4. R has many uses Work with data: subset, merge, and transform datasets with a powerful syntax Analysis: use existing statistical functions like regression or write your own Graphics: graphs can be made quickly during analysis and polished for publication quality displays
5. Why learn a whole language to look at data versus Excel? Recreate/redo your exact analysis Automate repetitive tasks Access to statistical methods not available in Excel Graphs are more elegant
6. Why R versusSAS, SPSS, or Stata? It's free! It runs on Mac, Windows, and Linux It has state-of-the-art graphics capabilities It contains advanced statistical routines not yet available in other packages – a de facto standard in statistics Can program new statistical methods or automate data manipulation/analysis adapted from statmethods.net
14. Choropleth map Shapefile: CIESIN, Columbia University Asthma data: http://nyc.gov/html/doh/downloads/pdf/asthma/asthma-hospital.pdf
15. Intro to R: recap Objectives After this workshop participants will be able to: Describe some capabilities and uses of R Statistical data analysis Automation (scripting) of functions to work with data Elegant graphics to facilitate data visualization Search for help within R and use good coding practices for reproducible research in R Read in, summarize, and transform a simple dataset in R Create and save reproducible graphs from R (scatterplots, histograms, boxplots, bar charts)
16. Learning a new language is difficult flickr.com/photos/dnorman/3732851541/
17. What makes R difficult to learn R is designed to be flexible and powerful rather than simple but limited. R is a fully featured language mainly used from the command line. Learning the commands and the structure of the code takes time and practice. If I made a a typo you would know what I meant... The solution: be careful, build code in simple pieces and test as you go (learn to debug). Reuse code that works. Use helpful resources. Consider an alternative GUI for R.
18. Getting help in R You can call for help on a function with a leading question mark and leaving off the () ?functionname Search online statmethods.net An Introduction to R in Windows found under Help – Manuals (in PDF)
19. Suggestions for an R workflow Leave yourself informative comments # everything to the right of the pound sign # is unevaluated Using spaces and indents can help readability Use meaningful names for objects Save the bits of your code that work in a text editor - building a script of clean code that works from start-to-finish. With clean code instead of transformed data files it is easier to redo analyses if your data is updated or you want to change an earlier step Reproducible research!
20. Intro to R: recap Objectives After this workshop participants will be able to: Describe some capabilities and uses of R Search for help within R and use good coding practices for reproducible research in R ?t.test will bring up R help Free manuals online: Introduction to R Also: statmethods.net #use comments; save the code that works to reproduce your results Read in, summarize, and transform a simple dataset in R Create and save reproducible graphs from R (scatterplots, histograms, boxplots, bar charts)
21. Learning the languageMany important features Arithmetic and logical operators: +, <, … Data types: numeric, logical, … Data structures: vectors, matrices, … Functions – always end with (): median(x)
22. Using R as a calculator Mathematical operators + - / * ^ log() abs()
23. R can evaluate logical expressions == equal != not equal & and | or (vertical pipe) 10 < 20 [1] TRUE pi > 3 & 2^2 == 4 [1] TRUE "This" != "That" [1] TRUE
24. Creating new objects Assignment operator is <- (looks like an arrow) x <- 10 “Set x to take the value 10” The symbols in this operator must be adjacent. x < - 10 What does this do? You can overwrite old values x <- x^2 “Set x to take the value x2”
25. R operations are vectorized A vector is an ordered set of data of the same type (numeric, logical, dates, etc.) Concatenate function is c() x <- c(1, 2, 3) x^2 [1] 1 4 9 For integer sequences we can make vectors quickly with n:m 4:9 [1] 4 5 6 7 8 9 Vector recycling c(0,10) * c(1, 1, 1, 1) [1] 0 10 0 10
26. Other object classes Matrices: lots of matrix operations you would always use the lm() function to fit a linear model but you could do it manually beta <- solve(t(X)%*%X)%*%t(X)%*%y Arrays: n-dimensional Lists: sets of objects (can be different classes)
27. Indexing and subsetting refer to components of objects by a position index which goes between square braces x <- c(10, 20, 30) x [1] 10 20 30 x[2] return the second position in x [1] 20 x[c(1, 2)] return the first and second position in x [1] 10 20 x[-3] return all except the third position in x [1] 10 20 What wouldx[c(3, 2)]return?
28. Data frames A data frame is a rectangular collection of data Rows: observations Columns: variables diamonds <- data.frame(carat, cut, price) carat cut price 1 0.23 Ideal 326 2 0.21 Premium 326 3 0.23 Good 327 4 0.29 Premium 334 5 0.31 Good 335 6 0.24 Very Good 336
29. Data frames You can extract the variables as vectors with a $ diamonds$cut You can also index by position (or name) with square braces diamonds[2, 3] returns the single value in row 2, column 3 An empty index is treated like a wildcard and corresponds to all rows or columns depending on position diamonds[, "cut"] (same result as diamonds$cut) How would you return the first three rows and all columns? row, column
30. R functions Thousands of functions are built-in: median() lm() linear model t.test() chisq.test() or make your own: inch.to.cm <- function(x){x * 2.54} inch.to.cm(74) [1] 187.96
31. Missing values These take a value of NA Can be in a data object of any type (logical, numeric, character) By default operations on NA will return NA NA == NA [1] NA Can check for NA with is.na() y <- c(2, 10, NA, 12) is.na(y) [1] FALSE FALSE TRUE FALSE Can often pass na.rm = T option to remove NA values in operations mean(y) [1] NA mean(y, na.rm = T) [1] 8
32. R has several thousandadditional packages time series survival spatial machine learning bioinformatics Interfaces to LaTeX, Excel, SQL databases, Twitter, google maps…
33. Installing a package Open up R Click in to the console window and type: install.packages() Select a mirror (anywhere in the US) Find and select "ggplot2" and choose OK. This will download ggplot2 and the other packages which it requires.
38. Recap on GUIs R Default Windows GUI: lacks additional features to make learning or programming easier Rcmdr: Drop-down menus for basic stats. Uses base graphics. Doesn't make programming easier. JGR: Makes programming easier with syntax highlighting and command argument suggestions. No menus for stats. Looks the same across platforms (Java based) Deducer: Adds menus for basic stats to JGR. Fewer menu driven stats than RCMDR but more flexible graphics options (building with ggplot2).
41. ggplot2 philosophy Written by Hadley Wickham (Rice Univ.) Extends The Grammar of Graphics (Wilkinson, 2005) All graphs can be constructed by combining specifications with data (Wilkinson, 2005). A specification is a structured way to describe how to build the graph from geometric objects (points, lines, etc.) projected on to scales (x, y, color, size, etc.)
42. ggplot2 philosophy When you can describe the content of the graph with the grammar, you don’t need to know the name of a particular type of plot… Dot plot, forest plot, Manhattan plot are just special cases of this formal grammar. …a plotting system with good defaults for a large set of components that can be combined in flexible and creative ways…
43. Building a plot in ggplot2 data to visualize (a data frame) map variables to aesthetic attributes geometric objects – what you see (points, bars, etc) scales map values from data to aesthetic space faceting subsets the data to show multiple plots statistical transformations – summarize data coordinate systems put data on plane of graphic Wickham 2009
44. A basic ggplot2 graph ggplot(airquality) + geom_point(aes(x = Temp, y = Ozone)) Aesthetics map variables to scales Data Geometric objects to display
45. Help with learning ggplot2 Website: had.co.nz/ggplot2/ Book: ggplot2: Elegant Graphics for Data Analysis Hadley Wickham, 2009
46. Intro to R: recap Objectives After this workshop participants will be able to: Describe some capabilities and uses of R Search for help within R and use good coding practices for reproducible research in R Read in, summarize, and transform a simple dataset in R With your neighbor, work through a script file analyzing data from the WHO - Global School Health Survey. Create and save reproducible graphs from R (scatterplots, histograms, boxplots, bar charts) The script contains code which you can run, modify, and take home!
47. Setting up our R session in 5 steps then a quick break (laptop users only, file is already on desktops) Download and save the script file intro_to_r_script.txt available from: library.cumc.columbia.edu/rworkshop Open R ( icon) 3. Open script Success! 4. Switch to Files of type: "All files (*.*)" intro_to_r_script.txt is on Desktop or wherever you saved it 5. Help your neighbor, ask for help (if stuck), take a quick break
48. A few helpful R links Download R: http://cran.r-project.org/available for Windows, Mac OS X, and Linux Advice – A clearly stated question with a reproducible example is far more likely to get help. You will often find your own solution by restating where you are getting stuck in a clear and concise way. Reproducible examples: https://gist.github.com/270442 General R links http://statmethods.net/ Quick-R for SAS/SPSS/Stata Users - An all around excellent reference site http://www.ats.ucla.edu/stat/R/ Resources for learning R from UCLA with lots of examples http://www.r-bloggers.com/learning-r-for-researchers-in-psychology/ This is a nice listing of R resources http://stackoverflow.com/questions/tagged/r Q&A forum for R programming questions - lots of good help! see also: http://crossvalidated.com for general stats & R ggplot2 links http://had.co.nz/ggplot2/ ggplot2 help & reference – lots of examples http://groups.google.com/group/ggplot2 ggplot2 user group – great for posting questions https://github.com/hadley/ggplot2/wiki ggplot2 wiki: answers many FAQs, tips & tricks http://www.slideshare.net/hadley/presentations Over 100 presentations by Hadley Wickham, author of ggplot2. A four-part video of a ½ day workshop by him starts here: http://had.blip.tv/file/3362248/ Setting up JGR in Windows JGR requires a JDK – speak to your IT person if this seems daunting (http://www.oracle.com/technetwork/java/javase/downloads/index.html) On Windows, JGR needs to be started from a launcher. For R version 2.12.0 on Windows with a 32bit R you will likely want to get the file jgr-1_62.exe as a launcher from here: http://www.rforge.net/JGR/ A discussion of the features of JGR can be found in this article (starting on page 9): http://stat-computing.org/newsletter/issues/scgn-16-2.pdf Deducer - an R package which works best in a working instance of JGR http://www.deducer.org/pmwiki/pmwiki.php?n=Main.DeducerManual There are great videos linked here introducing the Deducer package (although the volume is quite low) This slide last updated 12/4/2010
Editor's Notes
Therefore, when we interpret the regression coefficient in the single models for MEHP, MECPP, and MEHHP, we should be cautious that the regression coefficient cannot be interpreted as the effect of a single phthalate only. It may represent the overall effect due to a set of highly correlated phthalates. The Bayesian models estimate and test the significance of the association due to each separate phthalate, while the single models in this setting informally show that, overall, these several phthalate metabolites have a strong negative association with gestational age.Like a group test with measurement error
Daily NYC air quality measures from 1973 (Chambers 1983)