Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Basic introduction into R

265 views

Published on

3x half day lecture and practicals on introductory facets of R, amongst others: installing R and RStudio, reading in and writing out data, data cleaning, descriptive statistics, data visualization (including visual analysis). Courtesy of the European Historical Sample Population Network and the Babeş-Bolyai University (Cluj-Napoca, Romania)

Published in: Education
  • It is always challenging to teach a new programming language to someone with no prior programming knowledge. RStudio simplifies the learning process by handling all the technical setup. http://bit.ly/2JR3PXI
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Basic introduction into R

  1. 1. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Introduction into R Part 1A Richard L. Zijdeman 2016-06-15 Richard L. Zijdeman Introduction into R
  2. 2. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help 1 Quantitave research methods 2 Data analysis workflow 3 Statistical Software 4 Installing R and RStudio 5 Getting help Richard L. Zijdeman Introduction into R
  3. 3. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Quantitave research methods Richard L. Zijdeman Introduction into R
  4. 4. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Why To answer descriptive and explanatory questions on populations Richard L. Zijdeman Introduction into R
  5. 5. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Workflow: PTE problem (research question) theory (hypothesis) empirical test . . . with loops between T-E and P-T-E Richard L. Zijdeman Introduction into R
  6. 6. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Research Questions descriptive (to what extent. . . ) comparative (comparing two entities) trend (comparison over time) explanatory (focus on mechanism at hand) Richard L. Zijdeman Introduction into R
  7. 7. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Theory deductive reasoning explanans general mechanism condition explanandum (hypothesis) Richard L. Zijdeman Introduction into R
  8. 8. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Empirical test sample vs. population random vs. stratified samples testing technique, e.g.: T-test, correlation, regression Software required for faster analysis Richard L. Zijdeman Introduction into R
  9. 9. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Data analysis workflow Richard L. Zijdeman Introduction into R
  10. 10. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Empirical testings has its own workflow Grolemund & Wickham, 2016, Creative Commons Attribution-NonCommercial-NoDerivs 4.0. Richard L. Zijdeman Introduction into R
  11. 11. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Statistical Software Richard L. Zijdeman Introduction into R
  12. 12. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help The dangers of analysing with spreadsheets (e.g. MS Excel) tempting to input and clean data and analyse in the same sheet di cult to track cleaning rules defaults mess up your data (e.g. 01200 -> 1200) Richard L. Zijdeman Introduction into R
  13. 13. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Why use syntax (scripting) E ciency (really) Quality (error checking) Replicatability Communication Richard L. Zijdeman Introduction into R
  14. 14. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help R R is open source, which is good and bad: anybody can contribute (check, improve, create code) free of charge but: R depends on collective action cannot ‘demand’ support sprawl of packages Richard L. Zijdeman Introduction into R
  15. 15. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help RStudio browser for R provides easy access to: scripts data plots manual Richard L. Zijdeman Introduction into R
  16. 16. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Installing R and RStudio Richard L. Zijdeman Introduction into R
  17. 17. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Download R Instructions via http://www.r-project.org Choose a CRAN mirror http://cran.r-project.org/mirrors.html close, but active too! Romania hasn’t gone (yet!) Click on ‘Download R for Windows’ Follow usual installation procedure Double click on R You should now have a working session! Close the session, do not save workspace image Richard L. Zijdeman Introduction into R
  18. 18. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Packages and libraries base R (core product) additional packages CRAN repository spread through ‘mirrors’ choose a local, but active mirror Github packages not on CRAN development versions of CRAN libraries Richard L. Zijdeman Introduction into R
  19. 19. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help RStudio RStudio is found on http://www.rstudio.com Download the version for your OS (e.g. windows) http://www.rstudio.com/products/rstudio/download/ Install by double clicking on the downloaded file Start RStudio by double clicking on the icon You do not need to start R, before starting RStudio Richard L. Zijdeman Introduction into R
  20. 20. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Getting help Richard L. Zijdeman Introduction into R
  21. 21. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Build-in help: “?” ?[function] / ?[package] e.g. “?plot” or “?graphics” check the index for user guides and vignettes Richard L. Zijdeman Introduction into R
  22. 22. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Cran website Manuals R FAQ R Journal Richard L. Zijdeman Introduction into R
  23. 23. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Online communities Stackoverflow Instance of Stackexchange Reputation based Q&A Specific lists for packages, e.g.: ggplot2 R-sig-mixed-models Richard L. Zijdeman Introduction into R
  24. 24. Quantitave research methods Data analysis workflow Statistical Software Installing R and RStudio Getting help Asking a question Getting an answer Search the web: others must have had this problem too If you raise a question: be polite be concise short background replicatable example debrief your e orts sofar Richard L. Zijdeman Introduction into R
  25. 25. Introducing RStudio and R Introducing base R Data visualization using ggplot2 Introduction into R Part 1B Richard L. Zijdeman 2016-06-15 Richard L. Zijdeman Introduction into R
  26. 26. Introducing RStudio and R Introducing base R Data visualization using ggplot2 1 Introducing RStudio and R 2 Introducing base R 3 Data visualization using ggplot2 Richard L. Zijdeman Introduction into R
  27. 27. Introducing RStudio and R Introducing base R Data visualization using ggplot2 Introducing RStudio and R Richard L. Zijdeman Introduction into R
  28. 28. Introducing RStudio and R Introducing base R Data visualization using ggplot2 RStudio Rstudio is sort of a ‘viewer’ on R helps to organize input and output: editor (upper left) console (lower left) environment (upper right) output (lower right) Richard L. Zijdeman Introduction into R
  29. 29. Introducing RStudio and R Introducing base R Data visualization using ggplot2 R script series of ))commands to manipulate data always save your script, NEVER change your data original data + script = reproducable research Richard L. Zijdeman Introduction into R
  30. 30. Introducing RStudio and R Introducing base R Data visualization using ggplot2 Packages Build your R system using packages ‘Base R’ is basic. Add packages for your specific needs Packages are found on servers, called ‘mirrors’ Make sure to select a mirror first https://cran.r-project.org/mirrors.html%5Bhttps: //cran.r-project.org/mirrors.html%5D ## To permanently add the mirror, type: options(repos=structure( c(CRAN="http://cran.xl-mirror.nl"))) ## replace http://... with your favorite mirror Richard L. Zijdeman Introduction into R
  31. 31. Introducing RStudio and R Introducing base R Data visualization using ggplot2 Packages for book (see 1.4.2) pkgs <- c( "broom", "dplyr", "ggplot2", "jpeg", "jsonlite", "knitr", "Lahman", "microbenchmark", "png", "pryr", "purrr", "rcorpora", "readr", "stringr", "tibble", "tidyr" ) install.packages(pkgs) Richard L. Zijdeman Introduction into R
  32. 32. Introducing RStudio and R Introducing base R Data visualization using ggplot2 R Session contains scripts, data, functions can be saved ‘workspace image’ prefer not to: sessions are usually cluttered only useful if running script takes time Suggested tweak: Options: uncheck “Restore .RData into workspace at startup” Options: Save workspace to .RData on exit, select ‘never’ Richard L. Zijdeman Introduction into R
  33. 33. Introducing RStudio and R Introducing base R Data visualization using ggplot2 Introducing base R Richard L. Zijdeman Introduction into R
  34. 34. Introducing RStudio and R Introducing base R Data visualization using ggplot2 base R: assignment and print() ‘attach’ values to an object (e.g. a variable) x <- 5 y <- 4 z <- x * y print(z) ## [1] 20 Richard L. Zijdeman Introduction into R
  35. 35. Introducing RStudio and R Introducing base R Data visualization using ggplot2 base R: assignment and print() (II) Try and imagine the potential of assignment x <- c(4, 3, 2, 1, 0, 27, 34, 35) # c for concatenate values y <- -1 z <- x*y print(z) ## [1] -4 -3 -2 -1 0 -27 -34 -35 Richard L. Zijdeman Introduction into R
  36. 36. Introducing RStudio and R Introducing base R Data visualization using ggplot2 base R: data.frame basically a table contains columns (variables) contains rows (cases) “flat table” in Kees’ terminology my.df <- data.frame(x,z) str(my.df) # show STRucture ## data.frame : 8 obs. of 2 variables: ## $ x: num 4 3 2 1 0 27 34 35 ## $ z: num -4 -3 -2 -1 0 -27 -34 -35 There’s much more, but let’s keep that for tomorrow Richard L. Zijdeman Introduction into R
  37. 37. Introducing RStudio and R Introducing base R Data visualization using ggplot2 Data visualization using ggplot2 Richard L. Zijdeman Introduction into R
  38. 38. Introducing RStudio and R Introducing base R Data visualization using ggplot2 Visualizing your data Not just for analyses! Data quality representativeness missing data Richard L. Zijdeman Introduction into R
  39. 39. Introducing RStudio and R Introducing base R Data visualization using ggplot2 plot() in base R library(help = "datasets") # all datasets in R ?mtcars # show help on mtcars dataset df <- mtcars() str(mtcars) # display STRucture of an object plot(mtcars$hp, mtcars$mpg) plot(df) Richard L. Zijdeman Introduction into R
  40. 40. Introducing RStudio and R Introducing base R Data visualization using ggplot2 plot() is like . . . plot() is like latex: Forge it in anyway you want Heterogeneous approach though Takes quite some time to get it right Richard L. Zijdeman Introduction into R
  41. 41. Introducing RStudio and R Introducing base R Data visualization using ggplot2 ggplot() as alternative ggplot is but one of many graph packages ggplot is nice bc, of: similar approach to various types of graphs easy build up for basic graphs can get quite complex too (but cannot do it all) Richard L. Zijdeman Introduction into R
  42. 42. Introducing RStudio and R Introducing base R Data visualization using ggplot2 ggplot() and the canvas metaphore ggplot() consists of two elements canvas (multiple) layers of paint Richard L. Zijdeman Introduction into R
  43. 43. Introducing RStudio and R Introducing base R Data visualization using ggplot2 mapping and geom layers ggplot() consists of two elements canvas: data mapping (aesthetic) (multiple) layers of paint geom layers ggplot(data = <DATASET>, mapping = aes(x = <X-VAR>, y = <Y-VAR>)) + geom_<TYPE> Richard L. Zijdeman Introduction into R
  44. 44. Introducing RStudio and R Introducing base R Data visualization using ggplot2 our first ggplot install.packages("ggplot2") library(ggplot2) df <- mtcars ggplot(data = df, aes(x = hp, y = mpg)) + geom_point() Richard L. Zijdeman Introduction into R
  45. 45. Introducing RStudio and R Introducing base R Data visualization using ggplot2 geom_ features ? geom_point install.packages("ggplot2") library(ggplot2) df <- mtcars ggplot(data = df, aes(x = hp, y = mpg)) + geom_point(fill = "white", colour = "blue", shape = 21, size = 4) Richard L. Zijdeman Introduction into R
  46. 46. Introducing RStudio and R Introducing base R Data visualization using ggplot2 Adding characteristics to your plot Add variables to explain a pattern ggplot(data = df, aes(x = hp, y = mpg)) + geom_point(aes(colour = wt), size = 4) NB: notice the di erence? ggplot(data = df, aes(x = hp, y = mpg)) + geom_point(aes(colour = wt, size = 4)) Richard L. Zijdeman Introduction into R
  47. 47. Introducing RStudio and R Introducing base R Data visualization using ggplot2 Multiple geom’s Add variables to explain a pattern ggplot(data = df, aes(x = hp, y = mpg)) + geom_point(aes(colour = as.factor(am)), size = 6) + # increase size bc overlap geom_point(aes(shape = as.factor(vs)), size = 3) #V/S whether V8 (0) or Straight (European) (1) Richard L. Zijdeman Introduction into R
  48. 48. Introducing RStudio and R Introducing base R Data visualization using ggplot2 Adding facets Facets help reduce complexity ggplot(data = df, aes(x = hp, y = mpg)) + geom_point(aes(colour = as.factor(am)), size = 4) + facet_wrap( ~ vs) Richard L. Zijdeman Introduction into R
  49. 49. Introducing RStudio and R Introducing base R Data visualization using ggplot2 Things to consider with geom(_point) fill only works where shape actually can be filled consider order of geoms mind overlap: decrease size use alpha use ‘open’ shapes geom_jitter Richard L. Zijdeman Introduction into R
  50. 50. Introducing RStudio and R Introducing base R Data visualization using ggplot2 ggplot and titles Various ways to add titlex to axes and stu Can get quite complex Here’s the basiscs ggplot(data = df, aes(x = hp, y = mpg)) + geom_point() + labs(title = "Nice graph", x = "Horse Power", y = "Miles per Gallon" ) Richard L. Zijdeman Introduction into R
  51. 51. Introducing RStudio and R Introducing base R Data visualization using ggplot2 Themes and size ggplot(data = df, aes(x = hp, y = mpg)) + geom_point() + labs(title = "Nice graph", x = "Horse Power", y = "Miles per Gallon" ) + theme_bw(base_size = 16) Richard L. Zijdeman Introduction into R
  52. 52. Introducing RStudio and R Introducing base R Data visualization using ggplot2 Much more to learn not just about ggplot() axes legend (guides) geoms also about dataviz in general general do’s and don’ts which problem fits which graph it’s a science! (Graph theory) Richard L. Zijdeman Introduction into R
  53. 53. Data wrangling bit about NA Introduction into R Part 2A, 2B Richard L. Zijdeman 2016-06-16 Richard L. Zijdeman Introduction into R
  54. 54. Data wrangling bit about NA 1 Data wrangling 2 bit about NA Richard L. Zijdeman Introduction into R
  55. 55. Data wrangling bit about NA Data wrangling Richard L. Zijdeman Introduction into R
  56. 56. Data wrangling bit about NA Grolemund & Wickham, 2016, Creative Commons Attribution-NonCommercial-NoDerivs 4.0. Richard L. Zijdeman Introduction into R
  57. 57. Data wrangling bit about NA dplyr package # install.packages("dplyr") # 1 time only library(dplyr) install.packages("nycflights13") library(nycflights13) print(flights) Richard L. Zijdeman Introduction into R
  58. 58. Data wrangling bit about NA tibble or data_frame vs data.frame str(mtcars) class(mtcars) mtcars_tbl <- as_data_frame(mtcars) str(mtcars) class(mtcars) Richard L. Zijdeman Introduction into R
  59. 59. Data wrangling bit about NA filter filter(mtcars, am == 1, vs == 0) some.cars <- filter(mtcars, am == 1, vs == 0) some.cars (some.cars2 <- filter(mtcars, am == 1, vs == 0)) Richard L. Zijdeman Introduction into R
  60. 60. Data wrangling bit about NA filter and using or filter(mtcars, gear == 3 | gear == 4) # !! not like this: filter(mtcars, gear == 3 | 4) Richard L. Zijdeman Introduction into R
  61. 61. Data wrangling bit about NA bit about NA Richard L. Zijdeman Introduction into R
  62. 62. Data wrangling bit about NA Arrange arrange(flights, dep_time) arrange(flights, year, month, day) # ascending order arrange(flights, desc(day)) # NB: missing values come at end Richard L. Zijdeman Introduction into R
  63. 63. Data wrangling bit about NA Select df <- select(flights, year, month, day) names(flights) df <- select(flights, tailnum:dest) df <- select(flights, -(tailnum:dest)) df df <- select(flights, starts_with("arr_")) df <- select(flights, ends_with("e")) df <- select(flights, contains("a")) Richard L. Zijdeman Introduction into R
  64. 64. Data wrangling bit about NA rename df <- rename(flights, Y_ear = year) df <- mutate(flights, year1 = year+1) select(df, year, year1) df <- mutate(flights, year1 = year + 1, year2 = year1+1) select(df, contains("year")) df <- transmute(flights, year1 = year + 1, year2 = year1+1) # only maintains the newly created variables Richard L. Zijdeman Introduction into R
  65. 65. Data wrangling bit about NA group_by by_day <- group_by(flights, year, month, day) summarise(by_day) cars <- mtcars cars <- as_data_frame(mtcars) summarise(cars, mean_hp = mean(hp, na.rm = TRUE)) mean(cars$hp, na.rm = TRUE) Richard L. Zijdeman Introduction into R
  66. 66. Data wrangling bit about NA the pipe: %>% cars_grp <- group_by(cars, carb) class(cars) class(cars_grp) summarise(cars_grp, mmpg = mean(mpg, na.rm = TRUE)) cars_grp_sum <- summarise(cars_grp, mmpg = mean(mpg, na.rm = TRUE), count = n()) cars_grp_sum plot <- ggplot(cars_grp_sum, aes(x = carb, y = mmpg, label = carb)) + geom_point(aes(size = count)) + geom_text(colour = "cyan") plot Richard L. Zijdeman Introduction into R
  67. 67. Data wrangling bit about NA more pipe, adding a filter cars_grp_sum3 <- cars %>% group_by(carb) %>% summarise(mmpg = mean(mpg, na.rm = TRUE), count = n()) %>% filter(count > 3) ggplot(cars_grp_sum3, aes(x = carb, y = mmpg, label = carb) geom_point(aes(size = count)) + geom_text(colour = "cyan") + labs(title = "figure with %>% and count > 3") Richard L. Zijdeman Introduction into R
  68. 68. Session management Basic data manipulation Introduction into R Part 3A Richard L. Zijdeman 2016-06-17 Richard L. Zijdeman Introduction into R
  69. 69. Session management Basic data manipulation 1 Session management 2 Basic data manipulation Richard L. Zijdeman Introduction into R
  70. 70. Session management Basic data manipulation Session management Richard L. Zijdeman Introduction into R
  71. 71. Session management Basic data manipulation Maintaining your workspace Grolemund & Wickham, 2016, Creative Commons Attribution-NonCommercial-NoDerivs 4.0. Richard L. Zijdeman Introduction into R
  72. 72. Session management Basic data manipulation Setting up a session clear your Environment check sessionInfo() for loaded packages detach obsolete packages under ‘other attached packages’ set your directory (“" on windows and”/" for linux/mac) load libraries (install new ones) load your data Richard L. Zijdeman Introduction into R
  73. 73. Session management Basic data manipulation Example session setup rm(list = ls()) sessionInfo() # check for other attached packages detach("package:nycflights13", unload = TRUE) setwd("/Users/RichardZ/Dropbox/ Summer school 2016/Richard Zijdeman/") getwd() # to see whether you re in the right directory dir() # shows what s in your directory Richard L. Zijdeman Introduction into R
  74. 74. Session management Basic data manipulation Loading your data read.table() (generic function) read.csv() library(foreign) # e.g. SPSS and Stata library(readxl) # fast excel-package Richard L. Zijdeman Introduction into R
  75. 75. Session management Basic data manipulation Reading in data Di erent functions for di erent files: Base R: read.table() (read.csv()) foreign package: read.spss(), read.dta(), read.dbf() readxl alternatives packages: xlsx(Java required) gdata (perl-based) openxlsx package: read.xlsx() Richard L. Zijdeman Introduction into R
  76. 76. Session management Basic data manipulation read.csv() file: your file, including directory header: variable names or not? sep: seperator read.csv default: “,” read.csv2 default: “;” skip: number of rows to skip nrows: total number of rows to read stringsAsFactors encoding (e.g. “latin1” or “UTF-8”) Richard L. Zijdeman Introduction into R
  77. 77. Session management Basic data manipulation read_excel from readxl package path: your file, including directory sheet: name or number of sheet col_names: col names in 1st row? col_types: specify type na: what’s the sign for missing values skip: how many rows to skip before data starts Richard L. Zijdeman Introduction into R
  78. 78. Session management Basic data manipulation Example session loading your csv data # setwd() to set your working directory hmar100 <- read.csv("./Datafiles_HSN/HSN_marriages.csv", stringsAsFactors = FALSE, encoding = "latin1", header = TRUE, nrows = 100) # just first 100 rows Richard L. Zijdeman Introduction into R
  79. 79. Session management Basic data manipulation Example session loading your excel data # setwd() to set your working directory install.packages("readxl") library("readxl") hmar <- read_excel("./Datafiles_HSN/HSN_marriages_awful.xls col_names = TRUE, skip = 3) # empty lines not counted!!! Richard L. Zijdeman Introduction into R
  80. 80. Session management Basic data manipulation Basic data manipulation Richard L. Zijdeman Introduction into R
  81. 81. Session management Basic data manipulation Change case of text tolower() toupper() tolower("CaN we pleASe jUSt have LOWER cases?") names(hmar) <- tolower(names(hmar)) Richard L. Zijdeman Introduction into R
  82. 82. Session management Basic data manipulation length() Used to count how many instances there are length(names(hmar)) # shows number of variables in hmar Richard L. Zijdeman Introduction into R
  83. 83. Basic statistical techniques Introduction into R Part 3B Richard L. Zijdeman 2016-06-17 Richard L. Zijdeman Introduction into R
  84. 84. Basic statistical techniques 1 Basic statistical techniques Richard L. Zijdeman Introduction into R
  85. 85. Basic statistical techniques Basic statistical techniques Richard L. Zijdeman Introduction into R
  86. 86. Basic statistical techniques Box and whisker plot Distribution of data Median: 50% of the cases above and below Box: 1st and 3rd quartile Interquartile range (IQR): Q3-Q1 Outliers (Tukey, 1977): x < Q1 - 1.5*IQR x > Q3 + 1.5*IQR Richard L. Zijdeman Introduction into R
  87. 87. Basic statistical techniques p <- ggplot(hmar, aes(sign_groom, age_groom)) p + geom_boxplot() Richard L. Zijdeman Introduction into R
  88. 88. Basic statistical techniques hmar <- mutate(hmar, sign_groomD = (sign_groom == "h" & !(i p <- ggplot(hmar, aes(sign_groomD, age_groom)) p + geom_boxplot() Richard L. Zijdeman Introduction into R
  89. 89. Basic statistical techniques hmar <- mutate(hmar, sign_groomD = (sign_groom == "h" & !(i p <- ggplot(hmar, aes(sign_groomD, age_groom)) p + geom_boxplot() + geom_jitter(shape = 24, width = 0.2) Richard L. Zijdeman Introduction into R
  90. 90. Basic statistical techniques library(stats) var.test(age_groom ~ sign_groomD, data = hmar) t.test(age_groom ~ sign_groomD, data = hmar) # NB: always check for variances Richard L. Zijdeman Introduction into R
  91. 91. Basic statistical techniques A small PTE project Look at the variables in the HSN files Think of a research question Provide a general mechanism and hypothesis Plot your results Richard L. Zijdeman Introduction into R

×