Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction into R for historians (part 4: data manipulation)

398 views

Published on

Introduction into R for the European Historical Population Sample summerschool, Cluj-Napoca, Romana, 2015. Aimed at a public of historians with little quantitative skills

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Introduction into R for historians (part 4: data manipulation)

  1. 1. Recap Data manipulation data.table package Basic statistical techniques Data manipulation in R Richard L. Zijdeman May 29, 2015 Richard L. Zijdeman Data manipulation in R
  2. 2. Recap Data manipulation data.table package Basic statistical techniques 1 Recap 2 Data manipulation 3 data.table package 4 Basic statistical techniques Richard L. Zijdeman Data manipulation in R
  3. 3. Recap Data manipulation data.table package Basic statistical techniques Recap Richard L. Zijdeman Data manipulation in R
  4. 4. Recap Data manipulation data.table package Basic statistical techniques What we’ve seen so far functions to read in data read.csv(), read.xlsx() objects assignment <- characteristics, e.g.: str(), summary(), head(), tail() calculus mean(), min(), max() plotting plot() ggplot() paint by ‘layer’ Richard L. Zijdeman Data manipulation in R
  5. 5. Recap Data manipulation data.table package Basic statistical techniques Before we go on. . . Structure your R script Filename, Date, Purpose, Author, Last change Use comments to tell what you are doing read in data changing variables (why did you do it) Richard L. Zijdeman Data manipulation in R
  6. 6. Recap Data manipulation data.table package Basic statistical techniques Create a working directory, with subdirs + documents + data - source - derived + analysis + figures Richard L. Zijdeman Data manipulation in R
  7. 7. Recap Data manipulation data.table package Basic statistical techniques Set a working directory setwd(), getwd() use relative paths to save things “./” = currenty directory “./../” = folder up Read J. Scott Long’ “Workflow” Richard L. Zijdeman Data manipulation in R
  8. 8. Recap Data manipulation data.table package Basic statistical techniques Data manipulation Richard L. Zijdeman Data manipulation in R
  9. 9. Recap Data manipulation data.table package Basic statistical techniques Assignment and Indexing First, we’ll read in the HSN marriages again hmar <- read.csv("./../data/derived/HSN_marriages.csv", stringsAsFactors = FALSE, encoding = "latin1", header = TRUE, nrows = 10000) Richard L. Zijdeman Data manipulation in R
  10. 10. Recap Data manipulation data.table package Basic statistical techniques Change case of text tolower() toupper() tolower("CaN we pleASe jUSt have LOWER cases?") ## [1] "can we please just have lower cases?" names(hmar) <- tolower(names(hmar)) names(hmar) ## [1] "id_marriage" "idnr" "m_loc" "m_ ## [5] "sex_hsnrp" "age_groom" "occ_groom" "ci ## [9] "sign_groom" "b_loc_groom" "l_loc_groom" "ag ## [13] "occ_bride" "civilst_bride" "sign_bride" "b_ ## [17] "l_loc_bride" "a_f_groom" "occ_f_groom" "si Richard L. Zijdeman Data manipulation in R
  11. 11. Recap Data manipulation data.table package Basic statistical techniques Indexing There were way to many names to print on a slide. . . How many names are there actually? Richard L. Zijdeman Data manipulation in R
  12. 12. Recap Data manipulation data.table package Basic statistical techniques Use the length() command to find out: length(names(hmar)) ## [1] 29 So let’s print just the first two: names(hmar)[1:2] ## [1] "id_marriage" "idnr" The technique using squared brackets is called indexing Richard L. Zijdeman Data manipulation in R
  13. 13. Recap Data manipulation data.table package Basic statistical techniques Any idea how we would show the last two names? Richard L. Zijdeman Data manipulation in R
  14. 14. Recap Data manipulation data.table package Basic statistical techniques x <- length(names) names(hmar)[(x-1):x] ## [1] "id_marriage" Using concatenate we could also extract various names names(hmar)[c(1, 3, 5)] ## [1] "id_marriage" "m_loc" "sex_hsnrp" Richard L. Zijdeman Data manipulation in R
  15. 15. Recap Data manipulation data.table package Basic statistical techniques We can also apply indexing to a data.frame: hmar[1:2, 1:3] ## id_marriage idnr m_loc ## 1 1 1001 Abcoude-Baambrugge ## 2 2 1005 Baarn # shows the first 2 rows and first 3 columns # so, in general: data.frame[rows, columns] Richard L. Zijdeman Data manipulation in R
  16. 16. Recap Data manipulation data.table package Basic statistical techniques head() and tail() So actually, you should now be able to replace head() and tail() How? Richard L. Zijdeman Data manipulation in R
  17. 17. Recap Data manipulation data.table package Basic statistical techniques # head() hmar[1:6, ] # tail() y <- nrow(hmar) hmar[(y-6):y, ] Richard L. Zijdeman Data manipulation in R
  18. 18. Recap Data manipulation data.table package Basic statistical techniques data.table package Richard L. Zijdeman Data manipulation in R
  19. 19. Recap Data manipulation data.table package Basic statistical techniques Developed by Matt Dowle Website: https://github.com/Rdatatable/data.table/wiki Why data.table? fast subsetting on large files more consistent ‘grammar’ less typing Richard L. Zijdeman Data manipulation in R
  20. 20. Recap Data manipulation data.table package Basic statistical techniques install.packages("data.table") library(data.table) Richard L. Zijdeman Data manipulation in R
  21. 21. Recap Data manipulation data.table package Basic statistical techniques Class: data.table For data.table functions to work we need to define a data.frame as class data.base is.data.table(hmar) ## [1] FALSE hmar.dt <- data.table(hmar) is.data.table(hmar.dt) ## [1] TRUE is.data.frame(hmar.dt) ## [1] TRUE Richard L. Zijdeman Data manipulation in R
  22. 22. Recap Data manipulation data.table package Basic statistical techniques Friends with benefits Data.frame and data.table are like ‘friends with benefits’ all.equal(hmar, hmar.dt) ## [1] "Attributes: < Names: 2 string mismatches >" ## [2] "Attributes: < Length mismatch: comparison on first ## [3] "Attributes: < Component 1: Modes: character, extern ## [4] "Attributes: < Component 1: target is character, cur ## [5] "Attributes: < Component 2: Modes: numeric, characte ## [6] "Attributes: < Component 2: Lengths: 10000, 2 >" ## [7] "Attributes: < Component 2: target is numeric, curre # so we have all the benefits of a data.frame # ... and additional benefits of data.table NB: next series of commands will only work for data.tablesRichard L. Zijdeman Data manipulation in R
  23. 23. Recap Data manipulation data.table package Basic statistical techniques Sort with setkey Often we want to sort our data. We can do so with setkey() hmar.dt[1:6, m_year] ## [1] 1849 1851 1864 1840 1843 1858 # note for data.frame hmar it would be: # hmar[1:6, hmar$m_year] setkeyv(hmar.dt, "m_year") hmar.dt[1:6, m_year] ## [1] 1831 1831 1833 1833 1834 1834 identical(hmar.dt, hmar) ## [1] FALSE Richard L. Zijdeman Data manipulation in R
  24. 24. Recap Data manipulation data.table package Basic statistical techniques Multiple keys It is alo possible to sort on multiple keys setkeyv(hmar.dt, c("id_marriage", "idnr")) Richard L. Zijdeman Data manipulation in R
  25. 25. Recap Data manipulation data.table package Basic statistical techniques Subsetting groom.sig <- hmar.dt[age_groom > 30, ] dim(groom.sig) ## [1] 2493 29 groom.sig <- hmar.dt[sign_groom == "h", ] dim(groom.sig) ## [1] 9590 29 Richard L. Zijdeman Data manipulation in R
  26. 26. Recap Data manipulation data.table package Basic statistical techniques groom.sig <- hmar.dt[sign_groom == "h" & age_groom > 30, ] dim(groom.sig) ## [1] 2358 29 groom.sig <- hmar.dt[m_year != 1840, list(id_marriage, idnr)] dim(groom.sig) ## [1] 9985 2 Richard L. Zijdeman Data manipulation in R
  27. 27. Recap Data manipulation data.table package Basic statistical techniques Creating new variables Let’s create a variable for the mean of marriage of grooms hmar.dt[, mean.gage := mean(age_groom)] summary(hmar.dt$age_groom) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -2.00 24.00 26.00 28.38 30.00 79.00 summary(hmar.dt$mean.gage) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 28.38 28.38 28.38 28.38 28.38 28.38 Richard L. Zijdeman Data manipulation in R
  28. 28. Recap Data manipulation data.table package Basic statistical techniques Another example (from yesterday) Dummy variable for equal municipality of birth hmar.dt[, eq_b_loc := (b_loc_groom == b_loc_bride)] summary(hmar.dt$eq_b_loc) ## Mode FALSE TRUE NA's ## logical 6957 3043 0 Richard L. Zijdeman Data manipulation in R
  29. 29. Recap Data manipulation data.table package Basic statistical techniques Creating variables by group As we saw, a var with mean age wasn’t really interesting average age of grooms at marriage by civil status hmar.dt[, gage.mean.civ := mean(age_groom), by = civilst_groom] table(hmar.dt$civilst_groom, hmar.dt$gage.mean.civ) ## ## 27.2427939112599 40.8829787234043 42.9548286604361 ## 1 9263 0 0 ## 2 0 0 642 ## 3 0 94 0 ## 6 0 0 0 Richard L. Zijdeman Data manipulation in R
  30. 30. Recap Data manipulation data.table package Basic statistical techniques Summary subsets of the data So far, added vars to original data.frame can be redundant though Think of context, say municipalities archival material on characteristics, e.g.: population steam power You can also make context characteristics by aggregation Richard L. Zijdeman Data manipulation in R
  31. 31. Recap Data manipulation data.table package Basic statistical techniques mc <- hmar.dt[, mean(age_groom), by = b_loc_groom] summary(mc) ## b_loc_groom V1 ## Length:1184 Min. :-2.00 ## Class :character 1st Qu.:26.00 ## Mode :character Median :28.17 ## Mean :29.36 ## 3rd Qu.:31.00 ## Max. :69.00 Richard L. Zijdeman Data manipulation in R
  32. 32. Recap Data manipulation data.table package Basic statistical techniques We can improve by naming the variable directly, and adding more variables mc2 <- hmar.dt[, list(mean_gage = mean(age_groom), mean_bage = mean(age_bride)), by = b_loc_groom] summary(mc2) ## b_loc_groom mean_gage mean_bage ## Length:1184 Min. :-2.00 Min. :-2.00 ## Class :character 1st Qu.:26.00 1st Qu.:23.80 ## Mode :character Median :28.17 Median :25.88 ## Mean :29.36 Mean :26.53 ## 3rd Qu.:31.00 3rd Qu.:28.00 ## Max. :69.00 Max. :64.00 Richard L. Zijdeman Data manipulation in R
  33. 33. Recap Data manipulation data.table package Basic statistical techniques One more. . . counts Yesterday, we talked about the problem of overlapping points. We used geom_jitter to solve it. Now let’s do it properly: mc3 <- hmar.dt[, list(frequency = .N), by = list(m_year, age_bride)] # notice the .N ... N is often used for nr. of obs library(ggplot2) Richard L. Zijdeman Data manipulation in R
  34. 34. Recap Data manipulation data.table package Basic statistical techniques Using colour ggplot(mc3, aes(x= m_year, y = age_bride)) + geom_point(aes(colour = frequency), size = 10, shape = 18) + theme_bw() 20 40 60 age_bride 10 20 30 frequency Richard L. Zijdeman Data manipulation in R
  35. 35. Recap Data manipulation data.table package Basic statistical techniques Using size ggplot(mc3, aes(x= m_year, y = age_bride)) + geom_point(aes(size = frequency), colour = "blue", shape = 18) + theme_bw() 20 40 60 age_bride frequency 10 20 30 Richard L. Zijdeman Data manipulation in R
  36. 36. Recap Data manipulation data.table package Basic statistical techniques Basic statistical techniques Richard L. Zijdeman Data manipulation in R
  37. 37. Recap Data manipulation data.table package Basic statistical techniques Box and whisker plot Distribution of data Median: 50% of the cases above and below Box: 1st and 3rd quartile Interquartile range (IQR): Q3-Q1 Outliers (Tukey, 1977): x < Q1 - 1.5*IQR x > Q3 + 1.5*IQR Richard L. Zijdeman Data manipulation in R
  38. 38. Recap Data manipulation data.table package Basic statistical techniques boxplot(hmar.dt$age_bride, ylab = "Age") 0204060 Age Richard L. Zijdeman Data manipulation in R
  39. 39. Recap Data manipulation data.table package Basic statistical techniques hmar.dt[, sign.bride.cln := sign_bride == "h"] hmar.dt[age_bride < 14, age_bride := NA] # NB: no missing values here, but mind this when recoding! Richard L. Zijdeman Data manipulation in R
  40. 40. Recap Data manipulation data.table package Basic statistical techniques boxplot(hmar.dt$age_bride ~ hmar.dt$sign.bride.cln, names = c("not signed", "signed"), col = c("red", "green")) not signed signed 203040506070 Richard L. Zijdeman Data manipulation in R
  41. 41. Recap Data manipulation data.table package Basic statistical techniques Richard L. Zijdeman Data manipulation in R

×