13 case-study

1,736 views

Published on

  • Be the first to comment

13 case-study

  1. 1. Stat405 ddply case study Hadley Wickham Tuesday, 5 October 2010
  2. 2. 1. Homework 2. Project 3. Case study: gender trends 1. Focus on smaller subset 2. Develop summary statistic 3. Classify names Tuesday, 5 October 2010
  3. 3. Homework Explain your code! Comments should explain why not what Check your indenting - if it’s not indented correctly, it’s very hard to read Tuesday, 5 October 2010
  4. 4. # Really bad: # Set x equal to ten. x <- 10 # Bad: # Figure out if all windows are bars allbars <- all(windows %in% c("B", "BB", "BBB")) # Better: # all() / any() combination used to prevent errors in the # case of three DDs. # Better: # Check to see if DD will create a triple # if (length(unique(windows)) == 2) Tuesday, 5 October 2010
  5. 5. # Best (but still not perfect: ## DD wild 4 cases and subcases #### 1c) 3 DD's #### 2c) 2 DD's #### 2c) 2 DD's #### the prize is quadrupled #### 3c) 1 DD #### prize doubled ## 3c.1) 1 DD and 2 of a kind ## 3c.2) 1 DD for any bars ## 3c.3) 1 DD for Cherries #### 4c) NO DD's ## 4c.1) Just any bar ## 4c.2) Just cherries Tuesday, 5 October 2010
  6. 6. Project Tuesday, 5 October 2010
  7. 7. Tips from last year Proof read - far too many projects with obvious mistakes. Include a section on the data, giving a quick English run-down of what you did to the data. Only appendix should technical details. Presentation matters - you should be proud of your work, so take a little time to put it in a nice wrapper. Tuesday, 5 October 2010
  8. 8. Easy ways to lose points Overplotting Code style violations Forgetting about the denominator of a ratio Tuesday, 5 October 2010
  9. 9. Team Assessment Your individual grades will be weighted by effort. Each team member should turn in a (confidential) team evaluation sheet. Don’t forget to assess yourself. Tuesday, 5 October 2010
  10. 10. Case study Tuesday, 5 October 2010
  11. 11. Questions For names that are used for both boys and girls, how has usage changed? Can we use names that clearly have the incorrect sex to estimate error rates over time? Tuesday, 5 October 2010
  12. 12. Getting started options(stringsAsFactors = FALSE) library(plyr) library(ggplot2) bnames <- read.csv("baby-names2.csv.bz2") Tuesday, 5 October 2010
  13. 13. First task Too many names (~7000): need to identify smaller subset (~100) likely to be interesting. Outside of class, would look at more, but starting with a subset for easier exploration is a good idea. Tuesday, 5 October 2010
  14. 14. First task Too many names (~7000): need to identify smaller subset (~100) likely to be interesting. Outside of class, would look at more, but starting with a subset for easier exploration is a good idea. For this task, what attributes of a name are likely to be useful? Tuesday, 5 October 2010
  15. 15. Your turn For each name, calculate the total proportion of boys, the total proportion of girls, the number of years the name was in the top 1000 as a girls name, the number of years the name was in the top 1000 as a boys name Hint: Start with a single name and figure out how to solve the problem. Hint: Use summarise Tuesday, 5 October 2010
  16. 16. times <- ddply(bnames, "name", summarise, boys = sum(prop[sex == "boy"]), boys_n = sum(sex == "boy"), girls = sum(prop[sex == "girl"]), girls_n = sum(sex == "girl"), .progress = "text" ) Useful for slow operations # But this is rather painful Tuesday, 5 October 2010
  17. 17. # For this task, data much easier to work with # if put sex in columns instead of rows. We'll learn # more about reshaping in a couple of weeks # install.packages("reshape2") library(reshape2) bnames2 <- dcast(bnames, year + name ~ sex, value_var = "prop") # No information unless we have both boys and # girls for that name in that year both <- subset(bnames2, !is.na(boy) & !is.na(girl)) dim(both) head(both) Tuesday, 5 October 2010
  18. 18. Your turn Summarise each name with the number of years its made the list for both boys and girls, the average proportion of babies given that name. Which names would you include for further investigation? Tuesday, 5 October 2010
  19. 19. both_sum <- ddply(both, "name", summarise, years = length(name), avg_usage = mean(boy + girl) / 2 ) # No point at looking at names that only appear once both_sum <- subset(both_sum, years > 1) qplot(years, avg_usage, data = both_sum) Tuesday, 5 October 2010
  20. 20. # Now save our selections selected_names <- subset(both_sum, years > 20 & avg_usage > 0.005)$name selected <- subset(both, name %in% selected_names) nrow(selected) / nrow(both) Tuesday, 5 October 2010
  21. 21. Your turn Explore how the gender assignment of these names has changed over time. What is a good summary to use to compare boy popularity to girl popularity? Tuesday, 5 October 2010
  22. 22. qplot(year, boy - girl, data = selected, geom = "line", group = name) qplot(year, abs(boy - girl), data = selected, geom = "line", group = name, colour = sign(boy - girl)) qplot(year, boy / girl, data = selected, geom = "line", group = name) qplot(year, log10(boy / girl), data = selected, geom = "line", group = name) selected$lratio <- with(selected, log10(boy / girl)) qplot(lratio, name, data = selected) qplot(lratio, reorder(name, lratio), data = selected) qplot(abs(lratio), reorder(name, lratio), data = selected) Tuesday, 5 October 2010
  23. 23. Your turn Compute the mean and range of lratio for each name. Plot and come up with cutoffs that you think separate the two groups. Tuesday, 5 October 2010
  24. 24. rng <- ddply(selected, "name", summarise, diff = diff(range(lratio, na.rm = T)), mean = mean(lratio, na.rm = T) ) qplot(diff, abs(mean), data = rng) qplot(diff, abs(mean), data = rng, geom = "text", label = name) rng$dual <- abs(rng$mean) < 2 arrange(rng, mean, dual) selected <- join(selected, rng[c("name", "dual")] Tuesday, 5 October 2010
  25. 25. qplot(year, lratio, data = selected, geom = "line", group = name) + facet_wrap(~ dual) qplot(year, lratio, data = subset(selected, dual), geom = "line") + facet_wrap(~ name) qplot(year, boy / (boy + girl), data = subset(selected, dual), geom = "line") + facet_wrap(~ name) Tuesday, 5 October 2010
  26. 26. Next time Now that we’ve separated the two groups, we’ll explore each in more detail. Tuesday, 5 October 2010

×