Stat405           ddply case study


                           Hadley Wickham
Tuesday, 5 October 2010
1. Homework
                2. Project
                3. Case study: gender trends
                          1. Focus on smaller subset
                          2. Develop summary statistic
                          3. Classify names


Tuesday, 5 October 2010
Homework

                    Explain your code!
                    Comments should explain why not what
                    Check your indenting - if it’s not indented
                    correctly, it’s very hard to read




Tuesday, 5 October 2010
# Really bad:
     # Set x equal to ten.
     x <- 10

     # Bad:
     # Figure out if all windows are bars
     allbars <- all(windows %in% c("B", "BB", "BBB"))

     # Better:
     # all() / any() combination used to prevent errors in the
     # case of three DDs.

     # Better:
     # Check to see if DD will create a triple
     # if (length(unique(windows)) == 2)


Tuesday, 5 October 2010
# Best (but still not perfect:

              ## DD wild 4 cases and subcases
              #### 1c) 3 DD's
              #### 2c) 2 DD's
              #### 2c) 2 DD's
              ####     the prize is quadrupled
              #### 3c) 1 DD
              ####     prize doubled
              ##       3c.1) 1 DD and 2 of a kind
              ##       3c.2) 1 DD for any bars
              ##       3c.3) 1 DD for Cherries
              #### 4c) NO DD's
              ##       4c.1) Just any bar
              ##       4c.2) Just cherries


Tuesday, 5 October 2010
Project

Tuesday, 5 October 2010
Tips from last year
                    Proof read - far too many projects with
                    obvious mistakes.
                    Include a section on the data, giving a quick
                    English run-down of what you did to the
                    data. Only appendix should technical details.
                    Presentation matters - you should be proud
                    of your work, so take a little time to put it in a
                    nice wrapper.


Tuesday, 5 October 2010
Easy ways to lose
                               points

                    Overplotting
                    Code style violations
                    Forgetting about the denominator of a
                    ratio




Tuesday, 5 October 2010
Team Assessment

                    Your individual grades will be weighted by
                    effort.
                    Each team member should turn in a
                    (confidential) team evaluation sheet.
                    Don’t forget to assess yourself.




Tuesday, 5 October 2010
Case study

Tuesday, 5 October 2010
Questions

                  For names that are used for both boys
                  and girls, how has usage changed?
                  Can we use names that clearly have the
                  incorrect sex to estimate error rates over
                  time?




Tuesday, 5 October 2010
Getting started

                options(stringsAsFactors = FALSE)
                library(plyr)
                library(ggplot2)

                bnames <- read.csv("baby-names2.csv.bz2")




Tuesday, 5 October 2010
First task
                    Too many names (~7000): need to identify
                    smaller subset (~100) likely to be
                    interesting.
                    Outside of class, would look at more, but
                    starting with a subset for easier
                    exploration is a good idea.




Tuesday, 5 October 2010
First task
                    Too many names (~7000): need to identify
                    smaller subset (~100) likely to be
                    interesting.
                    Outside of class, would look at more, but
                    starting with a subset for easier
                    exploration is a good idea.

                      For this task, what attributes of a name are
                      likely to be useful?

Tuesday, 5 October 2010
Your turn
                    For each name, calculate the total proportion
                    of boys, the total proportion of girls, the
                    number of years the name was in the top
                    1000 as a girls name, the number of years
                    the name was in the top 1000 as a boys
                    name
                    Hint: Start with a single name and figure out
                    how to solve the problem. Hint: Use
                    summarise


Tuesday, 5 October 2010
times <- ddply(bnames, "name", summarise,
       boys = sum(prop[sex == "boy"]),
       boys_n = sum(sex == "boy"),
       girls = sum(prop[sex == "girl"]),
       girls_n = sum(sex == "girl"),
       .progress = "text"
     )
                      Useful for slow operations


     # But this is rather painful




Tuesday, 5 October 2010
# For this task, data much easier to work with
     # if put sex in columns instead of rows. We'll learn
     # more about reshaping in a couple of weeks
     # install.packages("reshape2")
     library(reshape2)
     bnames2 <- dcast(bnames, year + name ~ sex,
       value_var = "prop")

     # No information unless we have both boys and
     # girls for that name in that year
     both <- subset(bnames2, !is.na(boy) & !is.na(girl))
     dim(both)
     head(both)

Tuesday, 5 October 2010
Your turn

                    Summarise each name with the number
                    of years its made the list for both boys
                    and girls, the average proportion of
                    babies given that name.
                    Which names would you include for
                    further investigation?



Tuesday, 5 October 2010
both_sum <- ddply(both, "name", summarise,
       years = length(name),
       avg_usage = mean(boy + girl) / 2
     )

     # No point at looking at names that only appear once
     both_sum <- subset(both_sum, years > 1)

     qplot(years, avg_usage, data = both_sum)




Tuesday, 5 October 2010
# Now save our selections

     selected_names <- subset(both_sum,
       years > 20 & avg_usage > 0.005)$name

     selected <- subset(both, name %in% selected_names)

     nrow(selected) / nrow(both)




Tuesday, 5 October 2010
Your turn

                    Explore how the gender assignment of
                    these names has changed over time.
                    What is a good summary to use to
                    compare boy popularity to girl popularity?




Tuesday, 5 October 2010
qplot(year, boy - girl, data = selected,
       geom = "line", group = name)
     qplot(year, abs(boy - girl), data = selected,
       geom = "line", group = name,
       colour = sign(boy - girl))

     qplot(year, boy / girl, data = selected,
       geom = "line", group = name)
     qplot(year, log10(boy / girl), data = selected,
       geom = "line", group = name)

     selected$lratio <- with(selected, log10(boy / girl))
     qplot(lratio, name, data = selected)
     qplot(lratio, reorder(name, lratio), data = selected)
     qplot(abs(lratio), reorder(name, lratio),
       data = selected)

Tuesday, 5 October 2010
Your turn

                Compute the mean and range of lratio for
                each name.
                Plot and come up with cutoffs that you
                think separate the two groups.




Tuesday, 5 October 2010
rng <- ddply(selected, "name", summarise,
       diff = diff(range(lratio, na.rm = T)),
       mean = mean(lratio, na.rm = T)
     )

     qplot(diff, abs(mean), data = rng)
     qplot(diff, abs(mean), data = rng, geom = "text",
     label = name)

     rng$dual <- abs(rng$mean) < 2
     arrange(rng, mean, dual)

     selected <- join(selected, rng[c("name", "dual")]


Tuesday, 5 October 2010
qplot(year, lratio, data = selected, geom = "line",
       group = name) + facet_wrap(~ dual)

     qplot(year, lratio, data = subset(selected, dual),
       geom = "line") + facet_wrap(~ name)

     qplot(year, boy / (boy + girl),
       data = subset(selected, dual), geom = "line") +
       facet_wrap(~ name)




Tuesday, 5 October 2010
Next time


                    Now that we’ve separated the two
                    groups, we’ll explore each in more detail.




Tuesday, 5 October 2010

13 case-study

  • 1.
    Stat405 ddply case study Hadley Wickham Tuesday, 5 October 2010
  • 2.
    1. Homework 2. Project 3. Case study: gender trends 1. Focus on smaller subset 2. Develop summary statistic 3. Classify names Tuesday, 5 October 2010
  • 3.
    Homework Explain your code! Comments should explain why not what Check your indenting - if it’s not indented correctly, it’s very hard to read Tuesday, 5 October 2010
  • 4.
    # Really bad: # Set x equal to ten. x <- 10 # Bad: # Figure out if all windows are bars allbars <- all(windows %in% c("B", "BB", "BBB")) # Better: # all() / any() combination used to prevent errors in the # case of three DDs. # Better: # Check to see if DD will create a triple # if (length(unique(windows)) == 2) Tuesday, 5 October 2010
  • 5.
    # Best (butstill not perfect: ## DD wild 4 cases and subcases #### 1c) 3 DD's #### 2c) 2 DD's #### 2c) 2 DD's #### the prize is quadrupled #### 3c) 1 DD #### prize doubled ## 3c.1) 1 DD and 2 of a kind ## 3c.2) 1 DD for any bars ## 3c.3) 1 DD for Cherries #### 4c) NO DD's ## 4c.1) Just any bar ## 4c.2) Just cherries Tuesday, 5 October 2010
  • 6.
  • 7.
    Tips from lastyear Proof read - far too many projects with obvious mistakes. Include a section on the data, giving a quick English run-down of what you did to the data. Only appendix should technical details. Presentation matters - you should be proud of your work, so take a little time to put it in a nice wrapper. Tuesday, 5 October 2010
  • 8.
    Easy ways tolose points Overplotting Code style violations Forgetting about the denominator of a ratio Tuesday, 5 October 2010
  • 9.
    Team Assessment Your individual grades will be weighted by effort. Each team member should turn in a (confidential) team evaluation sheet. Don’t forget to assess yourself. Tuesday, 5 October 2010
  • 10.
  • 11.
    Questions For names that are used for both boys and girls, how has usage changed? Can we use names that clearly have the incorrect sex to estimate error rates over time? Tuesday, 5 October 2010
  • 12.
    Getting started options(stringsAsFactors = FALSE) library(plyr) library(ggplot2) bnames <- read.csv("baby-names2.csv.bz2") Tuesday, 5 October 2010
  • 13.
    First task Too many names (~7000): need to identify smaller subset (~100) likely to be interesting. Outside of class, would look at more, but starting with a subset for easier exploration is a good idea. Tuesday, 5 October 2010
  • 14.
    First task Too many names (~7000): need to identify smaller subset (~100) likely to be interesting. Outside of class, would look at more, but starting with a subset for easier exploration is a good idea. For this task, what attributes of a name are likely to be useful? Tuesday, 5 October 2010
  • 15.
    Your turn For each name, calculate the total proportion of boys, the total proportion of girls, the number of years the name was in the top 1000 as a girls name, the number of years the name was in the top 1000 as a boys name Hint: Start with a single name and figure out how to solve the problem. Hint: Use summarise Tuesday, 5 October 2010
  • 16.
    times <- ddply(bnames,"name", summarise, boys = sum(prop[sex == "boy"]), boys_n = sum(sex == "boy"), girls = sum(prop[sex == "girl"]), girls_n = sum(sex == "girl"), .progress = "text" ) Useful for slow operations # But this is rather painful Tuesday, 5 October 2010
  • 17.
    # For thistask, data much easier to work with # if put sex in columns instead of rows. We'll learn # more about reshaping in a couple of weeks # install.packages("reshape2") library(reshape2) bnames2 <- dcast(bnames, year + name ~ sex, value_var = "prop") # No information unless we have both boys and # girls for that name in that year both <- subset(bnames2, !is.na(boy) & !is.na(girl)) dim(both) head(both) Tuesday, 5 October 2010
  • 18.
    Your turn Summarise each name with the number of years its made the list for both boys and girls, the average proportion of babies given that name. Which names would you include for further investigation? Tuesday, 5 October 2010
  • 19.
    both_sum <- ddply(both,"name", summarise, years = length(name), avg_usage = mean(boy + girl) / 2 ) # No point at looking at names that only appear once both_sum <- subset(both_sum, years > 1) qplot(years, avg_usage, data = both_sum) Tuesday, 5 October 2010
  • 20.
    # Now saveour selections selected_names <- subset(both_sum, years > 20 & avg_usage > 0.005)$name selected <- subset(both, name %in% selected_names) nrow(selected) / nrow(both) Tuesday, 5 October 2010
  • 21.
    Your turn Explore how the gender assignment of these names has changed over time. What is a good summary to use to compare boy popularity to girl popularity? Tuesday, 5 October 2010
  • 22.
    qplot(year, boy -girl, data = selected, geom = "line", group = name) qplot(year, abs(boy - girl), data = selected, geom = "line", group = name, colour = sign(boy - girl)) qplot(year, boy / girl, data = selected, geom = "line", group = name) qplot(year, log10(boy / girl), data = selected, geom = "line", group = name) selected$lratio <- with(selected, log10(boy / girl)) qplot(lratio, name, data = selected) qplot(lratio, reorder(name, lratio), data = selected) qplot(abs(lratio), reorder(name, lratio), data = selected) Tuesday, 5 October 2010
  • 23.
    Your turn Compute the mean and range of lratio for each name. Plot and come up with cutoffs that you think separate the two groups. Tuesday, 5 October 2010
  • 24.
    rng <- ddply(selected,"name", summarise, diff = diff(range(lratio, na.rm = T)), mean = mean(lratio, na.rm = T) ) qplot(diff, abs(mean), data = rng) qplot(diff, abs(mean), data = rng, geom = "text", label = name) rng$dual <- abs(rng$mean) < 2 arrange(rng, mean, dual) selected <- join(selected, rng[c("name", "dual")] Tuesday, 5 October 2010
  • 25.
    qplot(year, lratio, data= selected, geom = "line", group = name) + facet_wrap(~ dual) qplot(year, lratio, data = subset(selected, dual), geom = "line") + facet_wrap(~ name) qplot(year, boy / (boy + girl), data = subset(selected, dual), geom = "line") + facet_wrap(~ name) Tuesday, 5 October 2010
  • 26.
    Next time Now that we’ve separated the two groups, we’ll explore each in more detail. Tuesday, 5 October 2010