Stat405 ddply case study (2)
Hadley Wickham
Thursday, 7 October 2010
1. Recap
1. Focus on smaller subset
2. More ddply
1. Develop summary statistic
2. Classify names
3. Apply to full data
Thursday, 7 October 2010
Questions
For names that are used for both boys
and girls, how has usage changed?
Can we use names that clearly have the
incorrect sex to estimate error rates over
time?
Thursday, 7 October 2010
Getting started
options(stringsAsFactors = FALSE)
library(plyr)
library(ggplot2)
both <- read.csv("both.csv")
Thursday, 7 October 2010
Interesting subset
both_sum <- ddply(both, "name", summarise,
years = length(name),
avg_usage = mean(boy + girl) / 2
)
both_sum <- subset(both_sum, years > 1)
qplot(years, avg_usage, data = both_sum)
selected_names <- subset(both_sum,
years > 50 & avg_usage > 0.0005)$name
selected <- subset(both, name %in% selected_names)
Thursday, 7 October 2010
Patterns
selected$lratio <- with(selected,
log10(boy / girl))
qplot(lratio, name, data = selected)
qplot(lratio, reorder(name, lratio),
data = selected)
qplot(abs(lratio), reorder(name, lratio),
data = selected)
Thursday, 7 October 2010