13 case-study

Stat405 ddply case study

Hadley Wickham
Tuesday, 5 October 2010

1. Homework
2. Project
3. Case study: gender trends
1. Focus on smaller subset
2. Develop summary statistic
3. Classify names


Homework

Explain your code!
Comments should explain why not what
Check your indenting - if it’s not indented
correctly, it’s very hard to read


# Really bad:
# Set x equal to ten.
x <- 10

# Bad:
# Figure out if all windows are bars
allbars <- all(windows %in% c("B", "BB", "BBB"))

# Better:
# all() / any() combination used to prevent errors in the
# case of three DDs.

# Better:
# Check to see if DD will create a triple
# if (length(unique(windows)) == 2)


# Best (but still not perfect:

## DD wild 4 cases and subcases
#### 1c) 3 DD's
#### 2c) 2 DD's
#### 2c) 2 DD's
#### the prize is quadrupled
#### 3c) 1 DD
#### prize doubled
## 3c.1) 1 DD and 2 of a kind
## 3c.2) 1 DD for any bars
## 3c.3) 1 DD for Cherries
#### 4c) NO DD's
## 4c.1) Just any bar
## 4c.2) Just cherries


Project


Tips from last year
Proof read - far too many projects with
obvious mistakes.
Include a section on the data, giving a quick
English run-down of what you did to the
data. Only appendix should technical details.
Presentation matters - you should be proud
of your work, so take a little time to put it in a
nice wrapper.


Easy ways to lose
points

Overplotting
Code style violations
Forgetting about the denominator of a
ratio


Team Assessment

Your individual grades will be weighted by
effort.
Each team member should turn in a
(conﬁdential) team evaluation sheet.
Don’t forget to assess yourself.


Case study


Questions

For names that are used for both boys
and girls, how has usage changed?
Can we use names that clearly have the
incorrect sex to estimate error rates over
time?


Getting started

options(stringsAsFactors = FALSE)
library(plyr)
library(ggplot2)

bnames <- read.csv("baby-names2.csv.bz2")


First task
Too many names (~7000): need to identify
smaller subset (~100) likely to be
interesting.
Outside of class, would look at more, but
starting with a subset for easier
exploration is a good idea.


First task
Too many names (~7000): need to identify
smaller subset (~100) likely to be
interesting.
Outside of class, would look at more, but
starting with a subset for easier
exploration is a good idea.

For this task, what attributes of a name are
likely to be useful?


Your turn
For each name, calculate the total proportion
of boys, the total proportion of girls, the
number of years the name was in the top
1000 as a girls name, the number of years
the name was in the top 1000 as a boys
name
Hint: Start with a single name and ﬁgure out
how to solve the problem. Hint: Use
summarise


times <- ddply(bnames, "name", summarise,
boys = sum(prop[sex == "boy"]),
boys_n = sum(sex == "boy"),
girls = sum(prop[sex == "girl"]),
girls_n = sum(sex == "girl"),
.progress = "text"
)
Useful for slow operations

# But this is rather painful


# For this task, data much easier to work with
# if put sex in columns instead of rows. We'll learn
# more about reshaping in a couple of weeks
# install.packages("reshape2")
library(reshape2)
bnames2 <- dcast(bnames, year + name ~ sex,
value_var = "prop")

# No information unless we have both boys and
# girls for that name in that year
both <- subset(bnames2, !is.na(boy) & !is.na(girl))
dim(both)
head(both)


Your turn

Summarise each name with the number
of years its made the list for both boys
and girls, the average proportion of
babies given that name.
Which names would you include for
further investigation?


both_sum <- ddply(both, "name", summarise,
years = length(name),
avg_usage = mean(boy + girl) / 2
)

# No point at looking at names that only appear once
both_sum <- subset(both_sum, years > 1)

qplot(years, avg_usage, data = both_sum)


# Now save our selections

selected_names <- subset(both_sum,
years > 20 & avg_usage > 0.005)$name

selected <- subset(both, name %in% selected_names)

nrow(selected) / nrow(both)


Your turn

Explore how the gender assignment of
these names has changed over time.
What is a good summary to use to
compare boy popularity to girl popularity?


qplot(year, boy - girl, data = selected,
geom = "line", group = name)
qplot(year, abs(boy - girl), data = selected,
geom = "line", group = name,
colour = sign(boy - girl))

qplot(year, boy / girl, data = selected,
qplot(year, log10(boy / girl), data = selected,

selected$lratio <- with(selected, log10(boy / girl))
qplot(lratio, name, data = selected)
qplot(lratio, reorder(name, lratio), data = selected)
qplot(abs(lratio), reorder(name, lratio),
data = selected)


Your turn

Compute the mean and range of lratio for
each name.
Plot and come up with cutoffs that you
think separate the two groups.


rng <- ddply(selected, "name", summarise,
diff = diff(range(lratio, na.rm = T)),
mean = mean(lratio, na.rm = T)
)

qplot(diff, abs(mean), data = rng)
qplot(diff, abs(mean), data = rng, geom = "text",
label = name)

rng$dual <- abs(rng$mean) < 2
arrange(rng, mean, dual)

selected <- join(selected, rng[c("name", "dual")]


qplot(year, lratio, data = selected, geom = "line",
group = name) + facet_wrap(~ dual)

qplot(year, lratio, data = subset(selected, dual),
geom = "line") + facet_wrap(~ name)

qplot(year, boy / (boy + girl),
data = subset(selected, dual), geom = "line") +
facet_wrap(~ name)


Next time

Now that we’ve separated the two
groups, we’ll explore each in more detail.


13 case-study

More Related Content

Similar to 13 case-study

More from Hadley Wickham

13 case-study