Upcoming SlideShare
×

# 02 Ddply

22,400 views
22,275 views

Published on

14 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• @GeorgeSoilis As I understood it from the previous slide, this is charting the percentage of baby names, for each gender, that fall in the top 100. It seems that, over time, a lower percentage of the names are being drawn from those 100 most popular names. To me, that suggests a greater diversity of names now, as compared to perhaps 100 years ago.

Are you sure you want to  Yes  No
• The last slide (plot) i think suggests amongst other things a drop in births???is this trend of both sexes downwards showing this or is it my idea?

Are you sure you want to  Yes  No
• Slide 8. It looks like nchar and vowels aren’t doing what you’d expect them to.

For example subset( bnames, name == ’Allyn’) says that the length == 3 and vowels == 0.

vowels should be defined with gsub('[^AEIOUaeiou]', '', x)

The 'Allyn' example behaves properly if length = nchar( tolower( name )). Not sure why that's necessary, since nchar('Allyn') == 5.

Are you sure you want to  Yes  No
• vowel.final <- function(x) { grepl( ’[aeiouy]\$’, x) }

43 214 boys’ names are vowel-final / 129 000
103 117 girls’ names are vowel-final / 129 000

Are you sure you want to  Yes  No
• To get the data set:

Are you sure you want to  Yes  No
Views
Total views
22,400
On SlideShare
0
From Embeds
0
Number of Embeds
74
Actions
Shares
0
225
5
Likes
14
Embeds 0
No embeds

No notes for slide

### 02 Ddply

1. 1. plyr US baby names Hadley Wickham Tuesday, 7 July 2009
2. 2. 1. Introduction to the data 2. Transformations and summaries 3. Group-wise transformation and summary 4. Variable selection syntax 5. Challenge Tuesday, 7 July 2009
3. 3. Baby names Top 1000 male and female baby names in the US, from 1880 to 2008. 258,000 records (1000 * 2 * 129) But only four variables: year, name, sex and percent. CC BY http://www.ﬂickr.com/photos/the_light_show/2586781132 Tuesday, 7 July 2009
4. 4. > head(bnames, 15) > tail(bnames, 15) year name percent sex year name percent sex 1 1880 John 0.081541 boy 257986 2008 Neveah 0.000130 girl 2 1880 William 0.080511 boy 257987 2008 Amaris 0.000129 girl 3 1880 James 0.050057 boy 257988 2008 Hadassah 0.000129 girl 4 1880 Charles 0.045167 boy 257989 2008 Dania 0.000129 girl 5 1880 George 0.043292 boy 257990 2008 Hailie 0.000129 girl 6 1880 Frank 0.027380 boy 257991 2008 Jamiya 0.000129 girl 7 1880 Joseph 0.022229 boy 257992 2008 Kathy 0.000129 girl 8 1880 Thomas 0.021401 boy 257993 2008 Laylah 0.000129 girl 9 1880 Henry 0.020641 boy 257994 2008 Riya 0.000129 girl 10 1880 Robert 0.020404 boy 257995 2008 Diya 0.000128 girl 11 1880 Edward 0.019965 boy 257996 2008 Carleigh 0.000128 girl 12 1880 Harry 0.018175 boy 257997 2008 Iyana 0.000128 girl 13 1880 Walter 0.014822 boy 257998 2008 Kenley 0.000127 girl 14 1880 Arthur 0.013504 boy 257999 2008 Sloane 0.000127 girl 15 1880 Fred 0.013251 boy 258000 2008 Elianna 0.000127 girl Tuesday, 7 July 2009
5. 5. Brainstorm What variables and summaries might you want to generate from this data? What questions would you like to be able to answer about the data? With your partner, you have 2 minutes to come up with as many as possible. Tuesday, 7 July 2009
6. 6. Some of my ideas • First/last letter • Rank • Length • Ecdf (how many babies have a • Number/percent name in the top of vowels 2, 3, 5, 100 etc) • Biblical names? Tuesday, 7 July 2009
7. 7. Transform & summarise transform(df, var1 = expr1, ...) summarise(df, var1 = expr1, ...) Transform modiﬁes an existing data frame. Summarise creates a new data frame. Tuesday, 7 July 2009
8. 8. letter <- function(x, n = 1) { Many interesting if (n < 0) { nc <- nchar(x) transformations and } n <- nc + n + 1 summaries can be tolower(substr(x, n, n)) calculated for the } vowels <- function(x) { whole dataset nchar(gsub("[^aeiou]", "", x)) } bnames <- transform(bnames, first = letter(name, 1), last = letter(name, -1), length = nchar(name), vowels = vowels(name) ) summarise(bnames, max_perc = max(percent), min_perc = min(percent)) Tuesday, 7 July 2009
9. 9. Group-wise What about group-wise transformations or summaries? e.g. what if we want to compute the rank of a name within a sex and year? This task is easy if we have a single year & sex, but hard otherwise. Tuesday, 7 July 2009
10. 10. one <- subset(bnames, sex == "boy" & year == 2008) one\$rank <- rank(-one\$percent, ties.method = "first") # or one <- transform(one, rank = rank(-percent, ties.method = "first")) head(one) What if we want to transform every sex and year? Tuesday, 7 July 2009
11. 11. # Split pieces <- split(bnames, list(bnames\$sex, bnames\$year)) # Apply results <- vector("list", length(pieces)) for(i in seq_along(pieces)) { piece <- pieces[[i]] piece <- transform(piece, rank = rank(-percent, ties.method = "first")) results[[i]] <- piece } # Combine result <- do.call("rbind", results) Tuesday, 7 July 2009
12. 12. # Or equivalently bnames <- ddply(bnames, c("sex", "year"), transform, rank = rank(-percent, ties.method = "first")) Tuesday, 7 July 2009
13. 13. Way to split Function to apply to Input data up input each piece # Or equivalently bnames <- ddply(bnames, c("sex", "year"), transform, rank = rank(-percent, ties.method = "first")) 2nd argument to transform() Tuesday, 7 July 2009
14. 14. ddply • .data: data frame to process • .variables: combination of variables to split by • .fun: function to call on each piece • ...: extra arguments passed to .fun Tuesday, 7 July 2009
15. 15. Variable speciﬁcation syntax • Character: c("sex", "year") • Numeric: 1:3 • Formula: ~ sex + year • Special: • .(sex, year) • .(first = letter(name, 1)) Tuesday, 7 July 2009
16. 16. Match function with use randomisation/permutation scale(x) tests scale to [0, 1] within each rank(x) group scale to mean 0, sd 1 x - min(x) / diff(range(x)) within each group compute per-group x / x[1] rankings sample(x) index a time series Tuesday, 7 July 2009
17. 17. Summaries In a similar way, we can use ddply() for group-wise summaries. There are many base R functions for special cases. Where available, these are often much faster; but you have to know they exist, and have to remember how to use them. Tuesday, 7 July 2009
18. 18. ddply(bnames, c("name"), summarise, tot = sum(percent)) ddply(bnames, c("length"), summarise, tot = sum(percent)) ddply(bnames, c("year", "sex"), summarise, tot = sum(percent)) fl <- ddply(bnames, c("year", "sex", "first"), summarise, tot = sum(percent)) library(ggplot2) qplot(year, tot, data = fl, geom = "line", colour = sex, facets = ~ first) Tuesday, 7 July 2009
19. 19. Challenge Create a plot that shows (by year) the proportion of US children who have a name in the top 100. Extra challenge: break it down by sex. What does this suggest about baby naming trends in the US? Tuesday, 7 July 2009
20. 20. 1.0 0.8 0.6 sex boy tot girl 0.4 0.2 0.0 1880 1900 1920 1940 1960 1980 2000 year Tuesday, 7 July 2009