SlideShare a Scribd company logo

02 Ddply

Hadley Wickham
Hadley Wickham
Hadley WickhamRice Unversity

02 Ddply

1 of 20
Download to read offline
plyr
                       US baby names


                        Hadley Wickham
Tuesday, 7 July 2009
1. Introduction to the data
                2. Transformations and summaries
                3. Group-wise transformation and
                   summary
                4. Variable selection syntax
                5. Challenge


Tuesday, 7 July 2009
Baby names
                       Top 1000 male and female baby
                       names in the US, from 1880 to
                       2008.
                       258,000 records (1000 * 2 * 129)
                       But only four variables: year,
                       name, sex and percent.

                                           CC BY http://www.flickr.com/photos/the_light_show/2586781132
Tuesday, 7 July 2009
> head(bnames, 15)               > tail(bnames, 15)
           year    name percent    sex          year     name    percent    sex
        1 1880     John 0.081541   boy   257986 2008   Neveah   0.000130   girl
        2 1880 William 0.080511    boy   257987 2008   Amaris   0.000129   girl
        3 1880    James 0.050057   boy   257988 2008 Hadassah   0.000129   girl
        4 1880 Charles 0.045167    boy   257989 2008    Dania   0.000129   girl
        5 1880 George 0.043292     boy   257990 2008   Hailie   0.000129   girl
        6 1880    Frank 0.027380   boy   257991 2008   Jamiya   0.000129   girl
        7 1880 Joseph 0.022229     boy   257992 2008    Kathy   0.000129   girl
        8 1880 Thomas 0.021401     boy   257993 2008   Laylah   0.000129   girl
        9 1880    Henry 0.020641   boy   257994 2008     Riya   0.000129   girl
        10 1880 Robert 0.020404    boy   257995 2008     Diya   0.000128   girl
        11 1880 Edward 0.019965    boy   257996 2008 Carleigh   0.000128   girl
        12 1880   Harry 0.018175   boy   257997 2008    Iyana   0.000128   girl
        13 1880 Walter 0.014822    boy   257998 2008   Kenley   0.000127   girl
        14 1880 Arthur 0.013504    boy   257999 2008   Sloane   0.000127   girl
        15 1880    Fred 0.013251   boy   258000 2008 Elianna    0.000127   girl



Tuesday, 7 July 2009
Brainstorm

                       What variables and summaries might you
                       want to generate from this data? What
                       questions would you like to be able to
                       answer about the data?
                       With your partner, you have 2 minutes to
                       come up with as many as possible.



Tuesday, 7 July 2009
Some of my ideas
                       • First/last letter   • Rank
                       • Length              • Ecdf (how many
                                               babies have a
                       • Number/percent
                                               name in the top
                         of vowels
                                               2, 3, 5, 100 etc)
                       • Biblical names?




Tuesday, 7 July 2009

More Related Content

More from Hadley Wickham (20)

27 development
27 development27 development
27 development
 
27 development
27 development27 development
27 development
 
24 modelling
24 modelling24 modelling
24 modelling
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R packages
R packagesR packages
R packages
 
22 spam
22 spam22 spam
22 spam
 
21 spam
21 spam21 spam
21 spam
 
20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
13 case-study
13 case-study13 case-study
13 case-study
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 

Recently uploaded

Final Research Paper - Ahmed Salama - Robotic-Assisted Surgeries.docx
Final Research Paper - Ahmed Salama - Robotic-Assisted Surgeries.docxFinal Research Paper - Ahmed Salama - Robotic-Assisted Surgeries.docx
Final Research Paper - Ahmed Salama - Robotic-Assisted Surgeries.docxAhmedBassam18
 
ATHISH MA (811721243009)SELF INTRODUCTION
ATHISH MA (811721243009)SELF INTRODUCTIONATHISH MA (811721243009)SELF INTRODUCTION
ATHISH MA (811721243009)SELF INTRODUCTIONathish243105
 
SELF INTRODUCTION - - A.D.Dhivya ad21014
SELF INTRODUCTION - - A.D.Dhivya ad21014SELF INTRODUCTION - - A.D.Dhivya ad21014
SELF INTRODUCTION - - A.D.Dhivya ad21014addhivya3014
 
Ambalatharasan self introduction assessment1.docx
Ambalatharasan self introduction assessment1.docxAmbalatharasan self introduction assessment1.docx
Ambalatharasan self introduction assessment1.docxgamef7614
 
密苏里科技大学毕业证制作流程-美国学历学位认证如何办理
密苏里科技大学毕业证制作流程-美国学历学位认证如何办理密苏里科技大学毕业证制作流程-美国学历学位认证如何办理
密苏里科技大学毕业证制作流程-美国学历学位认证如何办理8cuw8kc0
 
It is about small introduction about myself.pdf
It is about small introduction about myself.pdfIt is about small introduction about myself.pdf
It is about small introduction about myself.pdfchittrarasu3013
 
Mirdula Suresh (AD21028) Self-Introduction
Mirdula Suresh (AD21028) Self-IntroductionMirdula Suresh (AD21028) Self-Introduction
Mirdula Suresh (AD21028) Self-Introductionmirdula3028
 
A D DHIVYA self Introduction ad21014 ,
A D DHIVYA  self Introduction ad21014 ,A D DHIVYA  self Introduction ad21014 ,
A D DHIVYA self Introduction ad21014 ,addhivya3014
 
Aravindh self introduction assessment 1.docx
Aravindh self introduction assessment 1.docxAravindh self introduction assessment 1.docx
Aravindh self introduction assessment 1.docxgamef7614
 
It is about small introduction about myself.pdf
It is about small introduction about myself.pdfIt is about small introduction about myself.pdf
It is about small introduction about myself.pdfsidduper6
 

Recently uploaded (10)

Final Research Paper - Ahmed Salama - Robotic-Assisted Surgeries.docx
Final Research Paper - Ahmed Salama - Robotic-Assisted Surgeries.docxFinal Research Paper - Ahmed Salama - Robotic-Assisted Surgeries.docx
Final Research Paper - Ahmed Salama - Robotic-Assisted Surgeries.docx
 
ATHISH MA (811721243009)SELF INTRODUCTION
ATHISH MA (811721243009)SELF INTRODUCTIONATHISH MA (811721243009)SELF INTRODUCTION
ATHISH MA (811721243009)SELF INTRODUCTION
 
SELF INTRODUCTION - - A.D.Dhivya ad21014
SELF INTRODUCTION - - A.D.Dhivya ad21014SELF INTRODUCTION - - A.D.Dhivya ad21014
SELF INTRODUCTION - - A.D.Dhivya ad21014
 
Ambalatharasan self introduction assessment1.docx
Ambalatharasan self introduction assessment1.docxAmbalatharasan self introduction assessment1.docx
Ambalatharasan self introduction assessment1.docx
 
密苏里科技大学毕业证制作流程-美国学历学位认证如何办理
密苏里科技大学毕业证制作流程-美国学历学位认证如何办理密苏里科技大学毕业证制作流程-美国学历学位认证如何办理
密苏里科技大学毕业证制作流程-美国学历学位认证如何办理
 
It is about small introduction about myself.pdf
It is about small introduction about myself.pdfIt is about small introduction about myself.pdf
It is about small introduction about myself.pdf
 
Mirdula Suresh (AD21028) Self-Introduction
Mirdula Suresh (AD21028) Self-IntroductionMirdula Suresh (AD21028) Self-Introduction
Mirdula Suresh (AD21028) Self-Introduction
 
A D DHIVYA self Introduction ad21014 ,
A D DHIVYA  self Introduction ad21014 ,A D DHIVYA  self Introduction ad21014 ,
A D DHIVYA self Introduction ad21014 ,
 
Aravindh self introduction assessment 1.docx
Aravindh self introduction assessment 1.docxAravindh self introduction assessment 1.docx
Aravindh self introduction assessment 1.docx
 
It is about small introduction about myself.pdf
It is about small introduction about myself.pdfIt is about small introduction about myself.pdf
It is about small introduction about myself.pdf
 

02 Ddply

  • 1. plyr US baby names Hadley Wickham Tuesday, 7 July 2009
  • 2. 1. Introduction to the data 2. Transformations and summaries 3. Group-wise transformation and summary 4. Variable selection syntax 5. Challenge Tuesday, 7 July 2009
  • 3. Baby names Top 1000 male and female baby names in the US, from 1880 to 2008. 258,000 records (1000 * 2 * 129) But only four variables: year, name, sex and percent. CC BY http://www.flickr.com/photos/the_light_show/2586781132 Tuesday, 7 July 2009
  • 4. > head(bnames, 15) > tail(bnames, 15) year name percent sex year name percent sex 1 1880 John 0.081541 boy 257986 2008 Neveah 0.000130 girl 2 1880 William 0.080511 boy 257987 2008 Amaris 0.000129 girl 3 1880 James 0.050057 boy 257988 2008 Hadassah 0.000129 girl 4 1880 Charles 0.045167 boy 257989 2008 Dania 0.000129 girl 5 1880 George 0.043292 boy 257990 2008 Hailie 0.000129 girl 6 1880 Frank 0.027380 boy 257991 2008 Jamiya 0.000129 girl 7 1880 Joseph 0.022229 boy 257992 2008 Kathy 0.000129 girl 8 1880 Thomas 0.021401 boy 257993 2008 Laylah 0.000129 girl 9 1880 Henry 0.020641 boy 257994 2008 Riya 0.000129 girl 10 1880 Robert 0.020404 boy 257995 2008 Diya 0.000128 girl 11 1880 Edward 0.019965 boy 257996 2008 Carleigh 0.000128 girl 12 1880 Harry 0.018175 boy 257997 2008 Iyana 0.000128 girl 13 1880 Walter 0.014822 boy 257998 2008 Kenley 0.000127 girl 14 1880 Arthur 0.013504 boy 257999 2008 Sloane 0.000127 girl 15 1880 Fred 0.013251 boy 258000 2008 Elianna 0.000127 girl Tuesday, 7 July 2009
  • 5. Brainstorm What variables and summaries might you want to generate from this data? What questions would you like to be able to answer about the data? With your partner, you have 2 minutes to come up with as many as possible. Tuesday, 7 July 2009
  • 6. Some of my ideas • First/last letter • Rank • Length • Ecdf (how many babies have a • Number/percent name in the top of vowels 2, 3, 5, 100 etc) • Biblical names? Tuesday, 7 July 2009
  • 7. Transform & summarise transform(df, var1 = expr1, ...) summarise(df, var1 = expr1, ...) Transform modifies an existing data frame. Summarise creates a new data frame. Tuesday, 7 July 2009
  • 8. letter <- function(x, n = 1) { Many interesting if (n < 0) { nc <- nchar(x) transformations and } n <- nc + n + 1 summaries can be tolower(substr(x, n, n)) calculated for the } vowels <- function(x) { whole dataset nchar(gsub("[^aeiou]", "", x)) } bnames <- transform(bnames, first = letter(name, 1), last = letter(name, -1), length = nchar(name), vowels = vowels(name) ) summarise(bnames, max_perc = max(percent), min_perc = min(percent)) Tuesday, 7 July 2009
  • 9. Group-wise What about group-wise transformations or summaries? e.g. what if we want to compute the rank of a name within a sex and year? This task is easy if we have a single year & sex, but hard otherwise. Tuesday, 7 July 2009
  • 10. one <- subset(bnames, sex == "boy" & year == 2008) one$rank <- rank(-one$percent, ties.method = "first") # or one <- transform(one, rank = rank(-percent, ties.method = "first")) head(one) What if we want to transform every sex and year? Tuesday, 7 July 2009
  • 11. # Split pieces <- split(bnames, list(bnames$sex, bnames$year)) # Apply results <- vector("list", length(pieces)) for(i in seq_along(pieces)) { piece <- pieces[[i]] piece <- transform(piece, rank = rank(-percent, ties.method = "first")) results[[i]] <- piece } # Combine result <- do.call("rbind", results) Tuesday, 7 July 2009
  • 12. # Or equivalently bnames <- ddply(bnames, c("sex", "year"), transform, rank = rank(-percent, ties.method = "first")) Tuesday, 7 July 2009
  • 13. Way to split Function to apply to Input data up input each piece # Or equivalently bnames <- ddply(bnames, c("sex", "year"), transform, rank = rank(-percent, ties.method = "first")) 2nd argument to transform() Tuesday, 7 July 2009
  • 14. ddply • .data: data frame to process • .variables: combination of variables to split by • .fun: function to call on each piece • ...: extra arguments passed to .fun Tuesday, 7 July 2009
  • 15. Variable specification syntax • Character: c("sex", "year") • Numeric: 1:3 • Formula: ~ sex + year • Special: • .(sex, year) • .(first = letter(name, 1)) Tuesday, 7 July 2009
  • 16. Match function with use randomisation/permutation scale(x) tests scale to [0, 1] within each rank(x) group scale to mean 0, sd 1 x - min(x) / diff(range(x)) within each group compute per-group x / x[1] rankings sample(x) index a time series Tuesday, 7 July 2009
  • 17. Summaries In a similar way, we can use ddply() for group-wise summaries. There are many base R functions for special cases. Where available, these are often much faster; but you have to know they exist, and have to remember how to use them. Tuesday, 7 July 2009
  • 18. ddply(bnames, c("name"), summarise, tot = sum(percent)) ddply(bnames, c("length"), summarise, tot = sum(percent)) ddply(bnames, c("year", "sex"), summarise, tot = sum(percent)) fl <- ddply(bnames, c("year", "sex", "first"), summarise, tot = sum(percent)) library(ggplot2) qplot(year, tot, data = fl, geom = "line", colour = sex, facets = ~ first) Tuesday, 7 July 2009
  • 19. Challenge Create a plot that shows (by year) the proportion of US children who have a name in the top 100. Extra challenge: break it down by sex. What does this suggest about baby naming trends in the US? Tuesday, 7 July 2009
  • 20. 1.0 0.8 0.6 sex boy tot girl 0.4 0.2 0.0 1880 1900 1920 1940 1960 1980 2000 year Tuesday, 7 July 2009