Upcoming SlideShare
×

# LA R meetup - Nov 2013 - Eric Klusman

797 views

Published on

2 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
797
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
10
0
Likes
2
Embeds 0
No embeds

No notes for slide

### LA R meetup - Nov 2013 - Eric Klusman

1. 1. plyer  for  Split-­‐Apply-­‐Combine     Automa4ng  one  pa6ern  of  data  munging  and  analysis     Eric  Klusman   2013-­‐11-­‐14
2. 2. What  is  plyr?   •  A  library  of  func4ons  for  R  for  doing  analysis  in   a  split-­‐apply-­‐combine  pa6ern   –  Split  the  data  into  subgroups   –  Apply  some  func4on  to  summarize,  model,  or  plot  each   subgroup   –  Combine  the  results  of  the  subgroups  back  together   •  Automate  the  loops  and  avoid  the   bookkeeping  code   •  Assump4on:  data  can  be  processed  piecewise     2
3. 3. Example:    Baby  Names   •  •  •  •  From  Hadley  Wickham,  h6p://plyr.had.co.nz/09-­‐user/     Top  1000  U.S.  boy  and  girl  baby  names  from  1880  to  2008   Derived  from  Social  Security  Administra4on  dataset   1000  *  2  *  129  =  258000  obs  on  4  vars   >  head(bnames)   >  tail(bnames)    year        name    percent  sex   1  1880        John  0.081541  boy   2  1880  William  0.080511  boy   3  1880      James  0.050057  boy   4  1880  Charles  0.045167  boy   5  1880    George  0.043292  boy   6  1880      Frank  0.027380  boy      year          name    percent    sex   257995  2008          Diya  0.000128  girl   257996  2008  Carleigh  0.000128  girl   257997  2008        Iyana  0.000128  girl   257998  2008      Kenley  0.000127  girl   257999  2008      Sloane  0.000127  girl   258000  2008    Elianna  0.000127  girl   3
4. 4. Groupwise  summaries   •  What  if  we  want  to  compute  the  rank  of  a   name  within  a  sex  and  year?   •  Easy  for  a  single  year  and  sex;  hard  in  general.   #  Split   pieces  <-­‐  split(bnames,  list(bnames\$sex,  bnames\$year))     #  Apply   results  <=  vector(“list”,  length(pieces))   for(i  in  seq_along(pieces))  {          piece  <-­‐  pieces[[i]]          piece  <-­‐  transform(piece,  rank  =  rank(-­‐percent,  ties.method=“first”))          results[[i]]  <-­‐  piece   }     #  Combine   result  <-­‐  do.call(“rbind”,  results)   4
5. 5. Using  plyr   bnames  <-­‐  ddply(bnames,  c(“sex”,  “year”),  transform,                                  rank  =  rank(-­‐percent,  ties.method=“first”))   5
6. 6. One-­‐line  summaries     ddply(bnames,  c(“name”),  summarize,  tot  =  sum(percent))     ddply(bnames,  c(“length”),  summarize,  tot  =  sum(percent))     ddply(bnames,  c(“year”,  “sex”),  summarize,  tot  =  sum(percent))     fl  <-­‐  ddply(bnames,  c(“year”,  “sex”,  “first”),  summarize,                                  tot  =  sum(percent))     library(ggplot2)     qplot(year,  tot,  data  =  fl,  geom  =  “line”,  color  =  “sex”,                    facets  =  ~  first)   6
7. 7. 7
8. 8. plyr  func4ons  are  named  by  their   input  and  output  types   ioply  where  i  is  the  input  type  and  o  is  the  output  type   Func%on   Input  data  type   Output  data  type   ddply   Data  frame   Data  frame   aaply   Array   Array   daply   Dataframe   Array   d_ply   Dataframe   None;  used  for  plo`ng  or   prin4ng   ldply   List   Dataframe   alply   Array   List   8
9. 9. Base  R  vs.  plyr   Base   func-on   aggregate   d   d   ddply  +  colwise   apply   a   a/l   aaply  /  alply   by   l   l   dlply   lapply   l   l   llply   mapply   a   a/l   maply  /  mlply   replicate   r   a/l   raply  /  rlply   sapply   l   a   laply   sweep   a   a   -­‐   tapply   •        Input   Output   plyr  func-on   a   a   -­‐   Input  and  output  types  are  indicated  by  ﬁrst  le6er:    array,  data  frame,  list,  replica4on   9