0
plyer	
  for	
  Split-­‐Apply-­‐Combine	
  	
  
Automa4ng	
  one	
  pa6ern	
  of	
  data	
  munging	
  and	
  analysis	
  ...
What	
  is	
  plyr?	
  
•  A	
  library	
  of	
  func4ons	
  for	
  R	
  for	
  doing	
  analysis	
  in	
  
a	
  split-­‐a...
Example:	
  	
  Baby	
  Names	
  
• 
• 
• 
• 

From	
  Hadley	
  Wickham,	
  h6p://plyr.had.co.nz/09-­‐user/	
  	
  
Top	
...
Groupwise	
  summaries	
  
•  What	
  if	
  we	
  want	
  to	
  compute	
  the	
  rank	
  of	
  a	
  
name	
  within	
  a	...
Using	
  plyr	
  

bnames	
  <-­‐	
  ddply(bnames,	
  c(“sex”,	
  “year”),	
  transform,	
  
	
  	
  	
  	
  	
  	
  	
  	...
One-­‐line	
  summaries	
  	
  
ddply(bnames,	
  c(“name”),	
  summarize,	
  tot	
  =	
  sum(percent))	
  
	
  
ddply(bnam...
7	
  
plyr	
  func4ons	
  are	
  named	
  by	
  their	
  
input	
  and	
  output	
  types	
  
ioply	
  where	
  i	
  is	
  the	
...
Base	
  R	
  vs.	
  plyr	
  
Base	
  
func-on	
  
aggregate	
  

d	
  

d	
  

ddply	
  +	
  colwise	
  

apply	
  

a	
  ...
Upcoming SlideShare
Loading in...5
×

LA R meetup - Nov 2013 - Eric Klusman

423

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
423
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "LA R meetup - Nov 2013 - Eric Klusman"

  1. 1. plyer  for  Split-­‐Apply-­‐Combine     Automa4ng  one  pa6ern  of  data  munging  and  analysis     Eric  Klusman   2013-­‐11-­‐14  
  2. 2. What  is  plyr?   •  A  library  of  func4ons  for  R  for  doing  analysis  in   a  split-­‐apply-­‐combine  pa6ern   –  Split  the  data  into  subgroups   –  Apply  some  func4on  to  summarize,  model,  or  plot  each   subgroup   –  Combine  the  results  of  the  subgroups  back  together   •  Automate  the  loops  and  avoid  the   bookkeeping  code   •  Assump4on:  data  can  be  processed  piecewise     2  
  3. 3. Example:    Baby  Names   •  •  •  •  From  Hadley  Wickham,  h6p://plyr.had.co.nz/09-­‐user/     Top  1000  U.S.  boy  and  girl  baby  names  from  1880  to  2008   Derived  from  Social  Security  Administra4on  dataset   1000  *  2  *  129  =  258000  obs  on  4  vars   >  head(bnames)   >  tail(bnames)    year        name    percent  sex   1  1880        John  0.081541  boy   2  1880  William  0.080511  boy   3  1880      James  0.050057  boy   4  1880  Charles  0.045167  boy   5  1880    George  0.043292  boy   6  1880      Frank  0.027380  boy      year          name    percent    sex   257995  2008          Diya  0.000128  girl   257996  2008  Carleigh  0.000128  girl   257997  2008        Iyana  0.000128  girl   257998  2008      Kenley  0.000127  girl   257999  2008      Sloane  0.000127  girl   258000  2008    Elianna  0.000127  girl   3  
  4. 4. Groupwise  summaries   •  What  if  we  want  to  compute  the  rank  of  a   name  within  a  sex  and  year?   •  Easy  for  a  single  year  and  sex;  hard  in  general.   #  Split   pieces  <-­‐  split(bnames,  list(bnames$sex,  bnames$year))     #  Apply   results  <=  vector(“list”,  length(pieces))   for(i  in  seq_along(pieces))  {          piece  <-­‐  pieces[[i]]          piece  <-­‐  transform(piece,  rank  =  rank(-­‐percent,  ties.method=“first”))          results[[i]]  <-­‐  piece   }     #  Combine   result  <-­‐  do.call(“rbind”,  results)   4  
  5. 5. Using  plyr   bnames  <-­‐  ddply(bnames,  c(“sex”,  “year”),  transform,                                  rank  =  rank(-­‐percent,  ties.method=“first”))   5  
  6. 6. One-­‐line  summaries     ddply(bnames,  c(“name”),  summarize,  tot  =  sum(percent))     ddply(bnames,  c(“length”),  summarize,  tot  =  sum(percent))     ddply(bnames,  c(“year”,  “sex”),  summarize,  tot  =  sum(percent))     fl  <-­‐  ddply(bnames,  c(“year”,  “sex”,  “first”),  summarize,                                  tot  =  sum(percent))     library(ggplot2)     qplot(year,  tot,  data  =  fl,  geom  =  “line”,  color  =  “sex”,                    facets  =  ~  first)   6  
  7. 7. 7  
  8. 8. plyr  func4ons  are  named  by  their   input  and  output  types   ioply  where  i  is  the  input  type  and  o  is  the  output  type   Func%on   Input  data  type   Output  data  type   ddply   Data  frame   Data  frame   aaply   Array   Array   daply   Dataframe   Array   d_ply   Dataframe   None;  used  for  plo`ng  or   prin4ng   ldply   List   Dataframe   alply   Array   List   8  
  9. 9. Base  R  vs.  plyr   Base   func-on   aggregate   d   d   ddply  +  colwise   apply   a   a/l   aaply  /  alply   by   l   l   dlply   lapply   l   l   llply   mapply   a   a/l   maply  /  mlply   replicate   r   a/l   raply  /  rlply   sapply   l   a   laply   sweep   a   a   -­‐   tapply   •        Input   Output   plyr  func-on   a   a   -­‐   Input  and  output  types  are  indicated  by  first  le6er:    array,  data  frame,  list,  replica4on   9  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×