LA R meetup - Nov 2013 - Eric Klusman
Upcoming SlideShare
Loading in...5
×
 

LA R meetup - Nov 2013 - Eric Klusman

on

  • 499 views

 

Statistics

Views

Total Views
499
Views on SlideShare
499
Embed Views
0

Actions

Likes
1
Downloads
8
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

LA R meetup - Nov 2013 - Eric Klusman LA R meetup - Nov 2013 - Eric Klusman Presentation Transcript

  • plyer  for  Split-­‐Apply-­‐Combine     Automa4ng  one  pa6ern  of  data  munging  and  analysis     Eric  Klusman   2013-­‐11-­‐14  
  • What  is  plyr?   •  A  library  of  func4ons  for  R  for  doing  analysis  in   a  split-­‐apply-­‐combine  pa6ern   –  Split  the  data  into  subgroups   –  Apply  some  func4on  to  summarize,  model,  or  plot  each   subgroup   –  Combine  the  results  of  the  subgroups  back  together   •  Automate  the  loops  and  avoid  the   bookkeeping  code   •  Assump4on:  data  can  be  processed  piecewise     2  
  • Example:    Baby  Names   •  •  •  •  From  Hadley  Wickham,  h6p://plyr.had.co.nz/09-­‐user/     Top  1000  U.S.  boy  and  girl  baby  names  from  1880  to  2008   Derived  from  Social  Security  Administra4on  dataset   1000  *  2  *  129  =  258000  obs  on  4  vars   >  head(bnames)   >  tail(bnames)    year        name    percent  sex   1  1880        John  0.081541  boy   2  1880  William  0.080511  boy   3  1880      James  0.050057  boy   4  1880  Charles  0.045167  boy   5  1880    George  0.043292  boy   6  1880      Frank  0.027380  boy      year          name    percent    sex   257995  2008          Diya  0.000128  girl   257996  2008  Carleigh  0.000128  girl   257997  2008        Iyana  0.000128  girl   257998  2008      Kenley  0.000127  girl   257999  2008      Sloane  0.000127  girl   258000  2008    Elianna  0.000127  girl   3  
  • Groupwise  summaries   •  What  if  we  want  to  compute  the  rank  of  a   name  within  a  sex  and  year?   •  Easy  for  a  single  year  and  sex;  hard  in  general.   #  Split   pieces  <-­‐  split(bnames,  list(bnames$sex,  bnames$year))     #  Apply   results  <=  vector(“list”,  length(pieces))   for(i  in  seq_along(pieces))  {          piece  <-­‐  pieces[[i]]          piece  <-­‐  transform(piece,  rank  =  rank(-­‐percent,  ties.method=“first”))          results[[i]]  <-­‐  piece   }     #  Combine   result  <-­‐  do.call(“rbind”,  results)   4  
  • Using  plyr   bnames  <-­‐  ddply(bnames,  c(“sex”,  “year”),  transform,                                  rank  =  rank(-­‐percent,  ties.method=“first”))   5  
  • One-­‐line  summaries     ddply(bnames,  c(“name”),  summarize,  tot  =  sum(percent))     ddply(bnames,  c(“length”),  summarize,  tot  =  sum(percent))     ddply(bnames,  c(“year”,  “sex”),  summarize,  tot  =  sum(percent))     fl  <-­‐  ddply(bnames,  c(“year”,  “sex”,  “first”),  summarize,                                  tot  =  sum(percent))     library(ggplot2)     qplot(year,  tot,  data  =  fl,  geom  =  “line”,  color  =  “sex”,                    facets  =  ~  first)   6  
  • 7  
  • plyr  func4ons  are  named  by  their   input  and  output  types   ioply  where  i  is  the  input  type  and  o  is  the  output  type   Func%on   Input  data  type   Output  data  type   ddply   Data  frame   Data  frame   aaply   Array   Array   daply   Dataframe   Array   d_ply   Dataframe   None;  used  for  plo`ng  or   prin4ng   ldply   List   Dataframe   alply   Array   List   8  
  • Base  R  vs.  plyr   Base   func-on   aggregate   d   d   ddply  +  colwise   apply   a   a/l   aaply  /  alply   by   l   l   dlply   lapply   l   l   llply   mapply   a   a/l   maply  /  mlply   replicate   r   a/l   raply  /  rlply   sapply   l   a   laply   sweep   a   a   -­‐   tapply   •        Input   Output   plyr  func-on   a   a   -­‐   Input  and  output  types  are  indicated  by  first  le6er:    array,  data  frame,  list,  replica4on   9