plyr
         One data-analytic strategy

                      Hadley Wickham
                         Rice University
Fr...
1. Motivation: Deseasonlising ozone
                   measurements
                2. Outline of strategy: split-apply-
 ...
24 x 24 x 72 = 41,472

                       30




                       20




                       10




         ...
24 x 24 x 72 = 41,472

                       30




                       20




                       10




         ...
1.0



     0.5                     ●



     0.0                     ●




   −0.5



   −1.0


              −1.0    −0....
1.0
     1.0

                                               0.8
     0.5                     ●

                         ...
1.0
         0.9
         0.8
         0.7
 value




         0.6
         0.5
         0.4
         0.3

               ...
resid(deseas1) + mean(one$value)




                                    1.0
                                    0.9
     ...
resid(deseas1) + mean(one$value)




                                    1.0
                                    0.9
     ...
How can we do this for
                 all 24 x 24 locations?
                      (assume ozone levels stored
         ...
W
                                                         ith
                models <- as.list(rep(NA, 24 * 24))




   ...
W
                                                         ith
                models <- as.list(rep(NA, 24 * 24))




   ...
W
                                                          ith
                                                          ...
W
                                                          ith
                                                          ...
W
                                                       ith
                                                           pl...
30




                       20




                                               avg
                                  ...
30




                       20




                       10




                        0




                      −10...
Many problems involve splitting up a large
                      data structure, operating on each piece
                 ...
How you split up depends on the type of
                      input: arrays, data frames, lists
                      How ...
array   data frame    list   nothing


             array     aaply     adply      alply    a_ply


      data frame      ...
array    data frame    list    nothing


             array     apply      adply      alply     a_ply


      data frame  ...
Split: array, data frame, list


                                    1




                      2
                       ...
Split: array, data frame, list

                          1      2     3




 3




        2
                      1




...
Take 3d array, split up by first two dimensions.

          models <- aaply(ozone, 1:2, deseasf)
          deseas <- aaply(...
Combine: array, data frame, list




                                   4D!




Friday, 29 May 2009
Combine: array, data frame, list




Friday, 29 May 2009
Split: array, data frame, list


                                               .(sex)                      .(age)



    ...
Combine: array, data frame, list




                            .(sex)                .(age)                    .(sex, ag...
Case study: Baseball



Friday, 29 May 2009
id     year   team   g        ab        r        h
                                                                       ...
How does performance (rbi/ab)
    change over the course of a career?

    First need to add column that gives
    “career...
Draw time series for all 1228 players

              baseball <- subset(baseball, ab >= 25)
              xlim <- range(ba...
200



        150
count




        100



         50



          0

              0.0     0.2   0.4             0.6   ...
0.25
              1.0
                                                                                0.20


            ...
Fiddly details
                      Labelling
                      Progress bars
                      Consistent argume...
Data analysis
                      What other patterns of data analysis are
                      waiting to be discovere...
http://had.co.nz/plyr



Friday, 29 May 2009
Upcoming SlideShare
Loading in …5
×

Plyr, one data analytic strategy

6,889
-1

Published on

Published in: Technology, Education
3 Comments
8 Likes
Statistics
Notes
No Downloads
Views
Total Views
6,889
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
203
Comments
3
Likes
8
Embeds 0
No embeds

No notes for slide

Plyr, one data analytic strategy

  1. 1. plyr One data-analytic strategy Hadley Wickham Rice University Friday, 29 May 2009
  2. 2. 1. Motivation: Deseasonlising ozone measurements 2. Outline of strategy: split-apply- combine 3. Specifics: input vs. output 4. Fiddly details 5. Thoughts on data analysis Friday, 29 May 2009
  3. 3. 24 x 24 x 72 = 41,472 30 20 10 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
  4. 4. 24 x 24 x 72 = 41,472 30 20 10 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
  5. 5. 1.0 0.5 ● 0.0 ● −0.5 −1.0 −1.0 −0.5 0.0 0.5 1.0 Friday, 29 May 2009
  6. 6. 1.0 1.0 0.8 0.5 ● 0.6 0.0 ● ● 0.4 −0.5 0.2 −1.0 0.0 −1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Friday, 29 May 2009
  7. 7. 1.0 0.9 0.8 0.7 value 0.6 0.5 0.4 0.3 0.0 0.2 0.4 0.6 0.8 1.0 time Friday, 29 May 2009
  8. 8. resid(deseas1) + mean(one$value) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.0 0.2 0.4 0.6 0.8 1.0 time Friday, 29 May 2009
  9. 9. resid(deseas1) + mean(one$value) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.0 0.2 0.4 0.6 0.8 1.0 time Friday, 29 May 2009
  10. 10. How can we do this for all 24 x 24 locations? (assume ozone levels stored in a 24 x 24 x 72 array) Friday, 29 May 2009
  11. 11. W ith models <- as.list(rep(NA, 24 * 24)) a fo dim(models) <- c(24, 24) r lo op deseas <- array(NA, c(24, 24, 72)) dimnames(deseas) <- dimnames(ozone) for (i in seq_len(24)) { for(j in seq_len(24)) { mod <- deseasf(ozone[i, j, ]) models[[i, j]] <- mod deseas[i, j, ] <- resid(mod) } } Friday, 29 May 2009
  12. 12. W ith models <- as.list(rep(NA, 24 * 24)) a fo dim(models) <- c(24, 24) r lo op deseas <- array(NA, c(24, 24, 72)) dimnames(deseas) <- dimnames(ozone) for (i in seq_len(24)) { for(j in seq_len(24)) { mod <- deseasf(ozone[i, j, ]) models[[i, j]] <- mod deseas[i, j, ] <- resid(mod) } } Friday, 29 May 2009
  13. 13. W ith ap pl y models <- apply(ozone, 1:2, deseasf) resids <- unlist(lapply(models, resid)) dim(resids) <- c(72, 24, 24) deseas <- aperm(resids, c(2, 3, 1)) dimnames(deseas) <- dimnames(ozone) Friday, 29 May 2009
  14. 14. W ith ap pl y models <- apply(ozone, 1:2, deseasf) resids <- unlist(lapply(models, resid)) dim(resids) <- c(72, 24, 24) deseas <- aperm(resids, c(2, 3, 1)) dimnames(deseas) <- dimnames(ozone) Friday, 29 May 2009
  15. 15. W ith pl yr models <- aaply(ozone, 1:2, deseasf) deseas <- aaply(models, 1:2, resid) Succinct, but you need to know what aaply does cf. onomatopoeia, schadenfreude, soliloquy Friday, 29 May 2009
  16. 16. 30 20 avg 250 10 260 270 280 290 300 310 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
  17. 17. 30 20 10 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
  18. 18. Many problems involve splitting up a large data structure, operating on each piece and joining the results back together: split-apply-combine Friday, 29 May 2009
  19. 19. How you split up depends on the type of input: arrays, data frames, lists How you combine depends on the type of output: arrays, data frames, lists, nothing Friday, 29 May 2009
  20. 20. array data frame list nothing array aaply adply alply a_ply data frame daply ddply dlply d_ply list laply ldply llply l_ply Friday, 29 May 2009
  21. 21. array data frame list nothing array apply adply alply a_ply data frame daply aggregate by d_ply list sapply ldply lapply l_ply Friday, 29 May 2009
  22. 22. Split: array, data frame, list 1 2 1 2 1,2 Friday, 29 May 2009
  23. 23. Split: array, data frame, list 1 2 3 3 2 1 1,2,3 1,2 1,3 2,3 Friday, 29 May 2009
  24. 24. Take 3d array, split up by first two dimensions. models <- aaply(ozone, 1:2, deseasf) deseas <- aaply(models, 1:2, resid) Splitting up ozone gives 576 vectors of length 72. Splitting up models gives 576 rlm models How are they combined? Friday, 29 May 2009
  25. 25. Combine: array, data frame, list 4D! Friday, 29 May 2009
  26. 26. Combine: array, data frame, list Friday, 29 May 2009
  27. 27. Split: array, data frame, list .(sex) .(age) name age sex name age sex name age sex John 13 Male John 13 Male John 13 Male Mary 15 Female Peter 13 Male Peter 13 Male Alice 14 Female Roger 14 Male Phyllis 13 Female Peter 13 Male name age sex name age sex Roger 14 Male Mary 15 Female Alice 14 Female Phyllis 13 Female Alice 14 Female Roger 14 Male Phyllis 13 Female name age sex Mary 15 Female Friday, 29 May 2009
  28. 28. Combine: array, data frame, list .(sex) .(age) .(sex, age) sex value age value sex age value Male 3 13 3 Male 13 2 Female 3 14 2 Male 14 1 15 2 Female 13 1 Female 14 1 Applying nrow to each piece Female 15 1 Friday, 29 May 2009
  29. 29. Case study: Baseball Friday, 29 May 2009
  30. 30. id year team g ab r h 21 699 records ruthba01 1914 BOS 5 10 1 2 ruthba01 1915 BOS 42 92 16 29 ruthba01 1916 BOS 67 136 18 37 1228 players ruthba01 1917 BOS 52 123 14 40 ruthba01 1918 BOS 95 317 50 95 15-31 years for ruthba01 1919 BOS 130 432 103 139 each player ruthba01 1920 NYA 142 457 158 172 ruthba01 1921 NYA 152 540 177 204 ruthba01 1922 NYA 110 406 94 128 ruthba01 1923 NYA 152 522 151 205 ruthba01 1924 NYA 153 529 143 200 ruthba01 1925 NYA 98 359 61 104 ruthba01 1926 NYA 152 495 139 184 ruthba01 1927 NYA 151 540 158 192 ruthba01 1928 NYA 154 536 163 173 ruthba01 1929 NYA 135 499 121 172 Friday, 29 May 2009
  31. 31. How does performance (rbi/ab) change over the course of a career? First need to add column that gives “career year” Easy for a single player. baberuth <- subset(baseball, id == quot;ruthba01quot;) baberuth <- transform(baberuth, cyear = year - min(year) + 1) For many players, use ddply + transform baseball <- ddply(baseball, quot;idquot;, transform, cyear = year - min(year) + 1) Friday, 29 May 2009
  32. 32. Draw time series for all 1228 players baseball <- subset(baseball, ab >= 25) xlim <- range(baseball$cyear, na.rm=TRUE) ylim <- range(baseball$rbi / baseball$ab, na.rm=TRUE) plotpattern <- function(df) { qplot(cyear, rbi / ab, data = df, geom = quot;linequot;, xlim = xlim, ylim = ylim) } pdf(quot;paths.pdfquot;, width = 8, height = 4) d_ply(baseball, .(reorder(id, rbi / ab)), failwith(NA, plotpattern), .print = TRUE) dev.off() Friday, 29 May 2009
  33. 33. 200 150 count 100 50 0 0.0 0.2 0.4 0.6 0.8 1.0 rsquare Friday, 29 May 2009
  34. 34. 0.25 1.0 0.20 0.15 rsquare rsquare 0.5 0.00 0.00 0.10 intercept intercept 0.25 0.25 0.50 0.05 0.50 0.0 0.75 0.75 1.00 0.00 1.00 −0.05 −0.5 −0.10 −0.04 −0.020.00 0.02 0.04 0.06 0.08 −0.010 −0.005 0.000 0.005 0.010 slope slope Friday, 29 May 2009
  35. 35. Fiddly details Labelling Progress bars Consistent argument names Missing values / Nulls Friday, 29 May 2009
  36. 36. Data analysis What other patterns of data analysis are waiting to be discovered? How can we identify these strategies and then develop software to support them? Does teaching these patterns make it easier for novices to become experts? Friday, 29 May 2009
  37. 37. http://had.co.nz/plyr Friday, 29 May 2009
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×