Your SlideShare is downloading. ×
0
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Plyr, one data analytic strategy
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Plyr, one data analytic strategy

6,054

Published on

Published in: Technology, Education
3 Comments
7 Likes
Statistics
Notes
No Downloads
Views
Total Views
6,054
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
184
Comments
3
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. plyr One data-analytic strategy Hadley Wickham Rice University Friday, 29 May 2009
  • 2. 1. Motivation: Deseasonlising ozone measurements 2. Outline of strategy: split-apply- combine 3. Specifics: input vs. output 4. Fiddly details 5. Thoughts on data analysis Friday, 29 May 2009
  • 3. 24 x 24 x 72 = 41,472 30 20 10 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
  • 4. 24 x 24 x 72 = 41,472 30 20 10 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
  • 5. 1.0 0.5 ● 0.0 ● −0.5 −1.0 −1.0 −0.5 0.0 0.5 1.0 Friday, 29 May 2009
  • 6. 1.0 1.0 0.8 0.5 ● 0.6 0.0 ● ● 0.4 −0.5 0.2 −1.0 0.0 −1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Friday, 29 May 2009
  • 7. 1.0 0.9 0.8 0.7 value 0.6 0.5 0.4 0.3 0.0 0.2 0.4 0.6 0.8 1.0 time Friday, 29 May 2009
  • 8. resid(deseas1) + mean(one$value) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.0 0.2 0.4 0.6 0.8 1.0 time Friday, 29 May 2009
  • 9. resid(deseas1) + mean(one$value) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.0 0.2 0.4 0.6 0.8 1.0 time Friday, 29 May 2009
  • 10. How can we do this for all 24 x 24 locations? (assume ozone levels stored in a 24 x 24 x 72 array) Friday, 29 May 2009
  • 11. W ith models <- as.list(rep(NA, 24 * 24)) a fo dim(models) <- c(24, 24) r lo op deseas <- array(NA, c(24, 24, 72)) dimnames(deseas) <- dimnames(ozone) for (i in seq_len(24)) { for(j in seq_len(24)) { mod <- deseasf(ozone[i, j, ]) models[[i, j]] <- mod deseas[i, j, ] <- resid(mod) } } Friday, 29 May 2009
  • 12. W ith models <- as.list(rep(NA, 24 * 24)) a fo dim(models) <- c(24, 24) r lo op deseas <- array(NA, c(24, 24, 72)) dimnames(deseas) <- dimnames(ozone) for (i in seq_len(24)) { for(j in seq_len(24)) { mod <- deseasf(ozone[i, j, ]) models[[i, j]] <- mod deseas[i, j, ] <- resid(mod) } } Friday, 29 May 2009
  • 13. W ith ap pl y models <- apply(ozone, 1:2, deseasf) resids <- unlist(lapply(models, resid)) dim(resids) <- c(72, 24, 24) deseas <- aperm(resids, c(2, 3, 1)) dimnames(deseas) <- dimnames(ozone) Friday, 29 May 2009
  • 14. W ith ap pl y models <- apply(ozone, 1:2, deseasf) resids <- unlist(lapply(models, resid)) dim(resids) <- c(72, 24, 24) deseas <- aperm(resids, c(2, 3, 1)) dimnames(deseas) <- dimnames(ozone) Friday, 29 May 2009
  • 15. W ith pl yr models <- aaply(ozone, 1:2, deseasf) deseas <- aaply(models, 1:2, resid) Succinct, but you need to know what aaply does cf. onomatopoeia, schadenfreude, soliloquy Friday, 29 May 2009
  • 16. 30 20 avg 250 10 260 270 280 290 300 310 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
  • 17. 30 20 10 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
  • 18. Many problems involve splitting up a large data structure, operating on each piece and joining the results back together: split-apply-combine Friday, 29 May 2009
  • 19. How you split up depends on the type of input: arrays, data frames, lists How you combine depends on the type of output: arrays, data frames, lists, nothing Friday, 29 May 2009
  • 20. array data frame list nothing array aaply adply alply a_ply data frame daply ddply dlply d_ply list laply ldply llply l_ply Friday, 29 May 2009
  • 21. array data frame list nothing array apply adply alply a_ply data frame daply aggregate by d_ply list sapply ldply lapply l_ply Friday, 29 May 2009
  • 22. Split: array, data frame, list 1 2 1 2 1,2 Friday, 29 May 2009
  • 23. Split: array, data frame, list 1 2 3 3 2 1 1,2,3 1,2 1,3 2,3 Friday, 29 May 2009
  • 24. Take 3d array, split up by first two dimensions. models <- aaply(ozone, 1:2, deseasf) deseas <- aaply(models, 1:2, resid) Splitting up ozone gives 576 vectors of length 72. Splitting up models gives 576 rlm models How are they combined? Friday, 29 May 2009
  • 25. Combine: array, data frame, list 4D! Friday, 29 May 2009
  • 26. Combine: array, data frame, list Friday, 29 May 2009
  • 27. Split: array, data frame, list .(sex) .(age) name age sex name age sex name age sex John 13 Male John 13 Male John 13 Male Mary 15 Female Peter 13 Male Peter 13 Male Alice 14 Female Roger 14 Male Phyllis 13 Female Peter 13 Male name age sex name age sex Roger 14 Male Mary 15 Female Alice 14 Female Phyllis 13 Female Alice 14 Female Roger 14 Male Phyllis 13 Female name age sex Mary 15 Female Friday, 29 May 2009
  • 28. Combine: array, data frame, list .(sex) .(age) .(sex, age) sex value age value sex age value Male 3 13 3 Male 13 2 Female 3 14 2 Male 14 1 15 2 Female 13 1 Female 14 1 Applying nrow to each piece Female 15 1 Friday, 29 May 2009
  • 29. Case study: Baseball Friday, 29 May 2009
  • 30. id year team g ab r h 21 699 records ruthba01 1914 BOS 5 10 1 2 ruthba01 1915 BOS 42 92 16 29 ruthba01 1916 BOS 67 136 18 37 1228 players ruthba01 1917 BOS 52 123 14 40 ruthba01 1918 BOS 95 317 50 95 15-31 years for ruthba01 1919 BOS 130 432 103 139 each player ruthba01 1920 NYA 142 457 158 172 ruthba01 1921 NYA 152 540 177 204 ruthba01 1922 NYA 110 406 94 128 ruthba01 1923 NYA 152 522 151 205 ruthba01 1924 NYA 153 529 143 200 ruthba01 1925 NYA 98 359 61 104 ruthba01 1926 NYA 152 495 139 184 ruthba01 1927 NYA 151 540 158 192 ruthba01 1928 NYA 154 536 163 173 ruthba01 1929 NYA 135 499 121 172 Friday, 29 May 2009
  • 31. How does performance (rbi/ab) change over the course of a career? First need to add column that gives “career year” Easy for a single player. baberuth <- subset(baseball, id == quot;ruthba01quot;) baberuth <- transform(baberuth, cyear = year - min(year) + 1) For many players, use ddply + transform baseball <- ddply(baseball, quot;idquot;, transform, cyear = year - min(year) + 1) Friday, 29 May 2009
  • 32. Draw time series for all 1228 players baseball <- subset(baseball, ab >= 25) xlim <- range(baseball$cyear, na.rm=TRUE) ylim <- range(baseball$rbi / baseball$ab, na.rm=TRUE) plotpattern <- function(df) { qplot(cyear, rbi / ab, data = df, geom = quot;linequot;, xlim = xlim, ylim = ylim) } pdf(quot;paths.pdfquot;, width = 8, height = 4) d_ply(baseball, .(reorder(id, rbi / ab)), failwith(NA, plotpattern), .print = TRUE) dev.off() Friday, 29 May 2009
  • 33. 200 150 count 100 50 0 0.0 0.2 0.4 0.6 0.8 1.0 rsquare Friday, 29 May 2009
  • 34. 0.25 1.0 0.20 0.15 rsquare rsquare 0.5 0.00 0.00 0.10 intercept intercept 0.25 0.25 0.50 0.05 0.50 0.0 0.75 0.75 1.00 0.00 1.00 −0.05 −0.5 −0.10 −0.04 −0.020.00 0.02 0.04 0.06 0.08 −0.010 −0.005 0.000 0.005 0.010 slope slope Friday, 29 May 2009
  • 35. Fiddly details Labelling Progress bars Consistent argument names Missing values / Nulls Friday, 29 May 2009
  • 36. Data analysis What other patterns of data analysis are waiting to be discovered? How can we identify these strategies and then develop software to support them? Does teaching these patterns make it easier for novices to become experts? Friday, 29 May 2009
  • 37. http://had.co.nz/plyr Friday, 29 May 2009

×