Upcoming SlideShare
×

# Plyr, one data analytic strategy

6,889
-1

Published on

Published in: Technology, Education
8 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• A less 'expensive' plot than the final d_ply example is just qplot( x = cyear, y = rbi / ab, geom = 'path', data = baseball).

Are you sure you want to  Yes  No
• I’m used to being able to freely interchange = and <-. NB: One CANNOT swap these inside a Transform comand.

Are you sure you want to  Yes  No
• Slide 31: Is there a preferred syntax between ddply(baseball, 'id', ...) and ddply(baseball, .(id), ... ) ? The help page ?ddply uses the latter.

Are you sure you want to  Yes  No
Views
Total Views
6,889
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
203
3
Likes
8
Embeds 0
No embeds

No notes for slide

### Plyr, one data analytic strategy

1. 1. plyr One data-analytic strategy Hadley Wickham Rice University Friday, 29 May 2009
2. 2. 1. Motivation: Deseasonlising ozone measurements 2. Outline of strategy: split-apply- combine 3. Speciﬁcs: input vs. output 4. Fiddly details 5. Thoughts on data analysis Friday, 29 May 2009
3. 3. 24 x 24 x 72 = 41,472 30 20 10 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
4. 4. 24 x 24 x 72 = 41,472 30 20 10 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
5. 5. 1.0 0.5 ● 0.0 ● −0.5 −1.0 −1.0 −0.5 0.0 0.5 1.0 Friday, 29 May 2009
6. 6. 1.0 1.0 0.8 0.5 ● 0.6 0.0 ● ● 0.4 −0.5 0.2 −1.0 0.0 −1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Friday, 29 May 2009
7. 7. 1.0 0.9 0.8 0.7 value 0.6 0.5 0.4 0.3 0.0 0.2 0.4 0.6 0.8 1.0 time Friday, 29 May 2009
8. 8. resid(deseas1) + mean(one\$value) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.0 0.2 0.4 0.6 0.8 1.0 time Friday, 29 May 2009
9. 9. resid(deseas1) + mean(one\$value) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.0 0.2 0.4 0.6 0.8 1.0 time Friday, 29 May 2009
10. 10. How can we do this for all 24 x 24 locations? (assume ozone levels stored in a 24 x 24 x 72 array) Friday, 29 May 2009
11. 11. W ith models <- as.list(rep(NA, 24 * 24)) a fo dim(models) <- c(24, 24) r lo op deseas <- array(NA, c(24, 24, 72)) dimnames(deseas) <- dimnames(ozone) for (i in seq_len(24)) { for(j in seq_len(24)) { mod <- deseasf(ozone[i, j, ]) models[[i, j]] <- mod deseas[i, j, ] <- resid(mod) } } Friday, 29 May 2009
12. 12. W ith models <- as.list(rep(NA, 24 * 24)) a fo dim(models) <- c(24, 24) r lo op deseas <- array(NA, c(24, 24, 72)) dimnames(deseas) <- dimnames(ozone) for (i in seq_len(24)) { for(j in seq_len(24)) { mod <- deseasf(ozone[i, j, ]) models[[i, j]] <- mod deseas[i, j, ] <- resid(mod) } } Friday, 29 May 2009
13. 13. W ith ap pl y models <- apply(ozone, 1:2, deseasf) resids <- unlist(lapply(models, resid)) dim(resids) <- c(72, 24, 24) deseas <- aperm(resids, c(2, 3, 1)) dimnames(deseas) <- dimnames(ozone) Friday, 29 May 2009
14. 14. W ith ap pl y models <- apply(ozone, 1:2, deseasf) resids <- unlist(lapply(models, resid)) dim(resids) <- c(72, 24, 24) deseas <- aperm(resids, c(2, 3, 1)) dimnames(deseas) <- dimnames(ozone) Friday, 29 May 2009
15. 15. W ith pl yr models <- aaply(ozone, 1:2, deseasf) deseas <- aaply(models, 1:2, resid) Succinct, but you need to know what aaply does cf. onomatopoeia, schadenfreude, soliloquy Friday, 29 May 2009
16. 16. 30 20 avg 250 10 260 270 280 290 300 310 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
17. 17. 30 20 10 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
18. 18. Many problems involve splitting up a large data structure, operating on each piece and joining the results back together: split-apply-combine Friday, 29 May 2009
19. 19. How you split up depends on the type of input: arrays, data frames, lists How you combine depends on the type of output: arrays, data frames, lists, nothing Friday, 29 May 2009
20. 20. array data frame list nothing array aaply adply alply a_ply data frame daply ddply dlply d_ply list laply ldply llply l_ply Friday, 29 May 2009
21. 21. array data frame list nothing array apply adply alply a_ply data frame daply aggregate by d_ply list sapply ldply lapply l_ply Friday, 29 May 2009
22. 22. Split: array, data frame, list 1 2 1 2 1,2 Friday, 29 May 2009
23. 23. Split: array, data frame, list 1 2 3 3 2 1 1,2,3 1,2 1,3 2,3 Friday, 29 May 2009
24. 24. Take 3d array, split up by ﬁrst two dimensions. models <- aaply(ozone, 1:2, deseasf) deseas <- aaply(models, 1:2, resid) Splitting up ozone gives 576 vectors of length 72. Splitting up models gives 576 rlm models How are they combined? Friday, 29 May 2009
25. 25. Combine: array, data frame, list 4D! Friday, 29 May 2009
26. 26. Combine: array, data frame, list Friday, 29 May 2009
27. 27. Split: array, data frame, list .(sex) .(age) name age sex name age sex name age sex John 13 Male John 13 Male John 13 Male Mary 15 Female Peter 13 Male Peter 13 Male Alice 14 Female Roger 14 Male Phyllis 13 Female Peter 13 Male name age sex name age sex Roger 14 Male Mary 15 Female Alice 14 Female Phyllis 13 Female Alice 14 Female Roger 14 Male Phyllis 13 Female name age sex Mary 15 Female Friday, 29 May 2009
28. 28. Combine: array, data frame, list .(sex) .(age) .(sex, age) sex value age value sex age value Male 3 13 3 Male 13 2 Female 3 14 2 Male 14 1 15 2 Female 13 1 Female 14 1 Applying nrow to each piece Female 15 1 Friday, 29 May 2009
29. 29. Case study: Baseball Friday, 29 May 2009
30. 30. id year team g ab r h 21 699 records ruthba01 1914 BOS 5 10 1 2 ruthba01 1915 BOS 42 92 16 29 ruthba01 1916 BOS 67 136 18 37 1228 players ruthba01 1917 BOS 52 123 14 40 ruthba01 1918 BOS 95 317 50 95 15-31 years for ruthba01 1919 BOS 130 432 103 139 each player ruthba01 1920 NYA 142 457 158 172 ruthba01 1921 NYA 152 540 177 204 ruthba01 1922 NYA 110 406 94 128 ruthba01 1923 NYA 152 522 151 205 ruthba01 1924 NYA 153 529 143 200 ruthba01 1925 NYA 98 359 61 104 ruthba01 1926 NYA 152 495 139 184 ruthba01 1927 NYA 151 540 158 192 ruthba01 1928 NYA 154 536 163 173 ruthba01 1929 NYA 135 499 121 172 Friday, 29 May 2009
31. 31. How does performance (rbi/ab) change over the course of a career? First need to add column that gives “career year” Easy for a single player. baberuth <- subset(baseball, id == quot;ruthba01quot;) baberuth <- transform(baberuth, cyear = year - min(year) + 1) For many players, use ddply + transform baseball <- ddply(baseball, quot;idquot;, transform, cyear = year - min(year) + 1) Friday, 29 May 2009
32. 32. Draw time series for all 1228 players baseball <- subset(baseball, ab >= 25) xlim <- range(baseball\$cyear, na.rm=TRUE) ylim <- range(baseball\$rbi / baseball\$ab, na.rm=TRUE) plotpattern <- function(df) { qplot(cyear, rbi / ab, data = df, geom = quot;linequot;, xlim = xlim, ylim = ylim) } pdf(quot;paths.pdfquot;, width = 8, height = 4) d_ply(baseball, .(reorder(id, rbi / ab)), failwith(NA, plotpattern), .print = TRUE) dev.off() Friday, 29 May 2009
33. 33. 200 150 count 100 50 0 0.0 0.2 0.4 0.6 0.8 1.0 rsquare Friday, 29 May 2009
34. 34. 0.25 1.0 0.20 0.15 rsquare rsquare 0.5 0.00 0.00 0.10 intercept intercept 0.25 0.25 0.50 0.05 0.50 0.0 0.75 0.75 1.00 0.00 1.00 −0.05 −0.5 −0.10 −0.04 −0.020.00 0.02 0.04 0.06 0.08 −0.010 −0.005 0.000 0.005 0.010 slope slope Friday, 29 May 2009
35. 35. Fiddly details Labelling Progress bars Consistent argument names Missing values / Nulls Friday, 29 May 2009
36. 36. Data analysis What other patterns of data analysis are waiting to be discovered? How can we identify these strategies and then develop software to support them? Does teaching these patterns make it easier for novices to become experts? Friday, 29 May 2009
37. 37. http://had.co.nz/plyr Friday, 29 May 2009
1. #### A particular slide catching your eye?

Clipping is a handy way to collect important slides you want to go back to later.