Plyr, one data analytic strategy
Upcoming SlideShare
Loading in...5
×
 

Plyr, one data analytic strategy

on

  • 5,601 views

 

Statistics

Views

Total Views
5,601
Views on SlideShare
5,516
Embed Views
85

Actions

Likes
5
Downloads
140
Comments
3

3 Embeds 85

http://blog.had.co.nz 64
http://hadley.github.com 18
https://twitter.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • A less 'expensive' plot than the final d_ply example is just qplot( x = cyear, y = rbi / ab, geom = 'path', data = baseball).
    Are you sure you want to
    Your message goes here
    Processing…
  • I’m used to being able to freely interchange = and <-. NB: One CANNOT swap these inside a Transform comand.
    Are you sure you want to
    Your message goes here
    Processing…
  • Slide 31: Is there a preferred syntax between ddply(baseball, 'id', ...) and ddply(baseball, .(id), ... ) ? The help page ?ddply uses the latter.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Plyr, one data analytic strategy Plyr, one data analytic strategy Presentation Transcript

  • plyr One data-analytic strategy Hadley Wickham Rice University Friday, 29 May 2009
  • 1. Motivation: Deseasonlising ozone measurements 2. Outline of strategy: split-apply- combine 3. Specifics: input vs. output 4. Fiddly details 5. Thoughts on data analysis Friday, 29 May 2009
  • 24 x 24 x 72 = 41,472 30 20 10 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
  • 24 x 24 x 72 = 41,472 30 20 10 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
  • 1.0 0.5 ● 0.0 ● −0.5 −1.0 −1.0 −0.5 0.0 0.5 1.0 Friday, 29 May 2009
  • 1.0 1.0 0.8 0.5 ● 0.6 0.0 ● ● 0.4 −0.5 0.2 −1.0 0.0 −1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Friday, 29 May 2009
  • 1.0 0.9 0.8 0.7 value 0.6 0.5 0.4 0.3 0.0 0.2 0.4 0.6 0.8 1.0 time Friday, 29 May 2009
  • resid(deseas1) + mean(one$value) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.0 0.2 0.4 0.6 0.8 1.0 time Friday, 29 May 2009
  • resid(deseas1) + mean(one$value) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.0 0.2 0.4 0.6 0.8 1.0 time Friday, 29 May 2009
  • How can we do this for all 24 x 24 locations? (assume ozone levels stored in a 24 x 24 x 72 array) Friday, 29 May 2009
  • W ith models <- as.list(rep(NA, 24 * 24)) a fo dim(models) <- c(24, 24) r lo op deseas <- array(NA, c(24, 24, 72)) dimnames(deseas) <- dimnames(ozone) for (i in seq_len(24)) { for(j in seq_len(24)) { mod <- deseasf(ozone[i, j, ]) models[[i, j]] <- mod deseas[i, j, ] <- resid(mod) } } Friday, 29 May 2009
  • W ith models <- as.list(rep(NA, 24 * 24)) a fo dim(models) <- c(24, 24) r lo op deseas <- array(NA, c(24, 24, 72)) dimnames(deseas) <- dimnames(ozone) for (i in seq_len(24)) { for(j in seq_len(24)) { mod <- deseasf(ozone[i, j, ]) models[[i, j]] <- mod deseas[i, j, ] <- resid(mod) } } Friday, 29 May 2009
  • W ith ap pl y models <- apply(ozone, 1:2, deseasf) resids <- unlist(lapply(models, resid)) dim(resids) <- c(72, 24, 24) deseas <- aperm(resids, c(2, 3, 1)) dimnames(deseas) <- dimnames(ozone) Friday, 29 May 2009
  • W ith ap pl y models <- apply(ozone, 1:2, deseasf) resids <- unlist(lapply(models, resid)) dim(resids) <- c(72, 24, 24) deseas <- aperm(resids, c(2, 3, 1)) dimnames(deseas) <- dimnames(ozone) Friday, 29 May 2009
  • W ith pl yr models <- aaply(ozone, 1:2, deseasf) deseas <- aaply(models, 1:2, resid) Succinct, but you need to know what aaply does cf. onomatopoeia, schadenfreude, soliloquy Friday, 29 May 2009
  • 30 20 avg 250 10 260 270 280 290 300 310 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
  • 30 20 10 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
  • Many problems involve splitting up a large data structure, operating on each piece and joining the results back together: split-apply-combine Friday, 29 May 2009
  • How you split up depends on the type of input: arrays, data frames, lists How you combine depends on the type of output: arrays, data frames, lists, nothing Friday, 29 May 2009
  • array data frame list nothing array aaply adply alply a_ply data frame daply ddply dlply d_ply list laply ldply llply l_ply Friday, 29 May 2009
  • array data frame list nothing array apply adply alply a_ply data frame daply aggregate by d_ply list sapply ldply lapply l_ply Friday, 29 May 2009
  • Split: array, data frame, list 1 2 1 2 1,2 Friday, 29 May 2009
  • Split: array, data frame, list 1 2 3 3 2 1 1,2,3 1,2 1,3 2,3 Friday, 29 May 2009
  • Take 3d array, split up by first two dimensions. models <- aaply(ozone, 1:2, deseasf) deseas <- aaply(models, 1:2, resid) Splitting up ozone gives 576 vectors of length 72. Splitting up models gives 576 rlm models How are they combined? Friday, 29 May 2009
  • Combine: array, data frame, list 4D! Friday, 29 May 2009
  • Combine: array, data frame, list Friday, 29 May 2009
  • Split: array, data frame, list .(sex) .(age) name age sex name age sex name age sex John 13 Male John 13 Male John 13 Male Mary 15 Female Peter 13 Male Peter 13 Male Alice 14 Female Roger 14 Male Phyllis 13 Female Peter 13 Male name age sex name age sex Roger 14 Male Mary 15 Female Alice 14 Female Phyllis 13 Female Alice 14 Female Roger 14 Male Phyllis 13 Female name age sex Mary 15 Female Friday, 29 May 2009
  • Combine: array, data frame, list .(sex) .(age) .(sex, age) sex value age value sex age value Male 3 13 3 Male 13 2 Female 3 14 2 Male 14 1 15 2 Female 13 1 Female 14 1 Applying nrow to each piece Female 15 1 Friday, 29 May 2009
  • Case study: Baseball Friday, 29 May 2009
  • id year team g ab r h 21 699 records ruthba01 1914 BOS 5 10 1 2 ruthba01 1915 BOS 42 92 16 29 ruthba01 1916 BOS 67 136 18 37 1228 players ruthba01 1917 BOS 52 123 14 40 ruthba01 1918 BOS 95 317 50 95 15-31 years for ruthba01 1919 BOS 130 432 103 139 each player ruthba01 1920 NYA 142 457 158 172 ruthba01 1921 NYA 152 540 177 204 ruthba01 1922 NYA 110 406 94 128 ruthba01 1923 NYA 152 522 151 205 ruthba01 1924 NYA 153 529 143 200 ruthba01 1925 NYA 98 359 61 104 ruthba01 1926 NYA 152 495 139 184 ruthba01 1927 NYA 151 540 158 192 ruthba01 1928 NYA 154 536 163 173 ruthba01 1929 NYA 135 499 121 172 Friday, 29 May 2009
  • How does performance (rbi/ab) change over the course of a career? First need to add column that gives “career year” Easy for a single player. baberuth <- subset(baseball, id == quot;ruthba01quot;) baberuth <- transform(baberuth, cyear = year - min(year) + 1) For many players, use ddply + transform baseball <- ddply(baseball, quot;idquot;, transform, cyear = year - min(year) + 1) Friday, 29 May 2009
  • Draw time series for all 1228 players baseball <- subset(baseball, ab >= 25) xlim <- range(baseball$cyear, na.rm=TRUE) ylim <- range(baseball$rbi / baseball$ab, na.rm=TRUE) plotpattern <- function(df) { qplot(cyear, rbi / ab, data = df, geom = quot;linequot;, xlim = xlim, ylim = ylim) } pdf(quot;paths.pdfquot;, width = 8, height = 4) d_ply(baseball, .(reorder(id, rbi / ab)), failwith(NA, plotpattern), .print = TRUE) dev.off() Friday, 29 May 2009
  • 200 150 count 100 50 0 0.0 0.2 0.4 0.6 0.8 1.0 rsquare Friday, 29 May 2009
  • 0.25 1.0 0.20 0.15 rsquare rsquare 0.5 0.00 0.00 0.10 intercept intercept 0.25 0.25 0.50 0.05 0.50 0.0 0.75 0.75 1.00 0.00 1.00 −0.05 −0.5 −0.10 −0.04 −0.020.00 0.02 0.04 0.06 0.08 −0.010 −0.005 0.000 0.005 0.010 slope slope Friday, 29 May 2009
  • Fiddly details Labelling Progress bars Consistent argument names Missing values / Nulls Friday, 29 May 2009
  • Data analysis What other patterns of data analysis are waiting to be discovered? How can we identify these strategies and then develop software to support them? Does teaching these patterns make it easier for novices to become experts? Friday, 29 May 2009
  • http://had.co.nz/plyr Friday, 29 May 2009