On National Teacher Day, meet the 2024-25 Kenan Fellows
Intro to plyr for Davis R Users' Group, by Steve Culman
1. Split-Apply-Combine Strategy with the Package plyr
Hadley Wickham. The split-apply-combine strategy for data
analysis. Journal of Statistical Software, 40(1):1–29, 2011.
URL http://www.jstatsoft.org/v40/i01/.
http://plyr.had.co.nz/
2. Common action in statistical analyses:
• Split up a dataset into pieces
• Apply a function to these pieces
• Combine output and examine the results
Examples:
• Calculate the mean of response by treatment
• Run ANOVA on numerous response variables
• Calculate total precipitation from weather data by
month or year
3. plyr offers:
• a streamlined, simplified, and unified framework and
alterative to the R ‘apply’ base functions (apply, sapply, tapply,
lapply, mapply, etc.)
• a streamlined and simplified approach to writing for loops
When not to use plyr:
• each iteration requires overlapping data (e.g., running mean)
• each iteration depends on the previous iteration (e.g.,
dynamic simulation)
4. The package plyr uses a similar command **ply
• where the 1st ‘*’ designates the input type
• 2nd ‘*’ designates the output type.
The choices for the ‘*’ are either, ‘a’, ‘d’, ‘l’, or ’_’
• a = array
• d = data frame
• l = list
• _ means the output is discarded
• Examples:
– ddply = input is a dataframe, output is a data frame
– alply = input is an array, output is a list
5. • Full function: ddply(.data, .variables, .fun, ..., )
• .data = R object that will be split up, processed and
recombined
• .variables (.margins for arrays) = describes how to split up the
input into pieces
• .fun, = the processing function, and is applied to each piece in
turn.
• All further arguments are passed on to the processing
function.
• If you omit .fun the individual pieces will not be modified, but
the entire data structure will be converted from one type to
another.
6. The package plyr uses a similar command style: **ply
• where the 1st ‘*’ designates the input type
• 2nd ‘*’ designates the output type.
The choices for the ‘*’ are either, ‘a’, ‘d’, ‘l’, or ’_’
• a = array
• d = data frame
• l = list
• _ = the output is discarded
• Examples:
– ddply = input is a data frame, output is a data frame
– alply = input is an array, output is a list