© RAGINIJAIN CC SA 4.0
Ragini Jain
MSc CA 1st
Year (2015 - 2017)
plyr
© RAGINIJAIN CC SA 4.0
'for-loop' is the traditional approach
large data set
counter
condition
[ ]
element access
useful for overlapping data
© RAGINIJAIN CC SA 4.0
What is plyr
plyr is a R package that makes it easy to
work with large data sets.
large data set
apply function ( … )
split
… to each chunk of data in parallel
combine
© RAGINIJAIN CC SA 4.0
plyr
● Plyr supports functionality to
– Breakup a big problem into manageable pieces (split)
– Operate on each piece independently (apply)
– Put all the pieces together to get the final result (combine)
● Split-apply-combine (SAC) is similar to map-reduce (MR)
● Designed for parallel operation
● Similar to SQL group by operator
● Programmatically in R, 'plyr'
– Provides replacement for loops
– Abstracts away details of underlying data structure
© RAGINIJAIN CC SA 4.0
plyr
● Use it when
– Each piece of data will be
processed only once
– Each piece is processed
independently of all other
pieces.
● Don't use it when
– Each iteration requires
overlapping data (eg.
running mean)
– Each iteration depends on
the previous iteration (eg.
dynamic simulation)
© RAGINIJAIN CC SA 4.0
plyr in R environment
> installed.packages()
> install.packages("plyr")
> library('plyr')
© RAGINIJAIN CC SA 4.0
plyr 12 key functions
a = array
d = data frame
l = list
_ means the output is discarded
© RAGINIJAIN CC SA 4.0
plyr 12 key functions
a*ply( .data, .margins, .fun, …, .progress=”none”)
d*ply( .data, .variables, .fun, …, .progress=”none”)
l*ply( .data, .fun, …, .progress=”none”)
data which will be split up, processed and combined.
describe how to split up the input into pieces.
processing function
which is applied to each piece
© RAGINIJAIN CC SA 4.0
Rules for splitting input data
● a*ply
– Arrays are sliced by dimension in to lower-d pieces
– Must respond to dim() and accept multi-dimensional indexing
● d*ply
– Data frames are subsetted by combination of variables
– Must work with split() and coercible to a list.
● l*ply
– Each element in a list is a piece
– Must work with length() and [[
© RAGINIJAIN CC SA 4.0
Data frame structure visualization
A data frame is used for storing data variables.
It is a list of vectors of equal length.
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b)
> df
n s b
1 2 aa TRUE
2 3 bb FALSE
3 5 cc TRUE
>
© RAGINIJAIN CC SA 4.0
Data frame and ddply()
> df <-data.frame(matrix(rnorm(216),72,3),
c(rep("A",24),
rep("B",24),
rep("C",24)),
c(rep("J",36),
rep("K",36))
> colnames(df) <- c("v1", "v2", "v3", "dim1", "dim2")
> ddply(df,
c("dim1", "dim2"),
function(df)mean(df$v1))
© RAGINIJAIN CC SA 4.0
References
● Hadley Wickham, Journal Of Statistical Software, 40(1), 2011,
http://www.jstatsoft.org/article/view/v040i01
● Github 'plyr' repo
https://github.com/hadley/plyr
● CRAN
https://cran.r-project.org/web/packages/plyr/index.html
●
© RAGINIJAIN CC SA 4.0
Thank you.
● Questions
● Clarifications
● Suggestions
● Feedback
Ragini Jain
15030142023@sicsr.ac.in

Plyr

  • 1.
    © RAGINIJAIN CCSA 4.0 Ragini Jain MSc CA 1st Year (2015 - 2017) plyr
  • 2.
    © RAGINIJAIN CCSA 4.0 'for-loop' is the traditional approach large data set counter condition [ ] element access useful for overlapping data
  • 3.
    © RAGINIJAIN CCSA 4.0 What is plyr plyr is a R package that makes it easy to work with large data sets. large data set apply function ( … ) split … to each chunk of data in parallel combine
  • 4.
    © RAGINIJAIN CCSA 4.0 plyr ● Plyr supports functionality to – Breakup a big problem into manageable pieces (split) – Operate on each piece independently (apply) – Put all the pieces together to get the final result (combine) ● Split-apply-combine (SAC) is similar to map-reduce (MR) ● Designed for parallel operation ● Similar to SQL group by operator ● Programmatically in R, 'plyr' – Provides replacement for loops – Abstracts away details of underlying data structure
  • 5.
    © RAGINIJAIN CCSA 4.0 plyr ● Use it when – Each piece of data will be processed only once – Each piece is processed independently of all other pieces. ● Don't use it when – Each iteration requires overlapping data (eg. running mean) – Each iteration depends on the previous iteration (eg. dynamic simulation)
  • 6.
    © RAGINIJAIN CCSA 4.0 plyr in R environment > installed.packages() > install.packages("plyr") > library('plyr')
  • 7.
    © RAGINIJAIN CCSA 4.0 plyr 12 key functions a = array d = data frame l = list _ means the output is discarded
  • 8.
    © RAGINIJAIN CCSA 4.0 plyr 12 key functions a*ply( .data, .margins, .fun, …, .progress=”none”) d*ply( .data, .variables, .fun, …, .progress=”none”) l*ply( .data, .fun, …, .progress=”none”) data which will be split up, processed and combined. describe how to split up the input into pieces. processing function which is applied to each piece
  • 9.
    © RAGINIJAIN CCSA 4.0 Rules for splitting input data ● a*ply – Arrays are sliced by dimension in to lower-d pieces – Must respond to dim() and accept multi-dimensional indexing ● d*ply – Data frames are subsetted by combination of variables – Must work with split() and coercible to a list. ● l*ply – Each element in a list is a piece – Must work with length() and [[
  • 10.
    © RAGINIJAIN CCSA 4.0 Data frame structure visualization A data frame is used for storing data variables. It is a list of vectors of equal length. > n = c(2, 3, 5) > s = c("aa", "bb", "cc") > b = c(TRUE, FALSE, TRUE) > df = data.frame(n, s, b) > df n s b 1 2 aa TRUE 2 3 bb FALSE 3 5 cc TRUE >
  • 11.
    © RAGINIJAIN CCSA 4.0 Data frame and ddply() > df <-data.frame(matrix(rnorm(216),72,3), c(rep("A",24), rep("B",24), rep("C",24)), c(rep("J",36), rep("K",36)) > colnames(df) <- c("v1", "v2", "v3", "dim1", "dim2") > ddply(df, c("dim1", "dim2"), function(df)mean(df$v1))
  • 12.
    © RAGINIJAIN CCSA 4.0 References ● Hadley Wickham, Journal Of Statistical Software, 40(1), 2011, http://www.jstatsoft.org/article/view/v040i01 ● Github 'plyr' repo https://github.com/hadley/plyr ● CRAN https://cran.r-project.org/web/packages/plyr/index.html ●
  • 13.
    © RAGINIJAIN CCSA 4.0 Thank you. ● Questions ● Clarifications ● Suggestions ● Feedback Ragini Jain 15030142023@sicsr.ac.in