Computational Techniques for the Statistical
Analysis of Big Data in R
A Case Study of the rlme Package
Herb Susmann, Yusuf Bilgic
April 12, 2014
Workflow
Identify
Rewrite
Benchmark
Test
Case Study: rlme
Identify
Wilcoxon Tau Estimator
Pairup
Covariance Estimator
Summary
Keeping Ahead
Motivation
Case study: rlme package
Rank based regression and estimation of two- and three- level
nested effects models.
Goals: faster, less memory, more data
Before: 5,000 rows of data
After: 50,000 rows of data
Section 1
Workflow
Workflow
Identify
Rewrite
Benchmark
Test
Identify
Know your big O!
Identify
Know your big O! (O(n2) memory usage? probably not so
good for big data)
Identify
Know your big O! (O(n2) memory usage? probably not so
good for big data)
Look for error messages
Identify
Know your big O! (O(n2) memory usage? probably not so
good for big data)
Look for error messages
Profiling with RProf
Rewrite
High level design
Algorithm design
Rewrite
High level design
Algorithm design
Statistical techniques: bootstrapping
Rewrite
Microbenchmarking
Know what R is good at
Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization
Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization
Preallocation
Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization
Preallocation
Arguments are by value, not by reference
Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization
Preallocation
Arguments are by value, not by reference
Embrace C++
Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization
Preallocation
Arguments are by value, not by reference
Embrace C++
Be careful!
Vectorizing
## Bad
vec = 1:100
for (i in 1:length(vec)) {
vec[i] = vec[i]^2
}
## Better
sapply(vec, function(x) x^2)
## Best
vec^2
Preallocation
## Bad
vec = c()
for (i in 1:0) {
vec = c(vec, i)
}
## Better
vec = numeric(100)
for (i in 1:0) {
vec[i] = i
}
Pass by value
square <- function(x) {
x <- x^2
return(x)
}
x <- 1:100
square(x)
Benchmark
Write several versions of a slow function
Benchmark
Write several versions of a slow function
Test them against each other
Benchmark
Write several versions of a slow function
Test them against each other
Package: microbenchmark
Test
Regressions
Test
Regressions
Unit Testing
Test
Regressions
Unit Testing
Package: testthat
Test
Regressions
Unit Testing
Package: testthat
Section 2
Case Study: rlme
Identify
Over to R!
Rprof("profile")
fit.rlme = rlme(...)
Rprof(NULL)
summaryRprof("profile")
Wilcoxon Tau Estimator
Rank based scale estimator of residuals
Uses pairup (so already O(n2))
Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wrong?
Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wrong? Bad algorithm (the sort is at least O(nlogn)),
variable gets copied multiple times
Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wrong? Bad algorithm (the sort is at least O(nlogn)),
variable gets copied multiple times
Updated with C++
dresd = remove.k.smallest(dresd)
Wilcoxon Tau Estimator
Test with 2,000 residuals: better!
Wilcoxon Tau
But what about really huge inputs?
Wilcoxon Tau
But what about really huge inputs?
Bootstrapping: when over 5,000 rows, repeat estimate on
1000 sampled points 100 times
Wilcoxon Tau
But what about really huge inputs?
Bootstrapping: when over 5,000 rows, repeat estimate on
1000 sampled points 100 times
Not about speed, but about memory
Pairup
Pairup function: generates every possible pair from input
vector
Some rank-based estimators require pairwise operations
O(n2) complexity
Pairup
Original version: vectorized (14 LOC)
Pairup
Original version: vectorized (14 LOC)
Loop version (12 LOC)
Pairup
Original version: vectorized (14 LOC)
Loop version (12 LOC)
”Combn” version (core R function, 1 LOC)
Pairup
Original version: vectorized (14 LOC)
Loop version (12 LOC)
”Combn” version (core R function, 1 LOC)
C++ version (12 LOC)
Over to R!
Covariance Estimator
n × n covariance matrix
change to preallocation
Covariance Estimator
Summary
Identify
Rewrite
Benchmark
Test
Keeping Ahead
Parallelism
Keeping Ahead
Parallelism
Cluster: RMpi, snow
Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Probably not Hadoop, maybe Apache Spark?
Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Probably not Hadoop, maybe Apache Spark?
Julia Language
Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Probably not Hadoop, maybe Apache Spark?
Julia Language
Hadley Wickham (plyr, ggplot, testthat, ...)
Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Probably not Hadoop, maybe Apache Spark?
Julia Language
Hadley Wickham (plyr, ggplot, testthat, ...)
“Advanced R Programming”
Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Probably not Hadoop, maybe Apache Spark?
Julia Language
Hadley Wickham (plyr, ggplot, testthat, ...)
“Advanced R Programming”
Questions?

Computational Techniques for the Statistical Analysis of Big Data in R