Computational Techniques for the Statistical Analysis of Big Data in R

Computational Techniques for the Statistical
Analysis of Big Data in R
A Case Study of the rlme Package
Herb Susmann, Yusuf Bilgic
April 12, 2014

Workﬂow
Identify
Rewrite
Benchmark
Test
Case Study: rlme
Identify
Wilcoxon Tau Estimator
Pairup
Covariance Estimator
Summary
Keeping Ahead

Motivation
Case study: rlme package
Rank based regression and estimation of two- and three- level
nested eﬀects models.
Goals: faster, less memory, more data
Before: 5,000 rows of data
After: 50,000 rows of data

Workﬂow
Identify
Rewrite
Benchmark
Test

Identify
Know your big O! (O(n2) memory usage? probably not so
good for big data)

Identify
good for big data)
Look for error messages

Identify
good for big data)
Look for error messages
Proﬁling with RProf

Rewrite
High level design
Algorithm design

Rewrite
High level design
Algorithm design
Statistical techniques: bootstrapping

Rewrite
Microbenchmarking
Know what R is good at

Rewrite
Microbenchmarking
Avoid loops in favor of vectorization

Rewrite
Microbenchmarking
Preallocation

Rewrite
Microbenchmarking
Preallocation
Arguments are by value, not by reference

Rewrite
Microbenchmarking
Preallocation
Embrace C++

Rewrite
Microbenchmarking
Preallocation
Embrace C++
Be careful!

Vectorizing
## Bad
vec = 1:100
for (i in 1:length(vec)) {
vec[i] = vec[i]^2
}
## Better
sapply(vec, function(x) x^2)
## Best
vec^2

Preallocation
## Bad
vec = c()
for (i in 1:0) {
vec = c(vec, i)
}
## Better
vec = numeric(100)
for (i in 1:0) {
vec[i] = i
}

Pass by value
square <- function(x) {
x <- x^2
return(x)
}
x <- 1:100
square(x)

Benchmark
Write several versions of a slow function

Benchmark
Test them against each other

Benchmark
Test them against each other
Package: microbenchmark

Test
Regressions
Unit Testing
Package: testthat

Identify
Over to R!
Rprof("profile")
fit.rlme = rlme(...)
Rprof(NULL)
summaryRprof("profile")

Rank based scale estimator of residuals
Uses pairup (so already O(n2))

Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]

Original:
What’s wrong?

Original:
What’s wrong? Bad algorithm (the sort is at least O(nlogn)),
variable gets copied multiple times

Original:
What’s wrong? Bad algorithm (the sort is at least O(nlogn)),
variable gets copied multiple times
Updated with C++
dresd = remove.k.smallest(dresd)

Test with 2,000 residuals: better!

Wilcoxon Tau
But what about really huge inputs?

Wilcoxon Tau
Bootstrapping: when over 5,000 rows, repeat estimate on
1000 sampled points 100 times

Wilcoxon Tau
Bootstrapping: when over 5,000 rows, repeat estimate on
1000 sampled points 100 times
Not about speed, but about memory

Pairup
Pairup function: generates every possible pair from input
vector
Some rank-based estimators require pairwise operations
O(n2) complexity

Pairup
Original version: vectorized (14 LOC)

Pairup
Loop version (12 LOC)

Pairup
”Combn” version (core R function, 1 LOC)

Pairup
”Combn” version (core R function, 1 LOC)
C++ version (12 LOC)

Covariance Estimator
n × n covariance matrix
change to preallocation

Summary
Identify
Rewrite
Benchmark
Test

Keeping Ahead
Parallelism
Cluster: RMpi, snow

Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud

Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Probably not Hadoop, maybe Apache Spark?

Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Julia Language

Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Julia Language
Hadley Wickham (plyr, ggplot, testthat, ...)

Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Julia Language
Hadley Wickham (plyr, ggplot, testthat, ...)
“Advanced R Programming”

Computational Techniques for the Statistical Analysis of Big Data in R

More Related Content

What's hot

Viewers also liked

Similar to Computational Techniques for the Statistical Analysis of Big Data in R

Recently uploaded

Computational Techniques for the Statistical Analysis of Big Data in R