R user-group-2011-09

The bad old days (i.e. now) Hadoop is a silo HDFS isn’t a normal file system Hadoop doesn’t really like C++ R is limited One machine, one memory space Isn’t there any way we can just get along?

The white knight MapR changes things Lots of new stuff like snapshots, NFS All you need to know, you already know NFS provides cluster wide file access Everything works the way you expect Performance high enough to use as a message bus

Example, out-of-core SVD SVD provides compressed matrix form Based on sum of rank-1 matrices ≈ + + ? ± ±

More on SVD SVD provides a very nice basis

And a nifty approximation property

Also known as … Latent Semantic Indexing PCA Eigenvectors

An application, approximate translation Translation distributes over concatenation But counting turns concatenation into addition This means that translation is linear! ish

Traditional computation Products of A are dominated by large singular values and corresponding vectors Subtracting these dominate singular values allows the next ones to appear Lanczos method, generally Krylov sub-space

The gotcha Iteration in Hadoop is death Huge process invocation costs Lose all memory residency of data Total lost cause

Randomness to the rescue To save the day, run all iterations at the same time = = A

In R lsa = function(a, k, p) { n = dim(a)[1] m = dim(a)[2] y = a %*% matrix(rnorm(m*(k+p)), nrow=m) y.qr = qr(y) b = t(qr.Q(y.qr)) %*% a b.qr = qr(t(b)) svd = svd(t(qr.R(b.qr))) list(u=qr.Q(y.qr) %*% svd$u[,1:k], d=svd$d[1:k], v=qr.Q(b.qr) %*% svd$v[,1:k]) }

Not good enough yet Limited to memory size After memory limits, feature extraction dominates

Hybrid architecture Map-reduce Side-data Via NFS Feature extraction and down sampling I n p u t Data join Sequential SVD

Hybrid architecture Map-reduce Side-data Via NFS Feature extraction and down sampling I n p u t Data join R Visualization Sequential SVD

Randomness to the rescue To save the day again, use blocks = = =

Hybrid architecture Map-reduce Feature extraction and down sampling Via NFS Map-reduce R Visualization Block-wise parallel SVD

Conclusions Inter-operability allows massively scalability Prototyping in R not wasted Map-reduce iteration not needed for SVD Feasible scale ~10^9 non-zeros or more

R user-group-2011-09

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (9)

Similar to R user-group-2011-09

Similar to R user-group-2011-09 (20)

More from Ted Dunning

More from Ted Dunning (20)

Recently uploaded

Recently uploaded (20)

R user-group-2011-09