RHadoop and MapR
The bad old days (i.e. now)Hadoop is a siloHDFS isn’t a normal file systemHadoop doesn’t really like C++R is limitedOne machine, one memory spaceIsn’t there any way we can just get along?
The white knightMapR changes thingsLots of new stuff like snapshots, NFSAll you need to know, you already knowNFS provides cluster wide file accessEverything works the way you expectPerformance high enough to use as a message bus
Example, out-of-core SVDSVD provides compressed matrix formBased on sum of rank-1 matrices≈++  ?±±
More on SVDSVD provides a very nice basis
And a nifty approximation property
Also known as …Latent Semantic IndexingPCAEigenvectors
An application, approximate translationTranslation distributes over concatenationBut counting turns concatenation into additionThis means that translation is linear!ish
ish
Traditional computationProducts of A are dominated by large singular values and corresponding vectorsSubtracting these dominate singular values allows the next ones to appearLanczos method, generally Krylov sub-space
But …
The gotchaIteration in Hadoop is deathHuge process invocation costsLose all memory residency of dataTotal lost cause
Randomness to the rescueTo save the day, run all iterations at the same time==A
In Rlsa = function(a, k, p) {  n = dim(a)[1]  m = dim(a)[2]  y = a %*% matrix(rnorm(m*(k+p)), nrow=m)y.qr = qr(y)  b = t(qr.Q(y.qr)) %*% ab.qr = qr(t(b))svd = svd(t(qr.R(b.qr)))  list(u=qr.Q(y.qr) %*% svd$u[,1:k],         d=svd$d[1:k],         v=qr.Q(b.qr) %*% svd$v[,1:k])}
Not good enough yetLimited to memory sizeAfter memory limits, feature extraction dominates
Hybrid architectureMap-reduceSide-dataVia NFSFeatureextractionanddownsamplingInputDatajoinSequentialSVD
Hybrid architectureMap-reduceSide-dataVia NFSFeatureextractionanddownsamplingInputDatajoinRVisualizationSequentialSVD
Randomness to the rescueTo save the day again, use blocks===
Hybrid architectureMap-reduceFeature extractionanddown samplingVia NFSMap-reduceRVisualizationBlock-wiseparallelSVD
ConclusionsInter-operability allows massively scalabilityPrototyping in R not wastedMap-reduce iteration not needed for SVDFeasible scale ~10^9 non-zeros or more

R user-group-2011-09