R user-group-2011-09


Published on

Talk given on September 21 to the Bay Area R User Group. The talk walks a stochastic project SVD algrorithm through the steps from initial implementation in R to a proposed implementation using map-reduce that integrates cleanly with R via NFS export of the distributed file system. Not surprisingly, this algorithm is essentially the same as the one used by Mahout.

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

R user-group-2011-09

  1. 1. RHadoop and MapR<br />
  2. 2. The bad old days (i.e. now)<br />Hadoop is a silo<br />HDFS isn’t a normal file system<br />Hadoop doesn’t really like C++<br />R is limited<br />One machine, one memory space<br />Isn’t there any way we can just get along?<br />
  3. 3. The white knight<br />MapR changes things<br />Lots of new stuff like snapshots, NFS<br />All you need to know, you already know<br />NFS provides cluster wide file access<br />Everything works the way you expect<br />Performance high enough to use as a message bus<br />
  4. 4. Example, out-of-core SVD<br />SVD provides compressed matrix form<br />Based on sum of rank-1 matrices<br />≈<br />+<br />+ ?<br />±<br />±<br />
  5. 5. More on SVD<br />SVD provides a very nice basis<br />
  6. 6. And a nifty approximation property<br />
  7. 7. Also known as …<br />Latent Semantic Indexing<br />PCA<br />Eigenvectors<br />
  8. 8. An application, approximate translation<br />Translation distributes over concatenation<br />But counting turns concatenation into addition<br />This means that translation is linear!<br />ish<br />
  9. 9. ish<br />
  10. 10. Traditional computation<br />Products of A are dominated by large singular values and corresponding vectors<br />Subtracting these dominate singular values allows the next ones to appear<br />Lanczos method, generally Krylov sub-space<br />
  11. 11. But …<br />
  12. 12. The gotcha<br />Iteration in Hadoop is death<br />Huge process invocation costs<br />Lose all memory residency of data<br />Total lost cause<br />
  13. 13. Randomness to the rescue<br />To save the day, run all iterations at the same time<br />=<br />=<br />A<br />
  14. 14. In R<br />lsa = function(a, k, p) {<br /> n = dim(a)[1]<br /> m = dim(a)[2]<br /> y = a %*% matrix(rnorm(m*(k+p)), nrow=m)<br />y.qr = qr(y)<br /> b = t(qr.Q(y.qr)) %*% a<br />b.qr = qr(t(b))<br />svd = svd(t(qr.R(b.qr)))<br /> list(u=qr.Q(y.qr) %*% svd$u[,1:k], <br /> d=svd$d[1:k], <br /> v=qr.Q(b.qr) %*% svd$v[,1:k])<br />}<br />
  15. 15. Not good enough yet<br />Limited to memory size<br />After memory limits, feature extraction dominates<br />
  16. 16. Hybrid architecture<br />Map-reduce<br />Side-data<br />Via NFS<br />Feature<br />extraction<br />and<br />down<br />sampling<br />I<br />n<br />p<br />u<br />t<br />Data<br />join<br />Sequential<br />SVD<br />
  17. 17. Hybrid architecture<br />Map-reduce<br />Side-data<br />Via NFS<br />Feature<br />extraction<br />and<br />down<br />sampling<br />I<br />n<br />p<br />u<br />t<br />Data<br />join<br />R<br />Visualization<br />Sequential<br />SVD<br />
  18. 18. Randomness to the rescue<br />To save the day again, use blocks<br />=<br />=<br />=<br />
  19. 19. Hybrid architecture<br />Map-reduce<br />Feature extraction<br />and<br />down sampling<br />Via NFS<br />Map-reduce<br />R<br />Visualization<br />Block-wise<br />parallel<br />SVD<br />
  20. 20. Conclusions<br />Inter-operability allows massively scalability<br />Prototyping in R not wasted<br />Map-reduce iteration not needed for SVD<br />Feasible scale ~10^9 non-zeros or more<br />