Your SlideShare is downloading. ×
0
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

R user-group-2011-09

885

Published on

Talk given on September 21 to the Bay Area R User Group. The talk walks a stochastic project SVD algrorithm through the steps from initial implementation in R to a proposed implementation using …

Talk given on September 21 to the Bay Area R User Group. The talk walks a stochastic project SVD algrorithm through the steps from initial implementation in R to a proposed implementation using map-reduce that integrates cleanly with R via NFS export of the distributed file system. Not surprisingly, this algorithm is essentially the same as the one used by Mahout.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
885
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
25
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. RHadoop and MapR<br />
  • 2. The bad old days (i.e. now)<br />Hadoop is a silo<br />HDFS isn’t a normal file system<br />Hadoop doesn’t really like C++<br />R is limited<br />One machine, one memory space<br />Isn’t there any way we can just get along?<br />
  • 3. The white knight<br />MapR changes things<br />Lots of new stuff like snapshots, NFS<br />All you need to know, you already know<br />NFS provides cluster wide file access<br />Everything works the way you expect<br />Performance high enough to use as a message bus<br />
  • 4. Example, out-of-core SVD<br />SVD provides compressed matrix form<br />Based on sum of rank-1 matrices<br />≈<br />+<br />+ ?<br />±<br />±<br />
  • 5. More on SVD<br />SVD provides a very nice basis<br />
  • 6. And a nifty approximation property<br />
  • 7. Also known as …<br />Latent Semantic Indexing<br />PCA<br />Eigenvectors<br />
  • 8. An application, approximate translation<br />Translation distributes over concatenation<br />But counting turns concatenation into addition<br />This means that translation is linear!<br />ish<br />
  • 9. ish<br />
  • 10. Traditional computation<br />Products of A are dominated by large singular values and corresponding vectors<br />Subtracting these dominate singular values allows the next ones to appear<br />Lanczos method, generally Krylov sub-space<br />
  • 11. But …<br />
  • 12. The gotcha<br />Iteration in Hadoop is death<br />Huge process invocation costs<br />Lose all memory residency of data<br />Total lost cause<br />
  • 13. Randomness to the rescue<br />To save the day, run all iterations at the same time<br />=<br />=<br />A<br />
  • 14. In R<br />lsa = function(a, k, p) {<br /> n = dim(a)[1]<br /> m = dim(a)[2]<br /> y = a %*% matrix(rnorm(m*(k+p)), nrow=m)<br />y.qr = qr(y)<br /> b = t(qr.Q(y.qr)) %*% a<br />b.qr = qr(t(b))<br />svd = svd(t(qr.R(b.qr)))<br /> list(u=qr.Q(y.qr) %*% svd$u[,1:k], <br /> d=svd$d[1:k], <br /> v=qr.Q(b.qr) %*% svd$v[,1:k])<br />}<br />
  • 15. Not good enough yet<br />Limited to memory size<br />After memory limits, feature extraction dominates<br />
  • 16. Hybrid architecture<br />Map-reduce<br />Side-data<br />Via NFS<br />Feature<br />extraction<br />and<br />down<br />sampling<br />I<br />n<br />p<br />u<br />t<br />Data<br />join<br />Sequential<br />SVD<br />
  • 17. Hybrid architecture<br />Map-reduce<br />Side-data<br />Via NFS<br />Feature<br />extraction<br />and<br />down<br />sampling<br />I<br />n<br />p<br />u<br />t<br />Data<br />join<br />R<br />Visualization<br />Sequential<br />SVD<br />
  • 18. Randomness to the rescue<br />To save the day again, use blocks<br />=<br />=<br />=<br />
  • 19. Hybrid architecture<br />Map-reduce<br />Feature extraction<br />and<br />down sampling<br />Via NFS<br />Map-reduce<br />R<br />Visualization<br />Block-wise<br />parallel<br />SVD<br />
  • 20. Conclusions<br />Inter-operability allows massively scalability<br />Prototyping in R not wasted<br />Map-reduce iteration not needed for SVD<br />Feasible scale ~10^9 non-zeros or more<br />

×