Your SlideShare is downloading. ×
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
R user-group-2011-09
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

R user-group-2011-09

873

Published on

Talk given on September 21 to the Bay Area R User Group. The talk walks a stochastic project SVD algrorithm through the steps from initial implementation in R to a proposed implementation using …

Talk given on September 21 to the Bay Area R User Group. The talk walks a stochastic project SVD algrorithm through the steps from initial implementation in R to a proposed implementation using map-reduce that integrates cleanly with R via NFS export of the distributed file system. Not surprisingly, this algorithm is essentially the same as the one used by Mahout.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
873
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
24
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. RHadoop and MapR
  • 2. The bad old days (i.e. now)
    Hadoop is a silo
    HDFS isn’t a normal file system
    Hadoop doesn’t really like C++
    R is limited
    One machine, one memory space
    Isn’t there any way we can just get along?
  • 3. The white knight
    MapR changes things
    Lots of new stuff like snapshots, NFS
    All you need to know, you already know
    NFS provides cluster wide file access
    Everything works the way you expect
    Performance high enough to use as a message bus
  • 4. Example, out-of-core SVD
    SVD provides compressed matrix form
    Based on sum of rank-1 matrices

    +
    + ?
    ±
    ±
  • 5. More on SVD
    SVD provides a very nice basis
  • 6. And a nifty approximation property
  • 7. Also known as …
    Latent Semantic Indexing
    PCA
    Eigenvectors
  • 8. An application, approximate translation
    Translation distributes over concatenation
    But counting turns concatenation into addition
    This means that translation is linear!
    ish
  • 9. ish
  • 10. Traditional computation
    Products of A are dominated by large singular values and corresponding vectors
    Subtracting these dominate singular values allows the next ones to appear
    Lanczos method, generally Krylov sub-space
  • 11. But …
  • 12. The gotcha
    Iteration in Hadoop is death
    Huge process invocation costs
    Lose all memory residency of data
    Total lost cause
  • 13. Randomness to the rescue
    To save the day, run all iterations at the same time
    =
    =
    A
  • 14. In R
    lsa = function(a, k, p) {
    n = dim(a)[1]
    m = dim(a)[2]
    y = a %*% matrix(rnorm(m*(k+p)), nrow=m)
    y.qr = qr(y)
    b = t(qr.Q(y.qr)) %*% a
    b.qr = qr(t(b))
    svd = svd(t(qr.R(b.qr)))
    list(u=qr.Q(y.qr) %*% svd$u[,1:k],
    d=svd$d[1:k],
    v=qr.Q(b.qr) %*% svd$v[,1:k])
    }
  • 15. Not good enough yet
    Limited to memory size
    After memory limits, feature extraction dominates
  • 16. Hybrid architecture
    Map-reduce
    Side-data
    Via NFS
    Feature
    extraction
    and
    down
    sampling
    I
    n
    p
    u
    t
    Data
    join
    Sequential
    SVD
  • 17. Hybrid architecture
    Map-reduce
    Side-data
    Via NFS
    Feature
    extraction
    and
    down
    sampling
    I
    n
    p
    u
    t
    Data
    join
    R
    Visualization
    Sequential
    SVD
  • 18. Randomness to the rescue
    To save the day again, use blocks
    =
    =
    =
  • 19. Hybrid architecture
    Map-reduce
    Feature extraction
    and
    down sampling
    Via NFS
    Map-reduce
    R
    Visualization
    Block-wise
    parallel
    SVD
  • 20. Conclusions
    Inter-operability allows massively scalability
    Prototyping in R not wasted
    Map-reduce iteration not needed for SVD
    Feasible scale ~10^9 non-zeros or more

×