R user-group-2011-09
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

R user-group-2011-09

on

  • 1,114 views

Talk given on September 21 to the Bay Area R User Group. The talk walks a stochastic project SVD algrorithm through the steps from initial implementation in R to a proposed implementation using ...

Talk given on September 21 to the Bay Area R User Group. The talk walks a stochastic project SVD algrorithm through the steps from initial implementation in R to a proposed implementation using map-reduce that integrates cleanly with R via NFS export of the distributed file system. Not surprisingly, this algorithm is essentially the same as the one used by Mahout.

Statistics

Views

Total Views
1,114
Views on SlideShare
1,105
Embed Views
9

Actions

Likes
1
Downloads
23
Comments
0

3 Embeds 9

http://www.linkedin.com 5
https://www.linkedin.com 3
http://dschool.co 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

R user-group-2011-09 Presentation Transcript

  • 1. RHadoop and MapR
  • 2. The bad old days (i.e. now)
    Hadoop is a silo
    HDFS isn’t a normal file system
    Hadoop doesn’t really like C++
    R is limited
    One machine, one memory space
    Isn’t there any way we can just get along?
  • 3. The white knight
    MapR changes things
    Lots of new stuff like snapshots, NFS
    All you need to know, you already know
    NFS provides cluster wide file access
    Everything works the way you expect
    Performance high enough to use as a message bus
  • 4. Example, out-of-core SVD
    SVD provides compressed matrix form
    Based on sum of rank-1 matrices

    +
    + ?
    ±
    ±
  • 5. More on SVD
    SVD provides a very nice basis
  • 6. And a nifty approximation property
  • 7. Also known as …
    Latent Semantic Indexing
    PCA
    Eigenvectors
  • 8. An application, approximate translation
    Translation distributes over concatenation
    But counting turns concatenation into addition
    This means that translation is linear!
    ish
  • 9. ish
  • 10. Traditional computation
    Products of A are dominated by large singular values and corresponding vectors
    Subtracting these dominate singular values allows the next ones to appear
    Lanczos method, generally Krylov sub-space
  • 11. But …
  • 12. The gotcha
    Iteration in Hadoop is death
    Huge process invocation costs
    Lose all memory residency of data
    Total lost cause
  • 13. Randomness to the rescue
    To save the day, run all iterations at the same time
    =
    =
    A
  • 14. In R
    lsa = function(a, k, p) {
    n = dim(a)[1]
    m = dim(a)[2]
    y = a %*% matrix(rnorm(m*(k+p)), nrow=m)
    y.qr = qr(y)
    b = t(qr.Q(y.qr)) %*% a
    b.qr = qr(t(b))
    svd = svd(t(qr.R(b.qr)))
    list(u=qr.Q(y.qr) %*% svd$u[,1:k],
    d=svd$d[1:k],
    v=qr.Q(b.qr) %*% svd$v[,1:k])
    }
  • 15. Not good enough yet
    Limited to memory size
    After memory limits, feature extraction dominates
  • 16. Hybrid architecture
    Map-reduce
    Side-data
    Via NFS
    Feature
    extraction
    and
    down
    sampling
    I
    n
    p
    u
    t
    Data
    join
    Sequential
    SVD
  • 17. Hybrid architecture
    Map-reduce
    Side-data
    Via NFS
    Feature
    extraction
    and
    down
    sampling
    I
    n
    p
    u
    t
    Data
    join
    R
    Visualization
    Sequential
    SVD
  • 18. Randomness to the rescue
    To save the day again, use blocks
    =
    =
    =
  • 19. Hybrid architecture
    Map-reduce
    Feature extraction
    and
    down sampling
    Via NFS
    Map-reduce
    R
    Visualization
    Block-wise
    parallel
    SVD
  • 20. Conclusions
    Inter-operability allows massively scalability
    Prototyping in R not wasted
    Map-reduce iteration not needed for SVD
    Feasible scale ~10^9 non-zeros or more