8/9/2013 © MapR Confidential 1
R
Hadoop
and MapR
8/9/2013 © MapR Confidential 2
The bad old days (i.e. now)
• Hadoop is a silo
• HDFS isn’t a normal file system
• Hadoop d...
8/9/2013 © MapR Confidential 3
The white knight
• MapR changes things
• Lots of new stuff like snapshots, NFS
• All you ne...
8/9/2013 © MapR Confidential 4
Example, out-of-core SVD
• SVD provides compressed matrix form
• Based on sum of rank-1 mat...
8/9/2013 © MapR Confidential 5
More on SVD
• SVD provides a very nice basis
Ax = A aiviå = s juj ¢vj
j
å
é
ë
ê
ê
ù
û
ú
ú
a...
8/9/2013 © MapR Confidential 6
• And a nifty approximation property
Ax =s1a1u1 +s2a2u2 + siaiui
i>2
å
e 2
£ si
2
i>2
å
8/9/2013 © MapR Confidential 7
Also known as …
• Latent Semantic Indexing
• PCA
• Eigenvectors
8/9/2013 © MapR Confidential 8
An application, approximate translation
• Translation distributes over concatenation
• But ...
8/9/2013 © MapR Confidential 9
ish
8/9/2013 © MapR Confidential 10
Traditional computation
• Products of A are dominated by large singular
values and corresp...
8/9/2013 © MapR Confidential 11
But …
8/9/2013 © MapR Confidential 12
The gotcha
• Iteration in Hadoop is death
• Huge process invocation costs
• Lose all memor...
8/9/2013 © MapR Confidential 13
Randomness to the rescue
• To save the day, run all iterations at the same
time
Y = AW
QR ...
8/9/2013 © MapR Confidential 14
In R
lsa = function(a, k, p) {
n = dim(a)[1]
m = dim(a)[2]
y = a %*% matrix(rnorm(m*(k+p))...
8/9/2013 © MapR Confidential 15
Not good enough yet
• Limited to memory size
• After memory limits, feature extraction
dom...
8/9/2013 © MapR Confidential 16
Hybrid architecture
Feature
extraction
and
down
sampling
I
n
p
u
t
Side-data
Data
join
Seq...
8/9/2013 © MapR Confidential 17
Hybrid architecture
Feature
extraction
and
down
sampling
I
n
p
u
t
Side-data
Data
join
Map...
8/9/2013 © MapR Confidential 18
Randomness to the rescue
• To save the day again, use blocks
Yi = AiW
¢R R = ¢Y Y = ¢Yi Yi...
8/9/2013 © MapR Confidential 19
Hybrid architecture
Map-reduce
Feature extraction
and
down sampling Via NFS
R
Visualizatio...
8/9/2013 © MapR Confidential 20
Conclusions
• Inter-operability allows massively scalability
• Prototyping in R not wasted...
Upcoming SlideShare
Loading in …5
×

R user group 2011 09

265 views

Published on

Talk given on September 2011 to the Bay Area R User Group. The talk walks a stochastic project SVD algorithm through the steps from initial implementation in R to a proposed implementation using map-reduce that integrates cleanly with R via NFS export of the distributed file system. Not surprisingly, this algorithm is essentially the same as the one used by Mahout.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
265
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

R user group 2011 09

  1. 1. 8/9/2013 © MapR Confidential 1 R Hadoop and MapR
  2. 2. 8/9/2013 © MapR Confidential 2 The bad old days (i.e. now) • Hadoop is a silo • HDFS isn’t a normal file system • Hadoop doesn’t really like C++ • R is limited • One machine, one memory space • Isn’t there any way we can just get along?
  3. 3. 8/9/2013 © MapR Confidential 3 The white knight • MapR changes things • Lots of new stuff like snapshots, NFS • All you need to know, you already know • NFS provides cluster wide file access • Everything works the way you expect • Performance high enough to use as a message bus
  4. 4. 8/9/2013 © MapR Confidential 4 Example, out-of-core SVD • SVD provides compressed matrix form • Based on sum of rank-1 matrices A =s1u1 ¢v1 +s2u2 ¢v2 +e ± ±≈ + + ?
  5. 5. 8/9/2013 © MapR Confidential 5 More on SVD • SVD provides a very nice basis Ax = A aiviå = s juj ¢vj j å é ë ê ê ù û ú ú aivi i å é ë ê ù û ú= aisiui i å
  6. 6. 8/9/2013 © MapR Confidential 6 • And a nifty approximation property Ax =s1a1u1 +s2a2u2 + siaiui i>2 å e 2 £ si 2 i>2 å
  7. 7. 8/9/2013 © MapR Confidential 7 Also known as … • Latent Semantic Indexing • PCA • Eigenvectors
  8. 8. 8/9/2013 © MapR Confidential 8 An application, approximate translation • Translation distributes over concatenation • But counting turns concatenation into addition • This means that translation is linear! T(s1 | s2 )=T(s1)| T(s2 ) k(s1 | s2 )= k(s1) + k(s2 ) k(T(s1 | s2 )) = k(T(s1)) + k(T(s2 ))
  9. 9. 8/9/2013 © MapR Confidential 9 ish
  10. 10. 8/9/2013 © MapR Confidential 10 Traditional computation • Products of A are dominated by large singular values and corresponding vectors • Subtracting these dominate singular values allows the next ones to appear • Lanczos method, generally Krylov sub-space A ¢A A( ) n =US2n+1 ¢V
  11. 11. 8/9/2013 © MapR Confidential 11 But …
  12. 12. 8/9/2013 © MapR Confidential 12 The gotcha • Iteration in Hadoop is death • Huge process invocation costs • Lose all memory residency of data • Total lost cause
  13. 13. 8/9/2013 © MapR Confidential 13 Randomness to the rescue • To save the day, run all iterations at the same time Y = AW QR = Y B = ¢Q A US ¢V = B QU( )S ¢V » A == A
  14. 14. 8/9/2013 © MapR Confidential 14 In R lsa = function(a, k, p) { n = dim(a)[1] m = dim(a)[2] y = a %*% matrix(rnorm(m*(k+p)), nrow=m) y.qr = qr(y) b = t(qr.Q(y.qr)) %*% a b.qr = qr(t(b)) svd = svd(t(qr.R(b.qr))) list(u=qr.Q(y.qr) %*% svd$u[,1:k], d=svd$d[1:k], v=qr.Q(b.qr) %*% svd$v[,1:k]) }
  15. 15. 8/9/2013 © MapR Confidential 15 Not good enough yet • Limited to memory size • After memory limits, feature extraction dominates
  16. 16. 8/9/2013 © MapR Confidential 16 Hybrid architecture Feature extraction and down sampling I n p u t Side-data Data join Sequential SVD Map-reduce Via NFS
  17. 17. 8/9/2013 © MapR Confidential 17 Hybrid architecture Feature extraction and down sampling I n p u t Side-data Data join Map-reduce Via NFS R Visualization Sequential SVD
  18. 18. 8/9/2013 © MapR Confidential 18 Randomness to the rescue • To save the day again, use blocks Yi = AiW ¢R R = ¢Y Y = ¢Yi Yiå Bj = AiWR-1 ( )Aij i å LL' = B ¢B US ¢V = L AWR-1 U( )S L-1 B ¢V( )» A == =
  19. 19. 8/9/2013 © MapR Confidential 19 Hybrid architecture Map-reduce Feature extraction and down sampling Via NFS R Visualization Map-reduce Block-wise parallel SVD
  20. 20. 8/9/2013 © MapR Confidential 20 Conclusions • Inter-operability allows massively scalability • Prototyping in R not wasted • Map-reduce iteration not needed for SVD • Feasible scale ~10^9 non-zeros or more

×