Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
‹#›© Cloudera, Inc. All rights reserved.
Juliet Hougland
Data Scientist
@j_houg
Matrix Decomposition
at Scale
‹#›© Cloudera, Inc. All rights reserved.
The Singular Value
Decomposition
‹#›© Cloudera, Inc. All rights reserved.
• Dimensionality
Reduction/PCA
• Feature dimension
reduction
• Visualization of g...
‹#›© Cloudera, Inc. All rights reserved.
Define SVD
‹#›© Cloudera, Inc. All rights reserved.
Totally awesome LANL video
‹#›© Cloudera, Inc. All rights reserved.
This doesn’t work on distributed,
commodity setups
Good ClusterBad Cluster
‹#›© Cloudera, Inc. All rights reserved.
3 Distributed OSS
SVD Implementations
Mahout: Lanczos
Mahout: Stochastic
Spark: L...
‹#›© Cloudera, Inc. All rights reserved.
Lanczos’ Method
‹#›© Cloudera, Inc. All rights reserved.
• Iterative, with the dominant
cost a matrix-vector multiply
• Requires at least ...
‹#›© Cloudera, Inc. All rights reserved.
• Randomly project
original matrix to lower
dimensional space
• Factorize the pro...
‹#›© Cloudera, Inc. All rights reserved.
• What I test is written on
MapReduce
• Driver programs launch the series
of requ...
‹#›© Cloudera, Inc. All rights reserved.
Note!
Mahout Scala & Spark Bindings are integrated in Mahout.
Version 0.10 releas...
‹#›© Cloudera, Inc. All rights reserved.
Performance Comparisons
‹#›© Cloudera, Inc. All rights reserved.
[3]
‹#›© Cloudera, Inc. All rights reserved.
MapReduce
[4]
‹#›© Cloudera, Inc. All rights reserved.
Go Bananas tuning!
[5]
‹#›© Cloudera, Inc. All rights reserved.
My Cluster
6 Nodes running CDH 5.3*
Per Node:
2 physical cores
24, with hyper thr...
‹#›© Cloudera, Inc. All rights reserved.
What am I factorizing?
[7]
‹#›© Cloudera, Inc. All rights reserved.
What am I timing?
[8]
‹#›© Cloudera, Inc. All rights reserved.
Think of the polar bears
[9]
‹#›© Cloudera, Inc. All rights reserved.
Varying Columns
‹#›© Cloudera, Inc. All rights reserved.
Varying Rows
‹#›© Cloudera, Inc. All rights reserved.
Varying Sparsity
‹#›© Cloudera, Inc. All rights reserved.
Progress in Numerical Computation
[10]
‹#›© Cloudera, Inc. All rights reserved.
1. Genome PCA: http://bit.ly/1OxXMRy
2. SVD at LANL: http://bit.ly/193IIdY
3. App...
‹#›© Cloudera, Inc. All rights reserved.
Thanks!
juliet@cloudera.com
@j_houg
https://github.com/jhlch/svd-benchmark
Upcoming SlideShare
Loading in …5
×

Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

1,452 views

Published on

Matrix Decomposition at Scale: Matrix decomposition is an incredibly common task in machine learning, appearing everywhere including recommendation algorithms (SVD++), dimensionality reduction (PCA), and natural language processing (Latent Semantic Analysis) . Many well-known existing libraries can compute matrix decompositions when matrices fit in memory on a single machine. When the matrix no longer fits in memory and distributed computation is required, the computations becomes more complex and the details of the implementation become much more important. In this talk I will focus on the three major open source implementations of distributed eigen/singular value decomposition– LanczosSolver and StochasticSVD in Mahout and the SVD implementation in Spark MLLib. I will discuss the tradeoffs of of these implementations from the perspective of real world performance (beyond big-o notation for flops) and accuracy. I will conclude with some guidelines for choosing which implementation to use based on accuracy, performance, and scale requirements.

Published in: Technology
  • Be the first to comment

Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

  1. 1. ‹#›© Cloudera, Inc. All rights reserved. Juliet Hougland Data Scientist @j_houg Matrix Decomposition at Scale
  2. 2. ‹#›© Cloudera, Inc. All rights reserved. The Singular Value Decomposition
  3. 3. ‹#›© Cloudera, Inc. All rights reserved. • Dimensionality Reduction/PCA • Feature dimension reduction • Visualization of gene expression data • Latent Semantic Indexing • Low Rank Approximations • Digital Signals Processing SVD is applied everywhere A Global Map of Human Gene Expression. Lukk Et al. [1]
  4. 4. ‹#›© Cloudera, Inc. All rights reserved. Define SVD
  5. 5. ‹#›© Cloudera, Inc. All rights reserved. Totally awesome LANL video
  6. 6. ‹#›© Cloudera, Inc. All rights reserved. This doesn’t work on distributed, commodity setups Good ClusterBad Cluster
  7. 7. ‹#›© Cloudera, Inc. All rights reserved. 3 Distributed OSS SVD Implementations Mahout: Lanczos Mahout: Stochastic Spark: Lanczos
  8. 8. ‹#›© Cloudera, Inc. All rights reserved. Lanczos’ Method
  9. 9. ‹#›© Cloudera, Inc. All rights reserved. • Iterative, with the dominant cost a matrix-vector multiply • Requires at least k iterations to get k singular vectors Lanczos’ Method
  10. 10. ‹#›© Cloudera, Inc. All rights reserved. • Randomly project original matrix to lower dimensional space • Factorize the projected matrix. • Unproject Stochastic SVD M ⇡ QQ⇤ M Finding Structure in Randomness. Halko Et al. http://bit.ly/19VVRXp
  11. 11. ‹#›© Cloudera, Inc. All rights reserved. • What I test is written on MapReduce • Driver programs launch the series of required map reduce jobs • Lots of writing intermediate data to disk Frameworks • Using the MLLib component • Relies on Spark core • => tries to pin data in memory
  12. 12. ‹#›© Cloudera, Inc. All rights reserved. Note! Mahout Scala & Spark Bindings are integrated in Mahout. Version 0.10 release next month will move these methods The Scala DSL for linear algebra: val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi)
  13. 13. ‹#›© Cloudera, Inc. All rights reserved. Performance Comparisons
  14. 14. ‹#›© Cloudera, Inc. All rights reserved. [3]
  15. 15. ‹#›© Cloudera, Inc. All rights reserved. MapReduce [4]
  16. 16. ‹#›© Cloudera, Inc. All rights reserved. Go Bananas tuning! [5]
  17. 17. ‹#›© Cloudera, Inc. All rights reserved. My Cluster 6 Nodes running CDH 5.3* Per Node: 2 physical cores 24, with hyper threading => 144 total available cores 64 GB Memory 100 TB free in HDFS ! *Running Spark 1.3 [6]
  18. 18. ‹#›© Cloudera, Inc. All rights reserved. What am I factorizing? [7]
  19. 19. ‹#›© Cloudera, Inc. All rights reserved. What am I timing? [8]
  20. 20. ‹#›© Cloudera, Inc. All rights reserved. Think of the polar bears [9]
  21. 21. ‹#›© Cloudera, Inc. All rights reserved. Varying Columns
  22. 22. ‹#›© Cloudera, Inc. All rights reserved. Varying Rows
  23. 23. ‹#›© Cloudera, Inc. All rights reserved. Varying Sparsity
  24. 24. ‹#›© Cloudera, Inc. All rights reserved. Progress in Numerical Computation [10]
  25. 25. ‹#›© Cloudera, Inc. All rights reserved. 1. Genome PCA: http://bit.ly/1OxXMRy 2. SVD at LANL: http://bit.ly/193IIdY 3. Apples and Oranges: http://bit.ly/1xd1Q4d 4. Sound Board: http://bit.ly/19okavV 5. Bananas: http://bit.ly/1EGxh4p 6. Eniac: http://bit.ly/1F0GOWC 7. Big data pix tumblr: http://bigdatapix.tumblr.com/ 8. Watch: http://bit.ly/1FZtIKX 9. Polar Bears: http://bit.ly/1G0gXQw 10.Progress in numerical computing: http://bit.ly/1ID8WR5 Thanks for the images!
  26. 26. ‹#›© Cloudera, Inc. All rights reserved. Thanks! juliet@cloudera.com @j_houg https://github.com/jhlch/svd-benchmark

×