0
1©MapR Technologies 2013- Confidential
Apache Mahout
How it's good, how it's awesome, and where it falls short
2©MapR Technologies 2013- Confidential
What is Mahout?
 “Scalable machine learning”
– not just Hadoop-oriented machine le...
3©MapR Technologies 2013- Confidential
What is Right and Wrong with Mahout?
 Components
– recommendations
– math library
...
4©MapR Technologies 2013- Confidential
What is Right and Wrong with Mahout?
 Components
– recommendations
– math library
...
5©MapR Technologies 2013- Confidential
What is Right and Wrong with Mahout?
 Components
– recommendations
– math library
...
6©MapR Technologies 2013- Confidential
Mahout Math
7©MapR Technologies 2013- Confidential
Mahout Math
 Goals are
– basic linear algebra,
– and statistical sampling,
– and g...
8©MapR Technologies 2013- Confidential
Matrices and Vectors
 At the core:
– DenseVector, RandomAccessSparseVector
– Dense...
9©MapR Technologies 2013- Confidential
Assign
 Matrices
 Vectors
Matrix assign(double value);
Matrix assign(double[][] v...
10©MapR Technologies 2013- Confidential
Views
 Matrices
 Vectors
Matrix viewPart(int[] offset, int[] size);
Matrix viewP...
11©MapR Technologies 2013- Confidential
Examples
 The trace of a matrix
 Random projection
 Low rank random matrix
12©MapR Technologies 2013- Confidential
Examples
 The trace of a matrix
 Random projection
 Low rank random matrix
m.vi...
13©MapR Technologies 2013- Confidential
Examples
 The trace of a matrix
 Random projection
 Low rank random matrix
m.vi...
14©MapR Technologies 2013- Confidential
Recommenders
15©MapR Technologies 2013- Confidential
Examples of Recommendations
 Customers buying books (Linden et al)
 Web visitors...
16©MapR Technologies 2013- Confidential
Recommendation Basics
 History:
User Thing
1 3
2 4
3 4
2 3
3 2
1 1
2 1
17©MapR Technologies 2013- Confidential
Recommendation Basics
 History as matrix:
 (t1, t3) cooccur 2 times,
 (t1, t4) ...
18©MapR Technologies 2013- Confidential
A Quick Simplification
 Users who do h
 Also do r
Ah
AT
Ah( )
AT
A( )h
User-cent...
19©MapR Technologies 2013- Confidential
Clustering
20©MapR Technologies 2013- Confidential
An Example
21©MapR Technologies 2013- Confidential
An Example
22©MapR Technologies 2013- Confidential
Diagonalized Cluster Proximity
23©MapR Technologies 2013- Confidential
Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3...
24©MapR Technologies 2013- Confidential
Lots of Clusters Are Fine
25©MapR Technologies 2013- Confidential
Decompositions
26©MapR Technologies 2013- Confidential
Low Rank Matrix
 Or should we see it differently?
 Are these scaled up versions ...
27©MapR Technologies 2013- Confidential
Low Rank Matrix
 Matrix multiplication is designed to make this easy
 We can see...
28©MapR Technologies 2013- Confidential
Low Rank Matrix
 What about here?
 This is like before, but there is one excepti...
29©MapR Technologies 2013- Confidential
Low Rank Matrix
 OK … add in a simple fixer upper
1
2
10
20
1 2 5x
0
0
10
0
0 8 0...
30©MapR Technologies 2013- Confidential
Random Projection
31©MapR Technologies 2013- Confidential
SVD Projection
32©MapR Technologies 2013- Confidential
Classifiers
33©MapR Technologies 2013- Confidential
Mahout Classifiers
 Naïve Bayes
– high quality implementation
– uses idiosyncrati...
34©MapR Technologies 2013- Confidential
The stuff that isn’t there
35©MapR Technologies 2013- Confidential
What Mahout Isn’t
 Mahout isn’t R, isn’t SAS
 It doesn’t aim to do everything
 ...
36©MapR Technologies 2013- Confidential
 Contact:
– tdunning@maprtech.com
– @ted_dunning
– @apachemahout
– @user-subscrib...
Upcoming SlideShare
Loading in...5
×

What's Right and Wrong with Apache Mahout

441

Published on

This is a summary of what I think is good and bad about Mahout. Presented on the eve of the 2013 Hadoop Summit

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
441
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "What's Right and Wrong with Apache Mahout"

  1. 1. 1©MapR Technologies 2013- Confidential Apache Mahout How it's good, how it's awesome, and where it falls short
  2. 2. 2©MapR Technologies 2013- Confidential What is Mahout?  “Scalable machine learning” – not just Hadoop-oriented machine learning – not entirely, that is. Just mostly.  Components – math library – clustering – classification – decompositions – recommendations
  3. 3. 3©MapR Technologies 2013- Confidential What is Right and Wrong with Mahout?  Components – recommendations – math library – clustering – classification – decompositions – other stuff
  4. 4. 4©MapR Technologies 2013- Confidential What is Right and Wrong with Mahout?  Components – recommendations – math library – clustering – classification – decompositions – other stuff
  5. 5. 5©MapR Technologies 2013- Confidential What is Right and Wrong with Mahout?  Components – recommendations – math library – clustering – classification – decompositions – other stuff All the stuff that isn’t there
  6. 6. 6©MapR Technologies 2013- Confidential Mahout Math
  7. 7. 7©MapR Technologies 2013- Confidential Mahout Math  Goals are – basic linear algebra, – and statistical sampling, – and good clustering, – decent speed, – extensibility, – especially for sparse data  But not – totally badass speed – comprehensive set of algorithms – optimization, root finders, quadrature
  8. 8. 8©MapR Technologies 2013- Confidential Matrices and Vectors  At the core: – DenseVector, RandomAccessSparseVector – DenseMatrix, SparseRowMatrix  Highly composable API  Important ideas: – view*, assign and aggregate – iteration m.viewDiagonal().assign(v)
  9. 9. 9©MapR Technologies 2013- Confidential Assign  Matrices  Vectors Matrix assign(double value); Matrix assign(double[][] values); Matrix assign(Matrix other); Matrix assign(DoubleFunction f); Matrix assign(Matrix other, DoubleDoubleFunction f); Vector assign(double value); Vector assign(double[] values); Vector assign(Vector other); Vector assign(DoubleFunction f); Vector assign(Vector other, DoubleDoubleFunction f); Vector assign(DoubleDoubleFunction f, double y);
  10. 10. 10©MapR Technologies 2013- Confidential Views  Matrices  Vectors Matrix viewPart(int[] offset, int[] size); Matrix viewPart(int row, int rlen, int col, int clen); Vector viewRow(int row); Vector viewColumn(int column); Vector viewDiagonal(); Vector viewPart(int offset, int length);
  11. 11. 11©MapR Technologies 2013- Confidential Examples  The trace of a matrix  Random projection  Low rank random matrix
  12. 12. 12©MapR Technologies 2013- Confidential Examples  The trace of a matrix  Random projection  Low rank random matrix m.viewDiagonal().zSum()
  13. 13. 13©MapR Technologies 2013- Confidential Examples  The trace of a matrix  Random projection  Low rank random matrix m.viewDiagonal().zSum() m.times(new DenseMatrix(1000, 3).assign(new Normal()))
  14. 14. 14©MapR Technologies 2013- Confidential Recommenders
  15. 15. 15©MapR Technologies 2013- Confidential Examples of Recommendations  Customers buying books (Linden et al)  Web visitors rating music (Shardanand and Maes) or movies (Riedl, et al), (Netflix)  Internet radio listeners not skipping songs (Musicmatch)  Internet video watchers watching >30 s (Veoh)  Visibility in a map UI (new Google maps)
  16. 16. 16©MapR Technologies 2013- Confidential Recommendation Basics  History: User Thing 1 3 2 4 3 4 2 3 3 2 1 1 2 1
  17. 17. 17©MapR Technologies 2013- Confidential Recommendation Basics  History as matrix:  (t1, t3) cooccur 2 times,  (t1, t4) once,  (t2, t4) once,  (t3, t4) once t1 t2 t3 t4 u1 1 0 1 0 u2 1 0 1 1 u3 0 1 0 1
  18. 18. 18©MapR Technologies 2013- Confidential A Quick Simplification  Users who do h  Also do r Ah AT Ah( ) AT A( )h User-centric recommendations Item-centric recommendations
  19. 19. 19©MapR Technologies 2013- Confidential Clustering
  20. 20. 20©MapR Technologies 2013- Confidential An Example
  21. 21. 21©MapR Technologies 2013- Confidential An Example
  22. 22. 22©MapR Technologies 2013- Confidential Diagonalized Cluster Proximity
  23. 23. 23©MapR Technologies 2013- Confidential Parallel Speedup? 1 2 3 4 5 20 10 100 20 30 40 50 200 Threads Timeperpoint(μs) 2 3 4 5 6 8 10 12 14 16 Threaded version Non- threaded Perfect Scaling ✓
  24. 24. 24©MapR Technologies 2013- Confidential Lots of Clusters Are Fine
  25. 25. 25©MapR Technologies 2013- Confidential Decompositions
  26. 26. 26©MapR Technologies 2013- Confidential Low Rank Matrix  Or should we see it differently?  Are these scaled up versions of all the same column? 1 2 5 2 4 10 10 20 50 20 40 100
  27. 27. 27©MapR Technologies 2013- Confidential Low Rank Matrix  Matrix multiplication is designed to make this easy  We can see weighted column patterns, or weighted row patterns  All the same mathematically 1 2 10 20 1 2 5x Column pattern (or weights) Weights (or row pattern)
  28. 28. 28©MapR Technologies 2013- Confidential Low Rank Matrix  What about here?  This is like before, but there is one exceptional value 1 2 5 2 4 10 10 100 50 20 40 100
  29. 29. 29©MapR Technologies 2013- Confidential Low Rank Matrix  OK … add in a simple fixer upper 1 2 10 20 1 2 5x 0 0 10 0 0 8 0x Which row Exception pattern +[ [ ] ]
  30. 30. 30©MapR Technologies 2013- Confidential Random Projection
  31. 31. 31©MapR Technologies 2013- Confidential SVD Projection
  32. 32. 32©MapR Technologies 2013- Confidential Classifiers
  33. 33. 33©MapR Technologies 2013- Confidential Mahout Classifiers  Naïve Bayes – high quality implementation – uses idiosyncratic input format – … but it is naïve  SGD – sequential, not parallel – auto-tuning has foibles – learning rate annealing has issues – definitely not state of the art compared to Vowpal Wabbit  Random forest – scaling limits due to decomposition strategy – yet another input format – no deployment strategy
  34. 34. 34©MapR Technologies 2013- Confidential The stuff that isn’t there
  35. 35. 35©MapR Technologies 2013- Confidential What Mahout Isn’t  Mahout isn’t R, isn’t SAS  It doesn’t aim to do everything  It aims to scale some few problems of practical interest  The stuff that isn’t there is a feature, not a defect
  36. 36. 36©MapR Technologies 2013- Confidential  Contact: – tdunning@maprtech.com – @ted_dunning – @apachemahout – @user-subscribe@mahout.apache.org  Slides and such http://www.slideshare.net/tdunning  Hash tags: #mapr #apachemahout
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×