• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
What's Right and Wrong with Apache Mahout
 

What's Right and Wrong with Apache Mahout

on

  • 305 views

This is a summary of what I think is good and bad about Mahout. Presented on the eve of the 2013 Hadoop Summit

This is a summary of what I think is good and bad about Mahout. Presented on the eve of the 2013 Hadoop Summit

Statistics

Views

Total Views
305
Views on SlideShare
305
Embed Views
0

Actions

Likes
0
Downloads
10
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    What's Right and Wrong with Apache Mahout What's Right and Wrong with Apache Mahout Presentation Transcript

    • 1©MapR Technologies 2013- Confidential Apache Mahout How it's good, how it's awesome, and where it falls short
    • 2©MapR Technologies 2013- Confidential What is Mahout?  “Scalable machine learning” – not just Hadoop-oriented machine learning – not entirely, that is. Just mostly.  Components – math library – clustering – classification – decompositions – recommendations
    • 3©MapR Technologies 2013- Confidential What is Right and Wrong with Mahout?  Components – recommendations – math library – clustering – classification – decompositions – other stuff
    • 4©MapR Technologies 2013- Confidential What is Right and Wrong with Mahout?  Components – recommendations – math library – clustering – classification – decompositions – other stuff
    • 5©MapR Technologies 2013- Confidential What is Right and Wrong with Mahout?  Components – recommendations – math library – clustering – classification – decompositions – other stuff All the stuff that isn’t there
    • 6©MapR Technologies 2013- Confidential Mahout Math
    • 7©MapR Technologies 2013- Confidential Mahout Math  Goals are – basic linear algebra, – and statistical sampling, – and good clustering, – decent speed, – extensibility, – especially for sparse data  But not – totally badass speed – comprehensive set of algorithms – optimization, root finders, quadrature
    • 8©MapR Technologies 2013- Confidential Matrices and Vectors  At the core: – DenseVector, RandomAccessSparseVector – DenseMatrix, SparseRowMatrix  Highly composable API  Important ideas: – view*, assign and aggregate – iteration m.viewDiagonal().assign(v)
    • 9©MapR Technologies 2013- Confidential Assign  Matrices  Vectors Matrix assign(double value); Matrix assign(double[][] values); Matrix assign(Matrix other); Matrix assign(DoubleFunction f); Matrix assign(Matrix other, DoubleDoubleFunction f); Vector assign(double value); Vector assign(double[] values); Vector assign(Vector other); Vector assign(DoubleFunction f); Vector assign(Vector other, DoubleDoubleFunction f); Vector assign(DoubleDoubleFunction f, double y);
    • 10©MapR Technologies 2013- Confidential Views  Matrices  Vectors Matrix viewPart(int[] offset, int[] size); Matrix viewPart(int row, int rlen, int col, int clen); Vector viewRow(int row); Vector viewColumn(int column); Vector viewDiagonal(); Vector viewPart(int offset, int length);
    • 11©MapR Technologies 2013- Confidential Examples  The trace of a matrix  Random projection  Low rank random matrix
    • 12©MapR Technologies 2013- Confidential Examples  The trace of a matrix  Random projection  Low rank random matrix m.viewDiagonal().zSum()
    • 13©MapR Technologies 2013- Confidential Examples  The trace of a matrix  Random projection  Low rank random matrix m.viewDiagonal().zSum() m.times(new DenseMatrix(1000, 3).assign(new Normal()))
    • 14©MapR Technologies 2013- Confidential Recommenders
    • 15©MapR Technologies 2013- Confidential Examples of Recommendations  Customers buying books (Linden et al)  Web visitors rating music (Shardanand and Maes) or movies (Riedl, et al), (Netflix)  Internet radio listeners not skipping songs (Musicmatch)  Internet video watchers watching >30 s (Veoh)  Visibility in a map UI (new Google maps)
    • 16©MapR Technologies 2013- Confidential Recommendation Basics  History: User Thing 1 3 2 4 3 4 2 3 3 2 1 1 2 1
    • 17©MapR Technologies 2013- Confidential Recommendation Basics  History as matrix:  (t1, t3) cooccur 2 times,  (t1, t4) once,  (t2, t4) once,  (t3, t4) once t1 t2 t3 t4 u1 1 0 1 0 u2 1 0 1 1 u3 0 1 0 1
    • 18©MapR Technologies 2013- Confidential A Quick Simplification  Users who do h  Also do r Ah AT Ah( ) AT A( )h User-centric recommendations Item-centric recommendations
    • 19©MapR Technologies 2013- Confidential Clustering
    • 20©MapR Technologies 2013- Confidential An Example
    • 21©MapR Technologies 2013- Confidential An Example
    • 22©MapR Technologies 2013- Confidential Diagonalized Cluster Proximity
    • 23©MapR Technologies 2013- Confidential Parallel Speedup? 1 2 3 4 5 20 10 100 20 30 40 50 200 Threads Timeperpoint(μs) 2 3 4 5 6 8 10 12 14 16 Threaded version Non- threaded Perfect Scaling ✓
    • 24©MapR Technologies 2013- Confidential Lots of Clusters Are Fine
    • 25©MapR Technologies 2013- Confidential Decompositions
    • 26©MapR Technologies 2013- Confidential Low Rank Matrix  Or should we see it differently?  Are these scaled up versions of all the same column? 1 2 5 2 4 10 10 20 50 20 40 100
    • 27©MapR Technologies 2013- Confidential Low Rank Matrix  Matrix multiplication is designed to make this easy  We can see weighted column patterns, or weighted row patterns  All the same mathematically 1 2 10 20 1 2 5x Column pattern (or weights) Weights (or row pattern)
    • 28©MapR Technologies 2013- Confidential Low Rank Matrix  What about here?  This is like before, but there is one exceptional value 1 2 5 2 4 10 10 100 50 20 40 100
    • 29©MapR Technologies 2013- Confidential Low Rank Matrix  OK … add in a simple fixer upper 1 2 10 20 1 2 5x 0 0 10 0 0 8 0x Which row Exception pattern +[ [ ] ]
    • 30©MapR Technologies 2013- Confidential Random Projection
    • 31©MapR Technologies 2013- Confidential SVD Projection
    • 32©MapR Technologies 2013- Confidential Classifiers
    • 33©MapR Technologies 2013- Confidential Mahout Classifiers  Naïve Bayes – high quality implementation – uses idiosyncratic input format – … but it is naïve  SGD – sequential, not parallel – auto-tuning has foibles – learning rate annealing has issues – definitely not state of the art compared to Vowpal Wabbit  Random forest – scaling limits due to decomposition strategy – yet another input format – no deployment strategy
    • 34©MapR Technologies 2013- Confidential The stuff that isn’t there
    • 35©MapR Technologies 2013- Confidential What Mahout Isn’t  Mahout isn’t R, isn’t SAS  It doesn’t aim to do everything  It aims to scale some few problems of practical interest  The stuff that isn’t there is a feature, not a defect
    • 36©MapR Technologies 2013- Confidential  Contact: – tdunning@maprtech.com – @ted_dunning – @apachemahout – @user-subscribe@mahout.apache.org  Slides and such http://www.slideshare.net/tdunning  Hash tags: #mapr #apachemahout