Whats Right and Wrong with Apache Mahout

4,970 views

Published on

This is a summary of what I think is good and bad about Mahout. Presented on the eve of the 2013 Hadoop Summit

Published in: Technology, Education

Whats Right and Wrong with Apache Mahout

  1. 1. 1©MapR Technologies 2013- ConfidentialApache MahoutHow its good, how its awesome, and where it falls short
  2. 2. 2©MapR Technologies 2013- ConfidentialWhat is Mahout? “Scalable machine learning”– not just Hadoop-oriented machine learning– not entirely, that is. Just mostly. Components– math library– clustering– classification– decompositions– recommendations
  3. 3. 3©MapR Technologies 2013- ConfidentialWhat is Right and Wrong with Mahout? Components– recommendations– math library– clustering– classification– decompositions– other stuff
  4. 4. 4©MapR Technologies 2013- ConfidentialWhat is Right and Wrong with Mahout? Components– recommendations– math library– clustering– classification– decompositions– other stuff
  5. 5. 5©MapR Technologies 2013- ConfidentialWhat is Right and Wrong with Mahout? Components– recommendations– math library– clustering– classification– decompositions– other stuffAll the stuff thatisn’t there
  6. 6. 6©MapR Technologies 2013- ConfidentialMahout Math
  7. 7. 7©MapR Technologies 2013- ConfidentialMahout Math Goals are– basic linear algebra,– and statistical sampling,– and good clustering,– decent speed,– extensibility,– especially for sparse data But not– totally badass speed– comprehensive set of algorithms– optimization, root finders, quadrature
  8. 8. 8©MapR Technologies 2013- ConfidentialMatrices and Vectors At the core:– DenseVector, RandomAccessSparseVector– DenseMatrix, SparseRowMatrix Highly composable API Important ideas:– view*, assign and aggregate– iterationm.viewDiagonal().assign(v)
  9. 9. 9©MapR Technologies 2013- ConfidentialAssign Matrices VectorsMatrix assign(double value);Matrix assign(double[][] values);Matrix assign(Matrix other);Matrix assign(DoubleFunction f);Matrix assign(Matrix other, DoubleDoubleFunction f);Vector assign(double value);Vector assign(double[] values);Vector assign(Vector other);Vector assign(DoubleFunction f);Vector assign(Vector other, DoubleDoubleFunction f);Vector assign(DoubleDoubleFunction f, double y);
  10. 10. 10©MapR Technologies 2013- ConfidentialViews Matrices VectorsMatrix viewPart(int[] offset, int[] size);Matrix viewPart(int row, int rlen, int col, int clen);Vector viewRow(int row);Vector viewColumn(int column);Vector viewDiagonal();Vector viewPart(int offset, int length);
  11. 11. 11©MapR Technologies 2013- ConfidentialExamples The trace of a matrix Random projection Low rank random matrix
  12. 12. 12©MapR Technologies 2013- ConfidentialExamples The trace of a matrix Random projection Low rank random matrixm.viewDiagonal().zSum()
  13. 13. 13©MapR Technologies 2013- ConfidentialExamples The trace of a matrix Random projection Low rank random matrixm.viewDiagonal().zSum()m.times(new DenseMatrix(1000, 3).assign(new Normal()))
  14. 14. 14©MapR Technologies 2013- ConfidentialRecommenders
  15. 15. 15©MapR Technologies 2013- ConfidentialExamples of Recommendations Customers buying books (Linden et al) Web visitors rating music (Shardanand and Maes) or movies (Riedl,et al), (Netflix) Internet radio listeners not skipping songs (Musicmatch) Internet video watchers watching >30 s (Veoh) Visibility in a map UI (new Google maps)
  16. 16. 16©MapR Technologies 2013- ConfidentialRecommendation Basics History:User Thing1 32 43 42 33 21 12 1
  17. 17. 17©MapR Technologies 2013- ConfidentialRecommendation Basics History as matrix: (t1, t3) cooccur 2 times, (t1, t4) once, (t2, t4) once, (t3, t4) oncet1 t2 t3 t4u1 1 0 1 0u2 1 0 1 1u3 0 1 0 1
  18. 18. 18©MapR Technologies 2013- ConfidentialA Quick Simplification Users who do h Also do rAhATAh( )ATA( )hUser-centric recommendationsItem-centric recommendations
  19. 19. 19©MapR Technologies 2013- ConfidentialClustering
  20. 20. 20©MapR Technologies 2013- ConfidentialAn Example
  21. 21. 21©MapR Technologies 2013- ConfidentialAn Example
  22. 22. 22©MapR Technologies 2013- ConfidentialDiagonalized Cluster Proximity
  23. 23. 23©MapR Technologies 2013- ConfidentialParallel Speedup?1 2 3 4 5 201010020304050200ThreadsTimeperpoint(μs)23456810121416Threaded versionNon- threadedPerfect Scaling✓
  24. 24. 24©MapR Technologies 2013- ConfidentialLots of Clusters Are Fine
  25. 25. 25©MapR Technologies 2013- ConfidentialDecompositions
  26. 26. 26©MapR Technologies 2013- ConfidentialLow Rank Matrix Or should we see it differently? Are these scaled up versions of all the same column?1 2 52 4 1010 20 5020 40 100
  27. 27. 27©MapR Technologies 2013- ConfidentialLow Rank Matrix Matrix multiplication is designed to make this easy We can see weighted column patterns, or weighted row patterns All the same mathematically1210201 2 5xColumn pattern(or weights)Weights(or row pattern)
  28. 28. 28©MapR Technologies 2013- ConfidentialLow Rank Matrix What about here? This is like before, but there is one exceptional value1 2 52 4 1010 100 5020 40 100
  29. 29. 29©MapR Technologies 2013- ConfidentialLow Rank Matrix OK … add in a simple fixer upper1210201 2 5x001000 8 0xWhich rowExceptionpattern+[[]]
  30. 30. 30©MapR Technologies 2013- ConfidentialRandom Projection
  31. 31. 31©MapR Technologies 2013- ConfidentialSVD Projection
  32. 32. 32©MapR Technologies 2013- ConfidentialClassifiers
  33. 33. 33©MapR Technologies 2013- ConfidentialMahout Classifiers Naïve Bayes– high quality implementation– uses idiosyncratic input format– … but it is naïve SGD– sequential, not parallel– auto-tuning has foibles– learning rate annealing has issues– definitely not state of the art compared to Vowpal Wabbit Random forest– scaling limits due to decomposition strategy– yet another input format– no deployment strategy
  34. 34. 34©MapR Technologies 2013- ConfidentialThe stuff that isn’t there
  35. 35. 35©MapR Technologies 2013- ConfidentialWhat Mahout Isn’t Mahout isn’t R, isn’t SAS It doesn’t aim to do everything It aims to scale some few problems of practical interest The stuff that isn’t there is a feature, not a defect
  36. 36. 36©MapR Technologies 2013- Confidential Contact:– tdunning@maprtech.com– @ted_dunning– @apachemahout– @user-subscribe@mahout.apache.org Slides and suchhttp://www.slideshare.net/tdunning Hash tags: #mapr #apachemahout

×