Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Whats Right and Wrong with Apache Mahout

5,390 views

Published on

This is a summary of what I think is good and bad about Mahout. Presented on the eve of the 2013 Hadoop Summit

Published in: Technology, Education
  • Be the first to comment

Whats Right and Wrong with Apache Mahout

  1. 1. 1©MapR Technologies 2013- ConfidentialApache MahoutHow its good, how its awesome, and where it falls short
  2. 2. 2©MapR Technologies 2013- ConfidentialWhat is Mahout? “Scalable machine learning”– not just Hadoop-oriented machine learning– not entirely, that is. Just mostly. Components– math library– clustering– classification– decompositions– recommendations
  3. 3. 3©MapR Technologies 2013- ConfidentialWhat is Right and Wrong with Mahout? Components– recommendations– math library– clustering– classification– decompositions– other stuff
  4. 4. 4©MapR Technologies 2013- ConfidentialWhat is Right and Wrong with Mahout? Components– recommendations– math library– clustering– classification– decompositions– other stuff
  5. 5. 5©MapR Technologies 2013- ConfidentialWhat is Right and Wrong with Mahout? Components– recommendations– math library– clustering– classification– decompositions– other stuffAll the stuff thatisn’t there
  6. 6. 6©MapR Technologies 2013- ConfidentialMahout Math
  7. 7. 7©MapR Technologies 2013- ConfidentialMahout Math Goals are– basic linear algebra,– and statistical sampling,– and good clustering,– decent speed,– extensibility,– especially for sparse data But not– totally badass speed– comprehensive set of algorithms– optimization, root finders, quadrature
  8. 8. 8©MapR Technologies 2013- ConfidentialMatrices and Vectors At the core:– DenseVector, RandomAccessSparseVector– DenseMatrix, SparseRowMatrix Highly composable API Important ideas:– view*, assign and aggregate– iterationm.viewDiagonal().assign(v)
  9. 9. 9©MapR Technologies 2013- ConfidentialAssign Matrices VectorsMatrix assign(double value);Matrix assign(double[][] values);Matrix assign(Matrix other);Matrix assign(DoubleFunction f);Matrix assign(Matrix other, DoubleDoubleFunction f);Vector assign(double value);Vector assign(double[] values);Vector assign(Vector other);Vector assign(DoubleFunction f);Vector assign(Vector other, DoubleDoubleFunction f);Vector assign(DoubleDoubleFunction f, double y);
  10. 10. 10©MapR Technologies 2013- ConfidentialViews Matrices VectorsMatrix viewPart(int[] offset, int[] size);Matrix viewPart(int row, int rlen, int col, int clen);Vector viewRow(int row);Vector viewColumn(int column);Vector viewDiagonal();Vector viewPart(int offset, int length);
  11. 11. 11©MapR Technologies 2013- ConfidentialExamples The trace of a matrix Random projection Low rank random matrix
  12. 12. 12©MapR Technologies 2013- ConfidentialExamples The trace of a matrix Random projection Low rank random matrixm.viewDiagonal().zSum()
  13. 13. 13©MapR Technologies 2013- ConfidentialExamples The trace of a matrix Random projection Low rank random matrixm.viewDiagonal().zSum()m.times(new DenseMatrix(1000, 3).assign(new Normal()))
  14. 14. 14©MapR Technologies 2013- ConfidentialRecommenders
  15. 15. 15©MapR Technologies 2013- ConfidentialExamples of Recommendations Customers buying books (Linden et al) Web visitors rating music (Shardanand and Maes) or movies (Riedl,et al), (Netflix) Internet radio listeners not skipping songs (Musicmatch) Internet video watchers watching >30 s (Veoh) Visibility in a map UI (new Google maps)
  16. 16. 16©MapR Technologies 2013- ConfidentialRecommendation Basics History:User Thing1 32 43 42 33 21 12 1
  17. 17. 17©MapR Technologies 2013- ConfidentialRecommendation Basics History as matrix: (t1, t3) cooccur 2 times, (t1, t4) once, (t2, t4) once, (t3, t4) oncet1 t2 t3 t4u1 1 0 1 0u2 1 0 1 1u3 0 1 0 1
  18. 18. 18©MapR Technologies 2013- ConfidentialA Quick Simplification Users who do h Also do rAhATAh( )ATA( )hUser-centric recommendationsItem-centric recommendations
  19. 19. 19©MapR Technologies 2013- ConfidentialClustering
  20. 20. 20©MapR Technologies 2013- ConfidentialAn Example
  21. 21. 21©MapR Technologies 2013- ConfidentialAn Example
  22. 22. 22©MapR Technologies 2013- ConfidentialDiagonalized Cluster Proximity
  23. 23. 23©MapR Technologies 2013- ConfidentialParallel Speedup?1 2 3 4 5 201010020304050200ThreadsTimeperpoint(μs)23456810121416Threaded versionNon- threadedPerfect Scaling✓
  24. 24. 24©MapR Technologies 2013- ConfidentialLots of Clusters Are Fine
  25. 25. 25©MapR Technologies 2013- ConfidentialDecompositions
  26. 26. 26©MapR Technologies 2013- ConfidentialLow Rank Matrix Or should we see it differently? Are these scaled up versions of all the same column?1 2 52 4 1010 20 5020 40 100
  27. 27. 27©MapR Technologies 2013- ConfidentialLow Rank Matrix Matrix multiplication is designed to make this easy We can see weighted column patterns, or weighted row patterns All the same mathematically1210201 2 5xColumn pattern(or weights)Weights(or row pattern)
  28. 28. 28©MapR Technologies 2013- ConfidentialLow Rank Matrix What about here? This is like before, but there is one exceptional value1 2 52 4 1010 100 5020 40 100
  29. 29. 29©MapR Technologies 2013- ConfidentialLow Rank Matrix OK … add in a simple fixer upper1210201 2 5x001000 8 0xWhich rowExceptionpattern+[[]]
  30. 30. 30©MapR Technologies 2013- ConfidentialRandom Projection
  31. 31. 31©MapR Technologies 2013- ConfidentialSVD Projection
  32. 32. 32©MapR Technologies 2013- ConfidentialClassifiers
  33. 33. 33©MapR Technologies 2013- ConfidentialMahout Classifiers Naïve Bayes– high quality implementation– uses idiosyncratic input format– … but it is naïve SGD– sequential, not parallel– auto-tuning has foibles– learning rate annealing has issues– definitely not state of the art compared to Vowpal Wabbit Random forest– scaling limits due to decomposition strategy– yet another input format– no deployment strategy
  34. 34. 34©MapR Technologies 2013- ConfidentialThe stuff that isn’t there
  35. 35. 35©MapR Technologies 2013- ConfidentialWhat Mahout Isn’t Mahout isn’t R, isn’t SAS It doesn’t aim to do everything It aims to scale some few problems of practical interest The stuff that isn’t there is a feature, not a defect
  36. 36. 36©MapR Technologies 2013- Confidential Contact:– tdunning@maprtech.com– @ted_dunning– @apachemahout– @user-subscribe@mahout.apache.org Slides and suchhttp://www.slideshare.net/tdunning Hash tags: #mapr #apachemahout

×