Your SlideShare is downloading. ×
  • Like
Whats Right and Wrong with Apache Mahout
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Whats Right and Wrong with Apache Mahout

  • 3,297 views
Published

This is a summary of what I think is good and bad about Mahout. Presented on the eve of the 2013 Hadoop Summit

This is a summary of what I think is good and bad about Mahout. Presented on the eve of the 2013 Hadoop Summit

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,297
On SlideShare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
118
Comments
0
Likes
7

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 1©MapR Technologies 2013- ConfidentialApache MahoutHow its good, how its awesome, and where it falls short
  • 2. 2©MapR Technologies 2013- ConfidentialWhat is Mahout? “Scalable machine learning”– not just Hadoop-oriented machine learning– not entirely, that is. Just mostly. Components– math library– clustering– classification– decompositions– recommendations
  • 3. 3©MapR Technologies 2013- ConfidentialWhat is Right and Wrong with Mahout? Components– recommendations– math library– clustering– classification– decompositions– other stuff
  • 4. 4©MapR Technologies 2013- ConfidentialWhat is Right and Wrong with Mahout? Components– recommendations– math library– clustering– classification– decompositions– other stuff
  • 5. 5©MapR Technologies 2013- ConfidentialWhat is Right and Wrong with Mahout? Components– recommendations– math library– clustering– classification– decompositions– other stuffAll the stuff thatisn’t there
  • 6. 6©MapR Technologies 2013- ConfidentialMahout Math
  • 7. 7©MapR Technologies 2013- ConfidentialMahout Math Goals are– basic linear algebra,– and statistical sampling,– and good clustering,– decent speed,– extensibility,– especially for sparse data But not– totally badass speed– comprehensive set of algorithms– optimization, root finders, quadrature
  • 8. 8©MapR Technologies 2013- ConfidentialMatrices and Vectors At the core:– DenseVector, RandomAccessSparseVector– DenseMatrix, SparseRowMatrix Highly composable API Important ideas:– view*, assign and aggregate– iterationm.viewDiagonal().assign(v)
  • 9. 9©MapR Technologies 2013- ConfidentialAssign Matrices VectorsMatrix assign(double value);Matrix assign(double[][] values);Matrix assign(Matrix other);Matrix assign(DoubleFunction f);Matrix assign(Matrix other, DoubleDoubleFunction f);Vector assign(double value);Vector assign(double[] values);Vector assign(Vector other);Vector assign(DoubleFunction f);Vector assign(Vector other, DoubleDoubleFunction f);Vector assign(DoubleDoubleFunction f, double y);
  • 10. 10©MapR Technologies 2013- ConfidentialViews Matrices VectorsMatrix viewPart(int[] offset, int[] size);Matrix viewPart(int row, int rlen, int col, int clen);Vector viewRow(int row);Vector viewColumn(int column);Vector viewDiagonal();Vector viewPart(int offset, int length);
  • 11. 11©MapR Technologies 2013- ConfidentialExamples The trace of a matrix Random projection Low rank random matrix
  • 12. 12©MapR Technologies 2013- ConfidentialExamples The trace of a matrix Random projection Low rank random matrixm.viewDiagonal().zSum()
  • 13. 13©MapR Technologies 2013- ConfidentialExamples The trace of a matrix Random projection Low rank random matrixm.viewDiagonal().zSum()m.times(new DenseMatrix(1000, 3).assign(new Normal()))
  • 14. 14©MapR Technologies 2013- ConfidentialRecommenders
  • 15. 15©MapR Technologies 2013- ConfidentialExamples of Recommendations Customers buying books (Linden et al) Web visitors rating music (Shardanand and Maes) or movies (Riedl,et al), (Netflix) Internet radio listeners not skipping songs (Musicmatch) Internet video watchers watching >30 s (Veoh) Visibility in a map UI (new Google maps)
  • 16. 16©MapR Technologies 2013- ConfidentialRecommendation Basics History:User Thing1 32 43 42 33 21 12 1
  • 17. 17©MapR Technologies 2013- ConfidentialRecommendation Basics History as matrix: (t1, t3) cooccur 2 times, (t1, t4) once, (t2, t4) once, (t3, t4) oncet1 t2 t3 t4u1 1 0 1 0u2 1 0 1 1u3 0 1 0 1
  • 18. 18©MapR Technologies 2013- ConfidentialA Quick Simplification Users who do h Also do rAhATAh( )ATA( )hUser-centric recommendationsItem-centric recommendations
  • 19. 19©MapR Technologies 2013- ConfidentialClustering
  • 20. 20©MapR Technologies 2013- ConfidentialAn Example
  • 21. 21©MapR Technologies 2013- ConfidentialAn Example
  • 22. 22©MapR Technologies 2013- ConfidentialDiagonalized Cluster Proximity
  • 23. 23©MapR Technologies 2013- ConfidentialParallel Speedup?1 2 3 4 5 201010020304050200ThreadsTimeperpoint(μs)23456810121416Threaded versionNon- threadedPerfect Scaling✓
  • 24. 24©MapR Technologies 2013- ConfidentialLots of Clusters Are Fine
  • 25. 25©MapR Technologies 2013- ConfidentialDecompositions
  • 26. 26©MapR Technologies 2013- ConfidentialLow Rank Matrix Or should we see it differently? Are these scaled up versions of all the same column?1 2 52 4 1010 20 5020 40 100
  • 27. 27©MapR Technologies 2013- ConfidentialLow Rank Matrix Matrix multiplication is designed to make this easy We can see weighted column patterns, or weighted row patterns All the same mathematically1210201 2 5xColumn pattern(or weights)Weights(or row pattern)
  • 28. 28©MapR Technologies 2013- ConfidentialLow Rank Matrix What about here? This is like before, but there is one exceptional value1 2 52 4 1010 100 5020 40 100
  • 29. 29©MapR Technologies 2013- ConfidentialLow Rank Matrix OK … add in a simple fixer upper1210201 2 5x001000 8 0xWhich rowExceptionpattern+[[]]
  • 30. 30©MapR Technologies 2013- ConfidentialRandom Projection
  • 31. 31©MapR Technologies 2013- ConfidentialSVD Projection
  • 32. 32©MapR Technologies 2013- ConfidentialClassifiers
  • 33. 33©MapR Technologies 2013- ConfidentialMahout Classifiers Naïve Bayes– high quality implementation– uses idiosyncratic input format– … but it is naïve SGD– sequential, not parallel– auto-tuning has foibles– learning rate annealing has issues– definitely not state of the art compared to Vowpal Wabbit Random forest– scaling limits due to decomposition strategy– yet another input format– no deployment strategy
  • 34. 34©MapR Technologies 2013- ConfidentialThe stuff that isn’t there
  • 35. 35©MapR Technologies 2013- ConfidentialWhat Mahout Isn’t Mahout isn’t R, isn’t SAS It doesn’t aim to do everything It aims to scale some few problems of practical interest The stuff that isn’t there is a feature, not a defect
  • 36. 36©MapR Technologies 2013- Confidential Contact:– tdunning@maprtech.com– @ted_dunning– @apachemahout– @user-subscribe@mahout.apache.org Slides and suchhttp://www.slideshare.net/tdunning Hash tags: #mapr #apachemahout