Intro to Mahout

4,026 views

Published on

A short introduction to Mahout during SCISR meetup

http://bit.ly/scisr

Published in: Technology, Education
1 Comment
4 Likes
Statistics
Notes
No Downloads
Views
Total views
4,026
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
162
Comments
1
Likes
4
Embeds 0
No embeds

No notes for slide
  • The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers (2008)Apache Lucene(TM) is a high-performance, full-featured text search engine library  (2005)
  • Intro to Mahout

    1. 1. Ofer Vugman May 2012
    2. 2. Agenda and such… What is ML (Machine Learning) ML Common Use Cases Mahout Overview Algorithms in Mahout Mahout Commercial Use Mahout Summary
    3. 3. What is ML “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” Intro. To Machine Learning by E. Alpaydin
    4. 4. ML Common Use Cases Recommendation
    5. 5. ML Common Use Cases Classification
    6. 6. ML Common Use Cases Clustering
    7. 7. ML Common Libraries
    8. 8. Mahout Overview – What ?A mahout is a person who keeps and drives an elephant
    9. 9. Mahout Overview – What ? A scalable machine learning library
    10. 10. Mahout Overview – What ? Began life at 2008 as a subproject of Apache’s Lucene project On 2010 Mahout became a top-level Apache project in its own right Implemented in Java Built upon Apache’s Hadoop (Look ! An Elephant !)
    11. 11. Mahout Overview – Why ? Many open source ML libraries either:  Lack community  Lack documentation and examples  Lack scalability  Lack the Apache license  Are research oriented  Not well tested  Not built over existing production quality libraries
    12. 12. Mahout Overview – Why ? Scalability  Scalable to reasonably large datasets (core algorithms implemented in Map/Reduce, runnable on Hadoop)  Scalable to support your business case (Apache License)  Scalable community
    13. 13. Mahout Overview – Why ? Built over existing production quality libraries
    14. 14. Mahout Overview – Use Cases Mahout currently supports mainly four use cases: 1. Recommendation 2. Clustering 3. Classification 4. Frequent Itemset Mining
    15. 15. Mahout Overview - Technical System Requirements  Linux (or Cygwin on Windows)  Java 1.6.x or greater  Maven 2.0.11 or greater to build the source code  Hadoop 0.2 or greater** Not all algorithms are implemented to work on Hadoop clusters
    16. 16. Algorithms in Mahout We’ll focus on one example:  Collaborative Filtering (Recommenders) Yet there are many (many !!) more, you can find them all on https://cwiki.apache.org/confluence/dis play/MAHOUT/Algorithms
    17. 17. Algorithms Examples –Recommendation Help users find items they might like based on historical preferences Based on example by Sebastian Schelter in “Distributed Itembased Collaborative Filtering with Apache Mahout”
    18. 18. Algorithms Examples –Recommendation Alice 5 1 4 Bob ? 2 5 Peter 4 3 2
    19. 19. Algorithms Examples –Recommendation Algorithm  Neighborhood-based approach  Works by finding similarly rated items in the user-item-matrix (e.g. cosine, Pearson- Correlation, Tanimoto Coefficient)  Estimates a users preference towards an item by looking at his/her preferences towards similar items
    20. 20. Algorithms Examples –Recommendation Prediction: Estimate Bobs preference towards “The Matrix” 1. Look at all items that  a) are similar to “The Matrix“  b) have been rated by Bob => “Alien“, “Inception“ 2. Estimate the unknown preference with a weighted sum
    21. 21. Algorithms Examples –Recommendation MapReduce phase 1  Map – Make user the key (Alice, Matrix, 5) Alice (Matrix, 5) (Alice, Alien, 1) Alice (Alien, 1) (Alice, Inception, 4) Alice (Inception, 4) (Bob, Alien, 2) Bob (Alien, 2) (Bob, Inception, 5) Bob (Inception, 5) (Peter, Matrix, 4) Peter (Matrix, 4) (Peter, Alien, 3) Peter (Alien, 3) (Peter, Inception, 2) Peter (Inception, 2)
    22. 22. Algorithms Examples –Recommendation MapReduce phase 1  Reduce – Create inverted index Alice (Matrix, 5) Alice (Alien, 1) Alice (Inception, 4) Alice (Matrix, 5) (Alien, 1) (Inception, 4) Bob (Alien, 2) Bob (Alien, 2) (Inception, 5) Bob (Inception, 5) Peter(Matrix, 4) (Alien, 3) (Inception, 2) Peter (Matrix, 4) Peter (Alien, 3) Peter (Inception, 2)
    23. 23. Algorithms Examples –Recommendation MapReduce phase 2  Map – Isolate all co-occurred ratings (all cases where a user rated both items) Matrix, Alien (5,1) Matrix, Alien (4,3)Alice (Matrix, 5) (Alien, 1) (Inception, 4) Alien, Inception (1,4)Bob (Alien, 2) (Inception, 5) Alien, Inception (2,5)Peter(Matrix, 4) (Alien, 3) (Inception, 2) Alien, Inception (3,2) Matrix, Inception (4,2) Matrix, Inception (5,4)
    24. 24. Algorithms Examples –Recommendation MapReduce phase 2  Reduce – Compute similarities Matrix, Alien (5,1) Matrix, Alien (4,3) Alien, Inception (1,4) Matrix, Alien (-0.47) Alien, Inception (2,5) Matrix, Inception (0.47) Alien, Inception (3,2) Alien, Inception(-0.63) Matrix, Inception (4,2) Matrix, Inception (5,4)
    25. 25. Algorithms Examples –Recommendation Alice 5 1 4 Bob 1.5 2 5 Peter 4 3 2
    26. 26. Mahout Commercial Use Commercial use
    27. 27. Mahout Resources Mahout website - http://mahout.apache.org/ Introducing Apache Mahout – http://www.ibm.com/developerworks/java/lib rary/j-mahout/ “Mahout In Action” by Sean Owen and Robin Anil
    28. 28. Mahout Summary ML is all over the web today Mahout is about scalable machine learning Mahout has functionality for many of today’s common machine learning tasks MapReduce magic in action
    29. 29. Mahout Summary Thank you and good night

    ×