Introducing Apache Mahout

 Scalable Machine Learning for All!
          Grant Ingersoll
Agenda
• What is Machine Learning?
  – Definitions
  – Types
  – Applications
• Mahout
  –   What?
  –   Why?
  –   How?
  –   Who?
What is Machine Learning?




                NOT!
                 QuickTimeª and a
                  decompressor                                           QuickTimeª and a
            are needed to see this picture.       Or?                      decompressor
                                                                    are needed to see this picture.




http://en.wikipedia.org/wiki/Image:Hal-9000.jpg




                                                        http://upload.wikimedia.org/wikipedia/en/4/49/Terminator.jpg
How about?



             Google News
Or?




      Amazon.com
Definition
• “Machine Learning is programming
  computers to optimize a performance
  criterion using example data or past
  experience”
  – Intro. To Machine Learning by E.
    Alpaydin
• Subset of Artificial Intelligence
  – Many other fields: comp sci., biology,
    math, psychology, etc.
Characterizations
• Lots of Data

• Identifiable Features in that Data

• Too big/costly for people to handle
  – People still can help
Types
• Supervised
  – Using labeled training data, create
    function that predicts output of unseen
    inputs
• Unsupervised
  – Using unlabeled data, create function
    that predicts output
• Semi-Supervised
  – Uses labeled and unlabeled data
Classification/Categorization
•   Spam Filtering
•   Named Entity Recognition
•   Phrase Identification
•   Sentiment Analysis
•   Classification into a Taxonomy
Clustering
• Find Natural Groupings
  – Documents
  – Search Results
  – People
  – Genetic traits in groups
  – Many, many more uses
Collaborative Filtering
• Recommend people and products
  – User-User
    • User likes X, you might too
  – Item-Item
    • People who bought X also bought Y
Info. Retrieval
• Learning Ranking Functions

• Learning Spelling Corrections

• User Click Analysis and Tracking
Other
• Image Analysis
• Robotics
• Games
• Higher level natural language
  processing
• Many, many others
What is Apache Mahout?
• A Mahout is an elephant
  trainer/driver/keeper, hence…
             QuickTimeª and a
               decompressor
        are needed to see this picture.



                  + (and other distributed techniques)
           Machine Learning
                  =
What?
• Hadoop brings:
  – Map/Reduce API
  – HDFS
  – In other words, scalability and fault-
    tolerance
• Thus, Mahout’s Goal is:
  – Scalable Machine Learning with Apache
    License
Why Mahout?
• Many Open Source ML libraries either:
  –   Lack Community
  –   Lack Documentation and Examples
  –   Lack Scalability
  –   Lack the Apache License ;-)
  –   Or are research-oriented
• Personal: Learn more ML
• Intelligent Apps are the Present and Future
  – See the Hadoop talks tomorrow and Friday!
• Goal: Overcome gaps the Apache Way!
Current Status
• Close to Initial release
   – Focused on examples, docs, bug fixes
• What’s in it:
   – Simple Matrix/Vector library
   – Taste Collaborative Filtering
   – Clustering
      • Canopy/K-Means/Fuzzy K-Means/Mean-shift
   – Classifiers
      • Naïve Bayes
      • Complementary NB
   – Evolutionary
      • Integration with Watchmaker for fitness function
How?
• Examples
  – Taste
  – Clustering
  – Classification
  – Evolutionary
Taste: Movie
       Recommendations
• Given ratings by users of movies,
  recommend other movies

• http://lucene.apache.org/mahout/taste
  .html#demo
Clustering: Synthetic Control
            Data
• http://archive.ics.uci.edu/ml/datasets/Synthetic+


• Each clustering impl. has an example
  Job for running in
  <MAHOUT_HOME>/examples
  – o.a.mahout.clustering.syntheticcontrol.*
• Outputs clusters…
Classification: NB and CNB
          Examples
• 20 Newsgroups
  – http://cwiki.apache.org/confluence/display/MA


• Wikipedia
  – http://cwiki.apache.org/confluence/display/MA
Evolutionary
• Traveling Salesman
  – http://cwiki.apache.org/confluence/displa
    y/MAHOUT/Traveling+Salesman


• Class Discovery
  – http://cwiki.apache.org/confluence/displa
    y/MAHOUT/Class+Discovery
What’s Next?
•   Release 0.1!
•   Shared Amazon Images (others?)
•   More Examples
•   Winnow/Perceptron (MAHOUT-85)
•   Hbase and HAMA support
•   Normalize I/O format for data
•   Solr Integration (SOLR-769)
•   Other Algorithms: SVM, Linear Regression,
    etc.
When, Where, Who
• When? Now!
  – Mahout is growing
• Who? You!
  – We want Java programmers who:
     • Are comfortable with math
     • Like to work on large, hard problems
• Where?
  – http://lucene.apache.org/mahout
  – http://cwiki.apache.org/MAHOUT
  – mahout-{user|dev}@lucene.apache.org
Resources
• “Programming Collective Intelligence”
  by Toby Segaran
• “Data Mining - Practical Machine
  Learning Tools and Techniques” by
  Ian H. Witten and Eibe Frank
• Hadoop - http://hadoop.apache.org
• http://mloss.org/software/

Download Materials

  • 1.
    Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll
  • 2.
    Agenda • What isMachine Learning? – Definitions – Types – Applications • Mahout – What? – Why? – How? – Who?
  • 3.
    What is MachineLearning? NOT! QuickTimeª and a decompressor QuickTimeª and a are needed to see this picture. Or? decompressor are needed to see this picture. http://en.wikipedia.org/wiki/Image:Hal-9000.jpg http://upload.wikimedia.org/wikipedia/en/4/49/Terminator.jpg
  • 4.
    How about? Google News
  • 5.
    Or? Amazon.com
  • 6.
    Definition • “Machine Learningis programming computers to optimize a performance criterion using example data or past experience” – Intro. To Machine Learning by E. Alpaydin • Subset of Artificial Intelligence – Many other fields: comp sci., biology, math, psychology, etc.
  • 7.
    Characterizations • Lots ofData • Identifiable Features in that Data • Too big/costly for people to handle – People still can help
  • 8.
    Types • Supervised – Using labeled training data, create function that predicts output of unseen inputs • Unsupervised – Using unlabeled data, create function that predicts output • Semi-Supervised – Uses labeled and unlabeled data
  • 9.
    Classification/Categorization • Spam Filtering • Named Entity Recognition • Phrase Identification • Sentiment Analysis • Classification into a Taxonomy
  • 10.
    Clustering • Find NaturalGroupings – Documents – Search Results – People – Genetic traits in groups – Many, many more uses
  • 11.
    Collaborative Filtering • Recommendpeople and products – User-User • User likes X, you might too – Item-Item • People who bought X also bought Y
  • 12.
    Info. Retrieval • LearningRanking Functions • Learning Spelling Corrections • User Click Analysis and Tracking
  • 13.
    Other • Image Analysis •Robotics • Games • Higher level natural language processing • Many, many others
  • 14.
    What is ApacheMahout? • A Mahout is an elephant trainer/driver/keeper, hence… QuickTimeª and a decompressor are needed to see this picture. + (and other distributed techniques) Machine Learning =
  • 15.
    What? • Hadoop brings: – Map/Reduce API – HDFS – In other words, scalability and fault- tolerance • Thus, Mahout’s Goal is: – Scalable Machine Learning with Apache License
  • 16.
    Why Mahout? • ManyOpen Source ML libraries either: – Lack Community – Lack Documentation and Examples – Lack Scalability – Lack the Apache License ;-) – Or are research-oriented • Personal: Learn more ML • Intelligent Apps are the Present and Future – See the Hadoop talks tomorrow and Friday! • Goal: Overcome gaps the Apache Way!
  • 17.
    Current Status • Closeto Initial release – Focused on examples, docs, bug fixes • What’s in it: – Simple Matrix/Vector library – Taste Collaborative Filtering – Clustering • Canopy/K-Means/Fuzzy K-Means/Mean-shift – Classifiers • Naïve Bayes • Complementary NB – Evolutionary • Integration with Watchmaker for fitness function
  • 18.
    How? • Examples – Taste – Clustering – Classification – Evolutionary
  • 19.
    Taste: Movie Recommendations • Given ratings by users of movies, recommend other movies • http://lucene.apache.org/mahout/taste .html#demo
  • 20.
    Clustering: Synthetic Control Data • http://archive.ics.uci.edu/ml/datasets/Synthetic+ • Each clustering impl. has an example Job for running in <MAHOUT_HOME>/examples – o.a.mahout.clustering.syntheticcontrol.* • Outputs clusters…
  • 21.
    Classification: NB andCNB Examples • 20 Newsgroups – http://cwiki.apache.org/confluence/display/MA • Wikipedia – http://cwiki.apache.org/confluence/display/MA
  • 22.
    Evolutionary • Traveling Salesman – http://cwiki.apache.org/confluence/displa y/MAHOUT/Traveling+Salesman • Class Discovery – http://cwiki.apache.org/confluence/displa y/MAHOUT/Class+Discovery
  • 23.
    What’s Next? • Release 0.1! • Shared Amazon Images (others?) • More Examples • Winnow/Perceptron (MAHOUT-85) • Hbase and HAMA support • Normalize I/O format for data • Solr Integration (SOLR-769) • Other Algorithms: SVM, Linear Regression, etc.
  • 24.
    When, Where, Who •When? Now! – Mahout is growing • Who? You! – We want Java programmers who: • Are comfortable with math • Like to work on large, hard problems • Where? – http://lucene.apache.org/mahout – http://cwiki.apache.org/MAHOUT – mahout-{user|dev}@lucene.apache.org
  • 25.
    Resources • “Programming CollectiveIntelligence” by Toby Segaran • “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank • Hadoop - http://hadoop.apache.org • http://mloss.org/software/