Your SlideShare is downloading. ×
Download Materials
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Download Materials

1,342
views

Published on


0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,342
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
84
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll
  • 2. Agenda • What is Machine Learning? – Definitions – Types – Applications • Mahout – What? – Why? – How? – Who?
  • 3. What is Machine Learning? NOT! QuickTimeª and a decompressor QuickTimeª and a are needed to see this picture. Or? decompressor are needed to see this picture. http://en.wikipedia.org/wiki/Image:Hal-9000.jpg http://upload.wikimedia.org/wikipedia/en/4/49/Terminator.jpg
  • 4. How about? Google News
  • 5. Or? Amazon.com
  • 6. Definition • “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” – Intro. To Machine Learning by E. Alpaydin • Subset of Artificial Intelligence – Many other fields: comp sci., biology, math, psychology, etc.
  • 7. Characterizations • Lots of Data • Identifiable Features in that Data • Too big/costly for people to handle – People still can help
  • 8. Types • Supervised – Using labeled training data, create function that predicts output of unseen inputs • Unsupervised – Using unlabeled data, create function that predicts output • Semi-Supervised – Uses labeled and unlabeled data
  • 9. Classification/Categorization • Spam Filtering • Named Entity Recognition • Phrase Identification • Sentiment Analysis • Classification into a Taxonomy
  • 10. Clustering • Find Natural Groupings – Documents – Search Results – People – Genetic traits in groups – Many, many more uses
  • 11. Collaborative Filtering • Recommend people and products – User-User • User likes X, you might too – Item-Item • People who bought X also bought Y
  • 12. Info. Retrieval • Learning Ranking Functions • Learning Spelling Corrections • User Click Analysis and Tracking
  • 13. Other • Image Analysis • Robotics • Games • Higher level natural language processing • Many, many others
  • 14. What is Apache Mahout? • A Mahout is an elephant trainer/driver/keeper, hence… QuickTimeª and a decompressor are needed to see this picture. + (and other distributed techniques) Machine Learning =
  • 15. What? • Hadoop brings: – Map/Reduce API – HDFS – In other words, scalability and fault- tolerance • Thus, Mahout’s Goal is: – Scalable Machine Learning with Apache License
  • 16. Why Mahout? • Many Open Source ML libraries either: – Lack Community – Lack Documentation and Examples – Lack Scalability – Lack the Apache License ;-) – Or are research-oriented • Personal: Learn more ML • Intelligent Apps are the Present and Future – See the Hadoop talks tomorrow and Friday! • Goal: Overcome gaps the Apache Way!
  • 17. Current Status • Close to Initial release – Focused on examples, docs, bug fixes • What’s in it: – Simple Matrix/Vector library – Taste Collaborative Filtering – Clustering • Canopy/K-Means/Fuzzy K-Means/Mean-shift – Classifiers • Naïve Bayes • Complementary NB – Evolutionary • Integration with Watchmaker for fitness function
  • 18. How? • Examples – Taste – Clustering – Classification – Evolutionary
  • 19. Taste: Movie Recommendations • Given ratings by users of movies, recommend other movies • http://lucene.apache.org/mahout/taste .html#demo
  • 20. Clustering: Synthetic Control Data • http://archive.ics.uci.edu/ml/datasets/Synthetic+ • Each clustering impl. has an example Job for running in <MAHOUT_HOME>/examples – o.a.mahout.clustering.syntheticcontrol.* • Outputs clusters…
  • 21. Classification: NB and CNB Examples • 20 Newsgroups – http://cwiki.apache.org/confluence/display/MA • Wikipedia – http://cwiki.apache.org/confluence/display/MA
  • 22. Evolutionary • Traveling Salesman – http://cwiki.apache.org/confluence/displa y/MAHOUT/Traveling+Salesman • Class Discovery – http://cwiki.apache.org/confluence/displa y/MAHOUT/Class+Discovery
  • 23. What’s Next? • Release 0.1! • Shared Amazon Images (others?) • More Examples • Winnow/Perceptron (MAHOUT-85) • Hbase and HAMA support • Normalize I/O format for data • Solr Integration (SOLR-769) • Other Algorithms: SVM, Linear Regression, etc.
  • 24. When, Where, Who • When? Now! – Mahout is growing • Who? You! – We want Java programmers who: • Are comfortable with math • Like to work on large, hard problems • Where? – http://lucene.apache.org/mahout – http://cwiki.apache.org/MAHOUT – mahout-{user|dev}@lucene.apache.org
  • 25. Resources • “Programming Collective Intelligence” by Toby Segaran • “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank • Hadoop - http://hadoop.apache.org • http://mloss.org/software/