Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout

1,887 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,887
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
38
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout

  1. 1. Machine Learning with Apache MahoutClassification, Clustering and Recommendation 3/3/2011 Michaël Figuière
  2. 2. Machine Learning
  3. 3. Machine Learning Machine Learning is a subset of Artificial Intelligence Artificial Intelligence Machine Learning
  4. 4. NoSQL, Search and Machine Learning NoSQL, Search and Machine Learning greatly complete Machine Learning each other ! NoSQL Search
  5. 5. Machine Learning algorithms• Recommentations Advice user with recommended items• Classification Automatically classify documents based on a given set of examples• Clustering Automatically discover groups within a set of documents• Patterns mining, evolutionary algorithms, ...
  6. 6. Recommendation - User based Amazon suggests articles bought by similar customers
  7. 7. Recommendation - Item based On the article page Amazon leverages item based recommendation
  8. 8. Similarities between users Here we observes that users 1 and 2 have similar tastes 1 2 A B C D E F 1
  9. 9. Recommendation use cases• Advice user with items on e-commerce websites And increase revenue• Advice user with feature he may be interested in on a Web application As most features are usually unknown• Filter and adapt scoring of results of a search engine Based on similar users clicks, ...
  10. 10. Classification Mails classified as spams by GMail
  11. 11. Classification use cases• Automatically attach tags to documents Based on existing manual tagging, wikipedia, ...• Extract suspicious documents Spam, corrupted documents, ...
  12. 12. Clustering Trendy topics discovered by Google News
  13. 13. Clustering with K-Means A B C D E F
  14. 14. Clustering with K-Means Cluster centers A B with random initial position C D E F
  15. 15. Clustering with K-Means Data are attached to the nearest A B cluster center C D E F
  16. 16. Clustering with K-Means Cluster centers are moved in order to A minimize the sum B of distances C D E F
  17. 17. Clustering with K-Means The data point C is then attached to the A first center as it has B become the nearest C D E F
  18. 18. Clustering use cases• Finds key topics in a set of documents News feeds, business documents, ...• Finds some typical behaviors within a set of users Visit frequency, buying habits, ...
  19. 19. Apache Mahout
  20. 20. In few words• Implementation of machine learning algorithms in Java Continuously growing collection of algorithms• Most of them come in a MapReduce implementation for Hadoop Scalable to huge datasets• Still quite young but growing fast Started in early 2009• Intended to be for Machine Learning what Lucene is for Information Retrieval
  21. 21. Documentation
  22. 22. Recommendation exampleDataModel model = new FileDataModel(new File("data.csv"));UserSimilarity simil = new PearsonCorrelationSimilarity(model);UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, simil);List<RecommendedItem> recommendations = recommender.recommend(1, 1);The code for a basic recommendation is pretty straightforward !
  23. 23. Classification with Mahout Training Training examples algorithm Model Copy New data Model Decision
  24. 24. Clustering with Mahout Clustering List of Documents algorithm clusters
  25. 25. Relevance evaluation Entire dataset Data used for Data used to evaluate training relevance of an algorithm and its settings
  26. 26. A search engine use case
  27. 27. A Search Engine Search
  28. 28. A Search Engine MyCustomer Search
  29. 29. A Search Engine MyCustomer Search Document Non Disclosure Agreement 12 days ago ... MyCustomer agrees not to disclose any part of ... Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  30. 30. Indexing Pipeline Tika PDF Text Analyzer Extractor Search Index Analyzer Phone Call Lucene
  31. 31. A more complex Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  32. 32. Indexing Pipeline with Mahout Tika Mahout PDF Text Classifier Analyzer Extractor Search Index Classifier Analyzer Phone Call Lucene
  33. 33. Query pipeline Lucene Query Analyzer Search Index Results
  34. 34. Query pipeline with Mahout Lucene Query Analyzer Search Index Custom Analyzer Scoring Results Using Mahout recommendations
  35. 35. Conclusion• Machine learning brings a lot of valuable features for enterprises Revenue increasing, better productivity, user adoption, ...• Mahout is growing fast and is becoming a great choice for Java apps With easy integration to business applications• Business people are not used to that kind of use cases Collaboration with technical folks is mandatory
  36. 36. Questions / Answers ? blog.xebia.fr @mfiguiere

×