Your SlideShare is downloading. ×
0
Machine Learning with Apache MahoutClassification, Clustering and Recommendation  3/3/2011                                 ...
Machine Learning
Machine Learning                                           Machine Learning is a                                          ...
NoSQL, Search and Machine Learning                                NoSQL, Search and                                  Machi...
Machine Learning algorithms• Recommentations         Advice user with recommended items• Classification         Automatical...
Recommendation - User based                              Amazon suggests                               articles bought    ...
Recommendation - Item based                                 On the article page                              Amazon levera...
Similarities between users                                       Here we observes                                       th...
Recommendation use cases• Advice user with items on e-commerce websites         And increase revenue• Advice user with fea...
Classification                Mails classified as                spams by GMail
Classification use cases• Automatically attach tags to documents        Based on existing manual tagging, wikipedia, ...• E...
Clustering Trendy topics discovered by Google News
Clustering with K-Means        A                    B                C                                D                   ...
Clustering with K-Means                                    Cluster centers        A                    B               wit...
Clustering with K-Means                                    Data are attached                                       to the ...
Clustering with K-Means                                    Cluster centers are                                      moved ...
Clustering with K-Means                                       The data point C is                                     then...
Clustering use cases• Finds key topics in a set of documents         News feeds, business documents, ...• Finds some typic...
Apache Mahout
In few words• Implementation of machine learning algorithms in Java         Continuously growing collection of algorithms•...
Documentation
Recommendation exampleDataModel model = new FileDataModel(new File("data.csv"));UserSimilarity simil =   new PearsonCorrel...
Classification with Mahout       Training       Training      examples       algorithm           Model                     ...
Clustering with Mahout                    Clustering    List of     Documents                    algorithm    clusters
Relevance evaluation          Entire dataset      Data used for                           Data used to evaluate         tr...
A search engine use case
A Search Engine                  Search
A Search Engine            MyCustomer   Search
A Search Engine                      MyCustomer                               Search   Document     Non Disclosure Agreeme...
Indexing Pipeline                    Tika       PDF                  Text                            Analyzer             ...
A more complex Search Engine                      MyCustomer                               Search                    Sales...
Indexing Pipeline with Mahout            Tika       Mahout PDF             Text                       Classifier   Analyzer...
Query pipeline                     Lucene            Query                     Analyzer                                Sea...
Query pipeline with Mahout                        Lucene           Query                         Analyzer                 ...
Conclusion• Machine learning brings a lot of valuable features for enterprises         Revenue increasing, better producti...
Questions / Answers                       ?                      blog.xebia.fr                      @mfiguiere
Upcoming SlideShare
Loading in...5
×

Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout

1,443

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,443
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
36
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout"

  1. 1. Machine Learning with Apache MahoutClassification, Clustering and Recommendation 3/3/2011 Michaël Figuière
  2. 2. Machine Learning
  3. 3. Machine Learning Machine Learning is a subset of Artificial Intelligence Artificial Intelligence Machine Learning
  4. 4. NoSQL, Search and Machine Learning NoSQL, Search and Machine Learning greatly complete Machine Learning each other ! NoSQL Search
  5. 5. Machine Learning algorithms• Recommentations Advice user with recommended items• Classification Automatically classify documents based on a given set of examples• Clustering Automatically discover groups within a set of documents• Patterns mining, evolutionary algorithms, ...
  6. 6. Recommendation - User based Amazon suggests articles bought by similar customers
  7. 7. Recommendation - Item based On the article page Amazon leverages item based recommendation
  8. 8. Similarities between users Here we observes that users 1 and 2 have similar tastes 1 2 A B C D E F 1
  9. 9. Recommendation use cases• Advice user with items on e-commerce websites And increase revenue• Advice user with feature he may be interested in on a Web application As most features are usually unknown• Filter and adapt scoring of results of a search engine Based on similar users clicks, ...
  10. 10. Classification Mails classified as spams by GMail
  11. 11. Classification use cases• Automatically attach tags to documents Based on existing manual tagging, wikipedia, ...• Extract suspicious documents Spam, corrupted documents, ...
  12. 12. Clustering Trendy topics discovered by Google News
  13. 13. Clustering with K-Means A B C D E F
  14. 14. Clustering with K-Means Cluster centers A B with random initial position C D E F
  15. 15. Clustering with K-Means Data are attached to the nearest A B cluster center C D E F
  16. 16. Clustering with K-Means Cluster centers are moved in order to A minimize the sum B of distances C D E F
  17. 17. Clustering with K-Means The data point C is then attached to the A first center as it has B become the nearest C D E F
  18. 18. Clustering use cases• Finds key topics in a set of documents News feeds, business documents, ...• Finds some typical behaviors within a set of users Visit frequency, buying habits, ...
  19. 19. Apache Mahout
  20. 20. In few words• Implementation of machine learning algorithms in Java Continuously growing collection of algorithms• Most of them come in a MapReduce implementation for Hadoop Scalable to huge datasets• Still quite young but growing fast Started in early 2009• Intended to be for Machine Learning what Lucene is for Information Retrieval
  21. 21. Documentation
  22. 22. Recommendation exampleDataModel model = new FileDataModel(new File("data.csv"));UserSimilarity simil = new PearsonCorrelationSimilarity(model);UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, simil);List<RecommendedItem> recommendations = recommender.recommend(1, 1);The code for a basic recommendation is pretty straightforward !
  23. 23. Classification with Mahout Training Training examples algorithm Model Copy New data Model Decision
  24. 24. Clustering with Mahout Clustering List of Documents algorithm clusters
  25. 25. Relevance evaluation Entire dataset Data used for Data used to evaluate training relevance of an algorithm and its settings
  26. 26. A search engine use case
  27. 27. A Search Engine Search
  28. 28. A Search Engine MyCustomer Search
  29. 29. A Search Engine MyCustomer Search Document Non Disclosure Agreement 12 days ago ... MyCustomer agrees not to disclose any part of ... Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  30. 30. Indexing Pipeline Tika PDF Text Analyzer Extractor Search Index Analyzer Phone Call Lucene
  31. 31. A more complex Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  32. 32. Indexing Pipeline with Mahout Tika Mahout PDF Text Classifier Analyzer Extractor Search Index Classifier Analyzer Phone Call Lucene
  33. 33. Query pipeline Lucene Query Analyzer Search Index Results
  34. 34. Query pipeline with Mahout Lucene Query Analyzer Search Index Custom Analyzer Scoring Results Using Mahout recommendations
  35. 35. Conclusion• Machine learning brings a lot of valuable features for enterprises Revenue increasing, better productivity, user adoption, ...• Mahout is growing fast and is becoming a great choice for Java apps With easy integration to business applications• Business people are not used to that kind of use cases Collaboration with technical folks is mandatory
  36. 36. Questions / Answers ? blog.xebia.fr @mfiguiere
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×