Machine Learning with Apache MahoutClassiﬁcation, Clustering and Recommendation 3/3/2011 Michaël Figuière
Machine Learning Machine Learning is a subset of Artificial Intelligence Artificial Intelligence Machine Learning
NoSQL, Search and Machine Learning NoSQL, Search and Machine Learning greatly complete Machine Learning each other ! NoSQL Search
Machine Learning algorithms• Recommentations Advice user with recommended items• Classiﬁcation Automatically classify documents based on a given set of examples• Clustering Automatically discover groups within a set of documents• Patterns mining, evolutionary algorithms, ...
Recommendation - User based Amazon suggests articles bought by similar customers
Recommendation - Item based On the article page Amazon leverages item based recommendation
Similarities between users Here we observes that users 1 and 2 have similar tastes 1 2 A B C D E F 1
Recommendation use cases• Advice user with items on e-commerce websites And increase revenue• Advice user with feature he may be interested in on a Web application As most features are usually unknown• Filter and adapt scoring of results of a search engine Based on similar users clicks, ...
Classiﬁcation Mails classified as spams by GMail
Classiﬁcation use cases• Automatically attach tags to documents Based on existing manual tagging, wikipedia, ...• Extract suspicious documents Spam, corrupted documents, ...
Clustering Trendy topics discovered by Google News
Clustering with K-Means A B C D E F
Clustering with K-Means Cluster centers A B with random initial position C D E F
Clustering with K-Means Data are attached to the nearest A B cluster center C D E F
Clustering with K-Means Cluster centers are moved in order to A minimize the sum B of distances C D E F
Clustering with K-Means The data point C is then attached to the A first center as it has B become the nearest C D E F
Clustering use cases• Finds key topics in a set of documents News feeds, business documents, ...• Finds some typical behaviors within a set of users Visit frequency, buying habits, ...
In few words• Implementation of machine learning algorithms in Java Continuously growing collection of algorithms• Most of them come in a MapReduce implementation for Hadoop Scalable to huge datasets• Still quite young but growing fast Started in early 2009• Intended to be for Machine Learning what Lucene is for Information Retrieval
Recommendation exampleDataModel model = new FileDataModel(new File("data.csv"));UserSimilarity simil = new PearsonCorrelationSimilarity(model);UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, simil);List<RecommendedItem> recommendations = recommender.recommend(1, 1);The code for a basic recommendation is pretty straightforward !
Classiﬁcation with Mahout Training Training examples algorithm Model Copy New data Model Decision
Clustering with Mahout Clustering List of Documents algorithm clusters
Relevance evaluation Entire dataset Data used for Data used to evaluate training relevance of an algorithm and its settings
A search engine use case
A Search Engine Search
A Search Engine MyCustomer Search
A Search Engine MyCustomer Search Document Non Disclosure Agreement 12 days ago ... MyCustomer agrees not to disclose any part of ... Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
Indexing Pipeline Tika PDF Text Analyzer Extractor Search Index Analyzer Phone Call Lucene
A more complex Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
Indexing Pipeline with Mahout Tika Mahout PDF Text Classiﬁer Analyzer Extractor Search Index Classiﬁer Analyzer Phone Call Lucene
Query pipeline Lucene Query Analyzer Search Index Results
Query pipeline with Mahout Lucene Query Analyzer Search Index Custom Analyzer Scoring Results Using Mahout recommendations
Conclusion• Machine learning brings a lot of valuable features for enterprises Revenue increasing, better productivity, user adoption, ...• Mahout is growing fast and is becoming a great choice for Java apps With easy integration to business applications• Business people are not used to that kind of use cases Collaboration with technical folks is mandatory