• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout
 

Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout

on

  • 1,604 views

 

Statistics

Views

Total Views
1,604
Views on SlideShare
1,604
Embed Views
0

Actions

Likes
1
Downloads
31
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout Presentation Transcript

    • Machine Learning with Apache MahoutClassification, Clustering and Recommendation 3/3/2011 Michaël Figuière
    • Machine Learning
    • Machine Learning Machine Learning is a subset of Artificial Intelligence Artificial Intelligence Machine Learning
    • NoSQL, Search and Machine Learning NoSQL, Search and Machine Learning greatly complete Machine Learning each other ! NoSQL Search
    • Machine Learning algorithms• Recommentations Advice user with recommended items• Classification Automatically classify documents based on a given set of examples• Clustering Automatically discover groups within a set of documents• Patterns mining, evolutionary algorithms, ...
    • Recommendation - User based Amazon suggests articles bought by similar customers
    • Recommendation - Item based On the article page Amazon leverages item based recommendation
    • Similarities between users Here we observes that users 1 and 2 have similar tastes 1 2 A B C D E F 1
    • Recommendation use cases• Advice user with items on e-commerce websites And increase revenue• Advice user with feature he may be interested in on a Web application As most features are usually unknown• Filter and adapt scoring of results of a search engine Based on similar users clicks, ...
    • Classification Mails classified as spams by GMail
    • Classification use cases• Automatically attach tags to documents Based on existing manual tagging, wikipedia, ...• Extract suspicious documents Spam, corrupted documents, ...
    • Clustering Trendy topics discovered by Google News
    • Clustering with K-Means A B C D E F
    • Clustering with K-Means Cluster centers A B with random initial position C D E F
    • Clustering with K-Means Data are attached to the nearest A B cluster center C D E F
    • Clustering with K-Means Cluster centers are moved in order to A minimize the sum B of distances C D E F
    • Clustering with K-Means The data point C is then attached to the A first center as it has B become the nearest C D E F
    • Clustering use cases• Finds key topics in a set of documents News feeds, business documents, ...• Finds some typical behaviors within a set of users Visit frequency, buying habits, ...
    • Apache Mahout
    • In few words• Implementation of machine learning algorithms in Java Continuously growing collection of algorithms• Most of them come in a MapReduce implementation for Hadoop Scalable to huge datasets• Still quite young but growing fast Started in early 2009• Intended to be for Machine Learning what Lucene is for Information Retrieval
    • Documentation
    • Recommendation exampleDataModel model = new FileDataModel(new File("data.csv"));UserSimilarity simil = new PearsonCorrelationSimilarity(model);UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, simil);List<RecommendedItem> recommendations = recommender.recommend(1, 1);The code for a basic recommendation is pretty straightforward !
    • Classification with Mahout Training Training examples algorithm Model Copy New data Model Decision
    • Clustering with Mahout Clustering List of Documents algorithm clusters
    • Relevance evaluation Entire dataset Data used for Data used to evaluate training relevance of an algorithm and its settings
    • A search engine use case
    • A Search Engine Search
    • A Search Engine MyCustomer Search
    • A Search Engine MyCustomer Search Document Non Disclosure Agreement 12 days ago ... MyCustomer agrees not to disclose any part of ... Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
    • Indexing Pipeline Tika PDF Text Analyzer Extractor Search Index Analyzer Phone Call Lucene
    • A more complex Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
    • Indexing Pipeline with Mahout Tika Mahout PDF Text Classifier Analyzer Extractor Search Index Classifier Analyzer Phone Call Lucene
    • Query pipeline Lucene Query Analyzer Search Index Results
    • Query pipeline with Mahout Lucene Query Analyzer Search Index Custom Analyzer Scoring Results Using Mahout recommendations
    • Conclusion• Machine learning brings a lot of valuable features for enterprises Revenue increasing, better productivity, user adoption, ...• Mahout is growing fast and is becoming a great choice for Java apps With easy integration to business applications• Business people are not used to that kind of use cases Collaboration with technical folks is mandatory
    • Questions / Answers ? blog.xebia.fr @mfiguiere