Your SlideShare is downloading.
×

×
# Saving this for later?

### Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

#### Text the download link to your phone

Standard text messaging rates apply

Like this presentation? Why not share!

- Scalable Collaborative Filtering Re... by ev_ancasey 732 views
- Buidling large scale recommendation... by Keeyong Han 5742 views
- How to Build a Recommendation Engin... by Caserta Concepts 1643 views
- Collaborative Filtering with Spark by Chris Johnson 10401 views
- Large Scale Machine Learning with A... by Cloudera, Inc. 3280 views
- Co-occurrence Based Recommendations... by sscdotopen 1798 views
- Recent Developments in Spark MLlib ... by Hadoop_Summit 3873 views
- Recommendation system using bloom f... by ijdkpjournal 540 views
- Hadoop World 2011 Keynote: Ebay - H... by Cloudera, Inc. 23384 views
- Predictions on big data by techieguy85 31 views
- SOURCE CODE RETRIEVAL USING SEQUENC... by ijdkpjournal 674 views
- Tapad- Business Insider MadConf by Julie Hansen 846 views

Like this? Share it with your network
Share

1,353

views

views

Published on

No Downloads

Total Views

1,353

On Slideshare

0

From Embeds

0

Number of Embeds

13

Shares

0

Downloads

58

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Apache Mahout: Scalable Machine Learning Library Anastasiia Kornilova
- 2. What is Machine Learning? “Machine learning - branch of artificial intelligence, concerns the construction and study of systems that can learn from data”
- 3. Typical Use Cases ● Recommend products/friends … ● Classify content into predefined groups ● Computer vision ● Sentiment analysis/opinion mining ● Find patterns in users behavior/actions ● Identify key topics/summarize text ● Detect anomalies/fraud ● Ranking search results ● Speech and handwriting recognition ● Natural language processing
- 4. ML Algorithms (subset): ● Supervised learning – – Logistic regression – Support Vector Machines – ● Linear regression Random Forests Unsupervised learning – – Blind signal separation – ● Clustering Hidden Markov models Semi-supervised
- 5. Many ML libraries, frameworks and tools: ● Weka ● Python Scikit ● Pylearn/Pylearn2 ● Theano ● Orange ● SSBrain :) ● More can be find here: http://mloss.org/software/
- 6. Typical Workflow ● Get data ● Prepare data ● Choose algorithm(s) ● Run your algorithm(s) ● Validate results
- 7. Every ML algorithms deals with: 1.Data 2.Computation over this data
- 8. Scalability strategies: ● “Bigger” computer ● More cores ● GPU computing ● Parallel computing, MapReduce
- 9. What is Mahout? ● ● Scalable ML library built on Hadoop, written in Java Driven by Ng et al's. Paper “MapReduce for Machine Learning on Multicore” ● Started as Lucene sub-project. Became Apache TLP in April 2010 ● 25 July 2013 - Apache Mahout 0.8 released ● Taste Recommended Framework by Sean Owen was added in 2008
- 10. Who use Mahout?
- 11. When you need Mahout? Data Size Lines, Sample Data Task Analysis and visualization Tools Whiteboard, bash, ... KBs – low MBs, Prototype Data Analysis and visualization Octave, R, bash, ... MBs – low Gbs, Online Data Storage Data bases (MySQL, Postgresql), ... Analysis NumPy, SciPy, BLAS, Weka Visualization GBs – TBs – Pbs Big Data Protovis, D3, ... Storage HDFS, Hbase, Cassandra, ... Analysis Mahout, Hive, Pig, …. table from Varad Meru
- 12. Advantages ● Community ● Documentations and examples ● Scalability ● Apache license ● Well tested ● Built over existing production quality libraries
- 13. Requirements ● Java 1.6.x or greater ● Maven 3.x to build the source code ● Hadoop 0.20.0 or greater
- 14. Core themes ● Recommender engines (collaborative filtering) ● Clustering ● Classification
- 15. Core themes ● Recommender engines (collaborative filtering) ● Clustering ● Classification
- 16. Algorithms ● User and Item based recommenders ● Matrix factorization based recommenders ● K-Means, Fuzzy K-Means clustering ● Latent Dirichlet Allocation ● Singular value decomposition ● Logistic regression based classifier ● Complementary Naive Bayes classifier ● Random forest decision tree based classifier
- 17. Recommender engine
- 18. Personalization level ● Generic / Non-Personalized: everyone receives same recommendations ● Demographic: matches a target group ● Ephemeral: matches current activity ● Persistent: matches long-term interests
- 19. Content based ● User Ratings x Item Attributes => Model ● Model applied to new items via attributes ● ● Alternative: knowledge-based (Item attributes form model of item space) Example: Personalized news feeds
- 20. Table of ratings
- 21. Ratings ● Explicit (Rating, Review, Vote, Like) ● Implicit (Click, Purchase, Follow)
- 22. Item Item ● For every item I ● Select N similar items ● Recommend users, who work with item I this N items
- 23. User user ● For every user ● Find n most similar users ● Aggregate preferences for this user ● Generate recommended items
- 24. Similarities metrics ● Pearson Correlation ● Tanimoto ● Cosine similarity ● Euclidean distance
- 25. Sparse matrix
- 26. Parameters ● ● ● ● DataModel – FileDataModel, MySQLJDBCDataModel, PostgreSQLJDBCDataModel, MongoDBDataModel, CassandraDataModel UserSimilarity – Pearson Corelation, Tanimoto, Log-Likelihood, Euclidian Distance, Cosine Similarity ItemSimilarity – Pearson Corelation, Tanimoto, Log-Likelihood, Euclidian Distance, Cosine Similarity UserNeighborhood – Nearest N-User Neighborhood, Threshold User Neighborhood
- 27. Code example
- 28. Evaluation ● Average absolute difference ● RMSE ● Precision and recall ● ● Precision is the proportion of top results that are relevant, for some definition of relevant. Recall is the proportion of all relevant results included in the top results.
- 29. Clustering
- 30. Mahout Clustering Algorithms ● K-Means - runs on Hadoop ● Fuzzy K-means - runs on Hadoop ● Latent Dirichlet Allocation -runs on Hadoop ● Canopy clustering - runs on Hadoop ● Minhash clustering - runs on Hadoop ● kMeans++ streaming clustering - documentation missing
- 31. Classification
- 32. Mahout Classification Algorithms ● Logistic regression (SGD) - model parameter selection can be done in Hadoop ● Naive Bayes - training runs on Hadoop ● Random Forests - training is done in Hadoop ● Hidden Markov Models - training is done in Map-Reduce
- 33. Resources ● Mahout in action ● Apache Mahout Cookbook ● Introduction to Apache Mahout ● http://mahout.apache.org/
- 34. Q&A

Be the first to comment