Your SlideShare is downloading. ×
0
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Mahout
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Mahout

1,425

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,425
On Slideshare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
58
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  1. Apache Mahout: Scalable Machine Learning Library Anastasiia Kornilova
  2. What is Machine Learning? “Machine learning - branch of artificial intelligence, concerns the construction and study of systems that can learn from data”
  3. Typical Use Cases ● Recommend products/friends … ● Classify content into predefined groups ● Computer vision ● Sentiment analysis/opinion mining ● Find patterns in users behavior/actions ● Identify key topics/summarize text ● Detect anomalies/fraud ● Ranking search results ● Speech and handwriting recognition ● Natural language processing
  4. ML Algorithms (subset): ● Supervised learning – – Logistic regression – Support Vector Machines – ● Linear regression Random Forests Unsupervised learning – – Blind signal separation – ● Clustering Hidden Markov models Semi-supervised
  5. Many ML libraries, frameworks and tools: ● Weka ● Python Scikit ● Pylearn/Pylearn2 ● Theano ● Orange ● SSBrain :) ● More can be find here: http://mloss.org/software/
  6. Typical Workflow ● Get data ● Prepare data ● Choose algorithm(s) ● Run your algorithm(s) ● Validate results
  7. Every ML algorithms deals with: 1.Data 2.Computation over this data
  8. Scalability strategies: ● “Bigger” computer ● More cores ● GPU computing ● Parallel computing, MapReduce
  9. What is Mahout? ● ● Scalable ML library built on Hadoop, written in Java Driven by Ng et al's. Paper “MapReduce for Machine Learning on Multicore” ● Started as Lucene sub-project. Became Apache TLP in April 2010 ● 25 July 2013 - Apache Mahout 0.8 released ● Taste Recommended Framework by Sean Owen was added in 2008
  10. Who use Mahout?
  11. When you need Mahout? Data Size Lines, Sample Data Task Analysis and visualization Tools Whiteboard, bash, ... KBs – low MBs, Prototype Data Analysis and visualization Octave, R, bash, ... MBs – low Gbs, Online Data Storage Data bases (MySQL, Postgresql), ... Analysis NumPy, SciPy, BLAS, Weka Visualization GBs – TBs – Pbs Big Data Protovis, D3, ... Storage HDFS, Hbase, Cassandra, ... Analysis Mahout, Hive, Pig, …. table from Varad Meru
  12. Advantages ● Community ● Documentations and examples ● Scalability ● Apache license ● Well tested ● Built over existing production quality libraries
  13. Requirements ● Java 1.6.x or greater ● Maven 3.x to build the source code ● Hadoop 0.20.0 or greater
  14. Core themes ● Recommender engines (collaborative filtering) ● Clustering ● Classification
  15. Core themes ● Recommender engines (collaborative filtering) ● Clustering ● Classification
  16. Algorithms ● User and Item based recommenders ● Matrix factorization based recommenders ● K-Means, Fuzzy K-Means clustering ● Latent Dirichlet Allocation ● Singular value decomposition ● Logistic regression based classifier ● Complementary Naive Bayes classifier ● Random forest decision tree based classifier
  17. Recommender engine
  18. Personalization level ● Generic / Non-Personalized: everyone receives same recommendations ● Demographic: matches a target group ● Ephemeral: matches current activity ● Persistent: matches long-term interests
  19. Content based ● User Ratings x Item Attributes => Model ● Model applied to new items via attributes ● ● Alternative: knowledge-based (Item attributes form model of item space) Example: Personalized news feeds
  20. Table of ratings
  21. Ratings ● Explicit (Rating, Review, Vote, Like) ● Implicit (Click, Purchase, Follow)
  22. Item Item ● For every item I ● Select N similar items ● Recommend users, who work with item I this N items
  23. User user ● For every user ● Find n most similar users ● Aggregate preferences for this user ● Generate recommended items
  24. Similarities metrics ● Pearson Correlation ● Tanimoto ● Cosine similarity ● Euclidean distance
  25. Sparse matrix
  26. Parameters ● ● ● ● DataModel – FileDataModel, MySQLJDBCDataModel, PostgreSQLJDBCDataModel, MongoDBDataModel, CassandraDataModel UserSimilarity – Pearson Corelation, Tanimoto, Log-Likelihood, Euclidian Distance, Cosine Similarity ItemSimilarity – Pearson Corelation, Tanimoto, Log-Likelihood, Euclidian Distance, Cosine Similarity UserNeighborhood – Nearest N-User Neighborhood, Threshold User Neighborhood
  27. Code example
  28. Evaluation ● Average absolute difference ● RMSE ● Precision and recall ● ● Precision is the proportion of top results that are relevant, for some definition of relevant. Recall is the proportion of all relevant results included in the top results.
  29. Clustering
  30. Mahout Clustering Algorithms ● K-Means - runs on Hadoop ● Fuzzy K-means - runs on Hadoop ● Latent Dirichlet Allocation -runs on Hadoop ● Canopy clustering - runs on Hadoop ● Minhash clustering - runs on Hadoop ● kMeans++ streaming clustering - documentation missing
  31. Classification
  32. Mahout Classification Algorithms ● Logistic regression (SGD) - model parameter selection can be done in Hadoop ● Naive Bayes - training runs on Hadoop ● Random Forests - training is done in Hadoop ● Hidden Markov Models - training is done in Map-Reduce
  33. Resources ● Mahout in action ● Apache Mahout Cookbook ● Introduction to Apache Mahout ● http://mahout.apache.org/
  34. Q&A

×