Mahout
Upcoming SlideShare
Loading in...5
×
 

Mahout

on

  • 1,283 views

 

Statistics

Views

Total Views
1,283
Views on SlideShare
975
Embed Views
308

Actions

Likes
1
Downloads
52
Comments
0

14 Embeds 308

http://energyfirefox.blogspot.com 168
http://energyfirefox.blogspot.ru 117
https://www.linkedin.com 4
http://energyfirefox.blogspot.co.uk 4
http://feedly.com 3
http://energyfirefox.blogspot.it 2
http://energyfirefox.blogspot.de 2
http://www.inoreader.com 2
http://energyfirefox.blogspot.jp 1
http://energyfirefox.blogspot.co.nz 1
http://energyfirefox.blogspot.fi 1
http://energyfirefox.blogspot.fr 1
http://energyfirefox.blogspot.dk 1
http://www.linkedin.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Mahout Mahout Presentation Transcript

  • Apache Mahout: Scalable Machine Learning Library Anastasiia Kornilova
  • What is Machine Learning? “Machine learning - branch of artificial intelligence, concerns the construction and study of systems that can learn from data”
  • Typical Use Cases ● Recommend products/friends … ● Classify content into predefined groups ● Computer vision ● Sentiment analysis/opinion mining ● Find patterns in users behavior/actions ● Identify key topics/summarize text ● Detect anomalies/fraud ● Ranking search results ● Speech and handwriting recognition ● Natural language processing
  • ML Algorithms (subset): ● Supervised learning – – Logistic regression – Support Vector Machines – ● Linear regression Random Forests Unsupervised learning – – Blind signal separation – ● Clustering Hidden Markov models Semi-supervised
  • Many ML libraries, frameworks and tools: ● Weka ● Python Scikit ● Pylearn/Pylearn2 ● Theano ● Orange ● SSBrain :) ● More can be find here: http://mloss.org/software/
  • Typical Workflow ● Get data ● Prepare data ● Choose algorithm(s) ● Run your algorithm(s) ● Validate results
  • Every ML algorithms deals with: 1.Data 2.Computation over this data
  • Scalability strategies: ● “Bigger” computer ● More cores ● GPU computing ● Parallel computing, MapReduce
  • What is Mahout? ● ● Scalable ML library built on Hadoop, written in Java Driven by Ng et al's. Paper “MapReduce for Machine Learning on Multicore” ● Started as Lucene sub-project. Became Apache TLP in April 2010 ● 25 July 2013 - Apache Mahout 0.8 released ● Taste Recommended Framework by Sean Owen was added in 2008
  • Who use Mahout?
  • When you need Mahout? Data Size Lines, Sample Data Task Analysis and visualization Tools Whiteboard, bash, ... KBs – low MBs, Prototype Data Analysis and visualization Octave, R, bash, ... MBs – low Gbs, Online Data Storage Data bases (MySQL, Postgresql), ... Analysis NumPy, SciPy, BLAS, Weka Visualization GBs – TBs – Pbs Big Data Protovis, D3, ... Storage HDFS, Hbase, Cassandra, ... Analysis Mahout, Hive, Pig, …. table from Varad Meru
  • Advantages ● Community ● Documentations and examples ● Scalability ● Apache license ● Well tested ● Built over existing production quality libraries
  • Requirements ● Java 1.6.x or greater ● Maven 3.x to build the source code ● Hadoop 0.20.0 or greater
  • Core themes ● Recommender engines (collaborative filtering) ● Clustering ● Classification
  • Core themes ● Recommender engines (collaborative filtering) ● Clustering ● Classification
  • Algorithms ● User and Item based recommenders ● Matrix factorization based recommenders ● K-Means, Fuzzy K-Means clustering ● Latent Dirichlet Allocation ● Singular value decomposition ● Logistic regression based classifier ● Complementary Naive Bayes classifier ● Random forest decision tree based classifier
  • Recommender engine
  • Personalization level ● Generic / Non-Personalized: everyone receives same recommendations ● Demographic: matches a target group ● Ephemeral: matches current activity ● Persistent: matches long-term interests
  • Content based ● User Ratings x Item Attributes => Model ● Model applied to new items via attributes ● ● Alternative: knowledge-based (Item attributes form model of item space) Example: Personalized news feeds
  • Table of ratings
  • Ratings ● Explicit (Rating, Review, Vote, Like) ● Implicit (Click, Purchase, Follow)
  • Item Item ● For every item I ● Select N similar items ● Recommend users, who work with item I this N items
  • User user ● For every user ● Find n most similar users ● Aggregate preferences for this user ● Generate recommended items
  • Similarities metrics ● Pearson Correlation ● Tanimoto ● Cosine similarity ● Euclidean distance
  • Sparse matrix
  • Parameters ● ● ● ● DataModel – FileDataModel, MySQLJDBCDataModel, PostgreSQLJDBCDataModel, MongoDBDataModel, CassandraDataModel UserSimilarity – Pearson Corelation, Tanimoto, Log-Likelihood, Euclidian Distance, Cosine Similarity ItemSimilarity – Pearson Corelation, Tanimoto, Log-Likelihood, Euclidian Distance, Cosine Similarity UserNeighborhood – Nearest N-User Neighborhood, Threshold User Neighborhood
  • Code example
  • Evaluation ● Average absolute difference ● RMSE ● Precision and recall ● ● Precision is the proportion of top results that are relevant, for some definition of relevant. Recall is the proportion of all relevant results included in the top results.
  • Clustering
  • Mahout Clustering Algorithms ● K-Means - runs on Hadoop ● Fuzzy K-means - runs on Hadoop ● Latent Dirichlet Allocation -runs on Hadoop ● Canopy clustering - runs on Hadoop ● Minhash clustering - runs on Hadoop ● kMeans++ streaming clustering - documentation missing
  • Classification
  • Mahout Classification Algorithms ● Logistic regression (SGD) - model parameter selection can be done in Hadoop ● Naive Bayes - training runs on Hadoop ● Random Forests - training is done in Hadoop ● Hidden Markov Models - training is done in Map-Reduce
  • Resources ● Mahout in action ● Apache Mahout Cookbook ● Introduction to Apache Mahout ● http://mahout.apache.org/
  • Q&A