• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Mahout
 

Mahout

on

  • 1,049 views

 

Statistics

Views

Total Views
1,049
Views on SlideShare
754
Embed Views
295

Actions

Likes
1
Downloads
44
Comments
0

13 Embeds 295

http://energyfirefox.blogspot.com 160
http://energyfirefox.blogspot.ru 117
http://energyfirefox.blogspot.co.uk 4
http://feedly.com 3
http://www.inoreader.com 2
http://energyfirefox.blogspot.it 2
http://www.linkedin.com 1
http://energyfirefox.blogspot.fi 1
http://energyfirefox.blogspot.co.nz 1
http://energyfirefox.blogspot.de 1
http://energyfirefox.blogspot.jp 1
http://energyfirefox.blogspot.fr 1
http://energyfirefox.blogspot.dk 1
More...

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Mahout Mahout Presentation Transcript

    • Apache Mahout: Scalable Machine Learning Library Anastasiia Kornilova
    • What is Machine Learning? “Machine learning - branch of artificial intelligence, concerns the construction and study of systems that can learn from data”
    • Typical Use Cases ● Recommend products/friends … ● Classify content into predefined groups ● Computer vision ● Sentiment analysis/opinion mining ● Find patterns in users behavior/actions ● Identify key topics/summarize text ● Detect anomalies/fraud ● Ranking search results ● Speech and handwriting recognition ● Natural language processing
    • ML Algorithms (subset): ● Supervised learning – – Logistic regression – Support Vector Machines – ● Linear regression Random Forests Unsupervised learning – – Blind signal separation – ● Clustering Hidden Markov models Semi-supervised
    • Many ML libraries, frameworks and tools: ● Weka ● Python Scikit ● Pylearn/Pylearn2 ● Theano ● Orange ● SSBrain :) ● More can be find here: http://mloss.org/software/
    • Typical Workflow ● Get data ● Prepare data ● Choose algorithm(s) ● Run your algorithm(s) ● Validate results
    • Every ML algorithms deals with: 1.Data 2.Computation over this data
    • Scalability strategies: ● “Bigger” computer ● More cores ● GPU computing ● Parallel computing, MapReduce
    • What is Mahout? ● ● Scalable ML library built on Hadoop, written in Java Driven by Ng et al's. Paper “MapReduce for Machine Learning on Multicore” ● Started as Lucene sub-project. Became Apache TLP in April 2010 ● 25 July 2013 - Apache Mahout 0.8 released ● Taste Recommended Framework by Sean Owen was added in 2008
    • Who use Mahout?
    • When you need Mahout? Data Size Lines, Sample Data Task Analysis and visualization Tools Whiteboard, bash, ... KBs – low MBs, Prototype Data Analysis and visualization Octave, R, bash, ... MBs – low Gbs, Online Data Storage Data bases (MySQL, Postgresql), ... Analysis NumPy, SciPy, BLAS, Weka Visualization GBs – TBs – Pbs Big Data Protovis, D3, ... Storage HDFS, Hbase, Cassandra, ... Analysis Mahout, Hive, Pig, …. table from Varad Meru
    • Advantages ● Community ● Documentations and examples ● Scalability ● Apache license ● Well tested ● Built over existing production quality libraries
    • Requirements ● Java 1.6.x or greater ● Maven 3.x to build the source code ● Hadoop 0.20.0 or greater
    • Core themes ● Recommender engines (collaborative filtering) ● Clustering ● Classification
    • Core themes ● Recommender engines (collaborative filtering) ● Clustering ● Classification
    • Algorithms ● User and Item based recommenders ● Matrix factorization based recommenders ● K-Means, Fuzzy K-Means clustering ● Latent Dirichlet Allocation ● Singular value decomposition ● Logistic regression based classifier ● Complementary Naive Bayes classifier ● Random forest decision tree based classifier
    • Recommender engine
    • Personalization level ● Generic / Non-Personalized: everyone receives same recommendations ● Demographic: matches a target group ● Ephemeral: matches current activity ● Persistent: matches long-term interests
    • Content based ● User Ratings x Item Attributes => Model ● Model applied to new items via attributes ● ● Alternative: knowledge-based (Item attributes form model of item space) Example: Personalized news feeds
    • Table of ratings
    • Ratings ● Explicit (Rating, Review, Vote, Like) ● Implicit (Click, Purchase, Follow)
    • Item Item ● For every item I ● Select N similar items ● Recommend users, who work with item I this N items
    • User user ● For every user ● Find n most similar users ● Aggregate preferences for this user ● Generate recommended items
    • Similarities metrics ● Pearson Correlation ● Tanimoto ● Cosine similarity ● Euclidean distance
    • Sparse matrix
    • Parameters ● ● ● ● DataModel – FileDataModel, MySQLJDBCDataModel, PostgreSQLJDBCDataModel, MongoDBDataModel, CassandraDataModel UserSimilarity – Pearson Corelation, Tanimoto, Log-Likelihood, Euclidian Distance, Cosine Similarity ItemSimilarity – Pearson Corelation, Tanimoto, Log-Likelihood, Euclidian Distance, Cosine Similarity UserNeighborhood – Nearest N-User Neighborhood, Threshold User Neighborhood
    • Code example
    • Evaluation ● Average absolute difference ● RMSE ● Precision and recall ● ● Precision is the proportion of top results that are relevant, for some definition of relevant. Recall is the proportion of all relevant results included in the top results.
    • Clustering
    • Mahout Clustering Algorithms ● K-Means - runs on Hadoop ● Fuzzy K-means - runs on Hadoop ● Latent Dirichlet Allocation -runs on Hadoop ● Canopy clustering - runs on Hadoop ● Minhash clustering - runs on Hadoop ● kMeans++ streaming clustering - documentation missing
    • Classification
    • Mahout Classification Algorithms ● Logistic regression (SGD) - model parameter selection can be done in Hadoop ● Naive Bayes - training runs on Hadoop ● Random Forests - training is done in Hadoop ● Hidden Markov Models - training is done in Map-Reduce
    • Resources ● Mahout in action ● Apache Mahout Cookbook ● Introduction to Apache Mahout ● http://mahout.apache.org/
    • Q&A