Mahout

Apache Mahout:
Scalable Machine Learning Library

Anastasiia Kornilova

What is Machine Learning?

“Machine learning - branch of artificial
intelligence, concerns the construction
and study of systems that can learn from
data”

Typical Use Cases
●

Recommend products/friends …

●

Classify content into predefined groups

●

Computer vision

●

Sentiment analysis/opinion mining

●

Find patterns in users behavior/actions

●

Identify key topics/summarize text

●

Detect anomalies/fraud

●

Ranking search results

●

Speech and handwriting recognition

●

Natural language processing

ML Algorithms (subset):
●

Supervised learning
–
–

Logistic regression

–

Support Vector Machines

–
●

Linear regression

Random Forests

Unsupervised learning
–
–

Blind signal separation

–
●

Clustering
Hidden Markov models

Semi-supervised

Many ML libraries, frameworks
and tools:
●

Weka

●

Python Scikit

●

Pylearn/Pylearn2

●

Theano

●

Orange

●

SSBrain :)

●

More can be find here: http://mloss.org/software/

Typical Workflow
●

Get data

●

Prepare data

●

Choose algorithm(s)

●

Run your algorithm(s)

●

Validate results

Every ML algorithms deals
with:
1.Data
2.Computation over this data

Scalability strategies:
●

“Bigger” computer

●

More cores

●

GPU computing

●

Parallel computing, MapReduce

What is Mahout?
●

●

Scalable ML library built on Hadoop, written in Java
Driven by Ng et al's. Paper “MapReduce for Machine Learning on
Multicore”

●

Started as Lucene sub-project. Became Apache TLP in April 2010

●

25 July 2013 - Apache Mahout 0.8 released

●

Taste Recommended Framework by Sean Owen was added in
2008

When you need Mahout?
Data Size
Lines, Sample Data

Task
Analysis and
visualization

Tools
Whiteboard, bash, ...

KBs – low MBs,
Prototype Data

Analysis and
visualization

Octave, R, bash, ...

MBs – low Gbs,
Online Data

Storage

Data bases (MySQL,
Postgresql), ...

Analysis

NumPy, SciPy, BLAS,
Weka

Visualization
GBs – TBs – Pbs
Big Data

Protovis, D3, ...

Storage

HDFS, Hbase,
Cassandra, ...

Analysis

Mahout, Hive, Pig, ….

table from Varad Meru

Advantages
●

Community

●

Documentations and examples

●

Scalability

●

Apache license

●

Well tested

●

Built over existing production quality
libraries

Requirements
●

Java 1.6.x or greater

●

Maven 3.x to build the source code

●

Hadoop 0.20.0 or greater

Core themes
●

Recommender engines (collaborative
filtering)

●

Clustering

●

Classification

Algorithms
●

User and Item based recommenders

●

Matrix factorization based recommenders

●

K-Means, Fuzzy K-Means clustering

●

Latent Dirichlet Allocation

●

Singular value decomposition

●

Logistic regression based classifier

●

Complementary Naive Bayes classifier

●

Random forest decision tree based classifier

Personalization level
●

Generic / Non-Personalized: everyone
receives same recommendations

●

Demographic: matches a target group

●

Ephemeral: matches current activity

●

Persistent: matches long-term interests

Content based
●

User Ratings x Item Attributes => Model

●

Model applied to new items via attributes

●

●

Alternative: knowledge-based (Item
attributes form model of item space)
Example: Personalized news feeds

Ratings
●

Explicit (Rating, Review, Vote, Like)

●

Implicit (Click, Purchase, Follow)

Item Item
●

For every item I

●

Select N similar items

●

Recommend users, who work with item I
this N items

User user
●

For every user

●

Find n most similar users

●

Aggregate preferences for this user

●

Generate recommended items

Similarities metrics
●

Pearson Correlation

●

Tanimoto

●

Cosine similarity

●

Euclidean distance

Parameters
●

●

●

●

DataModel – FileDataModel, MySQLJDBCDataModel,
PostgreSQLJDBCDataModel, MongoDBDataModel,
CassandraDataModel
UserSimilarity – Pearson Corelation, Tanimoto, Log-Likelihood,
Euclidian Distance, Cosine Similarity
ItemSimilarity – Pearson Corelation, Tanimoto, Log-Likelihood,
Euclidian Distance, Cosine Similarity
UserNeighborhood – Nearest N-User Neighborhood, Threshold
User Neighborhood

Evaluation
●

Average absolute difference

●

RMSE

●

Precision and recall

●

●

Precision is the proportion of top results that are relevant, for some
definition of relevant.
Recall is the proportion of all relevant results included in the top
results.

Mahout Clustering Algorithms
●

K-Means - runs on Hadoop

●

Fuzzy K-means - runs on Hadoop

●

Latent Dirichlet Allocation -runs on Hadoop

●

Canopy clustering - runs on Hadoop

●

Minhash clustering - runs on Hadoop

●

kMeans++ streaming clustering - documentation
missing

Mahout Classification
Algorithms
●

Logistic regression (SGD) - model parameter
selection can be done in Hadoop

●

Naive Bayes - training runs on Hadoop

●

Random Forests - training is done in Hadoop

●

Hidden Markov Models - training is done in
Map-Reduce

Resources
●

Mahout in action

●

Apache Mahout Cookbook

●

Introduction to Apache Mahout

●

http://mahout.apache.org/

Mahout

More Related Content

What's hot

Viewers also liked

Similar to Mahout

Recently uploaded

Mahout