Apache Mahout:
Scalable Machine Learning Library

Anastasiia Kornilova
What is Machine Learning?

“Machine learning - branch of artificial
intelligence, concerns the construction
and study of systems that can learn from
data”
Typical Use Cases
●

Recommend products/friends …

●

Classify content into predefined groups

●

Computer vision

●

Sentiment analysis/opinion mining

●

Find patterns in users behavior/actions

●

Identify key topics/summarize text

●

Detect anomalies/fraud

●

Ranking search results

●

Speech and handwriting recognition

●

Natural language processing
ML Algorithms (subset):
●

Supervised learning
–
–

Logistic regression

–

Support Vector Machines

–
●

Linear regression

Random Forests

Unsupervised learning
–
–

Blind signal separation

–
●

Clustering
Hidden Markov models

Semi-supervised
Many ML libraries, frameworks
and tools:
●

Weka

●

Python Scikit

●

Pylearn/Pylearn2

●

Theano

●

Orange

●

SSBrain :)

●

More can be find here: http://mloss.org/software/
Typical Workflow
●

Get data

●

Prepare data

●

Choose algorithm(s)

●

Run your algorithm(s)

●

Validate results
Every ML algorithms deals
with:
1.Data
2.Computation over this data
Scalability strategies:
●

“Bigger” computer

●

More cores

●

GPU computing

●

Parallel computing, MapReduce
What is Mahout?
●

●

Scalable ML library built on Hadoop, written in Java
Driven by Ng et al's. Paper “MapReduce for Machine Learning on
Multicore”

●

Started as Lucene sub-project. Became Apache TLP in April 2010

●

25 July 2013 - Apache Mahout 0.8 released

●

Taste Recommended Framework by Sean Owen was added in
2008
Who use Mahout?
When you need Mahout?
Data Size
Lines, Sample Data

Task
Analysis and
visualization

Tools
Whiteboard, bash, ...

KBs – low MBs,
Prototype Data

Analysis and
visualization

Octave, R, bash, ...

MBs – low Gbs,
Online Data

Storage

Data bases (MySQL,
Postgresql), ...

Analysis

NumPy, SciPy, BLAS,
Weka

Visualization
GBs – TBs – Pbs
Big Data

Protovis, D3, ...

Storage

HDFS, Hbase,
Cassandra, ...

Analysis

Mahout, Hive, Pig, ….

table from Varad Meru
Advantages
●

Community

●

Documentations and examples

●

Scalability

●

Apache license

●

Well tested

●

Built over existing production quality
libraries
Requirements
●

Java 1.6.x or greater

●

Maven 3.x to build the source code

●

Hadoop 0.20.0 or greater
Core themes
●

Recommender engines (collaborative
filtering)

●

Clustering

●

Classification
Core themes
●

Recommender engines (collaborative
filtering)

●

Clustering

●

Classification
Algorithms
●

User and Item based recommenders

●

Matrix factorization based recommenders

●

K-Means, Fuzzy K-Means clustering

●

Latent Dirichlet Allocation

●

Singular value decomposition

●

Logistic regression based classifier

●

Complementary Naive Bayes classifier

●

Random forest decision tree based classifier
Recommender engine
Personalization level
●

Generic / Non-Personalized: everyone
receives same recommendations

●

Demographic: matches a target group

●

Ephemeral: matches current activity

●

Persistent: matches long-term interests
Content based
●

User Ratings x Item Attributes => Model

●

Model applied to new items via attributes

●

●

Alternative: knowledge-based (Item
attributes form model of item space)
Example: Personalized news feeds
Table of ratings
Ratings
●

Explicit (Rating, Review, Vote, Like)

●

Implicit (Click, Purchase, Follow)
Item Item
●

For every item I

●

Select N similar items

●

Recommend users, who work with item I
this N items
User user
●

For every user

●

Find n most similar users

●

Aggregate preferences for this user

●

Generate recommended items
Similarities metrics
●

Pearson Correlation

●

Tanimoto

●

Cosine similarity

●

Euclidean distance
Sparse matrix
Parameters
●

●

●

●

DataModel – FileDataModel, MySQLJDBCDataModel,
PostgreSQLJDBCDataModel, MongoDBDataModel,
CassandraDataModel
UserSimilarity – Pearson Corelation, Tanimoto, Log-Likelihood,
Euclidian Distance, Cosine Similarity
ItemSimilarity – Pearson Corelation, Tanimoto, Log-Likelihood,
Euclidian Distance, Cosine Similarity
UserNeighborhood – Nearest N-User Neighborhood, Threshold
User Neighborhood
Code example
Evaluation
●

Average absolute difference

●

RMSE

●

Precision and recall

●

●

Precision is the proportion of top results that are relevant, for some
definition of relevant.
Recall is the proportion of all relevant results included in the top
results.
Clustering
Mahout Clustering Algorithms
●

K-Means - runs on Hadoop

●

Fuzzy K-means - runs on Hadoop

●

Latent Dirichlet Allocation -runs on Hadoop

●

Canopy clustering - runs on Hadoop

●

Minhash clustering - runs on Hadoop

●

kMeans++ streaming clustering - documentation
missing
Classification
Mahout Classification
Algorithms
●

Logistic regression (SGD) - model parameter
selection can be done in Hadoop

●

Naive Bayes - training runs on Hadoop

●

Random Forests - training is done in Hadoop

●

Hidden Markov Models - training is done in
Map-Reduce
Resources
●

Mahout in action

●

Apache Mahout Cookbook

●

Introduction to Apache Mahout

●

http://mahout.apache.org/
Q&A

Mahout