Machine Learning with Mahout

Machine Learning
With Mahout
Rami Mukhtar
Big Data Group
National ICT Australia
February 2012

Mahout: Brief History
• Started in 2008 as a subproject of the Apache
Lucene;
– Text mining, clustering, some classification.
• Sean Owen started Taste in 2005:
– Recommender engine for business that never took off
– Mahout community asked to merge in Taste code
• Became a top level Apache Project in April 2010
• Lineage resulted in a fragmented framework.

NICTA Copyright 2011 From imagination to impact 2

Mahout: What is it?
• Collection of machine learning algorithm implementations:
– Many (not all) implemented on Hadoop map-reduce;
– Java library with handy command line interface to run common tasks.
• Currently serves 3 key areas:
– Recommendation engines
– Clustering
– Classification
• Focus of today’s talk is on functionality accessible from command
line interface:
– Most accessible for Hadoop beginners.


Recommenders
• Supports user based and item based
collaborative filtering:
– User based: similarity between users;
– Item based: similarity between items user

other items 1 2 3 user 3 likes item E
user 1 may
like
item
A B C D E F

4 5 6

Implementations
• Non-distributed (no Hadoop requirement)
– The ‘Taste’ code, supports item and user based;
– Good for up to 100 million user-item associations;
– Faster than distributed version.
• Distributed (Hadoop MapReduce)
– Item based using similarity measure (configurable)
between items.
– Latent factor based:
• Estimates ‘genres’ of items from user preferences
• Similar to entry that won the NetFlix prize.
– Both have command line interfaces.


Distributed Item Recommender
Item1 Item 2 Item 3 Item n

R R R
User 1
Similarity
User 2 R R R
calculation

User 3 R R R R

R R
1 2 n
R R 1 1 .2 … .8

2 1 … .5
R R
… .6
R R R R
User m n 1

User-item ratings matrix Item similarity
NICTA Copyright 2011 From imagination to impact matrix 6

Distributed Item Recommendation
csv file:
csv file:
user, item, rating
item, item, simularity
…

mahout itemsimilarity –i <input_file> -o <output_path> …!

Item1 Item 2 Item 3 Item n

User 1 R R R

User 2 R R R

User 3 R R R R s2,3 * R2 + s3,5 * R5
R2 R3? R5
R3? =
s2,3 + s3,5
R R

R R

User m R R R R
€

Distributed Item Recommendation
• Can perform item similarity and
recommendation generation in a single
call: csv file (tab seperated):
user, item, rating
…

mahout recommenditembased –i <input_path> !
-o <output_path> !
-u <users_file>! csv file (tab separated):
--numRecommendations …! user,item,score,item,score,…

csv file (tab
Number of
separated):
recommendations to user
return per user
user
user

Clustering

Vectorization

Clustering


Clustering
• Don’t know the structure of data, want to sensibly group
things together.
• A number of distributed algorithms supported:
– Canopy Clustering (MAHOUT-3 – integrated)
– K-Means Clustering (MAHOUT-5 – integrated)
– Fuzzy K-Means (MAHOUT-74 – integrated)
– Expectation Maximization (EM) (MAHOUT-28)
– Mean Shift Clustering (MAHOUT-15 – integrated)
– Hierarchical Clustering (MAHOUT-19)
– Dirichlet Process Clustering (MAHOUT-30 – integrated)
– Latent Dirichlet Allocation (MAHOUT-123 – integrated)
– Spectral Clustering (MAHOUT-363 – integrated)
– Minhash Clustering (MAHOUT-344 - integrated)
• Some have command line interface support.

Vectorization
• Data specific.
– Majority of cases need to write a map-reduce job to generate
vectorized input
– Input formats are still not uniform across Mahout.
– Most clustering implementations expect:
• SequenceFile(WritableComparable, VectorWritable)!
• Note: key is ignored.
• Mahout has some support for clustering text documents:
– Can generate n-gram Term Frequency-Inverse Document
Frequency (TF-IDF) from a directory of text documents;
– Enables text documents to be clustered using command line
interface.


Text Document Vectorization

Term frequency
The | conduct | as | run | doctor | with | a | Patel!
47 | 3 | 7 | 5 | 8 | 12 | 54| 6 !

Document
Document frequency
The | conduct | as | run | doctor | with | a | Patel!
1000| 198 | 999| 567 | 48 | 998 |100| 3 !

N
Corpus TFIDFi = TFi * log
DFi
Unigram: (crude)!
Increase weight of less
Bi-gram: (crude, oil)!
common words/n-
Tri-gram: (crude, oil, prices)!
grams within corpus

€

Text Document Vectorization

Directory of Sequence
plain text file: <name,
documents text body>

mahout seqdirectory –i <input_path> -o <seq_output_path>!
Dictionary
Term Frequency file

Ngram
TF-IDF
generation
Vec. Gen.
Inverse
Document Freq.

mahout seq2sparse -i <seq_input_path> -o <output_path> !

K-means clustering
Run k-means clustering

mahout kmeans!
-i <input vectors directory>! org.apache.mahout.common.distance.
-c <input clusters directory>! CosineDistanceMeasure
EuclideanDistanceMeasure
-o <output working directory> ! ManhattanDistanceMeasure
-k <# clusters sampled from input> ! SquaredEuclideanDistanceMeasure
-dm <DistanceMeasure> ! …
-x <maximum number of iterations> !
-xm <execution method: seq/mapreduce>!
…!
Cluster 1 Cluster 2
Inspect the result Top Terms: ! Top Terms: !
oil => 6.20! Coresponsibility => 13.97!
barrel => 5.15! cereals => 13.51!
mahout clusterdump ! crude => 5.06! penalise => 13.25!
prices => 4.50! farmers => 11.99!
-dt sequencefile ! opec => 3.23! levies => 11.60!
price => 2.77! ceilings => 11.52!
-d <dictionary_file>! dlrs => 2.76! ec => 11.07!
said => 2.70! ministers => 10.55!
-s <input_seq_file>! bpd => 2.45! output => 9.57!
petroleum => 1.99! 09.73 => 9.18!


Classification
• Train the machine to provide discrete answers to a
specific question.
Mahout supports the
100100: A following algorithms:
010011: A Model - ogistic Regression
L
010110: B Training - aïve Bayes
N
100101: A Algorithm - andom Forests
R
010101: B Others in development
Data with known
answers
101100: ? 101100: A
010010: ? 000110: B
001000: ? Trained
011100: A
100000: ? Model
101101: A
001001: ? 010111: B
Data without Data with
answers estimated
NICTA Copyright 2011 From imagination to impact answers 15

Classification Workflow
Label Sample
~90%

Training set 1
Model
Vectorize Sample Training
~10%

2
Model
Testing

3
Input set A, B, A,
Vectorize A, B, …

Trained Model Label
approximation

Feature Extraction
• Good feature extraction is critical to
trained model performance:
– Need domain understanding to ‘measure’ the
right things.
– Measure wrong things, even the best model
will perform badly.
– Caution needed to avoid ‘label leaks’.
• Will typically require hand written map-
reduce code:
– If text based, can use text mining tools in
HIVE or Mahout.

Naïve Bayes Classifier
Feature vector Label

n
classify ( f1, f 2 ,… f n ) = argmax p( L = l)∏ p( Fi = f i | L = l)
l i=1

Probability of feature i having value fi given l,
e.g. assume a Gaussian pdf:
(v −µl )2
1 −
2σ l2
P ( f = v | l) = e
2πσ l2
Note: Model training boils down to estimating the conditional variance of the
feature vector elements. This can be trivially parallelized and implemented in
map reduce.
€ 18
NICTA Copyright 2011 From imagination to impact

Naïve Bayes in Mahout
Command line specific to text classification (e.g. SPAM detection, document
classification, etc.)

Plain text file, format: Generated model, set
label t word word of files in sequence file
word … format (variances).

mahout trainclassifier –i <input_path> -o <output_path>
--gramSize <n_gram_size> -minDf <minimum_DF> -minSupport <min_TF> …

N_gram Discard n-grams
Discard n-grams
size, that occur less
that occur in less
default = 1 than this number
than this number
of times in a
of documents.
document.

Naïve Bayes in Mahout
• Need to write your own classifier to be
practical.

document
Classifier Trained Model

label

Look at class:
org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm!
• Classify document;
• Return top n predicted labels;
• Return classification certainty;
• …


Classification vs. Recommendation
• Can use a classifier to recommend:
– Interested in item or not interested?
• Classifier is based on features of the
specific item and the customer
• Recommendation based on past behavior
of customers
• Classification: single decisions
• Recommendation: ranking


Machine Learning with Mahout

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Machine Learning with Mahout

Similar to Machine Learning with Mahout (20)

Recently uploaded

Recently uploaded (20)

Machine Learning with Mahout