Machine Learning
With Mahout
Rami Mukhtar
Big Data Group
National ICT Australia
February 2012
Mahout: Brief History
•  Started in 2008 as a subproject of the Apache
   Lucene;
        –  Text mining, clustering, some classification.
•  Sean Owen started Taste in 2005:
        –  Recommender engine for business that never took off
        –  Mahout community asked to merge in Taste code
•  Became a top level Apache Project in April 2010
•  Lineage resulted in a fragmented framework.




NICTA Copyright 2011        From imagination to impact       2
Mahout: What is it?
•  Collection of machine learning algorithm implementations:
        –  Many (not all) implemented on Hadoop map-reduce;
        –  Java library with handy command line interface to run common tasks.
•  Currently serves 3 key areas:
        –  Recommendation engines
        –  Clustering
        –  Classification
•  Focus of today’s talk is on functionality accessible from command
   line interface:
        –  Most accessible for Hadoop beginners.




NICTA Copyright 2011               From imagination to impact                    3
Recommenders
•  Supports user based and item based
   collaborative filtering:
        –  User based: similarity between users;
        –  Item based: similarity between items                 user

      other items          1                2               3       user 3 likes item E
      user 1 may
          like
                                                                        item
                       A   B        C              D        E   F




                           4                5               6
NICTA Copyright 2011           From imagination to impact                          4
Implementations
•  Non-distributed (no Hadoop requirement)
        –  The ‘Taste’ code, supports item and user based;
        –  Good for up to 100 million user-item associations;
        –  Faster than distributed version.
•  Distributed (Hadoop MapReduce)
        –  Item based using similarity measure (configurable)
           between items.
        –  Latent factor based:
               •  Estimates ‘genres’ of items from user preferences
               •  Similar to entry that won the NetFlix prize.
        –  Both have command line interfaces.


NICTA Copyright 2011              From imagination to impact          5
Distributed Item Recommender
                Item1 Item 2 Item 3                   Item n

                       R             R        R
  User 1
                                                                                    Similarity
  User 2               R             R                    R
                                                                                    calculation

  User 3               R      R      R        R

                              R               R
                                                                           1   2           n
                              R                           R            1   1   .2     …    .8

                                                                       2       1      …    .5
                       R             R
                                                                                      …    .6
                       R             R        R           R
  User m                                                               n                   1

                           User-item ratings matrix                        Item similarity
NICTA Copyright 2011                      From imagination to impact           matrix             6
Distributed Item Recommendation
                                csv file:
                                                                         csv file:
                                user, item, rating
                                                                         item, item, simularity
                                …

mahout itemsimilarity –i <input_file> -o <output_path> …!

                       Item1 Item 2 Item 3                Item n

     User 1              R            R           R

     User 2              R            R                       R

     User 3              R     R      R           R                          s2,3 * R2 + s3,5 * R5
                               R2   R3?          R5
                                                                       R3? =
                                                                                   s2,3 + s3,5
                               R                              R

                         R            R

    User m               R            R           R           R
                                                      €
NICTA Copyright 2011                      From imagination to impact                              7
Distributed Item Recommendation
 •  Can perform item similarity and
    recommendation generation in a single
    call:    csv file (tab seperated):
                        user, item, rating
                        …

mahout recommenditembased –i <input_path> !
-o <output_path> !
-u <users_file>!                    csv file (tab separated):
--numRecommendations …!             user,item,score,item,score,…

                                                               csv file (tab
     Number of
                                                               separated):
     recommendations to                                        user
     return per user
                                                               user
                                                               user
 NICTA Copyright 2011             From imagination to impact                   8
Clustering




                       Vectorization




                                          Clustering




NICTA Copyright 2011                   From imagination to impact   9
Clustering
•  Don’t know the structure of data, want to sensibly group
   things together.
•  A number of distributed algorithms supported:
        –  Canopy Clustering (MAHOUT-3 – integrated)
        –  K-Means Clustering (MAHOUT-5 – integrated)
        –  Fuzzy K-Means (MAHOUT-74 – integrated)
        –  Expectation Maximization (EM) (MAHOUT-28)
        –  Mean Shift Clustering (MAHOUT-15 – integrated)
        –  Hierarchical Clustering (MAHOUT-19)
        –  Dirichlet Process Clustering (MAHOUT-30 – integrated)
        –  Latent Dirichlet Allocation (MAHOUT-123 – integrated)
        –  Spectral Clustering (MAHOUT-363 – integrated)
        –  Minhash Clustering (MAHOUT-344 - integrated)
•  Some have command line interface support.
NICTA Copyright 2011          From imagination to impact           10
Vectorization
•  Data specific.
        –  Majority of cases need to write a map-reduce job to generate
           vectorized input
        –  Input formats are still not uniform across Mahout.
        –  Most clustering implementations expect:
               •  SequenceFile(WritableComparable, VectorWritable)!
               •  Note: key is ignored.
•  Mahout has some support for clustering text documents:
        –  Can generate n-gram Term Frequency-Inverse Document
           Frequency (TF-IDF) from a directory of text documents;
        –  Enables text documents to be clustered using command line
           interface.




NICTA Copyright 2011               From imagination to impact             11
Text Document Vectorization

                       Term frequency
                       The | conduct | as | run | doctor | with | a | Patel!
                       47 | 3        | 7 | 5    | 8      | 12   | 54| 6 !

 Document
                       Document frequency
                       The | conduct | as | run | doctor | with | a | Patel!
                       1000| 198     | 999| 567 | 48     | 998 |100| 3 !


                                                N
  Corpus                    TFIDFi = TFi * log
                                               DFi
Unigram: (crude)!
                                                         Increase weight of less
Bi-gram: (crude, oil)!
                                                         common words/n-
Tri-gram: (crude, oil, prices)!
                                                         grams within corpus
NICTA Copyright 2011               From imagination to impact                      12

       €
Text Document Vectorization

Directory of                                                                      Sequence
plain text                                                                        file: <name,
documents                                                                         text body>




mahout seqdirectory –i <input_path> -o <seq_output_path>!
                                                                     Dictionary
                                       Term Frequency                file

                          Ngram
                                                                   TF-IDF
                        generation
                                                                  Vec. Gen.
                                          Inverse
                                       Document Freq.

 mahout seq2sparse -i <seq_input_path> -o <output_path> !
 NICTA Copyright 2011                From imagination to impact                           13
K-means clustering
Run k-means clustering

mahout kmeans!
-i <input vectors directory>!        org.apache.mahout.common.distance.
-c <input clusters directory>!       CosineDistanceMeasure
                                      EuclideanDistanceMeasure
-o <output working directory> !       ManhattanDistanceMeasure
-k <# clusters sampled from input> ! SquaredEuclideanDistanceMeasure
-dm <DistanceMeasure> !               …
-x <maximum number of iterations> !
-xm <execution method: seq/mapreduce>!
…!
                           Cluster 1              Cluster 2
 Inspect the result               Top Terms: !                        Top Terms: !
                                  oil                    =>   6.20!   Coresponsibility   =>   13.97!
                                  barrel                 =>   5.15!   cereals            =>   13.51!
 mahout clusterdump !             crude                  =>   5.06!   penalise           =>   13.25!
                                  prices                 =>   4.50!   farmers            =>   11.99!
 -dt sequencefile !               opec                   =>   3.23!   levies             =>   11.60!
                                  price                  =>   2.77!   ceilings           =>   11.52!
 -d <dictionary_file>!            dlrs                   =>   2.76!   ec                 =>   11.07!
                                  said                   =>   2.70!   ministers          =>   10.55!
 -s <input_seq_file>!             bpd                    =>   2.45!   output             =>   9.57!
                                  petroleum              =>   1.99!   09.73              =>   9.18!



NICTA Copyright 2011        From imagination to impact                                           14
Classification
•  Train the machine to provide discrete answers to a
   specific question.
                                                    Mahout supports the
          100100: A                                 following algorithms:
          010011: A          Model                  -  ogistic Regression
                                                     L
          010110: B         Training                -  aïve Bayes
                                                     N
          100101: A         Algorithm               -  andom Forests
                                                     R
          010101: B                                 Others in development
     Data with known
        answers
          101100: ?                                        101100: A
          010010: ?                                        000110: B
          001000: ?           Trained
                                                           011100: A
          100000: ?            Model
                                                           101101: A
          001001: ?                                        010111: B
        Data without                                        Data with
         answers                                            estimated
NICTA Copyright 2011   From imagination to impact            answers        15
Classification Workflow
                          Label                   Sample
                                                  ~90%

          Training set                                                         1
                                                                   Model
                         Vectorize                Sample          Training
                                                  ~10%




                                                                               2
                                                                   Model
                                                                   Testing



                                                                               3
          Input set                                                                  A, B, A,
                         Vectorize                                                   A, B, …


                                                               Trained Model           Label
NICTA Copyright 2011              From imagination to impact                                 16
                                                                                   approximation
Feature Extraction
•  Good feature extraction is critical to
   trained model performance:
        – Need domain understanding to ‘measure’ the
          right things.
        – Measure wrong things, even the best model
          will perform badly.
        – Caution needed to avoid ‘label leaks’.
•  Will typically require hand written map-
   reduce code:
        – If text based, can use text mining tools in
          HIVE or Mahout.
NICTA Copyright 2011     From imagination to impact     17
Naïve Bayes Classifier
      Feature vector                       Label


                                                             n
classify ( f1, f 2 ,… f n ) = argmax p( L = l)∏ p( Fi = f i | L = l)
                                       l                     i=1

                              Probability of feature i having value fi given l,
                              e.g. assume a Gaussian pdf:
                                                                       (v −µl )2
                                                                 1        −
                                                                              2σ l2
                                  P ( f = v | l) =                    e
                                                             2πσ l2
 Note: Model training boils down to estimating the conditional variance of the
 feature vector elements. This can be trivially parallelized and implemented in
 map reduce.
                        €                                                             18
 NICTA Copyright 2011           From imagination to impact
Naïve Bayes in Mahout
 Command line specific to text classification (e.g. SPAM detection, document
 classification, etc.)

   Plain text file, format:                                 Generated model, set
   label t word word                                       of files in sequence file
   word …                                                   format (variances).



mahout trainclassifier –i <input_path> -o <output_path>
--gramSize <n_gram_size> -minDf <minimum_DF> -minSupport <min_TF> …




              N_gram                                               Discard n-grams
                                     Discard n-grams
              size,                                                that occur less
                                     that occur in less
              default = 1                                          than this number
                                     than this number
                                                                   of times in a
                                     of documents.
                                                                   document.
NICTA Copyright 2011           From imagination to impact                               19
Naïve Bayes in Mahout
 •  Need to write your own classifier to be
    practical.

                       document
                                          Classifier              Trained Model

                        label

Look at class:
org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm!
• Classify document;
• Return top n predicted labels;
• Return classification certainty;
• …

NICTA Copyright 2011                 From imagination to impact               20
Classification vs. Recommendation
•  Can use a classifier to recommend:
        – Interested in item or not interested?
•  Classifier is based on features of the
   specific item and the customer
•  Recommendation based on past behavior
   of customers
•  Classification: single decisions
•  Recommendation: ranking


NICTA Copyright 2011     From imagination to impact   21

Machine Learning with Mahout

  • 1.
    Machine Learning With Mahout RamiMukhtar Big Data Group National ICT Australia February 2012
  • 2.
    Mahout: Brief History • Started in 2008 as a subproject of the Apache Lucene; –  Text mining, clustering, some classification. •  Sean Owen started Taste in 2005: –  Recommender engine for business that never took off –  Mahout community asked to merge in Taste code •  Became a top level Apache Project in April 2010 •  Lineage resulted in a fragmented framework. NICTA Copyright 2011 From imagination to impact 2
  • 3.
    Mahout: What isit? •  Collection of machine learning algorithm implementations: –  Many (not all) implemented on Hadoop map-reduce; –  Java library with handy command line interface to run common tasks. •  Currently serves 3 key areas: –  Recommendation engines –  Clustering –  Classification •  Focus of today’s talk is on functionality accessible from command line interface: –  Most accessible for Hadoop beginners. NICTA Copyright 2011 From imagination to impact 3
  • 4.
    Recommenders •  Supports userbased and item based collaborative filtering: –  User based: similarity between users; –  Item based: similarity between items user other items 1 2 3 user 3 likes item E user 1 may like item A B C D E F 4 5 6 NICTA Copyright 2011 From imagination to impact 4
  • 5.
    Implementations •  Non-distributed (noHadoop requirement) –  The ‘Taste’ code, supports item and user based; –  Good for up to 100 million user-item associations; –  Faster than distributed version. •  Distributed (Hadoop MapReduce) –  Item based using similarity measure (configurable) between items. –  Latent factor based: •  Estimates ‘genres’ of items from user preferences •  Similar to entry that won the NetFlix prize. –  Both have command line interfaces. NICTA Copyright 2011 From imagination to impact 5
  • 6.
    Distributed Item Recommender Item1 Item 2 Item 3 Item n R R R User 1 Similarity User 2 R R R calculation User 3 R R R R R R 1 2 n R R 1 1 .2 … .8 2 1 … .5 R R … .6 R R R R User m n 1 User-item ratings matrix Item similarity NICTA Copyright 2011 From imagination to impact matrix 6
  • 7.
    Distributed Item Recommendation csv file: csv file: user, item, rating item, item, simularity … mahout itemsimilarity –i <input_file> -o <output_path> …! Item1 Item 2 Item 3 Item n User 1 R R R User 2 R R R User 3 R R R R s2,3 * R2 + s3,5 * R5 R2 R3? R5 R3? = s2,3 + s3,5 R R R R User m R R R R € NICTA Copyright 2011 From imagination to impact 7
  • 8.
    Distributed Item Recommendation •  Can perform item similarity and recommendation generation in a single call: csv file (tab seperated): user, item, rating … mahout recommenditembased –i <input_path> ! -o <output_path> ! -u <users_file>! csv file (tab separated): --numRecommendations …! user,item,score,item,score,… csv file (tab Number of separated): recommendations to user return per user user user NICTA Copyright 2011 From imagination to impact 8
  • 9.
    Clustering Vectorization Clustering NICTA Copyright 2011 From imagination to impact 9
  • 10.
    Clustering •  Don’t knowthe structure of data, want to sensibly group things together. •  A number of distributed algorithms supported: –  Canopy Clustering (MAHOUT-3 – integrated) –  K-Means Clustering (MAHOUT-5 – integrated) –  Fuzzy K-Means (MAHOUT-74 – integrated) –  Expectation Maximization (EM) (MAHOUT-28) –  Mean Shift Clustering (MAHOUT-15 – integrated) –  Hierarchical Clustering (MAHOUT-19) –  Dirichlet Process Clustering (MAHOUT-30 – integrated) –  Latent Dirichlet Allocation (MAHOUT-123 – integrated) –  Spectral Clustering (MAHOUT-363 – integrated) –  Minhash Clustering (MAHOUT-344 - integrated) •  Some have command line interface support. NICTA Copyright 2011 From imagination to impact 10
  • 11.
    Vectorization •  Data specific. –  Majority of cases need to write a map-reduce job to generate vectorized input –  Input formats are still not uniform across Mahout. –  Most clustering implementations expect: •  SequenceFile(WritableComparable, VectorWritable)! •  Note: key is ignored. •  Mahout has some support for clustering text documents: –  Can generate n-gram Term Frequency-Inverse Document Frequency (TF-IDF) from a directory of text documents; –  Enables text documents to be clustered using command line interface. NICTA Copyright 2011 From imagination to impact 11
  • 12.
    Text Document Vectorization Term frequency The | conduct | as | run | doctor | with | a | Patel! 47 | 3 | 7 | 5 | 8 | 12 | 54| 6 ! Document Document frequency The | conduct | as | run | doctor | with | a | Patel! 1000| 198 | 999| 567 | 48 | 998 |100| 3 ! N Corpus TFIDFi = TFi * log DFi Unigram: (crude)! Increase weight of less Bi-gram: (crude, oil)! common words/n- Tri-gram: (crude, oil, prices)! grams within corpus NICTA Copyright 2011 From imagination to impact 12 €
  • 13.
    Text Document Vectorization Directoryof Sequence plain text file: <name, documents text body> mahout seqdirectory –i <input_path> -o <seq_output_path>! Dictionary Term Frequency file Ngram TF-IDF generation Vec. Gen. Inverse Document Freq. mahout seq2sparse -i <seq_input_path> -o <output_path> ! NICTA Copyright 2011 From imagination to impact 13
  • 14.
    K-means clustering Run k-meansclustering mahout kmeans! -i <input vectors directory>! org.apache.mahout.common.distance. -c <input clusters directory>! CosineDistanceMeasure EuclideanDistanceMeasure -o <output working directory> ! ManhattanDistanceMeasure -k <# clusters sampled from input> ! SquaredEuclideanDistanceMeasure -dm <DistanceMeasure> ! … -x <maximum number of iterations> ! -xm <execution method: seq/mapreduce>! …! Cluster 1 Cluster 2 Inspect the result Top Terms: ! Top Terms: ! oil => 6.20! Coresponsibility => 13.97! barrel => 5.15! cereals => 13.51! mahout clusterdump ! crude => 5.06! penalise => 13.25! prices => 4.50! farmers => 11.99! -dt sequencefile ! opec => 3.23! levies => 11.60! price => 2.77! ceilings => 11.52! -d <dictionary_file>! dlrs => 2.76! ec => 11.07! said => 2.70! ministers => 10.55! -s <input_seq_file>! bpd => 2.45! output => 9.57! petroleum => 1.99! 09.73 => 9.18! NICTA Copyright 2011 From imagination to impact 14
  • 15.
    Classification •  Train themachine to provide discrete answers to a specific question. Mahout supports the 100100: A following algorithms: 010011: A Model -  ogistic Regression L 010110: B Training -  aïve Bayes N 100101: A Algorithm -  andom Forests R 010101: B Others in development Data with known answers 101100: ? 101100: A 010010: ? 000110: B 001000: ? Trained 011100: A 100000: ? Model 101101: A 001001: ? 010111: B Data without Data with answers estimated NICTA Copyright 2011 From imagination to impact answers 15
  • 16.
    Classification Workflow Label Sample ~90% Training set 1 Model Vectorize Sample Training ~10% 2 Model Testing 3 Input set A, B, A, Vectorize A, B, … Trained Model Label NICTA Copyright 2011 From imagination to impact 16 approximation
  • 17.
    Feature Extraction •  Goodfeature extraction is critical to trained model performance: – Need domain understanding to ‘measure’ the right things. – Measure wrong things, even the best model will perform badly. – Caution needed to avoid ‘label leaks’. •  Will typically require hand written map- reduce code: – If text based, can use text mining tools in HIVE or Mahout. NICTA Copyright 2011 From imagination to impact 17
  • 18.
    Naïve Bayes Classifier Feature vector Label n classify ( f1, f 2 ,… f n ) = argmax p( L = l)∏ p( Fi = f i | L = l) l i=1 Probability of feature i having value fi given l, e.g. assume a Gaussian pdf: (v −µl )2 1 − 2σ l2 P ( f = v | l) = e 2πσ l2 Note: Model training boils down to estimating the conditional variance of the feature vector elements. This can be trivially parallelized and implemented in map reduce. € 18 NICTA Copyright 2011 From imagination to impact
  • 19.
    Naïve Bayes inMahout Command line specific to text classification (e.g. SPAM detection, document classification, etc.) Plain text file, format: Generated model, set label t word word of files in sequence file word … format (variances). mahout trainclassifier –i <input_path> -o <output_path> --gramSize <n_gram_size> -minDf <minimum_DF> -minSupport <min_TF> … N_gram Discard n-grams Discard n-grams size, that occur less that occur in less default = 1 than this number than this number of times in a of documents. document. NICTA Copyright 2011 From imagination to impact 19
  • 20.
    Naïve Bayes inMahout •  Need to write your own classifier to be practical. document Classifier Trained Model label Look at class: org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm! • Classify document; • Return top n predicted labels; • Return classification certainty; • … NICTA Copyright 2011 From imagination to impact 20
  • 21.
    Classification vs. Recommendation • Can use a classifier to recommend: – Interested in item or not interested? •  Classifier is based on features of the specific item and the customer •  Recommendation based on past behavior of customers •  Classification: single decisions •  Recommendation: ranking NICTA Copyright 2011 From imagination to impact 21