Machine LearningWith MahoutRami MukhtarBig Data GroupNational ICT AustraliaFebruary 2012
Mahout: Brief History•  Started in 2008 as a subproject of the Apache   Lucene;        –  Text mining, clustering, some cl...
Mahout: What is it?•  Collection of machine learning algorithm implementations:        –  Many (not all) implemented on Ha...
Recommenders•  Supports user based and item based   collaborative filtering:        –  User based: similarity between user...
Implementations•  Non-distributed (no Hadoop requirement)        –  The ‘Taste’ code, supports item and user based;       ...
Distributed Item Recommender                Item1 Item 2 Item 3                   Item n                       R          ...
Distributed Item Recommendation                                csv file:                                                  ...
Distributed Item Recommendation •  Can perform item similarity and    recommendation generation in a single    call:    cs...
Clustering                       Vectorization                                          ClusteringNICTA Copyright 2011    ...
Clustering•  Don’t know the structure of data, want to sensibly group   things together.•  A number of distributed algorit...
Vectorization•  Data specific.        –  Majority of cases need to write a map-reduce job to generate           vectorized...
Text Document Vectorization                       Term frequency                       The | conduct | as | run | doctor |...
Text Document VectorizationDirectory of                                                                      Sequenceplain...
K-means clusteringRun k-means clusteringmahout kmeans!-i <input vectors directory>!        org.apache.mahout.common.distan...
Classification•  Train the machine to provide discrete answers to a   specific question.                                  ...
Classification Workflow                          Label                   Sample                                           ...
Feature Extraction•  Good feature extraction is critical to   trained model performance:        – Need domain understandin...
Naïve Bayes Classifier      Feature vector                       Label                                                    ...
Naïve Bayes in Mahout Command line specific to text classification (e.g. SPAM detection, document classification, etc.)   ...
Naïve Bayes in Mahout •  Need to write your own classifier to be    practical.                       document             ...
Classification vs. Recommendation•  Can use a classifier to recommend:        – Interested in item or not interested?•  Cl...
Upcoming SlideShare
Loading in...5
×

Machine Learning with Mahout

7,258

Published on

Rami Mukhtar, NICTA
Meetup #1, 23 Feb 2012 - http://sydney.bigdataaustralia.com.au/events/49103992/

Published in: Technology, Education
1 Comment
19 Likes
Statistics
Notes
  • How Do I Start Learning Mahout?

    I found a very good link which explains about Big data , Mahout fundamentals and Map Reduce in a very simple manner. Hope this will help everyone . http://www.youtube.com/watch?v=DNUliYXrSZo
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
7,258
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
1
Likes
19
Embeds 0
No embeds

No notes for slide

Machine Learning with Mahout

  1. 1. Machine LearningWith MahoutRami MukhtarBig Data GroupNational ICT AustraliaFebruary 2012
  2. 2. Mahout: Brief History•  Started in 2008 as a subproject of the Apache Lucene; –  Text mining, clustering, some classification.•  Sean Owen started Taste in 2005: –  Recommender engine for business that never took off –  Mahout community asked to merge in Taste code•  Became a top level Apache Project in April 2010•  Lineage resulted in a fragmented framework.NICTA Copyright 2011 From imagination to impact 2
  3. 3. Mahout: What is it?•  Collection of machine learning algorithm implementations: –  Many (not all) implemented on Hadoop map-reduce; –  Java library with handy command line interface to run common tasks.•  Currently serves 3 key areas: –  Recommendation engines –  Clustering –  Classification•  Focus of today’s talk is on functionality accessible from command line interface: –  Most accessible for Hadoop beginners.NICTA Copyright 2011 From imagination to impact 3
  4. 4. Recommenders•  Supports user based and item based collaborative filtering: –  User based: similarity between users; –  Item based: similarity between items user other items 1 2 3 user 3 likes item E user 1 may like item A B C D E F 4 5 6NICTA Copyright 2011 From imagination to impact 4
  5. 5. Implementations•  Non-distributed (no Hadoop requirement) –  The ‘Taste’ code, supports item and user based; –  Good for up to 100 million user-item associations; –  Faster than distributed version.•  Distributed (Hadoop MapReduce) –  Item based using similarity measure (configurable) between items. –  Latent factor based: •  Estimates ‘genres’ of items from user preferences •  Similar to entry that won the NetFlix prize. –  Both have command line interfaces.NICTA Copyright 2011 From imagination to impact 5
  6. 6. Distributed Item Recommender Item1 Item 2 Item 3 Item n R R R User 1 Similarity User 2 R R R calculation User 3 R R R R R R 1 2 n R R 1 1 .2 … .8 2 1 … .5 R R … .6 R R R R User m n 1 User-item ratings matrix Item similarityNICTA Copyright 2011 From imagination to impact matrix 6
  7. 7. Distributed Item Recommendation csv file: csv file: user, item, rating item, item, simularity …mahout itemsimilarity –i <input_file> -o <output_path> …! Item1 Item 2 Item 3 Item n User 1 R R R User 2 R R R User 3 R R R R s2,3 * R2 + s3,5 * R5 R2 R3? R5 R3? = s2,3 + s3,5 R R R R User m R R R R €NICTA Copyright 2011 From imagination to impact 7
  8. 8. Distributed Item Recommendation •  Can perform item similarity and recommendation generation in a single call: csv file (tab seperated): user, item, rating …mahout recommenditembased –i <input_path> !-o <output_path> !-u <users_file>! csv file (tab separated):--numRecommendations …! user,item,score,item,score,… csv file (tab Number of separated): recommendations to user return per user user user NICTA Copyright 2011 From imagination to impact 8
  9. 9. Clustering Vectorization ClusteringNICTA Copyright 2011 From imagination to impact 9
  10. 10. Clustering•  Don’t know the structure of data, want to sensibly group things together.•  A number of distributed algorithms supported: –  Canopy Clustering (MAHOUT-3 – integrated) –  K-Means Clustering (MAHOUT-5 – integrated) –  Fuzzy K-Means (MAHOUT-74 – integrated) –  Expectation Maximization (EM) (MAHOUT-28) –  Mean Shift Clustering (MAHOUT-15 – integrated) –  Hierarchical Clustering (MAHOUT-19) –  Dirichlet Process Clustering (MAHOUT-30 – integrated) –  Latent Dirichlet Allocation (MAHOUT-123 – integrated) –  Spectral Clustering (MAHOUT-363 – integrated) –  Minhash Clustering (MAHOUT-344 - integrated)•  Some have command line interface support.NICTA Copyright 2011 From imagination to impact 10
  11. 11. Vectorization•  Data specific. –  Majority of cases need to write a map-reduce job to generate vectorized input –  Input formats are still not uniform across Mahout. –  Most clustering implementations expect: •  SequenceFile(WritableComparable, VectorWritable)! •  Note: key is ignored.•  Mahout has some support for clustering text documents: –  Can generate n-gram Term Frequency-Inverse Document Frequency (TF-IDF) from a directory of text documents; –  Enables text documents to be clustered using command line interface.NICTA Copyright 2011 From imagination to impact 11
  12. 12. Text Document Vectorization Term frequency The | conduct | as | run | doctor | with | a | Patel! 47 | 3 | 7 | 5 | 8 | 12 | 54| 6 ! Document Document frequency The | conduct | as | run | doctor | with | a | Patel! 1000| 198 | 999| 567 | 48 | 998 |100| 3 ! N Corpus TFIDFi = TFi * log DFiUnigram: (crude)! Increase weight of lessBi-gram: (crude, oil)! common words/n-Tri-gram: (crude, oil, prices)! grams within corpusNICTA Copyright 2011 From imagination to impact 12 €
  13. 13. Text Document VectorizationDirectory of Sequenceplain text file: <name,documents text body>mahout seqdirectory –i <input_path> -o <seq_output_path>! Dictionary Term Frequency file Ngram TF-IDF generation Vec. Gen. Inverse Document Freq. mahout seq2sparse -i <seq_input_path> -o <output_path> ! NICTA Copyright 2011 From imagination to impact 13
  14. 14. K-means clusteringRun k-means clusteringmahout kmeans!-i <input vectors directory>! org.apache.mahout.common.distance.-c <input clusters directory>! CosineDistanceMeasure EuclideanDistanceMeasure-o <output working directory> ! ManhattanDistanceMeasure-k <# clusters sampled from input> ! SquaredEuclideanDistanceMeasure-dm <DistanceMeasure> ! …-x <maximum number of iterations> !-xm <execution method: seq/mapreduce>!…! Cluster 1 Cluster 2 Inspect the result Top Terms: ! Top Terms: ! oil => 6.20! Coresponsibility => 13.97! barrel => 5.15! cereals => 13.51! mahout clusterdump ! crude => 5.06! penalise => 13.25! prices => 4.50! farmers => 11.99! -dt sequencefile ! opec => 3.23! levies => 11.60! price => 2.77! ceilings => 11.52! -d <dictionary_file>! dlrs => 2.76! ec => 11.07! said => 2.70! ministers => 10.55! -s <input_seq_file>! bpd => 2.45! output => 9.57! petroleum => 1.99! 09.73 => 9.18!NICTA Copyright 2011 From imagination to impact 14
  15. 15. Classification•  Train the machine to provide discrete answers to a specific question. Mahout supports the 100100: A following algorithms: 010011: A Model -  ogistic Regression L 010110: B Training -  aïve Bayes N 100101: A Algorithm -  andom Forests R 010101: B Others in development Data with known answers 101100: ? 101100: A 010010: ? 000110: B 001000: ? Trained 011100: A 100000: ? Model 101101: A 001001: ? 010111: B Data without Data with answers estimatedNICTA Copyright 2011 From imagination to impact answers 15
  16. 16. Classification Workflow Label Sample ~90% Training set 1 Model Vectorize Sample Training ~10% 2 Model Testing 3 Input set A, B, A, Vectorize A, B, … Trained Model LabelNICTA Copyright 2011 From imagination to impact 16 approximation
  17. 17. Feature Extraction•  Good feature extraction is critical to trained model performance: – Need domain understanding to ‘measure’ the right things. – Measure wrong things, even the best model will perform badly. – Caution needed to avoid ‘label leaks’.•  Will typically require hand written map- reduce code: – If text based, can use text mining tools in HIVE or Mahout.NICTA Copyright 2011 From imagination to impact 17
  18. 18. Naïve Bayes Classifier Feature vector Label nclassify ( f1, f 2 ,… f n ) = argmax p( L = l)∏ p( Fi = f i | L = l) l i=1 Probability of feature i having value fi given l, e.g. assume a Gaussian pdf: (v −µl )2 1 − 2σ l2 P ( f = v | l) = e 2πσ l2 Note: Model training boils down to estimating the conditional variance of the feature vector elements. This can be trivially parallelized and implemented in map reduce. € 18 NICTA Copyright 2011 From imagination to impact
  19. 19. Naïve Bayes in Mahout Command line specific to text classification (e.g. SPAM detection, document classification, etc.) Plain text file, format: Generated model, set label t word word of files in sequence file word … format (variances).mahout trainclassifier –i <input_path> -o <output_path>--gramSize <n_gram_size> -minDf <minimum_DF> -minSupport <min_TF> … N_gram Discard n-grams Discard n-grams size, that occur less that occur in less default = 1 than this number than this number of times in a of documents. document.NICTA Copyright 2011 From imagination to impact 19
  20. 20. Naïve Bayes in Mahout •  Need to write your own classifier to be practical. document Classifier Trained Model labelLook at class:org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm!• Classify document;• Return top n predicted labels;• Return classification certainty;• …NICTA Copyright 2011 From imagination to impact 20
  21. 21. Classification vs. Recommendation•  Can use a classifier to recommend: – Interested in item or not interested?•  Classifier is based on features of the specific item and the customer•  Recommendation based on past behavior of customers•  Classification: single decisions•  Recommendation: rankingNICTA Copyright 2011 From imagination to impact 21

×