• Save
Machine Learning with Mahout
Upcoming SlideShare
Loading in...5
×
 

Machine Learning with Mahout

on

  • 7,097 views

Rami Mukhtar, NICTA

Rami Mukhtar, NICTA
Meetup #1, 23 Feb 2012 - http://sydney.bigdataaustralia.com.au/events/49103992/

Statistics

Views

Total Views
7,097
Views on SlideShare
7,089
Embed Views
8

Actions

Likes
16
Downloads
0
Comments
1

2 Embeds 8

https://sendtoinc.com 7
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • How Do I Start Learning Mahout?

    I found a very good link which explains about Big data , Mahout fundamentals and Map Reduce in a very simple manner. Hope this will help everyone . http://www.youtube.com/watch?v=DNUliYXrSZo
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Machine Learning with Mahout Machine Learning with Mahout Presentation Transcript

    • Machine LearningWith MahoutRami MukhtarBig Data GroupNational ICT AustraliaFebruary 2012
    • Mahout: Brief History•  Started in 2008 as a subproject of the Apache Lucene; –  Text mining, clustering, some classification.•  Sean Owen started Taste in 2005: –  Recommender engine for business that never took off –  Mahout community asked to merge in Taste code•  Became a top level Apache Project in April 2010•  Lineage resulted in a fragmented framework.NICTA Copyright 2011 From imagination to impact 2
    • Mahout: What is it?•  Collection of machine learning algorithm implementations: –  Many (not all) implemented on Hadoop map-reduce; –  Java library with handy command line interface to run common tasks.•  Currently serves 3 key areas: –  Recommendation engines –  Clustering –  Classification•  Focus of today’s talk is on functionality accessible from command line interface: –  Most accessible for Hadoop beginners.NICTA Copyright 2011 From imagination to impact 3
    • Recommenders•  Supports user based and item based collaborative filtering: –  User based: similarity between users; –  Item based: similarity between items user other items 1 2 3 user 3 likes item E user 1 may like item A B C D E F 4 5 6NICTA Copyright 2011 From imagination to impact 4
    • Implementations•  Non-distributed (no Hadoop requirement) –  The ‘Taste’ code, supports item and user based; –  Good for up to 100 million user-item associations; –  Faster than distributed version.•  Distributed (Hadoop MapReduce) –  Item based using similarity measure (configurable) between items. –  Latent factor based: •  Estimates ‘genres’ of items from user preferences •  Similar to entry that won the NetFlix prize. –  Both have command line interfaces.NICTA Copyright 2011 From imagination to impact 5
    • Distributed Item Recommender Item1 Item 2 Item 3 Item n R R R User 1 Similarity User 2 R R R calculation User 3 R R R R R R 1 2 n R R 1 1 .2 … .8 2 1 … .5 R R … .6 R R R R User m n 1 User-item ratings matrix Item similarityNICTA Copyright 2011 From imagination to impact matrix 6
    • Distributed Item Recommendation csv file: csv file: user, item, rating item, item, simularity …mahout itemsimilarity –i <input_file> -o <output_path> …! Item1 Item 2 Item 3 Item n User 1 R R R User 2 R R R User 3 R R R R s2,3 * R2 + s3,5 * R5 R2 R3? R5 R3? = s2,3 + s3,5 R R R R User m R R R R €NICTA Copyright 2011 From imagination to impact 7
    • Distributed Item Recommendation •  Can perform item similarity and recommendation generation in a single call: csv file (tab seperated): user, item, rating …mahout recommenditembased –i <input_path> !-o <output_path> !-u <users_file>! csv file (tab separated):--numRecommendations …! user,item,score,item,score,… csv file (tab Number of separated): recommendations to user return per user user user NICTA Copyright 2011 From imagination to impact 8
    • Clustering Vectorization ClusteringNICTA Copyright 2011 From imagination to impact 9
    • Clustering•  Don’t know the structure of data, want to sensibly group things together.•  A number of distributed algorithms supported: –  Canopy Clustering (MAHOUT-3 – integrated) –  K-Means Clustering (MAHOUT-5 – integrated) –  Fuzzy K-Means (MAHOUT-74 – integrated) –  Expectation Maximization (EM) (MAHOUT-28) –  Mean Shift Clustering (MAHOUT-15 – integrated) –  Hierarchical Clustering (MAHOUT-19) –  Dirichlet Process Clustering (MAHOUT-30 – integrated) –  Latent Dirichlet Allocation (MAHOUT-123 – integrated) –  Spectral Clustering (MAHOUT-363 – integrated) –  Minhash Clustering (MAHOUT-344 - integrated)•  Some have command line interface support.NICTA Copyright 2011 From imagination to impact 10
    • Vectorization•  Data specific. –  Majority of cases need to write a map-reduce job to generate vectorized input –  Input formats are still not uniform across Mahout. –  Most clustering implementations expect: •  SequenceFile(WritableComparable, VectorWritable)! •  Note: key is ignored.•  Mahout has some support for clustering text documents: –  Can generate n-gram Term Frequency-Inverse Document Frequency (TF-IDF) from a directory of text documents; –  Enables text documents to be clustered using command line interface.NICTA Copyright 2011 From imagination to impact 11
    • Text Document Vectorization Term frequency The | conduct | as | run | doctor | with | a | Patel! 47 | 3 | 7 | 5 | 8 | 12 | 54| 6 ! Document Document frequency The | conduct | as | run | doctor | with | a | Patel! 1000| 198 | 999| 567 | 48 | 998 |100| 3 ! N Corpus TFIDFi = TFi * log DFiUnigram: (crude)! Increase weight of lessBi-gram: (crude, oil)! common words/n-Tri-gram: (crude, oil, prices)! grams within corpusNICTA Copyright 2011 From imagination to impact 12 €
    • Text Document VectorizationDirectory of Sequenceplain text file: <name,documents text body>mahout seqdirectory –i <input_path> -o <seq_output_path>! Dictionary Term Frequency file Ngram TF-IDF generation Vec. Gen. Inverse Document Freq. mahout seq2sparse -i <seq_input_path> -o <output_path> ! NICTA Copyright 2011 From imagination to impact 13
    • K-means clusteringRun k-means clusteringmahout kmeans!-i <input vectors directory>! org.apache.mahout.common.distance.-c <input clusters directory>! CosineDistanceMeasure EuclideanDistanceMeasure-o <output working directory> ! ManhattanDistanceMeasure-k <# clusters sampled from input> ! SquaredEuclideanDistanceMeasure-dm <DistanceMeasure> ! …-x <maximum number of iterations> !-xm <execution method: seq/mapreduce>!…! Cluster 1 Cluster 2 Inspect the result Top Terms: ! Top Terms: ! oil => 6.20! Coresponsibility => 13.97! barrel => 5.15! cereals => 13.51! mahout clusterdump ! crude => 5.06! penalise => 13.25! prices => 4.50! farmers => 11.99! -dt sequencefile ! opec => 3.23! levies => 11.60! price => 2.77! ceilings => 11.52! -d <dictionary_file>! dlrs => 2.76! ec => 11.07! said => 2.70! ministers => 10.55! -s <input_seq_file>! bpd => 2.45! output => 9.57! petroleum => 1.99! 09.73 => 9.18!NICTA Copyright 2011 From imagination to impact 14
    • Classification•  Train the machine to provide discrete answers to a specific question. Mahout supports the 100100: A following algorithms: 010011: A Model -  ogistic Regression L 010110: B Training -  aïve Bayes N 100101: A Algorithm -  andom Forests R 010101: B Others in development Data with known answers 101100: ? 101100: A 010010: ? 000110: B 001000: ? Trained 011100: A 100000: ? Model 101101: A 001001: ? 010111: B Data without Data with answers estimatedNICTA Copyright 2011 From imagination to impact answers 15
    • Classification Workflow Label Sample ~90% Training set 1 Model Vectorize Sample Training ~10% 2 Model Testing 3 Input set A, B, A, Vectorize A, B, … Trained Model LabelNICTA Copyright 2011 From imagination to impact 16 approximation
    • Feature Extraction•  Good feature extraction is critical to trained model performance: – Need domain understanding to ‘measure’ the right things. – Measure wrong things, even the best model will perform badly. – Caution needed to avoid ‘label leaks’.•  Will typically require hand written map- reduce code: – If text based, can use text mining tools in HIVE or Mahout.NICTA Copyright 2011 From imagination to impact 17
    • Naïve Bayes Classifier Feature vector Label nclassify ( f1, f 2 ,… f n ) = argmax p( L = l)∏ p( Fi = f i | L = l) l i=1 Probability of feature i having value fi given l, e.g. assume a Gaussian pdf: (v −µl )2 1 − 2σ l2 P ( f = v | l) = e 2πσ l2 Note: Model training boils down to estimating the conditional variance of the feature vector elements. This can be trivially parallelized and implemented in map reduce. € 18 NICTA Copyright 2011 From imagination to impact
    • Naïve Bayes in Mahout Command line specific to text classification (e.g. SPAM detection, document classification, etc.) Plain text file, format: Generated model, set label t word word of files in sequence file word … format (variances).mahout trainclassifier –i <input_path> -o <output_path>--gramSize <n_gram_size> -minDf <minimum_DF> -minSupport <min_TF> … N_gram Discard n-grams Discard n-grams size, that occur less that occur in less default = 1 than this number than this number of times in a of documents. document.NICTA Copyright 2011 From imagination to impact 19
    • Naïve Bayes in Mahout •  Need to write your own classifier to be practical. document Classifier Trained Model labelLook at class:org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm!• Classify document;• Return top n predicted labels;• Return classification certainty;• …NICTA Copyright 2011 From imagination to impact 20
    • Classification vs. Recommendation•  Can use a classifier to recommend: – Interested in item or not interested?•  Classifier is based on features of the specific item and the customer•  Recommendation based on past behavior of customers•  Classification: single decisions•  Recommendation: rankingNICTA Copyright 2011 From imagination to impact 21