Your SlideShare is downloading. ×
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Machine Learning with Mahout
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Machine Learning with Mahout

7,010

Published on

Rami Mukhtar, NICTA …

Rami Mukhtar, NICTA
Meetup #1, 23 Feb 2012 - http://sydney.bigdataaustralia.com.au/events/49103992/

Published in: Technology, Education
1 Comment
18 Likes
Statistics
Notes
  • How Do I Start Learning Mahout?

    I found a very good link which explains about Big data , Mahout fundamentals and Map Reduce in a very simple manner. Hope this will help everyone . http://www.youtube.com/watch?v=DNUliYXrSZo
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
7,010
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
1
Likes
18
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Machine LearningWith MahoutRami MukhtarBig Data GroupNational ICT AustraliaFebruary 2012
  • 2. Mahout: Brief History•  Started in 2008 as a subproject of the Apache Lucene; –  Text mining, clustering, some classification.•  Sean Owen started Taste in 2005: –  Recommender engine for business that never took off –  Mahout community asked to merge in Taste code•  Became a top level Apache Project in April 2010•  Lineage resulted in a fragmented framework.NICTA Copyright 2011 From imagination to impact 2
  • 3. Mahout: What is it?•  Collection of machine learning algorithm implementations: –  Many (not all) implemented on Hadoop map-reduce; –  Java library with handy command line interface to run common tasks.•  Currently serves 3 key areas: –  Recommendation engines –  Clustering –  Classification•  Focus of today’s talk is on functionality accessible from command line interface: –  Most accessible for Hadoop beginners.NICTA Copyright 2011 From imagination to impact 3
  • 4. Recommenders•  Supports user based and item based collaborative filtering: –  User based: similarity between users; –  Item based: similarity between items user other items 1 2 3 user 3 likes item E user 1 may like item A B C D E F 4 5 6NICTA Copyright 2011 From imagination to impact 4
  • 5. Implementations•  Non-distributed (no Hadoop requirement) –  The ‘Taste’ code, supports item and user based; –  Good for up to 100 million user-item associations; –  Faster than distributed version.•  Distributed (Hadoop MapReduce) –  Item based using similarity measure (configurable) between items. –  Latent factor based: •  Estimates ‘genres’ of items from user preferences •  Similar to entry that won the NetFlix prize. –  Both have command line interfaces.NICTA Copyright 2011 From imagination to impact 5
  • 6. Distributed Item Recommender Item1 Item 2 Item 3 Item n R R R User 1 Similarity User 2 R R R calculation User 3 R R R R R R 1 2 n R R 1 1 .2 … .8 2 1 … .5 R R … .6 R R R R User m n 1 User-item ratings matrix Item similarityNICTA Copyright 2011 From imagination to impact matrix 6
  • 7. Distributed Item Recommendation csv file: csv file: user, item, rating item, item, simularity …mahout itemsimilarity –i <input_file> -o <output_path> …! Item1 Item 2 Item 3 Item n User 1 R R R User 2 R R R User 3 R R R R s2,3 * R2 + s3,5 * R5 R2 R3? R5 R3? = s2,3 + s3,5 R R R R User m R R R R €NICTA Copyright 2011 From imagination to impact 7
  • 8. Distributed Item Recommendation •  Can perform item similarity and recommendation generation in a single call: csv file (tab seperated): user, item, rating …mahout recommenditembased –i <input_path> !-o <output_path> !-u <users_file>! csv file (tab separated):--numRecommendations …! user,item,score,item,score,… csv file (tab Number of separated): recommendations to user return per user user user NICTA Copyright 2011 From imagination to impact 8
  • 9. Clustering Vectorization ClusteringNICTA Copyright 2011 From imagination to impact 9
  • 10. Clustering•  Don’t know the structure of data, want to sensibly group things together.•  A number of distributed algorithms supported: –  Canopy Clustering (MAHOUT-3 – integrated) –  K-Means Clustering (MAHOUT-5 – integrated) –  Fuzzy K-Means (MAHOUT-74 – integrated) –  Expectation Maximization (EM) (MAHOUT-28) –  Mean Shift Clustering (MAHOUT-15 – integrated) –  Hierarchical Clustering (MAHOUT-19) –  Dirichlet Process Clustering (MAHOUT-30 – integrated) –  Latent Dirichlet Allocation (MAHOUT-123 – integrated) –  Spectral Clustering (MAHOUT-363 – integrated) –  Minhash Clustering (MAHOUT-344 - integrated)•  Some have command line interface support.NICTA Copyright 2011 From imagination to impact 10
  • 11. Vectorization•  Data specific. –  Majority of cases need to write a map-reduce job to generate vectorized input –  Input formats are still not uniform across Mahout. –  Most clustering implementations expect: •  SequenceFile(WritableComparable, VectorWritable)! •  Note: key is ignored.•  Mahout has some support for clustering text documents: –  Can generate n-gram Term Frequency-Inverse Document Frequency (TF-IDF) from a directory of text documents; –  Enables text documents to be clustered using command line interface.NICTA Copyright 2011 From imagination to impact 11
  • 12. Text Document Vectorization Term frequency The | conduct | as | run | doctor | with | a | Patel! 47 | 3 | 7 | 5 | 8 | 12 | 54| 6 ! Document Document frequency The | conduct | as | run | doctor | with | a | Patel! 1000| 198 | 999| 567 | 48 | 998 |100| 3 ! N Corpus TFIDFi = TFi * log DFiUnigram: (crude)! Increase weight of lessBi-gram: (crude, oil)! common words/n-Tri-gram: (crude, oil, prices)! grams within corpusNICTA Copyright 2011 From imagination to impact 12 €
  • 13. Text Document VectorizationDirectory of Sequenceplain text file: <name,documents text body>mahout seqdirectory –i <input_path> -o <seq_output_path>! Dictionary Term Frequency file Ngram TF-IDF generation Vec. Gen. Inverse Document Freq. mahout seq2sparse -i <seq_input_path> -o <output_path> ! NICTA Copyright 2011 From imagination to impact 13
  • 14. K-means clusteringRun k-means clusteringmahout kmeans!-i <input vectors directory>! org.apache.mahout.common.distance.-c <input clusters directory>! CosineDistanceMeasure EuclideanDistanceMeasure-o <output working directory> ! ManhattanDistanceMeasure-k <# clusters sampled from input> ! SquaredEuclideanDistanceMeasure-dm <DistanceMeasure> ! …-x <maximum number of iterations> !-xm <execution method: seq/mapreduce>!…! Cluster 1 Cluster 2 Inspect the result Top Terms: ! Top Terms: ! oil => 6.20! Coresponsibility => 13.97! barrel => 5.15! cereals => 13.51! mahout clusterdump ! crude => 5.06! penalise => 13.25! prices => 4.50! farmers => 11.99! -dt sequencefile ! opec => 3.23! levies => 11.60! price => 2.77! ceilings => 11.52! -d <dictionary_file>! dlrs => 2.76! ec => 11.07! said => 2.70! ministers => 10.55! -s <input_seq_file>! bpd => 2.45! output => 9.57! petroleum => 1.99! 09.73 => 9.18!NICTA Copyright 2011 From imagination to impact 14
  • 15. Classification•  Train the machine to provide discrete answers to a specific question. Mahout supports the 100100: A following algorithms: 010011: A Model -  ogistic Regression L 010110: B Training -  aïve Bayes N 100101: A Algorithm -  andom Forests R 010101: B Others in development Data with known answers 101100: ? 101100: A 010010: ? 000110: B 001000: ? Trained 011100: A 100000: ? Model 101101: A 001001: ? 010111: B Data without Data with answers estimatedNICTA Copyright 2011 From imagination to impact answers 15
  • 16. Classification Workflow Label Sample ~90% Training set 1 Model Vectorize Sample Training ~10% 2 Model Testing 3 Input set A, B, A, Vectorize A, B, … Trained Model LabelNICTA Copyright 2011 From imagination to impact 16 approximation
  • 17. Feature Extraction•  Good feature extraction is critical to trained model performance: – Need domain understanding to ‘measure’ the right things. – Measure wrong things, even the best model will perform badly. – Caution needed to avoid ‘label leaks’.•  Will typically require hand written map- reduce code: – If text based, can use text mining tools in HIVE or Mahout.NICTA Copyright 2011 From imagination to impact 17
  • 18. Naïve Bayes Classifier Feature vector Label nclassify ( f1, f 2 ,… f n ) = argmax p( L = l)∏ p( Fi = f i | L = l) l i=1 Probability of feature i having value fi given l, e.g. assume a Gaussian pdf: (v −µl )2 1 − 2σ l2 P ( f = v | l) = e 2πσ l2 Note: Model training boils down to estimating the conditional variance of the feature vector elements. This can be trivially parallelized and implemented in map reduce. € 18 NICTA Copyright 2011 From imagination to impact
  • 19. Naïve Bayes in Mahout Command line specific to text classification (e.g. SPAM detection, document classification, etc.) Plain text file, format: Generated model, set label t word word of files in sequence file word … format (variances).mahout trainclassifier –i <input_path> -o <output_path>--gramSize <n_gram_size> -minDf <minimum_DF> -minSupport <min_TF> … N_gram Discard n-grams Discard n-grams size, that occur less that occur in less default = 1 than this number than this number of times in a of documents. document.NICTA Copyright 2011 From imagination to impact 19
  • 20. Naïve Bayes in Mahout •  Need to write your own classifier to be practical. document Classifier Trained Model labelLook at class:org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm!• Classify document;• Return top n predicted labels;• Return classification certainty;• …NICTA Copyright 2011 From imagination to impact 20
  • 21. Classification vs. Recommendation•  Can use a classifier to recommend: – Interested in item or not interested?•  Classifier is based on features of the specific item and the customer•  Recommendation based on past behavior of customers•  Classification: single decisions•  Recommendation: rankingNICTA Copyright 2011 From imagination to impact 21

×