SlideShare a Scribd company logo
Machine Learning
With Mahout
Rami Mukhtar
Big Data Group
National ICT Australia
February 2012
Mahout: Brief History
•  Started in 2008 as a subproject of the Apache
   Lucene;
        –  Text mining, clustering, some classification.
•  Sean Owen started Taste in 2005:
        –  Recommender engine for business that never took off
        –  Mahout community asked to merge in Taste code
•  Became a top level Apache Project in April 2010
•  Lineage resulted in a fragmented framework.




NICTA Copyright 2011        From imagination to impact       2
Mahout: What is it?
•  Collection of machine learning algorithm implementations:
        –  Many (not all) implemented on Hadoop map-reduce;
        –  Java library with handy command line interface to run common tasks.
•  Currently serves 3 key areas:
        –  Recommendation engines
        –  Clustering
        –  Classification
•  Focus of today’s talk is on functionality accessible from command
   line interface:
        –  Most accessible for Hadoop beginners.




NICTA Copyright 2011               From imagination to impact                    3
Recommenders
•  Supports user based and item based
   collaborative filtering:
        –  User based: similarity between users;
        –  Item based: similarity between items                 user

      other items          1                2               3       user 3 likes item E
      user 1 may
          like
                                                                        item
                       A   B        C              D        E   F




                           4                5               6
NICTA Copyright 2011           From imagination to impact                          4
Implementations
•  Non-distributed (no Hadoop requirement)
        –  The ‘Taste’ code, supports item and user based;
        –  Good for up to 100 million user-item associations;
        –  Faster than distributed version.
•  Distributed (Hadoop MapReduce)
        –  Item based using similarity measure (configurable)
           between items.
        –  Latent factor based:
               •  Estimates ‘genres’ of items from user preferences
               •  Similar to entry that won the NetFlix prize.
        –  Both have command line interfaces.


NICTA Copyright 2011              From imagination to impact          5
Distributed Item Recommender
                Item1 Item 2 Item 3                   Item n

                       R             R        R
  User 1
                                                                                    Similarity
  User 2               R             R                    R
                                                                                    calculation

  User 3               R      R      R        R

                              R               R
                                                                           1   2           n
                              R                           R            1   1   .2     …    .8

                                                                       2       1      …    .5
                       R             R
                                                                                      …    .6
                       R             R        R           R
  User m                                                               n                   1

                           User-item ratings matrix                        Item similarity
NICTA Copyright 2011                      From imagination to impact           matrix             6
Distributed Item Recommendation
                                csv file:
                                                                         csv file:
                                user, item, rating
                                                                         item, item, simularity
                                …

mahout itemsimilarity –i <input_file> -o <output_path> …!

                       Item1 Item 2 Item 3                Item n

     User 1              R            R           R

     User 2              R            R                       R

     User 3              R     R      R           R                          s2,3 * R2 + s3,5 * R5
                               R2   R3?          R5
                                                                       R3? =
                                                                                   s2,3 + s3,5
                               R                              R

                         R            R

    User m               R            R           R           R
                                                      €
NICTA Copyright 2011                      From imagination to impact                              7
Distributed Item Recommendation
 •  Can perform item similarity and
    recommendation generation in a single
    call:    csv file (tab seperated):
                        user, item, rating
                        …

mahout recommenditembased –i <input_path> !
-o <output_path> !
-u <users_file>!                    csv file (tab separated):
--numRecommendations …!             user,item,score,item,score,…

                                                               csv file (tab
     Number of
                                                               separated):
     recommendations to                                        user
     return per user
                                                               user
                                                               user
 NICTA Copyright 2011             From imagination to impact                   8
Clustering




                       Vectorization




                                          Clustering




NICTA Copyright 2011                   From imagination to impact   9
Clustering
•  Don’t know the structure of data, want to sensibly group
   things together.
•  A number of distributed algorithms supported:
        –  Canopy Clustering (MAHOUT-3 – integrated)
        –  K-Means Clustering (MAHOUT-5 – integrated)
        –  Fuzzy K-Means (MAHOUT-74 – integrated)
        –  Expectation Maximization (EM) (MAHOUT-28)
        –  Mean Shift Clustering (MAHOUT-15 – integrated)
        –  Hierarchical Clustering (MAHOUT-19)
        –  Dirichlet Process Clustering (MAHOUT-30 – integrated)
        –  Latent Dirichlet Allocation (MAHOUT-123 – integrated)
        –  Spectral Clustering (MAHOUT-363 – integrated)
        –  Minhash Clustering (MAHOUT-344 - integrated)
•  Some have command line interface support.
NICTA Copyright 2011          From imagination to impact           10
Vectorization
•  Data specific.
        –  Majority of cases need to write a map-reduce job to generate
           vectorized input
        –  Input formats are still not uniform across Mahout.
        –  Most clustering implementations expect:
               •  SequenceFile(WritableComparable, VectorWritable)!
               •  Note: key is ignored.
•  Mahout has some support for clustering text documents:
        –  Can generate n-gram Term Frequency-Inverse Document
           Frequency (TF-IDF) from a directory of text documents;
        –  Enables text documents to be clustered using command line
           interface.




NICTA Copyright 2011               From imagination to impact             11
Text Document Vectorization

                       Term frequency
                       The | conduct | as | run | doctor | with | a | Patel!
                       47 | 3        | 7 | 5    | 8      | 12   | 54| 6 !

 Document
                       Document frequency
                       The | conduct | as | run | doctor | with | a | Patel!
                       1000| 198     | 999| 567 | 48     | 998 |100| 3 !


                                                N
  Corpus                    TFIDFi = TFi * log
                                               DFi
Unigram: (crude)!
                                                         Increase weight of less
Bi-gram: (crude, oil)!
                                                         common words/n-
Tri-gram: (crude, oil, prices)!
                                                         grams within corpus
NICTA Copyright 2011               From imagination to impact                      12

       €
Text Document Vectorization

Directory of                                                                      Sequence
plain text                                                                        file: <name,
documents                                                                         text body>




mahout seqdirectory –i <input_path> -o <seq_output_path>!
                                                                     Dictionary
                                       Term Frequency                file

                          Ngram
                                                                   TF-IDF
                        generation
                                                                  Vec. Gen.
                                          Inverse
                                       Document Freq.

 mahout seq2sparse -i <seq_input_path> -o <output_path> !
 NICTA Copyright 2011                From imagination to impact                           13
K-means clustering
Run k-means clustering

mahout kmeans!
-i <input vectors directory>!        org.apache.mahout.common.distance.
-c <input clusters directory>!       CosineDistanceMeasure
                                      EuclideanDistanceMeasure
-o <output working directory> !       ManhattanDistanceMeasure
-k <# clusters sampled from input> ! SquaredEuclideanDistanceMeasure
-dm <DistanceMeasure> !               …
-x <maximum number of iterations> !
-xm <execution method: seq/mapreduce>!
…!
                           Cluster 1              Cluster 2
 Inspect the result               Top Terms: !                        Top Terms: !
                                  oil                    =>   6.20!   Coresponsibility   =>   13.97!
                                  barrel                 =>   5.15!   cereals            =>   13.51!
 mahout clusterdump !             crude                  =>   5.06!   penalise           =>   13.25!
                                  prices                 =>   4.50!   farmers            =>   11.99!
 -dt sequencefile !               opec                   =>   3.23!   levies             =>   11.60!
                                  price                  =>   2.77!   ceilings           =>   11.52!
 -d <dictionary_file>!            dlrs                   =>   2.76!   ec                 =>   11.07!
                                  said                   =>   2.70!   ministers          =>   10.55!
 -s <input_seq_file>!             bpd                    =>   2.45!   output             =>   9.57!
                                  petroleum              =>   1.99!   09.73              =>   9.18!



NICTA Copyright 2011        From imagination to impact                                           14
Classification
•  Train the machine to provide discrete answers to a
   specific question.
                                                    Mahout supports the
          100100: A                                 following algorithms:
          010011: A          Model                  -  ogistic Regression
                                                     L
          010110: B         Training                -  aïve Bayes
                                                     N
          100101: A         Algorithm               -  andom Forests
                                                     R
          010101: B                                 Others in development
     Data with known
        answers
          101100: ?                                        101100: A
          010010: ?                                        000110: B
          001000: ?           Trained
                                                           011100: A
          100000: ?            Model
                                                           101101: A
          001001: ?                                        010111: B
        Data without                                        Data with
         answers                                            estimated
NICTA Copyright 2011   From imagination to impact            answers        15
Classification Workflow
                          Label                   Sample
                                                  ~90%

          Training set                                                         1
                                                                   Model
                         Vectorize                Sample          Training
                                                  ~10%




                                                                               2
                                                                   Model
                                                                   Testing



                                                                               3
          Input set                                                                  A, B, A,
                         Vectorize                                                   A, B, …


                                                               Trained Model           Label
NICTA Copyright 2011              From imagination to impact                                 16
                                                                                   approximation
Feature Extraction
•  Good feature extraction is critical to
   trained model performance:
        – Need domain understanding to ‘measure’ the
          right things.
        – Measure wrong things, even the best model
          will perform badly.
        – Caution needed to avoid ‘label leaks’.
•  Will typically require hand written map-
   reduce code:
        – If text based, can use text mining tools in
          HIVE or Mahout.
NICTA Copyright 2011     From imagination to impact     17
Naïve Bayes Classifier
      Feature vector                       Label


                                                             n
classify ( f1, f 2 ,… f n ) = argmax p( L = l)∏ p( Fi = f i | L = l)
                                       l                     i=1

                              Probability of feature i having value fi given l,
                              e.g. assume a Gaussian pdf:
                                                                       (v −µl )2
                                                                 1        −
                                                                              2σ l2
                                  P ( f = v | l) =                    e
                                                             2πσ l2
 Note: Model training boils down to estimating the conditional variance of the
 feature vector elements. This can be trivially parallelized and implemented in
 map reduce.
                        €                                                             18
 NICTA Copyright 2011           From imagination to impact
Naïve Bayes in Mahout
 Command line specific to text classification (e.g. SPAM detection, document
 classification, etc.)

   Plain text file, format:                                 Generated model, set
   label t word word                                       of files in sequence file
   word …                                                   format (variances).



mahout trainclassifier –i <input_path> -o <output_path>
--gramSize <n_gram_size> -minDf <minimum_DF> -minSupport <min_TF> …




              N_gram                                               Discard n-grams
                                     Discard n-grams
              size,                                                that occur less
                                     that occur in less
              default = 1                                          than this number
                                     than this number
                                                                   of times in a
                                     of documents.
                                                                   document.
NICTA Copyright 2011           From imagination to impact                               19
Naïve Bayes in Mahout
 •  Need to write your own classifier to be
    practical.

                       document
                                          Classifier              Trained Model

                        label

Look at class:
org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm!
• Classify document;
• Return top n predicted labels;
• Return classification certainty;
• …

NICTA Copyright 2011                 From imagination to impact               20
Classification vs. Recommendation
•  Can use a classifier to recommend:
        – Interested in item or not interested?
•  Classifier is based on features of the
   specific item and the customer
•  Recommendation based on past behavior
   of customers
•  Classification: single decisions
•  Recommendation: ranking


NICTA Copyright 2011     From imagination to impact   21

More Related Content

What's hot

Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduce
Pietro Michiardi
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
Gabriela Agustini
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
Muralidharan Deenathayalan
 
Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Sparksscdotopen
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
PyData
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
Pietro Michiardi
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
Ed Kohlwey
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formats
Vigen Sahakyan
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Vijay Srinivas Agneeswaran, Ph.D
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
Stefanie Zhao
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
David Gleich
 

What's hot (20)

Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduce
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Spark
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formats
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 
Spark at-hackthon8jan2014
Spark at-hackthon8jan2014Spark at-hackthon8jan2014
Spark at-hackthon8jan2014
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 

Viewers also liked

Principio de la prueba
Principio de la pruebaPrincipio de la prueba
Principio de la prueba
ANDREINA hernandez
 
HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS
HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTSHOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS
HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTSSunil Kakade
 
Modul kelompok viii_-_copy_-_copy[1]
Modul kelompok viii_-_copy_-_copy[1]Modul kelompok viii_-_copy_-_copy[1]
Modul kelompok viii_-_copy_-_copy[1]
rillafebrila
 
UGI Auburn Line Extension Map & Details
UGI Auburn Line Extension Map & DetailsUGI Auburn Line Extension Map & Details
UGI Auburn Line Extension Map & Details
Marcellus Drilling News
 
Business model presentation
Business model presentationBusiness model presentation
Business model presentation
Maria Kyamulabye
 
TorchFi platform
TorchFi platformTorchFi platform
TorchFi platform
anupbalagopal
 
Digital Citizenship & Internet Maturity for Schools
Digital Citizenship & Internet Maturity for SchoolsDigital Citizenship & Internet Maturity for Schools
Digital Citizenship & Internet Maturity for Schools
Raghu Pandey
 
Marketing plan for android app
Marketing plan for android appMarketing plan for android app
Marketing plan for android app
Sai Sachin
 
Machine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and futureMachine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and future
Cloudera, Inc.
 
UX Designer Skills
UX Designer SkillsUX Designer Skills
UX Designer Skills
Phowr Quang
 

Viewers also liked (12)

Principio de la prueba
Principio de la pruebaPrincipio de la prueba
Principio de la prueba
 
HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS
HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTSHOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS
HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS
 
Modul kelompok viii_-_copy_-_copy[1]
Modul kelompok viii_-_copy_-_copy[1]Modul kelompok viii_-_copy_-_copy[1]
Modul kelompok viii_-_copy_-_copy[1]
 
UGI Auburn Line Extension Map & Details
UGI Auburn Line Extension Map & DetailsUGI Auburn Line Extension Map & Details
UGI Auburn Line Extension Map & Details
 
Business model presentation
Business model presentationBusiness model presentation
Business model presentation
 
Resume-santosh
Resume-santoshResume-santosh
Resume-santosh
 
RBC
RBCRBC
RBC
 
TorchFi platform
TorchFi platformTorchFi platform
TorchFi platform
 
Digital Citizenship & Internet Maturity for Schools
Digital Citizenship & Internet Maturity for SchoolsDigital Citizenship & Internet Maturity for Schools
Digital Citizenship & Internet Maturity for Schools
 
Marketing plan for android app
Marketing plan for android appMarketing plan for android app
Marketing plan for android app
 
Machine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and futureMachine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and future
 
UX Designer Skills
UX Designer SkillsUX Designer Skills
UX Designer Skills
 

Similar to Machine Learning with Mahout

Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga
DTU - Technical University of Denmark
 
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialBuilding Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Xavier Amatriain
 
Linked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsLinked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsVito Ostuni
 
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
Roku
 
Hanna bosc2010
Hanna bosc2010Hanna bosc2010
Hanna bosc2010BOSC 2010
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
Recommendations play @flipkart (3)
Recommendations play @flipkart (3)Recommendations play @flipkart (3)
Recommendations play @flipkart (3)hava101
 
Big data analytics with R tool.pptx
Big data analytics with R tool.pptxBig data analytics with R tool.pptx
Big data analytics with R tool.pptx
salutiontechnology
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R Programming
IRJET Journal
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Alessandro Suglia
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Claudio Greco
 
Dynamic Synthesis of Mediators to Support Interoperability in Autonomic Systems
Dynamic Synthesis of Mediators to Support Interoperability in Autonomic SystemsDynamic Synthesis of Mediators to Support Interoperability in Autonomic Systems
Dynamic Synthesis of Mediators to Support Interoperability in Autonomic SystemsAmel Bennaceur
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
Revolution Analytics
 
Resume
ResumeResume
Resume
muddanas
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resume
muddanas
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resume
muddanas
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resume
muddanas
 
Intro to data science module 1 r
Intro to data science module 1 rIntro to data science module 1 r
Intro to data science module 1 r
amuletc
 
Scala Days NYC 2016
Scala Days NYC 2016Scala Days NYC 2016
Scala Days NYC 2016
Martin Odersky
 

Similar to Machine Learning with Mahout (20)

Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga
 
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialBuilding Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
 
Linked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsLinked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender Systems
 
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
 
Hanna bosc2010
Hanna bosc2010Hanna bosc2010
Hanna bosc2010
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Recommendations play @flipkart (3)
Recommendations play @flipkart (3)Recommendations play @flipkart (3)
Recommendations play @flipkart (3)
 
Big data analytics with R tool.pptx
Big data analytics with R tool.pptxBig data analytics with R tool.pptx
Big data analytics with R tool.pptx
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R Programming
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
 
Dynamic Synthesis of Mediators to Support Interoperability in Autonomic Systems
Dynamic Synthesis of Mediators to Support Interoperability in Autonomic SystemsDynamic Synthesis of Mediators to Support Interoperability in Autonomic Systems
Dynamic Synthesis of Mediators to Support Interoperability in Autonomic Systems
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 
Resume
ResumeResume
Resume
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resume
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resume
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resume
 
Intro to data science module 1 r
Intro to data science module 1 rIntro to data science module 1 r
Intro to data science module 1 r
 
Scala Days NYC 2016
Scala Days NYC 2016Scala Days NYC 2016
Scala Days NYC 2016
 

Recently uploaded

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 

Recently uploaded (20)

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 

Machine Learning with Mahout

  • 1. Machine Learning With Mahout Rami Mukhtar Big Data Group National ICT Australia February 2012
  • 2. Mahout: Brief History •  Started in 2008 as a subproject of the Apache Lucene; –  Text mining, clustering, some classification. •  Sean Owen started Taste in 2005: –  Recommender engine for business that never took off –  Mahout community asked to merge in Taste code •  Became a top level Apache Project in April 2010 •  Lineage resulted in a fragmented framework. NICTA Copyright 2011 From imagination to impact 2
  • 3. Mahout: What is it? •  Collection of machine learning algorithm implementations: –  Many (not all) implemented on Hadoop map-reduce; –  Java library with handy command line interface to run common tasks. •  Currently serves 3 key areas: –  Recommendation engines –  Clustering –  Classification •  Focus of today’s talk is on functionality accessible from command line interface: –  Most accessible for Hadoop beginners. NICTA Copyright 2011 From imagination to impact 3
  • 4. Recommenders •  Supports user based and item based collaborative filtering: –  User based: similarity between users; –  Item based: similarity between items user other items 1 2 3 user 3 likes item E user 1 may like item A B C D E F 4 5 6 NICTA Copyright 2011 From imagination to impact 4
  • 5. Implementations •  Non-distributed (no Hadoop requirement) –  The ‘Taste’ code, supports item and user based; –  Good for up to 100 million user-item associations; –  Faster than distributed version. •  Distributed (Hadoop MapReduce) –  Item based using similarity measure (configurable) between items. –  Latent factor based: •  Estimates ‘genres’ of items from user preferences •  Similar to entry that won the NetFlix prize. –  Both have command line interfaces. NICTA Copyright 2011 From imagination to impact 5
  • 6. Distributed Item Recommender Item1 Item 2 Item 3 Item n R R R User 1 Similarity User 2 R R R calculation User 3 R R R R R R 1 2 n R R 1 1 .2 … .8 2 1 … .5 R R … .6 R R R R User m n 1 User-item ratings matrix Item similarity NICTA Copyright 2011 From imagination to impact matrix 6
  • 7. Distributed Item Recommendation csv file: csv file: user, item, rating item, item, simularity … mahout itemsimilarity –i <input_file> -o <output_path> …! Item1 Item 2 Item 3 Item n User 1 R R R User 2 R R R User 3 R R R R s2,3 * R2 + s3,5 * R5 R2 R3? R5 R3? = s2,3 + s3,5 R R R R User m R R R R € NICTA Copyright 2011 From imagination to impact 7
  • 8. Distributed Item Recommendation •  Can perform item similarity and recommendation generation in a single call: csv file (tab seperated): user, item, rating … mahout recommenditembased –i <input_path> ! -o <output_path> ! -u <users_file>! csv file (tab separated): --numRecommendations …! user,item,score,item,score,… csv file (tab Number of separated): recommendations to user return per user user user NICTA Copyright 2011 From imagination to impact 8
  • 9. Clustering Vectorization Clustering NICTA Copyright 2011 From imagination to impact 9
  • 10. Clustering •  Don’t know the structure of data, want to sensibly group things together. •  A number of distributed algorithms supported: –  Canopy Clustering (MAHOUT-3 – integrated) –  K-Means Clustering (MAHOUT-5 – integrated) –  Fuzzy K-Means (MAHOUT-74 – integrated) –  Expectation Maximization (EM) (MAHOUT-28) –  Mean Shift Clustering (MAHOUT-15 – integrated) –  Hierarchical Clustering (MAHOUT-19) –  Dirichlet Process Clustering (MAHOUT-30 – integrated) –  Latent Dirichlet Allocation (MAHOUT-123 – integrated) –  Spectral Clustering (MAHOUT-363 – integrated) –  Minhash Clustering (MAHOUT-344 - integrated) •  Some have command line interface support. NICTA Copyright 2011 From imagination to impact 10
  • 11. Vectorization •  Data specific. –  Majority of cases need to write a map-reduce job to generate vectorized input –  Input formats are still not uniform across Mahout. –  Most clustering implementations expect: •  SequenceFile(WritableComparable, VectorWritable)! •  Note: key is ignored. •  Mahout has some support for clustering text documents: –  Can generate n-gram Term Frequency-Inverse Document Frequency (TF-IDF) from a directory of text documents; –  Enables text documents to be clustered using command line interface. NICTA Copyright 2011 From imagination to impact 11
  • 12. Text Document Vectorization Term frequency The | conduct | as | run | doctor | with | a | Patel! 47 | 3 | 7 | 5 | 8 | 12 | 54| 6 ! Document Document frequency The | conduct | as | run | doctor | with | a | Patel! 1000| 198 | 999| 567 | 48 | 998 |100| 3 ! N Corpus TFIDFi = TFi * log DFi Unigram: (crude)! Increase weight of less Bi-gram: (crude, oil)! common words/n- Tri-gram: (crude, oil, prices)! grams within corpus NICTA Copyright 2011 From imagination to impact 12 €
  • 13. Text Document Vectorization Directory of Sequence plain text file: <name, documents text body> mahout seqdirectory –i <input_path> -o <seq_output_path>! Dictionary Term Frequency file Ngram TF-IDF generation Vec. Gen. Inverse Document Freq. mahout seq2sparse -i <seq_input_path> -o <output_path> ! NICTA Copyright 2011 From imagination to impact 13
  • 14. K-means clustering Run k-means clustering mahout kmeans! -i <input vectors directory>! org.apache.mahout.common.distance. -c <input clusters directory>! CosineDistanceMeasure EuclideanDistanceMeasure -o <output working directory> ! ManhattanDistanceMeasure -k <# clusters sampled from input> ! SquaredEuclideanDistanceMeasure -dm <DistanceMeasure> ! … -x <maximum number of iterations> ! -xm <execution method: seq/mapreduce>! …! Cluster 1 Cluster 2 Inspect the result Top Terms: ! Top Terms: ! oil => 6.20! Coresponsibility => 13.97! barrel => 5.15! cereals => 13.51! mahout clusterdump ! crude => 5.06! penalise => 13.25! prices => 4.50! farmers => 11.99! -dt sequencefile ! opec => 3.23! levies => 11.60! price => 2.77! ceilings => 11.52! -d <dictionary_file>! dlrs => 2.76! ec => 11.07! said => 2.70! ministers => 10.55! -s <input_seq_file>! bpd => 2.45! output => 9.57! petroleum => 1.99! 09.73 => 9.18! NICTA Copyright 2011 From imagination to impact 14
  • 15. Classification •  Train the machine to provide discrete answers to a specific question. Mahout supports the 100100: A following algorithms: 010011: A Model -  ogistic Regression L 010110: B Training -  aïve Bayes N 100101: A Algorithm -  andom Forests R 010101: B Others in development Data with known answers 101100: ? 101100: A 010010: ? 000110: B 001000: ? Trained 011100: A 100000: ? Model 101101: A 001001: ? 010111: B Data without Data with answers estimated NICTA Copyright 2011 From imagination to impact answers 15
  • 16. Classification Workflow Label Sample ~90% Training set 1 Model Vectorize Sample Training ~10% 2 Model Testing 3 Input set A, B, A, Vectorize A, B, … Trained Model Label NICTA Copyright 2011 From imagination to impact 16 approximation
  • 17. Feature Extraction •  Good feature extraction is critical to trained model performance: – Need domain understanding to ‘measure’ the right things. – Measure wrong things, even the best model will perform badly. – Caution needed to avoid ‘label leaks’. •  Will typically require hand written map- reduce code: – If text based, can use text mining tools in HIVE or Mahout. NICTA Copyright 2011 From imagination to impact 17
  • 18. Naïve Bayes Classifier Feature vector Label n classify ( f1, f 2 ,… f n ) = argmax p( L = l)∏ p( Fi = f i | L = l) l i=1 Probability of feature i having value fi given l, e.g. assume a Gaussian pdf: (v −µl )2 1 − 2σ l2 P ( f = v | l) = e 2πσ l2 Note: Model training boils down to estimating the conditional variance of the feature vector elements. This can be trivially parallelized and implemented in map reduce. € 18 NICTA Copyright 2011 From imagination to impact
  • 19. Naïve Bayes in Mahout Command line specific to text classification (e.g. SPAM detection, document classification, etc.) Plain text file, format: Generated model, set label t word word of files in sequence file word … format (variances). mahout trainclassifier –i <input_path> -o <output_path> --gramSize <n_gram_size> -minDf <minimum_DF> -minSupport <min_TF> … N_gram Discard n-grams Discard n-grams size, that occur less that occur in less default = 1 than this number than this number of times in a of documents. document. NICTA Copyright 2011 From imagination to impact 19
  • 20. Naïve Bayes in Mahout •  Need to write your own classifier to be practical. document Classifier Trained Model label Look at class: org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm! • Classify document; • Return top n predicted labels; • Return classification certainty; • … NICTA Copyright 2011 From imagination to impact 20
  • 21. Classification vs. Recommendation •  Can use a classifier to recommend: – Interested in item or not interested? •  Classifier is based on features of the specific item and the customer •  Recommendation based on past behavior of customers •  Classification: single decisions •  Recommendation: ranking NICTA Copyright 2011 From imagination to impact 21