Classification with Naïve BayesA Deep Dive into Apache Mahout
Today’s speaker – Josh Pattersonjosh@cloudera.com / twitter: @jpatanoogaMaster’s Thesis: self-organizing mesh networksPublished in IAAI-09: TinyTermite: A Secure Routing AlgorithmConceived, built, and led Hadoop integration for the openPDC project at TVA (Smartgrid stuff)Led small team which designed classification techniques for time series and Map ReduceOpen source work at http://openpdc.codeplex.comNow: Solutions Architect at Cloudera2
What is Classification?Supervised LearningWe give the system a set of instances to learn fromSystem builds knowledge of some structureLearns “concepts”System can then classify new instances
Supervised vs Unsupervised LearningSupervisedGive system examples/instances of multiple conceptsSystem learns “concepts”More “hands on”Example: Naïve Bayes, Neural NetsUnsupervisedUses unlabled dataBuilds joint density modelExample: k-means clustering
Naïve BayesCalled Naïve Bayes because its based on “Baye’sRule” and “naively” assumes independence given the labelIt is only valid to multiply probabilities when the events are independentSimplistic assumption in real lifeDespite the name, Naïve works well on actual datasets
Naïve Bayes ClassifierSimple probabilistic classifier based on applying Baye’s theorem (from Bayesian statistics) strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be “independent feature model".
Naïve Bayes Classifier (2)Assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. Example: a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.
A Little Bit o’ Theory
Condensing MeaningTo train our system we needTotal number input training instances (count)Counts tuples: {attributen,outcomeo,valuem} Total counts of each outcomeo{outcome-count}To Calculate each Pr[En|H]({attributen,outcomeo,valuem} / {outcome-count} )…From the Vapor of That Last Big Equation
A Real Example From Witten, et al
Enter Apache MahoutWhat is it?Apache Mahout is a scalable machine learning library that supports large data setsWhat Are the Major Algorithm Type?ClassificationRecommendationClusteringhttp://mahout.apache.org/
Mahout Algorithms
Naïve Bayes and TextNaive Bayes does not model text well. “Tackling the Poor Assumptions of Naive Bayes Text Classifiers”http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdfMahout does some modifications based around TF-IDF scoring (Next Slide)Includes two other pre-processing steps, common for information retrieval but not for Naive Bayes classification
High Level AlgorithmFor Each Feature(word) in each Doc:Calc: “Weight Normalized Tf-Idf”for a given feature in a label is the Tf-idf calculated using standard idf multiplied by the Weight Normalized TfWe calculate the sum of W-N-Tf-idf for all the features in a label called Sigma_k, and alpha_i == 1.0Weight = Log [ ( W-N-Tf-Idf + alpha_i ) / ( Sigma_k + N  ) ]
BayesDriver Training WorkflowNaïve Bayes Training MapReduce Workflow in Mahout
Logical Classification ProcessGather, Clean, and Examine the Training DataReally get to know your data!Train the Classifier, allowing the system to “Learn” the “Concepts”But not “overfit” to this specific training data setClassify New Unseen InstancesWith Naïve Bayes we’ll calculate the probabilities of each class wrt this instance
How Is Classification Done?Sequentially or via Map ReduceTestClassifier.javaCreates ClassifierContextFor Each File in DirFor Each LineBreak line into map of tokensFeed array of words to Classifier engine for new classification/labelCollect classifications as output
A Quick Note About Training Data…Your classifier can only be as good as the training data lets it be…If you don’t do good data prep, everything will perform poorlyData collection and pre-processing takes the bulk of the time
Enough Math, Run the CodeDownload and install Mahouthttp://www.apache.orgRun 20Newsgroups Examplehttps://cwiki.apache.org/confluence/display/MAHOUT/Twenty+NewsgroupsUses Naïve Bayes ClassificationDownload and extract 20news-bydate.tar.gz from the 20newsgroups dataset
Generate Test and Train DatasetTraining Dataset:mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \  -p examples/bin/work/20news-bydate/20news-bydate-train \  -o examples/bin/work/20news-bydate/bayes-train-input \  -a org.apache.mahout.vectorizer.DefaultAnalyzer\  -c UTF-8Test Dataset:mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \  -p examples/bin/work/20news-bydate/20news-bydate-test \  -o examples/bin/work/20news-bydate/bayes-test-input \  -a org.apache.mahout.vectorizer.DefaultAnalyzer \  -c UTF-8
Train and Test ClassifierTrain:$MAHOUT_HOME/bin/mahout trainclassifier \  -i 20news-input/bayes-train-input \  -o newsmodel \  -type bayes \  -ng 3 \  -source hdfsTest:$MAHOUT_HOME/bin/mahout testclassifier \  -m newsmodel \  -d 20news-input \  -type bayes \  -ng 3 \  -source hdfs \  -method mapreduce
Other Use CasesPredictive AnalyticsYou’ll hear this term a lot in the field, especially in the context of SASGeneral Supervised Learning ClassificationWe can recognize a lot of things with practiceAnd lots of tuning!Document ClassificationSentiment Analysis
Questions?We’re Hiring!Cloudera’sDistro of Apache Hadoop:http://www.cloudera.comResources“Tackling the Poor Assumptions of Naive Bayes Text Classifiers”http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

Classification with Naive Bayes

  • 1.
    Classification with NaïveBayesA Deep Dive into Apache Mahout
  • 2.
    Today’s speaker –Josh Pattersonjosh@cloudera.com / twitter: @jpatanoogaMaster’s Thesis: self-organizing mesh networksPublished in IAAI-09: TinyTermite: A Secure Routing AlgorithmConceived, built, and led Hadoop integration for the openPDC project at TVA (Smartgrid stuff)Led small team which designed classification techniques for time series and Map ReduceOpen source work at http://openpdc.codeplex.comNow: Solutions Architect at Cloudera2
  • 3.
    What is Classification?SupervisedLearningWe give the system a set of instances to learn fromSystem builds knowledge of some structureLearns “concepts”System can then classify new instances
  • 4.
    Supervised vs UnsupervisedLearningSupervisedGive system examples/instances of multiple conceptsSystem learns “concepts”More “hands on”Example: Naïve Bayes, Neural NetsUnsupervisedUses unlabled dataBuilds joint density modelExample: k-means clustering
  • 5.
    Naïve BayesCalled NaïveBayes because its based on “Baye’sRule” and “naively” assumes independence given the labelIt is only valid to multiply probabilities when the events are independentSimplistic assumption in real lifeDespite the name, Naïve works well on actual datasets
  • 6.
    Naïve Bayes ClassifierSimpleprobabilistic classifier based on applying Baye’s theorem (from Bayesian statistics) strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be “independent feature model".
  • 7.
    Naïve Bayes Classifier(2)Assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. Example: a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.
  • 8.
    A Little Bito’ Theory
  • 9.
    Condensing MeaningTo trainour system we needTotal number input training instances (count)Counts tuples: {attributen,outcomeo,valuem} Total counts of each outcomeo{outcome-count}To Calculate each Pr[En|H]({attributen,outcomeo,valuem} / {outcome-count} )…From the Vapor of That Last Big Equation
  • 10.
    A Real ExampleFrom Witten, et al
  • 11.
    Enter Apache MahoutWhatis it?Apache Mahout is a scalable machine learning library that supports large data setsWhat Are the Major Algorithm Type?ClassificationRecommendationClusteringhttp://mahout.apache.org/
  • 12.
  • 13.
    Naïve Bayes andTextNaive Bayes does not model text well. “Tackling the Poor Assumptions of Naive Bayes Text Classifiers”http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdfMahout does some modifications based around TF-IDF scoring (Next Slide)Includes two other pre-processing steps, common for information retrieval but not for Naive Bayes classification
  • 14.
    High Level AlgorithmForEach Feature(word) in each Doc:Calc: “Weight Normalized Tf-Idf”for a given feature in a label is the Tf-idf calculated using standard idf multiplied by the Weight Normalized TfWe calculate the sum of W-N-Tf-idf for all the features in a label called Sigma_k, and alpha_i == 1.0Weight = Log [ ( W-N-Tf-Idf + alpha_i ) / ( Sigma_k + N ) ]
  • 15.
    BayesDriver Training WorkflowNaïveBayes Training MapReduce Workflow in Mahout
  • 16.
    Logical Classification ProcessGather,Clean, and Examine the Training DataReally get to know your data!Train the Classifier, allowing the system to “Learn” the “Concepts”But not “overfit” to this specific training data setClassify New Unseen InstancesWith Naïve Bayes we’ll calculate the probabilities of each class wrt this instance
  • 17.
    How Is ClassificationDone?Sequentially or via Map ReduceTestClassifier.javaCreates ClassifierContextFor Each File in DirFor Each LineBreak line into map of tokensFeed array of words to Classifier engine for new classification/labelCollect classifications as output
  • 18.
    A Quick NoteAbout Training Data…Your classifier can only be as good as the training data lets it be…If you don’t do good data prep, everything will perform poorlyData collection and pre-processing takes the bulk of the time
  • 19.
    Enough Math, Runthe CodeDownload and install Mahouthttp://www.apache.orgRun 20Newsgroups Examplehttps://cwiki.apache.org/confluence/display/MAHOUT/Twenty+NewsgroupsUses Naïve Bayes ClassificationDownload and extract 20news-bydate.tar.gz from the 20newsgroups dataset
  • 20.
    Generate Test andTrain DatasetTraining Dataset:mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \ -p examples/bin/work/20news-bydate/20news-bydate-train \ -o examples/bin/work/20news-bydate/bayes-train-input \ -a org.apache.mahout.vectorizer.DefaultAnalyzer\ -c UTF-8Test Dataset:mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \ -p examples/bin/work/20news-bydate/20news-bydate-test \ -o examples/bin/work/20news-bydate/bayes-test-input \ -a org.apache.mahout.vectorizer.DefaultAnalyzer \ -c UTF-8
  • 21.
    Train and TestClassifierTrain:$MAHOUT_HOME/bin/mahout trainclassifier \ -i 20news-input/bayes-train-input \ -o newsmodel \ -type bayes \ -ng 3 \ -source hdfsTest:$MAHOUT_HOME/bin/mahout testclassifier \ -m newsmodel \ -d 20news-input \ -type bayes \ -ng 3 \ -source hdfs \ -method mapreduce
  • 22.
    Other Use CasesPredictiveAnalyticsYou’ll hear this term a lot in the field, especially in the context of SASGeneral Supervised Learning ClassificationWe can recognize a lot of things with practiceAnd lots of tuning!Document ClassificationSentiment Analysis
  • 23.
    Questions?We’re Hiring!Cloudera’sDistro ofApache Hadoop:http://www.cloudera.comResources“Tackling the Poor Assumptions of Naive Bayes Text Classifiers”http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

Editor's Notes

  • #2 https://cwiki.apache.org/MAHOUT/books-tutorials-and-talks.html
  • #6 Contrasts with “1Rule” method (1Rule uses 1 attribute)NB allows all attributes to make contributions that are equally important and independent of one another
  • #7 This classifier produces a probability estimate for each class rather than a predictionConsidered “Supervised Learning”
  • #8 comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as boosted trees or random forestsAn advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification.
  • #9 Pr[E|H] -> all evidence for instances with H->”yes”Pr[H] -> percent of instances w/ this outcomePr[E] -> sum of the values ( ) for all outcomes
  • #10 Book reference: snow crashFor each attribute “a” there are multiple values, and given these combinations we need to look at how many times the instances were actually classified each class.In training we use the term “outcome”, in classification we use the term “class”Example: say we have 2 attributes to an instance
  • #11 We don’t take into account some of the other things like “missing values” here
  • #13 Now that we’ve established the case for Naïve Bayes + Text  show how it fits in with other classifications algos
  • #14 *** Need to sell case for using another feature calculating mechanic ***when one class has more training examples than anotherNaive Bayes selects poor weights for the decision boundary. To balance the amount of training examples used per estimatethey introduced a “complement class” formulation of Naive Bayes.A document is treated as a sequence of words and it is assumed that each word position is generated independently of every other word
  • #15 Term frequency =num occurrences of the considered term ti in document dj / sizeof ( words in doc dj )Normalized to protect against bias in larger docsIDF = log( Normalized Frequency for a term(feature) in a document is calculated by dividing the term frequency by the root mean square of terms frequencies in that documentWeight Normalized Tffor a given feature in a given label = sum of Normalized Frequency of the feature across all the documents in the label.
  • #16 Need to get a better handle on Sigma_kirSigmaWijhttps://cwiki.apache.org/MAHOUT/bayesian.html
  • #20 https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups
  • #22 Can also test sequentially