Classification with Naive Bayes

Classification with Naïve Bayes A Deep Dive into Apache Mahout

Today’s speaker – Josh Patterson josh@cloudera.com / twitter: @jpatanooga Master’s Thesis: self-organizing mesh networks Published in IAAI-09: TinyTermite: A Secure Routing Algorithm Conceived, built, and led Hadoop integration for the openPDC project at TVA (Smartgrid stuff) Led small team which designed classification techniques for time series and Map Reduce Open source work at http://openpdc.codeplex.com Now: Solutions Architect at Cloudera 2

What is Classification? Supervised Learning We give the system a set of instances to learn from System builds knowledge of some structure Learns “concepts” System can then classify new instances

Supervised vs Unsupervised Learning Supervised Give system examples/instances of multiple concepts System learns “concepts” More “hands on” Example: Naïve Bayes, Neural Nets Unsupervised Uses unlabled data Builds joint density model Example: k-means clustering

Naïve Bayes Called Naïve Bayes because its based on “Baye’sRule” and “naively” assumes independence given the label It is only valid to multiply probabilities when the events are independent Simplistic assumption in real life Despite the name, Naïve works well on actual datasets

Naïve Bayes Classifier Simple probabilistic classifier based on applying Baye’s theorem (from Bayesian statistics) strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be “independent feature model".

Naïve Bayes Classifier (2) Assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. Example: a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.

Condensing Meaning To train our system we need Total number input training instances (count) Counts tuples: {attributen,outcomeo,valuem} Total counts of each outcomeo {outcome-count} To Calculate each Pr[En|H] ({attributen,outcomeo,valuem} / {outcome-count} ) …From the Vapor of That Last Big Equation

A Real Example From Witten, et al

Enter Apache Mahout What is it? Apache Mahout is a scalable machine learning library that supports large data sets What Are the Major Algorithm Type? Classification Recommendation Clustering http://mahout.apache.org/

Naïve Bayes and Text Naive Bayes does not model text well. “Tackling the Poor Assumptions of Naive Bayes Text Classifiers” http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf Mahout does some modifications based around TF-IDF scoring (Next Slide) Includes two other pre-processing steps, common for information retrieval but not for Naive Bayes classification

High Level Algorithm For Each Feature(word) in each Doc: Calc: “Weight Normalized Tf-Idf” for a given feature in a label is the Tf-idf calculated using standard idf multiplied by the Weight Normalized Tf We calculate the sum of W-N-Tf-idf for all the features in a label called Sigma_k, and alpha_i == 1.0 Weight = Log [ ( W-N-Tf-Idf + alpha_i ) / ( Sigma_k + N ) ]

BayesDriver Training Workflow Naïve Bayes Training MapReduce Workflow in Mahout

Logical Classification Process Gather, Clean, and Examine the Training Data Really get to know your data! Train the Classifier, allowing the system to “Learn” the “Concepts” But not “overfit” to this specific training data set Classify New Unseen Instances With Naïve Bayes we’ll calculate the probabilities of each class wrt this instance

How Is Classification Done? Sequentially or via Map Reduce TestClassifier.java Creates ClassifierContext For Each File in Dir For Each Line Break line into map of tokens Feed array of words to Classifier engine for new classification/label Collect classifications as output

A Quick Note About Training Data… Your classifier can only be as good as the training data lets it be… If you don’t do good data prep, everything will perform poorly Data collection and pre-processing takes the bulk of the time

Enough Math, Run the Code Download and install Mahout http://www.apache.org Run 20Newsgroups Example https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups Uses Naïve Bayes Classification Download and extract 20news-bydate.tar.gz from the 20newsgroups dataset

Generate Test and Train Dataset Training Dataset: mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups br /> -p examples/bin/work/20news-bydate/20news-bydate-train br /> -o examples/bin/work/20news-bydate/bayes-train-input br /> -a org.apache.mahout.vectorizer.DefaultAnalyzerbr /> -c UTF-8 Test Dataset: mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups br /> -p examples/bin/work/20news-bydate/20news-bydate-test br /> -o examples/bin/work/20news-bydate/bayes-test-input br /> -a org.apache.mahout.vectorizer.DefaultAnalyzer br /> -c UTF-8

Train and Test Classifier Train: $MAHOUT_HOME/bin/mahout trainclassifier br /> -i 20news-input/bayes-train-input br /> -o newsmodel br /> -type bayes br /> -ng 3 br /> -source hdfs Test: $MAHOUT_HOME/bin/mahout testclassifier br /> -m newsmodel br /> -d 20news-input br /> -type bayes br /> -ng 3 br /> -source hdfs br /> -method mapreduce

Other Use Cases Predictive Analytics You’ll hear this term a lot in the field, especially in the context of SAS General Supervised Learning Classification We can recognize a lot of things with practice And lots of tuning! Document Classification Sentiment Analysis

Questions? We’re Hiring! Cloudera’sDistro of Apache Hadoop: http://www.cloudera.com Resources “Tackling the Poor Assumptions of Naive Bayes Text Classifiers” http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

Classification with Naive Bayes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Classification with Naive Bayes

Similar to Classification with Naive Bayes (20)

More from Josh Patterson

More from Josh Patterson (20)

Classification with Naive Bayes

Editor's Notes