Your SlideShare is downloading. ×
Classification with Naive Bayes
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Classification with Naive Bayes

23,697
views

Published on

A Deep Dive into Classification with Naive Bayes. Along the way we take a look at some basics from Ian Witten's Data Mining book and dig into the algorithm. …

A Deep Dive into Classification with Naive Bayes. Along the way we take a look at some basics from Ian Witten's Data Mining book and dig into the algorithm.

Presented on Wed Apr 27 2011 at SeaHUG in Seattle, WA.


0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
23,697
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
513
Comments
0
Likes
13
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • https://cwiki.apache.org/MAHOUT/books-tutorials-and-talks.html
  • Contrasts with “1Rule” method (1Rule uses 1 attribute)NB allows all attributes to make contributions that are equally important and independent of one another
  • This classifier produces a probability estimate for each class rather than a predictionConsidered “Supervised Learning”
  • comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as boosted trees or random forestsAn advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification.
  • Pr[E|H] -> all evidence for instances with H->”yes”Pr[H] -> percent of instances w/ this outcomePr[E] -> sum of the values ( ) for all outcomes
  • Book reference: snow crashFor each attribute “a” there are multiple values, and given these combinations we need to look at how many times the instances were actually classified each class.In training we use the term “outcome”, in classification we use the term “class”Example: say we have 2 attributes to an instance
  • We don’t take into account some of the other things like “missing values” here
  • Now that we’ve established the case for Naïve Bayes + Text  show how it fits in with other classifications algos
  • *** Need to sell case for using another feature calculating mechanic ***when one class has more training examples than anotherNaive Bayes selects poor weights for the decision boundary. To balance the amount of training examples used per estimatethey introduced a “complement class” formulation of Naive Bayes.A document is treated as a sequence of words and it is assumed that each word position is generated independently of every other word
  • Term frequency =num occurrences of the considered term ti in document dj / sizeof ( words in doc dj )Normalized to protect against bias in larger docsIDF = log( Normalized Frequency for a term(feature) in a document is calculated by dividing the term frequency by the root mean square of terms frequencies in that documentWeight Normalized Tffor a given feature in a given label = sum of Normalized Frequency of the feature across all the documents in the label.
  • Need to get a better handle on Sigma_kirSigmaWijhttps://cwiki.apache.org/MAHOUT/bayesian.html
  • https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups
  • Can also test sequentially
  • Transcript

    • 1. Classification with Naïve Bayes
      A Deep Dive into Apache Mahout
    • 2. Today’s speaker – Josh Patterson
      josh@cloudera.com / twitter: @jpatanooga
      Master’s Thesis: self-organizing mesh networks
      Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
      Conceived, built, and led Hadoop integration for the openPDC project at TVA (Smartgrid stuff)
      Led small team which designed classification techniques for time series and Map Reduce
      Open source work at http://openpdc.codeplex.com
      Now: Solutions Architect at Cloudera
      2
    • 3. What is Classification?
      Supervised Learning
      We give the system a set of instances to learn from
      System builds knowledge of some structure
      Learns “concepts”
      System can then classify new instances
    • 4. Supervised vs Unsupervised Learning
      Supervised
      Give system examples/instances of multiple concepts
      System learns “concepts”
      More “hands on”
      Example: Naïve Bayes, Neural Nets
      Unsupervised
      Uses unlabled data
      Builds joint density model
      Example: k-means clustering
    • 5. Naïve Bayes
      Called Naïve Bayes because its based on “Baye’sRule” and “naively” assumes independence given the label
      It is only valid to multiply probabilities when the events are independent
      Simplistic assumption in real life
      Despite the name, Naïve works well on actual datasets
    • 6. Naïve Bayes Classifier
      Simple probabilistic classifier based on
      applying Baye’s theorem (from Bayesian statistics)
      strong (naive) independence assumptions.
      A more descriptive term for the underlying probability model would be “independent feature model".
    • 7. Naïve Bayes Classifier (2)
      Assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature.
      Example:
      a fruit may be considered to be an apple if it is red, round, and about 4" in diameter.
      Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.
    • 8. A Little Bit o’ Theory
    • 9. Condensing Meaning
      To train our system we need
      Total number input training instances (count)
      Counts tuples:
      {attributen,outcomeo,valuem}
      Total counts of each outcomeo
      {outcome-count}
      To Calculate each Pr[En|H]
      ({attributen,outcomeo,valuem} / {outcome-count} )
      …From the Vapor of That Last Big Equation
    • 10. A Real Example From Witten, et al
    • 11. Enter Apache Mahout
      What is it?
      Apache Mahout is a scalable machine learning library that supports large data sets
      What Are the Major Algorithm Type?
      Classification
      Recommendation
      Clustering
      http://mahout.apache.org/
    • 12. Mahout Algorithms
    • 13. Naïve Bayes and Text
      Naive Bayes does not model text well.
      “Tackling the Poor Assumptions of Naive Bayes Text Classifiers”
      http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
      Mahout does some modifications based around TF-IDF scoring (Next Slide)
      Includes two other pre-processing steps, common for information retrieval but not for Naive Bayes classification
    • 14. High Level Algorithm
      For Each Feature(word) in each Doc:
      Calc: “Weight Normalized Tf-Idf”
      for a given feature in a label is the Tf-idf calculated using standard idf multiplied by the Weight Normalized Tf
      We calculate the sum of W-N-Tf-idf for all the features in a label called Sigma_k, and alpha_i == 1.0
      Weight = Log [ ( W-N-Tf-Idf + alpha_i ) / ( Sigma_k + N ) ]
    • 15. BayesDriver Training Workflow
      Naïve Bayes Training MapReduce Workflow in Mahout
    • 16. Logical Classification Process
      Gather, Clean, and Examine the Training Data
      Really get to know your data!
      Train the Classifier, allowing the system to “Learn” the “Concepts”
      But not “overfit” to this specific training data set
      Classify New Unseen Instances
      With Naïve Bayes we’ll calculate the probabilities of each class wrt this instance
    • 17. How Is Classification Done?
      Sequentially or via Map Reduce
      TestClassifier.java
      Creates ClassifierContext
      For Each File in Dir
      For Each Line
      Break line into map of tokens
      Feed array of words to Classifier engine for new classification/label
      Collect classifications as output
    • 18. A Quick Note About Training Data…
      Your classifier can only be as good as the training data lets it be…
      If you don’t do good data prep, everything will perform poorly
      Data collection and pre-processing takes the bulk of the time
    • 19. Enough Math, Run the Code
      Download and install Mahout
      http://www.apache.org
      Run 20Newsgroups Example
      https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups
      Uses Naïve Bayes Classification
      Download and extract 20news-bydate.tar.gz from the 20newsgroups dataset
    • 20. Generate Test and Train Dataset
      Training Dataset:
      mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups
      -p examples/bin/work/20news-bydate/20news-bydate-train
      -o examples/bin/work/20news-bydate/bayes-train-input
      -a org.apache.mahout.vectorizer.DefaultAnalyzer
      -c UTF-8
      Test Dataset:
      mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups
      -p examples/bin/work/20news-bydate/20news-bydate-test
      -o examples/bin/work/20news-bydate/bayes-test-input
      -a org.apache.mahout.vectorizer.DefaultAnalyzer
      -c UTF-8
    • 21. Train and Test Classifier
      Train:
      $MAHOUT_HOME/bin/mahout trainclassifier
      -i 20news-input/bayes-train-input
      -o newsmodel
      -type bayes
      -ng 3
      -source hdfs
      Test:
      $MAHOUT_HOME/bin/mahout testclassifier
      -m newsmodel
      -d 20news-input
      -type bayes
      -ng 3
      -source hdfs
      -method mapreduce
    • 22. Other Use Cases
      Predictive Analytics
      You’ll hear this term a lot in the field, especially in the context of SAS
      General Supervised Learning Classification
      We can recognize a lot of things with practice
      And lots of tuning!
      Document Classification
      Sentiment Analysis
    • 23. Questions?
      We’re Hiring!
      Cloudera’sDistro of Apache Hadoop:
      http://www.cloudera.com
      Resources
      “Tackling the Poor Assumptions of Naive Bayes Text Classifiers”
      http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf