SlideShare a Scribd company logo
Naive Bayes
and
Sentiment
Classification
The Task of Text
Classification
Is this spam?
Who wrote which Federalist papers?
1787-8: anonymous essays try to convince New
York to ratify U.S Constitution: Jay, Madison,
Hamilton.
Authorship of 12 of the letters in dispute
1963: solved by Mosteller and Wallace using
Bayesian methods
James Madison Alexander Hamilton
What is the subject of this medical article?
Antogonists and Inhibitors
Blood Supply
Chemistry
Drug Therapy
Embryology
Epidemiology
…
4
MeSH Subject Category Hierarchy
?
MEDLINE Article
Positive or negative movie review?
...zany characters and richly applied satire, and some great
plot twists
It was pathetic. The worst part about it was the boxing
scenes...
...awesome caramel sauce and sweet toasty almonds. I
love this place!
...awful pizza and ridiculously overpriced...
5
+
+
−
−
Positive or negative movie review?
...zany characters and richly applied satire, and some great
plot twists
It was pathetic. The worst part about it was the boxing
scenes...
...awesome caramel sauce and sweet toasty almonds. I
love this place!
...awful pizza and ridiculously overpriced...
6
+
+
−
−
Why sentiment analysis?
Movie: is this review positive or negative?
Products: what do people think about the new iPhone?
Public sentiment: how is consumer confidence?
Politics: what do people think about this candidate or issue?
Prediction: predict election outcomes or market trends from
sentiment
7
Scherer Typology of Affective States
Emotion: brief organically synchronized … evaluation of a major event
◦ angry, sad, joyful, fearful, ashamed, proud, elated
Mood: diffuse non-caused low-intensity long-duration change in subjective feeling
◦ cheerful, gloomy, irritable, listless, depressed, buoyant
Interpersonal stances: affective stance toward another person in a specific interaction
◦ friendly, flirtatious, distant, cold, warm, supportive, contemptuous
Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons
◦ liking, loving, hating, valuing, desiring
Personality traits: stable personality dispositions and typical behavior tendencies
◦ nervous, anxious, reckless, morose, hostile, jealous
Scherer Typology of Affective States
Emotion: brief organically synchronized … evaluation of a major event
◦ angry, sad, joyful, fearful, ashamed, proud, elated
Mood: diffuse non-caused low-intensity long-duration change in subjective feeling
◦ cheerful, gloomy, irritable, listless, depressed, buoyant
Interpersonal stances: affective stance toward another person in a specific interaction
◦ friendly, flirtatious, distant, cold, warm, supportive, contemptuous
Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons
◦ liking, loving, hating, valuing, desiring
Personality traits: stable personality dispositions and typical behavior tendencies
◦ nervous, anxious, reckless, morose, hostile, jealous
Basic Sentiment Classification
Sentiment analysis is the detection of
attitudes
Simple task we focus on in this chapter
◦ Is the attitude of this text positive or negative?
We return to affect classification in later
chapters
Summary: Text Classification
Sentiment analysis
Spam detection
Authorship identification
Language Identification
Assigning subject categories, topics, or genres
…
Text Classification: definition
Input:
◦ a document d
◦ a fixed set of classes C = {c1, c2,…, cJ}
Output: a predicted class c  C
Classification Methods: Hand-coded rules
Rules based on combinations of words or other features
◦ spam: black-list-address OR (“dollars” AND “you have been
selected”)
Accuracy can be high
◦ If rules carefully refined by expert
But building and maintaining these rules is expensive
Classification Methods:
Supervised Machine Learning
Input:
◦ a document d
◦ a fixed set of classes C = {c1, c2,…, cJ}
◦ A training set of m hand-labeled documents
(d1,c1),....,(dm,cm)
Output:
◦ a learned classifier γ:d → c
14
Classification Methods:
Supervised Machine Learning
Any kind of classifier
◦ Naïve Bayes
◦ Logistic regression
◦ Neural networks
◦ k-Nearest Neighbors
◦ …
Text
Classification
and Naive
Bayes
The Task of Text
Classification
Text
Classification
and Naive
Bayes
Naive Bayes (I)
Naive Bayes Intuition
Simple (“naive”) classification method based on
Bayes rule
Relies on very simple representation of document
◦ Bag of words
The Bag of Words Representation
19
The bag of words representation
γ( )=c
seen 2
sweet 1
whimsical 1
recommend 1
happy 1
... ...
Text
Classification
and Naïve
Bayes
Naive Bayes (I)
Text
Classification
and Naïve
Bayes
Formalizing the Naive
Bayes Classifier
Bayes’ Rule Applied to Documents and Classes
•For a document d and a class c
P(c | d) =
P(d |c)P(c)
P(d)
Naive Bayes Classifier (I)
cMAP = argmax
cÎC
P(c | d)
= argmax
cÎC
P(d | c)P(c)
P(d)
= argmax
cÎC
P(d |c)P(c)
MAP is “maximum a
posteriori” = most
likely class
Bayes Rule
Dropping the
denominator
Naive Bayes Classifier (II)
cMAP = argmax
cÎC
P(d | c)P(c)
Document d
represented as
features
x1..xn
= argmax
cÎC
P(x1, x2,… , xn | c)P(c)
"Likelihood" "Prior"
Naïve Bayes Classifier (IV)
How often does this
class occur?
cMAP = argmax
cÎC
P(x1, x2,… , xn | c)P(c)
O(|X|n•|C|) parameters
We can just count the
relative frequencies in
a corpus
Could only be estimated if a
very, very large number of
training examples was
available.
Multinomial Naive Bayes Independence
Assumptions
Bag of Words assumption: Assume position doesn’t matter
Conditional Independence: Assume the feature
probabilities P(xi|cj) are independent given the class c.
P(x1, x2,… , xn |c)
P(x1,… , xn |c)= P(x1 |c)·P(x2 |c)·P(x3 |c)·...·P(xn |c)
Multinomial Naive Bayes Classifier
cMAP = argmax
cÎC
P(x1, x2,… , xn | c)P(c)
cNB = argmax
cÎC
P(cj ) P(x | c)
xÎX
Õ
Applying Multinomial Naive Bayes Classifiers
to Text Classification
cNB = argmax
cjÎC
P(cj ) P(xi | cj )
iÎpositions
Õ
positions  all word positions in test document
Example
Let me explain a Multinomial Naïve Bayes Classifier
where we want to filter out the spam messages.
Initially, we consider eight normal messages and
four spam messages.
Histogram of all the words that occur in the
normal messages from family and friends
The probability of word dear given that we saw in
normal message is-
Probability (Dear|Normal) =
Probability (Friend|Normal) =
Probability (Lunch|Normal) =
Probability (Money|Normal) =
The probability of word dear given that we saw in
normal message is-
Probability (Dear|Normal) = 8 /17 = 0.47
Similarly, the probability of word Friend is-
Probability (Friend/Normal) = 5/ 17 =0.29
Probability (Lunch/Normal) = 3/ 17 =0.18
Probability (Money/Normal) = 1/ 17 =0.06
Histogram for Spam Message
he probability of word dear given that we saw in
spam message is-
Probability (Dear|Spam) =
Probability (Friend|Spam) =
Probability (Lunch|Spam) =
Probability (Money|Spam) =
he probability of word dear given that we saw in
spam message is-
Probability (Dear|Spam) = 2 /7 = 0.29
Similarly, the probability of word Friend is-
Probability (Friend|Spam) = 1/ 7 =0.14
Probability (Lunch|Spam) = 0/ 7 =0.00
Probability (Money|Spam) = 4/ 7 =0.57
What is the probability of “Dear Friend” as
normal message?
What is the probability of “Dear Friend” as
Spam message?
Problems with multiplying lots of probs
There's a problem with this:
Multiplying lots of probabilities can result in floating-point
underflow!
Luckily, log(ab) = log(a) + log(b)
Let's sum logs of probabilities instead of multiplying probabilities!
cNB = argmax
cjÎC
P(cj ) P(xi | cj )
iÎpositions
Õ
We actually do everything in log space
Instead of this:
This:
This is ok since log doesn't change the ranking of the classes (class with
highest prob still has highest log prob)
Model is now just max of sum of weights: a linear function of the inputs
So naive bayes is a linear classifier
cNB = argmax
cjÎC
P(cj ) P(xi | cj )
iÎpositions
Õ
Text
Classificatio
n and Naïve
Bayes
Formalizing the Naïve
Bayes Classifier
Text
Classification
and Naïve
Bayes
Naive Bayes: Learning
Learning the Multinomial Naive Bayes Model
First attempt: maximum likelihood estimates
◦ simply use the frequencies in the data
Sec.13.3
P̂(wi | cj ) =
count(wi,cj )
count(w,cj )
wÎV
å
P̂(cj ) =
doccount(C = cj )
Ndoc
Learning the Multinomial Naive Bayes Model
First attempt: maximum likelihood estimates
◦ simply use the frequencies in the data
Sec.13.3
P̂(wi | cj ) =
count(wi,cj )
count(w,cj )
wÎV
å
P̂(cj ) =
doccount(C = cj )
Ndoc
Compute the probability of for a class C
Compute the probability of a word given a class ∈{ Positive, Negative }
P(Normal) =
8/12
Parameter estimation
Create mega-document for topic j by concatenating all
docs in this topic
◦ Use frequency of w in mega-document
fraction of times word wi appears
among all words in documents of topic cj
P̂(wi | cj ) =
count(wi,cj )
count(w,cj )
wÎV
å
Doc 12, Normal 8 , Spam = 4
P ( Normal) = 8/12
P (Spam) = 4/12
Normal (17)
Dear – 8
Friend – 5
Lunch – 3
Money – 1
Spam (7)
Dear – 2
Friend – 1
Lunch – 0
Money – 4
Probability of “Dear Friend” belongs to -
P ( Normal| “Dear Friend”) = (8/17) * (5/17) * (8/12)
P (Spam| “Dear Friend”) = (2/7) * (1/7) * (4/12)
Normal
Dear – 8
Friend – 5
Lunch – 3
Money – 1
Spam
Dear – 2
Friend – 1
Lunch – 0
Money – 4
Probability of “Lunch Money” belongs to -
P ( Normal| “Lunch Money”) = (3/17) * (1/17) * (8/12)
P (Spam| “Lunch Money”) = (0/7) * (4/7) * (4/12) = 0
Normal
Dear – 8
Friend – 5
Lunch – 3
Money – 1
Spam
Dear – 2
Friend – 1
Lunch – 0
Money – 4
Problem with Maximum Likelihood
What if we have seen no training documents with the word fantastic
and classified in the topic positive (thumbs-up)?
Zero probabilities cannot be conditioned away, no matter the other
evidence!
P̂("fantastic" positive) =
count("fantastic", positive)
count(w,positive
wÎV
å )
= 0
cMAP = argmaxc P̂(c) P̂(xi | c)
i
Õ
Sec.13.3
Laplace (add-1) smoothing for Naïve Bayes
P̂(wi | c) =
count(wi,c)+1
count(w,c)+1
( )
wÎV
å
=
count(wi,c)+1
count(w,c
wÎV
å )
æ
è
ç
ç
ö
ø
÷
÷ + V
P̂(wi | c) =
count(wi,c)
count(w,c)
( )
wÎV
å
P ( Normal| “Lunch Money”) = (?) * (?) * (8/12)
P (Spam| “Lunch Money”) =
Normal
Dear – 8
Friend – 5
Lunch – 3
Money – 1
Spam
Dear – 2
Friend – 1
Lunch – 0
Money – 4
=
count(wi,c)+1
count(w,c
wÎV
å )
æ
è
ç
ç
ö
ø
÷
÷ + V
P̂(wi | c) =
count(wi,c)
count(w,c)
( )
wÎV
å
P(N|Lunch money)
= ( (3+1)/ (17+4) ) * (2/21 ) * (8/12) =0.012
P(S|Lunch money)
= (1/11) * (5/11) * (4/12) = 0.013
Unique Word = 4, Number of occurrence =
17
Unknown words
What about unknown words
◦ that appear in our test data
◦ but not in our training data or vocab
We ignore them
◦ Remove them from the test document!
◦ Pretend they weren't there!
◦ Don't include any probability for them at all.
Why don't we build an unknown word model?
◦ It doesn't help: knowing which class has more unknown words is
not generally a useful thing to know!
Stop words
Some systems ignore another class of words:
Stop words: very frequent words like the and a.
◦ Sort the whole vocabulary by frequency in the training, call the
top 10 or 50 words the stopword list.
◦ Now we remove all stop words from the training and test sets
as if they were never there.
But in most text classification applications, removing
stop words don't help, so it's more common to not use
stopword lists and use all the words in naive Bayes.
Multinomial Naïve Bayes: Learning
Calculate P(cj) terms
◦ For each cj in C do
docsj  all docs with class =cj
P(wk | cj )¬
nk +a
n+a |Vocabulary |
P(cj )¬
| docsj |
| total # documents|
• Calculate P(wk | cj) terms
• Textj  single doc containing all docsj
• Foreach word wk in Vocabulary
nk  # of occurrences of wk in Textj
• From training corpus, extract Vocabulary
Text
Classification
and Naive
Bayes
Naive Bayes: Learning
Text
Classification
and Naive
Bayes
Sentiment and Binary
Naive Bayes
Let's do a worked sentiment example!
A worked sentiment example Just
Plain
Boar
Entire
Predict
And 2
Lack
Energy
No
Surprise
Very
Few
lough
A worked sentiment example
Prior from training:
P(-) = 3/5
P(+) = 2/5
Drop "with"
Likelihoods from training:
Scoring the test set:
P(Predict|+) = ? P(Predict|-) = ?
P(No|+) = ? P(No|-) = ?
P(Fun|+) = ? P(Fun|-) = ?
P(-) * P(“Predict No Fun”)
P(+) * P(“Predict No Fun”)
A worked sentiment example
Prior from training:
P(-) = ?
P(+) = ?
Drop "with"
Likelihoods from training:
Scoring the test set:
Optimizing for sentiment analysis
For tasks like sentiment, word occurrence is more
important than word frequency.
◦ The occurrence of the word fantastic tells us a lot
◦ The fact that it occurs 5 times may not tell us much more.
Binary multinominal naive bayes, or binary NB
◦ Clip our word counts at 1
◦ Note: this is different than Bernoulli naive bayes; see the
textbook at the end of the chapter.
Binary Multinomial Naïve Bayes: Learning
Calculate P(cj) terms
◦ For each cj in C do
docsj  all docs with class =cj
P(cj )¬
| docsj |
| total # documents| P(wk | cj )¬
nk +a
n+a |Vocabulary |
• Textj  single doc containing all docsj
• Foreach word wk in Vocabulary
nk  # of occurrences of wk in Textj
• From training corpus, extract Vocabulary
• Calculate P(wk | cj) terms
• Remove duplicates in each doc:
• For each word type w in docj
• Retain only a single instance of w
Binary Multinomial Naive Bayes
on a test document d
63
First remove all duplicate words from d
Then compute NB using the same equation:
cNB = argmax
cjÎC
P(cj ) P(wi |cj )
iÎpositions
Õ
Binary multinominal naive Bayes
Counts can still be 2! Binarization is within-doc!
Text
Classification
and Naive
Bayes
Sentiment and Binary
Naive Bayes
Text
Classification
and Naïve
Bayes
Naïve Bayes: Relationship
to Language Modeling
Generative Model for Multinomial Naïve Bayes
67
c=+
X1=I X2=love X3=this X4=fun X5=film
Naïve Bayes and Language Modeling
Naïve bayes classifiers can use any sort of feature
◦ URL, email address, dictionaries, network features
But if, as in the previous slides
◦ We use only word features
◦ we use all of the words in the text (not a subset)
Then
◦ Naive bayes has an important similarity to language
modeling.
68
Each class = a unigram language model
Assigning each word: P(word | c)
Assigning each sentence: P(s|c)=P(word|c)
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
I love this fun film
0.1 0.1 .05 0.01 0.1
Class pos
P(s | pos) = 0.0000005
Sec.13.2.1
Naïve Bayes as a Language Model
Which class assigns the higher probability to s?
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
Model pos Model neg
film
love this fun
I
0.1
0.1 0.01 0.05
0.1
0.1
0.001 0.01 0.005
0.2
P(s|pos) > P(s|neg)
0.2 I
0.001 love
0.01 this
0.005 fun
0.1 film
Sec.13.2.1
Text
Classification
and Naïve
Bayes
Naïve Bayes: Relationship
to Language Modeling
Text
Classification
and Naïve
Bayes
Precision, Recall, and F
measure
Evaluation
Let's consider just binary text classification tasks
Imagine you're the CEO of Delicious Pie Company
You want to know what people are saying about
your pies
So you build a "Delicious Pie" tweet detector
◦ Positive class: tweets about Delicious Pie Co
◦ Negative class: all other tweets
The 2-by-2 confusion matrix
The 2-by-2 confusion matrix TP 10 FP 2
FN 3 TN 34
Evaluation: Accuracy
Why don't we use accuracy as our metric?
Imagine we saw 1 million tweets
◦ 100 of them talked about Delicious Pie Co.
◦ 999,900 talked about something else
We could build a dumb classifier that just labels every
tweet "not about pie"
◦ It would get 99.99% accuracy!!! Wow!!!!
◦ But useless! Doesn't return the comments we are looking for!
◦ That's why we use precision and recall instead
Evaluation: Precision
% of items the system detected (i.e., items the
system labeled as positive) that are in fact positive
(according to the human gold labels)
Evaluation: Recall
% of items actually present in the input that were
correctly identified by the system.
Why Precision and recall
Our dumb pie-classifier
◦ Just label nothing as "about pie"
Accuracy=99.99%
but
Recall = 0
◦ (it doesn't get any of the 100 Pie tweets)
Precision and recall, unlike accuracy, emphasize true
positives:
◦ finding the things that we are supposed to be looking for.
A combined measure: F
F measure: a single number that combines P and R:
We almost always use balanced F1 (i.e.,  = 1)

More Related Content

What's hot

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Minh Pham
 
Text prediction based on Recurrent Neural Network Language Model
Text prediction based on Recurrent Neural Network Language ModelText prediction based on Recurrent Neural Network Language Model
Text prediction based on Recurrent Neural Network Language Model
ANIRUDHMALODE2
 
Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNet
SungminYou
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
健程 杨
 
Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)
Hwa Pyung Kim
 
Steffen Rendle, Research Scientist, Google at MLconf SF
Steffen Rendle, Research Scientist, Google at MLconf SFSteffen Rendle, Research Scientist, Google at MLconf SF
Steffen Rendle, Research Scientist, Google at MLconf SF
MLconf
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
Kuppusamy P
 
Programming in C++ and Data Strucutres
Programming in C++ and Data StrucutresProgramming in C++ and Data Strucutres
Programming in C++ and Data Strucutres
Dr. C.V. Suresh Babu
 
Supervised Machine Learning
Supervised Machine LearningSupervised Machine Learning
Supervised Machine Learning
Livares Technologies Pvt Ltd
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Christian Perone
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to Autoencoders
Yan Xu
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural Networks
Yogendra Tamang
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
Dev Sahu
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
Sumit Raj
 
Compiler Design(NANTHU NOTES)
Compiler Design(NANTHU NOTES)Compiler Design(NANTHU NOTES)
Compiler Design(NANTHU NOTES)
guest251d9a
 
Hate speech detection
Hate speech detectionHate speech detection
Hate speech detection
NASIM ALAM
 
Recurrent Neural Network
Recurrent Neural NetworkRecurrent Neural Network
Recurrent Neural Network
Mohammad Sabouri
 
Software Engineering (Project Planning & Estimation)
Software Engineering (Project Planning &  Estimation)Software Engineering (Project Planning &  Estimation)
Software Engineering (Project Planning & Estimation)
ShudipPal
 
Workshop: Unit Testing in Python
Workshop: Unit Testing in PythonWorkshop: Unit Testing in Python
Workshop: Unit Testing in Python
David Tan
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
Girish Khanzode
 

What's hot (20)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Text prediction based on Recurrent Neural Network Language Model
Text prediction based on Recurrent Neural Network Language ModelText prediction based on Recurrent Neural Network Language Model
Text prediction based on Recurrent Neural Network Language Model
 
Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNet
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
 
Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)
 
Steffen Rendle, Research Scientist, Google at MLconf SF
Steffen Rendle, Research Scientist, Google at MLconf SFSteffen Rendle, Research Scientist, Google at MLconf SF
Steffen Rendle, Research Scientist, Google at MLconf SF
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
Programming in C++ and Data Strucutres
Programming in C++ and Data StrucutresProgramming in C++ and Data Strucutres
Programming in C++ and Data Strucutres
 
Supervised Machine Learning
Supervised Machine LearningSupervised Machine Learning
Supervised Machine Learning
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural Zoo
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to Autoencoders
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural Networks
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
Compiler Design(NANTHU NOTES)
Compiler Design(NANTHU NOTES)Compiler Design(NANTHU NOTES)
Compiler Design(NANTHU NOTES)
 
Hate speech detection
Hate speech detectionHate speech detection
Hate speech detection
 
Recurrent Neural Network
Recurrent Neural NetworkRecurrent Neural Network
Recurrent Neural Network
 
Software Engineering (Project Planning & Estimation)
Software Engineering (Project Planning &  Estimation)Software Engineering (Project Planning &  Estimation)
Software Engineering (Project Planning & Estimation)
 
Workshop: Unit Testing in Python
Workshop: Unit Testing in PythonWorkshop: Unit Testing in Python
Workshop: Unit Testing in Python
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
 

Similar to Text Classification.pdf

Topic_5_NB_Sentiment_Classification_.pptx
Topic_5_NB_Sentiment_Classification_.pptxTopic_5_NB_Sentiment_Classification_.pptx
Topic_5_NB_Sentiment_Classification_.pptx
HassaanIbrahim2
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
Dhwaj Raj
 
roman_numerals_buggypackage.bluej#BlueJ package filedepend.docx
roman_numerals_buggypackage.bluej#BlueJ package filedepend.docxroman_numerals_buggypackage.bluej#BlueJ package filedepend.docx
roman_numerals_buggypackage.bluej#BlueJ package filedepend.docx
joellemurphey
 
Probability Assignment Help
Probability Assignment HelpProbability Assignment Help
Probability Assignment Help
Statistics Assignment Help
 
1. The Central Intelligence Agency has specialists who analyze the f.pdf
1. The Central Intelligence Agency has specialists who analyze the f.pdf1. The Central Intelligence Agency has specialists who analyze the f.pdf
1. The Central Intelligence Agency has specialists who analyze the f.pdf
akhilc61
 
12.13.11 classwork tuesday
12.13.11 classwork   tuesday12.13.11 classwork   tuesday
12.13.11 classwork tuesday
mrlafrossia
 

Similar to Text Classification.pdf (6)

Topic_5_NB_Sentiment_Classification_.pptx
Topic_5_NB_Sentiment_Classification_.pptxTopic_5_NB_Sentiment_Classification_.pptx
Topic_5_NB_Sentiment_Classification_.pptx
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
 
roman_numerals_buggypackage.bluej#BlueJ package filedepend.docx
roman_numerals_buggypackage.bluej#BlueJ package filedepend.docxroman_numerals_buggypackage.bluej#BlueJ package filedepend.docx
roman_numerals_buggypackage.bluej#BlueJ package filedepend.docx
 
Probability Assignment Help
Probability Assignment HelpProbability Assignment Help
Probability Assignment Help
 
1. The Central Intelligence Agency has specialists who analyze the f.pdf
1. The Central Intelligence Agency has specialists who analyze the f.pdf1. The Central Intelligence Agency has specialists who analyze the f.pdf
1. The Central Intelligence Agency has specialists who analyze the f.pdf
 
12.13.11 classwork tuesday
12.13.11 classwork   tuesday12.13.11 classwork   tuesday
12.13.11 classwork tuesday
 

Recently uploaded

Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
mamamaam477
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
IJNSA Journal
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
University of Maribor
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
sachin chaurasia
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
zubairahmad848137
 

Recently uploaded (20)

Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
 

Text Classification.pdf

  • 3. Who wrote which Federalist papers? 1787-8: anonymous essays try to convince New York to ratify U.S Constitution: Jay, Madison, Hamilton. Authorship of 12 of the letters in dispute 1963: solved by Mosteller and Wallace using Bayesian methods James Madison Alexander Hamilton
  • 4. What is the subject of this medical article? Antogonists and Inhibitors Blood Supply Chemistry Drug Therapy Embryology Epidemiology … 4 MeSH Subject Category Hierarchy ? MEDLINE Article
  • 5. Positive or negative movie review? ...zany characters and richly applied satire, and some great plot twists It was pathetic. The worst part about it was the boxing scenes... ...awesome caramel sauce and sweet toasty almonds. I love this place! ...awful pizza and ridiculously overpriced... 5 + + − −
  • 6. Positive or negative movie review? ...zany characters and richly applied satire, and some great plot twists It was pathetic. The worst part about it was the boxing scenes... ...awesome caramel sauce and sweet toasty almonds. I love this place! ...awful pizza and ridiculously overpriced... 6 + + − −
  • 7. Why sentiment analysis? Movie: is this review positive or negative? Products: what do people think about the new iPhone? Public sentiment: how is consumer confidence? Politics: what do people think about this candidate or issue? Prediction: predict election outcomes or market trends from sentiment 7
  • 8. Scherer Typology of Affective States Emotion: brief organically synchronized … evaluation of a major event ◦ angry, sad, joyful, fearful, ashamed, proud, elated Mood: diffuse non-caused low-intensity long-duration change in subjective feeling ◦ cheerful, gloomy, irritable, listless, depressed, buoyant Interpersonal stances: affective stance toward another person in a specific interaction ◦ friendly, flirtatious, distant, cold, warm, supportive, contemptuous Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons ◦ liking, loving, hating, valuing, desiring Personality traits: stable personality dispositions and typical behavior tendencies ◦ nervous, anxious, reckless, morose, hostile, jealous
  • 9. Scherer Typology of Affective States Emotion: brief organically synchronized … evaluation of a major event ◦ angry, sad, joyful, fearful, ashamed, proud, elated Mood: diffuse non-caused low-intensity long-duration change in subjective feeling ◦ cheerful, gloomy, irritable, listless, depressed, buoyant Interpersonal stances: affective stance toward another person in a specific interaction ◦ friendly, flirtatious, distant, cold, warm, supportive, contemptuous Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons ◦ liking, loving, hating, valuing, desiring Personality traits: stable personality dispositions and typical behavior tendencies ◦ nervous, anxious, reckless, morose, hostile, jealous
  • 10. Basic Sentiment Classification Sentiment analysis is the detection of attitudes Simple task we focus on in this chapter ◦ Is the attitude of this text positive or negative? We return to affect classification in later chapters
  • 11. Summary: Text Classification Sentiment analysis Spam detection Authorship identification Language Identification Assigning subject categories, topics, or genres …
  • 12. Text Classification: definition Input: ◦ a document d ◦ a fixed set of classes C = {c1, c2,…, cJ} Output: a predicted class c  C
  • 13. Classification Methods: Hand-coded rules Rules based on combinations of words or other features ◦ spam: black-list-address OR (“dollars” AND “you have been selected”) Accuracy can be high ◦ If rules carefully refined by expert But building and maintaining these rules is expensive
  • 14. Classification Methods: Supervised Machine Learning Input: ◦ a document d ◦ a fixed set of classes C = {c1, c2,…, cJ} ◦ A training set of m hand-labeled documents (d1,c1),....,(dm,cm) Output: ◦ a learned classifier γ:d → c 14
  • 15. Classification Methods: Supervised Machine Learning Any kind of classifier ◦ Naïve Bayes ◦ Logistic regression ◦ Neural networks ◦ k-Nearest Neighbors ◦ …
  • 18. Naive Bayes Intuition Simple (“naive”) classification method based on Bayes rule Relies on very simple representation of document ◦ Bag of words
  • 19. The Bag of Words Representation 19
  • 20. The bag of words representation γ( )=c seen 2 sweet 1 whimsical 1 recommend 1 happy 1 ... ...
  • 23. Bayes’ Rule Applied to Documents and Classes •For a document d and a class c P(c | d) = P(d |c)P(c) P(d)
  • 24. Naive Bayes Classifier (I) cMAP = argmax cÎC P(c | d) = argmax cÎC P(d | c)P(c) P(d) = argmax cÎC P(d |c)P(c) MAP is “maximum a posteriori” = most likely class Bayes Rule Dropping the denominator
  • 25. Naive Bayes Classifier (II) cMAP = argmax cÎC P(d | c)P(c) Document d represented as features x1..xn = argmax cÎC P(x1, x2,… , xn | c)P(c) "Likelihood" "Prior"
  • 26. Naïve Bayes Classifier (IV) How often does this class occur? cMAP = argmax cÎC P(x1, x2,… , xn | c)P(c) O(|X|n•|C|) parameters We can just count the relative frequencies in a corpus Could only be estimated if a very, very large number of training examples was available.
  • 27. Multinomial Naive Bayes Independence Assumptions Bag of Words assumption: Assume position doesn’t matter Conditional Independence: Assume the feature probabilities P(xi|cj) are independent given the class c. P(x1, x2,… , xn |c) P(x1,… , xn |c)= P(x1 |c)·P(x2 |c)·P(x3 |c)·...·P(xn |c)
  • 28. Multinomial Naive Bayes Classifier cMAP = argmax cÎC P(x1, x2,… , xn | c)P(c) cNB = argmax cÎC P(cj ) P(x | c) xÎX Õ
  • 29. Applying Multinomial Naive Bayes Classifiers to Text Classification cNB = argmax cjÎC P(cj ) P(xi | cj ) iÎpositions Õ positions  all word positions in test document
  • 30. Example Let me explain a Multinomial Naïve Bayes Classifier where we want to filter out the spam messages. Initially, we consider eight normal messages and four spam messages.
  • 31. Histogram of all the words that occur in the normal messages from family and friends
  • 32. The probability of word dear given that we saw in normal message is- Probability (Dear|Normal) = Probability (Friend|Normal) = Probability (Lunch|Normal) = Probability (Money|Normal) =
  • 33. The probability of word dear given that we saw in normal message is- Probability (Dear|Normal) = 8 /17 = 0.47 Similarly, the probability of word Friend is- Probability (Friend/Normal) = 5/ 17 =0.29 Probability (Lunch/Normal) = 3/ 17 =0.18 Probability (Money/Normal) = 1/ 17 =0.06
  • 35. he probability of word dear given that we saw in spam message is- Probability (Dear|Spam) = Probability (Friend|Spam) = Probability (Lunch|Spam) = Probability (Money|Spam) =
  • 36. he probability of word dear given that we saw in spam message is- Probability (Dear|Spam) = 2 /7 = 0.29 Similarly, the probability of word Friend is- Probability (Friend|Spam) = 1/ 7 =0.14 Probability (Lunch|Spam) = 0/ 7 =0.00 Probability (Money|Spam) = 4/ 7 =0.57
  • 37. What is the probability of “Dear Friend” as normal message?
  • 38. What is the probability of “Dear Friend” as Spam message?
  • 39. Problems with multiplying lots of probs There's a problem with this: Multiplying lots of probabilities can result in floating-point underflow! Luckily, log(ab) = log(a) + log(b) Let's sum logs of probabilities instead of multiplying probabilities! cNB = argmax cjÎC P(cj ) P(xi | cj ) iÎpositions Õ
  • 40. We actually do everything in log space Instead of this: This: This is ok since log doesn't change the ranking of the classes (class with highest prob still has highest log prob) Model is now just max of sum of weights: a linear function of the inputs So naive bayes is a linear classifier cNB = argmax cjÎC P(cj ) P(xi | cj ) iÎpositions Õ
  • 43. Learning the Multinomial Naive Bayes Model First attempt: maximum likelihood estimates ◦ simply use the frequencies in the data Sec.13.3 P̂(wi | cj ) = count(wi,cj ) count(w,cj ) wÎV å P̂(cj ) = doccount(C = cj ) Ndoc
  • 44. Learning the Multinomial Naive Bayes Model First attempt: maximum likelihood estimates ◦ simply use the frequencies in the data Sec.13.3 P̂(wi | cj ) = count(wi,cj ) count(w,cj ) wÎV å P̂(cj ) = doccount(C = cj ) Ndoc Compute the probability of for a class C Compute the probability of a word given a class ∈{ Positive, Negative } P(Normal) = 8/12
  • 45. Parameter estimation Create mega-document for topic j by concatenating all docs in this topic ◦ Use frequency of w in mega-document fraction of times word wi appears among all words in documents of topic cj P̂(wi | cj ) = count(wi,cj ) count(w,cj ) wÎV å
  • 46. Doc 12, Normal 8 , Spam = 4 P ( Normal) = 8/12 P (Spam) = 4/12 Normal (17) Dear – 8 Friend – 5 Lunch – 3 Money – 1 Spam (7) Dear – 2 Friend – 1 Lunch – 0 Money – 4
  • 47. Probability of “Dear Friend” belongs to - P ( Normal| “Dear Friend”) = (8/17) * (5/17) * (8/12) P (Spam| “Dear Friend”) = (2/7) * (1/7) * (4/12) Normal Dear – 8 Friend – 5 Lunch – 3 Money – 1 Spam Dear – 2 Friend – 1 Lunch – 0 Money – 4
  • 48. Probability of “Lunch Money” belongs to - P ( Normal| “Lunch Money”) = (3/17) * (1/17) * (8/12) P (Spam| “Lunch Money”) = (0/7) * (4/7) * (4/12) = 0 Normal Dear – 8 Friend – 5 Lunch – 3 Money – 1 Spam Dear – 2 Friend – 1 Lunch – 0 Money – 4
  • 49. Problem with Maximum Likelihood What if we have seen no training documents with the word fantastic and classified in the topic positive (thumbs-up)? Zero probabilities cannot be conditioned away, no matter the other evidence! P̂("fantastic" positive) = count("fantastic", positive) count(w,positive wÎV å ) = 0 cMAP = argmaxc P̂(c) P̂(xi | c) i Õ Sec.13.3
  • 50. Laplace (add-1) smoothing for Naïve Bayes P̂(wi | c) = count(wi,c)+1 count(w,c)+1 ( ) wÎV å = count(wi,c)+1 count(w,c wÎV å ) æ è ç ç ö ø ÷ ÷ + V P̂(wi | c) = count(wi,c) count(w,c) ( ) wÎV å
  • 51. P ( Normal| “Lunch Money”) = (?) * (?) * (8/12) P (Spam| “Lunch Money”) = Normal Dear – 8 Friend – 5 Lunch – 3 Money – 1 Spam Dear – 2 Friend – 1 Lunch – 0 Money – 4 = count(wi,c)+1 count(w,c wÎV å ) æ è ç ç ö ø ÷ ÷ + V P̂(wi | c) = count(wi,c) count(w,c) ( ) wÎV å P(N|Lunch money) = ( (3+1)/ (17+4) ) * (2/21 ) * (8/12) =0.012 P(S|Lunch money) = (1/11) * (5/11) * (4/12) = 0.013 Unique Word = 4, Number of occurrence = 17
  • 52. Unknown words What about unknown words ◦ that appear in our test data ◦ but not in our training data or vocab We ignore them ◦ Remove them from the test document! ◦ Pretend they weren't there! ◦ Don't include any probability for them at all. Why don't we build an unknown word model? ◦ It doesn't help: knowing which class has more unknown words is not generally a useful thing to know!
  • 53. Stop words Some systems ignore another class of words: Stop words: very frequent words like the and a. ◦ Sort the whole vocabulary by frequency in the training, call the top 10 or 50 words the stopword list. ◦ Now we remove all stop words from the training and test sets as if they were never there. But in most text classification applications, removing stop words don't help, so it's more common to not use stopword lists and use all the words in naive Bayes.
  • 54. Multinomial Naïve Bayes: Learning Calculate P(cj) terms ◦ For each cj in C do docsj  all docs with class =cj P(wk | cj )¬ nk +a n+a |Vocabulary | P(cj )¬ | docsj | | total # documents| • Calculate P(wk | cj) terms • Textj  single doc containing all docsj • Foreach word wk in Vocabulary nk  # of occurrences of wk in Textj • From training corpus, extract Vocabulary
  • 57. Let's do a worked sentiment example!
  • 58. A worked sentiment example Just Plain Boar Entire Predict And 2 Lack Energy No Surprise Very Few lough
  • 59. A worked sentiment example Prior from training: P(-) = 3/5 P(+) = 2/5 Drop "with" Likelihoods from training: Scoring the test set: P(Predict|+) = ? P(Predict|-) = ? P(No|+) = ? P(No|-) = ? P(Fun|+) = ? P(Fun|-) = ? P(-) * P(“Predict No Fun”) P(+) * P(“Predict No Fun”)
  • 60. A worked sentiment example Prior from training: P(-) = ? P(+) = ? Drop "with" Likelihoods from training: Scoring the test set:
  • 61. Optimizing for sentiment analysis For tasks like sentiment, word occurrence is more important than word frequency. ◦ The occurrence of the word fantastic tells us a lot ◦ The fact that it occurs 5 times may not tell us much more. Binary multinominal naive bayes, or binary NB ◦ Clip our word counts at 1 ◦ Note: this is different than Bernoulli naive bayes; see the textbook at the end of the chapter.
  • 62. Binary Multinomial Naïve Bayes: Learning Calculate P(cj) terms ◦ For each cj in C do docsj  all docs with class =cj P(cj )¬ | docsj | | total # documents| P(wk | cj )¬ nk +a n+a |Vocabulary | • Textj  single doc containing all docsj • Foreach word wk in Vocabulary nk  # of occurrences of wk in Textj • From training corpus, extract Vocabulary • Calculate P(wk | cj) terms • Remove duplicates in each doc: • For each word type w in docj • Retain only a single instance of w
  • 63. Binary Multinomial Naive Bayes on a test document d 63 First remove all duplicate words from d Then compute NB using the same equation: cNB = argmax cjÎC P(cj ) P(wi |cj ) iÎpositions Õ
  • 64. Binary multinominal naive Bayes Counts can still be 2! Binarization is within-doc!
  • 66. Text Classification and Naïve Bayes Naïve Bayes: Relationship to Language Modeling
  • 67. Generative Model for Multinomial Naïve Bayes 67 c=+ X1=I X2=love X3=this X4=fun X5=film
  • 68. Naïve Bayes and Language Modeling Naïve bayes classifiers can use any sort of feature ◦ URL, email address, dictionaries, network features But if, as in the previous slides ◦ We use only word features ◦ we use all of the words in the text (not a subset) Then ◦ Naive bayes has an important similarity to language modeling. 68
  • 69. Each class = a unigram language model Assigning each word: P(word | c) Assigning each sentence: P(s|c)=P(word|c) 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film I love this fun film 0.1 0.1 .05 0.01 0.1 Class pos P(s | pos) = 0.0000005 Sec.13.2.1
  • 70. Naïve Bayes as a Language Model Which class assigns the higher probability to s? 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film Model pos Model neg film love this fun I 0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2 P(s|pos) > P(s|neg) 0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film Sec.13.2.1
  • 71. Text Classification and Naïve Bayes Naïve Bayes: Relationship to Language Modeling
  • 73. Evaluation Let's consider just binary text classification tasks Imagine you're the CEO of Delicious Pie Company You want to know what people are saying about your pies So you build a "Delicious Pie" tweet detector ◦ Positive class: tweets about Delicious Pie Co ◦ Negative class: all other tweets
  • 75. The 2-by-2 confusion matrix TP 10 FP 2 FN 3 TN 34
  • 76. Evaluation: Accuracy Why don't we use accuracy as our metric? Imagine we saw 1 million tweets ◦ 100 of them talked about Delicious Pie Co. ◦ 999,900 talked about something else We could build a dumb classifier that just labels every tweet "not about pie" ◦ It would get 99.99% accuracy!!! Wow!!!! ◦ But useless! Doesn't return the comments we are looking for! ◦ That's why we use precision and recall instead
  • 77. Evaluation: Precision % of items the system detected (i.e., items the system labeled as positive) that are in fact positive (according to the human gold labels)
  • 78. Evaluation: Recall % of items actually present in the input that were correctly identified by the system.
  • 79. Why Precision and recall Our dumb pie-classifier ◦ Just label nothing as "about pie" Accuracy=99.99% but Recall = 0 ◦ (it doesn't get any of the 100 Pie tweets) Precision and recall, unlike accuracy, emphasize true positives: ◦ finding the things that we are supposed to be looking for.
  • 80. A combined measure: F F measure: a single number that combines P and R: We almost always use balanced F1 (i.e.,  = 1)