The document discusses sentiment analysis using a Naïve Bayes classifier, detailing its objectives, challenges, and methodologies for processing textual data to determine positivity or negativity. It highlights the increasing importance of automated techniques in understanding opinions from electronic media and outlines the classification approaches employed including probabilistic analysis. Additionally, it compares different implementations of Naïve Bayes, such as multinomial and binarized variants, noting their respective accuracies and applications.
Introduction to sentiment analysis, which identifies sentiment in e-text as positive or negative.
The motivation behind sentiment analysis includes the increased use of electronic media and its application in opinion mining.
Review of previous work encompassing techniques such as Naïve Bayes, Maximum Entropy, and Support Vector Machine for sentiment analysis.
Identifies problems faced when implementing sentiment analyzers, including searching, tokenization, and content identification.
Overview of different approaches for sentiment analysis, notably Naïve Bayes Classifier, Max Entropy, and Support Vector Machine.
Details on Naïve Bayes Classifier, its application, and a basic explanation of the probabilistic analysis using Bayes' theorem.
Discussion on Multinomial Naïve Bayes and Binarized Multinomial Naïve Bayes,” focusing on their functionality and applications.
Provides accuracy rates around 75% for Multinomial Naïve Bayes and discusses key algorithms including dictionary and feature set generation.
Describes the training process and the testing methodology to find sentiment in test data, including decision-making based on probabilities.
Shows an example calculation for Multinomial Naïve Bayes, detailing class selection based on conditional probabilities.
Overview of Binarized Naïve Bayes, emphasizing its distinguishing factor of counting token occurrences once per document and achieving 79-82% accuracy.
SENTIMENT ANALYSIS
USINGNAÏVE BAYES CLASSIFIER
CREATED BY:-
DEV KUMAR , ANKUR TYAGI , SAURABH TYAGI
(Indian institute of information technology Allahabad )
10/2/2014 [Project Name]
1
2.
Introduction
• Objective
sentimental analysis is the task to identify an
e-text (text in the form of electronic data such
as comments, reviews or messages) to be
positive or negative.
10/2/2014 [Project Name]
2
3.
MOTIVATION
• Sentimentalanalysis is a hot topic of research.
• Use of electronic media is increasing day by day.
• Time is money or even more valuable than money
therefore instead of spending times in reading and
figuring out the positivity or negativity of text we
can use automated techniques for sentimental
analysis.
• Sentiment analysis is used in opinion mining.
– Example – Analyzing a product based on it’s reviews
and comments.
10/2/2014 [Project Name]
3
4.
PREVIOUS WORK
•There has been many techniques as an outcome of
ongoing research work like
• Naïve Bayes.
• Maximum Entropy.
• Support Vector Machine.
• Semantic Orientation.
10/2/2014 [Project Name]
4
5.
Problem Description
Whenwe Implement a sentiment analyzer we can
suffer following problems.
1. Searching problem.
2. Tokenization and classification .
3. Reliable content identification
10/2/2014 [Project Name]
5
6.
Continue….
Problem faced
– Searching problem
• We have to find a particular word in about 2500
files.
– All words are weighted same for example good and
best belongs to same category.
– The sequence in which words come in test data is
neglected. Other issues-
– Efficiency provided from this implementation Is only
40-50%
10/2/2014 [Project Name]
6
Continue…
• NaïveBayes Classifier
– Simple classification of words based on ‘Bayes
theorem’.
– It is a ‘Bag of words’ (text represented as collection
of it’s words, discarding grammar and order of
words but keeping multiplicity) approach for
subjective analysis of a content.
– Application -: Sentiment detection, Email spam
detection, Document categorization etc..
– Superior in terms of CPU and Memory utilization as
shown by Huang, J. (2003).
10/2/2014 [Project Name]
8
9.
Continue…
• ProbabilisticAnalysis of Naïve Bayes
for a document d and class c , By Bayes theorem
P d c P c
( / ) ( )
Naïve Bayes Classifier will be - :
10/2/2014 [Project Name]
9
( )
( | )
P d
P c d
c* argmaxc P(c | d)
Continue…
Multinomial NaïveBayes Classifier
Accuracy – around 75%
Algorithm - :
Dictionary Generation
Count occurrence of all word in our whole data set and
make a dictionary of some most frequent words.
Feature set Generation
- All document is represented as a feature vector over the
space of dictionary words.
- For each document, keep track of dictionary words along
with their number of occurrence in that document.
10/2/2014 [Project Name]
11
12.
Continue…
Formulaused for algorithms - :
( | ) | P x k label y k label y j
x label y
1{ k and }
1
k|label y
= probability that a particular word in document of
label(neg/pos) = y will be the kth word in the dictionary.
= Number of words in ith document.
= Total Number of documents.
10/2/2014 [Project Name]
12
( 1{ } ) | |
1
( )
1 1
( ) ( )
label y n V
m
i
i
i
m
i
n
j
i i
j
i
k|label y
i n
m
13.
Continue…
i
label y
Calculate Probability of occurrence of each label .Here label is
negative and positive.
These all formulas are used for training .
10/2/2014 [Project Name]
13
m
P label y
m
i
1
( ) 1{ }
( )
14.
Continue…
Training
In this phase We have to generate training data(words with
probability of occurrence in positive/negative train data files ).
Calculate for each label .
Calculate for each dictionary words and store the
result (Here: label will be negative and positive).
Now we have , word and corresponding probability for each of
the defined label .
10/2/2014 [Project Name]
14
P(label y)
k|label y
15.
Continue…
Testing
Goal – Finding the sentiment of given test data file.
• Generate Feature set(x) for test data file.
• For each document is test set find
Decision1 log P(x | label pos) log P(label pos)
• Similarly calculate
Decision2 log P(x | label neg) log P(label neg)
• Compare decision 1&2 to compute whether it has
Negative or Positive sentiment.
Note – We are taking log of probabilities for Laplacian smoothing.
10/2/2014 [Project Name]
15
16.
ˆP(c) =
Nc
N
count w c
( , )
1
count c V
( ) | |
ˆ ( | )
P w c
Type Doc Words Class
Training 1 Chinese Beijing Chinese c
Priors:
P(c)= 3/4
P(j)= 1/4
Conditional Probabilities:
P( Chinese | c ) = (5+1) / (8+6) = 6/14 = 3/7
P( Tokyo | c ) = (0+1) / (8+6) = 1/14
P( Japan | c ) =(0+1) / (8+6) = 1/14
P( Chinese | j ) =(1+1) / (3+6) = 2/9
P( Tokyo | j ) =(1+1) / (3+6) = 2/9
P( Japan | j ) =(1+1) / (3+6) = 2/9
2 Chinese Chinese Shanghai c
3 Chinese Macao c
4 Tokyo Japan Chinese j
Test 5 Chinese Chinese Chinese
Tokyo Japan
Choosing a class:
P(c|d5) = 3/4 * (3/7)3 * 1/14 *
1/14
≈ 0.0003
P(j|d5) = 1/4 * (2/9)3 * 2/9 * 2/9
≈ 0.0001
10/2/2014 [Project Name] 16
?
An Example of multinomial naïve Bayes
17.
Continue…
Binarized NaïveBayes
Identical to Multinomial Naïve Bayes, Only
difference is instead of measuring all occurrence
of a token in a document , we will measure it once
for a document.
Reason - : Because occurrence of the word
matters more than word frequency and weighting
it’s multiplicity doesn’t improve the accuracy
Accuracy – 79-82%
10/2/2014 [Project Name]
17