1. 1
A SENTIMENT ANAYSIS AND CLASSIFICATION ALGORITHM
UTILIZING AN INDEPENDENT TERM MATCHING SCHEME
SENSITIVE TO WORD COUNT PATERNS
Authors:
Asoka Korale, Ph.D., C.Eng., MIET
Chanuka Perera, Dip., ABE(UK)
Eranda Adikari, B.Sc., C.Eng., MIESL
Nadeesha Ekanayake, B.Sc.,
2. 2
Business Drivers of “Sentiment Analysis” & Classification
Devise a Customer focused Corporate Strategy
Help Determine Areas of Future Investments
Analysis of Customer Feedback for Decision making
Insights on Corporate Image, Service Level and Performance
Business Process Improvement …
3. 3
Objective of the Modeling
Prioritize Comments by Sentiment (Severity of Feedback)
Classify Comments to Pre Defined Categories
Rate Sentiment contained in Feedback
Analyze Feedback Comments, Prioritize and Classify for Timely Action
Direct each Class to Appropriate Authority in Priority Order for Timely action
4. 4
“Sentiment” a Definition
Concise “Comments” give insight to “Emotional” content of message
Emotional Dimensions of Words
Valence (Happiness), Activation (Arousal), Dominance
An Opinion, View held or Expressed
Only “Select” words convey “Emotion”
Dictionaries of rated Words across each Emotional Dimension
Account separately for “Negations”
Words rated for “Sentiment” by Human agents via large Surveys
Introduce Local Language Support
5. 5
Feedback Comment Classification Process
Supervised Methods employ “Training Sequences”
Technique uses word Combinations, Patterns, Frequencies
Grouping comments on a “Theme” or Criteria in to “Classes”
Requires Pre Classified Comments
Suitable for classifying large texts
6. 6
Sentiment Analysis via Independent Term Matching
Assumptions -
Twitter, FB & Customer
comments
Each term in a comment independent of others
Valence, Activation and Dominance components of each word drawn from a
Normal Distribution with specified Mean and Standard Deviation
Combined overall sentiment rating of matched words occurs at
maximum of the sum of the individual Normal Densities
Overall Sentiment in a comment represented by the combined effect of
the sentiment of individual words in the comment
Suitable for small text data
Ref: http://www.csc.ncsu.edu/faculty/healey/tweet_viz/
7. 7
Algorithm – Sentiment Score for each Comment
I. Comments in
Series: Each
Analyzed
Separately
II. Select a Comment,
Convert words to
Lower case and
Remove Punctuation
V. Compute a Normal Density
Function with Mean and Standard
Deviation corresponding to each
Attribute of each matched word by
scaling a Standard Normal Random
Variable
III. Find match in Dictionary for
each word in selected comment
and get corresponding mean and
standard deviation
IV. Extract Mean and Standard
Deviation of “Valence” and
“Activation” attributes of each
matched word from Dictionary
Vi. Compute the sum of
the Density functions
corresponding to each
attribute of all matched
words in the comment
Vii. Determine Maximum point “max-GMM” of the sum of the Density functions to arrive
at an average score for the effect of that attribute across all words in the comment
µ =
µ1
µ2
…
…
µ 𝑛
𝜎 =
𝜎1
𝜎2
…
…
𝜎 𝑛
Comment
Words Valence Rating Activation Rating
Dictionary
Value Mean Std Dev Mean Std Dev
'service' 6.83 1.54 2.95 2.09
'good' 7.89 1.24 3.66 2.72
'late' 3.32 1.17 5.57 2.56
Simple
Average 6.01 1.32 4.06 2.46
Word Valence Rating Activation Rating
max- GMM 7.5 3.7
8. 8
Gaussian Mixtures in Rating “Total Sentiment”
N
k
kkk mxgpxf
1
);();(
N
pk
1
2
2
1
2
1
),;(
k
kmx
k
kk emxg
the mean and stand deviation of the Normal Distribution of the ratings of each
matched word
overall sentiment xcomment of a comment in a particular dimension is then determined as
Consider the cumulative effect of all matched sentiment bearing words via the sum of the
individual probability densities.
x represents the sentiment score, N the number of matched words in a comment
kkm ,
where and
which is the point at which the probability of the mixture of distribution is
a maximum, and so is the most likely value for the overall sentiment of
a comment composed of several words.
);(
max
xf
x
xcomment
9. 9
Overall Valance (Happiness) and Activation (Arousal) of a comment
Comment Words Valence Rating Activation Rating
Dictionary Value Mean Std Dev Mean Std Dev
'service' 6.83 1.54 2.95 2.09
'good' 7.89 1.24 3.66 2.72
'late' 3.32 1.17 5.57 2.56
Simple Average 6.01 1.32 4.06 2.46
Word Valence Rating Activation Rating
max- GMM 7.5 3.7
Figure 1: Gaussian Mixtures of matched words in
the Valence Dimension
Figure 2: Gaussian Mixtures of matched words in
the Activation Dimension
10. 10
IMPACT OF “NEGATIONS” ON TOTAL RATING
Comment Words Valence Rating Activation Rating
Dictionary Value Mean Std Dev Mean Std Dev
'service' 6.83 1.54 2.95 2.09
Not 'good' 6.65 1.24 6.38 2.72
'late' 3.32 1.17 5.57 2.56
Simple Average 5.6 1.32 4.97 2.46
Word Valence Rating Activation Rating
max- GMM 6.7 4.5
Comment Words Valence Rating Activation Rating
Dictionary Value Mean Std Dev Mean Std Dev
'service' 6.83 1.54 2.95 2.09
'good' 7.89 1.24 3.66 2.72
'late' 3.32 1.17 5.57 2.56
Simple Average 6.01 1.32 4.06 2.46
“the service was not good and late”“the service was good but was late”
Word Valence Rating Activation Rating
max- GMM 7.5 3.7
Account for Negations by adjusting the sentiment score of word immediately following the negation in a
direction opposite in polarity to its matched directory sentiment value.
The magnitude of the adjustment made corresponds to the standard deviation of the particular rating value
being adjusted.
The magnitude of the adjustment can also be user definable
11. 11
Variance in Max GMM and Simple Average Measure
It is seen that 90% of the time the samples are
within +/- 0.5 in the case of the Valence Attribute.
The CDF of the difference in the Activation attribute
is tightly centered on the origin indicating hardly any
variance.
This is also an indication that most comments
convey sentiments of a single polarity and only a
few comments (less than 10%) have words with
conflicting emotional content.
Figure 1: Variance between GMM and Simple Average
measures for estimating overall comment sentiment
A measure of the degree of disparate emotions in the comments
12. 12
Sample Comments for Rating and Classification
1.HOTLINE ISSUES - DELAY IN ANSWERING - CX SERVICE ASSISTANCE
Today morning CX has called to the 444 HL for Movie Ticket & he has waited
for more than 10 mins in the line, regarding this now CX was very
disappointed on our service. So pls be kind enough to chk on ths & give the
call back to the CX ASAP. * Note: - Regarding this issue CX need the call
back from one of our manager & CX has requested not to charge a single
rupee from his no for this issue.
2.Yes,man magea prshnaya kiyapu gaman eyaa magea prshnea wisaduwaa
he's a good
3.Yes kad pin nambar signal
4.Wenath ayathana wala mema pahasukam nomati nisa
5.very good service
6.uparimaya
7.Uparima
8.think so
9.thanks
10.Super
11.Solved
12.She resolved my problem.
13.Service nallam
14.Sambanda weemata boho welawak giya nisa
15.recharge
16.Prashnayata pilithura hodin pahadili kara dima
17. Payak athulatha gataluwa nirakaranaya karanwa kiuwa. Thawamath
gataluwa nirakaranaya kara natha.
18.oba ayathanaya sewawan sadaha ihala mudalak ayakarana nisa
19.no mms setting laba dunnada save kala nohaka
20.nam apahu e tika ewanna
21.Mata awashshaya u pilithurau pahadili lesa laba ganemata hakiuna.
22.mage parshnata pilithuru dunna.
23.lotari SMS stop
24.Its professional
25.ing tone sewawa ain kirima
26.I submitted Xtv reg form on 27th oct at yr crescat arcade. They told to call
me on 28th wed to give the AC No
27.Hot line eka answer karapu girlge voice eka and care eka good
28.Hi kohomada? Mama mea dawas wala plan karagena yanawa mage next
music video eka karanna. Song eka "Mata Rawana" :-)
29.harima pehediliwa mage getaluwa nirakaranaya kala thanks
30.Good service but shortcomings due to some arrogant customer care
officers
31.good men
32Good
33.getaluwa hadunagenimata noheki wiya..
34.First of all its great to be treated as a privilege customer. Reason is simple.
I'm using X mobile connection and XTV, because dialog has the better
35.durakathanayata pilithuru denda epai eke hoda naraka kiyanna.
36.Cx need to add the CHU CHU TV which is a kids channel to the channel
list.Since this channel is available on another TV connection.Cx need this
channel to activate for XTV aswell.Please check on this and do the needfull.
Thank you
37.Customer service personal have to be trained better cause they can't think
out of the box.
38.bashawa wenaskaranna
13. 13
Sentiment Aggregates on Sample Comments
Fig 1: Heat Map of Sentiment rated sample comments Fig 2: Sentiment Dimensions of sample comments
14. 14
A Novel Association Rule Mining Algorithm
• Initialize (at level L1) by determining set of all Items {I} that meet minimum support criteria
• Determine support for all pairs of items {Ii,Ij} (i ~= j) in {I}
• Determine rules for all pairs of items of the form Ii->Ij
• At each subsequent level (Lp), p > 1
• Determine item combinations that meet minimum support criteria
• Items at subsequent stages selected from rules of previous stage that met min support
criteria
• Antecedent at subsequent level (Lp+1) is formed by merging the antecedent and
consequent terms of the rules that meet the minimum support criteria at level Lp
• Stop when combined terms no longer meet min support criteria
Deriving likely word combinations (Keyword Selection)
• Selection Measures NBANBASupport /)()(
)( BAConfidence )(/)( ASupportBASupport
)(/)&( ABA EPEEP
)/( AB EEP
15. 15
Simplifying Assumptions of the Naïve Bayes Technique
Sli
)(/),,...,,()/,...,( 2121 jjNjN CPCXXXPCXXXP
)(/),,..,,(),,...,/( 3221 jJNjN CPCXXXPCXXXP
)(/)()/()......,,..,/( 21 jjjnjN CPCPCXPCXXXP
)/(),,.../( 2 jijNi CXPCXXXP
)/)...(/()/()/,...,,( 2121 jNjjjN CXCXPCXPCXXXP
Under the assumption of conditional independence of word Xi given class Cj
)}()/({
max
)/( jj
j
CPCXP
C
XCP
)}()./().../()/({
max
21 jjNjj
j
CPCXPCXPCXP
C
probability of a sequence of words {Xi} in a comment given class Cj
Probability of class C given a set
of words X = {X1,X2…,XN}
16. 16
Classification via Naïve Bayes
Assumptions -
The order of words {Xi} in a comment is independent of each other given
the class {Cj}
A class is determined solely on the specific words in a comment and
their frequency of occurrence in that comment
Conditional Independence of the words in a comment given the class of
the comment
a “bag of words model”
17. 17
Performance of the Classification Algorithm
Accuracy greater than 75% on predicted classes
Accuracy greater than 90% on training samples
Performance will further increase with preprocessing and filtering
single word comments don’t convey meaningful category information
Use misclassified comments to “Retrain” algorithm
Key Words for classification via Association Rules
18. 18
Algorithm Implementation & Results
• Algorithm designed and built from first principals using Matlab programming language
• Local Language Support by updating Dictionary with Sinhala and Tamil words conveying emotion
• 59,000 comments analyzed and Rated for Sentiment and Classified / Binned in to six categories
• Improved Classification by word relationships (key words) derived from Association Rule Mining
• 3000 Training comments used with six classes for Training Model
• Fast implementation processing all comments in a few hours
• A Word vs. Frequency Analysis used to determine which new words to add to the Dictionary
• The Sentiment rating is a means to “prioritize” the handling of the sorted and binned comments
• Performance improvement by “re-classifying” , miss classified comments and reuse in Training
19. 19
Conclusion
• Pre Processing – improved performance by retaining only relevant words and word combinations
for the classification the business, purpose of the analysis
• Spelling mistakes will cause problems as words will not match those in dictionary
• Update Dictionary with new words and miss spelled words
• Introduce limits on the minimum number of words that should be matched for a comment to
be analyzed – for increased reliability
• Independent Term Matching – doesn’t necessarily capture “meaning” of comment
• short comments can be analyzed to assess overall sentiment
• Rate the emotional content in a comment
• Algorithm can provide other segmentations by matching words specific to the purpose of routing
• Naïve Bayes gave good classification accuracy
• The severity of sentiment in the classified comment used to prioritize comment handling
• Simple averaging of the attribute values to arrive at the combined effect of all matched words in a
comment can also be considered and may give results that are not that far off from the assumption
of Normality