Sentiment analysis and opinion mining

Sentiment Analysis and
Opinion Mining

Introduction
 Sentiment analysis or opinion mining is the computational study of people's
opinions, appraisals, attitudes, and emotions toward entities, individuals,
issues, events, topics and their attributes.
 The task is technically challenging and practically very useful.
– Businesses want to find public or consumer opinions about their products and services.
– Potential customers want to know the opinions of existing users before they use a
service or purchase a product.
 With user generated content on social media (i.e., reviews, forum discussions,
blogs and social networks) on the Web, individuals and organizations are
increasingly using public opinions for their decision making.

Need for Automated Sentiment Analysis
 Finding and monitoring opinion sites on the Web and distilling the information in
them remains a formidable task because of the proliferation of diverse sites.
 Each site typically contains a huge volume of opinionated text that is not always
easily deciphered in long forum postings and blogs.
 The average human reader will have difficulty identifying relevant sites and
accurately summarizing the information and opinions contained in them.
 Human analysis of text information is subject to considerable biases, e.g., people
often pay greater attention to opinions consistent with their own preferences. People
have difficulty, in producing consistent results when the amount of information to
be processed is large.
 Automated opinion mining and summarization systems are needed, as subjective
biases and mental limitations can be overcome with an objective sentiment analysis
system.

Levels of Analysis
 Sentiment analysis is carried out at three levels:
 Document level: The task at this level is to classify whether a whole
opinion document expresses a positive or negative sentiment
– Given a product review, the system determines whether the review expresses an
overall positive or negative opinion about the product. This task is commonly known
as document-level sentiment classification.
 This level of analysis assumes that each document expresses opinions on a
single entity (e.g., a single product).
 It is not applicable to documents which evaluate or compare multiple
entities.

Levels of Analysis
 Sentence level: The analysis goes to the sentences and determines
whether each sentence expressed a positive, negative, or neutral opinion.
 Neutral usually means no opinion. This level of analysis is closely related
to subjectivity classification, which distinguishes sentences that express
factual information from sentences (called objective sentences) Vs. that
express subjective views and opinions(called subjective sentences).
 Subjectivity is not equivalent to sentiment as many objective sentences can
imply opinions, e.g., “We bought the car last month and the windshield
wiper has fallen off.”
 Analysis is done at clause level but the clause level is still not enough, e.g.,
“Apple is doing very well in this lousy economy.”

Levels of Analysis
 Entity or Aspect Level: Document level and the sentence level analyses do
not discover what exactly people liked and did not like (Also called Feature
Level)
 Instead of looking at language constructs (documents, paragraphs, sentences,
clauses or phrases), aspect level directly looks at the opinion itself. Idea is
that an opinion consists of a sentiment (positive or negative) and a target
 An opinion without its target being identified is of limited use. Realizing the
importance of opinion targets also helps to understand the sentiment analysis
problem better.
– Example: “although the service is not that great, I still love this restaurant”
– Has a positive tone, but cannot say it is entirely positive. In fact, the sentence is positive
about the restaurant (emphasized), but negative about its service (not emphasized).

Sentiment Lexicon
 The most important indicators of sentiments are sentiment words, or
opinion words. These are words that are commonly used to express positive
or negative sentiments.
 For example, good, wonderful, and amazing are positive sentiment words,
and bad, poor, and terrible are negative sentiment words.
 There are also phrases and idioms, e.g., cost someone an arm and a leg.
 Sentiment words and phrases are instrumental to sentiment analysis.
 A list of such words and phrases is called a sentiment lexicon (or opinion
lexicon).
 Over the years, researchers have designed numerous algorithms to compile
such lexicons

Issues
 A positive or negative sentiment word may have opposite
orientations in different application domains.
 For example, “suck” usually indicates negative sentiment, e.g.,
“This battery sucks,” but it can also imply positive sentiment, e.g.,
“This vacuum cleaner really sucks (dirt).”
 Sarcastic sentences with or without sentiment words are hard to
deal with, e.g., “What a great car! It stopped working in two days.”
 Sarcasms are not very common in consumer reviews about
products and services, but are very common in other places, eg.
political discussions,

Issues
 A sentence containing sentiment words may not express any
sentiment.
 This happens in Question (interrogative) sentences and conditional
sentences e.g., “Can you tell me which Sony camera is good?” and “If I can
find a good camera in the shop, I will buy it.”
 These sentences contain the sentiment word “good”, but does not
express a positive or negative opinion on any specific camera.
 Not all conditional sentences or interrogative sentences express no
sentiments, e.g., “Does anyone know how to repair this terrible
printer”

Issues
 Many sentences without sentiment words can also imply opinions.
 Many of these are objective sentences that are used to express
some factual information.
 “This washer uses a lot of water” implies a negative sentiment
about the washer since it uses a lot of resource (water).
 “After sleeping on the mattress for two days, a valley has formed
in the middle” expresses a negative opinion about the mattress.
 These sentences are objective as it states a fact. They have no
sentiment words.

Issues
 “(1) I bought an iPhone a few days ago. (2) It was such a nice phone. (3) The touch
screen was really cool. (4) The voice quality was clear too. (5) However, my mother was
mad with me as I did not tell her before I bought it. (6) She also thought the phone was
too expensive, and wanted me to return it to the shop …”
 Sentences (2), (3) and (4) express some positive opinions
 sentences (5) and (6) express negative opinions or emotions
 Targets
– The target of the opinion in sentence (2) is the iPhone as a whole, and the targets of the opinions in
sentences (3) and (4) are touch screen" and voice quality“
– sentence (6) is the price of the iPhone
– target of the opinion/emotion in sentence (5) is me", not iPhone
 Holder of the opinions in sentences (2), (3) and (4) is the author of the review, but in
sentences (5) and (6) it is “my mother”.

Entity: Definition
 An entity e is a product, service, person, event, organization, or
topic. It is associated with a pair, e : (T;W), where T is a hierarchy
of components (or parts), sub-components, and so on, and W is a
set of attributes of e. Each component or sub-component also has
its own set of attributes.
 Samsung Galaxy is an entity. It has a set of components,
– battery and screen,
– It has a set of attributes, voice quality, size, and weight.
– The battery has its own set of attributes, e.g., battery life, and battery size

Entity and Attributes
 Entity is represented as a tree or hierarchy. The root of the tree is
the name of the entity.
 Each non-root node is a component or sub-component of the
entity.
 Each link is a part-of relation.
 Each node is associated with a set of attributes.
 An opinion can be expressed on any node and any attribute of the
node.
 Both components and attributes are combined and called
“Aspects”

Opinion
 An opinion is a positive or negative sentiment, attitude, emotion or
appraisal about an entity or an aspect of the entity from an opinion holder.
Positive, negative and neutral are called opinion orientations (also called
sentiment orientations, semantic orientations, or polarities).
 An opinion is a quintuple, (ei; aij ; ooijkl; hk; tl), where ei is the entity, aij is an
aspect j of ei, ooijkl is the orientation of the opinion about aspect aij, hk is the
opinion holder, and tl is the time when the opinion is expressed by hk.
– ooijkl can be positive, negative or neutral, with different strength/intensity
levels
– quintuple can be regarded as a schema of a database table for analysis

Opinion mining
 Objective : Given a collection of opinion documents D, discover all
opinion quintuples (ei; aij ; ooijkl; hk; tl) in D.
1. Extract all entity expressions in D, and group synonymous entity expressions
into entity clusters. Each entity expression cluster indicates a unique entity ei.
2. Extract all aspect expressions of the entities, and group aspect expressions into
clusters. Each aspect expression cluster of entity ei indicates a unique aspect aij
3. Extract opinion holder and time information from the text or unstructured data.
4. Determine whether each opinion on an aspect is positive, negative or neutral.
5. Produce all opinion quintuples (ei; aij ; ooijkl; hk; tl) expressed in D based on the
results of the above

Example of Extraction
 bigXyz on Nov-4-2010:(1) I bought a Motorola phone and my girlfriend
bought a Nokia phone yesterday. (2) We called each other when we got
home. (3) The voice of my Moto phone was unclear, but the camera was
good. (4) My girlfriend was quite happy with her phone, and its sound
quality. (5) I want a phone with good voice quality. (6) So I probably will
not keep it.
 QUINTIPLES
 (Motorola, voice quality, negative, bigXyz, Nov-4-2010)
 (Motorola, camera, positive, bigXyz, Nov-4-2010)
 (Nokia, GENERAL, positive, bigXyz's girlfriend, Nov-4-2010)
 (Nokia, voice quality, positive, bigXyz's girlfriend, Nov-4-2010)

Two more Definitions
 An objective sentence (sentence 1&2) presents some factual
information about the world, while a subjective sentence expresses
some personal feelings, views or beliefs.
– Subjective expressions come in many forms, e.g., opinions, allegations,
desires, beliefs, suspicions, and speculations.
– A subjective sentence may not contain an opinion (Sentence 5)
– Not every objective sentence contains no opinion. “the earphone broke in
two days", is an objective sentence but it implies a negative sentiment.

Two more Definitions
 Emotions are our subjective feelings and thoughts
– There are 6 primary emotions, i.e., love, joy, surprise, anger, sadness,
and fear, which can be sub-divided into many secondary and tertiary
emotions. Each emotion can also have different intensities.
– The concepts of emotions and opinions are not equivalent.
– Many opinion sentences express no emotion (e.g., “the voice of this
phone is clear”), which are called rational evaluation sentences
– Many emotion sentences give no opinion, (e.g., “I am so surprised to see
you”)

Document Sentiment Classification
 Given an opinionated document d evaluating an entity e, determine
the opinion orientation oo on e, i.e., determine oo on aspect
GENERAL in the quintuple (e;GENERAL; oo; h; t). e, h, and t are
assumed known or irrelevant.
– Also known as the document-level sentiment classification
– Sentiment classification assumes that the opinion document d (e.g., a product
review) expresses opinions on a single entity e and the opinions are from a
single opinion holder h.
– This assumption holds for customer reviews of products and services
because each such review usually focuses on a single product and is written
by a single reviewer.

Classification based on Supervised Learning
 Three classes, positive, negative and neutral.
 Since each review already has a reviewer-assigned rating (e.g., 1-5
stars), training and testing data are readily available.
– A review with 4 or 5 stars is a positive review, a review with 1 or 2 stars
is a negative review and a review with 3 stars is a neutral review.
– Naïve Bayesian classification, and support vector machines (SVM).
– It was shown that using unigrams (a bag of individual words) as features
in classification performed well with either naive Bayesian or SVM.

Feature set for Classification
 Terms and their frequency: individual words or word n-grams and their
frequency counts.
– word positions may also be important.
– TF-IDF weighting scheme.
 Opinion words and phrases: Used to express positive or negative sentiments.
– beautiful, wonderful, good, and amazing are positive opinion words, and bad, poor,
and terrible are negative
– Many opinion words are adjectives and adverbs. Nouns (rubbish, junk, and crap) and
verbs (hate and like) can also indicate opinions.
– There are also opinion phrases and idioms, cost someone an arm and a leg. Opinion
words and phrases are instrumental to sentiment analysis

 Part of speech: adjectives are important indicators of opinions and treated
as special features.
 Negations: Negation words are important because their appearances often
change the opinion orientation.
– “I don't like this camera” is negative.
– Negation words must be handled with care because not all occurrences of such words
mean negation.
– “not” in “not only but also” does not change the orientation direction
 Syntactic dependency: Word dependency based features generated from
parsing or dependency trees

 Manually labeling training data can be time-consuming and label intensive.
 Opinion words can be utilized in the training process.
 Tan et al. used opinion words to label a portion of informative examples
and then learn a new supervised classifier based on labeled ones.
 Opinion words can be utilized to increase the sentiment classification
accuracy.
 Regression can be used for predicting Rating scores (e.g., 1-5 stars)
– the rating scores are ordinal!
 Domain Specificity: A classifier trained using one domain often performs
poorly when it is applied or tested on another domain.

Classification – Unsupervised Learning
 Three Step Process
 Step 1:
 Phrases containing adjectives or adverbs are extracted as
adjectives and adverbs are good indicators of opinions.
– Context is important. “unpredictable" breaking distance of car vs.
“unpredictable” ending of the mystery movie
 The algorithm extracts two consecutive words, where one
member of the pair is an adjective or adverb, and the other
is a context word

 Step 2: Estimate the semantic orientation of the extracted phrases using the point-wise
mutual information (PMI) measure
𝑃𝑀𝐼(𝑡1, 𝑡2 ) = 𝑙𝑜𝑔2
𝑃(𝑡1 ∩ 𝑡2)
𝑃 𝑡1 . 𝑃(𝑡2)
 PMI is a measure of the degree of statistical dependence between t1 and t2 and log of this
ratio is the amount of information that we acquire about the presence of one of the words
when we observe the other
 The semantic/opinion orientation (SO) of a phrase is computed based on its association
with the positive reference word “excellent” and its association with the negative
reference word “poor”
SO(Phrase)=PMI(Phrase, “Excellent”) – PMI(Phrase, “Poor”)
 The probabilities are calculated by issuing queries and collecting the number of hits.
Searching the two terms together and separately, we can estimate the probabilities

 Step 3: The algorithm computes the average SO of all phrases in a review,
and classifies the review as recommended if the average SO is positive
– Final classification accuracies on reviews from various domains range from 84% for
automobile reviews to 66% for movie reviews.
 Advantage of document level sentiment classification: it provides a
prevailing opinion on an entity, topic or event.
 Shortcomings:
– It does not give details on what people liked and/or disliked and
– It is not easily applicable to non-reviews, e.g., forum and blog postings, because many
such postings evaluate multiple entities and compare them.

Sentence-level Sentiment Classification.
 Document-level sentiment classification techniques can also be applied to
individual sentences.
 The task of classifying a sentence as subjective or objective is often called
subjectivity classification
 The resulting subjective sentences are also classified as expressing positive or
negative opinions
 1. Subjectivity classification: Determine whether s is a subjective sentence or
an objective sentence
 2. Sentence-level sentiment classification: If s is subjective, determine whether
it expresses a positive, negative or neutral opinion.

Assumption
 The sentence expresses a single opinion from a single opinion
holder.
 This assumption is only appropriate for simple sentences with a
single opinion,
– “The picture quality of this camera is amazing.”
 Compound and complex sentences, a single sentence may express
more than one opinion.
– “The picture quality of this camera is amazing and so is the battery life,
but the view finder is too small for such a great camera"

Opinion Lexicon Expansion
 Opinion words: also known as opinion-bearing words or sentiment words.
 Positive opinion words are used to express some desired states while
negative opinion words are used to express some undesired states.
– beautiful, wonderful, good, and amazing.
– bad, poor, and terrible.
 There are also opinion phrases and idioms: “Cost someone an arm and a leg”.
 Collectively, they are called the opinion lexicon. Used for opinion mining.
 Three Approaches: Manual, Dictionary-based, and Corpus-based.
– The manual approach is time-consuming and not usually used alone, but combined
with automated approaches as the check because automated methods make mistakes.

Dictionary based approach
 Bootstrapping using a small set of seed opinion words and an online
dictionary, e.g., WordNet.
 The strategy is to first collect a small set of opinion words manually with
known orientations, and then to grow this set by searching for their
synonyms and antonyms.
 The newly found words are added to the seed list and the next iteration
starts. The iterative process stops when no more new words are found.
 After the process completes, manual inspection can be carried out to remove
and/or correct errors.

Dictionary based approach
 Shortcoming: The approach is unable to find opinion words
with domain and context specific orientations, which is
quite common.
– For example, for a speaker phone, if it is quiet, it is usually
negative. However, for a car, if it is quiet, it is positive.
 The corpus-based approach can help deal with this problem.

Corpus-based approach
 The methods rely on syntactic or co-occurrence patterns and also a seed list of
opinion words to find other opinion words in a large corpus
 The technique starts with a list of seed opinion adjectives, and uses them and a
set of linguistic constraints or conventions on connectives to identify
additional adjective opinion words and their orientations.
 Conjunction “AND”: conjoined adjectives usually have the same orientation.
– “This car is beautiful and spacious”
– "This car is beautiful and difficult to drive“ (AND Conjunction is not usually used)
 Rules or constraints are also designed for other connectives, OR, BUT,
EITHER-OR, and NEITHER-NOR.
 This idea is called sentiment consistency

 Learning is applied to a large corpus to determine if two conjoined
adjectives are of the same or different orientations.
 Same and different-orientation links between adjectives are formed
 Clustering is performed on these to produce two sets of words: positive and
negative.
 Inter-sentential consistency is the idea to neighboring sentences.
 The same opinion orientation (positive or negative) is usually expressed in a
few consecutive sentences.
 Opinion changes are indicated by adversative expressions such as “but” and
“however”.

 Different orientations in different contexts even in the same domain.
– Digital camera: “The battery life is long (+)” ;
– “The time taken to auto-focus is long" (-).
 Consider both possible opinion words and aspects together, and use
 the pair (aspect, opinion word) as the opinion context, (battery life", long").
 This determines opinion words and their orientations together with the
aspects that they modify.
 Can be used to analyze comparative sentences.
 Many contexts can be more complex, consuming a large amount of
resources.

Aspect-Based Sentiment Analysis
 In a typical opinionated document, the author writes both positive and
negative aspects of the entity, although the general sentiment on the entity
may be positive or negative. Document and sentence sentiment
classification does not provide such information.
 Aspect-based sentiment analysis needs to be used
 At the aspect level, the mining objective is to discover every quintuple (ei;
aij ; ooijkl; hk; tl) in a given document d.
 To achieve the objective, five tasks need to be performed.

Aspect extraction
 Extract aspects that have been evaluated.
– “The picture quality of this camera is amazing,” the aspect is “picture
quality" of the entity represented by “this camera”. The evaluation is not
about the camera as a whole, but about its picture quality.
– The sentence “I love this camera” evaluates the camera as a whole, i.e.,
the GENERAL aspect of the entity represented by “this camera”.
 Whenever we talk about an aspect, we must know which entity it
belongs to.
 It is a Two-step Process

Aspect extraction
 1. Find frequent nouns and noun phrases.
 Nouns and noun phrases (or groups) are identified by a POS tagger; the
frequencies are counted; and only the frequent ones are kept.
 A frequency threshold can be decided experimentally.
 When people comment on different aspects of a product, the vocabulary
that they use usually converges. The nouns that are frequently talked about
are usually genuine and important aspects.
 Irrelevant contents in reviews are often diverse, i.e., they are quite different
in different reviews. These are infrequent nouns

Aspect extraction
 2. Find infrequent aspects by exploiting the relationships between
aspects and opinion words.
– The previous step can miss many genuine aspect expressions which are infrequent.
This step tries to find some of them.
 The same opinion word can be used to describe or modify different
aspects. Opinion words that modify frequent aspects can also
modify infrequent aspects, and thus can be used to extract
infrequent aspects.
– For example, “picture” has been found to be a frequent aspect, and we have the
sentence, “The pictures are absolutely amazing.”
– “software“ can also be extracted as an aspect from the following sentence, “The
software is amazing.”

Aspect extraction
 Point-wise mutual information (PMI) score between the phrase and some
meronymy discriminators associated with the product class can be used.
 The meronymy discriminators for the “scanner” class are, “of scanner”,
“scanner has”, “scanner comes with”, etc., which are used
 To find components or parts of scanners by searching the Web.
 𝑃𝑀𝐼 𝑎, 𝑑 =
ℎ𝑖𝑡𝑠(𝑎∩𝑑)
ℎ𝑖𝑡𝑠 𝑎 .ℎ𝑖𝑡𝑠(𝑑)
 If the PMI value of a candidate aspect is too low, it may not be a
component of the product because a and d do not co-occur frequently.

Aspect sentiment classification
 Determine whether the opinions on different aspects are positive,
negative or neutral. In the first example below, the opinion on the
“picture quality" aspect is positive, and in the second example, the
opinion on the GENERAL aspect is also positive.
 “The picture quality of this camera is amazing," the aspect is “picture
quality" of the entity represented by “this camera". does not indicate the
GENERAL aspect because the evaluation is not about the camera as a
whole, but about its picture quality.
 “I love this camera" evaluates the camera as a whole, i.e., the GENERAL
aspect of the entity represented by “this camera".

Lexicon-based Approach
 Uses an opinion lexicon, - a list of opinion words and phrases, and a
set of rules to determine the orientations of opinions in a sentence
 It also considers opinion shifters and “but-clauses”.
 4 steps
 1. Mark opinion words and phrases: Given a sentence that contains
one or more aspects, this step marks all opinion words and phrases in
the sentence.
– Each positive word is assigned the opinion score of +1, each negative word is
assigned the opinion score of -1

 2. Handle opinion shifters: Opinion shifters are words and phrases that can
shift or change opinion orientations.
– Negation words like not, never, none, nobody, nowhere, neither and cannot are the
most common type.
 Sarcasm changes orientation
– “What a great car, it failed to start the first day.”
 Spotting them and handling them correctly in actual sentences by an
automated system is not easy.
 Not every appearance of an opinion shifter changes the opinion orientation
– “not only … but also”

 3. Handle but-clauses:
 In English, but means contrary.
 A sentence containing but is handled by applying the following rule:
– The opinion orientation before but and after but are opposite to each other if
the opinion on one side cannot be determined.
 “not only but also” (needs to be handled separately).
 There are contrary words and phrases that do not always indicate an
opinion change
– “Audi is great, but Mercedes is better".
 Such cases need to be identified and dealt with separately.

 3. Aggregating opinions: This step applies an opinion aggregation function to
the resulting opinion scores to determine the final orientation of the opinion
on each aspect in the sentence.
 Consider a sentence S, which contains a set of aspects {a1 … am} and a set of
opinion words or phrases {ow1 : : : own} with their opinion scores. The
opinion orientation for each aspect ai in S is
 𝑆𝑐𝑜𝑟𝑒 𝑎𝑖, 𝑆 = 𝑜𝑤 𝑗∈𝑆
𝑜𝑤 𝑗.𝑜𝑜
𝐷𝑖𝑠𝑡(𝑜𝑤 𝑗,𝑎 𝑖)
– where owj is an opinion word/phrase in s, dist (owj ; ai) is the distance between aspect
ai and opinion word owj in S.
– owj.oo is the opinion score of owj. Gives lower weights to opinion words that are far
away from aspect ai.

Simultaneous Opinion Lexicon Expansion
and Aspect Extraction
 Needs an initial set of opinion word seeds as the input (no seed aspects)
 Opinions almost always have targets and there are natural relations
connecting opinion words and targets in a sentence
– Opinion words have relations among themselves and so do targets among themselves.
 The opinion targets are aspects. Opinion words can be recognized by
identified aspects, and aspects can be identified by known opinion words.
– The extracted opinion words and aspects are utilized to identify new opinion words
and new aspects, which are used again to extract more opinion words and aspects.
– Propagation stops when no more opinion words or aspects can be found.

Dependency grammar
 Dependency grammar was adopted to describe the relations. The Algorithm
uses only direct dependencies to model the relations.
– A direct dependency indicates that one word depends on the other word without any
additional words in their dependency path or they both depend on a third word
directly.
 Some constraints are also imposed. Opinion words are considered to be
adjectives and aspects nouns or noun phrases.
– “Canon G3 produces great pictures”, the adjective “great” is parsed as directly
depending on the noun “pictures". “great" is an opinion word and given the rule `a
noun on which an opinion word directly depends is taken as an aspect', we can extract
“pictures” as an aspect. Similarly, “pictures” is an aspect, “great” as an opinion word
using a similar rule.

Mining Comparative Opinions
 A comparative sentence expresses a relation based on similarities or
differences of more than one entity.
– The comparison is usually conveyed using the comparative or superlative form of an
adjective or adverb.
 A comparative sentence typically states that one entity has more or less of a
certain attribute than another entity.
– A superlative sentence states that one entity has the most or least of a certain attribute
among a set of similar entities.
 A comparison can be between two or more entities, groups of entities, and
one entity and the rest of the entities. It can also be between versions.

Types of Comparatives and Superlatives
 Comparatives are usually formed by adding the suffix “-er” and superlatives
are formed by adding the suffix “-est” to their base adjectives and adverbs.
– “longer” in “The battery life of Camera-x is longer than that of Camera-y”, longest“ in
“The battery life of this camera is the longest",
– This type of comparatives and superlatives are called Type 1
 Some adjectives and adverbs form comparatives or superlatives by using
words like more, most, less and least before such words (more beautiful)
– These are type 2. Types 1 and 2 are called regular comparatives and superlatives
 Irregular comparatives and superlatives, i.e., more, less, least, better, best,
– Grouped under Type 1 (based on the behavior)
 Words like “superior”, “preferred” are also grouped under Type 1

Types of comparative relations
 Four types
 1. Non-equal gradable comparisons: Type “greater or less than” that express
an ordering of some entities with regard to some of their shared aspects
– “The Intel chip is faster than that of AMD”. “I prefer Intel to AMD”.
 2. Equative comparisons: Type equal to that state two or more entities are
equal with regard to some of their shared aspects
– “The performance of Samsung is about the same as that of LG.”
 3. Superlative comparisons: type greater or less than all others that rank one
entity over all others,
– “The Intel chip is the fastest”.

 Comparative words used in non-equal gradable comparisons are categorized
into two groups according to whether they express increased or decreased
quantities, which are useful in opinion analysis.
– Increasing comparatives: Such a comparative expresses an increased quantity, e.g.,
more and longer.
– Decreasing comparatives: Such a comparative expresses a decreased quantity, e.g.,
less and fewer.

 4. Non-gradable comparisons: Relations that compare aspects of two or more
entities, but do not grade them.
 There are three main sub-types:
 Entity A is similar to or different from entity B with regard to some of their
shared aspects, “Coke tastes differently from Pepsi.”
 Entity A has aspect a1, and entity B has aspect a2 (They are usually
substitutable), “Desktop PCs use external speakers but laptops use internal
speakers.”
 Entity A has aspect a, but entity B does not have, e.g., “Phone-x has an
earphone, but Phone-y does not have.”

Objective of mining comparative opinions
 Given a collection of opinionated documents D,
– discover in D all comparative opinion sextuples of the form
(E1;E2; A; PE; h; t)
– where E1 and E2 are the entity sets being compared based on their
shared aspects A
– Entities in E1 appear before entities in E2 in the sentence),
– PE(∈ {E1;E2}) is the preferred entity set of the opinion holder h,
– t is the time when the comparative opinion is expressed.
 These sextuples can be mined

Example
 “Ipad's display is better than those of Galaxy and Surface." written by Vish
in Feb 2016.
 The extracted comparative opinion is:
– ({Ipad}, {Galaxy, Surface}, {display}, preferred: {Ipad}, John, Feb 2016)
 The entity set E1 is {Ipad}, the entity set E2 is {Galaxy, Surface},
 Their shared aspect set A being compared is {display},
 The preferred entity set is {Ipad},
 The opinion holder h is John
 The time t when this comparative opinion was written is Feb 2016.

Case: Sentiment Analysis-Hybrid
Approach
 Combined rule-based classification, supervised learning and
machine learning to form a hybrid method.
 Tested on movie reviews, product reviews and MySpace comments.
 Hybrid classification can improve the classification effectiveness in
terms of micro- and macro-averaged F1.
 F1 is a measure that takes both the precision and recall of a
classifier’s effectiveness into account

Evaluation Metrics
 Precision(P) =
tp
tp+f p
; Recall(R) =
tp
tp+fn
;
 Accuracy(A) =
tp+tn
tp+tn+f p+fn
; F1 =
2·P·R
P+R
Machine says yes Machine says no
human says yes tp fn
human says no fp tn

Evaluation Metrics
 1. Micro averaging.
– Given a set of confusion tables, a new two-by-two contingency table is generated.
– Each cell in the new table represents the sum of the number of documents from
within the set of tables.
– Given the new table, the average performance of an automatic classifier, in terms of
its precision and recall, is measured.
 2. Macro averaging.
– Given a set of confusion tables, a set of values are generated.
– Each value represents the precision or recall of an automatic classifier
– Given these values, the average performance of an automatic classifier, in terms of its
precision and recall, is measured

Rule Based Classification
 A rule consists of an antecedent and its associated consequent that have an
‘if-then ’relation: antecedent  consequent
– An antecedent is a condition: one or more tokens concatenated by the ^ operator.
– A token can be a word, ‘?’ representing a proper noun, or ‘#’ representing a target term.
– A target term is a term that represents the context in which a set of documents occurs,
such as the name of a politician, a policy recommendation, a company name, a brand of
a product or a movie title.
 A consequent represents a sentiment that is either positive or negative, and is
the result of meeting the condition defined by the antecedent.
– {token1 ^ token2 ^ . . . ^ tokenn} =) {+|−}
+ is positive sentiment; - is negative sentiment

Comparative Statements
 1. Laptop-A is more expensive than Laptop-B.
 2. Laptop-A is more expensive than Laptop-C.
 Target word of these sentences is Laptop-A. The rule derived is:
– {# ^ more ^ expensive ^ than^?} =) {−}
– The target word, Laptop-A is less favorable than the other two laptops due to its
price. Focus is on the price attribute of the Laptop-A.
 Target words are Laptop-B and Laptop-C. The rule derived is:
– {? ^ more ^ expensive ^ than ^ #} =) {+}
– The two target words, Laptop-B and Laptop-C are more favorable than the Laptop-A
due to its price. Focus is on the price attribute of both the Laptop-B and Laptop-C.
 Target word is crucial factor in determining the sentiment of an antecedent

General Inquirer Based Classifier (GIBC)
 The first, simplest rule set was based on 3672 pre-classified words
found in the General Inquirer Lexicon (Stone et al. 1966),
 1598 of which were pre-classified as positive and 2074 of which
were pre-classified as negative.
 Here, each rule depends solely on one sentiment bearing word
representing an antecedent.
 A General Inquirer Based Classifier (GIBC) was implemented which
applied the rule set to classify document collections.

Calculation of “Closeness”
 1. Select 120 positive words, such as amazing, awesome, beautiful, and 120 negative
words, such as absurd, angry, anguish, from the General Inquirer Lexicon.
 2. Compose 240 search engine queries per antecedent; each query combines an antecedent
and a sentiment bearing word.
 3. Collect the hit counts of all queries by using the Google and Yahoo search engines. Two
search engines were used to determine whether the hit counts were influenced by the
coverage and accuracy level of a single search engine. For each query, the search engines
return the hit count of a number of Web pages that contains both the antecedent and a
sentiment bearing word. The proximity of the antecedent and word is at the page level.
 A better level of precision may be obtained if the proximity checking can be carried out at
the sentence level.
 This would lead to an ethical issue, however, because each page has to be downloaded and
stored locally for further analysis.

Calculation of “Closeness”
 4. Collect the hit counts of each sentiment-bearing word and each antecedent.
 5. Use 4 closeness measures to measure the closeness between each antecedent
and 120 positive words (S+) and between each antecedent and 120 negative
words (S−) based on all the hit counts collected.
 𝑆+
= 𝑖=1
120
𝑐𝑙𝑜𝑠𝑒𝑛𝑒𝑠𝑠 (𝑎𝑛𝑡𝑖𝑐𝑖𝑑𝑒𝑛𝑡, 𝑤𝑜𝑟𝑑𝑖
+
)
 𝑆−
= 𝑖=1
120
𝑐𝑙𝑜𝑠𝑒𝑛𝑒𝑠𝑠 (𝑎𝑛𝑡𝑖𝑐𝑖𝑑𝑒𝑛𝑡, 𝑤𝑜𝑟𝑑𝑖
−
)
 If the antecedent co-occurs more frequently with the 120 positive words (S+ >
S−), then this would mean that the antecedent has a positive consequent and
vice versa.

Measures of Closeness
 Document Frequency (DF). counts the number of Web pages containing a pair
of an antecedent and a sentiment bearing word, i.e., the hit count returned by a
search engine. The larger a DF value, the greater the association strength
between antecedent and word.
 The other measures of closeness are
 Mutual Information (MI) = 𝑙𝑜𝑔2
𝑃 𝑤𝑜𝑟𝑑,𝑎𝑛𝑡𝑒𝑐𝑒𝑛𝑑𝑒𝑛𝑡
𝑃 𝑤𝑜𝑟𝑑 .𝑃(𝐴𝑛𝑡𝑒𝑐𝑒𝑑𝑒𝑛𝑡)
 Chi-Square
 Log Likelihood Ratio

Classifiers Used
 General Inquirer Based Classifier (GIBC)
 Rule-Based Classifier (RBC)
 Statistics Based Classifier (SBC)
 Mutual Information (MI).
 Chi-square (χ2)
 Induction Rule Based Classifier (IRBC)
 Support Vector Machines

The evaluation results of the closeness measures – Based on Search Engine

Sentiment analysis and opinion mining

More Related Content

What's hot

Similar to Sentiment analysis and opinion mining

More from Sumit Sony

Recently uploaded

Sentiment analysis and opinion mining

Editor's Notes