Data Mining And Business Intelligence
Seminar held at Indian Institute of Management (IIM Raipur)
The market is flooded with advertisements, and it's difficult for any consumer to check for the authenticity in them.
The research paper discusses a technique of Opinion Mining and Classification that will help consumers choose a product based on correct reviews available over internet.
Presented by:
Rajat Katiyar
Amit Singh Chauhan
Komal Billu Kujur
Sales & Marketing Alignment: How to Synergize for Success
Opinion Mining and Classification Technique to help make better choices before buying a product
1. Data Mining and Business
Intelligence
PGP 2012-14
Group no 1
Amit Singh Chauhan
(60)
Komal Billu (21)
2. consumer
market is flooded with products of the most
varied sorts, each being advertised as better, cheaper,
and more resistant.
Is
advertisement really true?
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
2
3. Good
Solution is to go for “Word of Mouth” on the web.
Ideal
situation is that one is able to read all the available
reviews and create an opinion.
• Time spent in reviewing will be huge
• Product reviews written in different languages
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
3
4.
How to extract the features for a given product, that
could be commented upon in a customer review ????
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
4
5. Significance
of the problem
• Mining the web for customer opinion on different products is
both a useful, as well as challenging task.
• This research will give customer a clear polarity which will be
binary in nature.
• Eventually it will help customer to take a firm opinion about
the product he goes for opinion mining.
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
5
6. What
are the expected results of the project?
It will evolve methods to evaluate a system
implementing the method presented and we show the
evaluation results obtained when applying our system
to a set of previously manually annotated texts
containing customer reviews in English and Spanish.
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
6
7. The
approach to the problem has been divided into two
major phases:
Preprocessing
Main Processing
Assigning polarity to feature attribute
Summarization of feature polarity
Discussion and Evaluation
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
7
9. Once
the user enters a query about the product a series
of documents are downloaded in different languages
A second operation is performed to determine the
category of the product
After the category is determined the product specific
features are extracted using the Word net and Concept
net
Product independent features also extracted which are
applicable to all the products
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
9
10. Once
we are done with Word net we search the Concept
net for further attributes and features.
In the next step we look for undiscovered features of the
product. For eg. For a camera these features would be
battery life, picture resolution and auto mode.
These features extracted by using bigrams which use a
corpus of target words and other words used with it in
the customer review
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
10
12. The
main processing process starts with anaphora
resolution in which we replace anaphoric references with
their corresponding referents
For eg: I bought this camera about a week ago, and so far
have found it very simple to use and after anaphoric
resolution it will become I bought this camera about a week
ago, and so far have found <this camera > very simple to
use
Sentence chunking done to convert the modified text to
sentences and after that sentence extraction done to
remove text of no importance
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
12
13. Sentence
parsing done to obtain sentence structure and
component dependencies.
In the next step the features and their values i.e.
attributes are extracted
We also assign a modifier to each attribute feature to
determine whether the attribute is positive or negative
Hence triplets of the form (feature, feature attribute,
valueof Modifier).
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
13
14.
ConceptNet methodology:
• the OUT relations PropertyOf and CapableOf relations
• IN relations PartOf and UsedFor relations
Feature value extraction:
• feature, attributeFeature, valueOfModifier
Assigning polarity to feature attributes i.e. SMO(sequential minimal
optimization ) SVM(Support Vector Machine)
• The set of anchors contains the terms {featureName,happy, unsatisfied, nice,
small, buy}
• 6 dimensional training vector v(j,i) = NGD(w,a), where a with j ranging from 1 to 6
are the anchors and wi, with i from 1 to 30 are the words from the positive and
negative categories.
i
j
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
j
14
15. Summarization
of feature polarity:
The formulas can be summarized in:
• Fpos(i)= #pos_feature_attributes(i)/#feature_attributes(i)
Fneg(i) =#neg_feature_attributes(i)/#feature attributes(i)
• The results shown are triplets of the form (feature, % Positive Opinions,
% Negative Opinions)
Discussion
and Evaluation:
Three formula for computing the system performance
• System Accuracy (SA)
• Feature Identification Precision (FIP)
• Feature Identification Recall (FIR)
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
15
16.
The Normalized Google Distance, is a semantic similarity measure
derived from the number of hits returned by the Google search
engine for a given set of keywords. Keywords with the same or similar
meanings in a natural language sense tend to be "close" in units
of Normalized Google Distance, while words with dissimilar meanings
tend to be farther apart.
NGD(x,y) = [max{logf(x), logf(y)}-log f(x,y)]/[log N – min{log f(x), log f(y)]
Where:
• N is the total number of web pages searched by Google * average number of singleton
search terms occurring on pages
• f(x) and f(y) are the number of hits for search terms x and y, respectively
• f(x, y) is the number of web pages on which both x and y occur.
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
16
17.
Once the product category is determined, extracting
the product specific features and feature attributes by
using:
• WordNet for English
• EuroWordNet for Spanish
Process
of determining the specific product features is
done by ConceptNet
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
17
18.
Specialised tool for anaphora resolution
• JavaRAP for English.
• SUPAR (Slot Unification Parser for Anaphora Resolution) for
Spanish.
Named
Entity Recognizer to spot names of products,
brands and shops.
Ling Pipe is used to split to sentence and identifying the
named entities being referred.
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
18
19. Sentence
parsing tool
• Minipar (English)
• Freeling (Spanish)
To
assign polarity to each of the identified attribute of
the product, following are used sequentially
• Sequential Minimal Optimization (SMO) Support Vector Machine
(SVM)
• Normalized Google Distance (NGD)
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
19
20. SVM
and NGD scores use a set of anchors that must be
established previously, which remains largely a
subjective matter.
The informal language style used by the customers
while jotting their reviews, makes the identification of
words and dependencies in phrases sometimes
impossible.
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
20
21. Currently
it is possible to review consumer comments in
two languages it can also be further extended to include
other languages also
We can also extend it to include for extracting
information from images and photos posted by the other
users
It can also be used for suggestive selling i.e. user will
provide his criteria for buying the product as well as
how important each factor is to him and then our system
will give suggestions accordingly
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
21
22.
A Feature Dependent Method for Opinion Mining and Classification
• By - Alexandra BALAHUR DLSI, Univ. Alicante Alicante, Spain Andrés MONTOYO DLSI, Univ. Alicante
Alicante, Spain
http://en.wikipedia.org/wiki/Sequential_minimal_optimization
http://en.wikipedia.org/wiki/Normalized_Google_distance
http://research.microsoft.com/en-us/groups/nlp/
http://en.wikipedia.org/wiki/Natural_language_processing
http://wordnet.princeton.edu/
http://conceptnet5.media.mit.edu/
http://web.media.mit.edu/~hugo/publications/papers/BTTJ-ConceptNet.pdf
http://www.acronymfinder.com/Slot-Unification-Parser-for-Anaphora-Resolution(computer-science)-(SUPAR).html
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.8911&rep=rep1&ty
pe=pdf
INDIAN INSTITUTE OF MANAGEMENT RAIPUR
22