2. Proposed System Architecture
The proposed framework has a modular architecture and uses an
unsupervised method and a lexical resource to extract opinions
from user reviews posted on TripAdvisor website.
On the website users are allowed to add reviews of travel related
content.
The System consist of two module
A content acquisition module
Analysis module
3. Content Acquisition Module
The acquisition module consists of a web crawler that visit the
tourism website starting from a given URL.
The crawler collects all the links found in visited pages and
register the visited ones.
The content of visited pages that contain reviews, is sent to
content extraction module that parse the html source of page and
extract the review.
The analysis module
The analysis module process the reviews from deposit and
implement the opinion mining process.
It includes the processing module, opinion mining module and
SentiWordNet lexical database.
Opinion mining is performed using an unsupervised approach at
multiple level: word level, sentence level and document level.
4.
5. • SentiWordNet
• SentiWordNet is a lexical resource derived from WordNet which
assigns numerical values to each sysnet, representing the scores of
positivity, negativity or objectivity.
ObjScore = 1 − (PosScore + NegScore)
• The web interface allows the user to search for any synset
belonging to WordNet with its associated SentiWordNet scores.
• The advantage of using synsets instead of terms is to offer
different sentiment scores for each sense of one word, because the
connotations can differ in one word depending on the sense.
6. SentiWordNet is not able handle multi
word queries, so we suggest preprocessing
them in the following way:
1. tokenization
2. POStagging
3. Reduce text to nouns, adjectives, verbs,
adverbs (optionally filtering out named
entities)
4. Normalization: stemming and/or
lemmatization
“If SentiWordNet does not find any suiting
synset, the sentiment scores for this word
simply are all zero”
7. Example analysis
We take a product review from amazon.com which was rated with five stars.
“This cute little set is not only sturdy and realistic, it was also a wonderful
introduction to preparing food for our 3 year old daughter. Ever since her
Grandpa bought this for her, she's made everything from a cheese sandwich to
a triple decker salami club! She had so much fun playing with this toy, that
she started to become interested in how I prepared meals. She now is very
eager to spread peanut butter and jam, layer turkey and cheese and help mix
cake batter.[...] A very cute gift to give a girl or boy.”
We assume that tokenization is simply performed by splitting the text at
whitespaces. POSTags are marked with different colours: nouns, adjectives, verbs,
and adverbs.
the preprocessing pipeline turns words like “was” into “is”, “preparing” into
“prepare”,“brought” into “buy” and so on.
Word sense disambiguation for this example is manually performed, e.g. “spread”
is assigned to the synset ‘“cover by spreading something over; "spread the bread
with cheese"’.
8. If we take the average and stay on the positivity/objectivity/negativity scale is the
text to 84% objective, 12% positive, around 4% negative (b). This does not agree
with the five star rating and our intuition that reviews always are rather subjective
text.
9. Reference
Opinion Mining Using SentiWordNet Julia Kreutzer & Neele Witte
Semantic Analysis HT 2013/14 Uppsala University
Research paper in opinion mining techniques in Tourism.
Handbook Of Natural Language Processing, Second Edition
Chapman & Hall Crc Machine Learning & Pattern Recognition
2010BOOK
International Conference on Advanced Computing Technologies
and Applications (ICACTA-2015)Sentiment analysis: Measuring
opinions
Research paper Identifying Customer Preferences about Tourism
Produc
Editor's Notes
The review sentences are evaluated identifying parts of speech using a POS tagging algorithm
For each sentence an opinion mining analysis is performed. Each sentence, through a tokenization process, is split into component words. The words polarity is evaluated using SentiWordNet.
SentiWordNet is a lexical resource derived from WordNet which assigns numerical values to each sysnet, representing the scores of positivity, negativity or objectivity