Zizka synasc 2012

+ František Dařena
Jan Žižka
Department
of
Faculty of
Business
Karel Burda Informatics and
Economics

Mendel Czech
University Republic
in Brno

Grouping Customer Opinions Written in
Natural Language Using Unsupervised
Machine Learning

+
Introduction

 Many companies collect opinions expressed by
their customers
 These opinions can hide valuable knowledge
 Discovering such the knowledge by people can
be a very demanding task because:
 the opinion database can be very large,
 the customers can use different languages,
 the people can handle the opinions subjectively,
 sometimes additional resources (like lists of positive
and negative words) might be needed.

+
Introduction

 Ourprevious research focused on the analysis
what was significant for including a certain
opinion into one of categories like satisfied or
dissatisfied customers
 However, this requires to have the reviews
separated into classes sharing a common
opinion/sentiment

+
Introduction

 Clusteringas the most common form of
unsupervised learning enables automatic
grouping of unlabeled documents into subsets
called clusters
 Inthe previous research, we analyzed how well
a computer can separate the classes
expressing a certain opinion and to find a
clustering algorithm with a set of its best
parameters, similarity, and clustering-criterion
function, word representation, and the role of
stemming for the given specific data

+
Objective

 Clustering process is naturally not errorless
and some reviews labeled as positive appear
in a cluster containing mostly negative
reviews and vice versa
 The objective was to analyse why certain
reviews were assigned “wrongly” to a group
containing mostly reviews from a different
class in order to improve the results of
classification and prediction

+
Data description

 Processed data included reviews of hotel clients
collected from publicly available sources
 The reviews were labeled as positive and
negative
 Reviews characteristics:
 more than 5,000,000 reviews
 written in more than 25 natural languages
 written only by real customers, based on their
experience
 written relatively carefully but still containing errors that
are typical for natural languages

+
Properties of data used for
experiments
 The subset (marked as written in English) used in
our experiments contained almost two million
opinions

Review category Positive Negative
Number of reviews 1,190,949 741,092
Maximal review length 391 words 396 words
Average review length 21.67 words 25.73 words
Variance 403.34 words 618.47 words

+
Review examples

 Positive
 The breakfast and the very clean rooms stood out as the best
features of this hotel.
 Clean and moden, the great loation near station. Friendly
reception!
 The rooms are new. The breakfast is also great. We had a really
nice stay.
 Good location - very quiet and good breakfast.

 Negative
 High price charged for internet access which actual cost now is
extreamly low.
 water in the shower did not flow away
 The room was noisy and the room temperature was higher than
normal.
 The air conditioning wasn't working

+
Data preparation

 Datacollection, cleaning (removing tags, non-letter
characters), converting to upper-case
 Removing stopwords and words shorter than 3
characters and
 Spell checking, diacritics removal etc. were not carried
out
 Creatingthree smaller subsets containing positive and
negative reviews with the following proportions:
 about 1,000 positive and 1,000 negative (small)
 about 50,000 positive and 50,000 negative (medium)
 about 250,000 positive and 250,000 negative (large)

+
Experimental steps

 Transformation of the data into the vector representation (bag-
of-words model, tf-idf weighting schema)

 Clustering with Cluto* with following parameters
 similarity function – cosine similarity,
 clustering method – k-means (Cluto’s variation)
 criterion function that is optimized during clustering process – H2

 weighted entropy of the results varied from about 0.58 to 0.60
(e.g., for the small set of reviews, the entropy was 0.587, and
accuracy 0.859)
* Free software providing different clustering methods working with several
clustering criterion functions and similarity measures, suitable for operating on
very large datasets.

+
Graphical representation of the
results of clustering

False Positive (FP) False Negative (FP)

True Positive (TP) True Negative (TN)

Clustered Positive (CP) Clustered Negative (CN)

+
Analysis of incorrectly clustered
reviews
 When a review rPi, originally labeled as positive, is
“wrongly” assigned to a cluster with mostly negative
reviews (CN), we can assume that the properties of this
review are more “similar” to the properties of the other
reviews in CN, i.e., the words of rPi and their combinations
are more similar to the words contained in the dictionary
of CN

 The similarity was related to the frequency of words of rPi
in the subsets of the clustering solution (FN is compared
to TN, TP, CP, and FP is compared to TP, TN, CN)

+
reviews
 We 𝑋
introduce the importance 𝑖 𝑤 𝑖 of a word wi in a
given set X:

where NX(wi) is the frequency of word wi in set X and
NX is the number of dictionary words in X

+
reviews
 The importance of a word in one set should be similar
to the importance of the same word in the most similar
set, i.e., importance of words in FN and TN should be
more similar than, e.g., importance of words in FN and
TP

 Lowest value among
and corresponds to highest
importance similarity with TP, TN, or CN

 The same comparisons between FN and TN, TP, and
CP were carried out

+ Importance of words from dictionary of False
Positive set compared to the other sets

+ Importance of words from dictionary of False
Negative set compared to the other sets

+
Results of the analysis

 The words with higher frequencies included mostly the words
that could be considered positive (e.g., location, excellent, or
friendly) and negative (e.g., small, noise, or poor) in terms of
their contribution to the assignment of reviews to a “correct”
category

 These words that are important from the correct classification
viewpoint have often the most similar importance in a different
set than one would expect, e.g., some words in reviews from
FN bearing a strong positive sentiment had their importance
most similar to their importance in TN and not in TP or CP

+
Example 1 – small data set

 A strongly positive word excellent was used 3 times in the
FN (290 positive reviews, 3,678 words)  iFN = 0.0008

 Such the importance was the most similar to the
importance of the same word in TN (iTN = 0.0007) and not
in TP (iTP = 0.007) or CP (iCP = 0.006)

 Review “Excellent bed making. Very good restaurant but
an English language menu would be advantageous to
non-german speaking visitors.” containing a strongly
positive word excellent was categorized incorrectly

+
Example 2 – small data set

 A positiveword good (with smaller positivity than
excellent) had the importance iFN = 0.0114

 Suchthe importance was most similar to the
importance of the same word in CP (iCP = 0.0146)
and not in TP (iTP = 0.016) or TN (iTN = 0.0021)

 Nevertheless,
some reviews containing this positive
word were assigned to a group with mostly negative
reviews.

+
Results of the analysis

 Both examples demonstrate that other document
properties, i.e., the presence of the other words together
with their importance, are signifi-
cant. This is demonstrated in the
table with importance similarities
of words of an obviously positive
review containing twice strongly
positive word “good” which was
assigned incorrectly to CN.

+
Results of the analysis – importance
vs. frequency
 The analysis of the importance of words from dictionary of FN
showed that about 60% of words had their importance similar
to their importance in TN

 However, the frequency of each of these words (number of
occurrences in all reviews) was relatively low (many of them
appeared just once)

 These words with highly similar importance also often did not
bear any sentiment, such as the words discounted, happening,
or attitude

+
Conclusions

 The study aimed at finding what was actually the reason of
assigning some documents into a “wrong” class

 The critical information is provided by certain significant words
included in individual reviews

 Words that the previous research found significant for opinion
polarity did not take effect as misleading information unlike
words that were much more or quite insignificant

 Specific words (or their combinations) can be filtered out as
noise, improving the cluster generation

Zizka synasc 2012

More Related Content

Viewers also liked

Similar to Zizka synasc 2012

More from Natalia Ostapuk

Zizka synasc 2012