Zizka aimsa 2012

+ Jan Žižka
Karel Burda
Department
of
Faculty of
Business
František Dařena Informatics and
Economics

Mendel Czech
University Republic
in Brno

Clustering Very Large Textual
Unstructured Customers' Reviews in
a Natural Language

+
Introduction

 Many companies collect opinions expressed by
their customers.
 These opinions can hide valuable knowledge.
 Discovering such the knowledge by people can
be a very demanding task because
 the opinion database can be very large,
 the customers can use different languages,
 the people can handle the opinions subjectively,
 sometimes additional resources (like lists of positive
and negative words) might be needed.

+
Introduction

 Our previous research focused on analysis what was
significant for including a certain opinion into one of
categories like satisfied or dissatisfied customers

 However, this requires to have the reviews separated
into classes sharing a common opinion/sentiment

 Clusteringas the most common form of unsupervised
learning enables automatic grouping of unlabeled
documents into subsets called clusters

+
Objective

The objective is to find out how well a
computer can separate the classes
expressing a certain opinion, without
prior knowledge of the nature of such
the classes, and to find a clustering
algorithm with a set of its best
parameters, similarity and clustering-
criterion functions, word representation,
and the role of stemming for the given
specific data.

+
Data description

 Processed data included reviews of hotel clients
collected from publicly available sources
 The reviews were labeled as positive and
negative
 Reviews characteristics:
 more than 5,000,000 reviews
 written in more than 25 natural languages
 written only by real customers, based on a real
experience
 written relatively carefully but still containing errors that
are typical for natural languages

+
Properties of data used for
experiments
 Thesubset used in our experiments contained
almost two million opinions marked as written in
English.

Review category Positive Negative
Number of reviews 1,190,949 741,092
Maximal review length 391 words 396 words
Average review length 21.67 words 25.73 words
Variance 403.34 words 618.47 words

+
Review examples

 Positive
 The breakfast and the very clean rooms stood out as the best features of
this hotel.
 Clean and moden, the great loation near station. Friendly reception!
 The rooms are new. The breakfast is also great. We had a really nice stay.
 Nothing, the hotel is very noisy, no sound insulation whatsoever. Room
very small. Shower not nice with a curtain. This is a 2/3 star max.

 Negative
 High price charged for internet access which actual cost now is extreamly
low.
 water in the shower did not flow away
 The room was noisy and the room temperature was higher than normal.
 The train almost running through your room every 10 minutes, the old man
at the restaurant was ironic beyond friendly, the food was ok but very
German.

+
Data preparation

 Data collection, cleaning (removing tags, non-letter
characters), converting to upper-case
 Removing words shorter than 3 characters
 Porter’s Stemming
 Stopwords removing, spell checking, diacritics removal etc.
were not carried out
 Creating 14 smaller subsets containing positive and negative
reviews with following proportions: 131:144, 229:211,
987:1029, 1031:1085, 2096:2211, 4932:4757, 4832:4757,
7432:7399, 10023:8946, 10251:9352, 15469:14784,
24153:23956, 52146:49986, and 365921:313752

+
Experimental steps

 Random selection of desired amount of reviews

 Transformation of the data into the vector representation

 Loading the data in Cluto* and performing clustering

 Evaluating the results

* Free software providing different clustering methods working with
several clustering criterion functions and similarity measures, suitable
for operating on very large datasets.

+
Clustering algorithm parameters

 Clustering algorithm – describes the way how objects to be
clustered are assigned into individual groups

 Available algorithms
 Cluto's k-means variation – algorithm iteratively adapts the initial
randomly generated k cluster centroids' positions
 Repeated bisection – a sequence of cluster bisections
 Graph-based – partitioning a graph representing objects to be
clustered

+

 Similarity – an important measure affecting the results of
clustering because the objects within one cluster need to be
similar while objects from different clusters should be dissimilar

 Available similarity/distance measures
 Cosine similarity – measures the cosine of the angle between
couples of vectors representing the documents
 Pearson's correlation coefficient – measures linear correlation
between values of two vectors
 Euclidean distance – computes the distance between points
representing documents in the abstract space

+

 Criterion functions – particular clustering criterion function defined
over the entire clustering solution is optimized
 Internal functions are defined over the documents that are part of each
cluster and do not take into account the documents assigned to different
clusters
 External criterion functions derive the clustering solution the difference
among individual clusters
 Internal and external functions can be combined to define a set of
hybrid criterion functions that simultaneously optimize individual criterion
functions

 Available criterion functions
 Internal – I1, I2
 External – E1, E2
 Hybrid – H1, H2
 Graph based – G1

+

 Document representation – documents are represented using the
vector-space model

 Vector dimensions – document properties (terms, in our
experiments words)

 Vector values
 Term Presence (TP)
 Term Frequency (TF)
 Term Frequency × Inverse Document Frequency (TF-IDF)
 Term Presence × Inverse Document Frequency (TP-IDF)

𝑁
𝑖𝑑𝑓 𝑡 𝑖 = log
𝑛(𝑡𝑖)

+
Evaluation of cluster quality

 Purity based measures – measure the extend to which each
cluster contained documents from primarily one class
 Purity of cluster Sr of size nr:
1
P(𝑆𝑟) = max ni
nr i r
 Purity of the entire solution with k clusters:
𝑘
𝑛𝑟
𝑃𝑢𝑟𝑖𝑡𝑦 = P(𝑆𝑟)
𝑛
𝑟=1

 A perfect clustering solution – clusters contain documents from
only a single class  Purity = 1

+
Evaluation of cluster quality

 Entropy based measures – how the various classes of documents
are distributed within each cluster
 Entropy of cluster Sr of size nr:
𝑞
1 𝑛 𝑖𝑟 𝑛 𝑖𝑟
E 𝑆𝑟 =− log ,
log 𝑞 𝑛𝑟 𝑛𝑟
𝑖=1

where q is the number of classes and 𝑛 𝑖𝑟 number of documents
in ith class that were assigned to the rth cluster
 Entropy of the entire solution with k clusters:
𝑘
𝑛𝑟
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = E(𝑆𝑟)
𝑛
𝑟=1

 A perfect clustering solution – clusters contain documents from
only a single class  Entropy = 0

+
Results

 Best results were achieved by k-means, repeated bisection,
and cosine similarity as demonstrated in following tables

 A certain boundary from which the entropy value oscillates and
does not change much with increasing number of documents
was found – around 10,000 documents

 IDF weighting had a considerable positive impact on clustering
results in comparison with simple TP/TF

 TF-IDF document representation provided almost the same
results as TP-IDF

+
Results

 Using cosine similarity provided the best results unlike the
Euclidean distance and Pearson's correlation coefficient.
 For example, for the set of documents containing 4,932 positive and
4,745 negative reviews, the entropy was 0.594 for cosine similarity,
while Euclidean distance provided entropy 0.740, and Pearson's
coefficient 0.838

 The H2 and I2 criterion functions provided the best results.

 For the I1 criterion function, the entropy of one cluster was very
low (less than 0.2). On the other hand, the second cluster's
entropy was extremely high.

 Stemming applied during the preprocessing phase had no
impact on the entropy at all.

+
Weighted entropy

K-means Repeated bisection
Ratio P:N TF-IDF TP-IDF TF-IDF TP-IDF
I2 H2 I2 H2 I2 H2 I2 H2
131:144 0.792 0.785 0.793 0.741 0.726 0.767 0.774 0.774
229:211 0.694 0.632 0.695 0.627 0.648 0.643 0.650 0.647
987:1029 0.624 0.610 0.618 0.605 0.624 0.609 0.618 0.611
4832:4757 0.601 0.581 0.599 0.579 0.600 0.584 0.598 0.580
7432:7399 0.605 0.596 0.599 0.587 0.605 0.595 0.594 0.586
15469:14784 0.604 0.583 0.598 0.579 0.604 0.582 0.598 0.579
24153:23956 0.597 0.580 0.589 0.572 0.597 0.580 0.589 0.572
52164:49986 0.596 0.582 0.600 0.573 0.604 0.582 0.598 0.574
201346:204716 0.599 0.583 0.592 0.575 0.597 0.583 0.593 0.576
365921:313752 0.602 0.586 0.598 0.584 0.599 0.581 0.598 0.580

+
Percentage ratios of documents in
the clusters
K-means Repeated bisection
I2 H2 I2 H2
Ratio P:N
cluster 0 cluster 1 cluster 0 cluster 1 cluster 0 cluster 1 cluster 0 cluster 1
(P:N) (P:N) (P:N) (P:N) (P:N) (P:N) (P:N) (P:N)
131:144 76:24 24:74 78:24 22:74 75:22 25:76 78:19 22:78
229:211 84:21 16:79 86:18 14:82 84:20 16:80 84:18 16:82
987:1029 80:12 19:87 85:16 14:83 79:11 20:88 85:15 15:84
4832:4757 83:13 17:87 87:15 13:85 83:12 17:87 86:14 14:86
7432:7399 82:12 17:87 85:14 14:85 82:12 17:86 86:14 14:85
15469:14784 80:11 19:89 85:13 15:86 81:10 19:89 85:13 15:87
24153:23956 81:11 19:89 85:13 14:86 81:10 18:89 86:13 14:87
52164:49986 18:89 81:11 15:87 85:13 19:89 80:10 15:87 85:12
201346:204716 82:11 18:88 85:13 15:86 82:11 18:89 15:87 85:12
365921:313752 19:89 80:10 16:88 83:12 80:10 20:90 16:87 84:12

+
Weighted entropy for different data
set sizes

+
Conclusions

 The goal was to automatically build clusters
representing positive and negative opinions and
finding a clustering algorithm with a set of its best
parameters, similarity measure, clustering-criterion
function, word representation, and the role of
stemming.

 Themain focus was on clustering large real-world
data during a reasonable time, without applying any
sophisticated methods that can increase the
computational complexity.

+
Conclusions

 The best results were obtained with
 k-means
 performed better when compared with other
algorithms
 proved itself as a faster algorithm
 binary vector representation
 idf weighting
 cosine similarity
 H2 criterion function
 stemming did not improve the results

+
Future work

 Clustering of reviews in other languages

 Analysis of “incorrectly” categorized reviews

 Clustering smaller units of reviews (e.g., sentences)

Zizka aimsa 2012

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to Zizka aimsa 2012

Similar to Zizka aimsa 2012 (20)

More from Natalia Ostapuk

More from Natalia Ostapuk (20)

Zizka aimsa 2012