1. + Jan Žižka
Karel Burda
Department
of
Faculty of
Business
František Dařena Informatics and
Economics
Mendel Czech
University Republic
in Brno
Clustering Very Large Textual
Unstructured Customers' Reviews in
a Natural Language
2. +
Introduction
Many companies collect opinions expressed by
their customers.
These opinions can hide valuable knowledge.
Discovering such the knowledge by people can
be a very demanding task because
the opinion database can be very large,
the customers can use different languages,
the people can handle the opinions subjectively,
sometimes additional resources (like lists of positive
and negative words) might be needed.
3. +
Introduction
Our previous research focused on analysis what was
significant for including a certain opinion into one of
categories like satisfied or dissatisfied customers
However, this requires to have the reviews separated
into classes sharing a common opinion/sentiment
Clusteringas the most common form of unsupervised
learning enables automatic grouping of unlabeled
documents into subsets called clusters
4. +
Objective
The objective is to find out how well a
computer can separate the classes
expressing a certain opinion, without
prior knowledge of the nature of such
the classes, and to find a clustering
algorithm with a set of its best
parameters, similarity and clustering-
criterion functions, word representation,
and the role of stemming for the given
specific data.
5. +
Data description
Processed data included reviews of hotel clients
collected from publicly available sources
The reviews were labeled as positive and
negative
Reviews characteristics:
more than 5,000,000 reviews
written in more than 25 natural languages
written only by real customers, based on a real
experience
written relatively carefully but still containing errors that
are typical for natural languages
6. +
Properties of data used for
experiments
Thesubset used in our experiments contained
almost two million opinions marked as written in
English.
Review category Positive Negative
Number of reviews 1,190,949 741,092
Maximal review length 391 words 396 words
Average review length 21.67 words 25.73 words
Variance 403.34 words 618.47 words
7. +
Review examples
Positive
The breakfast and the very clean rooms stood out as the best features of
this hotel.
Clean and moden, the great loation near station. Friendly reception!
The rooms are new. The breakfast is also great. We had a really nice stay.
Nothing, the hotel is very noisy, no sound insulation whatsoever. Room
very small. Shower not nice with a curtain. This is a 2/3 star max.
Negative
High price charged for internet access which actual cost now is extreamly
low.
water in the shower did not flow away
The room was noisy and the room temperature was higher than normal.
The train almost running through your room every 10 minutes, the old man
at the restaurant was ironic beyond friendly, the food was ok but very
German.
8. +
Data preparation
Data collection, cleaning (removing tags, non-letter
characters), converting to upper-case
Removing words shorter than 3 characters
Porter’s Stemming
Stopwords removing, spell checking, diacritics removal etc.
were not carried out
Creating 14 smaller subsets containing positive and negative
reviews with following proportions: 131:144, 229:211,
987:1029, 1031:1085, 2096:2211, 4932:4757, 4832:4757,
7432:7399, 10023:8946, 10251:9352, 15469:14784,
24153:23956, 52146:49986, and 365921:313752
9. +
Experimental steps
Random selection of desired amount of reviews
Transformation of the data into the vector representation
Loading the data in Cluto* and performing clustering
Evaluating the results
* Free software providing different clustering methods working with
several clustering criterion functions and similarity measures, suitable
for operating on very large datasets.
10. +
Clustering algorithm parameters
Clustering algorithm – describes the way how objects to be
clustered are assigned into individual groups
Available algorithms
Cluto's k-means variation – algorithm iteratively adapts the initial
randomly generated k cluster centroids' positions
Repeated bisection – a sequence of cluster bisections
Graph-based – partitioning a graph representing objects to be
clustered
11. +
Clustering algorithm parameters
Similarity – an important measure affecting the results of
clustering because the objects within one cluster need to be
similar while objects from different clusters should be dissimilar
Available similarity/distance measures
Cosine similarity – measures the cosine of the angle between
couples of vectors representing the documents
Pearson's correlation coefficient – measures linear correlation
between values of two vectors
Euclidean distance – computes the distance between points
representing documents in the abstract space
12. +
Clustering algorithm parameters
Criterion functions – particular clustering criterion function defined
over the entire clustering solution is optimized
Internal functions are defined over the documents that are part of each
cluster and do not take into account the documents assigned to different
clusters
External criterion functions derive the clustering solution the difference
among individual clusters
Internal and external functions can be combined to define a set of
hybrid criterion functions that simultaneously optimize individual criterion
functions
Available criterion functions
Internal – I1, I2
External – E1, E2
Hybrid – H1, H2
Graph based – G1
13. +
Clustering algorithm parameters
Document representation – documents are represented using the
vector-space model
Vector dimensions – document properties (terms, in our
experiments words)
Vector values
Term Presence (TP)
Term Frequency (TF)
Term Frequency × Inverse Document Frequency (TF-IDF)
Term Presence × Inverse Document Frequency (TP-IDF)
𝑁
𝑖𝑑𝑓 𝑡 𝑖 = log
𝑛(𝑡𝑖)
14. +
Evaluation of cluster quality
Purity based measures – measure the extend to which each
cluster contained documents from primarily one class
Purity of cluster Sr of size nr:
1
P(𝑆𝑟) = max ni
nr i r
Purity of the entire solution with k clusters:
𝑘
𝑛𝑟
𝑃𝑢𝑟𝑖𝑡𝑦 = P(𝑆𝑟)
𝑛
𝑟=1
A perfect clustering solution – clusters contain documents from
only a single class Purity = 1
15. +
Evaluation of cluster quality
Entropy based measures – how the various classes of documents
are distributed within each cluster
Entropy of cluster Sr of size nr:
𝑞
1 𝑛 𝑖𝑟 𝑛 𝑖𝑟
E 𝑆𝑟 =− log ,
log 𝑞 𝑛𝑟 𝑛𝑟
𝑖=1
where q is the number of classes and 𝑛 𝑖𝑟 number of documents
in ith class that were assigned to the rth cluster
Entropy of the entire solution with k clusters:
𝑘
𝑛𝑟
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = E(𝑆𝑟)
𝑛
𝑟=1
A perfect clustering solution – clusters contain documents from
only a single class Entropy = 0
16. +
Results
Best results were achieved by k-means, repeated bisection,
and cosine similarity as demonstrated in following tables
A certain boundary from which the entropy value oscillates and
does not change much with increasing number of documents
was found – around 10,000 documents
IDF weighting had a considerable positive impact on clustering
results in comparison with simple TP/TF
TF-IDF document representation provided almost the same
results as TP-IDF
17. +
Results
Using cosine similarity provided the best results unlike the
Euclidean distance and Pearson's correlation coefficient.
For example, for the set of documents containing 4,932 positive and
4,745 negative reviews, the entropy was 0.594 for cosine similarity,
while Euclidean distance provided entropy 0.740, and Pearson's
coefficient 0.838
The H2 and I2 criterion functions provided the best results.
For the I1 criterion function, the entropy of one cluster was very
low (less than 0.2). On the other hand, the second cluster's
entropy was extremely high.
Stemming applied during the preprocessing phase had no
impact on the entropy at all.
20. +
Weighted entropy for different data
set sizes
21. +
Conclusions
The goal was to automatically build clusters
representing positive and negative opinions and
finding a clustering algorithm with a set of its best
parameters, similarity measure, clustering-criterion
function, word representation, and the role of
stemming.
Themain focus was on clustering large real-world
data during a reasonable time, without applying any
sophisticated methods that can increase the
computational complexity.
22. +
Conclusions
The best results were obtained with
k-means
performed better when compared with other
algorithms
proved itself as a faster algorithm
binary vector representation
idf weighting
cosine similarity
H2 criterion function
stemming did not improve the results
23. +
Future work
Clustering of reviews in other languages
Analysis of “incorrectly” categorized reviews
Clustering smaller units of reviews (e.g., sentences)