SlideShare a Scribd company logo
1 of 23
Download to read offline
+                Jan Žižka
              Karel Burda
                             Department
                             of
                                           Faculty of
                                           Business
         František Dařena    Informatics   and
                                           Economics




                             Mendel        Czech
                             University    Republic
                             in Brno




       Clustering Very Large Textual
    Unstructured Customers' Reviews in
            a Natural Language
+
    Introduction


     Many   companies collect opinions expressed by
      their customers.
     These opinions can hide valuable knowledge.
     Discovering such the knowledge by people can
      be a very demanding task because
      the opinion database can be very large,
      the customers can use different languages,
      the people can handle the opinions subjectively,
      sometimes additional resources (like lists of positive
       and negative words) might be needed.
+
    Introduction

     Our previous research focused on analysis what was
     significant for including a certain opinion into one of
     categories like satisfied or dissatisfied customers

     However,  this requires to have the reviews separated
     into classes sharing a common opinion/sentiment

     Clusteringas the most common form of unsupervised
     learning enables automatic grouping of unlabeled
     documents into subsets called clusters
+
    Objective


    The  objective is to find out how well a
     computer can separate the classes
     expressing a certain opinion, without
     prior knowledge of the nature of such
     the classes, and to find a clustering
     algorithm with a set of its best
     parameters, similarity and clustering-
     criterion functions, word representation,
     and the role of stemming for the given
     specific data.
+
    Data description

     Processed   data included reviews of hotel clients
      collected from publicly available sources
     The reviews were labeled as positive and
      negative
     Reviews characteristics:
      more  than 5,000,000 reviews
      written in more than 25 natural languages
      written only by real customers, based on a real
       experience
      written relatively carefully but still containing errors that
       are typical for natural languages
+
    Properties of data used for
    experiments
     Thesubset used in our experiments contained
     almost two million opinions marked as written in
     English.

            Review category         Positive       Negative
            Number of reviews       1,190,949      741,092
            Maximal review length   391 words      396 words
            Average review length   21.67 words    25.73 words
            Variance                403.34 words   618.47 words
+
    Review examples

       Positive
           The breakfast and the very clean rooms stood out as the best features of
            this hotel.
           Clean and moden, the great loation near station. Friendly reception!
           The rooms are new. The breakfast is also great. We had a really nice stay.
           Nothing, the hotel is very noisy, no sound insulation whatsoever. Room
            very small. Shower not nice with a curtain. This is a 2/3 star max.

       Negative
           High price charged for internet access which actual cost now is extreamly
            low.
           water in the shower did not flow away
           The room was noisy and the room temperature was higher than normal.
           The train almost running through your room every 10 minutes, the old man
            at the restaurant was ironic beyond friendly, the food was ok but very
            German.
+
    Data preparation

       Data collection, cleaning (removing tags, non-letter
        characters), converting to upper-case
     Removing words shorter than 3 characters
     Porter’s Stemming
     Stopwords removing, spell checking, diacritics removal etc.
      were not carried out
       Creating 14 smaller subsets containing positive and negative
        reviews with following proportions: 131:144, 229:211,
        987:1029, 1031:1085, 2096:2211, 4932:4757, 4832:4757,
        7432:7399, 10023:8946, 10251:9352, 15469:14784,
        24153:23956, 52146:49986, and 365921:313752
+
    Experimental steps

       Random selection of desired amount of reviews

       Transformation of the data into the vector representation

       Loading the data in Cluto* and performing clustering

       Evaluating the results



    *   Free software providing different clustering methods working with
        several clustering criterion functions and similarity measures, suitable
        for operating on very large datasets.
+
    Clustering algorithm parameters

       Clustering algorithm – describes the way how objects to be
        clustered are assigned into individual groups

       Available algorithms
           Cluto's k-means variation – algorithm iteratively adapts the initial
            randomly generated k cluster centroids' positions
           Repeated bisection – a sequence of cluster bisections
           Graph-based – partitioning a graph representing objects to be
            clustered
+
    Clustering algorithm parameters

       Similarity – an important measure affecting the results of
        clustering because the objects within one cluster need to be
        similar while objects from different clusters should be dissimilar

       Available similarity/distance measures
           Cosine similarity – measures the cosine of the angle between
            couples of vectors representing the documents
           Pearson's correlation coefficient – measures linear correlation
            between values of two vectors
           Euclidean distance – computes the distance between points
            representing documents in the abstract space
+
    Clustering algorithm parameters

       Criterion functions – particular clustering criterion function defined
        over the entire clustering solution is optimized
           Internal functions are defined over the documents that are part of each
            cluster and do not take into account the documents assigned to different
            clusters
           External criterion functions derive the clustering solution the difference
            among individual clusters
           Internal and external functions can be combined to define a set of
            hybrid criterion functions that simultaneously optimize individual criterion
            functions

       Available criterion functions
           Internal – I1, I2
           External – E1, E2
           Hybrid – H1, H2
           Graph based – G1
+
    Clustering algorithm parameters

       Document representation – documents are represented using the
        vector-space model

       Vector dimensions – document properties (terms, in our
        experiments words)

       Vector values
           Term Presence (TP)
           Term Frequency (TF)
           Term Frequency × Inverse Document Frequency (TF-IDF)
           Term Presence × Inverse Document Frequency (TP-IDF)

                                                   𝑁
                                 𝑖𝑑𝑓 𝑡 𝑖 = log
                                                 𝑛(𝑡𝑖)
+
    Evaluation of cluster quality

       Purity based measures – measure the extend to which each
        cluster contained documents from primarily one class
           Purity of cluster Sr of size nr:
                                            1
                                     P(𝑆𝑟) = max ni
                                            nr i r
           Purity of the entire solution with k clusters:
                                                𝑘
                                                     𝑛𝑟
                                   𝑃𝑢𝑟𝑖𝑡𝑦 =             P(𝑆𝑟)
                                                     𝑛
                                               𝑟=1

       A perfect clustering solution – clusters contain documents from
        only a single class  Purity = 1
+
    Evaluation of cluster quality

       Entropy based measures – how the various classes of documents
        are distributed within each cluster
           Entropy of cluster Sr of size nr:
                                                      𝑞
                                           1               𝑛 𝑖𝑟    𝑛 𝑖𝑟
                               E 𝑆𝑟   =−                        log ,
                                         log 𝑞             𝑛𝑟      𝑛𝑟
                                                     𝑖=1

            where q is the number of classes and 𝑛 𝑖𝑟 number of documents
            in ith class that were assigned to the rth cluster
           Entropy of the entire solution with k clusters:
                                                 𝑘
                                                      𝑛𝑟
                                   𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =             E(𝑆𝑟)
                                                      𝑛
                                                𝑟=1

       A perfect clustering solution – clusters contain documents from
        only a single class  Entropy = 0
+
    Results

       Best results were achieved by k-means, repeated bisection,
        and cosine similarity as demonstrated in following tables

       A certain boundary from which the entropy value oscillates and
        does not change much with increasing number of documents
        was found – around 10,000 documents

       IDF weighting had a considerable positive impact on clustering
        results in comparison with simple TP/TF

       TF-IDF document representation provided almost the same
        results as TP-IDF
+
    Results

       Using cosine similarity provided the best results unlike the
        Euclidean distance and Pearson's correlation coefficient.
           For example, for the set of documents containing 4,932 positive and
            4,745 negative reviews, the entropy was 0.594 for cosine similarity,
            while Euclidean distance provided entropy 0.740, and Pearson's
            coefficient 0.838

       The H2 and I2 criterion functions provided the best results.

       For the I1 criterion function, the entropy of one cluster was very
        low (less than 0.2). On the other hand, the second cluster's
        entropy was extremely high.

       Stemming applied during the preprocessing phase had no
        impact on the entropy at all.
+
    Weighted entropy


                               K-means                         Repeated bisection
      Ratio P:N          TF-IDF            TP-IDF            TF-IDF           TP-IDF
                      I2       H2       I2       H2       I2       H2      I2       H2
       131:144      0.792     0.785   0.793     0.741   0.726     0.767  0.774     0.774
        229:211     0.694     0.632   0.695 0.627       0.648     0.643   0.650    0.647
       987:1029     0.624     0.610   0.618     0.605   0.624     0.609  0.618     0.611
      4832:4757     0.601     0.581   0.599     0.579   0.600     0.584  0.598     0.580
      7432:7399     0.605     0.596   0.599     0.587   0.605     0.595  0.594     0.586
     15469:14784    0.604     0.583   0.598     0.579   0.604     0.582  0.598     0.579
     24153:23956    0.597     0.580   0.589     0.572   0.597     0.580  0.589     0.572
     52164:49986    0.596     0.582   0.600     0.573   0.604     0.582  0.598     0.574
    201346:204716   0.599     0.583   0.592     0.575   0.597     0.583  0.593     0.576
    365921:313752   0.602     0.586   0.598     0.584   0.599     0.581  0.598     0.580
+
    Percentage ratios of documents in
    the clusters
                                     K-means                            Repeated bisection
                              I2                   H2                   I2                   H2
      Ratio P:N
                    cluster 0 cluster 1 cluster 0 cluster 1 cluster 0 cluster 1 cluster 0 cluster 1
                      (P:N)     (P:N)     (P:N)     (P:N)     (P:N)     (P:N)     (P:N)     (P:N)
       131:144       76:24         24:74   78:24        22:74   75:22        25:76   78:19        22:78
       229:211       84:21         16:79   86:18        14:82   84:20        16:80   84:18        16:82
      987:1029       80:12         19:87   85:16        14:83   79:11        20:88   85:15        15:84
      4832:4757      83:13         17:87   87:15        13:85   83:12        17:87   86:14        14:86
      7432:7399      82:12         17:87   85:14        14:85   82:12        17:86   86:14        14:85
     15469:14784      80:11        19:89   85:13        15:86   81:10        19:89   85:13        15:87
     24153:23956      81:11        19:89   85:13        14:86   81:10        18:89   86:13        14:87
     52164:49986     18:89         81:11   15:87        85:13   19:89        80:10   15:87        85:12
    201346:204716     82:11        18:88   85:13        15:86   82:11        18:89   15:87        85:12
    365921:313752    19:89         80:10   16:88        83:12   80:10        20:90   16:87        84:12
+
    Weighted entropy for different data
    set sizes
+
    Conclusions

     The goal was to automatically build clusters
     representing positive and negative opinions and
     finding a clustering algorithm with a set of its best
     parameters, similarity measure, clustering-criterion
     function, word representation, and the role of
     stemming.

     Themain focus was on clustering large real-world
     data during a reasonable time, without applying any
     sophisticated methods that can increase the
     computational complexity.
+
    Conclusions


     The   best results were obtained with
      k-means
       performed   better when compared with other
         algorithms
        proved itself as a faster algorithm
      binary vector representation
      idf weighting
      cosine similarity
      H2 criterion function
      stemming did not improve the results
+
    Future work

     Clustering   of reviews in other languages

     Analysis   of “incorrectly” categorized reviews

     Clustering   smaller units of reviews (e.g., sentences)

More Related Content

What's hot

Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernelsDev Nath
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningTapas Majumdar
 
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4Glenn De Backer
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
 
Evaluation of subjective answers using glsa enhanced with contextual synonymy
Evaluation of subjective answers using glsa enhanced with contextual synonymyEvaluation of subjective answers using glsa enhanced with contextual synonymy
Evaluation of subjective answers using glsa enhanced with contextual synonymyijnlc
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationQuentin Pleplé
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementNAVER Engineering
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 

What's hot (16)

Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learning
 
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
 
Siguccs20101026
Siguccs20101026Siguccs20101026
Siguccs20101026
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
Evaluation of subjective answers using glsa enhanced with contextual synonymy
Evaluation of subjective answers using glsa enhanced with contextual synonymyEvaluation of subjective answers using glsa enhanced with contextual synonymy
Evaluation of subjective answers using glsa enhanced with contextual synonymy
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet Allocation
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
Lesson 35
Lesson 35Lesson 35
Lesson 35
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
speech enhancement
speech enhancementspeech enhancement
speech enhancement
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 

Similar to Zizka aimsa 2012

Current Approaches in Search Result Diversification
Current Approaches in Search Result DiversificationCurrent Approaches in Search Result Diversification
Current Approaches in Search Result DiversificationMario Sangiorgio
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingNimrita Koul
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Machine learning and Neural Networks
Machine learning and Neural NetworksMachine learning and Neural Networks
Machine learning and Neural Networksbutest
 
Improving search with neural ranking methods
Improving search with neural ranking methodsImproving search with neural ranking methods
Improving search with neural ranking methodsvoginip
 

Similar to Zizka aimsa 2012 (20)

Zizka synasc 2012
Zizka synasc 2012Zizka synasc 2012
Zizka synasc 2012
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 
Additional1
Additional1Additional1
Additional1
 
Ir models
Ir modelsIr models
Ir models
 
Current Approaches in Search Result Diversification
Current Approaches in Search Result DiversificationCurrent Approaches in Search Result Diversification
Current Approaches in Search Result Diversification
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
Clustering
ClusteringClustering
Clustering
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
nnml.ppt
nnml.pptnnml.ppt
nnml.ppt
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Classification
ClassificationClassification
Classification
 
Classification
ClassificationClassification
Classification
 
Words in space
Words in spaceWords in space
Words in space
 
Machine learning and Neural Networks
Machine learning and Neural NetworksMachine learning and Neural Networks
Machine learning and Neural Networks
 
Improving search with neural ranking methods
Improving search with neural ranking methodsImproving search with neural ranking methods
Improving search with neural ranking methods
 
Lect4
Lect4Lect4
Lect4
 
TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 

More from Natalia Ostapuk

Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013Natalia Ostapuk
 
Mt engine on nlp semniar
Mt engine on nlp semniarMt engine on nlp semniar
Mt engine on nlp semniarNatalia Ostapuk
 
Клышинский 8.12
Клышинский 8.12Клышинский 8.12
Клышинский 8.12Natalia Ostapuk
 
место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1Natalia Ostapuk
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledgeNatalia Ostapuk
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledgeNatalia Ostapuk
 

More from Natalia Ostapuk (20)

Gromov
GromovGromov
Gromov
 
Aist academic writing
Aist academic writingAist academic writing
Aist academic writing
 
Aist academic writing
Aist academic writingAist academic writing
Aist academic writing
 
Ponomareva
PonomarevaPonomareva
Ponomareva
 
Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013
 
Tomita одесса
Tomita одессаTomita одесса
Tomita одесса
 
Mt engine on nlp semniar
Mt engine on nlp semniarMt engine on nlp semniar
Mt engine on nlp semniar
 
Tomita 4марта
Tomita 4мартаTomita 4марта
Tomita 4марта
 
Konyushkova
KonyushkovaKonyushkova
Konyushkova
 
Braslavsky 13.12.12
Braslavsky 13.12.12Braslavsky 13.12.12
Braslavsky 13.12.12
 
Клышинский 8.12
Клышинский 8.12Клышинский 8.12
Клышинский 8.12
 
Zizka immm 2012
Zizka immm 2012Zizka immm 2012
Zizka immm 2012
 
Analysis by-variants
Analysis by-variantsAnalysis by-variants
Analysis by-variants
 
место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1
 
Text mining
Text miningText mining
Text mining
 
Additional2
Additional2Additional2
Additional2
 
Seminar1
Seminar1Seminar1
Seminar1
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge
 
Angelii rus
Angelii rusAngelii rus
Angelii rus
 

Zizka aimsa 2012

  • 1. + Jan Žižka Karel Burda Department of Faculty of Business František Dařena Informatics and Economics Mendel Czech University Republic in Brno Clustering Very Large Textual Unstructured Customers' Reviews in a Natural Language
  • 2. + Introduction  Many companies collect opinions expressed by their customers.  These opinions can hide valuable knowledge.  Discovering such the knowledge by people can be a very demanding task because  the opinion database can be very large,  the customers can use different languages,  the people can handle the opinions subjectively,  sometimes additional resources (like lists of positive and negative words) might be needed.
  • 3. + Introduction  Our previous research focused on analysis what was significant for including a certain opinion into one of categories like satisfied or dissatisfied customers  However, this requires to have the reviews separated into classes sharing a common opinion/sentiment  Clusteringas the most common form of unsupervised learning enables automatic grouping of unlabeled documents into subsets called clusters
  • 4. + Objective The objective is to find out how well a computer can separate the classes expressing a certain opinion, without prior knowledge of the nature of such the classes, and to find a clustering algorithm with a set of its best parameters, similarity and clustering- criterion functions, word representation, and the role of stemming for the given specific data.
  • 5. + Data description  Processed data included reviews of hotel clients collected from publicly available sources  The reviews were labeled as positive and negative  Reviews characteristics:  more than 5,000,000 reviews  written in more than 25 natural languages  written only by real customers, based on a real experience  written relatively carefully but still containing errors that are typical for natural languages
  • 6. + Properties of data used for experiments  Thesubset used in our experiments contained almost two million opinions marked as written in English. Review category Positive Negative Number of reviews 1,190,949 741,092 Maximal review length 391 words 396 words Average review length 21.67 words 25.73 words Variance 403.34 words 618.47 words
  • 7. + Review examples  Positive  The breakfast and the very clean rooms stood out as the best features of this hotel.  Clean and moden, the great loation near station. Friendly reception!  The rooms are new. The breakfast is also great. We had a really nice stay.  Nothing, the hotel is very noisy, no sound insulation whatsoever. Room very small. Shower not nice with a curtain. This is a 2/3 star max.  Negative  High price charged for internet access which actual cost now is extreamly low.  water in the shower did not flow away  The room was noisy and the room temperature was higher than normal.  The train almost running through your room every 10 minutes, the old man at the restaurant was ironic beyond friendly, the food was ok but very German.
  • 8. + Data preparation  Data collection, cleaning (removing tags, non-letter characters), converting to upper-case  Removing words shorter than 3 characters  Porter’s Stemming  Stopwords removing, spell checking, diacritics removal etc. were not carried out  Creating 14 smaller subsets containing positive and negative reviews with following proportions: 131:144, 229:211, 987:1029, 1031:1085, 2096:2211, 4932:4757, 4832:4757, 7432:7399, 10023:8946, 10251:9352, 15469:14784, 24153:23956, 52146:49986, and 365921:313752
  • 9. + Experimental steps  Random selection of desired amount of reviews  Transformation of the data into the vector representation  Loading the data in Cluto* and performing clustering  Evaluating the results * Free software providing different clustering methods working with several clustering criterion functions and similarity measures, suitable for operating on very large datasets.
  • 10. + Clustering algorithm parameters  Clustering algorithm – describes the way how objects to be clustered are assigned into individual groups  Available algorithms  Cluto's k-means variation – algorithm iteratively adapts the initial randomly generated k cluster centroids' positions  Repeated bisection – a sequence of cluster bisections  Graph-based – partitioning a graph representing objects to be clustered
  • 11. + Clustering algorithm parameters  Similarity – an important measure affecting the results of clustering because the objects within one cluster need to be similar while objects from different clusters should be dissimilar  Available similarity/distance measures  Cosine similarity – measures the cosine of the angle between couples of vectors representing the documents  Pearson's correlation coefficient – measures linear correlation between values of two vectors  Euclidean distance – computes the distance between points representing documents in the abstract space
  • 12. + Clustering algorithm parameters  Criterion functions – particular clustering criterion function defined over the entire clustering solution is optimized  Internal functions are defined over the documents that are part of each cluster and do not take into account the documents assigned to different clusters  External criterion functions derive the clustering solution the difference among individual clusters  Internal and external functions can be combined to define a set of hybrid criterion functions that simultaneously optimize individual criterion functions  Available criterion functions  Internal – I1, I2  External – E1, E2  Hybrid – H1, H2  Graph based – G1
  • 13. + Clustering algorithm parameters  Document representation – documents are represented using the vector-space model  Vector dimensions – document properties (terms, in our experiments words)  Vector values  Term Presence (TP)  Term Frequency (TF)  Term Frequency × Inverse Document Frequency (TF-IDF)  Term Presence × Inverse Document Frequency (TP-IDF) 𝑁 𝑖𝑑𝑓 𝑡 𝑖 = log 𝑛(𝑡𝑖)
  • 14. + Evaluation of cluster quality  Purity based measures – measure the extend to which each cluster contained documents from primarily one class  Purity of cluster Sr of size nr: 1 P(𝑆𝑟) = max ni nr i r  Purity of the entire solution with k clusters: 𝑘 𝑛𝑟 𝑃𝑢𝑟𝑖𝑡𝑦 = P(𝑆𝑟) 𝑛 𝑟=1  A perfect clustering solution – clusters contain documents from only a single class  Purity = 1
  • 15. + Evaluation of cluster quality  Entropy based measures – how the various classes of documents are distributed within each cluster  Entropy of cluster Sr of size nr: 𝑞 1 𝑛 𝑖𝑟 𝑛 𝑖𝑟 E 𝑆𝑟 =− log , log 𝑞 𝑛𝑟 𝑛𝑟 𝑖=1 where q is the number of classes and 𝑛 𝑖𝑟 number of documents in ith class that were assigned to the rth cluster  Entropy of the entire solution with k clusters: 𝑘 𝑛𝑟 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = E(𝑆𝑟) 𝑛 𝑟=1  A perfect clustering solution – clusters contain documents from only a single class  Entropy = 0
  • 16. + Results  Best results were achieved by k-means, repeated bisection, and cosine similarity as demonstrated in following tables  A certain boundary from which the entropy value oscillates and does not change much with increasing number of documents was found – around 10,000 documents  IDF weighting had a considerable positive impact on clustering results in comparison with simple TP/TF  TF-IDF document representation provided almost the same results as TP-IDF
  • 17. + Results  Using cosine similarity provided the best results unlike the Euclidean distance and Pearson's correlation coefficient.  For example, for the set of documents containing 4,932 positive and 4,745 negative reviews, the entropy was 0.594 for cosine similarity, while Euclidean distance provided entropy 0.740, and Pearson's coefficient 0.838  The H2 and I2 criterion functions provided the best results.  For the I1 criterion function, the entropy of one cluster was very low (less than 0.2). On the other hand, the second cluster's entropy was extremely high.  Stemming applied during the preprocessing phase had no impact on the entropy at all.
  • 18. + Weighted entropy K-means Repeated bisection Ratio P:N TF-IDF TP-IDF TF-IDF TP-IDF I2 H2 I2 H2 I2 H2 I2 H2 131:144 0.792 0.785 0.793 0.741 0.726 0.767 0.774 0.774 229:211 0.694 0.632 0.695 0.627 0.648 0.643 0.650 0.647 987:1029 0.624 0.610 0.618 0.605 0.624 0.609 0.618 0.611 4832:4757 0.601 0.581 0.599 0.579 0.600 0.584 0.598 0.580 7432:7399 0.605 0.596 0.599 0.587 0.605 0.595 0.594 0.586 15469:14784 0.604 0.583 0.598 0.579 0.604 0.582 0.598 0.579 24153:23956 0.597 0.580 0.589 0.572 0.597 0.580 0.589 0.572 52164:49986 0.596 0.582 0.600 0.573 0.604 0.582 0.598 0.574 201346:204716 0.599 0.583 0.592 0.575 0.597 0.583 0.593 0.576 365921:313752 0.602 0.586 0.598 0.584 0.599 0.581 0.598 0.580
  • 19. + Percentage ratios of documents in the clusters K-means Repeated bisection I2 H2 I2 H2 Ratio P:N cluster 0 cluster 1 cluster 0 cluster 1 cluster 0 cluster 1 cluster 0 cluster 1 (P:N) (P:N) (P:N) (P:N) (P:N) (P:N) (P:N) (P:N) 131:144 76:24 24:74 78:24 22:74 75:22 25:76 78:19 22:78 229:211 84:21 16:79 86:18 14:82 84:20 16:80 84:18 16:82 987:1029 80:12 19:87 85:16 14:83 79:11 20:88 85:15 15:84 4832:4757 83:13 17:87 87:15 13:85 83:12 17:87 86:14 14:86 7432:7399 82:12 17:87 85:14 14:85 82:12 17:86 86:14 14:85 15469:14784 80:11 19:89 85:13 15:86 81:10 19:89 85:13 15:87 24153:23956 81:11 19:89 85:13 14:86 81:10 18:89 86:13 14:87 52164:49986 18:89 81:11 15:87 85:13 19:89 80:10 15:87 85:12 201346:204716 82:11 18:88 85:13 15:86 82:11 18:89 15:87 85:12 365921:313752 19:89 80:10 16:88 83:12 80:10 20:90 16:87 84:12
  • 20. + Weighted entropy for different data set sizes
  • 21. + Conclusions  The goal was to automatically build clusters representing positive and negative opinions and finding a clustering algorithm with a set of its best parameters, similarity measure, clustering-criterion function, word representation, and the role of stemming.  Themain focus was on clustering large real-world data during a reasonable time, without applying any sophisticated methods that can increase the computational complexity.
  • 22. + Conclusions  The best results were obtained with  k-means  performed better when compared with other algorithms  proved itself as a faster algorithm  binary vector representation  idf weighting  cosine similarity  H2 criterion function  stemming did not improve the results
  • 23. + Future work  Clustering of reviews in other languages  Analysis of “incorrectly” categorized reviews  Clustering smaller units of reviews (e.g., sentences)