+              Jan Žižka
       František Dařena
                           Department
                           of
                                         Faculty of
                                         Business
                           Informatics   and
                                         Economics




                           Mendel        Czech
                           University    Republic
                           in Brno




    MINING SIGNIFICANT WORDS FROM
    CUSTOMER OPINIONS WRITTEN IN
    DIFFERENT NATURAL LANGUAGES
+
    Introduction


     Many  companies collect opinions expressed
      by their customers.
     These opinions can hide valuable knowledge.
     Discovering the knowledge by people can be
      sometimes a very demanding task because
      the opinion database can be very large,
      the customers can use different languages,
      the people can handle the opinions subjectively,
      sometimes additional resources (like lists of positive
       and negative words) might be needed.
+
    Objective


    For answering the question “What is
     significant for including a certain
     opinion into one of categories like
     satisfied or dissatisfied customers?”
     automatically extract words significant
     for positive and negative customers'
     opinions and to form not too large
     dictionaries of these words.
+
    Data description

     Processed  data included reviews of hotel clients
      collected from publicly available sources.
     The reviews were labeled as positive and
      negative.
     Reviews characteristics:
      more than 5,000,000 reviews,
      written in more than 25 natural languages,
      written only by real customers, based on a real
       experience,
      written relatively carefully but still containing errors that
       are typical for natural languages.
+
    Review examples

       Positive
           The breakfast and the very clean rooms stood out as the best
            features of this hotel.
           Clean and moden, the great loation near station. Friendly
            reception!
           The rooms are new. The breakfast is also great. We had a really
            nice stay.
           Good location - very quiet and good breakfast.

       Negative
           High price charged for internet access which actual cost now
            is extreamly low.
           water in the shower did not flow away
           The room was noisy and the room temperature was higher
            than normal.
           The air conditioning wasn't working
+
    Data preparation


     Data  collection, cleaning (removing tags, non-
      letter characters), converting to upper-case.
     Transforming into the bag-of-words
      representation, term frequencies (TF) used as
      attribute values.
     Removing the words with global frequency < 2.
     Stemming, stopwords removing, spell
      checking, diacritics removal etc. were not
      carried out.
+
                     Data characteristics

                    1200000



                    1000000



                     800000
number of reviews




                                                                                                        positive
                     600000
                                                                                                        negative

                     400000



                     200000



                          0
                              English   French   Spanish   German   Italian   Russian   Japan   Czech
+
                          Data characteristics

                         250000




                         200000
number of unique words




                         150000

                                                                                                            MinTF=1
                                                                                                            MinTF=2
                         100000




                          50000




                              0
                                  English   German   Japan   French   Spanish   Italian   Russian   Czech
+
    Finding the significant words

     Thanksto having a large collection of labeled
     examples a classifier that separates positive and
     negative reviews could be created.
     To reveal significant attributes (words) a decision
      tree was built using the tree-generating algorithm
      c5 (by R. Quinlan) based on entropy minimization.
     The goal was not to achieve the best classification
      accuracy but to find relevant attributes that
      contribute to assigning a text to a given class.
     The significant words appeared in the nodes of the
      decision tree.
+
    An example of a decision tree

    LOCATION > 0:
    :...POOR > 0:
    :   :...GOOD > 0: _P (13)
    :   :   GOOD <= 0:
    :   :   :...EXCELLENT > 0: _P (3)
    :   :       EXCELLENT <= 0:
    :   :       :...GREAT > 0: _P (3)
    :   :           GREAT <= 0:
    :   :           :...CLEAN <= 0: _N (48/4)
    :   :               CLEAN > 0: _P (4/1)
    :   POOR <= 0:
    :   :...DIFFICULT > 0:
    :       :...GOOD > 0: _P (6)
    :       :   GOOD <= 0:
    :       :   :...HELPFUL <= 0: _N (34/7)
    :       :       HELPFUL > 0: _P (5)
    ...
    ...
+
    Finding the significant words

     The classification accuracy which is proportional to
     the relevancy of words was between 83 – 93%.
     Thedecision tree mostly asked if the frequency
     was > 0 or = 0 (binary representation).
     Thedecision tree provides a list of about 200-300
     words significant for classification from the
     sentiment perspective together with the
     significance (i.e. the frequency of using the words
     during classification) of the words.
     Only15 words for each language is presented
     together with their significance (column %).
+
    Handling large collections

     For
        languages with large amount of reviews the
     datasets were randomly split into subsets
     consisting of 50,000 reviews because of memory
     requirements and a decision tree was created for
     each such subset.

     Each
         of the 50,000-sample subsets gave almost the
     same list of words.

     The   relevancies of extracted words were averaged.
+
    Results
+
    Results
+
    Results
+
    Results
+
    Conclusions

    A   procedure how to apply computers, machine
      learning, and natural language processing areas to
      automatically find significant words was presented.
     From the total number of words (80,000–200,000) only
      about 200–300 were identified as significant.
     The simple, unified procedure worked well for many
      languages.
     Following research focuses on determining the
      strength of sentiment and on generating typical short
      phrases instead of only creating individual words.
     The procedure might be used during the marketing
      research or marketing intelligence, for filtering
      reviews, generating lists of key-words etc.
Thank you for your attention
Vielen Dank für Ihre Aufmerksamkeit
    Gracias por vuestra atención
      Merci de votre attention
   Grazie per la vostra attenzione
      Спасибо за ваше внимание
   ご静聴ありがとうございました
      Děkuji za vaši pozornost

Additional2

  • 1.
    + Jan Žižka František Dařena Department of Faculty of Business Informatics and Economics Mendel Czech University Republic in Brno MINING SIGNIFICANT WORDS FROM CUSTOMER OPINIONS WRITTEN IN DIFFERENT NATURAL LANGUAGES
  • 2.
    + Introduction  Many companies collect opinions expressed by their customers.  These opinions can hide valuable knowledge.  Discovering the knowledge by people can be sometimes a very demanding task because  the opinion database can be very large,  the customers can use different languages,  the people can handle the opinions subjectively,  sometimes additional resources (like lists of positive and negative words) might be needed.
  • 3.
    + Objective For answering the question “What is significant for including a certain opinion into one of categories like satisfied or dissatisfied customers?” automatically extract words significant for positive and negative customers' opinions and to form not too large dictionaries of these words.
  • 4.
    + Data description  Processed data included reviews of hotel clients collected from publicly available sources.  The reviews were labeled as positive and negative.  Reviews characteristics:  more than 5,000,000 reviews,  written in more than 25 natural languages,  written only by real customers, based on a real experience,  written relatively carefully but still containing errors that are typical for natural languages.
  • 5.
    + Review examples  Positive  The breakfast and the very clean rooms stood out as the best features of this hotel.  Clean and moden, the great loation near station. Friendly reception!  The rooms are new. The breakfast is also great. We had a really nice stay.  Good location - very quiet and good breakfast.  Negative  High price charged for internet access which actual cost now is extreamly low.  water in the shower did not flow away  The room was noisy and the room temperature was higher than normal.  The air conditioning wasn't working
  • 6.
    + Data preparation  Data collection, cleaning (removing tags, non- letter characters), converting to upper-case.  Transforming into the bag-of-words representation, term frequencies (TF) used as attribute values.  Removing the words with global frequency < 2.  Stemming, stopwords removing, spell checking, diacritics removal etc. were not carried out.
  • 7.
    + Data characteristics 1200000 1000000 800000 number of reviews positive 600000 negative 400000 200000 0 English French Spanish German Italian Russian Japan Czech
  • 8.
    + Data characteristics 250000 200000 number of unique words 150000 MinTF=1 MinTF=2 100000 50000 0 English German Japan French Spanish Italian Russian Czech
  • 9.
    + Finding the significant words  Thanksto having a large collection of labeled examples a classifier that separates positive and negative reviews could be created.  To reveal significant attributes (words) a decision tree was built using the tree-generating algorithm c5 (by R. Quinlan) based on entropy minimization.  The goal was not to achieve the best classification accuracy but to find relevant attributes that contribute to assigning a text to a given class.  The significant words appeared in the nodes of the decision tree.
  • 10.
    + An example of a decision tree LOCATION > 0: :...POOR > 0: : :...GOOD > 0: _P (13) : : GOOD <= 0: : : :...EXCELLENT > 0: _P (3) : : EXCELLENT <= 0: : : :...GREAT > 0: _P (3) : : GREAT <= 0: : : :...CLEAN <= 0: _N (48/4) : : CLEAN > 0: _P (4/1) : POOR <= 0: : :...DIFFICULT > 0: : :...GOOD > 0: _P (6) : : GOOD <= 0: : : :...HELPFUL <= 0: _N (34/7) : : HELPFUL > 0: _P (5) ... ...
  • 11.
    + Finding the significant words  The classification accuracy which is proportional to the relevancy of words was between 83 – 93%.  Thedecision tree mostly asked if the frequency was > 0 or = 0 (binary representation).  Thedecision tree provides a list of about 200-300 words significant for classification from the sentiment perspective together with the significance (i.e. the frequency of using the words during classification) of the words.  Only15 words for each language is presented together with their significance (column %).
  • 12.
    + Handling large collections  For languages with large amount of reviews the datasets were randomly split into subsets consisting of 50,000 reviews because of memory requirements and a decision tree was created for each such subset.  Each of the 50,000-sample subsets gave almost the same list of words.  The relevancies of extracted words were averaged.
  • 13.
    + Results
  • 14.
    + Results
  • 15.
    + Results
  • 16.
    + Results
  • 17.
    + Conclusions A procedure how to apply computers, machine learning, and natural language processing areas to automatically find significant words was presented.  From the total number of words (80,000–200,000) only about 200–300 were identified as significant.  The simple, unified procedure worked well for many languages.  Following research focuses on determining the strength of sentiment and on generating typical short phrases instead of only creating individual words.  The procedure might be used during the marketing research or marketing intelligence, for filtering reviews, generating lists of key-words etc.
  • 18.
    Thank you foryour attention Vielen Dank für Ihre Aufmerksamkeit Gracias por vuestra atención Merci de votre attention Grazie per la vostra attenzione Спасибо за ваше внимание ご静聴ありがとうございました Děkuji za vaši pozornost