Seminar1

+ Jan Žižka
František Dařena
Department
of
Faculty of
Business
Informatics and
Economics

Mendel Czech
University Republic
in Brno

Mining Textual Significant
Expressions Reflecting Opinions in
Natural Languages

+
Introduction

 Many companies collect opinions expressed
by their customers.
 These opinions can hide valuable knowledge.
 Discovering the knowledge by people can be
sometimes a very demanding task because
 the opinion database can be very large,
 the customers can use different languages,
 the people can handle the opinions subjectively,
 sometimes additional resources (like lists of positive
and negative words) might be needed.

+
Introduction

 Text
mining can reveal units of the texts
(words, phrases, sentences etc.) that can
represent the meaning/sentiment
 Individual
words usually do not bring
enough information
 More information can provide phrases, but
their extraction, based on linguistic
analysis, requires additional knowledge
that is unique for every language

+
Objective

The objective is to find a way how a
computer can reveal phrases that
express a certain opinion, without the
exacting and time consuming linguistic
analysis which is miscellaneous for
different natural languages.

+
Data description

 Processed data included reviews of hotel clients
collected from publicly available sources
 The reviews were labeled as positive and negative
 Reviews characteristics:
 more than 5,000,000 reviews
 written in more than 25 natural languages
 written only by real customers, based on a real
experience
 written relatively carefully but still containing errors that
are typical for natural languages

+
Review examples

 Positive
 The breakfast and the very clean rooms stood out as the best
features of this hotel.
 Clean and moden, the great loation near station. Friendly
reception!
 The rooms are new. The breakfast is also great. We had a really
nice stay.
 Good location - very quiet and good breakfast.

 Negative
 High price charged for internet access which actual cost now
is extreamly low.
 water in the shower did not flow away
 The room was noisy and the room temperature was higher
than normal.
 The air conditioning wasn't working

+
Data preparation

 Data collection, cleaning (removing tags, non-
letter characters), converting to upper-case
 Transforming into the bag-of-words
representation, term frequencies (TF) used as
attribute values
 Removing the words with global frequency < 2
 Stemming, stopwords removing, spell
checking, diacritics removal etc. were not
carried out

+
Data characteristics – number of
reviews
1200000

1000000

800000
number of reviews

positive
600000
negative

400000

200000

0
English French Spanish German Italian Czech

+
Data characteristics – dictionary
sizes
250000

200000
number of unique words

150000
MinTF=1
MinTF=2
100000

50000

0
English German French Spanish Italian Czech

+
Finding significant words

 Thanksto having a large collection of labeled
examples a classifier that separates positive and
negative reviews could be created
 To reveal significant attributes (words) a decision
tree was built using the tree-generating algorithm
c5 based on entropy minimization
 The goal was not to achieve the best classification
accuracy but to find relevant attributes that
contribute to assigning a text to a given class
 The significant words appeared in the nodes of the
decision tree

+
Finding the significant words

 The classification accuracy which is proportional to
the relevancy of words was between 89.5 – 92.5%

 Thedecision tree provided a list of about 200–300
words significant for classification from the
sentiment perspective

 These words are used as the basis for extraction of
significant expressions in order to prevent from
considering all possible combinations of words

+
Extracting significant expressions

 Extraction of significant expressions starts from
the list of significant words, the reviews are
being searched in the proximity of these words
 Significant-expression extracting algorithm
parameters:
D – the distance from a significant word within
which the search is carried out
 N – the number of words forming the significant
expressions
 M – the minimal number of occurrences of a
specific group of words

+
An example

 Searching for significant expressions in a review,
the algorithm parameters: D = 3, N = 3.

+
Results

 Lists
of significant expressions extracted from the
original text reviews were obtained.

 The expressions need to be considered by people.

+
Significant expressions for English

+
Significant expressions for
German

+
Significant expressions for Spanish

+
Discussion

 Some of the significant expressions were very similar

 The significant expressions were mostly quite
meaningful and potentially useful for the target
audience

 Some of the expressions were naturally not useful at all

 Itis necessary to find a trade-off between the size of
expressions, the length of the texts where the search is
carried out and the informative value of expressions

+
Discussion

 Examples of different distances of words forming the same
significant expression "good location"

+
Discussion

 But, the same expression can be formed from words from
more contexts:

“... Breakfast was really good. The location is a
little out of the center ...”
or
“Good service. Convenient location”
or
“It is a quiet location for a good nights sleep”

+
Handling large collections

 For
languages with large amount of reviews the
datasets were randomly split into subsets
consisting of 50,000 reviews because of memory
requirements and a decision tree was created for
each such subset

 Each
of the 50,000-sample subsets gave almost the
same list of words

 The relevancies of extracted words were averaged

+
Conclusions

 A procedure how to apply computers, machine learning, and
natural language processing areas to automatically find
significant expressions was presented
 From the total number of words (80,000–200,000) only about
200–300 were identified as significant and used as the basis
for expressions extraction
 The simple, unified procedure worked well for many
languages
 Following research focuses on preprocessing phase (e.g.
eliminating meaningless words)
 The procedure might be used during the marketing
research or marketing intelligence, for filtering reviews,
generating lists of key-words etc.

Thank you for your attention
Vielen Dank für Ihre Aufmerksamkeit
Gracias por vuestra atención
Merci de votre attention
Grazie per la vostra attenzione
Děkuji za vaši pozornost

Seminar1

Recommended

Recommended

More Related Content

Similar to Seminar1

Similar to Seminar1 (20)

More from Natalia Ostapuk

More from Natalia Ostapuk (20)

Recently uploaded

Recently uploaded (20)

Seminar1