SlideShare a Scribd company logo
1 of 20
Download to read offline
IMMM-2012, October 21-26, 2012, Venice, Italy




Parallel Processing of Very Many Textual
Customers’ Reviews Freely Written Down
           in Natural Languages

        Jan Žižka and František Dařena

           Department of Informatics
         FBE, Mendel University in Brno
             Brno, Czech Republic

        zizka@mendelu.cz, darena@mendelu.cz
IMMM-2012, October 21-26, 2012, Venice, Italy




One of contemporary typical data is text written in
various natural languages.

Among others, textual data very often represents also
subjective opinions, meanings, sentiments, attitudes,
views, ideas of the text authors – and we can mine it
from the textual data, getting knowledge: text-mining.

The following slides deal with customer opinions
evaluating hotel services.

We can see such data, for example, on web-sites as
booking.com, or elsewhere.
IMMM-2012, October 21-26, 2012, Venice, Italy




Discovering
knowledge
that is
hidden in
collected
very large
real-world
textual data
in various
natural
languages:
IMMM-2012, October 21-26, 2012, Venice, Italy




Text-mining of very many documents written in natural
languages is limited by the computational complexity
(time and memory) and computer performance.

Most of common users can use only ordinary personal
computers (PC's) – no supercomputers.

A regular solution: Having just standard processors
(even if multicore ones) and some gigabytes of RAM,
the whole data set has to be divided into smaller
subsets that can be processed in parallel.

Are the results different or not? If yes, how much?
IMMM-2012, October 21-26, 2012, Venice, Italy




The original text-mining research aimed at automatical
search for significant words and phrases, which could
be then used for deeper examination of positive and
negative reviews; that is, looking for typical praises or
complaints.

For example, “good location”, “bad food”, “very noisy
environment”, “not too much friendly personnel”, “nice
clean room”, and so like.

To obtain such understandable kind of knowledge,
decision trees (rules) were generated.
IMMM-2012, October 21-26, 2012, Venice, Italy



Some original English examples, ca 1,200,000 positive
and 800,000 negative (no grammar corrections):

– breakfast and the closeness to the railway station were the only
things that werent bad

– did not spend enogh time in hotel to assess

– it was somewhere to sleep

– very little !!!!!!!!!

– breakfast, supermarket in the same building, kitchen in
the apartment (basic but better than none)

– no complaints on the hotel
IMMM-2012, October 21-26, 2012, Venice, Italy



The upper computational complexity of the entropy-
based decision tree (c5) is O(m∙n2) for m reviews and
n unique words in the generated dictionary; some n's:
IMMM-2012, October 21-26, 2012, Venice, Italy



The minimum review length was 1 word (for example,
“Excellent!!!”), the maximum was 167 words.

The average length of a review was 19 words.

The vector sparsity was typically around 0.01%, that is,
on average, a review contained only 0.01% of the all
words in the dictionary created from reviews.

An overwhelming majority of words were insignificant,
only some 100-300 of terms (depending on a specific
language) played their significant role from the
classification point of view.
IMMM-2012, October 21-26, 2012, Venice, Italy




                                                Intuitively, the
                                                larger data, the
                                                better discovered
                                                knowledge.

                                                However, there is
                                                always an
                                                insurmountable
                                                problem:

                                                How to process
                                                very large textual
                                                data?
IMMM-2012, October 21-26, 2012, Venice, Italy



The experiments were aimed at finding the optimal
subset size for the main data set division. The
optimum was defined as obtaining the same results
from the whole data set and from the individual
subsets.

Ideally, each subset should provide the same
significant words that would have the same
significance for classification.

The word significance was defined as the number of
times when a decision tree asked what was a word
frequency in a review.
IMMM-2012, October 21-26, 2012, Venice, Italy



If there is a word from the whole data set in the root,
most of the subsets (ideally all) should have the same
word in their roots.

Similarly, the same rule can be applied to other words
included in the trees on levels approaching the leaves.
Then we could say that each subset represents the
original set perfectly.

In reality, the decision trees generated for each review
subset more or less mutually differ because they are
created from different reviews. A tree generated from
a subset may contain also at least one word that is not
in the tree generated from the whole.
IMMM-2012, October 21-26, 2012, Venice, Italy




                                          Part of a tree for
                                          a subset: Do all
                                          the subsets of the
                                          customer reviews
                                          have the same
                                          word location in
                                          their roots?

                                          And what about
                                          other words?
                                          Are they on the
                                          same positions in
                                          all subset trees?
IMMM-2012, October 21-26, 2012, Venice, Italy



The question is:

How many subsets should the whole review set be
divided into so that the unified results from all the
subsets provide (almost) the same result as from the
whole?

It is not easy to find a general solution because the
result depends on particular data.

The research used the data described above because
it corresponded to many similar situations: a lot of
short reviews concerning just one topic.
IMMM-2012, October 21-26, 2012, Venice, Italy



The original set (2,000,000 reviews) was too big to be
processed as a whole.

For a given PC model (8 GB RAM, 4-core processor,
64-bit machine), the experiments worked with 200,000
randomly selected reviews as the whole (for more, it
took more than 24 hours of computation, or crashed
due to the insufficient memory error).

Then, the task was to find an optimal division of the
200,000 reviews into smaller subsets.
IMMM-2012, October 21-26, 2012, Venice, Italy



The results of experiments are demonstrated on the
followning graphs for different sizes of the whole set
and its subsets.

On the horizontal x axis, there are the most significant
words generated by the trees.

The vertical axis y shows the correspondence between
the percentage of the significant words in the whole set
and the average percentage of the relevant subsets.

The whole set contains all the significant words:
the y value is always 1.0 (that is, 100%).
IMMM-2012, October 21-26, 2012, Venice, Italy



The results provided by the whole set and its subsets
is given by the agreement between the percentage
weight of a word wj in the tree generated for the whole
set and the average percentage weight of wj in the
trees generated for all corresponding subsets.

If a word wj percentage weight in the whole set would
(on average) be the same as for all subsets, then the
agreement is perfect; otherwise, imperfect, where the
imperfection is given by the difference.
IMMM-2012, October 21-26, 2012, Venice, Italy




The whole set contains 200,000 reviews
IMMM-2012, October 21-26, 2012, Venice, Italy




The whole set contains 100,000 reviews
IMMM-2012, October 21-26, 2012, Venice, Italy




The whole set contains 50,000 reviews
IMMM-2012, October 21-26, 2012, Venice, Italy



Conclusions:

Probably, it is no surprise that the subsets should be
as large as possible to obtain reliable knowledge.

However, the question is: How big should be the value
“small” for the inevitable subsets?

For a given data, it can be found experimentally, and
then the result is applicable to the same data type in
the future, as the experiments demonstrated.

This research report suggests a method how to act.

More Related Content

Viewers also liked

Viewers also liked (10)

Presentation
PresentationPresentation
Presentation
 
Mt engine on nlp semniar
Mt engine on nlp semniarMt engine on nlp semniar
Mt engine on nlp semniar
 
Braslavsky 13.12.12
Braslavsky 13.12.12Braslavsky 13.12.12
Braslavsky 13.12.12
 
Nlp seminar academicwriting
Nlp seminar academicwritingNlp seminar academicwriting
Nlp seminar academicwriting
 
Cross domainsc new
Cross domainsc newCross domainsc new
Cross domainsc new
 
Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013
 
Tomita одесса
Tomita одессаTomita одесса
Tomita одесса
 
Evaluation in-nlp
Evaluation in-nlpEvaluation in-nlp
Evaluation in-nlp
 
ТебеРефрейминг V2
ТебеРефрейминг V2ТебеРефрейминг V2
ТебеРефрейминг V2
 
Клышинский 8.12
Клышинский 8.12Клышинский 8.12
Клышинский 8.12
 

More from Natalia Ostapuk

место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1Natalia Ostapuk
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledgeNatalia Ostapuk
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledgeNatalia Ostapuk
 
семинар Spb ling_v3
семинар Spb ling_v3семинар Spb ling_v3
семинар Spb ling_v3Natalia Ostapuk
 
17.03 большакова
17.03 большакова17.03 большакова
17.03 большаковаNatalia Ostapuk
 
автоматическое аннотирование новостного потока
автоматическое аннотирование новостного потокаавтоматическое аннотирование новостного потока
автоматическое аннотирование новостного потокаNatalia Ostapuk
 

More from Natalia Ostapuk (17)

Aist academic writing
Aist academic writingAist academic writing
Aist academic writing
 
Aist academic writing
Aist academic writingAist academic writing
Aist academic writing
 
Tomita 4марта
Tomita 4мартаTomita 4марта
Tomita 4марта
 
Analysis by-variants
Analysis by-variantsAnalysis by-variants
Analysis by-variants
 
место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1
 
Text mining
Text miningText mining
Text mining
 
Additional2
Additional2Additional2
Additional2
 
Additional1
Additional1Additional1
Additional1
 
Seminar1
Seminar1Seminar1
Seminar1
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge
 
Angelii rus
Angelii rusAngelii rus
Angelii rus
 
семинар Spb ling_v3
семинар Spb ling_v3семинар Spb ling_v3
семинар Spb ling_v3
 
17.03 большакова
17.03 большакова17.03 большакова
17.03 большакова
 
Авиком
АвикомАвиком
Авиком
 
автоматическое аннотирование новостного потока
автоматическое аннотирование новостного потокаавтоматическое аннотирование новостного потока
автоматическое аннотирование новостного потока
 
Ai brainy
Ai brainyAi brainy
Ai brainy
 

Zizka immm 2012

  • 1. IMMM-2012, October 21-26, 2012, Venice, Italy Parallel Processing of Very Many Textual Customers’ Reviews Freely Written Down in Natural Languages Jan Žižka and František Dařena Department of Informatics FBE, Mendel University in Brno Brno, Czech Republic zizka@mendelu.cz, darena@mendelu.cz
  • 2. IMMM-2012, October 21-26, 2012, Venice, Italy One of contemporary typical data is text written in various natural languages. Among others, textual data very often represents also subjective opinions, meanings, sentiments, attitudes, views, ideas of the text authors – and we can mine it from the textual data, getting knowledge: text-mining. The following slides deal with customer opinions evaluating hotel services. We can see such data, for example, on web-sites as booking.com, or elsewhere.
  • 3. IMMM-2012, October 21-26, 2012, Venice, Italy Discovering knowledge that is hidden in collected very large real-world textual data in various natural languages:
  • 4. IMMM-2012, October 21-26, 2012, Venice, Italy Text-mining of very many documents written in natural languages is limited by the computational complexity (time and memory) and computer performance. Most of common users can use only ordinary personal computers (PC's) – no supercomputers. A regular solution: Having just standard processors (even if multicore ones) and some gigabytes of RAM, the whole data set has to be divided into smaller subsets that can be processed in parallel. Are the results different or not? If yes, how much?
  • 5. IMMM-2012, October 21-26, 2012, Venice, Italy The original text-mining research aimed at automatical search for significant words and phrases, which could be then used for deeper examination of positive and negative reviews; that is, looking for typical praises or complaints. For example, “good location”, “bad food”, “very noisy environment”, “not too much friendly personnel”, “nice clean room”, and so like. To obtain such understandable kind of knowledge, decision trees (rules) were generated.
  • 6. IMMM-2012, October 21-26, 2012, Venice, Italy Some original English examples, ca 1,200,000 positive and 800,000 negative (no grammar corrections): – breakfast and the closeness to the railway station were the only things that werent bad – did not spend enogh time in hotel to assess – it was somewhere to sleep – very little !!!!!!!!! – breakfast, supermarket in the same building, kitchen in the apartment (basic but better than none) – no complaints on the hotel
  • 7. IMMM-2012, October 21-26, 2012, Venice, Italy The upper computational complexity of the entropy- based decision tree (c5) is O(m∙n2) for m reviews and n unique words in the generated dictionary; some n's:
  • 8. IMMM-2012, October 21-26, 2012, Venice, Italy The minimum review length was 1 word (for example, “Excellent!!!”), the maximum was 167 words. The average length of a review was 19 words. The vector sparsity was typically around 0.01%, that is, on average, a review contained only 0.01% of the all words in the dictionary created from reviews. An overwhelming majority of words were insignificant, only some 100-300 of terms (depending on a specific language) played their significant role from the classification point of view.
  • 9. IMMM-2012, October 21-26, 2012, Venice, Italy Intuitively, the larger data, the better discovered knowledge. However, there is always an insurmountable problem: How to process very large textual data?
  • 10. IMMM-2012, October 21-26, 2012, Venice, Italy The experiments were aimed at finding the optimal subset size for the main data set division. The optimum was defined as obtaining the same results from the whole data set and from the individual subsets. Ideally, each subset should provide the same significant words that would have the same significance for classification. The word significance was defined as the number of times when a decision tree asked what was a word frequency in a review.
  • 11. IMMM-2012, October 21-26, 2012, Venice, Italy If there is a word from the whole data set in the root, most of the subsets (ideally all) should have the same word in their roots. Similarly, the same rule can be applied to other words included in the trees on levels approaching the leaves. Then we could say that each subset represents the original set perfectly. In reality, the decision trees generated for each review subset more or less mutually differ because they are created from different reviews. A tree generated from a subset may contain also at least one word that is not in the tree generated from the whole.
  • 12. IMMM-2012, October 21-26, 2012, Venice, Italy Part of a tree for a subset: Do all the subsets of the customer reviews have the same word location in their roots? And what about other words? Are they on the same positions in all subset trees?
  • 13. IMMM-2012, October 21-26, 2012, Venice, Italy The question is: How many subsets should the whole review set be divided into so that the unified results from all the subsets provide (almost) the same result as from the whole? It is not easy to find a general solution because the result depends on particular data. The research used the data described above because it corresponded to many similar situations: a lot of short reviews concerning just one topic.
  • 14. IMMM-2012, October 21-26, 2012, Venice, Italy The original set (2,000,000 reviews) was too big to be processed as a whole. For a given PC model (8 GB RAM, 4-core processor, 64-bit machine), the experiments worked with 200,000 randomly selected reviews as the whole (for more, it took more than 24 hours of computation, or crashed due to the insufficient memory error). Then, the task was to find an optimal division of the 200,000 reviews into smaller subsets.
  • 15. IMMM-2012, October 21-26, 2012, Venice, Italy The results of experiments are demonstrated on the followning graphs for different sizes of the whole set and its subsets. On the horizontal x axis, there are the most significant words generated by the trees. The vertical axis y shows the correspondence between the percentage of the significant words in the whole set and the average percentage of the relevant subsets. The whole set contains all the significant words: the y value is always 1.0 (that is, 100%).
  • 16. IMMM-2012, October 21-26, 2012, Venice, Italy The results provided by the whole set and its subsets is given by the agreement between the percentage weight of a word wj in the tree generated for the whole set and the average percentage weight of wj in the trees generated for all corresponding subsets. If a word wj percentage weight in the whole set would (on average) be the same as for all subsets, then the agreement is perfect; otherwise, imperfect, where the imperfection is given by the difference.
  • 17. IMMM-2012, October 21-26, 2012, Venice, Italy The whole set contains 200,000 reviews
  • 18. IMMM-2012, October 21-26, 2012, Venice, Italy The whole set contains 100,000 reviews
  • 19. IMMM-2012, October 21-26, 2012, Venice, Italy The whole set contains 50,000 reviews
  • 20. IMMM-2012, October 21-26, 2012, Venice, Italy Conclusions: Probably, it is no surprise that the subsets should be as large as possible to obtain reliable knowledge. However, the question is: How big should be the value “small” for the inevitable subsets? For a given data, it can be found experimentally, and then the result is applicable to the same data type in the future, as the experiments demonstrated. This research report suggests a method how to act.