The document discusses parallel processing of large amounts of customer reviews to discover hidden knowledge. It finds that while more data yields better knowledge discovery, very large datasets cannot be processed as a whole by typical computers. The document reports on experiments dividing large review datasets into subsets of varying sizes to determine the optimal subset size that produces results nearly identical to processing the full dataset as a single set. It finds smaller subset sizes produce less consistent results with the full dataset, while larger subset sizes yield more consistent knowledge discovery with the full original dataset.
1. IMMM-2012, October 21-26, 2012, Venice, Italy
Parallel Processing of Very Many Textual
Customers’ Reviews Freely Written Down
in Natural Languages
Jan Žižka and František Dařena
Department of Informatics
FBE, Mendel University in Brno
Brno, Czech Republic
zizka@mendelu.cz, darena@mendelu.cz
2. IMMM-2012, October 21-26, 2012, Venice, Italy
One of contemporary typical data is text written in
various natural languages.
Among others, textual data very often represents also
subjective opinions, meanings, sentiments, attitudes,
views, ideas of the text authors – and we can mine it
from the textual data, getting knowledge: text-mining.
The following slides deal with customer opinions
evaluating hotel services.
We can see such data, for example, on web-sites as
booking.com, or elsewhere.
3. IMMM-2012, October 21-26, 2012, Venice, Italy
Discovering
knowledge
that is
hidden in
collected
very large
real-world
textual data
in various
natural
languages:
4. IMMM-2012, October 21-26, 2012, Venice, Italy
Text-mining of very many documents written in natural
languages is limited by the computational complexity
(time and memory) and computer performance.
Most of common users can use only ordinary personal
computers (PC's) – no supercomputers.
A regular solution: Having just standard processors
(even if multicore ones) and some gigabytes of RAM,
the whole data set has to be divided into smaller
subsets that can be processed in parallel.
Are the results different or not? If yes, how much?
5. IMMM-2012, October 21-26, 2012, Venice, Italy
The original text-mining research aimed at automatical
search for significant words and phrases, which could
be then used for deeper examination of positive and
negative reviews; that is, looking for typical praises or
complaints.
For example, “good location”, “bad food”, “very noisy
environment”, “not too much friendly personnel”, “nice
clean room”, and so like.
To obtain such understandable kind of knowledge,
decision trees (rules) were generated.
6. IMMM-2012, October 21-26, 2012, Venice, Italy
Some original English examples, ca 1,200,000 positive
and 800,000 negative (no grammar corrections):
– breakfast and the closeness to the railway station were the only
things that werent bad
– did not spend enogh time in hotel to assess
– it was somewhere to sleep
– very little !!!!!!!!!
– breakfast, supermarket in the same building, kitchen in
the apartment (basic but better than none)
– no complaints on the hotel
7. IMMM-2012, October 21-26, 2012, Venice, Italy
The upper computational complexity of the entropy-
based decision tree (c5) is O(m∙n2) for m reviews and
n unique words in the generated dictionary; some n's:
8. IMMM-2012, October 21-26, 2012, Venice, Italy
The minimum review length was 1 word (for example,
“Excellent!!!”), the maximum was 167 words.
The average length of a review was 19 words.
The vector sparsity was typically around 0.01%, that is,
on average, a review contained only 0.01% of the all
words in the dictionary created from reviews.
An overwhelming majority of words were insignificant,
only some 100-300 of terms (depending on a specific
language) played their significant role from the
classification point of view.
9. IMMM-2012, October 21-26, 2012, Venice, Italy
Intuitively, the
larger data, the
better discovered
knowledge.
However, there is
always an
insurmountable
problem:
How to process
very large textual
data?
10. IMMM-2012, October 21-26, 2012, Venice, Italy
The experiments were aimed at finding the optimal
subset size for the main data set division. The
optimum was defined as obtaining the same results
from the whole data set and from the individual
subsets.
Ideally, each subset should provide the same
significant words that would have the same
significance for classification.
The word significance was defined as the number of
times when a decision tree asked what was a word
frequency in a review.
11. IMMM-2012, October 21-26, 2012, Venice, Italy
If there is a word from the whole data set in the root,
most of the subsets (ideally all) should have the same
word in their roots.
Similarly, the same rule can be applied to other words
included in the trees on levels approaching the leaves.
Then we could say that each subset represents the
original set perfectly.
In reality, the decision trees generated for each review
subset more or less mutually differ because they are
created from different reviews. A tree generated from
a subset may contain also at least one word that is not
in the tree generated from the whole.
12. IMMM-2012, October 21-26, 2012, Venice, Italy
Part of a tree for
a subset: Do all
the subsets of the
customer reviews
have the same
word location in
their roots?
And what about
other words?
Are they on the
same positions in
all subset trees?
13. IMMM-2012, October 21-26, 2012, Venice, Italy
The question is:
How many subsets should the whole review set be
divided into so that the unified results from all the
subsets provide (almost) the same result as from the
whole?
It is not easy to find a general solution because the
result depends on particular data.
The research used the data described above because
it corresponded to many similar situations: a lot of
short reviews concerning just one topic.
14. IMMM-2012, October 21-26, 2012, Venice, Italy
The original set (2,000,000 reviews) was too big to be
processed as a whole.
For a given PC model (8 GB RAM, 4-core processor,
64-bit machine), the experiments worked with 200,000
randomly selected reviews as the whole (for more, it
took more than 24 hours of computation, or crashed
due to the insufficient memory error).
Then, the task was to find an optimal division of the
200,000 reviews into smaller subsets.
15. IMMM-2012, October 21-26, 2012, Venice, Italy
The results of experiments are demonstrated on the
followning graphs for different sizes of the whole set
and its subsets.
On the horizontal x axis, there are the most significant
words generated by the trees.
The vertical axis y shows the correspondence between
the percentage of the significant words in the whole set
and the average percentage of the relevant subsets.
The whole set contains all the significant words:
the y value is always 1.0 (that is, 100%).
16. IMMM-2012, October 21-26, 2012, Venice, Italy
The results provided by the whole set and its subsets
is given by the agreement between the percentage
weight of a word wj in the tree generated for the whole
set and the average percentage weight of wj in the
trees generated for all corresponding subsets.
If a word wj percentage weight in the whole set would
(on average) be the same as for all subsets, then the
agreement is perfect; otherwise, imperfect, where the
imperfection is given by the difference.
20. IMMM-2012, October 21-26, 2012, Venice, Italy
Conclusions:
Probably, it is no surprise that the subsets should be
as large as possible to obtain reliable knowledge.
However, the question is: How big should be the value
“small” for the inevitable subsets?
For a given data, it can be found experimentally, and
then the result is applicable to the same data type in
the future, as the experiments demonstrated.
This research report suggests a method how to act.