Text mining

Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno

Natural Language Processing
by Advanced Artificial
Intelligence Methods

Jan Žižka
Department of Informatics
Faculty of Business and Economics
Mendel University in Brno, Czech Republic

zizka.jan@gmail.com, zizka@mendelu.cz

(Text Mining)


●
Data, information, knowledge
● Electronic text data
●
Inductive machine learning (ML)
●
Pre-processing of data and its representation
● Methods of searching, similarity, pattern recognition
●
Algorithms (just some examples)
● Application areas

Natural Language Processing by Advanced
Artificial Intelligence Methods


●
Data, information, knowledge
- data here means all (text) values somehow obtained
(relevant, irrelevant, with or without noise, exact and
inexact, approximate, and so like)

- information is part of data that is interesting from the
specific selected problem-solution viewpoint

- knowledge is generalized information

- metaknowledge is “knowledge about knowledge” (for
example, to know which knowledge is applicable to
a specific problem)



●
Electronic text data
Text in an electronic form (ASCII/ANSI, Unicode, etc.).
Typical text data can be found, e.g., on the Internet.
Electronic text is used in many areas.

Electronic text data are created in any common natural
language (not only in prevailing English).
Processing of such “human-like” data by machines is
extraordinarily complicated and often depends on
a specific language.



●
Inductive machine learning
- learning by using a limited set
of examples;

- the examples generally cover
only a proportion of reality;

- sufficient values describing the
data are missing (for example,
distribution);

- a mathematical model cannot
be created for a reliable
prediction or classification;

- knowledge is obtained by the
generalization of information.



●

What is a color of a crow?

Black? And why?

Has anyone of you seen
a crow that was not black?

Has anyone seen completely all crows that have existed
anywhere anytime on the Earth? (No, he/she surely hasn't.)

To what degree is the generalization “a crow is black”
correct and acceptable? Can you say?



●
How many specific crows we need
to see to generalize “a crow is black”?



●
The hooded crow:



●
The generalization of specific available examples is one of
possible learning methods.

Machines (computers) need (unlike the human beings)
usually significantly (much) larger amount of specific
examples to generalize, therefore to get knowledge.

The application of a method to determine a degree of
similarity plays a big role – for example, to categorize an
unknown example to a certain group of known samples.



●
Algorithms of machine learning define their relevant
parameters automatically during their training phase. The
quality of their training is verified during testing. If the
results of testing are acceptable, the trained algorithm can
be used for a given application.

The training phase requests suitable learning examples
because an algorithm’s properties (parameters) are
finally defined by the applied training data.

The testing phase uses examples which were not been
used by an algorithm during its training phase.



●
Pre-processing of data, their representation
The typical way to get knowledge from electronic
unstructured texts consists in the following steps:

- source → a necessary volume of (generally noisy) data
- removing noise → clear data
- interesting part of data from the application viewpoint →
information
- information generalization → knowledge



●
Representing text documents: bag of words (BOW).

Methods of machine learning mostly see text documents
as files containing symbolic values (terms, words) without
analyzing their meaning (at most, only shallowly) or mutual
dependence.

Therefore, the word order in a document is considered as
being “meaningless” – naturally, it eliminates a certain
information contents. However, it significantly simplifies
processing of natural languages from, for example, the
classification point of view.



●
Pre-processing affects significantly the result quality:

- excluding common words, which have no specific
meaning from the application viewpoint (prepositions,
abbreviations, definite/indefinite articles, etc.);
- excluding words with very low or high frequency in all
processed documents;
- excluding punctuation, spaces, and so like;
- transferring alphabetic characters to lower-case letters;
- eliminating insignificant characters and words reduces
the problem dimensionality (e.g., from 104 to 103)
because each unique word is one dimension.



●
An example of text representation where we ignore
punctuation, spatial zoning (new lines, paragraphs,
chapters, etc.), upper and lower letters, two languages
(English terms in a Czech sentence), word orders – it can be
very significant (for example, machine learning and learning
machine), and excluding general words (“stop words”). We get
a dictionary (a list of symbols) applied to training of a chosen
algorithm:



●
Příklad representace textu, kde se ignoruje interpunkce, členění
textu do řádků, velká a malá písmena, dvojjazyčnost (anglické
termíny v české větě), pořadí slov, které může mít velký
význam (např. machine learning – strojové učení a learning
machine – učící stroj má zcela odlišný význam), a vynechají se
obecná slova.

anglické české členění dvojjazyčnost ignoruje interpunkce
learning machine má malá metody mít může obecná odlišný
písmena pomocí pořadí příklad representace řádků slov stroj
strojové termíny textu učení učící velká velký větě vynechají
význam words zcela



●
The next dimensionality reduction can be obtained, for
example, by transferring words into their stems. In the previous
example, we could reduce the generated dictionary (infinitive,
grammmatical case, singular, voice, and so like), so the
dimensionality 8 decreases to 4:

mít má stroj strojové učení učící velká velký
mít stroj učit velký

Stemming, of course, depends on a language. For English,
there exists a simplified system Porter stemming, where the
machine plainly cuts off word endings – this is far from being
perfect, however, it is practically very effective.



●
The word incidence – more possibilites to represent it:

- binary: 1/0 means a word is/isn’t in a document (a word
weight is 1 or 0);

- frequency: a word weight is given by its frequency in
a document;

- tf-idf: term frequency-inverted document frequency:
a word frequency in a document (a document representation
by a given word) to the number of documents having that word
(the higher the number of documents with that word the lower
the word’s discrimination value).



●
Methods of searching, similarity
The general task is to find similarity between an unlabeled
document and a labeled one. It can be used, for example, for
classification: interesting/uninteresting, and so like.

Unsupervised learning (clustering): learning without a techer.

Supervised learning: learning with a teacher.

Semi-supervised learning: a small amount of given samples
significantly improves clustering.



●
Methods of searching, similarity
Supervised learning:

- k-NN (k-nearest neighbors);
- generation of decision trees;
- disjunctive normal form (generating rules);
- support vector machines;
- Bayes naïve classifier (using conditional probability);
- etc. (there are really many possibilities).


w1 w2 w3 cj
je pěkné počasí +
je chladno -
Training není velmi chladno +
texts: není pěkné -
velmi chladno -
chladno -
. . . .
. . . .
. . . .

+ texts: total 6 words
the number of unique words: 6
- texts: total 7 words

A classified document “to není pěkné chladno”: + or - ?

After creating the dictionary from the unique words (here 6),
computing apriori probabilities (2 texts + and 4 texts – in 6
texts), computing aposteriori probabilties of words in + and –,
and the following normalization we can set the result:
w1 w2 w3 w4 w5 w6
the sorted
dictionary: chladno je není pěkné počasí velmi
frequency wi in + 1 1 1 1 1 1
frequency wi in - 3 1 1 1 0 1
p (wi | +) 1/6 1/6 1/6 1/6 1/6 1/6
p (wi | -) 3/7 1/7 1/7 1/7 0/7 1/7

p = p ( 'není', 'pěkné', 'chladno' | +/–) =
= pNBK ('není' | +/–) × p('pěkné' | +/–) × p('chladno' | +/–)

“w3 w4 w1” = “není pěkné chladno”

P+ = p(+) p(w3 = 'není' | +) p(w4 = 'pěkné' | +) p(w1 = 'chladno' | +) =
2 1 1 1
= × × × ≈ 0.00154
6 6 6 6 
P- = p(–) p(w3 = 'není' | –) p(w4 = 'pěkné' | –) p(w1 = 'chladno' | –) =
4 1 1 3
= × × × ≈ 0.00583
6 7 7 7 
+ 0.00154
P = ≈ 0.21
n 0.00154  0.00583
0.00583 Pn- > Pn+ ⇒ negative
P =-
≈ 0.79
n 0.00154  0.00583


●
Application areas
Many applications exist in various areas where massive
electronic text data exist. Typical examples are browsing
the Internet or filtering of email spam. Among the
contemporary application areas belong, for example:

- grouping of similar blog submissions;
- determining subjectivity in text;
- opinions/feelings/moods/attitudes/meanings in text;
- revealing of text plagiarisms;
- analyzing opinions;
- business intelligence (legal commercial “espionage”);
and so like.

END

Text mining

More Related Content

What's hot

Similar to Text mining

More from Natalia Ostapuk

Recently uploaded

Text mining