Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno




Natural Language Processing
   by Advanced Artificial
    Intelligence Methods

                  Jan Žižka
         Department of Informatics
    Faculty of Business and Economics
  Mendel University in Brno, Czech Republic

  zizka.jan@gmail.com, zizka@mendelu.cz


                      (Text Mining)
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Data, information, knowledge
●   Electronic text data
●
    Inductive machine learning (ML)
●
    Pre-processing of data and its representation
●   Methods of searching, similarity, pattern recognition
●
    Algorithms (just some examples)
●   Application areas




               Natural Language Processing by Advanced
                     Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Data, information, knowledge
    - data here means all (text) values somehow obtained
      (relevant, irrelevant, with or without noise, exact and
      inexact, approximate, and so like)

    - information is part of data that is interesting from the
      specific selected problem-solution viewpoint

    - knowledge is generalized information

    - metaknowledge is “knowledge about knowledge” (for
      example, to know which knowledge is applicable to
      a specific problem)


                Natural Language Processing by Advanced
                      Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Electronic text data
    Text in an electronic form (ASCII/ANSI, Unicode, etc.).
    Typical text data can be found, e.g., on the Internet.
    Electronic text is used in many areas.


    Electronic text data are created in any common natural
    language (not only in prevailing English).
    Processing of such “human-like” data by machines is
    extraordinarily complicated and often depends on
    a specific language.


               Natural Language Processing by Advanced
                     Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Inductive machine learning
                                                             - learning by using a limited set
                                                               of examples;

                                                             - the examples generally cover
                                                               only a proportion of reality;

                                                             - sufficient values describing the
                                                               data are missing (for example,
                                                               distribution);

                                                             - a mathematical model cannot
                                                               be created for a reliable
                                                               prediction or classification;

                                                             - knowledge is obtained by the
                                                               generalization of information.




              Natural Language Processing by Advanced
                    Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Inductive machine learning

    What is a color of a crow?

    Black? And why?

    Has anyone of you seen
    a crow that was not black?

    Has anyone seen completely all crows that have existed
    anywhere anytime on the Earth? (No, he/she surely hasn't.)

    To what degree is the generalization “a crow is black”
    correct and acceptable? Can you say?

               Natural Language Processing by Advanced
                     Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Inductive machine learning
    How many specific crows we need
    to see to generalize “a crow is black”?




                Natural Language Processing by Advanced
                      Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Inductive machine learning
    The hooded crow:




               Natural Language Processing by Advanced
                     Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Inductive machine learning
    The generalization of specific available examples is one of
    possible learning methods.

    Machines (computers) need (unlike the human beings)
    usually significantly (much) larger amount of specific
    examples to generalize, therefore to get knowledge.

    The application of a method to determine a degree of
    similarity plays a big role – for example, to categorize an
    unknown example to a certain group of known samples.




               Natural Language Processing by Advanced
                     Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Inductive machine learning
    Algorithms of machine learning define their relevant
    parameters automatically during their training phase. The
    quality of their training is verified during testing. If the
    results of testing are acceptable, the trained algorithm can
    be used for a given application.

    The training phase requests suitable learning examples
    because an algorithm’s properties (parameters) are
    finally defined by the applied training data.

    The testing phase uses examples which were not been
    used by an algorithm during its training phase.


               Natural Language Processing by Advanced
                     Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Pre-processing of data, their representation
    The typical way to get knowledge from electronic
    unstructured texts consists in the following steps:


    - source → a necessary volume of (generally noisy) data
    - removing noise → clear data
    - interesting part of data from the application viewpoint →
      information
    - information generalization → knowledge



                Natural Language Processing by Advanced
                      Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Pre-processing of data, their representation
    Representing text documents: bag of words (BOW).

    Methods of machine learning mostly see text documents
    as files containing symbolic values (terms, words) without
    analyzing their meaning (at most, only shallowly) or mutual
    dependence.

    Therefore, the word order in a document is considered as
    being “meaningless” – naturally, it eliminates a certain
    information contents. However, it significantly simplifies
    processing of natural languages from, for example, the
    classification point of view.


               Natural Language Processing by Advanced
                     Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Pre-processing of data, their representation
    Pre-processing affects significantly the result quality:

    - excluding common words, which have no specific
      meaning from the application viewpoint (prepositions,
      abbreviations, definite/indefinite articles, etc.);
    - excluding words with very low or high frequency in all
      processed documents;
    - excluding punctuation, spaces, and so like;
    - transferring alphabetic characters to lower-case letters;
    - eliminating insignificant characters and words reduces
      the problem dimensionality (e.g., from 104 to 103)
      because each unique word is one dimension.


                Natural Language Processing by Advanced
                      Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Pre-processing of data, their representation
    An example of text representation where we ignore
    punctuation, spatial zoning (new lines, paragraphs,
    chapters, etc.), upper and lower letters, two languages
    (English terms in a Czech sentence), word orders – it can be
    very significant (for example, machine learning and learning
    machine), and excluding general words (“stop words”). We get
    a dictionary (a list of symbols) applied to training of a chosen
    algorithm:




                Natural Language Processing by Advanced
                      Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Pre-processing of data, their representation
    Příklad representace textu, kde se ignoruje interpunkce, členění
    textu do řádků, velká a malá písmena, dvojjazyčnost (anglické
    termíny v české větě), pořadí slov, které může mít velký
    význam (např. machine learning – strojové učení a learning
    machine – učící stroj má zcela odlišný význam), a vynechají se
    obecná slova.

     anglické české členění dvojjazyčnost ignoruje interpunkce
    learning machine má malá metody mít může obecná odlišný
    písmena pomocí pořadí příklad representace řádků slov stroj
    strojové termíny textu učení učící velká velký větě vynechají
    význam words zcela



                Natural Language Processing by Advanced
                      Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Pre-processing of data, their representation
    The next dimensionality reduction can be obtained, for
    example, by transferring words into their stems. In the previous
    example, we could reduce the generated dictionary (infinitive,
    grammmatical case, singular, voice, and so like), so the
    dimensionality 8 decreases to 4:

             mít má stroj strojové učení učící velká velký
                         mít stroj učit velký

    Stemming, of course, depends on a language. For English,
    there exists a simplified system Porter stemming, where the
    machine plainly cuts off word endings – this is far from being
    perfect, however, it is practically very effective.


                Natural Language Processing by Advanced
                      Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Pre-processing of data, their representation
    The word incidence – more possibilites to represent it:

    - binary: 1/0 means a word is/isn’t in a document (a word
      weight is 1 or 0);

    - frequency: a word weight is given by its frequency in
      a document;

    - tf-idf: term frequency-inverted document frequency:
      a word frequency in a document (a document representation
      by a given word) to the number of documents having that word
      (the higher the number of documents with that word the lower
     the word’s discrimination value).


                Natural Language Processing by Advanced
                      Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Methods of searching, similarity
    The general task is to find similarity between an unlabeled
    document and a labeled one. It can be used, for example, for
    classification: interesting/uninteresting, and so like.

    Unsupervised learning (clustering): learning without a techer.

    Supervised learning: learning with a teacher.

    Semi-supervised learning: a small amount of given samples
    significantly improves clustering.




                Natural Language Processing by Advanced
                      Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Methods of searching, similarity
    Supervised learning:

    - k-NN (k-nearest neighbors);
    - generation of decision trees;
    - disjunctive normal form (generating rules);
    - support vector machines;
    - Bayes naïve classifier (using conditional probability);
    - etc. (there are really many possibilities).




                 Natural Language Processing by Advanced
                       Artificial Intelligence Methods
w1               w2         w3     cj
            je          pěkné   počasí         +
            je          chladno                -
Training    není        velmi   chladno        +
texts:      není        pěkné                  -
            velmi       chladno                -
            chladno                            -
                 .            .         .      .
                 .            .         .      .
                 .            .         .      .

    + texts: total 6 words
                                  the number of unique words: 6
    - texts: total 7 words

A classified document “to není pěkné chladno”: + or - ?
After creating the dictionary from the unique words (here 6),
computing apriori probabilities (2 texts + and 4 texts – in 6
texts), computing aposteriori probabilties of words in + and –,
and the following normalization we can set the result:
                          w1      w2     w3     w4      w5        w6
 the sorted
 dictionary:            chladno je      není pěkné počasí velmi
frequency wi in +        1       1     1       1       1          1
frequency wi in -        3       1     1       1       0          1
 p (wi | +)               1/6     1/6 1/6       1/6     1/6       1/6
 p (wi | -)               3/7     1/7 1/7       1/7     0/7       1/7

p = p ( 'není', 'pěkné', 'chladno' | +/–) =
  = pNBK ('není' | +/–) × p('pěkné' | +/–) × p('chladno' | +/–)
“w3 w4 w1” = “není pěkné chladno”

P+ = p(+) p(w3 = 'není' | +) p(w4 = 'pěkné' | +) p(w1 = 'chladno' | +) =
                 2 1 1 1
                = × × × ≈ 0.00154
                 6 6 6 6                         
P- = p(–) p(w3 = 'není' | –) p(w4 = 'pěkné' | –) p(w1 = 'chladno' | –) =
                 4 1 1 3
                = × × × ≈ 0.00583
                 6 7 7 7                          
   +     0.00154
P =                   ≈ 0.21
  n 0.00154  0.00583
         0.00583                          Pn- > Pn+ ⇒ negative
P =-
                      ≈ 0.79
  n 0.00154  0.00583
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Application areas
    Many applications exist in various areas where massive
    electronic text data exist. Typical examples are browsing
    the Internet or filtering of email spam. Among the
    contemporary application areas belong, for example:

      - grouping of similar blog submissions;
      - determining subjectivity in text;
      - opinions/feelings/moods/attitudes/meanings in text;
      - revealing of text plagiarisms;
      - analyzing opinions;
      - business intelligence (legal commercial “espionage”);
    and so like.

                                    END
               Natural Language Processing by Advanced
                     Artificial Intelligence Methods

Text mining

  • 1.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno Natural Language Processing by Advanced Artificial Intelligence Methods Jan Žižka Department of Informatics Faculty of Business and Economics Mendel University in Brno, Czech Republic zizka.jan@gmail.com, zizka@mendelu.cz (Text Mining)
  • 2.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Data, information, knowledge ● Electronic text data ● Inductive machine learning (ML) ● Pre-processing of data and its representation ● Methods of searching, similarity, pattern recognition ● Algorithms (just some examples) ● Application areas Natural Language Processing by Advanced Artificial Intelligence Methods
  • 3.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Data, information, knowledge - data here means all (text) values somehow obtained (relevant, irrelevant, with or without noise, exact and inexact, approximate, and so like) - information is part of data that is interesting from the specific selected problem-solution viewpoint - knowledge is generalized information - metaknowledge is “knowledge about knowledge” (for example, to know which knowledge is applicable to a specific problem) Natural Language Processing by Advanced Artificial Intelligence Methods
  • 4.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Electronic text data Text in an electronic form (ASCII/ANSI, Unicode, etc.). Typical text data can be found, e.g., on the Internet. Electronic text is used in many areas. Electronic text data are created in any common natural language (not only in prevailing English). Processing of such “human-like” data by machines is extraordinarily complicated and often depends on a specific language. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 5.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Inductive machine learning - learning by using a limited set of examples; - the examples generally cover only a proportion of reality; - sufficient values describing the data are missing (for example, distribution); - a mathematical model cannot be created for a reliable prediction or classification; - knowledge is obtained by the generalization of information. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 6.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Inductive machine learning What is a color of a crow? Black? And why? Has anyone of you seen a crow that was not black? Has anyone seen completely all crows that have existed anywhere anytime on the Earth? (No, he/she surely hasn't.) To what degree is the generalization “a crow is black” correct and acceptable? Can you say? Natural Language Processing by Advanced Artificial Intelligence Methods
  • 7.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Inductive machine learning How many specific crows we need to see to generalize “a crow is black”? Natural Language Processing by Advanced Artificial Intelligence Methods
  • 8.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Inductive machine learning The hooded crow: Natural Language Processing by Advanced Artificial Intelligence Methods
  • 9.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Inductive machine learning The generalization of specific available examples is one of possible learning methods. Machines (computers) need (unlike the human beings) usually significantly (much) larger amount of specific examples to generalize, therefore to get knowledge. The application of a method to determine a degree of similarity plays a big role – for example, to categorize an unknown example to a certain group of known samples. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 10.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Inductive machine learning Algorithms of machine learning define their relevant parameters automatically during their training phase. The quality of their training is verified during testing. If the results of testing are acceptable, the trained algorithm can be used for a given application. The training phase requests suitable learning examples because an algorithm’s properties (parameters) are finally defined by the applied training data. The testing phase uses examples which were not been used by an algorithm during its training phase. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 11.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Pre-processing of data, their representation The typical way to get knowledge from electronic unstructured texts consists in the following steps: - source → a necessary volume of (generally noisy) data - removing noise → clear data - interesting part of data from the application viewpoint → information - information generalization → knowledge Natural Language Processing by Advanced Artificial Intelligence Methods
  • 12.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Pre-processing of data, their representation Representing text documents: bag of words (BOW). Methods of machine learning mostly see text documents as files containing symbolic values (terms, words) without analyzing their meaning (at most, only shallowly) or mutual dependence. Therefore, the word order in a document is considered as being “meaningless” – naturally, it eliminates a certain information contents. However, it significantly simplifies processing of natural languages from, for example, the classification point of view. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 13.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Pre-processing of data, their representation Pre-processing affects significantly the result quality: - excluding common words, which have no specific meaning from the application viewpoint (prepositions, abbreviations, definite/indefinite articles, etc.); - excluding words with very low or high frequency in all processed documents; - excluding punctuation, spaces, and so like; - transferring alphabetic characters to lower-case letters; - eliminating insignificant characters and words reduces the problem dimensionality (e.g., from 104 to 103) because each unique word is one dimension. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 14.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Pre-processing of data, their representation An example of text representation where we ignore punctuation, spatial zoning (new lines, paragraphs, chapters, etc.), upper and lower letters, two languages (English terms in a Czech sentence), word orders – it can be very significant (for example, machine learning and learning machine), and excluding general words (“stop words”). We get a dictionary (a list of symbols) applied to training of a chosen algorithm: Natural Language Processing by Advanced Artificial Intelligence Methods
  • 15.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Pre-processing of data, their representation Příklad representace textu, kde se ignoruje interpunkce, členění textu do řádků, velká a malá písmena, dvojjazyčnost (anglické termíny v české větě), pořadí slov, které může mít velký význam (např. machine learning – strojové učení a learning machine – učící stroj má zcela odlišný význam), a vynechají se obecná slova. anglické české členění dvojjazyčnost ignoruje interpunkce learning machine má malá metody mít může obecná odlišný písmena pomocí pořadí příklad representace řádků slov stroj strojové termíny textu učení učící velká velký větě vynechají význam words zcela Natural Language Processing by Advanced Artificial Intelligence Methods
  • 16.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Pre-processing of data, their representation The next dimensionality reduction can be obtained, for example, by transferring words into their stems. In the previous example, we could reduce the generated dictionary (infinitive, grammmatical case, singular, voice, and so like), so the dimensionality 8 decreases to 4: mít má stroj strojové učení učící velká velký mít stroj učit velký Stemming, of course, depends on a language. For English, there exists a simplified system Porter stemming, where the machine plainly cuts off word endings – this is far from being perfect, however, it is practically very effective. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 17.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Pre-processing of data, their representation The word incidence – more possibilites to represent it: - binary: 1/0 means a word is/isn’t in a document (a word weight is 1 or 0); - frequency: a word weight is given by its frequency in a document; - tf-idf: term frequency-inverted document frequency: a word frequency in a document (a document representation by a given word) to the number of documents having that word (the higher the number of documents with that word the lower the word’s discrimination value). Natural Language Processing by Advanced Artificial Intelligence Methods
  • 18.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Methods of searching, similarity The general task is to find similarity between an unlabeled document and a labeled one. It can be used, for example, for classification: interesting/uninteresting, and so like. Unsupervised learning (clustering): learning without a techer. Supervised learning: learning with a teacher. Semi-supervised learning: a small amount of given samples significantly improves clustering. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 19.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Methods of searching, similarity Supervised learning: - k-NN (k-nearest neighbors); - generation of decision trees; - disjunctive normal form (generating rules); - support vector machines; - Bayes naïve classifier (using conditional probability); - etc. (there are really many possibilities). Natural Language Processing by Advanced Artificial Intelligence Methods
  • 20.
    w1 w2 w3 cj je pěkné počasí + je chladno - Training není velmi chladno + texts: není pěkné - velmi chladno - chladno - . . . . . . . . . . . . + texts: total 6 words the number of unique words: 6 - texts: total 7 words A classified document “to není pěkné chladno”: + or - ?
  • 21.
    After creating thedictionary from the unique words (here 6), computing apriori probabilities (2 texts + and 4 texts – in 6 texts), computing aposteriori probabilties of words in + and –, and the following normalization we can set the result: w1 w2 w3 w4 w5 w6 the sorted dictionary: chladno je není pěkné počasí velmi frequency wi in + 1 1 1 1 1 1 frequency wi in - 3 1 1 1 0 1 p (wi | +) 1/6 1/6 1/6 1/6 1/6 1/6 p (wi | -) 3/7 1/7 1/7 1/7 0/7 1/7 p = p ( 'není', 'pěkné', 'chladno' | +/–) = = pNBK ('není' | +/–) × p('pěkné' | +/–) × p('chladno' | +/–)
  • 22.
    “w3 w4 w1”= “není pěkné chladno” P+ = p(+) p(w3 = 'není' | +) p(w4 = 'pěkné' | +) p(w1 = 'chladno' | +) = 2 1 1 1 = × × × ≈ 0.00154 6 6 6 6  P- = p(–) p(w3 = 'není' | –) p(w4 = 'pěkné' | –) p(w1 = 'chladno' | –) = 4 1 1 3 = × × × ≈ 0.00583 6 7 7 7  + 0.00154 P = ≈ 0.21 n 0.00154  0.00583 0.00583 Pn- > Pn+ ⇒ negative P =- ≈ 0.79 n 0.00154  0.00583
  • 23.
    Jan Žižka, Ústavinformatiky, PEF, Mendelova universita, Brno ● Application areas Many applications exist in various areas where massive electronic text data exist. Typical examples are browsing the Internet or filtering of email spam. Among the contemporary application areas belong, for example: - grouping of similar blog submissions; - determining subjectivity in text; - opinions/feelings/moods/attitudes/meanings in text; - revealing of text plagiarisms; - analyzing opinions; - business intelligence (legal commercial “espionage”); and so like. END Natural Language Processing by Advanced Artificial Intelligence Methods