Your SlideShare is downloading. ×
Text mining
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Text mining

1,220
views

Published on

Published in: Technology, Education

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,220
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
16
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, BrnoNatural Language Processing by Advanced Artificial Intelligence Methods Jan Žižka Department of Informatics Faculty of Business and Economics Mendel University in Brno, Czech Republic zizka.jan@gmail.com, zizka@mendelu.cz (Text Mining)
  • 2. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Data, information, knowledge● Electronic text data● Inductive machine learning (ML)● Pre-processing of data and its representation● Methods of searching, similarity, pattern recognition● Algorithms (just some examples)● Application areas Natural Language Processing by Advanced Artificial Intelligence Methods
  • 3. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Data, information, knowledge - data here means all (text) values somehow obtained (relevant, irrelevant, with or without noise, exact and inexact, approximate, and so like) - information is part of data that is interesting from the specific selected problem-solution viewpoint - knowledge is generalized information - metaknowledge is “knowledge about knowledge” (for example, to know which knowledge is applicable to a specific problem) Natural Language Processing by Advanced Artificial Intelligence Methods
  • 4. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Electronic text data Text in an electronic form (ASCII/ANSI, Unicode, etc.). Typical text data can be found, e.g., on the Internet. Electronic text is used in many areas. Electronic text data are created in any common natural language (not only in prevailing English). Processing of such “human-like” data by machines is extraordinarily complicated and often depends on a specific language. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 5. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Inductive machine learning - learning by using a limited set of examples; - the examples generally cover only a proportion of reality; - sufficient values describing the data are missing (for example, distribution); - a mathematical model cannot be created for a reliable prediction or classification; - knowledge is obtained by the generalization of information. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 6. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Inductive machine learning What is a color of a crow? Black? And why? Has anyone of you seen a crow that was not black? Has anyone seen completely all crows that have existed anywhere anytime on the Earth? (No, he/she surely hasnt.) To what degree is the generalization “a crow is black” correct and acceptable? Can you say? Natural Language Processing by Advanced Artificial Intelligence Methods
  • 7. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Inductive machine learning How many specific crows we need to see to generalize “a crow is black”? Natural Language Processing by Advanced Artificial Intelligence Methods
  • 8. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Inductive machine learning The hooded crow: Natural Language Processing by Advanced Artificial Intelligence Methods
  • 9. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Inductive machine learning The generalization of specific available examples is one of possible learning methods. Machines (computers) need (unlike the human beings) usually significantly (much) larger amount of specific examples to generalize, therefore to get knowledge. The application of a method to determine a degree of similarity plays a big role – for example, to categorize an unknown example to a certain group of known samples. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 10. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Inductive machine learning Algorithms of machine learning define their relevant parameters automatically during their training phase. The quality of their training is verified during testing. If the results of testing are acceptable, the trained algorithm can be used for a given application. The training phase requests suitable learning examples because an algorithm’s properties (parameters) are finally defined by the applied training data. The testing phase uses examples which were not been used by an algorithm during its training phase. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 11. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Pre-processing of data, their representation The typical way to get knowledge from electronic unstructured texts consists in the following steps: - source → a necessary volume of (generally noisy) data - removing noise → clear data - interesting part of data from the application viewpoint → information - information generalization → knowledge Natural Language Processing by Advanced Artificial Intelligence Methods
  • 12. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Pre-processing of data, their representation Representing text documents: bag of words (BOW). Methods of machine learning mostly see text documents as files containing symbolic values (terms, words) without analyzing their meaning (at most, only shallowly) or mutual dependence. Therefore, the word order in a document is considered as being “meaningless” – naturally, it eliminates a certain information contents. However, it significantly simplifies processing of natural languages from, for example, the classification point of view. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 13. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Pre-processing of data, their representation Pre-processing affects significantly the result quality: - excluding common words, which have no specific meaning from the application viewpoint (prepositions, abbreviations, definite/indefinite articles, etc.); - excluding words with very low or high frequency in all processed documents; - excluding punctuation, spaces, and so like; - transferring alphabetic characters to lower-case letters; - eliminating insignificant characters and words reduces the problem dimensionality (e.g., from 104 to 103) because each unique word is one dimension. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 14. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Pre-processing of data, their representation An example of text representation where we ignore punctuation, spatial zoning (new lines, paragraphs, chapters, etc.), upper and lower letters, two languages (English terms in a Czech sentence), word orders – it can be very significant (for example, machine learning and learning machine), and excluding general words (“stop words”). We get a dictionary (a list of symbols) applied to training of a chosen algorithm: Natural Language Processing by Advanced Artificial Intelligence Methods
  • 15. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Pre-processing of data, their representation Příklad representace textu, kde se ignoruje interpunkce, členění textu do řádků, velká a malá písmena, dvojjazyčnost (anglické termíny v české větě), pořadí slov, které může mít velký význam (např. machine learning – strojové učení a learning machine – učící stroj má zcela odlišný význam), a vynechají se obecná slova. anglické české členění dvojjazyčnost ignoruje interpunkce learning machine má malá metody mít může obecná odlišný písmena pomocí pořadí příklad representace řádků slov stroj strojové termíny textu učení učící velká velký větě vynechají význam words zcela Natural Language Processing by Advanced Artificial Intelligence Methods
  • 16. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Pre-processing of data, their representation The next dimensionality reduction can be obtained, for example, by transferring words into their stems. In the previous example, we could reduce the generated dictionary (infinitive, grammmatical case, singular, voice, and so like), so the dimensionality 8 decreases to 4: mít má stroj strojové učení učící velká velký mít stroj učit velký Stemming, of course, depends on a language. For English, there exists a simplified system Porter stemming, where the machine plainly cuts off word endings – this is far from being perfect, however, it is practically very effective. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 17. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Pre-processing of data, their representation The word incidence – more possibilites to represent it: - binary: 1/0 means a word is/isn’t in a document (a word weight is 1 or 0); - frequency: a word weight is given by its frequency in a document; - tf-idf: term frequency-inverted document frequency: a word frequency in a document (a document representation by a given word) to the number of documents having that word (the higher the number of documents with that word the lower the word’s discrimination value). Natural Language Processing by Advanced Artificial Intelligence Methods
  • 18. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Methods of searching, similarity The general task is to find similarity between an unlabeled document and a labeled one. It can be used, for example, for classification: interesting/uninteresting, and so like. Unsupervised learning (clustering): learning without a techer. Supervised learning: learning with a teacher. Semi-supervised learning: a small amount of given samples significantly improves clustering. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 19. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Methods of searching, similarity Supervised learning: - k-NN (k-nearest neighbors); - generation of decision trees; - disjunctive normal form (generating rules); - support vector machines; - Bayes naïve classifier (using conditional probability); - etc. (there are really many possibilities). Natural Language Processing by Advanced Artificial Intelligence Methods
  • 20. w1 w2 w3 cj je pěkné počasí + je chladno -Training není velmi chladno +texts: není pěkné - velmi chladno - chladno - . . . . . . . . . . . . + texts: total 6 words the number of unique words: 6 - texts: total 7 wordsA classified document “to není pěkné chladno”: + or - ?
  • 21. After creating the dictionary from the unique words (here 6),computing apriori probabilities (2 texts + and 4 texts – in 6texts), computing aposteriori probabilties of words in + and –,and the following normalization we can set the result: w1 w2 w3 w4 w5 w6 the sorted dictionary: chladno je není pěkné počasí velmifrequency wi in + 1 1 1 1 1 1frequency wi in - 3 1 1 1 0 1 p (wi | +) 1/6 1/6 1/6 1/6 1/6 1/6 p (wi | -) 3/7 1/7 1/7 1/7 0/7 1/7p = p ( není, pěkné, chladno | +/–) = = pNBK (není | +/–) × p(pěkné | +/–) × p(chladno | +/–)
  • 22. “w3 w4 w1” = “není pěkné chladno”P+ = p(+) p(w3 = není | +) p(w4 = pěkné | +) p(w1 = chladno | +) = 2 1 1 1 = × × × ≈ 0.00154 6 6 6 6 P- = p(–) p(w3 = není | –) p(w4 = pěkné | –) p(w1 = chladno | –) = 4 1 1 3 = × × × ≈ 0.00583 6 7 7 7  + 0.00154P = ≈ 0.21 n 0.00154  0.00583 0.00583 Pn- > Pn+ ⇒ negativeP =- ≈ 0.79 n 0.00154  0.00583
  • 23. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno● Application areas Many applications exist in various areas where massive electronic text data exist. Typical examples are browsing the Internet or filtering of email spam. Among the contemporary application areas belong, for example: - grouping of similar blog submissions; - determining subjectivity in text; - opinions/feelings/moods/attitudes/meanings in text; - revealing of text plagiarisms; - analyzing opinions; - business intelligence (legal commercial “espionage”); and so like. END Natural Language Processing by Advanced Artificial Intelligence Methods

×