Successfully reported this slideshow.

Text mining by examples, By Hadi Mohammadzadeh

8,134 views

Published on

Published in: Technology, Education
  • Be the first to comment

Text mining by examples, By Hadi Mohammadzadeh

  1. 1. . Seminar on Text Mining by ExamplesBy : Hadi MohammadzadehInstitute of Applied Information ProcessingUniversity of Ulm – 27 Jan. 2010 Hadi Mohammadzadeh Text Mining by Examples Pages 1
  2. 2. .Seminar on Text Mining by ExamplesOutLine– New Terminologies– WordNet - A Large Lexical DataBase of English– Reuters-21578 … as a Text Collection– CMU Text Learning Group Data Archives– Text Mine Software - Web based algorithms– Text Mine Software - Command based algorithms– Usefull Web sites Hadi Mohammadzadeh Text Mining by Examples Pages 2
  3. 3. . Seminar on Text Mining by Examples Part OneNew Terminologies Word and Meaning RelationshipsHadi Mohammadzadeh Text Mining by Examples Pages 3
  4. 4. . Understanding Text Hyponym and Hypernym• In linguistics, a hyponym is a word or phrase whose semantic range is included within another word, its hypernym. For example, scarlet and crimson are all hyponyms of red (their hypernym), which is, in turn, a hyponym of colour. Hadi Mohammadzadeh Text Mining by Examples Pages 4
  5. 5. . Understanding Text Meronym• Meronymy is a semantic relation used in linguistics. A meronym denotes a constituent part of, or a member of something. That is, – X is a meronym of Y if Xs are parts of Y(s), or – X is a meronym of Y if Xs are members of Y(s).• For example, finger is a meronym of hand because a finger is part of a hand. Similarly wheel is a meronym of automobile. Hadi Mohammadzadeh Text Mining by Examples Pages 5
  6. 6. . Understanding Text Holonym• Holonymy defines the relationship between a term denoting the whole and a term denoting a part of the whole. That is, – X is a holonym of Y if Ys are parts of Xs, or – X is a holonym of Y if Ys are members of Xs.• For example, tree is a holonym of bark, of trunk‘ and of limb. Hadi Mohammadzadeh Text Mining by Examples Pages 6
  7. 7. . Seminar on Text Mining by Examples Part Two WordNet A Large Lexical DataBase of EnglishHadi Mohammadzadeh Text Mining by Examples Pages 7
  8. 8. . WordNet• WordNet® is a large lexical database of English, developed under the direction of George A. Miller.• Develpoment of WordNet began in 1985 and its use is widespread in tools to manage text.• WordNet is more than just a dictionary and thesaurus; it includes all kinds of relationships between words. WordNet version 2.0 contains roughly 150,000 content words. Hadi Mohammadzadeh Text Mining by Examples Pages 8
  9. 9. . WordNet cont.• Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.• WordNet is also freely and publicly available for download.• WordNets structure makes it a useful tool for computational linguistics and natural language processing. Hadi Mohammadzadeh Text Mining by Examples Pages 9
  10. 10. . Understanding Text – Polysemy Number of Senses in WordNet• A word can have more than one meaning that is not obvious in a sentence.• In WordNet a word has an average of 1.4 senses. Average of Sense Word Number Average of Senses Verb 2.1 Adjective 1.45 Adverb 1.25 Nouns 1.24 Hadi Mohammadzadeh Text Mining by Examples Pages 10
  11. 11. . Understanding Text – Polysemy Number of Senses in WordNetWords with the Highest Number of Senses from WordNet Word Number of Senses Break 74 Cut 73 Run 57 Play 52 Make 51 Hadi Mohammadzadeh Text Mining by Examples Pages 11
  12. 12. . Understanding Text – Polysemy Number of POS in WordNet• Some words also have more than one part of speech(POS). For example still has five different parts of speech. Word Number of POS Out 5 Round 5 Still 5 Down 5 Over 4 Hadi Mohammadzadeh Text Mining by Examples Pages 12
  13. 13. . World Classifications in WordNet• Words can be classified into word classes or POS.• We refer to nouns, verbs, adjectives, and adverbs as content words.• Conjunctions, determiners, pronouns, and prepositions are called function words. Frequencies of Word Classes from WordNet Type Number Type Number Noun 114,400(75%) Preposition 133(0.08%) Adjective 21,438(14%) Pronoun 118(0.07%) Verb 11,341(7.4%) Conjunction 89(0.05%) Adverb 4662(3%) Determiner 14(0.009%) Hadi Mohammadzadeh Text Mining by Examples Pages 13
  14. 14. . WordNet Website and Developed Program• WordNet Website• WordNet Developed Program Hadi Mohammadzadeh Text Mining by Examples Pages 14
  15. 15. . Seminar on Text Mining by Examples Part Three Reuters-21578 as a Text CollectionHadi Mohammadzadeh Text Mining by Examples Pages 15
  16. 16. . Reuters-21578 History• The documents in the Reuters-21578 collection appeared on the Reuters newswire in 1987.• Reuters-21578 is a test collection for evaluation of automatic text categorization techniques. Really it is a classic benchmark for text categorization algorithms.• The Reuters-21578 collection is distributed in 22 files. Each of the first 21 files contain 1000 documents, while the last contains 578 documents. Hadi Mohammadzadeh Text Mining by Examples Pages 16
  17. 17. . Reuters-21578• Distribution 1.0 on 26 September 1997, By David D. Lewis AT&T Labs - Research• The data was originally collected and labeled by Carnegie Group, Inc. and Reuters, Ltd. in the course of developing the CONSTRUE text categorization system. Hadi Mohammadzadeh Text Mining by Examples Pages 17
  18. 18. . Seminar on Text Mining by Examples Part FourCMU Text Learning Group Data Archives as a Text Collection Hadi Mohammadzadeh Text Mining by Examples Pages 18
  19. 19. . CMU Text Learning Group Data Archives• This data set is a collection of 20,000 messages, collected from 20 different netnews newsgroups. One thousand messages from each of the twenty newsgroups were chosen at random and partitioned by newsgroup name.• Link• Sample Message• Experiment Results• Prof. Cho , Sam Houston State of University Hadi Mohammadzadeh Text Mining by Examples Pages 19
  20. 20. . CMU Text Learning Group Data Archives1. alt.atheism2. talk.politics.guns3. talk.politics.mideast4. talk.politics.misc5. talk.religion.misc6. soc.religion.christian7. comp.sys.ibm.pc.hardware8. comp.graphics9. comp.os.ms-windows.misc10. comp.sys.mac.hardware11. comp.windows.x12. rec.autos13. rec.motorcycles14. rec.sport.baseball15. rec.sport.hockey16. sci.crypt17. sci.electronics18. sci.space19. sci.med20. misc.forsale Hadi Mohammadzadeh Text Mining by Examples Pages 20
  21. 21. . Seminar on Text Mining by Examples Part FiveText Mine Software Web based algorithmsHadi Mohammadzadeh Text Mining by Examples Pages 21
  22. 22. . Text Mine Application• The three scripts in the first row handle: 1. the creation of text statistics • Number of word types • Letter frequencies • Word frequencies 2. Entity Extraction 3. Finding the POS tags for words Hadi Mohammadzadeh Text Mining by Examples Pages 22
  23. 23. . Text Mine Application• As an input use a text file such as Help File or write a text on Textbox. Hadi Mohammadzadeh Text Mining by Examples Pages 23
  24. 24. . Seminar on Text Mining by Examples Part SixText Mine Software Command based algorithmsHadi Mohammadzadeh Text Mining by Examples Pages 24
  25. 25. . Zeroth Program Tokens• Name of Program: tokens.pl• Input : sample.• Output : After runnig this program, it will generate a text file with following name tokens.txt• Aim : Generating Tokens Hadi Mohammadzadeh Text Mining by Examples Pages 25
  26. 26. . First Program Part of Speech Tagger• Name of Program: pos-test.pl• Input : Inside Perl File.• Output : After runnig this program, it will generate a text file with following name pos_test_results.txt• Aim : Part of Speech Tagger Hadi Mohammadzadeh Text Mining by Examples Pages 26
  27. 27. . Second Program Entity Extraction• To generate named entities with associated types, we need some dictionaries for categories such as – Person, place, organization, number, currency, dimension, time, technical time, or miscellaneous. – For Exampel co_abbrev.dat contains a list of about 900 abbreviations. Or co_places table is a list of about 3000 of the world’s lager cities. Hadi Mohammadzadeh Text Mining by Examples Pages 27
  28. 28. . Second Program Entity Extraction• Name of Program: test-ent.pl• Input : Inside Perl File.• Output : After runnig this program, it will generate a text file with following name test_ent_results.txt• Aim : Entity Extraction Hadi Mohammadzadeh Text Mining by Examples Pages 28
  29. 29. . Third Program Disambiguate words with multiple• Name of Program: sense.pl• Input : Inside Perl File.• Output : After runnig this program, it will generate a text file with following name sense.txt Hadi Mohammadzadeh Text Mining by Examples Pages 29
  30. 30. . Fourth Program Random Text Generator• Name of Program: tgen.pl• Input : Inside Perl File.• Output : After runnig this program, it will generate a text file with following name tgen.txt Hadi Mohammadzadeh Text Mining by Examples Pages 30
  31. 31. . Fifth Program Splitting of text into sentences• Name of Program: tsplit.pl• Input : Inside Perl File.• Output : After runnig this program, it will generate a text file with following name tsplit.txt Hadi Mohammadzadeh Text Mining by Examples Pages 31
  32. 32. . Sixth program Clustering• Name of Program: cluster.pl• Input Data: a collection of 55 Reuters documents from three topics – Cocoa , 15 documents – Suger , 22 documents – Coffee , 18 documents Input file included in cluster.pl.• Input Parameters : A similarity threshold, a linking parameter, and an indexing parameter.• Output : It returns a list of clusters and similarity matrix. Cluster.txt• Method : This program is based on genetic algorithm method. Hadi Mohammadzadeh Text Mining by Examples Pages 32
  33. 33. . Seminar on Text Mining by Examples Part Seven Usefull Web sitesHadi Mohammadzadeh Text Mining by Examples Pages 33
  34. 34. . Talk to Ditto• http://www.convo.co.uk/x02/? Hadi Mohammadzadeh Text Mining by Examples Pages 34
  35. 35. .Hadi Mohammadzadeh Text Mining by Examples Pages 35
  36. 36. .Hadi Mohammadzadeh Text Mining by Examples Pages 36
  37. 37. .Hadi Mohammadzadeh Text Mining by Examples Pages 37
  38. 38. . How it works?• Bayesian Classification is used to teach Ditto the donkey the basics of the English language• When Ditto receives a message, he evaluates it for niceness or nastiness, then responds emotionally on a scale of –100 to +100• Ditto was trained using 5525 examples Hadi Mohammadzadeh Text Mining by Examples Pages 38
  39. 39. . Dragon Toolkit• Dragon Toolkit Hadi Mohammadzadeh Text Mining by Examples Pages 39
  40. 40. . Disp• http://www.ltg.ed.ac.uk/disp/resources/ Hadi Mohammadzadeh Text Mining by Examples Pages 40
  41. 41. . References• Books – Introduction to Information Retrieval-2008 – Managing Gigabytes-1999 – The Text Mining Handbook – Text Mining Application Programming – Web Data Mining Hadi Mohammadzadeh Text Mining by Examples Pages 41

×