This document outlines a seminar on text mining by examples presented by Hadi Mohammadzadeh. The seminar covers new terminologies related to text mining, WordNet as a lexical database, the Reuters-21578 text collection, CMU text learning group data archives, text mine software algorithms, and useful websites. The seminar is divided into seven parts covering these topics in detail with examples.
Six Myths about Ontologies: The Basics of Formal Ontology
Text mining by examples, By Hadi Mohammadzadeh
1. .
Seminar on
Text Mining
by Examples
By : Hadi Mohammadzadeh
Institute of Applied Information Processing
University of Ulm – 27 Jan. 2010
Hadi Mohammadzadeh Text Mining by Examples Pages 1
2. .
Seminar on Text Mining by Examples
OutLine
– New Terminologies
– WordNet - A Large Lexical DataBase of English
– Reuters-21578 … as a Text Collection
– CMU Text Learning Group Data Archives
– Text Mine Software - Web based algorithms
– Text Mine Software - Command based algorithms
– Usefull Web sites
Hadi Mohammadzadeh Text Mining by Examples Pages 2
3. .
Seminar on Text Mining by Examples
Part One
New Terminologies
Word and Meaning Relationships
Hadi Mohammadzadeh Text Mining by Examples Pages 3
4. .
Understanding Text
Hyponym and Hypernym
• In linguistics, a hyponym is a word or phrase whose
semantic range is included within another word, its
hypernym. For example, scarlet and crimson are all
hyponyms of red (their hypernym), which is, in turn, a
hyponym of colour.
Hadi Mohammadzadeh Text Mining by Examples Pages 4
5. .
Understanding Text
Meronym
• Meronymy is a semantic relation used in linguistics.
A meronym denotes a constituent part of, or a
member of something. That is,
– X is a meronym of Y if Xs are parts of Y(s), or
– X is a meronym of Y if Xs are members of Y(s).
• For example, 'finger' is a meronym of 'hand' because
a finger is part of a hand. Similarly 'wheel' is a
meronym of 'automobile'.
Hadi Mohammadzadeh Text Mining by Examples Pages 5
6. .
Understanding Text
Holonym
• Holonymy defines the relationship between a term denoting
the whole and a term denoting a part of the whole. That is,
– 'X' is a holonym of 'Y' if Ys are parts of Xs, or
– 'X' is a holonym of 'Y' if Ys are members of Xs.
• For example, 'tree' is a holonym of 'bark',
of 'trunk‘
and of 'limb.'
Hadi Mohammadzadeh Text Mining by Examples Pages 6
7. .
Seminar on Text Mining by Examples
Part Two
WordNet
A Large Lexical DataBase of English
Hadi Mohammadzadeh Text Mining by Examples Pages 7
8. .
WordNet
• WordNet® is a large lexical database of English, developed
under the direction of George A. Miller.
• Develpoment of WordNet began in 1985 and its use is
widespread in tools to manage text.
• WordNet is more than just a dictionary and thesaurus; it includes
all kinds of relationships between words. WordNet version 2.0
contains roughly 150,000 content words.
Hadi Mohammadzadeh Text Mining by Examples Pages 8
9. .
WordNet cont.
• Nouns, verbs, adjectives and adverbs are grouped into
sets of cognitive synonyms (synsets), each expressing a
distinct concept.
• WordNet is also freely and publicly available for
download.
• WordNet's structure makes it a useful tool for
computational linguistics and natural language
processing.
Hadi Mohammadzadeh Text Mining by Examples Pages 9
10. .
Understanding Text – Polysemy
Number of Senses in WordNet
• A word can have more than one meaning that is not obvious in
a sentence.
• In WordNet a word has an average of 1.4 senses.
Average of Sense
Word Number Average of Senses
Verb 2.1
Adjective 1.45
Adverb 1.25
Nouns 1.24
Hadi Mohammadzadeh Text Mining by Examples Pages 10
11. .
Understanding Text – Polysemy
Number of Senses in WordNet
Words with the Highest Number of Senses from
WordNet
Word Number of Senses
Break 74
Cut 73
Run 57
Play 52
Make 51
Hadi Mohammadzadeh Text Mining by Examples Pages 11
12. .
Understanding Text – Polysemy
Number of POS in WordNet
• Some words also have more than one part of speech(POS). For
example still has five different parts of speech.
Word Number of POS
Out 5
Round 5
Still 5
Down 5
Over 4
Hadi Mohammadzadeh Text Mining by Examples Pages 12
13. .
World Classifications in WordNet
• Words can be classified into word classes or POS.
• We refer to nouns, verbs, adjectives, and adverbs as content words.
• Conjunctions, determiners, pronouns, and prepositions are called
function words.
Frequencies of Word Classes from WordNet
Type Number Type Number
Noun 114,400(75%) Preposition 133(0.08%)
Adjective 21,438(14%) Pronoun 118(0.07%)
Verb 11,341(7.4%) Conjunction 89(0.05%)
Adverb 4662(3%) Determiner 14(0.009%)
Hadi Mohammadzadeh Text Mining by Examples Pages 13
14. .
WordNet
Website and Developed Program
• WordNet Website
• WordNet Developed Program
Hadi Mohammadzadeh Text Mining by Examples Pages 14
15. .
Seminar on Text Mining by Examples
Part Three
Reuters-21578
as a Text Collection
Hadi Mohammadzadeh Text Mining by Examples Pages 15
16. .
Reuters-21578
History
• The documents in the Reuters-21578 collection
appeared on the Reuters newswire in 1987.
• Reuters-21578 is a test collection for evaluation of
automatic text categorization techniques. Really it is a
classic benchmark for text categorization algorithms.
• The Reuters-21578 collection is distributed in 22 files.
Each of the first 21 files contain 1000 documents,
while the last contains 578 documents.
Hadi Mohammadzadeh Text Mining by Examples Pages 16
17. .
Reuters-21578
• Distribution 1.0 on 26 September 1997, By
David D. Lewis AT&T Labs - Research
• The data was originally collected and labeled
by Carnegie Group, Inc. and Reuters, Ltd. in
the course of developing the CONSTRUE text
categorization system.
Hadi Mohammadzadeh Text Mining by Examples Pages 17
18. .
Seminar on Text Mining by Examples
Part Four
CMU Text Learning Group
Data Archives
as a Text Collection
Hadi Mohammadzadeh Text Mining by Examples Pages 18
19. .
CMU Text Learning Group
Data Archives
• This data set is a collection of 20,000 messages, collected
from 20 different netnews newsgroups. One thousand
messages from each of the twenty newsgroups were chosen at
random and partitioned by newsgroup name.
• Link
• Sample Message
• Experiment Results
• Prof. Cho , Sam Houston State of University
Hadi Mohammadzadeh Text Mining by Examples Pages 19
20. .
CMU Text Learning Group
Data Archives
1. alt.atheism
2. talk.politics.guns
3. talk.politics.mideast
4. talk.politics.misc
5. talk.religion.misc
6. soc.religion.christian
7. comp.sys.ibm.pc.hardware
8. comp.graphics
9. comp.os.ms-windows.misc
10. comp.sys.mac.hardware
11. comp.windows.x
12. rec.autos
13. rec.motorcycles
14. rec.sport.baseball
15. rec.sport.hockey
16. sci.crypt
17. sci.electronics
18. sci.space
19. sci.med
20. misc.forsale
Hadi Mohammadzadeh Text Mining by Examples Pages 20
21. .
Seminar on Text Mining by Examples
Part Five
Text Mine Software
Web based algorithms
Hadi Mohammadzadeh Text Mining by Examples Pages 21
22. .
Text Mine Application
• The three scripts in the first row handle:
1. the creation of text statistics
• Number of word types
• Letter frequencies
• Word frequencies
2. Entity Extraction
3. Finding the POS tags for words
Hadi Mohammadzadeh Text Mining by Examples Pages 22
23. .
Text Mine Application
• As an input use a text file such as Help File or
write a text on Textbox.
Hadi Mohammadzadeh Text Mining by Examples Pages 23
24. .
Seminar on Text Mining by Examples
Part Six
Text Mine Software
Command based algorithms
Hadi Mohammadzadeh Text Mining by Examples Pages 24
25. .
Zeroth Program
Tokens
• Name of Program: tokens.pl
• Input : sample.
• Output : After runnig this program, it will generate a text file with
following name
tokens.txt
• Aim : Generating Tokens
Hadi Mohammadzadeh Text Mining by Examples Pages 25
26. .
First Program
Part of Speech Tagger
• Name of Program: pos-test.pl
• Input : Inside Perl File.
• Output : After runnig this program, it will
generate a text file with following name
pos_test_results.txt
• Aim : Part of Speech Tagger
Hadi Mohammadzadeh Text Mining by Examples Pages 26
27. .
Second Program
Entity Extraction
• To generate named entities with associated
types, we need some dictionaries for categories
such as
– Person, place, organization, number, currency, dimension, time,
technical time, or miscellaneous.
– For Exampel co_abbrev.dat contains a list of about 900
abbreviations. Or co_places table is a list of about 3000 of the
world’s lager cities.
Hadi Mohammadzadeh Text Mining by Examples Pages 27
28. .
Second Program
Entity Extraction
• Name of Program: test-ent.pl
• Input : Inside Perl File.
• Output : After runnig this program, it will generate a
text file with following name
test_ent_results.txt
• Aim : Entity Extraction
Hadi Mohammadzadeh Text Mining by Examples Pages 28
29. .
Third Program
Disambiguate words with multiple
• Name of Program: sense.pl
• Input : Inside Perl File.
• Output : After runnig this program, it will
generate a text file with following name
sense.txt
Hadi Mohammadzadeh Text Mining by Examples Pages 29
30. .
Fourth Program
Random Text Generator
• Name of Program: tgen.pl
• Input : Inside Perl File.
• Output : After runnig this program, it will
generate a text file with following name
tgen.txt
Hadi Mohammadzadeh Text Mining by Examples Pages 30
31. .
Fifth Program
Splitting of text into sentences
• Name of Program: tsplit.pl
• Input : Inside Perl File.
• Output : After runnig this program, it will
generate a text file with following name
tsplit.txt
Hadi Mohammadzadeh Text Mining by Examples Pages 31
32. .
Sixth program
Clustering
• Name of Program: cluster.pl
• Input Data: a collection of 55 Reuters documents from three topics
– Cocoa , 15 documents
– Suger , 22 documents
– Coffee , 18 documents
Input file included in cluster.pl.
• Input Parameters : A similarity threshold, a linking parameter, and
an indexing parameter.
• Output :
It returns a list of clusters and similarity matrix. Cluster.txt
• Method : This program is based on genetic algorithm method.
Hadi Mohammadzadeh Text Mining by Examples Pages 32
33. .
Seminar on Text Mining by Examples
Part Seven
Usefull Web sites
Hadi Mohammadzadeh Text Mining by Examples Pages 33
34. .
Talk to Ditto
• http://www.convo.co.uk/x02/?
Hadi Mohammadzadeh Text Mining by Examples Pages 34
38. .
How it works?
• Bayesian Classification is used to teach Ditto
the donkey the basics of the English language
• When Ditto receives a message, he evaluates it
for niceness or nastiness, then responds
emotionally on a scale of –100 to +100
• Ditto was trained using 5525 examples
Hadi Mohammadzadeh Text Mining by Examples Pages 38
39. .
Dragon Toolkit
• Dragon Toolkit
Hadi Mohammadzadeh Text Mining by Examples Pages 39
40. .
Disp
• http://www.ltg.ed.ac.uk/disp/resources/
Hadi Mohammadzadeh Text Mining by Examples Pages 40
41. .
References
• Books
– Introduction to Information Retrieval-2008
– Managing Gigabytes-1999
– The Text Mining Handbook
– Text Mining Application Programming
– Web Data Mining
Hadi Mohammadzadeh Text Mining by Examples Pages 41