TEXT MINING
seminar submitted by:
Ali Abdul_Zahraa
Msc,MathcompUOK
ali.abdulzahraa@gmail.com
Outline
Introduction
Data Mining vs Text Mining
Text Mining Process
Text Mining Applications
Challenges in Text Mining
Conclusion
Introduction
• What is Text Mining?
– Text mining is the analysis of data contained in
natural language text
Introduction
• Why Text Mining?
– Massive amount of new information being
created World’s data doubles every 18 months
(Jacques Vallee Ph.D)
– 80-90% of all data is held in various
unstructured formats
– Useful information can be derived from this
unstructured data
Unstructured Data Examples “Ore”
• Email
• Insurance claims
• News articles
• Web pages
• Patent portfolios
• Customer
complaint letters
• Contracts
• Transcripts of
phone calls with
customers
• Technical
documents
Reasons for Text Mining
0
10
20
30
40
50
60
70
80
90
Percentage
Collections of
Text
Structured Data
How Text Mining Differs from Data
Mining
Data Mining
• Identify data sets
• Select features
• Prepare data
• Analyze
distribution
Text Mining
• Identify documents
• Extract features
• Select features by
algorithm
• Prepare data
• Analyze
distribution
Mining
 Filtering : remove punctuation, special
characters .
Segmentation: segment document to
words.
Stemming : Techniques used to
find out the root/stem of a word:
– E.g.,
– user engineering
– users engineered
– used engineer
– using
• Stem (root) : use engineer
Usefulness
• improving effectiveness of retrieval and text mining
– matching similar words
• reducing indexing size
– combing words with same roots may reduce indexing size as much
as 40-50%.
Mining
 Basic stemming methods
• remove ending
– if a word ends with a consonant other than s,
followed by an s, then delete s.
– if a word ends in es, drop the s.
– if a word ends in ing, delete the ing unless the remaining word consists only
of one letter or of th.
– If a word ends with ed, preceded by a consonant, delete the ed unless this
leaves only a single letter.
– …...
• transform words
– if a word ends with “ies” but not “eies” or “aies” then “ies ”
Mining
Mining
eliminate excessive words : words that not
give meaning by itself such as preposition
, conjunction , conditional particle.
That is performed by comparison with a list
of these words.
Canonical Names
President Bush
Mr. Bush
George Bush
Canonical Name:
George Bush
• The canonical name is the most explicit, least
ambiguous name constructed from the different
variants found in the document
• Reduces ambiguity of variants
Mining
Clipping : eliminate words that appear in high
or low frequency.
o The low frequency’s words will forms small
clusters that not useful , and high frequency’s
words that is always appear and it’s also not
useful.
o There is many ways to calculate word’s
frequency in document(s)
Mining
Clustering : Clustering interrelated
documents, based on documents topics.
Text Mining: Analysis
• Which words are most present.
• Which words are most interesting .
• Which words help define the document.
• What are the interesting text phrases?
Text mining applications
• Call Center Software.
• Anti-Spam.
• Market Intelligence.
• Mining in web .
Actual examples
• One of clinical center in USA be capable of
determine one of genes that responsible for
one of harmful diseases by treat greater than
150,000 news paper.
• Text mining in holy Quran.
• Etc….
Challenges in Text Mining
• Information is in unstructured textual form and it’s
in Natural Language (NL).
• Not readily accessible to be used by computers.
• Dealing with huge collections of documents.
• Require Skillful person to choose which documents
that will treat , and analysis the output .
• Require more time.
• Cost , 50,000$ just to software.
More information
• Central Intelligence Agency (CIA) the most
supportive to text mining .
- 11/ September events.
- mining in E-mail , chat rooms, and social
networks .
-So its support many companies such as
Attensity ،Inxight , Intelliseek.
More information
• SPSS company statistic’s : text mining software
user’s so little comparing with data mining
software user’s.
conclusion
• Finally, most refer to that the field of text
mining are still in the research phase
• and still its applications limited operation at
the present time
• but the possibilities that can be provided,
which helps to understand the huge amounts
of text and extract the core of which
information is important and useful prospects
in many areas .
Text mining

Text mining

  • 1.
    TEXT MINING seminar submittedby: Ali Abdul_Zahraa Msc,MathcompUOK ali.abdulzahraa@gmail.com
  • 2.
    Outline Introduction Data Mining vsText Mining Text Mining Process Text Mining Applications Challenges in Text Mining Conclusion
  • 3.
    Introduction • What isText Mining? – Text mining is the analysis of data contained in natural language text
  • 4.
    Introduction • Why TextMining? – Massive amount of new information being created World’s data doubles every 18 months (Jacques Vallee Ph.D) – 80-90% of all data is held in various unstructured formats – Useful information can be derived from this unstructured data
  • 5.
    Unstructured Data Examples“Ore” • Email • Insurance claims • News articles • Web pages • Patent portfolios • Customer complaint letters • Contracts • Transcripts of phone calls with customers • Technical documents
  • 6.
    Reasons for TextMining 0 10 20 30 40 50 60 70 80 90 Percentage Collections of Text Structured Data
  • 7.
    How Text MiningDiffers from Data Mining Data Mining • Identify data sets • Select features • Prepare data • Analyze distribution Text Mining • Identify documents • Extract features • Select features by algorithm • Prepare data • Analyze distribution
  • 8.
    Mining  Filtering :remove punctuation, special characters . Segmentation: segment document to words.
  • 9.
    Stemming : Techniquesused to find out the root/stem of a word: – E.g., – user engineering – users engineered – used engineer – using • Stem (root) : use engineer Usefulness • improving effectiveness of retrieval and text mining – matching similar words • reducing indexing size – combing words with same roots may reduce indexing size as much as 40-50%. Mining
  • 10.
     Basic stemmingmethods • remove ending – if a word ends with a consonant other than s, followed by an s, then delete s. – if a word ends in es, drop the s. – if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th. – If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter. – …... • transform words – if a word ends with “ies” but not “eies” or “aies” then “ies ” Mining
  • 11.
    Mining eliminate excessive words: words that not give meaning by itself such as preposition , conjunction , conditional particle. That is performed by comparison with a list of these words.
  • 12.
    Canonical Names President Bush Mr.Bush George Bush Canonical Name: George Bush • The canonical name is the most explicit, least ambiguous name constructed from the different variants found in the document • Reduces ambiguity of variants
  • 13.
    Mining Clipping : eliminatewords that appear in high or low frequency. o The low frequency’s words will forms small clusters that not useful , and high frequency’s words that is always appear and it’s also not useful. o There is many ways to calculate word’s frequency in document(s)
  • 14.
    Mining Clustering : Clusteringinterrelated documents, based on documents topics.
  • 15.
    Text Mining: Analysis •Which words are most present. • Which words are most interesting . • Which words help define the document. • What are the interesting text phrases?
  • 16.
    Text mining applications •Call Center Software. • Anti-Spam. • Market Intelligence. • Mining in web .
  • 17.
    Actual examples • Oneof clinical center in USA be capable of determine one of genes that responsible for one of harmful diseases by treat greater than 150,000 news paper. • Text mining in holy Quran. • Etc….
  • 18.
    Challenges in TextMining • Information is in unstructured textual form and it’s in Natural Language (NL). • Not readily accessible to be used by computers. • Dealing with huge collections of documents. • Require Skillful person to choose which documents that will treat , and analysis the output . • Require more time. • Cost , 50,000$ just to software.
  • 19.
    More information • CentralIntelligence Agency (CIA) the most supportive to text mining . - 11/ September events. - mining in E-mail , chat rooms, and social networks . -So its support many companies such as Attensity ،Inxight , Intelliseek.
  • 20.
    More information • SPSScompany statistic’s : text mining software user’s so little comparing with data mining software user’s.
  • 21.
    conclusion • Finally, mostrefer to that the field of text mining are still in the research phase • and still its applications limited operation at the present time • but the possibilities that can be provided, which helps to understand the huge amounts of text and extract the core of which information is important and useful prospects in many areas .