Text mining

TEXT MINING
seminar submitted by:
Ali Abdul_Zahraa
Msc,MathcompUOK
ali.abdulzahraa@gmail.com

Outline
Introduction
Data Mining vs Text Mining
Text Mining Process
Text Mining Applications
Challenges in Text Mining
Conclusion

Introduction
• What is Text Mining?
– Text mining is the analysis of data contained in
natural language text

Introduction
• Why Text Mining?
– Massive amount of new information being
created World’s data doubles every 18 months
(Jacques Vallee Ph.D)
– 80-90% of all data is held in various
unstructured formats
– Useful information can be derived from this
unstructured data

Unstructured Data Examples “Ore”
• Email
• Insurance claims
• News articles
• Web pages
• Patent portfolios
• Customer
complaint letters
• Contracts
• Transcripts of
phone calls with
customers
• Technical
documents

Reasons for Text Mining
0
10
20
30
40
50
60
70
80
90
Percentage
Collections of
Text
Structured Data

How Text Mining Differs from Data
Mining
Data Mining
• Identify data sets
• Select features
• Prepare data
• Analyze
distribution
Text Mining
• Identify documents
• Extract features
• Select features by
algorithm
• Prepare data
• Analyze
distribution

Mining
 Filtering : remove punctuation, special
characters .
Segmentation: segment document to
words.

Stemming : Techniques used to
find out the root/stem of a word:
– E.g.,
– user engineering
– users engineered
– used engineer
– using
• Stem (root) : use engineer
Usefulness
• improving effectiveness of retrieval and text mining
– matching similar words
• reducing indexing size
– combing words with same roots may reduce indexing size as much
as 40-50%.
Mining

 Basic stemming methods
• remove ending
– if a word ends with a consonant other than s,
followed by an s, then delete s.
– if a word ends in es, drop the s.
– if a word ends in ing, delete the ing unless the remaining word consists only
of one letter or of th.
– If a word ends with ed, preceded by a consonant, delete the ed unless this
leaves only a single letter.
– …...
• transform words
– if a word ends with “ies” but not “eies” or “aies” then “ies ”
Mining

Mining
eliminate⁯ excessive words : words that not
give meaning by itself such as preposition
, conjunction , conditional particle.
That is performed by comparison with a list
of these words.

Canonical Names
President Bush
Mr. Bush
George Bush
Canonical Name:
George Bush
• The canonical name is the most explicit, least
ambiguous name constructed from the different
variants found in the document
• Reduces ambiguity of variants

Mining
Clipping : eliminate words that appear in high
or low frequency.
o The low frequency’s words will forms small
clusters that not useful , and high frequency’s
words that is always appear and it’s also not
useful.
o There is many ways to calculate word’s
frequency in document(s)

Mining
Clustering : Clustering interrelated
documents, based on documents topics.

Text Mining: Analysis
• Which words are most present.
• Which words are most interesting .
• Which words help define the document.
• What are the interesting text phrases?

Text mining applications
• Call Center Software.
• Anti-Spam.
• Market Intelligence.
• Mining in web .

Actual examples
• One of clinical center in USA be capable of
determine one of genes that responsible for
one of harmful diseases by treat greater than
150,000 news paper.
• Text mining in holy Quran.
• Etc….

Challenges in Text Mining
• Information is in unstructured textual form and it’s
in Natural Language (NL).
• Not readily accessible to be used by computers.
• Dealing with huge collections of documents.
• Require Skillful person to choose which documents
that will treat , and analysis the output .
• Require more time.
• Cost , 50,000$ just to software.

More information
• Central Intelligence Agency (CIA) the most
supportive to text mining .
- 11/ September events.
- mining in E-mail , chat rooms, and social
networks .
-So its support many companies such as
Attensity ،Inxight , Intelliseek.

More information
• SPSS company statistic’s : text mining software
user’s so little comparing with data mining
software user’s.

conclusion
• Finally, most refer to that the field of text
mining are still in the research phase
• and still its applications limited operation at
the present time
• but the possibilities that can be provided,
which helps to understand the huge amounts
of text and extract the core of which
information is important and useful prospects
in many areas .

Text mining

In this document

More Related Content

What's hot

Viewers also liked

Similar to Text mining

More from Ali A Jalil

Recently uploaded

Text mining