Text Mining

ADDIS ABABA UNIVERSITY
SCHOOL OF GRADUATE STUDIES
SCHOOL OF INFORMATION SCIENCE
February 2011

Presentation Outline
• Definition
• Related Research Areas
• Architecture
• TM Process
• Techniques
• Applications
• Pros and Cons
– Advantages
– Challenges/ Limitations
• Conclusion
• Recommendations /Future of Text Mining/

Introduction and Definitions
• Mining is the process of inferring for patterns
with in a structured or unstructured data.
• Text Mining is the discovery by computer of new,
previously unknown information, by
automatically extracting useful information from
different written resources.
• Text mining, also known as document mining, is
an emerging technology for analyzing large
collection of unstructured documents for the
purposes of extracting interesting and non-trivial
(important) patterns or knowledge.

Related Fields of Study
Database Type Search Mode Atomic entity
Data Retrieval Structured Goal-driven Data Record
Info. Retrieval Unstructured Goal-driven Document
Data Mining Structured Opportunistic Numbers and Dimensions
Text Mining Unstructured Opportunistic Language feature or concept
Table 1: Summary of difference among related fields of Text mining
Figure 1: The relation and
difference of text mining
with other fields

General Architecture of Text Mining Systems
(Feldman and Sanger, 2007)
• four main areas:
1. Preprocessing tasks: convert the information from each
original data source into a canonical (recognized or official)
format.
2. Core Mining Operations: “the heart of a TMS” and include
pattern discovery, trend analysis, and incremental
knowledge discovery algorithms.
3. Presentation Layer Components: include GUI and pattern
browsing functionality as well as access to the query
language. Visualization tools and user-facing query editors
and optimizers also fall under this architectural category.
4. Refinement Techniques (post-processing): include methods
that filter redundant information and cluster closely related
data

Figure 2: System architecture for generic text mining system
Figure 3: System architecture for an advanced or domain-oriented text
mining system
Figure 4: System architecture for an advanced text mining system with
background knowledge base

TM Process (Vidhya and Aghila, 2010)
Document
Collection
Retrieve and
Pre-process
Document
Feature
Selection
Feature
Generation
Classification
Clustering
TM Techniques
Management
Information Systems
Knowledge
Information
Retrieval
Information
Extraction
Summarization Topic Discovery
1. Tokenize
2. Remove
Stop words
3. Stem
Figure 5: Text Mining Process

Text Mining Techniques
The major TM techniques:
• Categorization
• Clustering
• Summarization
• Question Answering : deals with how to find the best answer to a given
question
• Concept linkage : connect related documents by identifying their commonly-
shared concepts
• Information Extraction: identify key phrases and relationships within text
• Topic tracking : A topic tracking system works by keeping user profiles
and, based on the documents the user views, predicts other documents of
interest to the user
• Association detection : the focus is on studying the relationships and
implications among topics, or descriptive concepts, which are used to
characterize a set of related text
• Information visualization : puts large textual sources in a visual hierarchy or
map and provides browsing capabilities.
The user can interact with the document map by zooming, scaling, and
creating sub-maps

Text mining Applications
Text Mining: General Applications
• Relationship Analysis
– If A is related to B, and B is related to C, there is potentially a relationship between A
and C.
• Trend analysis
– Occurrences of A peak in October.
• Mixed applications
– Co-occurrence of A together with B peak in November.
Text Mining: Business Applications
• Example 1: Decision Support in CRM
– What are customers’ typical complaints?
• Example 2: Personalization in eCommerce
– Suggest products that fit a user’s interest profile
Major Advantage
Text mining provides a competitive edge for a company to process
and take advantage of a large quantity of textual information.

Other Applications Areas of TM
• Security applications
• Biomedical applications
• Software and applications
• Online media applications
• Marketing applications
• Movie analysis
• Academic applications
• Internet search engine
• Call center specialists
• Lawyers, insurers and venture
capitalists
• Researching
• Intelligent Email Routing
Commercial applications
• AeroText
• Clarabridge
• Technologies
• Endeca
• Expert System S.p.A.
• Fair Isaac
• SAS
• IBM SPSS
• StatSoft
Free open-source applications
• Carrot2
• GATE
• OpenNLP
• Natural Language Toolkit
(NLTK)
• RapidMiner
• tm: Text Mining Package

Challenges of Text Mining
Analytical Challenges
• Soft matching :
Example:
Misspelt – Wal-mart , Walmart
Company names in short form – ClearForest instead of ClearForest corporation
Use of abbreviations - EDS instead of Electronic Data Systems Corporation
• Summarization : may create erroneous and senseless output
• Temporal resolution : most business documents are time dependant and may
expire after a certain period of time
• Uniqueness resolution : When processing a large number of documents, it is
possible to identify many features and events that resemble one another
Example : when the same name appear in different documents
Linguistics Challenges
• Anaphora Resolution : ability to resolve co-references
Example: resolving pronominal like “he”, “she”, “we” etc
• Full Parsing Vs Shallow parsing

Conclusion
• TM also known as Text Data Mining or KDT refers
generally to the process of extracting interesting
and non-trivial information and knowledge from
unstructured text.
• Text mining is an interdisciplinary field which
draws on information retrieval, data mining,
machine learning, statistics and computational
linguistics
• The motivation for TM is, information (over 90%) is
stored as text in the world
• TM has many applications in different sectors
• There are different TM techniques but there are a
number of challenges to implement each techniques

Recommendations
• Personalized autonomous mining: Current text mining
products and applications are still tools designed for
trained knowledge specialists
• Multilingual text refining: It is essential to develop text
refining algorithms, that process multilingual text
documents and produce language-independent intermediate
forms
• Stronger integration and bigger overlap between text
mining, information retrieval, natural language processing
and software engineering
• Domain knowledge integration: Domain knowledge do not
provided for any current text mining tools

Text Mining

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Text Mining

Similar to Text Mining (20)

More from Biniam Asnake

More from Biniam Asnake (6)

Recently uploaded

Recently uploaded (20)

Text Mining

Editor's Notes