2. INTRODUCTION
Text mining is a Discovery
Also referred as Text Data
Mining (TDM) and
Knowledge Discovery in
Textual Database (KDT)
To extract relevant
information or knowledge or
pattern from different
sources that are in
unstructured or semi-
structured form.
3. DATA MINING VS. TEXT MINING
Data Mining Text Mining
Process directly Linguistic processing or natural
language processing (NLP)
Identify causal relationship Discover heretofore unknown
information
Structured Data Semi-structured & unstructured
Data (Text)
Structured numeric transection
data residing in rational data
warehouse
Applications deal with much more
diverse and eclectic collections of
systems & formats
5. STEPS FOR TEXT MINING
Pre processing the text
Applying text mining techniques
Summarization
Classification
Clustering
Visualization
Information extraction
o Analyzing The Text
6. TEXT DATABASES & INFORMATION
RETIEVAL
Text databases ( document databases)
Large collections of documents from various sources
: news articles, research papers, books, digital libraries, e-mail
messages and web pages, library database etc.
Data stored is usually semi-structured
Information retrieval
A field developed in parallel with database systems
Information is organized into a large number of documents
Information retrieval problem: locating relevant documents
based on user input, such as keywords or example
documents
7. TYPICAL INFPRMATION RETRIEVAL
PROBLEM
To locate relevant documents in a document collection
based on a user’s query
Some keywords describing an information need
For ad hoc information need user takes the initiative to
“pull” the relevant information out from the collection
For long-term information need, a retrieval system may
also take the initiative to push relevant to user’s need
Such an information access process is called
information filtering
Corresponding systems are called filtering systems or
recommender systems
8. INFORMATION RETRIEVAL
Typical IR systems
Online library catalogs
Online document management systems
Information retrieval vs. database
systems
some DB problems are not present in IR, e.g.,
update, transections management, complex objects
Some IR problems are not addressed well in
DBMS, e.g., unstructured documents, approximate
search using keywords and relevance
9. BASIC MEASURE FOR TEXT RETRIEVAL
Suppose that a text retrieval system has just retrieved a
number of documents based on query
Let the set of documents relevant to a query be denoted as
{relevant}
The set of documents retrieved be donated as {retrieved}
The set of documents that are both relevant and retrieved is
denoted as {relevant} n {retrieved}
10. Precision: percentage of retrieved documents that are
in fact relevant to the query(i.e. “correct” response)
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
| 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 ∩ 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 |
|{𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
Recall: percentage of documents that are relevant to the
query and were in fact, retrieved.
𝑟𝑒𝑐𝑎𝑙𝑙 =
| 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 ∩ 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 |
| 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 |
An information retrieval system often needs to trade off
recall for precision or vice versa.
F-score, is harmonic mean of recall and precision
𝐹_𝑠𝑐𝑜𝑟𝑒 =
𝑟𝑒𝑐𝑎𝑙𝑙×𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
(𝑟𝑒𝑐𝑎𝑙𝑙+𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛)/2
11. INFORMATION RETRIEVAL CONCEPT
Basic concept
A document can be described by a set of representative
keywords called index terms
Different index terms varying relevance when used to
describe document content
This effect is captured through the assignment of numerical
weights to each index term of a document
DBMS Analogy
Index terms → Attributes
Weights → Attributes value
12. TEXT RETRIEVAL METHODS
Document selection method
• Knowledge base retrieval
• The query is specifying constraints for selecting relevant
documents
• Boolean retrieval model- a document is represented by a set
of keywords
• User provides a Boolean expression of keywords such as “car
and repair shops”, “tea or coffee”
• Return documents that satisfy the boolean expression
• Difficulty in presenting a user’s information need exctly with a
boolean query
• Used when the user knows a lot about the document collection
and can formulate a good query