Text Mining.pptx

TEXT MINING
By
Yashvi Babariya

INTRODUCTION
 Text mining is a Discovery
 Also referred as Text Data
Mining (TDM) and
Knowledge Discovery in
Textual Database (KDT)
 To extract relevant
information or knowledge or
pattern from different
sources that are in
unstructured or semi-
structured form.

DATA MINING VS. TEXT MINING
Data Mining Text Mining
Process directly Linguistic processing or natural
language processing (NLP)
Identify causal relationship Discover heretofore unknown
information
Structured Data Semi-structured & unstructured
Data (Text)
Structured numeric transection
data residing in rational data
warehouse
Applications deal with much more
diverse and eclectic collections of
systems & formats

INPUT OUTPUT MODEL FOR TEXT MINING

STEPS FOR TEXT MINING
 Pre processing the text
 Applying text mining techniques
 Summarization
 Classification
 Clustering
 Visualization
 Information extraction
o Analyzing The Text

TEXT DATABASES & INFORMATION
RETIEVAL
 Text databases ( document databases)
 Large collections of documents from various sources
: news articles, research papers, books, digital libraries, e-mail
messages and web pages, library database etc.
 Data stored is usually semi-structured
 Information retrieval
 A field developed in parallel with database systems
 Information is organized into a large number of documents
 Information retrieval problem: locating relevant documents
based on user input, such as keywords or example
documents

TYPICAL INFPRMATION RETRIEVAL
PROBLEM
 To locate relevant documents in a document collection
based on a user’s query
 Some keywords describing an information need
 For ad hoc information need user takes the initiative to
“pull” the relevant information out from the collection
 For long-term information need, a retrieval system may
also take the initiative to push relevant to user’s need
 Such an information access process is called
information filtering
 Corresponding systems are called filtering systems or
recommender systems

INFORMATION RETRIEVAL
 Typical IR systems
 Online library catalogs
 Online document management systems
 Information retrieval vs. database
systems
 some DB problems are not present in IR, e.g.,
update, transections management, complex objects
 Some IR problems are not addressed well in
DBMS, e.g., unstructured documents, approximate
search using keywords and relevance

BASIC MEASURE FOR TEXT RETRIEVAL
 Suppose that a text retrieval system has just retrieved a
number of documents based on query
 Let the set of documents relevant to a query be denoted as
{relevant}
 The set of documents retrieved be donated as {retrieved}
 The set of documents that are both relevant and retrieved is
denoted as {relevant} n {retrieved}

 Precision: percentage of retrieved documents that are
in fact relevant to the query(i.e. “correct” response)
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
| 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 ∩ 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 |
|{𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
 Recall: percentage of documents that are relevant to the
query and were in fact, retrieved.
𝑟𝑒𝑐𝑎𝑙𝑙 =
| 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 ∩ 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 |
| 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 |
 An information retrieval system often needs to trade off
recall for precision or vice versa.
 F-score, is harmonic mean of recall and precision
𝐹_𝑠𝑐𝑜𝑟𝑒 =
𝑟𝑒𝑐𝑎𝑙𝑙×𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
(𝑟𝑒𝑐𝑎𝑙𝑙+𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛)/2

INFORMATION RETRIEVAL CONCEPT
 Basic concept
 A document can be described by a set of representative
keywords called index terms
 Different index terms varying relevance when used to
describe document content
 This effect is captured through the assignment of numerical
weights to each index term of a document
 DBMS Analogy
 Index terms → Attributes
 Weights → Attributes value

TEXT RETRIEVAL METHODS
 Document selection method
• Knowledge base retrieval
• The query is specifying constraints for selecting relevant
documents
• Boolean retrieval model- a document is represented by a set
of keywords
• User provides a Boolean expression of keywords such as “car
and repair shops”, “tea or coffee”
• Return documents that satisfy the boolean expression
• Difficulty in presenting a user’s information need exctly with a
boolean query
• Used when the user knows a lot about the document collection
and can formulate a good query

Text Mining.pptx

Recommended

Recommended

More Related Content

Similar to Text Mining.pptx

Similar to Text Mining.pptx (20)

Recently uploaded

Recently uploaded (20)

Text Mining.pptx