Text Mining
Presenter: Gokul K S
Text mining also is known as Text Data Mining(TDM)
and Knowledge Discovery in Textual Database(KDT).
A process of identifying novel information
from a collection of text
2
“
What is Text Databases ?.
3
Comparison
Data Mining
 process directly
 Identify causal
relationship
 Structured
numeric
transaction data
residing in
rational data
warehouse
Text Mining
 Linguistic processing
or natural language
processing (NLP)
 Discover heretofore
unknown information
4
Data Mining / Knowledge Discovery
Structured Data Multimedia Free Text Hypertext
5
HomeLoan (
Loanee: Frank Rizzo
Lender: MWF
Agency: Lake View
Amount: $200,000
Term: 15 years
)
Frank Rizzo bought
his home from Lake
View Real Estate in
1992.
He paid $200,000
under a15-year loan
from MW Financial.
<a href>Frank Rizzo
</a> Bought
<a hef>this home</a>
from <a href>Lake
View Real Estate</a>
In <b>1992</b>.
<p>...
Loans($200K,[map],...)
Information
Retrieval
 The science of searching for
 Information in documents
 Documents themselves
 Metadata which describe documents
 Text, sound, images or data, within
database: relational stand-alone database
or hypertext networked databases such as
the Internet or intranets.
6
Information retrieval cont..
 A field developed in parallel with database
systems
 Information is organized into (a large
number of) documents
 Information retrieval problem: locating
relevant documents based on user input,
such as keywords or example documents
Basic Measures for
Text Retrieval
8
Precision: the percentage of retrieved documents that
are in fact relevant to the query (i.e., “correct”
responses)
Precision
.
9
Relevant Relevant &
Retrieved Retrieved
All Documents
|}{|
|}{}{|
Retrieved
RetrievedRelevant
precision


Recall Recall: the percentage of documents that are relevant
to the query and were, in fact, retrieved
10
|}{|
|}{}{|
Re
Relevant
RetrievedRelevant
call


Trade-off ○Trade-off: which is defined as the harmonic mean of
recall and precision:
11
2/)(
*
_
precisionrecall
precisionrecall
scoreF


Text Retrieval Methods
 Document Selection
 Boolean Model
A typical method of this category is the Boolean retrieval model, in which a
document is represented by a set of keywords and a user provides a
Boolean expression of keywords, such as “car and repair shops,” “tea or
coffee,” or “database systems but not Oracle.”
The Boolean model predicts that each document is either relevant or non-
relevant based on the match of a document to the query
12
Document ranking
Document ranking methods use the query to
rank all documents in the order of relevance.
13
Document ranking
Basic techniques
Stop list
Set of words that are deemed “irrelevant”, even though they may
appear frequently
◦E.g., a, the, of, for, to, with, etc.
◦Stop lists may vary when document set varies
14
Document ranking
◦Word stem
Several words are small syntactic variants of each other since they share a
common word stem
E.g., drug, drugs, drugged
◦A term frequency table
Each entry frequent_table(i, j) = # of occurrences of the word ti in
document di
◦Usually, the ratio instead of the absolute number of occurrences is used
15
Document ranking
◦Term Frequency(TF)
The term frequency be the number of occurrences of term t in the
document d, that is, freq (d, t). The (weighted) term-frequency
matrix TF(d, t) measures the association of a term t with respect to
the given document d: it is generally defined as 0 if the document
does not contain the term, and nonzero otherwise.
16
otherwise.t))),log(freq(dlog(11
0t)freq(d,if,0t)TF(d,


Document ranking
|dt| << |d|, the term t will have a large IDF scaling factor and vice
versa.
Inverse document frequency (IDF)
◦That represents the scaling factor, or the importance of a term t.
○If a term t occurs in many documents, its importance will be
scaled down due to its reduced discriminative power.
17
||
||1
log)(
dt
d
tIDF


Document ranking
○In a complete vector-space model, TF and IDF are combined
together, which forms
TF-IDF(d, t) = TF(d, t)*IDF(t)
○
18
Document ranking
Similarity based
Finds similar documents based on a set of common keywords
Answer should be based on the degree of relevance based on the
nearness of the keywords, relative frequency of the keywords, etc.
measure the closeness of a document to a query (a set of keywords
◦
19
||||
),(
21
21
21
vv
vv
vvsim


Thanks!
20

Text Mining

  • 1.
  • 2.
    Text mining alsois known as Text Data Mining(TDM) and Knowledge Discovery in Textual Database(KDT). A process of identifying novel information from a collection of text 2
  • 3.
    “ What is TextDatabases ?. 3
  • 4.
    Comparison Data Mining  processdirectly  Identify causal relationship  Structured numeric transaction data residing in rational data warehouse Text Mining  Linguistic processing or natural language processing (NLP)  Discover heretofore unknown information 4
  • 5.
    Data Mining /Knowledge Discovery Structured Data Multimedia Free Text Hypertext 5 HomeLoan ( Loanee: Frank Rizzo Lender: MWF Agency: Lake View Amount: $200,000 Term: 15 years ) Frank Rizzo bought his home from Lake View Real Estate in 1992. He paid $200,000 under a15-year loan from MW Financial. <a href>Frank Rizzo </a> Bought <a hef>this home</a> from <a href>Lake View Real Estate</a> In <b>1992</b>. <p>... Loans($200K,[map],...)
  • 6.
    Information Retrieval  The scienceof searching for  Information in documents  Documents themselves  Metadata which describe documents  Text, sound, images or data, within database: relational stand-alone database or hypertext networked databases such as the Internet or intranets. 6
  • 7.
    Information retrieval cont.. A field developed in parallel with database systems  Information is organized into (a large number of) documents  Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents
  • 8.
  • 9.
    Precision: the percentageof retrieved documents that are in fact relevant to the query (i.e., “correct” responses) Precision . 9 Relevant Relevant & Retrieved Retrieved All Documents |}{| |}{}{| Retrieved RetrievedRelevant precision  
  • 10.
    Recall Recall: thepercentage of documents that are relevant to the query and were, in fact, retrieved 10 |}{| |}{}{| Re Relevant RetrievedRelevant call  
  • 11.
    Trade-off ○Trade-off: whichis defined as the harmonic mean of recall and precision: 11 2/)( * _ precisionrecall precisionrecall scoreF  
  • 12.
    Text Retrieval Methods Document Selection  Boolean Model A typical method of this category is the Boolean retrieval model, in which a document is represented by a set of keywords and a user provides a Boolean expression of keywords, such as “car and repair shops,” “tea or coffee,” or “database systems but not Oracle.” The Boolean model predicts that each document is either relevant or non- relevant based on the match of a document to the query 12
  • 13.
    Document ranking Document rankingmethods use the query to rank all documents in the order of relevance. 13
  • 14.
    Document ranking Basic techniques Stoplist Set of words that are deemed “irrelevant”, even though they may appear frequently ◦E.g., a, the, of, for, to, with, etc. ◦Stop lists may vary when document set varies 14
  • 15.
    Document ranking ◦Word stem Severalwords are small syntactic variants of each other since they share a common word stem E.g., drug, drugs, drugged ◦A term frequency table Each entry frequent_table(i, j) = # of occurrences of the word ti in document di ◦Usually, the ratio instead of the absolute number of occurrences is used 15
  • 16.
    Document ranking ◦Term Frequency(TF) Theterm frequency be the number of occurrences of term t in the document d, that is, freq (d, t). The (weighted) term-frequency matrix TF(d, t) measures the association of a term t with respect to the given document d: it is generally defined as 0 if the document does not contain the term, and nonzero otherwise. 16 otherwise.t))),log(freq(dlog(11 0t)freq(d,if,0t)TF(d,  
  • 17.
    Document ranking |dt| <<|d|, the term t will have a large IDF scaling factor and vice versa. Inverse document frequency (IDF) ◦That represents the scaling factor, or the importance of a term t. ○If a term t occurs in many documents, its importance will be scaled down due to its reduced discriminative power. 17 || ||1 log)( dt d tIDF  
  • 18.
    Document ranking ○In acomplete vector-space model, TF and IDF are combined together, which forms TF-IDF(d, t) = TF(d, t)*IDF(t) ○ 18
  • 19.
    Document ranking Similarity based Findssimilar documents based on a set of common keywords Answer should be based on the degree of relevance based on the nearness of the keywords, relative frequency of the keywords, etc. measure the closeness of a document to a query (a set of keywords ◦ 19 |||| ),( 21 21 21 vv vv vvsim  
  • 20.