SlideShare a Scribd company logo
1 of 6
Download to read offline
A Document Similarity Measurement without Dictionaries
Chung-Chen Chen
Institute of Information Science, Academia Sinica,
Taipei, Taiwan, R.O.C.
E-mail : johnson@iis.sinica.edu.tw
Wen-Lian Hsu
Institute of Information Science, Academia Sinica,
Taipei, Taiwan, R.O.C.
E-mail : hsu@iis.sinica.edu.tw
Abstract Document similarity measure is an important
topic in information retrieval and document classification
systems. But the document similarity measure of English
cannot apply directly to Chinese. The percentage of English
vocabulary covered by an English dictionary is very high, but it
is much lower in the case of Chinese. In this paper we
proposed a measure called common keyword similarity (CKS)
that does not rely on dictionaries. The CKS measurement is
based on the common substrings of two Chinese documents.
Each substring is weighed by its distribution over document
database. Experiment shows that weighting function has a great
influence on the recall/precision evaluation, and the CKS
measurement without using any dictionary is better than a
system that uses dictionaries.
1. Introduction
The measurement of document similarity is an important
topic in interactive information retrieval systems and automatic
classification systems. One of the most popular measures is the
Cosine Coefficient (CC) based on the Vector Space Model
(VSM) [1][2][3]. However, the vector space model needs a
dictionary to build vectors. This approach is not appropriate for
Mandarin Chinese. There are many words that cannot be found
in ordinary Chinese dictionaries, such as terminologies,
people's names and place names. The problem is further
aggravated by the fact that there are no delimiters between
these words in Chinese. To overcome that problem, we
proposed an alternative measure that does not require any
dictionary.
In this paper, we proposed a document similarity measure
based on the common substrings of two documents. Our
approach looks for all maximal common substrings using a
PAT-tree [4][5][6], which is a tree that keeps all substring-
frequency information. The importance of each substring (or
keyword) is then evaluated by its discriminating effect, which
indicates how well the keyword fits the pre-defined
0
2 1 4, ¤ñ.. 5, §õ.. 8, °ª..1, ±i..
2, ¥ý¥Í¤ñ.. 6, ¥ý¥Í°ª.. 3, ¥Í¤ñ.. 7, ¥Í°ª..
¢° ¢± ¢² ¢³ ¢´ ¢µ ¢¶ ¢·
±i ¥ý ¥Í ¤ñ §õ ¥ý ¥Í °ª
0
2 2
3 4
3
2
65 1
4
0010..
0100.. 0110.. 1000.. 1001..
1100..
classification. The summation of keyword importance after
normalization is considered as the similarity between two
documents.
1. Extracting common keywords from two documents by
a PAT-tree
Our approach to measure document similarity is based on
maximal common substrings. A data structure called a PAT-tree
can be used to find such substrings. A PAT-tree is a binary
"trie". The "trie" is a tree in which each internal node is split at
the first byte that is different from its brothers. Figure 1 shows
a compressed trie, which illustrates the encoding of Chinese
sentence "張先生比李先生高"
Figure 1 : The compressed trie of Chinese sentence --
"張先生比李先生高"
Because there are two substrings " 先 生 " in the sentence of
Figure 1, the "trie" split at the second child of the root. The
number "2" in the node indicates that the two substrings have
the first two characters in common.
A compressed binary trie has at most two children in each
internal node. It keeps the indices of a text in linear space. A
PAT-tree is a compressed binary trie that keeps the leaf node
pointer and the split position in the same node, saving the space
of leaf nodes. Figure 2 show a compressed binary trie and its
PAT-tree after scanning the first six bits of the binary sequence
"01100100010111".
Figure 2a : Compressed Binary Trie
Figure 2b : PAT-tree
The following algorithm illustrates how to get the maximal
common substrings from two documents using a PAT-tree.
Algorithm get-common-keyword
Parameter doc1, doc2 : document
PAT-tree1 = build-PAT-tree(doc1)
pos = 1
keyword-set = empty
while pos <= length(doc2)
tail2 = tail(doc2,pos)
node1 = Search(tail2)
tail1 = tail(doc1, node1.pos)
keyword = longest-common-head(tail1, tail2)
If keyword longer than 2 Chinese character
Put keyword into keyword-set
return keyword-set
2. Measuring similarity between two documents
The measure of document similarity is normally based on
the keyword weighting function. For an information retrieval
system, some keywords are important while some are not. The
situation also varies with different contexts. Stop words are
good examples of keywords with no importance. A keyword
may be important for some queries but not for others.
Similarly, a keyword may be important in one classification but
not in another.
In order to illustrate the idea of our keyword importance
measurement, we shall introduce two classifications systems
1
),(
)(
),( 
CSkKCE
CSCE
CSkKDE
1, 張.. 4, 比.. 5, 李.. 8, 高..
2, 先生比.. 6, 先生高.. 3, 生比..
7, 生高..
1
2
25
2
134 6 4 3
2,1
3,24, 2 5, 3 6,4
1,0
0110.. 1100..1001..1000..0100..0010..Figure 2a : compressed binary trie of six head bits Figure 2b : PAT-tree of six head bits
1,0
2,1
3,24,2
5,3 6,4
used in the news database of Central News Agency (CNA) at
Taiwan. Each of them has four categories. Classification A is
"Politics / Economics / Society / Education". Classification B is
"Taiwan / Mainland of China / Overseas / Local". These two
classification systems are some what orthogonal. In our
measure, "Stock" is an important keyword in classification A
because stock plays an important role in the economic market.
But it is not important in classification B. On the other hand,
the keyword "Russia" is important in classification B but not in
A, because "Russia" is a foreign country for Taiwan.
In order to measure the importance of a keyword, two
measures (CE and KCE) based on the distribution of
documents over a classification are proposed. The ratio of these
two measures reflects the importance of the keyword. In the
following five definitions, the first four are proposed by us.
The last definition CC is a popular document similarity
measure that is used in contrast to our CKS measure.
Definition 1 : Classification Entropy -- CE(CS)
Let D be a set of documents. The classification system CS
partition D into D1, D2, …, Dn . Define |Di| as the number of
documents in Di. The Classification entropy CE(CS) is
defined below.
The entropy CE(CS) value is higher in a uniform
distribution and lower in a biased distribution.
Definition 2 : Keyword Classification Entropy--KCE(k, CS)
Let Dk be the subset of D that contains the keyword k. The
classification system CS partition Dk into D1,k D2,k …Dn,k. The
keyword classification entropy KCE(k, CS) is defined below.
KCE(k,CS) is the entropy of documents that contains
keyword k over classification CS.
Definition 3 : Keyword Discrimination Effect -- KDE(k,
CS)
The measure KDE(k,CS) reflects the importance of the
keyword k over classification CS. A keyword is important
for a classification if the documents containing the keyword
are not evenly distributed within the classification system.
Definition 4 : Common Keyword Similarity -- CKS(di, dj)
Let len(di) be the length of document di, wk be the weight of
keyword k. The Common Keyword Similarity of di, dj is
defined below.
The measure CKS is based on the common substrings of two
documents rather than a fixed dictionary. The weighting of
each keyword wk can be measured by any keyword
weighting function such as wk=1 or wk = KDE(k, A).
Definition 5 : Cosine Coefficient -- CC(di, dj) [1][2]
Let wik be the weight of keyword k in document di, the
Cosine Coefficient is defined below.
The CC is very popular for document similarity
measurement. It is used to make comparison to our CKS


n
i k
ki
k
ki
D
D
D
D
CSkKCE
1
,
2
,
)
||
||
(log
||
||
),(





L
1k
2
jk
L
1k
2
ik
L
1
2
w*w
),(C k
k
ji
w
ddC


n
i
ii
D
D
D
D
CSCE
1
2 )
||
||
(log
||
||
)(
)(*)(
),(
djdi,ofkeywordmaximal
2
ji
commonk
k
ji
dlendlen
w
ddCKS


measure in the experiments.
4. Experimental results
The combination of recall and precision is a popular
evaluation for information retrieval systems. It is defined as
follows.
)(*)(
),(
2,1
2
ji
ddofkeywordcommonk
k
ji
dlendlen
w
ddCKS


Definition 6 : Let D be a set of documents and d a target
document. Let X be a human-defined subset of D that is
similar to d, and let Y be the subset of D that is determined
similar to d by program P. Then the recall and precision of
P on document set D are defined as follows.
Recall and precision are used to evaluate the quality of
document similarity measures in the experiments. The subject
corpus of our experiment contains 10056 news from CNA,
March/1997.
We selected a total of 116 documents that contain keyword
"陸委會" as the documents to be reviewed in our experiment.
The first document with title "交通部將與陸委會商討九七台
港航運" was selected as the target document. The experiments
measured similarity between the target document and the
collection of selected documents.
The recall/precision evaluation of the document similarity
measure CKS is determined by several factors. The first factor
is the keyword weighting function. In the keyword weighting
function KDE, the classification plays an important role. Figure
3 shows the comparison between classification A : "Politics /
Economics / Society / Education" and B : "Taiwan / Mainland
of China / Overseas / Local". The recall/precision evaluation
shows that classification A is much better than classification B
in the experiment. This shows that classification B is not as
important as classification A in judging the document similarity
of news corpus.
Figure 3. The effect of different classification on
keyword weighting KDE
The second factor that influences the document similarity
measure is the selection of keywords. Using dictionaries or not
has a great influence on this measure. Figure 4 shows the
comparison between CKS with and without dictionaries.
Figure 4. The effect of similarity measure with or
without dictionary
The measure CC in definition 5 was used as a contrast
measure in Figure 4. We used the dictionary collected from
Modern and Classical Chinese Corpora at Academia Sinica
Text Databases, which contains 78410 Chinese words [7]. The
experiment shows that a system using common substrings is
better than one using a dictionary. It is because many words in
the news corpus cannot be found in an ordinary dictionary.
In these experiments, KDE exhibits very good performance
except in the case of very high recall. It is mainly due to the
fact that the target document has a low similarity to itself. A
long common substring usually shows up only once in the
corpus, which lower the weights of long keywords in weighting
function KDE.
5. Conclusion
A Chinese document similarity measure CKS based on
common substrings and a keyword weighting function KDE are
proposed in this paper. There are two observations. First of all,
the document similarity measurement CKS without dictionaries
is better than the CC measurement that uses a dictionary.
Secondly, the classification system plays an important role in
the keyword weighting function KDE for document similarity
measurement.
Reference
[1] Salton, G., and Buckley, C. "Term-Weighting Approaches
in Automatic Text Retrieval," Information Processing &
Management, 24, 513-23, 1988.
[2] Salton, G. "Automatic Text Processing," Reading, Mass. :
Addision Wesley, 1989.
[3] Rasmussen, E. "Clustering Algorithm," Information
Retrieval Data Structures & Algorithms, Prentice Hall,
Edited by B. Frakes and Richardo Baeza-Yates, 419-442,
1992.
||
||
),,(
||
||
),,(
Y
YX
DdPprecision
X
YX
DdPrecall




[4] Morrison, D., "PATRICIA : Practical Algorithm to
Retrieve Information Coded in Alphanumeric," JACM,
514-534, 1968.
[5] Gonnet, G. H., Baeza-Yates, R. A., Snider, T. "New Indices
for text : PAT trees and PAT arrays," Information Retrieval
Data Structures & Algorithms, Prentice Hall, Edited by B.
Frakes and Richardo Baeza-Yates, 66-82, 1992.
[6] Chien, L.F., "PAT-Tree-Based Keyword Extraction for
Chinese Information Retrieval," The ACM SIGIR
Conference, 50-58, 1997.
[7] Huang, C.R., Chen, K.J., "Modern and Classical Chinese
Corpora at Academia Sinica Text Databases for Natural
Language Processing and Linguistic Computing," The
Sixth CODATA Task Group Meeting on the Survey of
Data Sources in Asian-Oceanic Countries, 9-12, 1994.
recall
precision
w=KDE A
w=KDE B
1.0
1.
recall
precision
CKS without
dictionary A,
w=KDE A
CC with
dictionary
CKS with
dictionary,
w=KDE A
1.0
1.0

More Related Content

What's hot

Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Primya Tamil
 
The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)theijes
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESkevig
 
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
 
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
 
Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique ofIJDKP
 
Efficiently searching nearest neighbor in documents
Efficiently searching nearest neighbor in documentsEfficiently searching nearest neighbor in documents
Efficiently searching nearest neighbor in documentseSAT Publishing House
 
Efficiently searching nearest neighbor in documents using keywords
Efficiently searching nearest neighbor in documents using keywordsEfficiently searching nearest neighbor in documents using keywords
Efficiently searching nearest neighbor in documents using keywordseSAT Journals
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringIRJET Journal
 
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESEFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESIJCSEIT Journal
 
Text Classification using Support Vector Machine
Text Classification using Support Vector MachineText Classification using Support Vector Machine
Text Classification using Support Vector Machineinventionjournals
 
Approaches for Keyword Query Routing
Approaches for Keyword Query RoutingApproaches for Keyword Query Routing
Approaches for Keyword Query RoutingIJERA Editor
 

What's hot (15)

L0261075078
L0261075078L0261075078
L0261075078
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
 
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
 
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
 
Hc3612711275
Hc3612711275Hc3612711275
Hc3612711275
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
 
Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique of
 
Efficiently searching nearest neighbor in documents
Efficiently searching nearest neighbor in documentsEfficiently searching nearest neighbor in documents
Efficiently searching nearest neighbor in documents
 
Efficiently searching nearest neighbor in documents using keywords
Efficiently searching nearest neighbor in documents using keywordsEfficiently searching nearest neighbor in documents using keywords
Efficiently searching nearest neighbor in documents using keywords
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed Clustering
 
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESEFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
 
Text Classification using Support Vector Machine
Text Classification using Support Vector MachineText Classification using Support Vector Machine
Text Classification using Support Vector Machine
 
Approaches for Keyword Query Routing
Approaches for Keyword Query RoutingApproaches for Keyword Query Routing
Approaches for Keyword Query Routing
 

Viewers also liked

結合統計與規則的多層次中文斷詞系統
結合統計與規則的多層次中文斷詞系統結合統計與規則的多層次中文斷詞系統
結合統計與規則的多層次中文斷詞系統鍾誠 陳鍾誠
 
FastData 快速的人文資料庫撰寫方式
FastData  快速的人文資料庫撰寫方式FastData  快速的人文資料庫撰寫方式
FastData 快速的人文資料庫撰寫方式鍾誠 陳鍾誠
 
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling ApproachXML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach鍾誠 陳鍾誠
 
文件空間中正交分類樹的建構
文件空間中正交分類樹的建構 文件空間中正交分類樹的建構
文件空間中正交分類樹的建構 鍾誠 陳鍾誠
 
十分鐘化學史 (一) 《拉瓦錫之前的那些事兒》
十分鐘化學史 (一)  《拉瓦錫之前的那些事兒》十分鐘化學史 (一)  《拉瓦錫之前的那些事兒》
十分鐘化學史 (一) 《拉瓦錫之前的那些事兒》鍾誠 陳鍾誠
 
用十分鐘搞懂 《電腦如何解方程式》
用十分鐘搞懂  《電腦如何解方程式》用十分鐘搞懂  《電腦如何解方程式》
用十分鐘搞懂 《電腦如何解方程式》鍾誠 陳鍾誠
 
最佳化問題的公理化方法
最佳化問題的公理化方法最佳化問題的公理化方法
最佳化問題的公理化方法鍾誠 陳鍾誠
 
基於欄位填充機制的 XML 文件檢索方法 (博士論文口試簡報)
基於欄位填充機制的 XML 文件檢索方法 (博士論文口試簡報)基於欄位填充機制的 XML 文件檢索方法 (博士論文口試簡報)
基於欄位填充機制的 XML 文件檢索方法 (博士論文口試簡報)鍾誠 陳鍾誠
 
用十分鐘瞭解《大學專題的那些事兒》!
用十分鐘瞭解《大學專題的那些事兒》!用十分鐘瞭解《大學專題的那些事兒》!
用十分鐘瞭解《大學專題的那些事兒》!鍾誠 陳鍾誠
 
用十分鐘 向jserv學習作業系統設計
用十分鐘  向jserv學習作業系統設計用十分鐘  向jserv學習作業系統設計
用十分鐘 向jserv學習作業系統設計鍾誠 陳鍾誠
 
《計算機結構與作業系統裏》-- 資工系學生們經常搞錯的那些事兒!
《計算機結構與作業系統裏》--  資工系學生們經常搞錯的那些事兒!《計算機結構與作業系統裏》--  資工系學生們經常搞錯的那些事兒!
《計算機結構與作業系統裏》-- 資工系學生們經常搞錯的那些事兒!鍾誠 陳鍾誠
 
用十分鐘瞭解 《JavaScript的程式世界》
用十分鐘瞭解  《JavaScript的程式世界》用十分鐘瞭解  《JavaScript的程式世界》
用十分鐘瞭解 《JavaScript的程式世界》鍾誠 陳鍾誠
 
用十分鐘快速掌握《數學的整體結構》
用十分鐘快速掌握《數學的整體結構》用十分鐘快速掌握《數學的整體結構》
用十分鐘快速掌握《數學的整體結構》鍾誠 陳鍾誠
 
十分鐘讓程式人搞懂雲端平台與技術
十分鐘讓程式人搞懂雲端平台與技術十分鐘讓程式人搞懂雲端平台與技術
十分鐘讓程式人搞懂雲端平台與技術鍾誠 陳鍾誠
 
用十分鐘理解 《神經網路發展史》
用十分鐘理解 《神經網路發展史》用十分鐘理解 《神經網路發展史》
用十分鐘理解 《神經網路發展史》鍾誠 陳鍾誠
 
用十分鐘開始理解深度學習技術 (從 dnn.js 專案出發)
用十分鐘開始理解深度學習技術  (從 dnn.js 專案出發)用十分鐘開始理解深度學習技術  (從 dnn.js 專案出發)
用十分鐘開始理解深度學習技術 (從 dnn.js 專案出發)鍾誠 陳鍾誠
 
人造交談語言 (使用有BNF的口語透過機器翻譯和外國人溝通)
人造交談語言  (使用有BNF的口語透過機器翻譯和外國人溝通)人造交談語言  (使用有BNF的口語透過機器翻譯和外國人溝通)
人造交談語言 (使用有BNF的口語透過機器翻譯和外國人溝通)鍾誠 陳鍾誠
 
用十年也搞不懂《Cantor奇幻的集合論世界》
用十年也搞不懂《Cantor奇幻的集合論世界》用十年也搞不懂《Cantor奇幻的集合論世界》
用十年也搞不懂《Cantor奇幻的集合論世界》鍾誠 陳鍾誠
 

Viewers also liked (20)

結合統計與規則的多層次中文斷詞系統
結合統計與規則的多層次中文斷詞系統結合統計與規則的多層次中文斷詞系統
結合統計與規則的多層次中文斷詞系統
 
FastData 快速的人文資料庫撰寫方式
FastData  快速的人文資料庫撰寫方式FastData  快速的人文資料庫撰寫方式
FastData 快速的人文資料庫撰寫方式
 
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling ApproachXML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
 
文件空間中正交分類樹的建構
文件空間中正交分類樹的建構 文件空間中正交分類樹的建構
文件空間中正交分類樹的建構
 
十分鐘化學史 (一) 《拉瓦錫之前的那些事兒》
十分鐘化學史 (一)  《拉瓦錫之前的那些事兒》十分鐘化學史 (一)  《拉瓦錫之前的那些事兒》
十分鐘化學史 (一) 《拉瓦錫之前的那些事兒》
 
用十分鐘搞懂 《電腦如何解方程式》
用十分鐘搞懂  《電腦如何解方程式》用十分鐘搞懂  《電腦如何解方程式》
用十分鐘搞懂 《電腦如何解方程式》
 
最佳化問題的公理化方法
最佳化問題的公理化方法最佳化問題的公理化方法
最佳化問題的公理化方法
 
基於欄位填充機制的 XML 文件檢索方法 (博士論文口試簡報)
基於欄位填充機制的 XML 文件檢索方法 (博士論文口試簡報)基於欄位填充機制的 XML 文件檢索方法 (博士論文口試簡報)
基於欄位填充機制的 XML 文件檢索方法 (博士論文口試簡報)
 
用十分鐘瞭解《大學專題的那些事兒》!
用十分鐘瞭解《大學專題的那些事兒》!用十分鐘瞭解《大學專題的那些事兒》!
用十分鐘瞭解《大學專題的那些事兒》!
 
用十分鐘 向jserv學習作業系統設計
用十分鐘  向jserv學習作業系統設計用十分鐘  向jserv學習作業系統設計
用十分鐘 向jserv學習作業系統設計
 
《計算機結構與作業系統裏》-- 資工系學生們經常搞錯的那些事兒!
《計算機結構與作業系統裏》--  資工系學生們經常搞錯的那些事兒!《計算機結構與作業系統裏》--  資工系學生們經常搞錯的那些事兒!
《計算機結構與作業系統裏》-- 資工系學生們經常搞錯的那些事兒!
 
用十分鐘瞭解 《JavaScript的程式世界》
用十分鐘瞭解  《JavaScript的程式世界》用十分鐘瞭解  《JavaScript的程式世界》
用十分鐘瞭解 《JavaScript的程式世界》
 
用十分鐘快速掌握《數學的整體結構》
用十分鐘快速掌握《數學的整體結構》用十分鐘快速掌握《數學的整體結構》
用十分鐘快速掌握《數學的整體結構》
 
十分鐘讓程式人搞懂雲端平台與技術
十分鐘讓程式人搞懂雲端平台與技術十分鐘讓程式人搞懂雲端平台與技術
十分鐘讓程式人搞懂雲端平台與技術
 
用十分鐘理解 《神經網路發展史》
用十分鐘理解 《神經網路發展史》用十分鐘理解 《神經網路發展史》
用十分鐘理解 《神經網路發展史》
 
用十分鐘開始理解深度學習技術 (從 dnn.js 專案出發)
用十分鐘開始理解深度學習技術  (從 dnn.js 專案出發)用十分鐘開始理解深度學習技術  (從 dnn.js 專案出發)
用十分鐘開始理解深度學習技術 (從 dnn.js 專案出發)
 
人造交談語言 (使用有BNF的口語透過機器翻譯和外國人溝通)
人造交談語言  (使用有BNF的口語透過機器翻譯和外國人溝通)人造交談語言  (使用有BNF的口語透過機器翻譯和外國人溝通)
人造交談語言 (使用有BNF的口語透過機器翻譯和外國人溝通)
 
用十年也搞不懂《Cantor奇幻的集合論世界》
用十年也搞不懂《Cantor奇幻的集合論世界》用十年也搞不懂《Cantor奇幻的集合論世界》
用十年也搞不懂《Cantor奇幻的集合論世界》
 
系統程式 - 附錄
系統程式 - 附錄系統程式 - 附錄
系統程式 - 附錄
 
系統程式 -- 第 6 章
系統程式 -- 第 6 章系統程式 -- 第 6 章
系統程式 -- 第 6 章
 

Similar to A Document Similarity Measurement without Dictionaries

ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
 
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...ijseajournal
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...onlmcq
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Co word analysis
Co word analysisCo word analysis
Co word analysisdebolina73
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latexIAESIJEECS
 
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ijnlc
 
Text Segmentation for Online Subjective Examination using Machine Learning
Text Segmentation for Online Subjective Examination using Machine   LearningText Segmentation for Online Subjective Examination using Machine   Learning
Text Segmentation for Online Subjective Examination using Machine LearningIRJET Journal
 
Iaetsd an efficient way of classifying and
Iaetsd an efficient way of classifying andIaetsd an efficient way of classifying and
Iaetsd an efficient way of classifying andIaetsd Iaetsd
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Techniquekevig
 

Similar to A Document Similarity Measurement without Dictionaries (20)

A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
 
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
 
TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
 
UNIT 3 IRT.docx
UNIT 3 IRT.docxUNIT 3 IRT.docx
UNIT 3 IRT.docx
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
 
Cc35451454
Cc35451454Cc35451454
Cc35451454
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
F017243241
F017243241F017243241
F017243241
 
Co word analysis
Co word analysisCo word analysis
Co word analysis
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
 
L0261075078
L0261075078L0261075078
L0261075078
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latex
 
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
 
Text Segmentation for Online Subjective Examination using Machine Learning
Text Segmentation for Online Subjective Examination using Machine   LearningText Segmentation for Online Subjective Examination using Machine   Learning
Text Segmentation for Online Subjective Examination using Machine Learning
 
Iaetsd an efficient way of classifying and
Iaetsd an efficient way of classifying andIaetsd an efficient way of classifying and
Iaetsd an efficient way of classifying and
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Technique
 

More from 鍾誠 陳鍾誠

用十分鐘瞭解 新竹科學園區的發展史
用十分鐘瞭解  新竹科學園區的發展史用十分鐘瞭解  新竹科學園區的發展史
用十分鐘瞭解 新竹科學園區的發展史鍾誠 陳鍾誠
 
用十分鐘搞懂 λ-Calculus
用十分鐘搞懂 λ-Calculus用十分鐘搞懂 λ-Calculus
用十分鐘搞懂 λ-Calculus鍾誠 陳鍾誠
 
交⼤資訊⼯程學系備審資料 ⾱詠祥
交⼤資訊⼯程學系備審資料 ⾱詠祥交⼤資訊⼯程學系備審資料 ⾱詠祥
交⼤資訊⼯程學系備審資料 ⾱詠祥鍾誠 陳鍾誠
 
smallpt: Global Illumination in 99 lines of C++
smallpt:  Global Illumination in 99 lines of C++smallpt:  Global Illumination in 99 lines of C++
smallpt: Global Illumination in 99 lines of C++鍾誠 陳鍾誠
 
西洋史 (你或許不知道但卻影響現代教育的那些事)
西洋史  (你或許不知道但卻影響現代教育的那些事)西洋史  (你或許不知道但卻影響現代教育的那些事)
西洋史 (你或許不知道但卻影響現代教育的那些事)鍾誠 陳鍾誠
 
區塊鏈 (比特幣背後的關鍵技術) -- 十分鐘系列
區塊鏈  (比特幣背後的關鍵技術)   -- 十分鐘系列區塊鏈  (比特幣背後的關鍵技術)   -- 十分鐘系列
區塊鏈 (比特幣背後的關鍵技術) -- 十分鐘系列鍾誠 陳鍾誠
 
區塊鏈 (比特幣背後的關鍵技術) -- 十分鐘系列
區塊鏈  (比特幣背後的關鍵技術)   -- 十分鐘系列區塊鏈  (比特幣背後的關鍵技術)   -- 十分鐘系列
區塊鏈 (比特幣背後的關鍵技術) -- 十分鐘系列鍾誠 陳鍾誠
 
梯度下降法 (隱藏在深度學習背後的演算法) -- 十分鐘系列
梯度下降法  (隱藏在深度學習背後的演算法) -- 十分鐘系列梯度下降法  (隱藏在深度學習背後的演算法) -- 十分鐘系列
梯度下降法 (隱藏在深度學習背後的演算法) -- 十分鐘系列鍾誠 陳鍾誠
 
用十分鐘理解 《微分方程》
用十分鐘理解  《微分方程》用十分鐘理解  《微分方程》
用十分鐘理解 《微分方程》鍾誠 陳鍾誠
 
系統程式 -- 第 12 章 系統軟體實作
系統程式 -- 第 12 章 系統軟體實作系統程式 -- 第 12 章 系統軟體實作
系統程式 -- 第 12 章 系統軟體實作鍾誠 陳鍾誠
 
系統程式 -- 第 11 章 嵌入式系統
系統程式 -- 第 11 章 嵌入式系統系統程式 -- 第 11 章 嵌入式系統
系統程式 -- 第 11 章 嵌入式系統鍾誠 陳鍾誠
 
系統程式 -- 第 10 章 作業系統
系統程式 -- 第 10 章 作業系統系統程式 -- 第 10 章 作業系統
系統程式 -- 第 10 章 作業系統鍾誠 陳鍾誠
 
系統程式 -- 第 9 章 虛擬機器
系統程式 -- 第 9 章 虛擬機器系統程式 -- 第 9 章 虛擬機器
系統程式 -- 第 9 章 虛擬機器鍾誠 陳鍾誠
 
系統程式 -- 第 8 章 編譯器
系統程式 -- 第 8 章 編譯器系統程式 -- 第 8 章 編譯器
系統程式 -- 第 8 章 編譯器鍾誠 陳鍾誠
 
系統程式 -- 第 7 章 高階語言
系統程式 -- 第 7 章 高階語言系統程式 -- 第 7 章 高階語言
系統程式 -- 第 7 章 高階語言鍾誠 陳鍾誠
 
系統程式 -- 第 6 章 巨集處理器
系統程式 -- 第 6 章 巨集處理器系統程式 -- 第 6 章 巨集處理器
系統程式 -- 第 6 章 巨集處理器鍾誠 陳鍾誠
 
系統程式 -- 第 5 章 連結與載入
系統程式 -- 第 5 章 連結與載入系統程式 -- 第 5 章 連結與載入
系統程式 -- 第 5 章 連結與載入鍾誠 陳鍾誠
 
系統程式 -- 第 4 章 組譯器
系統程式 -- 第 4 章 組譯器系統程式 -- 第 4 章 組譯器
系統程式 -- 第 4 章 組譯器鍾誠 陳鍾誠
 

More from 鍾誠 陳鍾誠 (20)

用十分鐘瞭解 新竹科學園區的發展史
用十分鐘瞭解  新竹科學園區的發展史用十分鐘瞭解  新竹科學園區的發展史
用十分鐘瞭解 新竹科學園區的發展史
 
用十分鐘搞懂 λ-Calculus
用十分鐘搞懂 λ-Calculus用十分鐘搞懂 λ-Calculus
用十分鐘搞懂 λ-Calculus
 
交⼤資訊⼯程學系備審資料 ⾱詠祥
交⼤資訊⼯程學系備審資料 ⾱詠祥交⼤資訊⼯程學系備審資料 ⾱詠祥
交⼤資訊⼯程學系備審資料 ⾱詠祥
 
smallpt: Global Illumination in 99 lines of C++
smallpt:  Global Illumination in 99 lines of C++smallpt:  Global Illumination in 99 lines of C++
smallpt: Global Illumination in 99 lines of C++
 
西洋史 (你或許不知道但卻影響現代教育的那些事)
西洋史  (你或許不知道但卻影響現代教育的那些事)西洋史  (你或許不知道但卻影響現代教育的那些事)
西洋史 (你或許不知道但卻影響現代教育的那些事)
 
區塊鏈 (比特幣背後的關鍵技術) -- 十分鐘系列
區塊鏈  (比特幣背後的關鍵技術)   -- 十分鐘系列區塊鏈  (比特幣背後的關鍵技術)   -- 十分鐘系列
區塊鏈 (比特幣背後的關鍵技術) -- 十分鐘系列
 
區塊鏈 (比特幣背後的關鍵技術) -- 十分鐘系列
區塊鏈  (比特幣背後的關鍵技術)   -- 十分鐘系列區塊鏈  (比特幣背後的關鍵技術)   -- 十分鐘系列
區塊鏈 (比特幣背後的關鍵技術) -- 十分鐘系列
 
梯度下降法 (隱藏在深度學習背後的演算法) -- 十分鐘系列
梯度下降法  (隱藏在深度學習背後的演算法) -- 十分鐘系列梯度下降法  (隱藏在深度學習背後的演算法) -- 十分鐘系列
梯度下降法 (隱藏在深度學習背後的演算法) -- 十分鐘系列
 
用十分鐘理解 《微分方程》
用十分鐘理解  《微分方程》用十分鐘理解  《微分方程》
用十分鐘理解 《微分方程》
 
系統程式 -- 前言
系統程式 -- 前言系統程式 -- 前言
系統程式 -- 前言
 
系統程式 -- 附錄
系統程式 -- 附錄系統程式 -- 附錄
系統程式 -- 附錄
 
系統程式 -- 第 12 章 系統軟體實作
系統程式 -- 第 12 章 系統軟體實作系統程式 -- 第 12 章 系統軟體實作
系統程式 -- 第 12 章 系統軟體實作
 
系統程式 -- 第 11 章 嵌入式系統
系統程式 -- 第 11 章 嵌入式系統系統程式 -- 第 11 章 嵌入式系統
系統程式 -- 第 11 章 嵌入式系統
 
系統程式 -- 第 10 章 作業系統
系統程式 -- 第 10 章 作業系統系統程式 -- 第 10 章 作業系統
系統程式 -- 第 10 章 作業系統
 
系統程式 -- 第 9 章 虛擬機器
系統程式 -- 第 9 章 虛擬機器系統程式 -- 第 9 章 虛擬機器
系統程式 -- 第 9 章 虛擬機器
 
系統程式 -- 第 8 章 編譯器
系統程式 -- 第 8 章 編譯器系統程式 -- 第 8 章 編譯器
系統程式 -- 第 8 章 編譯器
 
系統程式 -- 第 7 章 高階語言
系統程式 -- 第 7 章 高階語言系統程式 -- 第 7 章 高階語言
系統程式 -- 第 7 章 高階語言
 
系統程式 -- 第 6 章 巨集處理器
系統程式 -- 第 6 章 巨集處理器系統程式 -- 第 6 章 巨集處理器
系統程式 -- 第 6 章 巨集處理器
 
系統程式 -- 第 5 章 連結與載入
系統程式 -- 第 5 章 連結與載入系統程式 -- 第 5 章 連結與載入
系統程式 -- 第 5 章 連結與載入
 
系統程式 -- 第 4 章 組譯器
系統程式 -- 第 4 章 組譯器系統程式 -- 第 4 章 組譯器
系統程式 -- 第 4 章 組譯器
 

Recently uploaded

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 

Recently uploaded (20)

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 

A Document Similarity Measurement without Dictionaries

  • 1. A Document Similarity Measurement without Dictionaries Chung-Chen Chen Institute of Information Science, Academia Sinica, Taipei, Taiwan, R.O.C. E-mail : johnson@iis.sinica.edu.tw Wen-Lian Hsu Institute of Information Science, Academia Sinica, Taipei, Taiwan, R.O.C. E-mail : hsu@iis.sinica.edu.tw Abstract Document similarity measure is an important topic in information retrieval and document classification systems. But the document similarity measure of English cannot apply directly to Chinese. The percentage of English vocabulary covered by an English dictionary is very high, but it is much lower in the case of Chinese. In this paper we proposed a measure called common keyword similarity (CKS) that does not rely on dictionaries. The CKS measurement is based on the common substrings of two Chinese documents. Each substring is weighed by its distribution over document database. Experiment shows that weighting function has a great influence on the recall/precision evaluation, and the CKS measurement without using any dictionary is better than a system that uses dictionaries. 1. Introduction The measurement of document similarity is an important topic in interactive information retrieval systems and automatic classification systems. One of the most popular measures is the Cosine Coefficient (CC) based on the Vector Space Model (VSM) [1][2][3]. However, the vector space model needs a dictionary to build vectors. This approach is not appropriate for Mandarin Chinese. There are many words that cannot be found in ordinary Chinese dictionaries, such as terminologies, people's names and place names. The problem is further aggravated by the fact that there are no delimiters between these words in Chinese. To overcome that problem, we proposed an alternative measure that does not require any dictionary. In this paper, we proposed a document similarity measure based on the common substrings of two documents. Our approach looks for all maximal common substrings using a PAT-tree [4][5][6], which is a tree that keeps all substring- frequency information. The importance of each substring (or keyword) is then evaluated by its discriminating effect, which indicates how well the keyword fits the pre-defined 0 2 1 4, ¤ñ.. 5, §õ.. 8, °ª..1, ±i.. 2, ¥ý¥Í¤ñ.. 6, ¥ý¥Í°ª.. 3, ¥Í¤ñ.. 7, ¥Í°ª.. ¢° ¢± ¢² ¢³ ¢´ ¢µ ¢¶ ¢· ±i ¥ý ¥Í ¤ñ §õ ¥ý ¥Í °ª 0 2 2 3 4 3 2 65 1 4 0010.. 0100.. 0110.. 1000.. 1001.. 1100..
  • 2. classification. The summation of keyword importance after normalization is considered as the similarity between two documents. 1. Extracting common keywords from two documents by a PAT-tree Our approach to measure document similarity is based on maximal common substrings. A data structure called a PAT-tree can be used to find such substrings. A PAT-tree is a binary "trie". The "trie" is a tree in which each internal node is split at the first byte that is different from its brothers. Figure 1 shows a compressed trie, which illustrates the encoding of Chinese sentence "張先生比李先生高" Figure 1 : The compressed trie of Chinese sentence -- "張先生比李先生高" Because there are two substrings " 先 生 " in the sentence of Figure 1, the "trie" split at the second child of the root. The number "2" in the node indicates that the two substrings have the first two characters in common. A compressed binary trie has at most two children in each internal node. It keeps the indices of a text in linear space. A PAT-tree is a compressed binary trie that keeps the leaf node pointer and the split position in the same node, saving the space of leaf nodes. Figure 2 show a compressed binary trie and its PAT-tree after scanning the first six bits of the binary sequence "01100100010111". Figure 2a : Compressed Binary Trie Figure 2b : PAT-tree The following algorithm illustrates how to get the maximal common substrings from two documents using a PAT-tree. Algorithm get-common-keyword Parameter doc1, doc2 : document PAT-tree1 = build-PAT-tree(doc1) pos = 1 keyword-set = empty while pos <= length(doc2) tail2 = tail(doc2,pos) node1 = Search(tail2) tail1 = tail(doc1, node1.pos) keyword = longest-common-head(tail1, tail2) If keyword longer than 2 Chinese character Put keyword into keyword-set return keyword-set 2. Measuring similarity between two documents The measure of document similarity is normally based on the keyword weighting function. For an information retrieval system, some keywords are important while some are not. The situation also varies with different contexts. Stop words are good examples of keywords with no importance. A keyword may be important for some queries but not for others. Similarly, a keyword may be important in one classification but not in another. In order to illustrate the idea of our keyword importance measurement, we shall introduce two classifications systems 1 ),( )( ),(  CSkKCE CSCE CSkKDE 1, 張.. 4, 比.. 5, 李.. 8, 高.. 2, 先生比.. 6, 先生高.. 3, 生比.. 7, 生高.. 1 2 25 2 134 6 4 3 2,1 3,24, 2 5, 3 6,4 1,0 0110.. 1100..1001..1000..0100..0010..Figure 2a : compressed binary trie of six head bits Figure 2b : PAT-tree of six head bits 1,0 2,1 3,24,2 5,3 6,4
  • 3. used in the news database of Central News Agency (CNA) at Taiwan. Each of them has four categories. Classification A is "Politics / Economics / Society / Education". Classification B is "Taiwan / Mainland of China / Overseas / Local". These two classification systems are some what orthogonal. In our measure, "Stock" is an important keyword in classification A because stock plays an important role in the economic market. But it is not important in classification B. On the other hand, the keyword "Russia" is important in classification B but not in A, because "Russia" is a foreign country for Taiwan. In order to measure the importance of a keyword, two measures (CE and KCE) based on the distribution of documents over a classification are proposed. The ratio of these two measures reflects the importance of the keyword. In the following five definitions, the first four are proposed by us. The last definition CC is a popular document similarity measure that is used in contrast to our CKS measure. Definition 1 : Classification Entropy -- CE(CS) Let D be a set of documents. The classification system CS partition D into D1, D2, …, Dn . Define |Di| as the number of documents in Di. The Classification entropy CE(CS) is defined below. The entropy CE(CS) value is higher in a uniform distribution and lower in a biased distribution. Definition 2 : Keyword Classification Entropy--KCE(k, CS) Let Dk be the subset of D that contains the keyword k. The classification system CS partition Dk into D1,k D2,k …Dn,k. The keyword classification entropy KCE(k, CS) is defined below. KCE(k,CS) is the entropy of documents that contains keyword k over classification CS. Definition 3 : Keyword Discrimination Effect -- KDE(k, CS) The measure KDE(k,CS) reflects the importance of the keyword k over classification CS. A keyword is important for a classification if the documents containing the keyword are not evenly distributed within the classification system. Definition 4 : Common Keyword Similarity -- CKS(di, dj) Let len(di) be the length of document di, wk be the weight of keyword k. The Common Keyword Similarity of di, dj is defined below. The measure CKS is based on the common substrings of two documents rather than a fixed dictionary. The weighting of each keyword wk can be measured by any keyword weighting function such as wk=1 or wk = KDE(k, A). Definition 5 : Cosine Coefficient -- CC(di, dj) [1][2] Let wik be the weight of keyword k in document di, the Cosine Coefficient is defined below. The CC is very popular for document similarity measurement. It is used to make comparison to our CKS   n i k ki k ki D D D D CSkKCE 1 , 2 , ) || || (log || || ),(      L 1k 2 jk L 1k 2 ik L 1 2 w*w ),(C k k ji w ddC   n i ii D D D D CSCE 1 2 ) || || (log || || )( )(*)( ),( djdi,ofkeywordmaximal 2 ji commonk k ji dlendlen w ddCKS  
  • 4. measure in the experiments. 4. Experimental results The combination of recall and precision is a popular evaluation for information retrieval systems. It is defined as follows. )(*)( ),( 2,1 2 ji ddofkeywordcommonk k ji dlendlen w ddCKS  
  • 5. Definition 6 : Let D be a set of documents and d a target document. Let X be a human-defined subset of D that is similar to d, and let Y be the subset of D that is determined similar to d by program P. Then the recall and precision of P on document set D are defined as follows. Recall and precision are used to evaluate the quality of document similarity measures in the experiments. The subject corpus of our experiment contains 10056 news from CNA, March/1997. We selected a total of 116 documents that contain keyword "陸委會" as the documents to be reviewed in our experiment. The first document with title "交通部將與陸委會商討九七台 港航運" was selected as the target document. The experiments measured similarity between the target document and the collection of selected documents. The recall/precision evaluation of the document similarity measure CKS is determined by several factors. The first factor is the keyword weighting function. In the keyword weighting function KDE, the classification plays an important role. Figure 3 shows the comparison between classification A : "Politics / Economics / Society / Education" and B : "Taiwan / Mainland of China / Overseas / Local". The recall/precision evaluation shows that classification A is much better than classification B in the experiment. This shows that classification B is not as important as classification A in judging the document similarity of news corpus. Figure 3. The effect of different classification on keyword weighting KDE The second factor that influences the document similarity measure is the selection of keywords. Using dictionaries or not has a great influence on this measure. Figure 4 shows the comparison between CKS with and without dictionaries. Figure 4. The effect of similarity measure with or without dictionary The measure CC in definition 5 was used as a contrast measure in Figure 4. We used the dictionary collected from Modern and Classical Chinese Corpora at Academia Sinica Text Databases, which contains 78410 Chinese words [7]. The experiment shows that a system using common substrings is better than one using a dictionary. It is because many words in the news corpus cannot be found in an ordinary dictionary. In these experiments, KDE exhibits very good performance except in the case of very high recall. It is mainly due to the fact that the target document has a low similarity to itself. A long common substring usually shows up only once in the corpus, which lower the weights of long keywords in weighting function KDE. 5. Conclusion A Chinese document similarity measure CKS based on common substrings and a keyword weighting function KDE are proposed in this paper. There are two observations. First of all, the document similarity measurement CKS without dictionaries is better than the CC measurement that uses a dictionary. Secondly, the classification system plays an important role in the keyword weighting function KDE for document similarity measurement. Reference [1] Salton, G., and Buckley, C. "Term-Weighting Approaches in Automatic Text Retrieval," Information Processing & Management, 24, 513-23, 1988. [2] Salton, G. "Automatic Text Processing," Reading, Mass. : Addision Wesley, 1989. [3] Rasmussen, E. "Clustering Algorithm," Information Retrieval Data Structures & Algorithms, Prentice Hall, Edited by B. Frakes and Richardo Baeza-Yates, 419-442, 1992. || || ),,( || || ),,( Y YX DdPprecision X YX DdPrecall    
  • 6. [4] Morrison, D., "PATRICIA : Practical Algorithm to Retrieve Information Coded in Alphanumeric," JACM, 514-534, 1968. [5] Gonnet, G. H., Baeza-Yates, R. A., Snider, T. "New Indices for text : PAT trees and PAT arrays," Information Retrieval Data Structures & Algorithms, Prentice Hall, Edited by B. Frakes and Richardo Baeza-Yates, 66-82, 1992. [6] Chien, L.F., "PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval," The ACM SIGIR Conference, 50-58, 1997. [7] Huang, C.R., Chen, K.J., "Modern and Classical Chinese Corpora at Academia Sinica Text Databases for Natural Language Processing and Linguistic Computing," The Sixth CODATA Task Group Meeting on the Survey of Data Sources in Asian-Oceanic Countries, 9-12, 1994. recall precision w=KDE A w=KDE B 1.0 1. recall precision CKS without dictionary A, w=KDE A CC with dictionary CKS with dictionary, w=KDE A 1.0 1.0