SlideShare a Scribd company logo
IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 2, Ver. IV (Mar – Apr. 2015), PP 32-41
www.iosrjournals.org
DOI: 10.9790/0661-17243241 www.iosrjournals.org 32 | Page
An Enhanced Suffix Tree Approach to Measure Semantic
Similarity between Multiple Documents
A.Kavitha1
, Dr.N.Rajkumar2
, Dr.S.P.Victor3
1
(Research Scholar, Manonmaniam Sundaranor University, Tirunelveli, India,)
2
(Dept. of M.E. S/w Engg., Professor & Head, Sri Ramakrishna Engineering College, India,)
3
(Dept. of MCA, Professor & Head, St. Xavier College, Palayamkottai, Tirunelveli, India,)
Abstract: Semantic Similarity is a concept whereby the set of documents are measured to find the likeliness of
their meaning content. Document Similarity is the process of Computing the Semantic Similarity between
Multiple Documents Using Similarity measures. In this paper, the document similarity has been applied to
compute the pair wise similarities of documents based on the Suffix Tree Document (STD) model. Documents
are pre-processed initially. Data Preprocessing can be done to increase the efficiency of the Similarity values.
The pre-processed phrases are inserted in Suffix tree. A Suffix tree is a data structure that presents
the suffixes of a given string in a way that allows for a particularly fast implementation of much important string
operation. The suffix substrings are selected as the phrases to label the edges of the suffix tree. Internal nodes
represents phrases that shared by Multiple Documents. The similarity of two documents can be defined as the
more internal nodes shared by the two documents. Suffix tree can be used to solve the exact matching problem
in linear time. Document similarity naturally inherits the term tf-idf(Term frequency and inverse Document
frequency) weighting scheme in computing the document similarity with phrases. Tf-Idf method has been used
to calculate the weight of Internal nodes of the suffix tree, where internal nodes are the nodes that has been
shared by multiple documents. Cosine, Dice and Hellinger measures applied to find the pair wise similarity
based on the weight of each internal node of the suffix tree.
Keywords: Semantic similarity, Similarity measures, Document similarity, Suffix tree and Tf-idf scheme.
I. Introduction
Semantic similarity is a domain whereas a set of documents within lists are assigned a metric based on
the likeness of their meaning content. The document similarity plays a vital role in the field of information
retrieval using Clustering technique [11][7]. The main goal of the system is to compute the semantic similarity
between multiple documents. The system involves by getting the several documents as input from the user to
find the similarity between various documents based on different similarity measure. The document
preprocessing denotes the Stop words removal, Case conversion and Special characters removal. The phrases
are extracted from the document to construct the suffix tree and labeled to edges of the nodes of the suffix tree
[1][10]. A Suffix tree is a data structure that presents the suffixes of a given string in a way that allows for a
particularly fast implementation of many important string operations [14][9]. The term frequency Tf-Idf method
is used to calculate the weight of internal nodes of the suffix tree, where internal nodes are the nodes that have
been shared by multiple documents. Cosine similarity measure, Dice Coefficient and Hellinger measures are
used to find the pair wise similarity based on the weight of each internal node of the suffix tree [5][7].
Document similarity is shown as values and the values must be between 0 and 1. The value 1 implies the
absolute similarity and 0 implies both the documents are not similar.
1.1 Semantic Similarity
Semantic similarity measures can be classified into pair wise similarity and group wise similarity
measures. The Pair wise similarity measures functional similarity between two instances by combining the
semantic similarities of the concepts they represent. The group wise semantic similarity measure calculates the
similarity directly by not combining the semantic similarities of the concepts they represent.
Semantic similarity is mostly used approach and associated with several applications to determine
similarity [15]. The similarity measures are used in conjunction with corpus system to retrieve all kind of
information and also it helps to retrieve information in web [3][4][8].
1.2 Data Preprocessing
The data pre-processing in an existing consist of three phase namely, special character removal, stop
words removal and case conversion. The data pre-processing helps to minimize the document size and
comparison time. In the first phase, list of 32 special characters are removed from all the documents [1]. The
few special characters are shown in fig. 1.
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
DOI: 10.9790/0661-17243241 www.iosrjournals.org 33 | Page
!, @, #, $, %, ^, &, *, (, ),-,=,+,_,[,], ;,:,|,<,>,?,/,`,~ , , 
Figure 1. Special characters list
The second phase is a removal of stop words and it eliminates over all 256 stop word list from all the input
documents. The list of stop words is presented in fig. 2.
a, an, the , is , are , there, who, what, when, how, much, this, that,.. etc.
Figure 2. Stop Words List
The third phase is case conversion, it converts entire document from uppercase to lower case.
Example
The data preprocessing process has been illustrated to the following document as in fig. 3 and fig. 4.
Computer science or Computing science (abbreviated as CS or CompSci) is the scientific and practical approach to
computation and its applications. A computer scientist specializes in the theory of computation and the design of
computational systems.
Figure 3. Document1
Computer science Computing science abbreviated CS or CompSci scientific practical approach computation
applications computer scientist specialize theory computation design computational systems
Figure 4. Preprocessed documents
II. Related Work
Hung Chim and Xiaotie Deng,(2008) proposed a method to compute document similarity. The main
objective of their work was to find a phrase-based document similarity to compute the pairwise similarities of
documents based on the Suffix Tree Document (STD) model. By mapping each node in the suffix tree of STD
model into a unique feature term in the Vector Space Document (VSD) model, the phrase-based document
similarity naturally inherits the term tf-idf weighting scheme in computing the document similarity with phrases
[1].
Elias Iosif and Alexandros Potamianos presented a Web-based metrics that compute the semantic
similarity between words or terms and compared with the state of the fine art. Starting from the fundamental
assumption that similarity of context implies that similarity of synonym and relevant Web documents were
downloaded via a Web search engine and the contextual information of words of interest can be compared
(context-based similarity metrics). In addition, the proposed unsupervised context-based similarity computation
algorithms seems to be competitive with the state-of-the-art supervised semantic similarity algorithms based on
language-specific knowledge resources [2].
Chen et al. proposed Story Link Detection systems that determines whether two stories are about the
same events or links which are usually based on the cosine similarity measure between two stories. This work
presents a method for increasing the performance of a link detection system by using a variety of similarity
measures and using source-pair specific collective information. The various similarity procedures such as
cosine, Hellinger, Tanimoto and clarity, both alone and in combination have been used [5]. Jaz et al presented to
methods to learn semantic similarity between documents. One method is based on document similarity and other
approach based co-occurrence information [13].
Sheetal A et al. presented a method to compute similarity between words through web documents.
Semantic similarity measures play an important role in the extraction of semantic relations. It uses the web
based metrics to compute semantic similarity between words or terms and also compares with the state-of-the-
art. Similarity measures proposed in this work based on the five different association measures in retrieval of
information that is normal matching, Dice, Jaccard, Overlap, and Cosine coefficient. The performance of these
methods has been evaluated using Miller and Charle’s benchmark dataset [6].
Anna Huang implemented a method to analyze the effectiveness of similarity measures in partitional
clustering for text document datasets. This proposed approach utilized the standard K-means algorithm and
report the results on several text document datasets and five distance/similarity measures that have been most
commonly used in text clustering [7]. Hsun and yau presented the work of cross language retrieval using
semantic similarity measures. They applied fuzzy models to represent the document and used similarity
approaches to retrieve information [12].
III. Proposed Work
The proposed system includes four major methods to compute an efficient similarity between
document work namely Data Preprocessing, Suffix tree, Node Weight calculation and Similarity Measures. The
proposed work includes the stop nodes removal that is removal of symbols, Stop words and Case Conversion.
Phrases can be extracted from the pre-processed data. Each internal node has at least two children and each edge
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
DOI: 10.9790/0661-17243241 www.iosrjournals.org 34 | Page
is labeled with a nonempty sub-string of a document known as a sentence. Every leaf node in the suffix tree
designates a suffix sub-string of a document; each internal node shows a phrase shared by at least two suffix
sub-strings. The similarity of two documents is defined as the more internal nodes shared by the two documents,
the more related the documents be likely and includes different similarity measures to show the different
between the range of the similarity and the flow of proposed the similarity measures includes three different
measures such as Hellinger, Jacard and Dice coefficient. The proposed work is shown in the Fig. 5.
Figure 5. Proposed system Architecture
3.1 Suffix Tree
A tree-like data structure for solving problems contains strings which allow the storage of all sub-
strings of a given string in linear space. Each internal node, except root node, contains minimum two children
and every edge is labeled with a nonempty sub-string of S. Suffix tree is considered to be one of the well-known
full text index data structures. It has been studied for decades and is used in many algorithmic solutions and
practical applications. The necessary steps to be followed to construct suffix tree consists of extracting the
phrases form the preprocessed document and each edge is labeled with a nonempty sub-string of a document
called a phrase. There are three kinds of nodes in the suffix tree: the leaf nodes, root node and internal nodes.
Every internal node represents a common phrase shared by at least two suffix sub strings. The similarity of two
documents is defined as the more internal nodes shared by the two documents, the more exact documents it
should be. The leaf nodes can be called as terminal nodes. Each node in the suffix tree, except terminal nodes
and the root node, either an internal node or a leaf node represents a nonempty phrase that appears in at least one
document in the data set. The similar phrase may exist in various edges of the suffix tree. The suffix tree of a
document set is a compact trie containing all suffix sub-strings of the documents in the data set. During the
suffix tree construction, the root node is the initial node and the parent of all other nodes. All other nodes are
created and stored in a hierarchical order to follow their LCP nodes, respectively. In our contribution, all the
child nodes of the root node are defined as first-level nodes of the suffix tree, the child nodes of the first-level
nodes as second-level nodes and so on.
To build a suffix tree, the naive and straightforward method searches each suffix sub-string of the
document to all suffix sub-strings which already exist in the tree and finds a position to insert it. The time
complexity of building the suffix tree for a document of m words is O (m2
).
Example
Consider the two Documents:
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
DOI: 10.9790/0661-17243241 www.iosrjournals.org 35 | Page
Document 1
Document 2
Cont..
Computer science computing science abbreviated cs or compsci scientific practical
approach computation article computer scientist specializes theory computation
design computational systems
Computer science appears 1959 article communication Human interaction considers
challenges making computers computations useful usable universally accessible
humans
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
DOI: 10.9790/0661-17243241 www.iosrjournals.org 36 | Page
Figure 6. Suffix tree
Nodes shared in the above Suffix tree are A, B and E.
3.2 Weight Calculation
Weight of the node can be calculated using TF-IDF weighting scheme, where tf- refers term frequency
and df- refers inverse document frequency , is a numerical statistic which reflects how important a word to
a document in a set. It is frequently used as a weighting feature in information retrieval and text mining.
The tf(t,d) represents the number of times that term t occurs in document d.
The inverse document frequency (idf) is a measure of whether the term is common or rare across all
documents. The Idf is obtained by dividing the total number of documents by the number of documents
containing the term.
The node weights in the documents to be calculated using equation (1).
d={w(1,d),w(2,d),…….w(m,d)} (1)
Where w=weight and m=number of terms. The weight of the term can be calculated using equation
(2).
w(i,d)=(1+log tf(I,d).log(1+N/df(i)) (2)
Where, tf(i,d),is the frequency of the ith
term in the document, and df(i) ,is the number of Documents containing
the ith
term and N refers number of Documents.
Example: Calculating the weight of the internal nodes shared by multiple Documents.
Internal nodes Shared by Multiple Documents in fig. 6 are Node A,B and E.
Calculating the Weights
w(a,1)=w(computer,doc1)=(1+log tf(computer,doc1)).log(1+N/df(computer))
 tf(computer,doc1) = 1
 df(computer) = 2
(1+log 1).log(1+2/2)
(1+0).log(1+1)
(1).(0.693)
 0.693
w(B,doc1)=w(science,doc1)
(1+log tf(science,doc1).log(1+N/df(science))
tf(science,doc1)=1
df(science)=2
(1+log 1).log(1+2/2)
W(Computer,doc1)=0.693
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
DOI: 10.9790/0661-17243241 www.iosrjournals.org 37 | Page
(1+0).log(1+1)
0.693
Similarly, calculate the value of node B and E with respect to Document 1 and Document 2.
Node weight table is constructed from the above calculation as shown in table 1.
Table 1. Node Weight Table
3.3 Similarity Measures
Similarity Measure is a measure which computes the semantic similarity of the documents using
similarity values and the similarity method can represents the similarity between multiple documents. The
measure reflects the degree of closeness or likeliness of two documents. All similarity measures should map to
the range [-1, 1] or [0, 1] , 0 or -1 minimum similarity and 1 shows maximum similarity. The proposed approach
has been applied three different similarity measures: Cosine similarity, Dice Coefficient and Hellinger Measure.
There is a large number of similarity measures proposed in the survey, since the finest similarity measure is not
exist.
3.3.1 Cosine Similarity
Cosine similarity is a measure of similarity between two vectors of an inner product space that
measures the cosine of the angle between them. The resulting similarity ranges from −1 to 1 and 0 usually
representing autonomy, and values in between represents intermediary similarity or dissimilarity. In the case of
similarity measure, the cosine similarity of two documents may be series as of 0 to 1, because the term
frequencies may not be negative.
Cosine Similarity = dx.dy  ∑ i
m
=1 xi.yi
(3)
|dx|.|dy| √ ∑i=1
m
xi
2
yi
2
Where dx and dy are the Documents
dx={x1,x2,x3……xn} and dy={y1,y2,y3…..yn}, xi and yi is the weight of corresponding nodes and m
and n are the number of internal nodes.
Doc 1 ={A,B,E}
Doc 2 ={A,B,E}
where, x is A, y is B and z is E.
(x1*x2)+(y1*y2)+(z1*z2)
(x12
+y12
+z12
)1/2
(x22
+y22
+z22
)1/2
= (0.693 *0.693)+(0.693*1.173)+(0.693 *0.693)
((0.6993)2
+(0.693)2
+(0.693)2
)1/2
.((0.693)2+
(1.173)2
+(0.693)2)1/2
1.7732
= (1.4406)1/2
.(2.257)1/2
1.7732
=
(1.200)(1.502)
= 0.98
Cosine Similarity for the Document 1 and Document 2 is 0.98.
3.3.2 Dice Coefficient
Dice coefficient determines how similar a set and another set are. It can be applied to measure how
similar two Documents are in terms of number of common bi-grams. Dice coefficient is mainly used for
comparing the similarity of two Documents and it uses statistic to compute the similarity of two samples.
NODE DOC 1 DOC 2
A 0.693 0.693
B 0.693 1.173
E 0.693 0.693
Cosine =
W(Science,doc1)=0.693
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
DOI: 10.9790/0661-17243241 www.iosrjournals.org 38 | Page
Where A and B are the Documents
Doc 1 ={A,B,E}
Doc 2 ={A,B,E}
3.3.3 Hellinger Distance
Distance between probability distributions is called as Hellinger distance. The Hellinger distance is
closely associated to the total variation distance. For example, both distances define the same topology of the
space of probability measures, but it has several technical advantages derived from properties of inner products.
Hellinger Distance for Document 1 and Document 2 is 0.984.
The comparison of two documents using Cosine, Dice and Hellinger distance has shown in table 2.
Table 2. Comparison table of two different Similarity measures
Measures Similarity Values
COSINE 0.99
DICE 0.956
2 A.B 2(∑xi yi) (4)
=
|A|+|B| ∑i=1
m
xi
2
+∑ i=1
m
yi
2
Dice coefficient =
 2 ((0.693 *0.693)+(0.693*1.173)+(0.693 *0.693))
((0.6993)2
+(0.693)2
+(0.693)2
)+((0.693)2+
(1.173)2
+(0.693)2)
 2(1.7732)
(1.4406)+(2.257)
 3.5464
3.6976
 0.956
Dice coefficient for Document 1 and Document 2 is 0.956
∑xi yi
Hellinger = (6)
(∑i=1m xi2 +∑ i=1m yi2 ) -∑i=1n (xi – yi )
 ((0.693 *0.693)+(0.693*1.173)+(0.693 *0.693))
((0.6993)2 +(0.693)2 +(0.693)2 )+((0.693)2+ (1.173)2 +(0.693)2 )
- ((0.693*0.693)+(0.693*1.173)+(0.693 *0.693))
 (1.7732)
((1.4406)+(2.1334)) -1.7732
 1.7732
3.574-1.7732
 0.984
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
DOI: 10.9790/0661-17243241 www.iosrjournals.org 39 | Page
Hellinger 0.984
IV. Performance Evaluation
In order to evaluate the performance of the proposed system, it has been developed using NetBeans
IDE version 7.2 for UI and computing the values and Microsoft Access for database. The set of standard data
from www.Wikipedia.com source and also some dataset from www.uc.dataset.org has been collected and
employed to the evaluation of the system.
This system gives the document similarity values between 0 and 1. Multiple documents that are any
number of documents can be compared to get the similarity values using Cosine, Dice and Hellinger measures.
The preprocessing method reduces the complexity of the suffix tree and increases the accuracy of the Similarity
measures by eliminating irrelevant terms and symbols as node. The String matching and term weight can be
easily calculated using Suffix Tree procedure. Fig, 7 describes the size of suffix tree growth linearly to the size
of documents. The line shows the number of internal nodes in suffix tree against the number of nodes exist in
every document.
Figure 7. The size of suffix tree scales linearly to the size of document
Figure 8. Time cost for Similarity and suffix tree construction
The fig. 8 shows the time required to construct the suffix tree and similarity calculation. The time
gradually increases with the number of documents in the system. The comparison of similarity result from the
Hellinger, Cosine and Dice is presented in fig. 9.
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
DOI: 10.9790/0661-17243241 www.iosrjournals.org 40 | Page
0
0.2
0.4
0.6
0.8
1
1.2
DOC 1-
2
DOC 1-
3
DOC 1-
4
DOC 2-
3
DOC 2-
4
DOC 3-
4
S
i
m
i
l
a
r
i
t
y
v
a
l
u
e
Documents
Hellinger
Cosine
Dice
Figure 9. Comparison of different similarity measures
V. Conclusion And Future Work
The paper successfully computes the similarity of multiple documents and gives the similarity in
values. The concept of the suffix tree and the new document similarity are quite simple, but the implementation
of these approaches is little bit complicated. To improve the performance of the document similarity, we
investigated the STD model in both the theoretical data structure analysis and the clustering algorithmic
optimization. As a result, the efficiency of the new document similarity approach has been proven in our
experiments on large document dataset. The phrases tf-idf weights has been used in computing document
similarities and proven to be very effective in documents similarity. Our work has reported a successful
approach to extend the usage of tf-idf weighting scheme. The term tf-idf weighting scheme is suitable for
evaluating the importance of not only the keywords but also the phrase in document clustering. The replacement
of Suffix tree with Enhanced suffix Arrays improves the space efficiency. Enhanced suffix arrays satisfy the
algorithm of the suffix tree to overcome the space and time complexities. The future scope of the system will
focus on accepting all types of documents to determine the similarity.
References
[1]. Hung Chim and Xiaotie Deng, “Efficient Phrase-Based Document Similarity for Clustering” IEEE Transactions On Knowledge
And Data Engineering, Vol. 20, No. 9, pp. 1217-1229, 2008.
[2]. Elias Iosif, Alexandros Potamianos, “Unsupervised Semantic Similarity Computation between Terms Using Web Documents”
IEEE Transactions On Knowledge And Data Engineering, Vol. 22, No. 11, pp: 1637-1647, 2010.
[3]. Angelos Hliaoutakis, et al. , “Information Retrieval by Semantic Similarity” , in. International Journal on Semantic Web &
Information Systems, Vol.2, No.3, pp.55-73, 2006.
[4]. Giannis Varelas et al., “Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web”,
Proc. of the 7th ACM International workshop on Web information and Data Management , pp. 10 -16, 2005.
[5]. Francine Chen, Ayman Farahat, Thorsten Brants, “Multiple Similarity Measures and Source-Pair Information in Story Link
Detection”, Proc. of Human Language Technology Conference, pp. 313-320, Chicago, 2004.
[6]. Sheetal A. Takale, Sushma S. Nandgaonkar , “Measuring Semantic Similarity between Words Using Web Documents”,
International Journal of Advanced Computer Science and Applications, Vol. 1, No.4, pp.78-85, 2010.
[7]. Anna Huang, “Similarity Measures for Text Document Clustering”, Computer Science Research Student Conference, pp.49-56,
New Zealand, 2008.
[8]. Danushka Bollegala, Yutaka Matsuo and Mitsuru Ishizuka, “A Web Search Engine-Based Approach to Measure Semantic
Similarity between Words”, Knowledge And Data Engineering, Vol. 23, No. 7, pp.977-990, 2011.
[9]. D.S. Sven Meyer zu Eissen and M. Potthast, “The Suffix Tree Document Model Revisited,” Proc. Fifth Int’l Conf. Knowledge
Management (I-Know ’05), pp. 596-603, 2005.
[10]. Mohamed Ibrahim Abouelhoda, Stefan Kurtz and Enno Ohlebusch, “Replacing suffix trees with enhanced suffix arrays”, Journal of
Discrete Algorithms Vol.2, No.1, pp.53–86, 2003.
[11]. Behnam Hajian and Tony White, “Measuring Semantic Similarity using a Multi-Tree Model”, Proc. of 9th Workshop on
Intelligent Techniques for Web Personalization and Recommender Systems, pp. 7–14, Spain, 2011.
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
DOI: 10.9790/0661-17243241 www.iosrjournals.org 41 | Page
[12]. Hsun-Hui Huang and Yau-Hwang Kuo, “Cross-Lingual Document Representation and Semantic Similarity Measure: A Fuzzy Set
and Rough Set Based Approach”, IEEE Transactions on fuzzy systems, vol.18, no.6, pp.1098-1111, 2010.
[13]. Jaz Kondola,John Shawe-Taylor,Nello Cristianini, “Learning Semantic Simlarity”, Proc. of Neural Information Processing Systems,
vol.15, pp.657-664, Canada, 2003.
[14]. E. Ukkonen, “On-Line Construction of Suffix Trees,” Algorithmica, vol. 14, no. 3, pp. 249-260, 1995.
[15]. Dekang Lin, “An Information-Theoretic Definition of Similarity”, Proc. of 15th
International Conference Conference on Machine
Learning, pp. 296-304, Wisconsin, USA, 1998.

More Related Content

What's hot

A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationNinad Samel
 
Correlation Preserving Indexing Based Text Clustering
Correlation Preserving Indexing Based Text ClusteringCorrelation Preserving Indexing Based Text Clustering
Correlation Preserving Indexing Based Text Clustering
IOSR Journals
 
G04124041046
G04124041046G04124041046
G04124041046
IOSR-JEN
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
Ajay Ohri
 
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSCONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
ijseajournal
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
IJORCS
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Editor IJARCET
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
IJERA Editor
 
Ijetcas14 624
Ijetcas14 624Ijetcas14 624
Ijetcas14 624
Iasir Journals
 
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
ijcsitcejournal
 
Elevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringElevating forensic investigation system for file clustering
Elevating forensic investigation system for file clustering
eSAT Publishing House
 
Elevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringElevating forensic investigation system for file clustering
Elevating forensic investigation system for file clustering
eSAT Journals
 
Optimal approach for text summarization
Optimal approach for text summarizationOptimal approach for text summarization
Optimal approach for text summarizationIAEME Publication
 
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
IJECEIAES
 
Ju3517011704
Ju3517011704Ju3517011704
Ju3517011704
IJERA Editor
 

What's hot (16)

A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
 
Correlation Preserving Indexing Based Text Clustering
Correlation Preserving Indexing Based Text ClusteringCorrelation Preserving Indexing Based Text Clustering
Correlation Preserving Indexing Based Text Clustering
 
G04124041046
G04124041046G04124041046
G04124041046
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
 
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSCONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
Ijetcas14 624
Ijetcas14 624Ijetcas14 624
Ijetcas14 624
 
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
 
Elevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringElevating forensic investigation system for file clustering
Elevating forensic investigation system for file clustering
 
Elevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringElevating forensic investigation system for file clustering
Elevating forensic investigation system for file clustering
 
Optimal approach for text summarization
Optimal approach for text summarizationOptimal approach for text summarization
Optimal approach for text summarization
 
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
 
Ju3517011704
Ju3517011704Ju3517011704
Ju3517011704
 

Similar to An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents

International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
L0261075078
L0261075078L0261075078
L0261075078
inventionjournals
 
L017158389
L017158389L017158389
L017158389
IOSR Journals
 
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
IOSR Journals
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
onlmcq
 
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAIDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
ijistjournal
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations on
ijistjournal
 
A03730108
A03730108A03730108
A03730108
theijes
 
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
ijdmtaiir
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
AIRCC Publishing Corporation
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
ijcsit
 
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
AIRCC Publishing Corporation
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Technique
kevig
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Technique
kevig
 
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
IOSR Journals
 
C017161925
C017161925C017161925
C017161925
IOSR Journals
 
Context Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsContext Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word Pairs
IJCSIS Research Publications
 
J017145559
J017145559J017145559
J017145559
IOSR Journals
 
Challenging Issues and Similarity Measures for Web Document Clustering
Challenging Issues and Similarity Measures for Web Document ClusteringChallenging Issues and Similarity Measures for Web Document Clustering
Challenging Issues and Similarity Measures for Web Document Clustering
IOSR Journals
 

Similar to An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents (20)

International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
L0261075078
L0261075078L0261075078
L0261075078
 
L017158389
L017158389L017158389
L017158389
 
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
 
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAIDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations on
 
A03730108
A03730108A03730108
A03730108
 
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
 
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Technique
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Technique
 
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
 
C017161925
C017161925C017161925
C017161925
 
Context Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsContext Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word Pairs
 
J017145559
J017145559J017145559
J017145559
 
Challenging Issues and Similarity Measures for Web Document Clustering
Challenging Issues and Similarity Measures for Web Document ClusteringChallenging Issues and Similarity Measures for Web Document Clustering
Challenging Issues and Similarity Measures for Web Document Clustering
 

More from iosrjce

An Examination of Effectuation Dimension as Financing Practice of Small and M...
An Examination of Effectuation Dimension as Financing Practice of Small and M...An Examination of Effectuation Dimension as Financing Practice of Small and M...
An Examination of Effectuation Dimension as Financing Practice of Small and M...
iosrjce
 
Does Goods and Services Tax (GST) Leads to Indian Economic Development?
Does Goods and Services Tax (GST) Leads to Indian Economic Development?Does Goods and Services Tax (GST) Leads to Indian Economic Development?
Does Goods and Services Tax (GST) Leads to Indian Economic Development?
iosrjce
 
Childhood Factors that influence success in later life
Childhood Factors that influence success in later lifeChildhood Factors that influence success in later life
Childhood Factors that influence success in later life
iosrjce
 
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...
iosrjce
 
Customer’s Acceptance of Internet Banking in Dubai
Customer’s Acceptance of Internet Banking in DubaiCustomer’s Acceptance of Internet Banking in Dubai
Customer’s Acceptance of Internet Banking in Dubai
iosrjce
 
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...A Study of Employee Satisfaction relating to Job Security & Working Hours amo...
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...
iosrjce
 
Consumer Perspectives on Brand Preference: A Choice Based Model Approach
Consumer Perspectives on Brand Preference: A Choice Based Model ApproachConsumer Perspectives on Brand Preference: A Choice Based Model Approach
Consumer Perspectives on Brand Preference: A Choice Based Model Approach
iosrjce
 
Student`S Approach towards Social Network Sites
Student`S Approach towards Social Network SitesStudent`S Approach towards Social Network Sites
Student`S Approach towards Social Network Sites
iosrjce
 
Broadcast Management in Nigeria: The systems approach as an imperative
Broadcast Management in Nigeria: The systems approach as an imperativeBroadcast Management in Nigeria: The systems approach as an imperative
Broadcast Management in Nigeria: The systems approach as an imperative
iosrjce
 
A Study on Retailer’s Perception on Soya Products with Special Reference to T...
A Study on Retailer’s Perception on Soya Products with Special Reference to T...A Study on Retailer’s Perception on Soya Products with Special Reference to T...
A Study on Retailer’s Perception on Soya Products with Special Reference to T...
iosrjce
 
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...
iosrjce
 
Consumers’ Behaviour on Sony Xperia: A Case Study on Bangladesh
Consumers’ Behaviour on Sony Xperia: A Case Study on BangladeshConsumers’ Behaviour on Sony Xperia: A Case Study on Bangladesh
Consumers’ Behaviour on Sony Xperia: A Case Study on Bangladesh
iosrjce
 
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...
iosrjce
 
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...
iosrjce
 
Media Innovations and its Impact on Brand awareness & Consideration
Media Innovations and its Impact on Brand awareness & ConsiderationMedia Innovations and its Impact on Brand awareness & Consideration
Media Innovations and its Impact on Brand awareness & Consideration
iosrjce
 
Customer experience in supermarkets and hypermarkets – A comparative study
Customer experience in supermarkets and hypermarkets – A comparative studyCustomer experience in supermarkets and hypermarkets – A comparative study
Customer experience in supermarkets and hypermarkets – A comparative study
iosrjce
 
Social Media and Small Businesses: A Combinational Strategic Approach under t...
Social Media and Small Businesses: A Combinational Strategic Approach under t...Social Media and Small Businesses: A Combinational Strategic Approach under t...
Social Media and Small Businesses: A Combinational Strategic Approach under t...
iosrjce
 
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...Secretarial Performance and the Gender Question (A Study of Selected Tertiary...
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...
iosrjce
 
Implementation of Quality Management principles at Zimbabwe Open University (...
Implementation of Quality Management principles at Zimbabwe Open University (...Implementation of Quality Management principles at Zimbabwe Open University (...
Implementation of Quality Management principles at Zimbabwe Open University (...
iosrjce
 
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...Organizational Conflicts Management In Selected Organizaions In Lagos State, ...
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...
iosrjce
 

More from iosrjce (20)

An Examination of Effectuation Dimension as Financing Practice of Small and M...
An Examination of Effectuation Dimension as Financing Practice of Small and M...An Examination of Effectuation Dimension as Financing Practice of Small and M...
An Examination of Effectuation Dimension as Financing Practice of Small and M...
 
Does Goods and Services Tax (GST) Leads to Indian Economic Development?
Does Goods and Services Tax (GST) Leads to Indian Economic Development?Does Goods and Services Tax (GST) Leads to Indian Economic Development?
Does Goods and Services Tax (GST) Leads to Indian Economic Development?
 
Childhood Factors that influence success in later life
Childhood Factors that influence success in later lifeChildhood Factors that influence success in later life
Childhood Factors that influence success in later life
 
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...
 
Customer’s Acceptance of Internet Banking in Dubai
Customer’s Acceptance of Internet Banking in DubaiCustomer’s Acceptance of Internet Banking in Dubai
Customer’s Acceptance of Internet Banking in Dubai
 
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...A Study of Employee Satisfaction relating to Job Security & Working Hours amo...
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...
 
Consumer Perspectives on Brand Preference: A Choice Based Model Approach
Consumer Perspectives on Brand Preference: A Choice Based Model ApproachConsumer Perspectives on Brand Preference: A Choice Based Model Approach
Consumer Perspectives on Brand Preference: A Choice Based Model Approach
 
Student`S Approach towards Social Network Sites
Student`S Approach towards Social Network SitesStudent`S Approach towards Social Network Sites
Student`S Approach towards Social Network Sites
 
Broadcast Management in Nigeria: The systems approach as an imperative
Broadcast Management in Nigeria: The systems approach as an imperativeBroadcast Management in Nigeria: The systems approach as an imperative
Broadcast Management in Nigeria: The systems approach as an imperative
 
A Study on Retailer’s Perception on Soya Products with Special Reference to T...
A Study on Retailer’s Perception on Soya Products with Special Reference to T...A Study on Retailer’s Perception on Soya Products with Special Reference to T...
A Study on Retailer’s Perception on Soya Products with Special Reference to T...
 
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...
 
Consumers’ Behaviour on Sony Xperia: A Case Study on Bangladesh
Consumers’ Behaviour on Sony Xperia: A Case Study on BangladeshConsumers’ Behaviour on Sony Xperia: A Case Study on Bangladesh
Consumers’ Behaviour on Sony Xperia: A Case Study on Bangladesh
 
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...
 
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...
 
Media Innovations and its Impact on Brand awareness & Consideration
Media Innovations and its Impact on Brand awareness & ConsiderationMedia Innovations and its Impact on Brand awareness & Consideration
Media Innovations and its Impact on Brand awareness & Consideration
 
Customer experience in supermarkets and hypermarkets – A comparative study
Customer experience in supermarkets and hypermarkets – A comparative studyCustomer experience in supermarkets and hypermarkets – A comparative study
Customer experience in supermarkets and hypermarkets – A comparative study
 
Social Media and Small Businesses: A Combinational Strategic Approach under t...
Social Media and Small Businesses: A Combinational Strategic Approach under t...Social Media and Small Businesses: A Combinational Strategic Approach under t...
Social Media and Small Businesses: A Combinational Strategic Approach under t...
 
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...Secretarial Performance and the Gender Question (A Study of Selected Tertiary...
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...
 
Implementation of Quality Management principles at Zimbabwe Open University (...
Implementation of Quality Management principles at Zimbabwe Open University (...Implementation of Quality Management principles at Zimbabwe Open University (...
Implementation of Quality Management principles at Zimbabwe Open University (...
 
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...Organizational Conflicts Management In Selected Organizaions In Lagos State, ...
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...
 

Recently uploaded

English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 

Recently uploaded (20)

English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 

An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents

  • 1. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 2, Ver. IV (Mar – Apr. 2015), PP 32-41 www.iosrjournals.org DOI: 10.9790/0661-17243241 www.iosrjournals.org 32 | Page An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents A.Kavitha1 , Dr.N.Rajkumar2 , Dr.S.P.Victor3 1 (Research Scholar, Manonmaniam Sundaranor University, Tirunelveli, India,) 2 (Dept. of M.E. S/w Engg., Professor & Head, Sri Ramakrishna Engineering College, India,) 3 (Dept. of MCA, Professor & Head, St. Xavier College, Palayamkottai, Tirunelveli, India,) Abstract: Semantic Similarity is a concept whereby the set of documents are measured to find the likeliness of their meaning content. Document Similarity is the process of Computing the Semantic Similarity between Multiple Documents Using Similarity measures. In this paper, the document similarity has been applied to compute the pair wise similarities of documents based on the Suffix Tree Document (STD) model. Documents are pre-processed initially. Data Preprocessing can be done to increase the efficiency of the Similarity values. The pre-processed phrases are inserted in Suffix tree. A Suffix tree is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of much important string operation. The suffix substrings are selected as the phrases to label the edges of the suffix tree. Internal nodes represents phrases that shared by Multiple Documents. The similarity of two documents can be defined as the more internal nodes shared by the two documents. Suffix tree can be used to solve the exact matching problem in linear time. Document similarity naturally inherits the term tf-idf(Term frequency and inverse Document frequency) weighting scheme in computing the document similarity with phrases. Tf-Idf method has been used to calculate the weight of Internal nodes of the suffix tree, where internal nodes are the nodes that has been shared by multiple documents. Cosine, Dice and Hellinger measures applied to find the pair wise similarity based on the weight of each internal node of the suffix tree. Keywords: Semantic similarity, Similarity measures, Document similarity, Suffix tree and Tf-idf scheme. I. Introduction Semantic similarity is a domain whereas a set of documents within lists are assigned a metric based on the likeness of their meaning content. The document similarity plays a vital role in the field of information retrieval using Clustering technique [11][7]. The main goal of the system is to compute the semantic similarity between multiple documents. The system involves by getting the several documents as input from the user to find the similarity between various documents based on different similarity measure. The document preprocessing denotes the Stop words removal, Case conversion and Special characters removal. The phrases are extracted from the document to construct the suffix tree and labeled to edges of the nodes of the suffix tree [1][10]. A Suffix tree is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations [14][9]. The term frequency Tf-Idf method is used to calculate the weight of internal nodes of the suffix tree, where internal nodes are the nodes that have been shared by multiple documents. Cosine similarity measure, Dice Coefficient and Hellinger measures are used to find the pair wise similarity based on the weight of each internal node of the suffix tree [5][7]. Document similarity is shown as values and the values must be between 0 and 1. The value 1 implies the absolute similarity and 0 implies both the documents are not similar. 1.1 Semantic Similarity Semantic similarity measures can be classified into pair wise similarity and group wise similarity measures. The Pair wise similarity measures functional similarity between two instances by combining the semantic similarities of the concepts they represent. The group wise semantic similarity measure calculates the similarity directly by not combining the semantic similarities of the concepts they represent. Semantic similarity is mostly used approach and associated with several applications to determine similarity [15]. The similarity measures are used in conjunction with corpus system to retrieve all kind of information and also it helps to retrieve information in web [3][4][8]. 1.2 Data Preprocessing The data pre-processing in an existing consist of three phase namely, special character removal, stop words removal and case conversion. The data pre-processing helps to minimize the document size and comparison time. In the first phase, list of 32 special characters are removed from all the documents [1]. The few special characters are shown in fig. 1.
  • 2. An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents DOI: 10.9790/0661-17243241 www.iosrjournals.org 33 | Page !, @, #, $, %, ^, &, *, (, ),-,=,+,_,[,], ;,:,|,<,>,?,/,`,~ , , Figure 1. Special characters list The second phase is a removal of stop words and it eliminates over all 256 stop word list from all the input documents. The list of stop words is presented in fig. 2. a, an, the , is , are , there, who, what, when, how, much, this, that,.. etc. Figure 2. Stop Words List The third phase is case conversion, it converts entire document from uppercase to lower case. Example The data preprocessing process has been illustrated to the following document as in fig. 3 and fig. 4. Computer science or Computing science (abbreviated as CS or CompSci) is the scientific and practical approach to computation and its applications. A computer scientist specializes in the theory of computation and the design of computational systems. Figure 3. Document1 Computer science Computing science abbreviated CS or CompSci scientific practical approach computation applications computer scientist specialize theory computation design computational systems Figure 4. Preprocessed documents II. Related Work Hung Chim and Xiaotie Deng,(2008) proposed a method to compute document similarity. The main objective of their work was to find a phrase-based document similarity to compute the pairwise similarities of documents based on the Suffix Tree Document (STD) model. By mapping each node in the suffix tree of STD model into a unique feature term in the Vector Space Document (VSD) model, the phrase-based document similarity naturally inherits the term tf-idf weighting scheme in computing the document similarity with phrases [1]. Elias Iosif and Alexandros Potamianos presented a Web-based metrics that compute the semantic similarity between words or terms and compared with the state of the fine art. Starting from the fundamental assumption that similarity of context implies that similarity of synonym and relevant Web documents were downloaded via a Web search engine and the contextual information of words of interest can be compared (context-based similarity metrics). In addition, the proposed unsupervised context-based similarity computation algorithms seems to be competitive with the state-of-the-art supervised semantic similarity algorithms based on language-specific knowledge resources [2]. Chen et al. proposed Story Link Detection systems that determines whether two stories are about the same events or links which are usually based on the cosine similarity measure between two stories. This work presents a method for increasing the performance of a link detection system by using a variety of similarity measures and using source-pair specific collective information. The various similarity procedures such as cosine, Hellinger, Tanimoto and clarity, both alone and in combination have been used [5]. Jaz et al presented to methods to learn semantic similarity between documents. One method is based on document similarity and other approach based co-occurrence information [13]. Sheetal A et al. presented a method to compute similarity between words through web documents. Semantic similarity measures play an important role in the extraction of semantic relations. It uses the web based metrics to compute semantic similarity between words or terms and also compares with the state-of-the- art. Similarity measures proposed in this work based on the five different association measures in retrieval of information that is normal matching, Dice, Jaccard, Overlap, and Cosine coefficient. The performance of these methods has been evaluated using Miller and Charle’s benchmark dataset [6]. Anna Huang implemented a method to analyze the effectiveness of similarity measures in partitional clustering for text document datasets. This proposed approach utilized the standard K-means algorithm and report the results on several text document datasets and five distance/similarity measures that have been most commonly used in text clustering [7]. Hsun and yau presented the work of cross language retrieval using semantic similarity measures. They applied fuzzy models to represent the document and used similarity approaches to retrieve information [12]. III. Proposed Work The proposed system includes four major methods to compute an efficient similarity between document work namely Data Preprocessing, Suffix tree, Node Weight calculation and Similarity Measures. The proposed work includes the stop nodes removal that is removal of symbols, Stop words and Case Conversion. Phrases can be extracted from the pre-processed data. Each internal node has at least two children and each edge
  • 3. An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents DOI: 10.9790/0661-17243241 www.iosrjournals.org 34 | Page is labeled with a nonempty sub-string of a document known as a sentence. Every leaf node in the suffix tree designates a suffix sub-string of a document; each internal node shows a phrase shared by at least two suffix sub-strings. The similarity of two documents is defined as the more internal nodes shared by the two documents, the more related the documents be likely and includes different similarity measures to show the different between the range of the similarity and the flow of proposed the similarity measures includes three different measures such as Hellinger, Jacard and Dice coefficient. The proposed work is shown in the Fig. 5. Figure 5. Proposed system Architecture 3.1 Suffix Tree A tree-like data structure for solving problems contains strings which allow the storage of all sub- strings of a given string in linear space. Each internal node, except root node, contains minimum two children and every edge is labeled with a nonempty sub-string of S. Suffix tree is considered to be one of the well-known full text index data structures. It has been studied for decades and is used in many algorithmic solutions and practical applications. The necessary steps to be followed to construct suffix tree consists of extracting the phrases form the preprocessed document and each edge is labeled with a nonempty sub-string of a document called a phrase. There are three kinds of nodes in the suffix tree: the leaf nodes, root node and internal nodes. Every internal node represents a common phrase shared by at least two suffix sub strings. The similarity of two documents is defined as the more internal nodes shared by the two documents, the more exact documents it should be. The leaf nodes can be called as terminal nodes. Each node in the suffix tree, except terminal nodes and the root node, either an internal node or a leaf node represents a nonempty phrase that appears in at least one document in the data set. The similar phrase may exist in various edges of the suffix tree. The suffix tree of a document set is a compact trie containing all suffix sub-strings of the documents in the data set. During the suffix tree construction, the root node is the initial node and the parent of all other nodes. All other nodes are created and stored in a hierarchical order to follow their LCP nodes, respectively. In our contribution, all the child nodes of the root node are defined as first-level nodes of the suffix tree, the child nodes of the first-level nodes as second-level nodes and so on. To build a suffix tree, the naive and straightforward method searches each suffix sub-string of the document to all suffix sub-strings which already exist in the tree and finds a position to insert it. The time complexity of building the suffix tree for a document of m words is O (m2 ). Example Consider the two Documents:
  • 4. An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents DOI: 10.9790/0661-17243241 www.iosrjournals.org 35 | Page Document 1 Document 2 Cont.. Computer science computing science abbreviated cs or compsci scientific practical approach computation article computer scientist specializes theory computation design computational systems Computer science appears 1959 article communication Human interaction considers challenges making computers computations useful usable universally accessible humans
  • 5. An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents DOI: 10.9790/0661-17243241 www.iosrjournals.org 36 | Page Figure 6. Suffix tree Nodes shared in the above Suffix tree are A, B and E. 3.2 Weight Calculation Weight of the node can be calculated using TF-IDF weighting scheme, where tf- refers term frequency and df- refers inverse document frequency , is a numerical statistic which reflects how important a word to a document in a set. It is frequently used as a weighting feature in information retrieval and text mining. The tf(t,d) represents the number of times that term t occurs in document d. The inverse document frequency (idf) is a measure of whether the term is common or rare across all documents. The Idf is obtained by dividing the total number of documents by the number of documents containing the term. The node weights in the documents to be calculated using equation (1). d={w(1,d),w(2,d),…….w(m,d)} (1) Where w=weight and m=number of terms. The weight of the term can be calculated using equation (2). w(i,d)=(1+log tf(I,d).log(1+N/df(i)) (2) Where, tf(i,d),is the frequency of the ith term in the document, and df(i) ,is the number of Documents containing the ith term and N refers number of Documents. Example: Calculating the weight of the internal nodes shared by multiple Documents. Internal nodes Shared by Multiple Documents in fig. 6 are Node A,B and E. Calculating the Weights w(a,1)=w(computer,doc1)=(1+log tf(computer,doc1)).log(1+N/df(computer))  tf(computer,doc1) = 1  df(computer) = 2 (1+log 1).log(1+2/2) (1+0).log(1+1) (1).(0.693)  0.693 w(B,doc1)=w(science,doc1) (1+log tf(science,doc1).log(1+N/df(science)) tf(science,doc1)=1 df(science)=2 (1+log 1).log(1+2/2) W(Computer,doc1)=0.693
  • 6. An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents DOI: 10.9790/0661-17243241 www.iosrjournals.org 37 | Page (1+0).log(1+1) 0.693 Similarly, calculate the value of node B and E with respect to Document 1 and Document 2. Node weight table is constructed from the above calculation as shown in table 1. Table 1. Node Weight Table 3.3 Similarity Measures Similarity Measure is a measure which computes the semantic similarity of the documents using similarity values and the similarity method can represents the similarity between multiple documents. The measure reflects the degree of closeness or likeliness of two documents. All similarity measures should map to the range [-1, 1] or [0, 1] , 0 or -1 minimum similarity and 1 shows maximum similarity. The proposed approach has been applied three different similarity measures: Cosine similarity, Dice Coefficient and Hellinger Measure. There is a large number of similarity measures proposed in the survey, since the finest similarity measure is not exist. 3.3.1 Cosine Similarity Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The resulting similarity ranges from −1 to 1 and 0 usually representing autonomy, and values in between represents intermediary similarity or dissimilarity. In the case of similarity measure, the cosine similarity of two documents may be series as of 0 to 1, because the term frequencies may not be negative. Cosine Similarity = dx.dy  ∑ i m =1 xi.yi (3) |dx|.|dy| √ ∑i=1 m xi 2 yi 2 Where dx and dy are the Documents dx={x1,x2,x3……xn} and dy={y1,y2,y3…..yn}, xi and yi is the weight of corresponding nodes and m and n are the number of internal nodes. Doc 1 ={A,B,E} Doc 2 ={A,B,E} where, x is A, y is B and z is E. (x1*x2)+(y1*y2)+(z1*z2) (x12 +y12 +z12 )1/2 (x22 +y22 +z22 )1/2 = (0.693 *0.693)+(0.693*1.173)+(0.693 *0.693) ((0.6993)2 +(0.693)2 +(0.693)2 )1/2 .((0.693)2+ (1.173)2 +(0.693)2)1/2 1.7732 = (1.4406)1/2 .(2.257)1/2 1.7732 = (1.200)(1.502) = 0.98 Cosine Similarity for the Document 1 and Document 2 is 0.98. 3.3.2 Dice Coefficient Dice coefficient determines how similar a set and another set are. It can be applied to measure how similar two Documents are in terms of number of common bi-grams. Dice coefficient is mainly used for comparing the similarity of two Documents and it uses statistic to compute the similarity of two samples. NODE DOC 1 DOC 2 A 0.693 0.693 B 0.693 1.173 E 0.693 0.693 Cosine = W(Science,doc1)=0.693
  • 7. An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents DOI: 10.9790/0661-17243241 www.iosrjournals.org 38 | Page Where A and B are the Documents Doc 1 ={A,B,E} Doc 2 ={A,B,E} 3.3.3 Hellinger Distance Distance between probability distributions is called as Hellinger distance. The Hellinger distance is closely associated to the total variation distance. For example, both distances define the same topology of the space of probability measures, but it has several technical advantages derived from properties of inner products. Hellinger Distance for Document 1 and Document 2 is 0.984. The comparison of two documents using Cosine, Dice and Hellinger distance has shown in table 2. Table 2. Comparison table of two different Similarity measures Measures Similarity Values COSINE 0.99 DICE 0.956 2 A.B 2(∑xi yi) (4) = |A|+|B| ∑i=1 m xi 2 +∑ i=1 m yi 2 Dice coefficient =  2 ((0.693 *0.693)+(0.693*1.173)+(0.693 *0.693)) ((0.6993)2 +(0.693)2 +(0.693)2 )+((0.693)2+ (1.173)2 +(0.693)2)  2(1.7732) (1.4406)+(2.257)  3.5464 3.6976  0.956 Dice coefficient for Document 1 and Document 2 is 0.956 ∑xi yi Hellinger = (6) (∑i=1m xi2 +∑ i=1m yi2 ) -∑i=1n (xi – yi )  ((0.693 *0.693)+(0.693*1.173)+(0.693 *0.693)) ((0.6993)2 +(0.693)2 +(0.693)2 )+((0.693)2+ (1.173)2 +(0.693)2 ) - ((0.693*0.693)+(0.693*1.173)+(0.693 *0.693))  (1.7732) ((1.4406)+(2.1334)) -1.7732  1.7732 3.574-1.7732  0.984
  • 8. An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents DOI: 10.9790/0661-17243241 www.iosrjournals.org 39 | Page Hellinger 0.984 IV. Performance Evaluation In order to evaluate the performance of the proposed system, it has been developed using NetBeans IDE version 7.2 for UI and computing the values and Microsoft Access for database. The set of standard data from www.Wikipedia.com source and also some dataset from www.uc.dataset.org has been collected and employed to the evaluation of the system. This system gives the document similarity values between 0 and 1. Multiple documents that are any number of documents can be compared to get the similarity values using Cosine, Dice and Hellinger measures. The preprocessing method reduces the complexity of the suffix tree and increases the accuracy of the Similarity measures by eliminating irrelevant terms and symbols as node. The String matching and term weight can be easily calculated using Suffix Tree procedure. Fig, 7 describes the size of suffix tree growth linearly to the size of documents. The line shows the number of internal nodes in suffix tree against the number of nodes exist in every document. Figure 7. The size of suffix tree scales linearly to the size of document Figure 8. Time cost for Similarity and suffix tree construction The fig. 8 shows the time required to construct the suffix tree and similarity calculation. The time gradually increases with the number of documents in the system. The comparison of similarity result from the Hellinger, Cosine and Dice is presented in fig. 9.
  • 9. An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents DOI: 10.9790/0661-17243241 www.iosrjournals.org 40 | Page 0 0.2 0.4 0.6 0.8 1 1.2 DOC 1- 2 DOC 1- 3 DOC 1- 4 DOC 2- 3 DOC 2- 4 DOC 3- 4 S i m i l a r i t y v a l u e Documents Hellinger Cosine Dice Figure 9. Comparison of different similarity measures V. Conclusion And Future Work The paper successfully computes the similarity of multiple documents and gives the similarity in values. The concept of the suffix tree and the new document similarity are quite simple, but the implementation of these approaches is little bit complicated. To improve the performance of the document similarity, we investigated the STD model in both the theoretical data structure analysis and the clustering algorithmic optimization. As a result, the efficiency of the new document similarity approach has been proven in our experiments on large document dataset. The phrases tf-idf weights has been used in computing document similarities and proven to be very effective in documents similarity. Our work has reported a successful approach to extend the usage of tf-idf weighting scheme. The term tf-idf weighting scheme is suitable for evaluating the importance of not only the keywords but also the phrase in document clustering. The replacement of Suffix tree with Enhanced suffix Arrays improves the space efficiency. Enhanced suffix arrays satisfy the algorithm of the suffix tree to overcome the space and time complexities. The future scope of the system will focus on accepting all types of documents to determine the similarity. References [1]. Hung Chim and Xiaotie Deng, “Efficient Phrase-Based Document Similarity for Clustering” IEEE Transactions On Knowledge And Data Engineering, Vol. 20, No. 9, pp. 1217-1229, 2008. [2]. Elias Iosif, Alexandros Potamianos, “Unsupervised Semantic Similarity Computation between Terms Using Web Documents” IEEE Transactions On Knowledge And Data Engineering, Vol. 22, No. 11, pp: 1637-1647, 2010. [3]. Angelos Hliaoutakis, et al. , “Information Retrieval by Semantic Similarity” , in. International Journal on Semantic Web & Information Systems, Vol.2, No.3, pp.55-73, 2006. [4]. Giannis Varelas et al., “Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web”, Proc. of the 7th ACM International workshop on Web information and Data Management , pp. 10 -16, 2005. [5]. Francine Chen, Ayman Farahat, Thorsten Brants, “Multiple Similarity Measures and Source-Pair Information in Story Link Detection”, Proc. of Human Language Technology Conference, pp. 313-320, Chicago, 2004. [6]. Sheetal A. Takale, Sushma S. Nandgaonkar , “Measuring Semantic Similarity between Words Using Web Documents”, International Journal of Advanced Computer Science and Applications, Vol. 1, No.4, pp.78-85, 2010. [7]. Anna Huang, “Similarity Measures for Text Document Clustering”, Computer Science Research Student Conference, pp.49-56, New Zealand, 2008. [8]. Danushka Bollegala, Yutaka Matsuo and Mitsuru Ishizuka, “A Web Search Engine-Based Approach to Measure Semantic Similarity between Words”, Knowledge And Data Engineering, Vol. 23, No. 7, pp.977-990, 2011. [9]. D.S. Sven Meyer zu Eissen and M. Potthast, “The Suffix Tree Document Model Revisited,” Proc. Fifth Int’l Conf. Knowledge Management (I-Know ’05), pp. 596-603, 2005. [10]. Mohamed Ibrahim Abouelhoda, Stefan Kurtz and Enno Ohlebusch, “Replacing suffix trees with enhanced suffix arrays”, Journal of Discrete Algorithms Vol.2, No.1, pp.53–86, 2003. [11]. Behnam Hajian and Tony White, “Measuring Semantic Similarity using a Multi-Tree Model”, Proc. of 9th Workshop on Intelligent Techniques for Web Personalization and Recommender Systems, pp. 7–14, Spain, 2011.
  • 10. An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents DOI: 10.9790/0661-17243241 www.iosrjournals.org 41 | Page [12]. Hsun-Hui Huang and Yau-Hwang Kuo, “Cross-Lingual Document Representation and Semantic Similarity Measure: A Fuzzy Set and Rough Set Based Approach”, IEEE Transactions on fuzzy systems, vol.18, no.6, pp.1098-1111, 2010. [13]. Jaz Kondola,John Shawe-Taylor,Nello Cristianini, “Learning Semantic Simlarity”, Proc. of Neural Information Processing Systems, vol.15, pp.657-664, Canada, 2003. [14]. E. Ukkonen, “On-Line Construction of Suffix Trees,” Algorithmica, vol. 14, no. 3, pp. 249-260, 1995. [15]. Dekang Lin, “An Information-Theoretic Definition of Similarity”, Proc. of 15th International Conference Conference on Machine Learning, pp. 296-304, Wisconsin, USA, 1998.