1) The document discusses an approach to measure semantic similarity between multiple documents using an enhanced suffix tree. It involves preprocessing documents, constructing a suffix tree with documents' phrases as edges, calculating weights of shared nodes using TF-IDF, and applying cosine, dice, and hellinger similarity measures to determine pairwise document similarities.
2) The approach first preprocesses documents by removing stop words, special characters, and converting to lowercase. A suffix tree is constructed with documents' phrases as edges. Shared nodes in the tree represent common phrases between documents.
3) Node weights are calculated using TF-IDF, with higher weights given to rarer phrases. Several similarity measures (cosine, dice, hellinger) are then applied
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MININGijcsit
In today’s world of internet, with whole lot of e-documents such, as html pages, digital libraries etc. occupying considerable amount of cyber space, organizing these documents has become a practical need. Clustering is an important technique that organizes large number of objects into smaller coherent groups.This helps in efficient and effective use of these documents for information retrieval and other NLP tasks.Email is one of the most frequently used e-document by individual or organization. Email categorization is one of the major tasks of email mining. Categorizing emails into different groups help easy retrieval and maintenance. Like other e-documents, emails can also be classified using clustering algorithms. In this
paper a similarity measure called Similarity Measure for Text Processing is suggested for email clustering.
The suggested similarity measure takes into account three situations: feature appears in both emails, feature appears in only one email and feature appears in none of the emails. The potency of suggested similarity measure is analyzed on Enron email data set to categorize emails. The outcome indicates that the efficiency acquired by the suggested similarity measure is better than that acquired by other measures.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
An Improved Similarity Matching based Clustering Framework for Short and Sent...IJECEIAES
Text clustering plays a key role in navigation and browsing process. For an efficient text clustering, the large amount of information is grouped into meaningful clusters. Multiple text clustering techniques do not address the issues such as, high time and space complexity, inability to understand the relational and contextual attributes of the word, less robustness, risks related to privacy exposure, etc. To address these issues, an efficient text based clustering framework is proposed. The Reuters dataset is chosen as the input dataset. Once the input dataset is preprocessed, the similarity between the words are computed using the cosine similarity. The similarities between the components are compared and the vector data is created. From the vector data the clustering particle is computed. To optimize the clustering results, mutation is applied to the vector data. The performance the proposed text based clustering framework is analyzed using the metrics such as Mean Square Error (MSE), Peak Signal Noise Ratio (PSNR) and Processing time. From the experimental results, it is found that, the proposed text based clustering framework produced optimal MSE, PSNR and processing time when compared to the existing Fuzzy C-Means (FCM) and Pairwise Random Swap (PRS) methods.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MININGijcsit
In today’s world of internet, with whole lot of e-documents such, as html pages, digital libraries etc. occupying considerable amount of cyber space, organizing these documents has become a practical need. Clustering is an important technique that organizes large number of objects into smaller coherent groups.This helps in efficient and effective use of these documents for information retrieval and other NLP tasks.Email is one of the most frequently used e-document by individual or organization. Email categorization is one of the major tasks of email mining. Categorizing emails into different groups help easy retrieval and maintenance. Like other e-documents, emails can also be classified using clustering algorithms. In this
paper a similarity measure called Similarity Measure for Text Processing is suggested for email clustering.
The suggested similarity measure takes into account three situations: feature appears in both emails, feature appears in only one email and feature appears in none of the emails. The potency of suggested similarity measure is analyzed on Enron email data set to categorize emails. The outcome indicates that the efficiency acquired by the suggested similarity measure is better than that acquired by other measures.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
An Improved Similarity Matching based Clustering Framework for Short and Sent...IJECEIAES
Text clustering plays a key role in navigation and browsing process. For an efficient text clustering, the large amount of information is grouped into meaningful clusters. Multiple text clustering techniques do not address the issues such as, high time and space complexity, inability to understand the relational and contextual attributes of the word, less robustness, risks related to privacy exposure, etc. To address these issues, an efficient text based clustering framework is proposed. The Reuters dataset is chosen as the input dataset. Once the input dataset is preprocessed, the similarity between the words are computed using the cosine similarity. The similarities between the components are compared and the vector data is created. From the vector data the clustering particle is computed. To optimize the clustering results, mutation is applied to the vector data. The performance the proposed text based clustering framework is analyzed using the metrics such as Mean Square Error (MSE), Peak Signal Noise Ratio (PSNR) and Processing time. From the experimental results, it is found that, the proposed text based clustering framework produced optimal MSE, PSNR and processing time when compared to the existing Fuzzy C-Means (FCM) and Pairwise Random Swap (PRS) methods.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSijseajournal
ABSTRACT
In this paper we propose a novel method to cluster categorical data while retaining their context. Typically, clustering is performed on numerical data. However it is often useful to cluster categorical data as well, especially when dealing with data in real-world contexts. Several methods exist which can cluster categorical data, but our approach is unique in that we use recent text-processing and machine learning advancements like GloVe and t- SNE to develop a a context-aware clustering approach (using pre-trained
word embeddings). We encode words or categorical data into numerical, context-aware, vectors that we use to cluster the data points using common clustering algorithms like K-means.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...IJORCS
The increasing nature of World Wide Web has imposed great challenges for researchers in improving the search efficiency over the internet. Now days web document clustering has become an important research topic to provide most relevant documents in huge volumes of results returned in response to a simple query. In this paper, first we proposed a novel approach, to precisely define clusters based on maximal frequent item set (MFI) by Apriori algorithm. Afterwards utilizing the same maximal frequent item set (MFI) based similarity measure for Hierarchical document clustering. By considering maximal frequent item sets, the dimensionality of document set is decreased. Secondly, providing privacy preserving of open web documents is to avoiding duplicate documents. There by we can protect the privacy of individual copy rights of documents. This can be achieved using equivalence relation.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...ijcsitcejournal
This paper proposes a semi-structured information retrieval model based on a new method for calculation
of similarity. We have developed CASISS (Calculation of Similarity of Semi-Structured documents)
method to quantify how two given texts are similar. This new method identifies elements of semi-structured
documents using elements descriptors. Each semi-structured document is pre-processed before the
extraction of a set of descriptors for each element, which characterize the contents of elements.It can be
used to increase the accuracy of the information retrieval process by taking into account not only the
presence of query terms in the given document but also the topology (position continuity) of these terms.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Elevating forensic investigation system for file clusteringeSAT Journals
Abstract In computer forensic investigation, thousands of files are usually surveyed. Much of the data in those files consists of formless manuscript, whose investigation by computer examiners is very tough to accomplish. Clustering is the unverified organization of designs that is data items, remarks, or feature vectors into groups (clusters). To find a noble clarification for this automated method of analysis are of great interest. In particular, algorithms such as K-means, K-medoids, Single Link, Complete Link and Average Link can simplify the detection of new and valuable information from the documents under investigation. This paper is going to present an tactic that applies text clustering algorithms to forensic examination of computers seized in police investigations using multithreading technique for data clustering. Keywords- Clustering, forensic computing, text mining, multithreading.
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...IJECEIAES
Agglomerative hierarchical is a bottom up clustering method, where the distances between documents can be retrieved by extracting feature values using a topic-based latent dirichlet allocation method. To reduce the number of features, term selection can be done using Luhn’s Idea. Those methods can be used to build the better clusters for document. But, there is less research discusses it. Therefore, in this research, the term weighting calculation uses Luhn’s Idea to select the terms by defining upper and lower cut-off, and then extracts the feature of terms using gibbs sampling latent dirichlet allocation combined with term frequency and fuzzy Sugeno method. The feature values used to be the distance between documents, and clustered with single, complete and average link algorithm. The evaluations show the feature extraction with and without lower cut-off have less difference. But, the topic determination of each term based on term frequency and fuzzy Sugeno method is better than Tsukamoto method in finding more relevant documents. The used of lower cut-off and fuzzy Sugeno gibbs latent dirichlet allocation for complete agglomerative hierarchical clustering have consistent metric values. This clustering method suggested as a better method in clustering documents that is more relevant to its gold standard.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic concepts that characterizes the domain as well as their definitions and interrelationships. This paper will describe some algorithms for identifying semantic relations and constructing an Information Technology Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences. We then extract these sentences based on English pattern in order to build training set. We use a random sample among 245 categories of ACM to evaluate our results. Results generated show that our system yields superior performance.
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSijseajournal
ABSTRACT
In this paper we propose a novel method to cluster categorical data while retaining their context. Typically, clustering is performed on numerical data. However it is often useful to cluster categorical data as well, especially when dealing with data in real-world contexts. Several methods exist which can cluster categorical data, but our approach is unique in that we use recent text-processing and machine learning advancements like GloVe and t- SNE to develop a a context-aware clustering approach (using pre-trained
word embeddings). We encode words or categorical data into numerical, context-aware, vectors that we use to cluster the data points using common clustering algorithms like K-means.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...IJORCS
The increasing nature of World Wide Web has imposed great challenges for researchers in improving the search efficiency over the internet. Now days web document clustering has become an important research topic to provide most relevant documents in huge volumes of results returned in response to a simple query. In this paper, first we proposed a novel approach, to precisely define clusters based on maximal frequent item set (MFI) by Apriori algorithm. Afterwards utilizing the same maximal frequent item set (MFI) based similarity measure for Hierarchical document clustering. By considering maximal frequent item sets, the dimensionality of document set is decreased. Secondly, providing privacy preserving of open web documents is to avoiding duplicate documents. There by we can protect the privacy of individual copy rights of documents. This can be achieved using equivalence relation.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...ijcsitcejournal
This paper proposes a semi-structured information retrieval model based on a new method for calculation
of similarity. We have developed CASISS (Calculation of Similarity of Semi-Structured documents)
method to quantify how two given texts are similar. This new method identifies elements of semi-structured
documents using elements descriptors. Each semi-structured document is pre-processed before the
extraction of a set of descriptors for each element, which characterize the contents of elements.It can be
used to increase the accuracy of the information retrieval process by taking into account not only the
presence of query terms in the given document but also the topology (position continuity) of these terms.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Elevating forensic investigation system for file clusteringeSAT Journals
Abstract In computer forensic investigation, thousands of files are usually surveyed. Much of the data in those files consists of formless manuscript, whose investigation by computer examiners is very tough to accomplish. Clustering is the unverified organization of designs that is data items, remarks, or feature vectors into groups (clusters). To find a noble clarification for this automated method of analysis are of great interest. In particular, algorithms such as K-means, K-medoids, Single Link, Complete Link and Average Link can simplify the detection of new and valuable information from the documents under investigation. This paper is going to present an tactic that applies text clustering algorithms to forensic examination of computers seized in police investigations using multithreading technique for data clustering. Keywords- Clustering, forensic computing, text mining, multithreading.
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...IJECEIAES
Agglomerative hierarchical is a bottom up clustering method, where the distances between documents can be retrieved by extracting feature values using a topic-based latent dirichlet allocation method. To reduce the number of features, term selection can be done using Luhn’s Idea. Those methods can be used to build the better clusters for document. But, there is less research discusses it. Therefore, in this research, the term weighting calculation uses Luhn’s Idea to select the terms by defining upper and lower cut-off, and then extracts the feature of terms using gibbs sampling latent dirichlet allocation combined with term frequency and fuzzy Sugeno method. The feature values used to be the distance between documents, and clustered with single, complete and average link algorithm. The evaluations show the feature extraction with and without lower cut-off have less difference. But, the topic determination of each term based on term frequency and fuzzy Sugeno method is better than Tsukamoto method in finding more relevant documents. The used of lower cut-off and fuzzy Sugeno gibbs latent dirichlet allocation for complete agglomerative hierarchical clustering have consistent metric values. This clustering method suggested as a better method in clustering documents that is more relevant to its gold standard.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic concepts that characterizes the domain as well as their definitions and interrelationships. This paper will describe some algorithms for identifying semantic relations and constructing an Information Technology Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences. We then extract these sentences based on English pattern in order to build training set. We use a random sample among 245 categories of ACM to evaluate our results. Results generated show that our system yields superior performance.
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information
Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology
is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic
concepts that characterizes the domain as well as their definitions and interrelationships. This paper will
describe some algorithms for identifying semantic relations and constructing an Information Technology
Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed
based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our
algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language
Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences.
We then extract these sentences based on English pattern in order to build training set. We use a
random sample among 245 categories of ACM to evaluate our results. Results generated show that our
system yields superior performance.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
Theoretical work submitted to the Journal should be original in its motivation or modeling structure. Empirical analysis should be based on a theoretical framework and should be capable of replication. It is expected that all materials required for replication (including computer programs and data sets) should be available upon request to the authors.
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification
The World Wide Web holds a large size of different information. Sometimes while searching the World Wide Web, users always do not gain the type of information they expect. In the subject of information extraction, extracting semantic relationships between terms from documents become a challenge. This paper proposes a system helps in retrieving documents based on the query expansion and tackles the extracting of semantic relationships from biological documents. This system retrieved documents that are relevant to the input terms then it extracts the existence of a relationship. In this system, we use Boolean model and the pattern recognition which helps in determining the relevant documents and determining the place of the relationship in the biological document. The system constructs a term-relation table that accelerates the relation extracting part. The proposed method offers another usage of the system so the researchers can use it to figure out the relationship between two biological terms through the available information in the biological documents. Also for the retrieved documents, the system measures the percentage of the precision and recall.
A Semantic Retrieval System for Extracting Relationships from Biological Corpusijcsit
The World Wide Web holds a large size of different information. Sometimes while searching the World Wide Web, users always do not gain the type of information they expect. In the subject of information extraction, extracting semantic relationships between terms from documents become a challenge. This
paper proposes a system helps in retrieving documents based on the query expansion and tackles the extracting of semantic relationships from biological documents. This system retrieved documents that are relevant to the input terms then it extracts the existence of a relationship. In this system, we use Boolean
model and the pattern recognition which helps in determining the relevant documents and determining the place of the relationship in the biological document. The system constructs a term-relation table that accelerates the relation extracting part. The proposed method offers another usage of the system so the
researchers can use it to figure out the relationship between two biological terms through the available information in the biological documents. Also for the retrieved documents, the system measures the percentage of the precision and recall.
The World Wide Web holds a large size of different information. Sometimes while searching the World Wide Web, users always do not gain the type of information they expect. In the subject of information extraction, extracting semantic relationships between terms from documents become a challenge. This paper proposes a system helps in retrieving documents based on the query expansion and tackles the extracting of semantic relationships from biological documents. This system retrieved documents that are relevant to the input terms then it extracts the existence of a relationship. In this system, we use Boolean model and the pattern recognition which helps in determining the relevant documents and determining the place of the relationship in the biological document. The system constructs a term-relation table that accelerates the relation extracting part. The proposed method offers another usage of the system so the researchers can use it to figure out the relationship between two biological terms through the available information in the biological documents. Also for the retrieved documents, the system measures the percentage of the precision and recall.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Testing Different Log Bases for Vector Model Weighting Techniquekevig
Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.
Testing Different Log Bases for Vector Model Weighting Techniquekevig
Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.
Semantic similarity and semantic relatedness
measure in particular is very important in the current scenario
due to the huge demand for natural language processing based
applications such as chatbots and information retrieval systems
such as knowledge base based FAQ systems. Current approaches
generally use similarity measures which does not use the context
sensitive relationships between the words. This leads to erroneous
similarity predictions and is not of much use in real life
applications. This work proposes a novel approach that gives an
accurate relatedness measure of any two words in a sentence by
taking their context into consideration. This context correction
results in a more accurate similarity prediction which results in
higher accuracy of information retrieval systems.
An Examination of Effectuation Dimension as Financing Practice of Small and M...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Does Goods and Services Tax (GST) Leads to Indian Economic Development?iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Childhood Factors that influence success in later lifeiosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Customer’s Acceptance of Internet Banking in Dubaiiosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Consumer Perspectives on Brand Preference: A Choice Based Model Approachiosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Student`S Approach towards Social Network Sitesiosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Broadcast Management in Nigeria: The systems approach as an imperativeiosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
A Study on Retailer’s Perception on Soya Products with Special Reference to T...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Consumers’ Behaviour on Sony Xperia: A Case Study on Bangladeshiosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Media Innovations and its Impact on Brand awareness & Considerationiosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Customer experience in supermarkets and hypermarkets – A comparative studyiosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Social Media and Small Businesses: A Combinational Strategic Approach under t...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Implementation of Quality Management principles at Zimbabwe Open University (...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
1. IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 2, Ver. IV (Mar – Apr. 2015), PP 32-41
www.iosrjournals.org
DOI: 10.9790/0661-17243241 www.iosrjournals.org 32 | Page
An Enhanced Suffix Tree Approach to Measure Semantic
Similarity between Multiple Documents
A.Kavitha1
, Dr.N.Rajkumar2
, Dr.S.P.Victor3
1
(Research Scholar, Manonmaniam Sundaranor University, Tirunelveli, India,)
2
(Dept. of M.E. S/w Engg., Professor & Head, Sri Ramakrishna Engineering College, India,)
3
(Dept. of MCA, Professor & Head, St. Xavier College, Palayamkottai, Tirunelveli, India,)
Abstract: Semantic Similarity is a concept whereby the set of documents are measured to find the likeliness of
their meaning content. Document Similarity is the process of Computing the Semantic Similarity between
Multiple Documents Using Similarity measures. In this paper, the document similarity has been applied to
compute the pair wise similarities of documents based on the Suffix Tree Document (STD) model. Documents
are pre-processed initially. Data Preprocessing can be done to increase the efficiency of the Similarity values.
The pre-processed phrases are inserted in Suffix tree. A Suffix tree is a data structure that presents
the suffixes of a given string in a way that allows for a particularly fast implementation of much important string
operation. The suffix substrings are selected as the phrases to label the edges of the suffix tree. Internal nodes
represents phrases that shared by Multiple Documents. The similarity of two documents can be defined as the
more internal nodes shared by the two documents. Suffix tree can be used to solve the exact matching problem
in linear time. Document similarity naturally inherits the term tf-idf(Term frequency and inverse Document
frequency) weighting scheme in computing the document similarity with phrases. Tf-Idf method has been used
to calculate the weight of Internal nodes of the suffix tree, where internal nodes are the nodes that has been
shared by multiple documents. Cosine, Dice and Hellinger measures applied to find the pair wise similarity
based on the weight of each internal node of the suffix tree.
Keywords: Semantic similarity, Similarity measures, Document similarity, Suffix tree and Tf-idf scheme.
I. Introduction
Semantic similarity is a domain whereas a set of documents within lists are assigned a metric based on
the likeness of their meaning content. The document similarity plays a vital role in the field of information
retrieval using Clustering technique [11][7]. The main goal of the system is to compute the semantic similarity
between multiple documents. The system involves by getting the several documents as input from the user to
find the similarity between various documents based on different similarity measure. The document
preprocessing denotes the Stop words removal, Case conversion and Special characters removal. The phrases
are extracted from the document to construct the suffix tree and labeled to edges of the nodes of the suffix tree
[1][10]. A Suffix tree is a data structure that presents the suffixes of a given string in a way that allows for a
particularly fast implementation of many important string operations [14][9]. The term frequency Tf-Idf method
is used to calculate the weight of internal nodes of the suffix tree, where internal nodes are the nodes that have
been shared by multiple documents. Cosine similarity measure, Dice Coefficient and Hellinger measures are
used to find the pair wise similarity based on the weight of each internal node of the suffix tree [5][7].
Document similarity is shown as values and the values must be between 0 and 1. The value 1 implies the
absolute similarity and 0 implies both the documents are not similar.
1.1 Semantic Similarity
Semantic similarity measures can be classified into pair wise similarity and group wise similarity
measures. The Pair wise similarity measures functional similarity between two instances by combining the
semantic similarities of the concepts they represent. The group wise semantic similarity measure calculates the
similarity directly by not combining the semantic similarities of the concepts they represent.
Semantic similarity is mostly used approach and associated with several applications to determine
similarity [15]. The similarity measures are used in conjunction with corpus system to retrieve all kind of
information and also it helps to retrieve information in web [3][4][8].
1.2 Data Preprocessing
The data pre-processing in an existing consist of three phase namely, special character removal, stop
words removal and case conversion. The data pre-processing helps to minimize the document size and
comparison time. In the first phase, list of 32 special characters are removed from all the documents [1]. The
few special characters are shown in fig. 1.
2. An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
DOI: 10.9790/0661-17243241 www.iosrjournals.org 33 | Page
!, @, #, $, %, ^, &, *, (, ),-,=,+,_,[,], ;,:,|,<,>,?,/,`,~ , ,
Figure 1. Special characters list
The second phase is a removal of stop words and it eliminates over all 256 stop word list from all the input
documents. The list of stop words is presented in fig. 2.
a, an, the , is , are , there, who, what, when, how, much, this, that,.. etc.
Figure 2. Stop Words List
The third phase is case conversion, it converts entire document from uppercase to lower case.
Example
The data preprocessing process has been illustrated to the following document as in fig. 3 and fig. 4.
Computer science or Computing science (abbreviated as CS or CompSci) is the scientific and practical approach to
computation and its applications. A computer scientist specializes in the theory of computation and the design of
computational systems.
Figure 3. Document1
Computer science Computing science abbreviated CS or CompSci scientific practical approach computation
applications computer scientist specialize theory computation design computational systems
Figure 4. Preprocessed documents
II. Related Work
Hung Chim and Xiaotie Deng,(2008) proposed a method to compute document similarity. The main
objective of their work was to find a phrase-based document similarity to compute the pairwise similarities of
documents based on the Suffix Tree Document (STD) model. By mapping each node in the suffix tree of STD
model into a unique feature term in the Vector Space Document (VSD) model, the phrase-based document
similarity naturally inherits the term tf-idf weighting scheme in computing the document similarity with phrases
[1].
Elias Iosif and Alexandros Potamianos presented a Web-based metrics that compute the semantic
similarity between words or terms and compared with the state of the fine art. Starting from the fundamental
assumption that similarity of context implies that similarity of synonym and relevant Web documents were
downloaded via a Web search engine and the contextual information of words of interest can be compared
(context-based similarity metrics). In addition, the proposed unsupervised context-based similarity computation
algorithms seems to be competitive with the state-of-the-art supervised semantic similarity algorithms based on
language-specific knowledge resources [2].
Chen et al. proposed Story Link Detection systems that determines whether two stories are about the
same events or links which are usually based on the cosine similarity measure between two stories. This work
presents a method for increasing the performance of a link detection system by using a variety of similarity
measures and using source-pair specific collective information. The various similarity procedures such as
cosine, Hellinger, Tanimoto and clarity, both alone and in combination have been used [5]. Jaz et al presented to
methods to learn semantic similarity between documents. One method is based on document similarity and other
approach based co-occurrence information [13].
Sheetal A et al. presented a method to compute similarity between words through web documents.
Semantic similarity measures play an important role in the extraction of semantic relations. It uses the web
based metrics to compute semantic similarity between words or terms and also compares with the state-of-the-
art. Similarity measures proposed in this work based on the five different association measures in retrieval of
information that is normal matching, Dice, Jaccard, Overlap, and Cosine coefficient. The performance of these
methods has been evaluated using Miller and Charle’s benchmark dataset [6].
Anna Huang implemented a method to analyze the effectiveness of similarity measures in partitional
clustering for text document datasets. This proposed approach utilized the standard K-means algorithm and
report the results on several text document datasets and five distance/similarity measures that have been most
commonly used in text clustering [7]. Hsun and yau presented the work of cross language retrieval using
semantic similarity measures. They applied fuzzy models to represent the document and used similarity
approaches to retrieve information [12].
III. Proposed Work
The proposed system includes four major methods to compute an efficient similarity between
document work namely Data Preprocessing, Suffix tree, Node Weight calculation and Similarity Measures. The
proposed work includes the stop nodes removal that is removal of symbols, Stop words and Case Conversion.
Phrases can be extracted from the pre-processed data. Each internal node has at least two children and each edge
3. An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
DOI: 10.9790/0661-17243241 www.iosrjournals.org 34 | Page
is labeled with a nonempty sub-string of a document known as a sentence. Every leaf node in the suffix tree
designates a suffix sub-string of a document; each internal node shows a phrase shared by at least two suffix
sub-strings. The similarity of two documents is defined as the more internal nodes shared by the two documents,
the more related the documents be likely and includes different similarity measures to show the different
between the range of the similarity and the flow of proposed the similarity measures includes three different
measures such as Hellinger, Jacard and Dice coefficient. The proposed work is shown in the Fig. 5.
Figure 5. Proposed system Architecture
3.1 Suffix Tree
A tree-like data structure for solving problems contains strings which allow the storage of all sub-
strings of a given string in linear space. Each internal node, except root node, contains minimum two children
and every edge is labeled with a nonempty sub-string of S. Suffix tree is considered to be one of the well-known
full text index data structures. It has been studied for decades and is used in many algorithmic solutions and
practical applications. The necessary steps to be followed to construct suffix tree consists of extracting the
phrases form the preprocessed document and each edge is labeled with a nonempty sub-string of a document
called a phrase. There are three kinds of nodes in the suffix tree: the leaf nodes, root node and internal nodes.
Every internal node represents a common phrase shared by at least two suffix sub strings. The similarity of two
documents is defined as the more internal nodes shared by the two documents, the more exact documents it
should be. The leaf nodes can be called as terminal nodes. Each node in the suffix tree, except terminal nodes
and the root node, either an internal node or a leaf node represents a nonempty phrase that appears in at least one
document in the data set. The similar phrase may exist in various edges of the suffix tree. The suffix tree of a
document set is a compact trie containing all suffix sub-strings of the documents in the data set. During the
suffix tree construction, the root node is the initial node and the parent of all other nodes. All other nodes are
created and stored in a hierarchical order to follow their LCP nodes, respectively. In our contribution, all the
child nodes of the root node are defined as first-level nodes of the suffix tree, the child nodes of the first-level
nodes as second-level nodes and so on.
To build a suffix tree, the naive and straightforward method searches each suffix sub-string of the
document to all suffix sub-strings which already exist in the tree and finds a position to insert it. The time
complexity of building the suffix tree for a document of m words is O (m2
).
Example
Consider the two Documents:
4. An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
DOI: 10.9790/0661-17243241 www.iosrjournals.org 35 | Page
Document 1
Document 2
Cont..
Computer science computing science abbreviated cs or compsci scientific practical
approach computation article computer scientist specializes theory computation
design computational systems
Computer science appears 1959 article communication Human interaction considers
challenges making computers computations useful usable universally accessible
humans
5. An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
DOI: 10.9790/0661-17243241 www.iosrjournals.org 36 | Page
Figure 6. Suffix tree
Nodes shared in the above Suffix tree are A, B and E.
3.2 Weight Calculation
Weight of the node can be calculated using TF-IDF weighting scheme, where tf- refers term frequency
and df- refers inverse document frequency , is a numerical statistic which reflects how important a word to
a document in a set. It is frequently used as a weighting feature in information retrieval and text mining.
The tf(t,d) represents the number of times that term t occurs in document d.
The inverse document frequency (idf) is a measure of whether the term is common or rare across all
documents. The Idf is obtained by dividing the total number of documents by the number of documents
containing the term.
The node weights in the documents to be calculated using equation (1).
d={w(1,d),w(2,d),…….w(m,d)} (1)
Where w=weight and m=number of terms. The weight of the term can be calculated using equation
(2).
w(i,d)=(1+log tf(I,d).log(1+N/df(i)) (2)
Where, tf(i,d),is the frequency of the ith
term in the document, and df(i) ,is the number of Documents containing
the ith
term and N refers number of Documents.
Example: Calculating the weight of the internal nodes shared by multiple Documents.
Internal nodes Shared by Multiple Documents in fig. 6 are Node A,B and E.
Calculating the Weights
w(a,1)=w(computer,doc1)=(1+log tf(computer,doc1)).log(1+N/df(computer))
tf(computer,doc1) = 1
df(computer) = 2
(1+log 1).log(1+2/2)
(1+0).log(1+1)
(1).(0.693)
0.693
w(B,doc1)=w(science,doc1)
(1+log tf(science,doc1).log(1+N/df(science))
tf(science,doc1)=1
df(science)=2
(1+log 1).log(1+2/2)
W(Computer,doc1)=0.693
6. An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
DOI: 10.9790/0661-17243241 www.iosrjournals.org 37 | Page
(1+0).log(1+1)
0.693
Similarly, calculate the value of node B and E with respect to Document 1 and Document 2.
Node weight table is constructed from the above calculation as shown in table 1.
Table 1. Node Weight Table
3.3 Similarity Measures
Similarity Measure is a measure which computes the semantic similarity of the documents using
similarity values and the similarity method can represents the similarity between multiple documents. The
measure reflects the degree of closeness or likeliness of two documents. All similarity measures should map to
the range [-1, 1] or [0, 1] , 0 or -1 minimum similarity and 1 shows maximum similarity. The proposed approach
has been applied three different similarity measures: Cosine similarity, Dice Coefficient and Hellinger Measure.
There is a large number of similarity measures proposed in the survey, since the finest similarity measure is not
exist.
3.3.1 Cosine Similarity
Cosine similarity is a measure of similarity between two vectors of an inner product space that
measures the cosine of the angle between them. The resulting similarity ranges from −1 to 1 and 0 usually
representing autonomy, and values in between represents intermediary similarity or dissimilarity. In the case of
similarity measure, the cosine similarity of two documents may be series as of 0 to 1, because the term
frequencies may not be negative.
Cosine Similarity = dx.dy ∑ i
m
=1 xi.yi
(3)
|dx|.|dy| √ ∑i=1
m
xi
2
yi
2
Where dx and dy are the Documents
dx={x1,x2,x3……xn} and dy={y1,y2,y3…..yn}, xi and yi is the weight of corresponding nodes and m
and n are the number of internal nodes.
Doc 1 ={A,B,E}
Doc 2 ={A,B,E}
where, x is A, y is B and z is E.
(x1*x2)+(y1*y2)+(z1*z2)
(x12
+y12
+z12
)1/2
(x22
+y22
+z22
)1/2
= (0.693 *0.693)+(0.693*1.173)+(0.693 *0.693)
((0.6993)2
+(0.693)2
+(0.693)2
)1/2
.((0.693)2+
(1.173)2
+(0.693)2)1/2
1.7732
= (1.4406)1/2
.(2.257)1/2
1.7732
=
(1.200)(1.502)
= 0.98
Cosine Similarity for the Document 1 and Document 2 is 0.98.
3.3.2 Dice Coefficient
Dice coefficient determines how similar a set and another set are. It can be applied to measure how
similar two Documents are in terms of number of common bi-grams. Dice coefficient is mainly used for
comparing the similarity of two Documents and it uses statistic to compute the similarity of two samples.
NODE DOC 1 DOC 2
A 0.693 0.693
B 0.693 1.173
E 0.693 0.693
Cosine =
W(Science,doc1)=0.693
7. An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
DOI: 10.9790/0661-17243241 www.iosrjournals.org 38 | Page
Where A and B are the Documents
Doc 1 ={A,B,E}
Doc 2 ={A,B,E}
3.3.3 Hellinger Distance
Distance between probability distributions is called as Hellinger distance. The Hellinger distance is
closely associated to the total variation distance. For example, both distances define the same topology of the
space of probability measures, but it has several technical advantages derived from properties of inner products.
Hellinger Distance for Document 1 and Document 2 is 0.984.
The comparison of two documents using Cosine, Dice and Hellinger distance has shown in table 2.
Table 2. Comparison table of two different Similarity measures
Measures Similarity Values
COSINE 0.99
DICE 0.956
2 A.B 2(∑xi yi) (4)
=
|A|+|B| ∑i=1
m
xi
2
+∑ i=1
m
yi
2
Dice coefficient =
2 ((0.693 *0.693)+(0.693*1.173)+(0.693 *0.693))
((0.6993)2
+(0.693)2
+(0.693)2
)+((0.693)2+
(1.173)2
+(0.693)2)
2(1.7732)
(1.4406)+(2.257)
3.5464
3.6976
0.956
Dice coefficient for Document 1 and Document 2 is 0.956
∑xi yi
Hellinger = (6)
(∑i=1m xi2 +∑ i=1m yi2 ) -∑i=1n (xi – yi )
((0.693 *0.693)+(0.693*1.173)+(0.693 *0.693))
((0.6993)2 +(0.693)2 +(0.693)2 )+((0.693)2+ (1.173)2 +(0.693)2 )
- ((0.693*0.693)+(0.693*1.173)+(0.693 *0.693))
(1.7732)
((1.4406)+(2.1334)) -1.7732
1.7732
3.574-1.7732
0.984
8. An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
DOI: 10.9790/0661-17243241 www.iosrjournals.org 39 | Page
Hellinger 0.984
IV. Performance Evaluation
In order to evaluate the performance of the proposed system, it has been developed using NetBeans
IDE version 7.2 for UI and computing the values and Microsoft Access for database. The set of standard data
from www.Wikipedia.com source and also some dataset from www.uc.dataset.org has been collected and
employed to the evaluation of the system.
This system gives the document similarity values between 0 and 1. Multiple documents that are any
number of documents can be compared to get the similarity values using Cosine, Dice and Hellinger measures.
The preprocessing method reduces the complexity of the suffix tree and increases the accuracy of the Similarity
measures by eliminating irrelevant terms and symbols as node. The String matching and term weight can be
easily calculated using Suffix Tree procedure. Fig, 7 describes the size of suffix tree growth linearly to the size
of documents. The line shows the number of internal nodes in suffix tree against the number of nodes exist in
every document.
Figure 7. The size of suffix tree scales linearly to the size of document
Figure 8. Time cost for Similarity and suffix tree construction
The fig. 8 shows the time required to construct the suffix tree and similarity calculation. The time
gradually increases with the number of documents in the system. The comparison of similarity result from the
Hellinger, Cosine and Dice is presented in fig. 9.
9. An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
DOI: 10.9790/0661-17243241 www.iosrjournals.org 40 | Page
0
0.2
0.4
0.6
0.8
1
1.2
DOC 1-
2
DOC 1-
3
DOC 1-
4
DOC 2-
3
DOC 2-
4
DOC 3-
4
S
i
m
i
l
a
r
i
t
y
v
a
l
u
e
Documents
Hellinger
Cosine
Dice
Figure 9. Comparison of different similarity measures
V. Conclusion And Future Work
The paper successfully computes the similarity of multiple documents and gives the similarity in
values. The concept of the suffix tree and the new document similarity are quite simple, but the implementation
of these approaches is little bit complicated. To improve the performance of the document similarity, we
investigated the STD model in both the theoretical data structure analysis and the clustering algorithmic
optimization. As a result, the efficiency of the new document similarity approach has been proven in our
experiments on large document dataset. The phrases tf-idf weights has been used in computing document
similarities and proven to be very effective in documents similarity. Our work has reported a successful
approach to extend the usage of tf-idf weighting scheme. The term tf-idf weighting scheme is suitable for
evaluating the importance of not only the keywords but also the phrase in document clustering. The replacement
of Suffix tree with Enhanced suffix Arrays improves the space efficiency. Enhanced suffix arrays satisfy the
algorithm of the suffix tree to overcome the space and time complexities. The future scope of the system will
focus on accepting all types of documents to determine the similarity.
References
[1]. Hung Chim and Xiaotie Deng, “Efficient Phrase-Based Document Similarity for Clustering” IEEE Transactions On Knowledge
And Data Engineering, Vol. 20, No. 9, pp. 1217-1229, 2008.
[2]. Elias Iosif, Alexandros Potamianos, “Unsupervised Semantic Similarity Computation between Terms Using Web Documents”
IEEE Transactions On Knowledge And Data Engineering, Vol. 22, No. 11, pp: 1637-1647, 2010.
[3]. Angelos Hliaoutakis, et al. , “Information Retrieval by Semantic Similarity” , in. International Journal on Semantic Web &
Information Systems, Vol.2, No.3, pp.55-73, 2006.
[4]. Giannis Varelas et al., “Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web”,
Proc. of the 7th ACM International workshop on Web information and Data Management , pp. 10 -16, 2005.
[5]. Francine Chen, Ayman Farahat, Thorsten Brants, “Multiple Similarity Measures and Source-Pair Information in Story Link
Detection”, Proc. of Human Language Technology Conference, pp. 313-320, Chicago, 2004.
[6]. Sheetal A. Takale, Sushma S. Nandgaonkar , “Measuring Semantic Similarity between Words Using Web Documents”,
International Journal of Advanced Computer Science and Applications, Vol. 1, No.4, pp.78-85, 2010.
[7]. Anna Huang, “Similarity Measures for Text Document Clustering”, Computer Science Research Student Conference, pp.49-56,
New Zealand, 2008.
[8]. Danushka Bollegala, Yutaka Matsuo and Mitsuru Ishizuka, “A Web Search Engine-Based Approach to Measure Semantic
Similarity between Words”, Knowledge And Data Engineering, Vol. 23, No. 7, pp.977-990, 2011.
[9]. D.S. Sven Meyer zu Eissen and M. Potthast, “The Suffix Tree Document Model Revisited,” Proc. Fifth Int’l Conf. Knowledge
Management (I-Know ’05), pp. 596-603, 2005.
[10]. Mohamed Ibrahim Abouelhoda, Stefan Kurtz and Enno Ohlebusch, “Replacing suffix trees with enhanced suffix arrays”, Journal of
Discrete Algorithms Vol.2, No.1, pp.53–86, 2003.
[11]. Behnam Hajian and Tony White, “Measuring Semantic Similarity using a Multi-Tree Model”, Proc. of 9th Workshop on
Intelligent Techniques for Web Personalization and Recommender Systems, pp. 7–14, Spain, 2011.
10. An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multiple Documents
DOI: 10.9790/0661-17243241 www.iosrjournals.org 41 | Page
[12]. Hsun-Hui Huang and Yau-Hwang Kuo, “Cross-Lingual Document Representation and Semantic Similarity Measure: A Fuzzy Set
and Rough Set Based Approach”, IEEE Transactions on fuzzy systems, vol.18, no.6, pp.1098-1111, 2010.
[13]. Jaz Kondola,John Shawe-Taylor,Nello Cristianini, “Learning Semantic Simlarity”, Proc. of Neural Information Processing Systems,
vol.15, pp.657-664, Canada, 2003.
[14]. E. Ukkonen, “On-Line Construction of Suffix Trees,” Algorithmica, vol. 14, no. 3, pp. 249-260, 1995.
[15]. Dekang Lin, “An Information-Theoretic Definition of Similarity”, Proc. of 15th
International Conference Conference on Machine
Learning, pp. 296-304, Wisconsin, USA, 1998.