The document proposes a privacy-preserving approach for hierarchical document clustering using maximal frequent item sets (MFI). First, MFI are identified from document collections using the Apriori algorithm to define clusters precisely. Then, the same MFI-based similarity measure is used to construct a hierarchy of clusters. This approach decreases dimensionality and avoids duplicate documents, thereby protecting individual copyrights. The methodology and algorithm are described in detail.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Abstract: Traditional approaches for document classification need data which is labelled for the construction reliable classifiers which are even accurate. Unfortunately, data which is already labelled are rarely available, and often too costly to obtain. For the given learning task for which data which is trained is unavailable, abundant labelled data may be there for a different and related domain. One would like to use the related labelled data as auxiliary information to accomplish the classification task in the target domain. Recently, the paradigm of transfer learning has been introduced to enable effective learning strategies when auxiliary data obey a different probability distribution. A co-clustering based classification algorithm has been previously proposed to tackle cross-domain text classification. In this work, we extend the idea underlying this approach by making the latent semantic relationship between the two domains explicit. This goal is achieved with the use of Wikipedia. As a result, the pathway that allows propagating labels between the two domains not only captures common words, but also semantic concepts based on the content of documents. We empirically demonstrate the efficacy of our semantic-based approach to cross-domain classification using a variety of real data.Keywords: Classification, Clustering, Cross-domain Text Classification, Co-clustering, Labelled data, Traditional Approaches.
Title: Co-Clustering For Cross-Domain Text Classification
Author: Rayala Venkat, Mahanthi Kasaragadda
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
An Improved Similarity Matching based Clustering Framework for Short and Sent...IJECEIAES
Text clustering plays a key role in navigation and browsing process. For an efficient text clustering, the large amount of information is grouped into meaningful clusters. Multiple text clustering techniques do not address the issues such as, high time and space complexity, inability to understand the relational and contextual attributes of the word, less robustness, risks related to privacy exposure, etc. To address these issues, an efficient text based clustering framework is proposed. The Reuters dataset is chosen as the input dataset. Once the input dataset is preprocessed, the similarity between the words are computed using the cosine similarity. The similarities between the components are compared and the vector data is created. From the vector data the clustering particle is computed. To optimize the clustering results, mutation is applied to the vector data. The performance the proposed text based clustering framework is analyzed using the metrics such as Mean Square Error (MSE), Peak Signal Noise Ratio (PSNR) and Processing time. From the experimental results, it is found that, the proposed text based clustering framework produced optimal MSE, PSNR and processing time when compared to the existing Fuzzy C-Means (FCM) and Pairwise Random Swap (PRS) methods.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Abstract: Traditional approaches for document classification need data which is labelled for the construction reliable classifiers which are even accurate. Unfortunately, data which is already labelled are rarely available, and often too costly to obtain. For the given learning task for which data which is trained is unavailable, abundant labelled data may be there for a different and related domain. One would like to use the related labelled data as auxiliary information to accomplish the classification task in the target domain. Recently, the paradigm of transfer learning has been introduced to enable effective learning strategies when auxiliary data obey a different probability distribution. A co-clustering based classification algorithm has been previously proposed to tackle cross-domain text classification. In this work, we extend the idea underlying this approach by making the latent semantic relationship between the two domains explicit. This goal is achieved with the use of Wikipedia. As a result, the pathway that allows propagating labels between the two domains not only captures common words, but also semantic concepts based on the content of documents. We empirically demonstrate the efficacy of our semantic-based approach to cross-domain classification using a variety of real data.Keywords: Classification, Clustering, Cross-domain Text Classification, Co-clustering, Labelled data, Traditional Approaches.
Title: Co-Clustering For Cross-Domain Text Classification
Author: Rayala Venkat, Mahanthi Kasaragadda
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
An Improved Similarity Matching based Clustering Framework for Short and Sent...IJECEIAES
Text clustering plays a key role in navigation and browsing process. For an efficient text clustering, the large amount of information is grouped into meaningful clusters. Multiple text clustering techniques do not address the issues such as, high time and space complexity, inability to understand the relational and contextual attributes of the word, less robustness, risks related to privacy exposure, etc. To address these issues, an efficient text based clustering framework is proposed. The Reuters dataset is chosen as the input dataset. Once the input dataset is preprocessed, the similarity between the words are computed using the cosine similarity. The similarities between the components are compared and the vector data is created. From the vector data the clustering particle is computed. To optimize the clustering results, mutation is applied to the vector data. The performance the proposed text based clustering framework is analyzed using the metrics such as Mean Square Error (MSE), Peak Signal Noise Ratio (PSNR) and Processing time. From the experimental results, it is found that, the proposed text based clustering framework produced optimal MSE, PSNR and processing time when compared to the existing Fuzzy C-Means (FCM) and Pairwise Random Swap (PRS) methods.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGijcsa
Text Document Clustering is one of the fastest growing research areas because of availability of huge amount of information in an electronic form. There are several number of techniques launched for clustering documents in such a way that documents within a cluster have high intra-similarity and low inter-similarity to other clusters. Many document clustering algorithms provide localized search in effectively navigating, summarizing, and organizing information. A global optimal solution can be obtained by applying high-speed and high-quality optimization algorithms. The optimization technique performs a globalized search in the entire solution space. In this paper, a brief survey on optimization approaches to text document clustering is turned out.
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSijseajournal
ABSTRACT
In this paper we propose a novel method to cluster categorical data while retaining their context. Typically, clustering is performed on numerical data. However it is often useful to cluster categorical data as well, especially when dealing with data in real-world contexts. Several methods exist which can cluster categorical data, but our approach is unique in that we use recent text-processing and machine learning advancements like GloVe and t- SNE to develop a a context-aware clustering approach (using pre-trained
word embeddings). We encode words or categorical data into numerical, context-aware, vectors that we use to cluster the data points using common clustering algorithms like K-means.
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...csandit
Document Clustering algorithms goal is to create clusters that are coherent internally, but
clearly different from each other. The useful expressions in the documents is often accompanied
by a large amount of noise that is caused by the use of unnecessary words, so it is indispensable
to eliminate it and keeping just the useful information.
Keyphrases extraction systems in Arabic are new phenomena. A number of Text Mining
applications can use it to improve her results. The Keyphrases are defined as phrases that
capture the main topics discussed in document; they offer a brief and precise summary of
document content. Therefore, it can be a good solution to get rid of the existent noise from
documents.
In this paper, we propose a new method to solve the problem cited above especially for Arabic
language documents, which is one of the most complex languages, by using a new Keyphrases
extraction algorithm based on the Suffix Tree data structure (KpST). To evaluate our approach,
we conduct an experimental study on Arabic Documents Clustering using the most popular
approach of Hierarchical algorithms: Agglomerative Hierarchical algorithm with seven linkage
techniques and a variety of distance functions and similarity measures to perform Arabic
Document Clustering task. The obtained results show that our approach for extracting
Keyphrases improves the clustering results.
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
Recent Trends in Incremental Clustering: A ReviewIOSRjournaljce
This paper presents a review on recent trends in incremental clustering algorithms. It tries to focus on both clustering based on similarity measure and clustering not based on similarity measure. In this context, the paper is devoted to various typical incremental clustering algorithms. Mainly optimization, genetic and fuzzy approaches of these algorithms is covered in the paper. The paper is original with respect to one aspect that is, it provides a complete overview that is fully devoted to evolutionary algorithms for incremental clustering. A number of references are provided that describe applications of evolutionary algorithms for incremental clustering in different domains, such as human activity detection, online fault detection, information security, track an object consistently throughout the network solving boundary problem etc.
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGYcscpconf
A digital library is a type of information retrieval (IR) system. The existing information retrieval
methodologies generally have problems on keyword-searching. We proposed a model to solve
the problem by using concept-based approach (ontology) and metadata case base. This model
consists of identifying domain concepts in user’s query and applying expansion to them. The
system aims at contributing to an improved relevance of results retrieved from digital libraries
by proposing a conceptual query expansion for intelligent concept-based retrieval. We need to
import the concept of ontology, making use of its advantage of abundant semantics and
standard concept. Domain specific ontology can be used to improve information retrieval from
traditional level based on keyword to the lay based on knowledge (or concept) and change the
process of retrieval from traditional keyword matching to semantics matching. One approach is
query expansion techniques using domain ontology and the other would be introducing a case
based similarity measure for metadata information retrieval using Case Based Reasoning
(CBR) approach. Results show improvements over classic method, query expansion using
general purpose ontology and a number of other approaches.
Semantics-based clustering approach for similar research area detectionTELKOMNIKA JOURNAL
The manual process of searching out individuals in an already existing
research field is cumbersome and time-consuming. Prominent and rookie
researchers alike are predisposed to seek existing research publications in
a research field of interest before coming up with a thesis. From
extant literature, automated similar research area detection systems have
been developed to solve this problem. However, most of them use
keyword-matching techniques, which do not sufficiently capture the implicit
semantics of keywords thereby leaving out some research articles. In this
study, we propose the use of ontology-based pre-processing, Latent Semantic
Indexing and K-Means Clustering to develop a prototype similar research area
detection system, that can be used to determine similar research domain
publications. Our proposed system solves the challenge of high dimensionality
and data sparsity faced by the traditional document clustering technique. Our
system is evaluated with randomly selected publications from faculties
in Nigerian universities and results show that the integration of ontologies
in preprocessing provides more accurate clustering results.
Data mining , knowledge discovery is the process
of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize
it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among
dozens of fields in large relational databases. The goal of
clustering is to determine the intrinsic grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply this
criterion, in such a way that the result of the clustering will suit
their needs.
For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier
detection).Of late, clustering techniques have been applied in the
areas which involve browsing the gathered data or in categorizing
the outcome provided by the search engines for the reply to the
query raised by the users. In this paper, we are providing a
comprehensive survey over the document clustering.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Hierarchal clustering and similarity measures along with multi representationeSAT Journals
Abstract All clustering methods have to assume some cluster relationship on the list of data objects that they really are applied on. Graph-Based Document Clustering works with frequent senses rather than frequent keywords used in traditional text mining techniques.Similarity between a pair of objects can be defined either explicitly or implicitly. With this paper, we analyzed existing multi-viewpoint based similarity measure and two related clustering methods. The main difference between a traditional dissimilarity/similarity measure and ours could be that the former uses merely a single viewpoint, which is the origin, even though the latter utilizes many viewpoints, which you ll find are objects assumed to not have the very same cluster using the two objects being measured. Using multiple viewpoints, more informative assessment of similarity could well be achieved. Theoretical analysis and empirical study are conducted to back up this claim. Two criterion functions for document clustering are proposed dependent on this wonderful measure. We compare them several well-known clustering algorithms which use other popular similarity measures on various document collections confirming the good sides of our proposal. Keywords –Multiview Cluster, Document id, ClusterDistance
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHijcsit
The huge volume of text documents available on the internet has made it difficult to find valuable
information for specific users. In fact, the need for efficient applications to extract interested knowledge
from textual documents is vitally important. This paper addresses the problem of responding to user
queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a
cluster-based information retrieval framework was proposed in this paper, in order to design and develop
a system for analysing and extracting useful patterns from text documents. In this approach, a preprocessing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector
Space Model (VSM) is performed to represent the dataset. The system was implemented through two main
phases. In phase 1, the clustering analysis process is designed and implemented to group documents into
several clusters, while in phase 2, an information retrieval process was implemented to rank clusters
according to the user queries in order to retrieve the relevant documents from specific clusters deemed
relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision
(P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.
The huge volume of text documents available on the internet has made it difficult to find valuable
information for specific users. In fact, the need for efficient applications to extract interested knowledge
from textual documents is vitally important. This paper addresses the problem of responding to user
queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a
cluster-based information retrieval framework was proposed in this paper, in order to design and develop
a system for analysing and extracting useful patterns from text documents. In this approach, a pre-
processing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector
Space Model (VSM) is performed to represent the dataset. The system was implemented through two main
phases. In phase 1, the clustering analysis process is designed and implemented to group documents into
several clusters, while in phase 2, an information retrieval process was implemented to rank clusters
according to the user queries in order to retrieve the relevant documents from specific clusters deemed
relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision
(P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGijcsa
Text Document Clustering is one of the fastest growing research areas because of availability of huge amount of information in an electronic form. There are several number of techniques launched for clustering documents in such a way that documents within a cluster have high intra-similarity and low inter-similarity to other clusters. Many document clustering algorithms provide localized search in effectively navigating, summarizing, and organizing information. A global optimal solution can be obtained by applying high-speed and high-quality optimization algorithms. The optimization technique performs a globalized search in the entire solution space. In this paper, a brief survey on optimization approaches to text document clustering is turned out.
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSijseajournal
ABSTRACT
In this paper we propose a novel method to cluster categorical data while retaining their context. Typically, clustering is performed on numerical data. However it is often useful to cluster categorical data as well, especially when dealing with data in real-world contexts. Several methods exist which can cluster categorical data, but our approach is unique in that we use recent text-processing and machine learning advancements like GloVe and t- SNE to develop a a context-aware clustering approach (using pre-trained
word embeddings). We encode words or categorical data into numerical, context-aware, vectors that we use to cluster the data points using common clustering algorithms like K-means.
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...csandit
Document Clustering algorithms goal is to create clusters that are coherent internally, but
clearly different from each other. The useful expressions in the documents is often accompanied
by a large amount of noise that is caused by the use of unnecessary words, so it is indispensable
to eliminate it and keeping just the useful information.
Keyphrases extraction systems in Arabic are new phenomena. A number of Text Mining
applications can use it to improve her results. The Keyphrases are defined as phrases that
capture the main topics discussed in document; they offer a brief and precise summary of
document content. Therefore, it can be a good solution to get rid of the existent noise from
documents.
In this paper, we propose a new method to solve the problem cited above especially for Arabic
language documents, which is one of the most complex languages, by using a new Keyphrases
extraction algorithm based on the Suffix Tree data structure (KpST). To evaluate our approach,
we conduct an experimental study on Arabic Documents Clustering using the most popular
approach of Hierarchical algorithms: Agglomerative Hierarchical algorithm with seven linkage
techniques and a variety of distance functions and similarity measures to perform Arabic
Document Clustering task. The obtained results show that our approach for extracting
Keyphrases improves the clustering results.
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
Recent Trends in Incremental Clustering: A ReviewIOSRjournaljce
This paper presents a review on recent trends in incremental clustering algorithms. It tries to focus on both clustering based on similarity measure and clustering not based on similarity measure. In this context, the paper is devoted to various typical incremental clustering algorithms. Mainly optimization, genetic and fuzzy approaches of these algorithms is covered in the paper. The paper is original with respect to one aspect that is, it provides a complete overview that is fully devoted to evolutionary algorithms for incremental clustering. A number of references are provided that describe applications of evolutionary algorithms for incremental clustering in different domains, such as human activity detection, online fault detection, information security, track an object consistently throughout the network solving boundary problem etc.
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGYcscpconf
A digital library is a type of information retrieval (IR) system. The existing information retrieval
methodologies generally have problems on keyword-searching. We proposed a model to solve
the problem by using concept-based approach (ontology) and metadata case base. This model
consists of identifying domain concepts in user’s query and applying expansion to them. The
system aims at contributing to an improved relevance of results retrieved from digital libraries
by proposing a conceptual query expansion for intelligent concept-based retrieval. We need to
import the concept of ontology, making use of its advantage of abundant semantics and
standard concept. Domain specific ontology can be used to improve information retrieval from
traditional level based on keyword to the lay based on knowledge (or concept) and change the
process of retrieval from traditional keyword matching to semantics matching. One approach is
query expansion techniques using domain ontology and the other would be introducing a case
based similarity measure for metadata information retrieval using Case Based Reasoning
(CBR) approach. Results show improvements over classic method, query expansion using
general purpose ontology and a number of other approaches.
Semantics-based clustering approach for similar research area detectionTELKOMNIKA JOURNAL
The manual process of searching out individuals in an already existing
research field is cumbersome and time-consuming. Prominent and rookie
researchers alike are predisposed to seek existing research publications in
a research field of interest before coming up with a thesis. From
extant literature, automated similar research area detection systems have
been developed to solve this problem. However, most of them use
keyword-matching techniques, which do not sufficiently capture the implicit
semantics of keywords thereby leaving out some research articles. In this
study, we propose the use of ontology-based pre-processing, Latent Semantic
Indexing and K-Means Clustering to develop a prototype similar research area
detection system, that can be used to determine similar research domain
publications. Our proposed system solves the challenge of high dimensionality
and data sparsity faced by the traditional document clustering technique. Our
system is evaluated with randomly selected publications from faculties
in Nigerian universities and results show that the integration of ontologies
in preprocessing provides more accurate clustering results.
Data mining , knowledge discovery is the process
of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize
it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among
dozens of fields in large relational databases. The goal of
clustering is to determine the intrinsic grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply this
criterion, in such a way that the result of the clustering will suit
their needs.
For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier
detection).Of late, clustering techniques have been applied in the
areas which involve browsing the gathered data or in categorizing
the outcome provided by the search engines for the reply to the
query raised by the users. In this paper, we are providing a
comprehensive survey over the document clustering.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Hierarchal clustering and similarity measures along with multi representationeSAT Journals
Abstract All clustering methods have to assume some cluster relationship on the list of data objects that they really are applied on. Graph-Based Document Clustering works with frequent senses rather than frequent keywords used in traditional text mining techniques.Similarity between a pair of objects can be defined either explicitly or implicitly. With this paper, we analyzed existing multi-viewpoint based similarity measure and two related clustering methods. The main difference between a traditional dissimilarity/similarity measure and ours could be that the former uses merely a single viewpoint, which is the origin, even though the latter utilizes many viewpoints, which you ll find are objects assumed to not have the very same cluster using the two objects being measured. Using multiple viewpoints, more informative assessment of similarity could well be achieved. Theoretical analysis and empirical study are conducted to back up this claim. Two criterion functions for document clustering are proposed dependent on this wonderful measure. We compare them several well-known clustering algorithms which use other popular similarity measures on various document collections confirming the good sides of our proposal. Keywords –Multiview Cluster, Document id, ClusterDistance
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHijcsit
The huge volume of text documents available on the internet has made it difficult to find valuable
information for specific users. In fact, the need for efficient applications to extract interested knowledge
from textual documents is vitally important. This paper addresses the problem of responding to user
queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a
cluster-based information retrieval framework was proposed in this paper, in order to design and develop
a system for analysing and extracting useful patterns from text documents. In this approach, a preprocessing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector
Space Model (VSM) is performed to represent the dataset. The system was implemented through two main
phases. In phase 1, the clustering analysis process is designed and implemented to group documents into
several clusters, while in phase 2, an information retrieval process was implemented to rank clusters
according to the user queries in order to retrieve the relevant documents from specific clusters deemed
relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision
(P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.
The huge volume of text documents available on the internet has made it difficult to find valuable
information for specific users. In fact, the need for efficient applications to extract interested knowledge
from textual documents is vitally important. This paper addresses the problem of responding to user
queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a
cluster-based information retrieval framework was proposed in this paper, in order to design and develop
a system for analysing and extracting useful patterns from text documents. In this approach, a pre-
processing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector
Space Model (VSM) is performed to represent the dataset. The system was implemented through two main
phases. In phase 1, the clustering analysis process is designed and implemented to group documents into
several clusters, while in phase 2, an information retrieval process was implemented to rank clusters
according to the user queries in order to retrieve the relevant documents from specific clusters deemed
relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision
(P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Text document clustering and similarity detection is the major part of document management, where every document should be identified by its key terms and domain knowledge. Based on the similarity, the documents are grouped into clusters. For document similarity calculation there are several approaches were proposed in the existing system. But the existing system is either term based or pattern based. And those systems suffered from several problems. To make a revolution in this challenging environment, the proposed system presents an innovative model for document similarity by applying back propagation time stamp algorithm. It discovers patterns in text documents as higher level features and creates a network for fast grouping. It also detects the most appropriate patterns based on its weight and BPTT performs the document similarity measures. Using this approach, the document can be categorized easily. In order to perform the above, a new approach is used. This helps to reduce the training process problems. The above framework is named as BPTT. The BPTT has implemented and evaluated using dot net platform with different set of datasets.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Help the Genetic Algorithm to Minimize the Urban Traffic on IntersectionsIJORCS
Control of traffic lights at the intersections of the main issues is the optimal traffic. Intersections to regulate traffic flow of vehicles and eliminate conflicting traffic flows are used. Modeling and simulation of traffic are widely used in industry. In fact, the modeling and simulation of an industrial system is studied before creating economically and when it is affordable. The aim of this article is a smart way to control traffic. The first stage of the project with the objective of collecting statistical data (cycle time of each of the intersection of the lights of vehicles is waiting for a red light) steps where the data collection found optimal amounts next it is. Introduced by genetic algorithm optimization of parameters is performed. GA begin with coding step as a binary variable (the range specified by the initial data set is obtained) will start with an initial population and then a new generation of genetic operators mutation and crossover and will Finally, the members of the optimal fitness values are selected as the solution set. The optimal output of Petri nets CPN TOOLS modeling and software have been implemented. The results indicate that the performance improvement project in intersections traffic control systems. It is known that other data collected and enforced intersections of evolutionary methods such as genetic algorithms to reduce the waiting time for traffic lights behind the red lights and to determine the appropriate cycle.
Welcoming the research scholars, scientists around the globe in the Open Access Dimension, IJORCS is now accepting manuscripts for its next issue (Volume 4, Issue 4). Authors are encouraged to contribute to the research community by submitting to IJORCS, articles that clarify new research results, projects, surveying works and industrial experiences that describe significant advances in field of computer science.
All paper submissions (http://www.ijorcs.org/submit-paper) are received and managed electronically by IJORCS Team. Detailed instructions about the submission procedure are available on IJORCS website (http://www.ijorcs.org/author-guidelines)
License plate recognition system is one of the core technologies in intelligent traffic control. In this paper, a new and tunable algorithm which can detect multiple license plates in high resolution applications is proposed. The algorithm aims at investigation into and identification of the novel Iranian and some European countries plate, characterized by both inclusion of blue area on it and its geometric shape. Obviously, the suggested algorithm contains suitable velocity due to not making use of heavy pre-processing operation such as image-improving filters, edge-detection operation and omission of noise at the beginning stages. So, the recommended method of ours is compatible with model-adaptation, i.e., the very blue section of the plate so that the present method indicated the fact that if several plates are included in the image, the method can successfully manage to detect it. We evaluated our method on the two Persian single vehicle license plate data set that we obtained 99.33, 99% correct recognition rate respectively. Further we tested our algorithm on the Persian multiple vehicle license plate data set and we achieved 98% accuracy rate. Also we obtained approximately 99% accuracy in character recognition stage.
FPGA Implementation of FIR Filter using Various Algorithms: A RetrospectiveIJORCS
This Paper is a review study of FPGA implementation of Finite Impulse response (FIR) with low cost and high performance. The key observation of this paper is an elaborate analysis about hardware implementations of FIR filters using different algorithm i.e., Distributed Arithmetic (DA), DA-Offset Binary Coding (DA-OBC), Common Sub-expression Elimination (CSE) and sum-of-power-of-two (SOPOT) with less resources and without affecting the performance of the original FIR Filter.
Using Virtualization Technique to Increase Security and Reduce Energy Consump...IJORCS
An approach has been presented in this paper in order to generate a secure environment on internet Based Virtual Computing platform and also to reduce energy consumption in green cloud computing. The proposed approach constantly checks the accuracy of stored data by means of a central control service inside the network environment and also checks system security through isolating single virtual machines using a common virtual environment. This approach has been simulated on two types of Virtual Machine Manager (VMM) Quick EMUlator (Qemu), HVM (Hardware Virtual Machine) Xen and outputs of the simulation in VMInsight show that when service is getting singly used, the overhead of its performance will be increased. As a secure system, the proposed approach is able to recognize malicious behaviors and assure service security by means of operational integrity measurement. Moreover, the rate of system efficiency has been evaluated according to the amount of energy consumption on five applications (Defragmentation, Compression, Linux Boot Decompression and Kernel Boot). Therefore, this has been resulted that to secure multi-tenant environment, managers and supervisors should independently install a security monitoring system for each Virtual Machines (VMs) which will come up to have the management heavy workload of. While the proposed approach, can respond to all VM’s with just one virtual machine as a supervisor.
Algebraic Fault Attack on the SHA-256 Compression FunctionIJORCS
The cryptographic hash function SHA-256 is one member of the SHA-2 hash family, which was proposed in 2000 and was standardized by NIST in 2002 as a successor of SHA-1. Although the differential fault attack on SHA-1compression function has been proposed, it seems hard to be directly adapted to SHA-256. In this paper, an efficient algebraic fault attack on SHA-256 compression function is proposed under the word-oriented random fault model. During the attack, an automatic tool STP is exploited, which constructs binary expressions for the word-based operations in SHA-256 compression function and then invokes a SAT solver to solve the equations. The simulation of the new attack needs about 65 fault injections to recover the chaining value and the input message block with about 200 seconds on average. Moreover, based on the attack on SHA-256 compression function, an almost universal forgery attack on HMAC-SHA-256 is presented. Our algebraic fault analysis is generic, automatic and can be applied to other ARX-based primitives.
Enhancement of DES Algorithm with Multi State LogicIJORCS
The principal goal to design any encryption algorithm must be the security against unauthorized access or attacks. Data Encryption Standard algorithm is a symmetric key algorithm and it is used to secure the data. Enhanced DES algorithm works on increasing the key length or complex S-BOX design or increased the number of states in which the information is to be represented or combination of above criteria. By increasing the key length, the number of combinations for key will increase which is hard for the intruder to do the brute force attack. As the S-BOX design will become the complex there will be a good avalanche effect. As the number of states increases in which the information is represented, it is hard for the intruder to crack the actual information. Proposed algorithm replace the predefined XOR operation applied during the 16 round of the standard algorithm by a new operation called “Hash function” depends on using two keys. One key used in “F” function and another key consists of a combination of 16 states (0,1,2…13,14,15) instead of the ordinary 2 state key (0, 1). This replacement adds a new level of protection strength and more robustness against breaking methods.
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...IJORCS
This paper presents a new algorithm for solving large scale global optimization problems based on hybridization of simulated annealing and Nelder-Mead algorithm. The new algorithm is called simulated Nelder-Mead algorithm with random variables updating (SNMRVU). SNMRVU starts with an initial solution, which is generated randomly and then the solution is divided into partitions. The neighborhood zone is generated, random number of partitions are selected and variables updating process is starting in order to generate a trail neighbor solutions. This process helps the SNMRVU algorithm to explore the region around a current iterate solution. The Nelder- Mead algorithm is used in the final stage in order to improve the best solution found so far and accelerates the convergence in the final stage. The performance of the SNMRVU algorithm is evaluated using 27 scalable benchmark functions and compared with four algorithms. The results show that the SNMRVU algorithm is promising and produces high quality solutions with low computational costs.
Welcoming the research scholars, scientists around the globe in the Open Access Dimension, IJORCS is now accepting manuscripts for its next issue (Volume 4, Issue 2). Authors are encouraged to contribute to the research community by submitting to IJORCS, articles that clarify new research results, projects, surveying works and industrial experiences that describe significant advances in field of computer science.
To view complete list of topics coverage of IJORCS, Aim & Scope, please visit, www.ijorcs.org/scope
Welcoming the research scholars, scientists around the globe in the Open Access Dimension, IJORCS is now accepting manuscripts for its next issue (Volume 4, Issue 1). Authors are encouraged to contribute to the research community by submitting to IJORCS, articles that clarify new research results, projects, surveying works and industrial experiences that describe significant advances in field of computer science.
Voice Recognition System using Template MatchingIJORCS
It is easy for human to recognize familiar voice but using computer programs to identify a voice when compared with others is a herculean task. This is due to the problem that is encountered when developing the algorithm to recognize human voice. It is impossible to say a word the same way in two different occasions. Human speech analysis by computer gives different interpretation based on varying speed of speech delivery. This research paper gives detail description of the process behind implementation of an effective voice recognition algorithm. The algorithm utilize discrete Fourier transform to compare the frequency spectra of two voice samples because it remained unchanged as speech is slightly varied. Chebyshev inequality is then used to determine whether the two voices came from the same person. The algorithm is implemented and tested using MATLAB.
Channel Aware Mac Protocol for Maximizing Throughput and FairnessIJORCS
The proper channel utilization and the queue length aware routing protocol is a challenging task in MANET. To overcome this drawback we are extending the previous work by improving the MAC protocol to maximize the Throughput and Fairness. In this work we are estimating the channel condition and Contention for a channel aware packet scheduling and the queue length is also calculated for the routing protocol which is aware of the queue length. The channel is scheduled based on the channel condition and the routing is carried out by considering the queue length. This queue length will provide a measurement of traffic load at the mobile node itself. Depending upon this load the node with the lesser load will be selected for the routing; this will effectively balance the load and improve the throughput of the ad hoc network.
A Review and Analysis on Mobile Application Development Processes using Agile...IJORCS
Over a last decade, mobile telecommunication industry has observed a rapid growth, proved to be highly competitive, uncertain and dynamic environment. Besides its advancement, it has also raised number of questions and gained concern both in industry and research. The development process of mobile application differs from traditional softwares as the users expect same features similar to their desktop computer applications with additional mobile specific functionalities. Advanced mobile applications require assimilation with existing enterprise computing systems such as databases, legacy applications and Web services. In addition, the lifecycle of a mobile application moves much faster than that of a traditional Web application and therefore the lifecycle management associated therein must be adjusted accordingly. The Security and application testing are more stimulating and interesting in mobile application than in Web applications since the technology in mobile devices progresses rapidly and developers must stay in touch with the latest developments, news and trends in their area of work. With the rising competence of software market, researchers are seeking more flexible methods that can adjust to dynamic situations where software system requirements are changing over time, producing valuable software in short duration and within low budget. The intrinsic uncertainty and complexity in any software project therefore requires an iterative developmental plan to cope with uncertainty and a large number of unknown variables. Agile Methodologies were thus introduced to meet the new requirements of the software development companies. The agile methodologies aim at facilitating software development processes where changes are acceptable at any stage and provide a structure for highly collaborative software development. Therefore, the present paper aims in reviewing and analysing different prevalent methodologies utilizing agile techniques that are currently in use for the development of mobile applications. This paper provides a detailed review and analysis on the use of agile methodologies in the proposed processes associated with mobile application skills and highlights its benefit and constraints. In addition, based on this analysis, future research needs are identified and discussed.
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...IJORCS
In general, nodes in Wireless Sensor Networks (WSNs) are equipped with limited battery and computation capabilities but the occurrence of congestion consumes more energy and computation power by retransmitting the data packets. Thus, congestion should be regulated to improve network performance. In this paper, we propose a congestion prediction and adaptive rate adjustment technique for Wireless Sensor Networks. This technique predicts congestion level using fuzzy logic system. Node degree, data arrival rate and queue length are taken as inputs to the fuzzy system and congestion level is obtained as an outcome. When the congestion level is amidst moderate and maximum ranges, adaptive rate adjustment technique is triggered. Our technique prevents congestion by controlling data sending rate and also avoids unsolicited packet losses. By simulation, we prove the proficiency our technique. It increases system throughput and network performance significantly.
A Study of Routing Techniques in Intermittently Connected MANETsIJORCS
A Mobile Ad hoc Network (MANET) is a self-configuring infrastructure less network of mobile devices connected by wireless. These are a kind of wireless Ad hoc Networks that usually has a routable networking environment on top of a Link Layer Ad hoc Network. The routing approach in MANET includes mainly three categories viz., Reactive Protocols, Proactive Protocols and Hybrid Protocols. These traditional routing schemes are not pertinent to the so called Intermittently Connected Mobile Ad hoc Network (ICMANET). ICMANET is a form of Delay Tolerant Network, where there never exists a complete end – to – end path between two nodes wishing to communicate. The intermittent connectivity araise when network is sparse or highly mobile. Routing in such a spasmodic environment is arduous. In this paper, we put forward the indication of prevailing routing approaches for ICMANET with their benefits and detriments
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...IJORCS
In the field of speech signal processing, Spectral subtraction method (SSM) has been successfully implemented to suppress the noise that is added acoustically. SSM does reduce the noise at satisfactory level but musical noise is a major drawback of this method. To implement spectral subtraction method, transformation of speech signal from time domain to frequency domain is required. On the other hand, Wavelet transform displays another aspect of speech signal. In this paper we have applied a new approach in which SSM is cascaded with wavelet thresholding technique (WTT) for improving the quality of speech signal by removing the problem of musical noise to a great extent. Results of this proposed system have been simulated on MATLAB.
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed SystemIJORCS
Due to the restriction of designing faster and faster computers, one has to find the ways to maximize the performance of the available hardware. A distributed system consists of several autonomous nodes, where some nodes are busy with processing, while some nodes are idle without any processing. To make better utilization of the hardware, the tasks or load of the overloaded node will be sent to the under loaded node that has less processing weight to minimize the response time of the tasks. Load balancing is a tool used effectively for balancing the load among the systems. Dynamic load balancing takes into account of the current system state for migration of the tasks from heavily loaded nodes to the lightly loaded nodes. In this paper, we devised an adaptive load-sharing algorithm to balance the load by taking into consideration of connectivity among the nodes, processing capacity of each node and link capacity.
The Design of Cognitive Social Simulation Framework using Statistical Methodo...IJORCS
Modeling the behavior of the cognitive architecture in the context of social simulation using statistical methodologies is currently a growing research area. Normally, a cognitive architecture for an intelligent agent involves artificial computational process which exemplifies theories of cognition in computer algorithms under the consideration of state space. More specifically, for such cognitive system with large state space the problem like large tables and data sparsity are faced. Hence in this paper, we have proposed a method using a value iterative approach based on Q-learning algorithm, with function approximation technique to handle the cognitive systems with large state space. From the experimental results in the application domain of academic science it has been verified that the proposed approach has better performance compared to its existing approaches.
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...IJORCS
To efficiently process continuous spatio-temporal queries, we need to efficiently and effectively handle large number of moving objects and continuous updates on these queries. In this paper, we propose a framework that employs a new indexing algorithm that is built on top of SQL Server 2008 and avoid the overhead related to R-Tree indexing. To answer range queries, we utilize dynamic materialized view concept to efficiently handle update queries. We propose an adaptive safe region to reduce communication costs between the client and the server and to minimize position update load. Caching of results was utilized to enhance the overall performance of the framework. To handle concurrent spatio-temporal queries, we utilize publish/subscribe paradigm to group similar queries and efficiently process these requests. Experiments show that the overall proposed framework performance was able to outperform R-Tree index and produce promising and satisfactory results.
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and high-quality clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Recent studies have shown that partitional clustering algorithms such as the k-means algorithm are the most popular algorithms for clustering large datasets. The major problem with partitional clustering algorithms is that they are sensitive to the selection of the initial partitions and are prone to premature converge to local optima. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and cluster centers for any given set of data. The cluster estimates can be used to initialize iterative optimization-based clustering methods and model identification methods. In this paper, we present a hybrid Particle Swarm Optimization, Subtractive + (PSO) clustering algorithm that performs fast clustering. For comparison purpose, we applied the Subtractive + (PSO) clustering algorithm, PSO, and the Subtractive clustering algorithms on three different datasets. The results illustrate that the Subtractive + (PSO) clustering algorithm can generate the most compact clustering results as compared to other algorithms.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
2. 8 P. Rajesh, G. Narasimha, N.Saisumanth
collection of clusters that is not favorable to based similarity measure . The clusters in the resulting
interpretation [5, 6]. To minimize the overlapping of hierarchy are non-overlapping. The parent cluster
documents, Beil, Ester [7] were proposed a method contains only the general documents.
HFTC (Hierarchical Frequent Text Clustering) is
another frequent item set based approach to choose the III. ALGORITHM DESCRIPTION
next frequent item sets. But the clustering result In this section, we explained our proposed
depends on the order of choosing next frequent item algorithm description including common
sets. The resulting hierarchy in HFTC usually contains preprocessing steps and pseudo code of algorithm. It
many clusters at first level. As a result the documents also includes to precisely defining clusters based on
in the same class are to be distributed into different maximal frequent item set (MFI) by Apriori algorithm.
branches of hierarchy, which decreases the overall First, we will speak about some common
clustering accuracy. preprocessing steps for representing each document by
C.M.Fung [8] has introduced FIHC (Frequent Item item sets (terms). Second we will bring in vector space
set based Hierarchical Clustering) method for model by assigning weights to terms in all document
document clustering. Which employed, a cluster topic sets. Finally, we will explain the process of
tree is constructed based on the similarity among initialization of clusters seeds using MFI to perform
clusters. FIHC used the efficient child pruning when hierarchical clustering. Let Ds represents set of all
number of clusters is large and to apply the elaborated documents in collection of database.
sibling merging only when number of clusters is small. Ds= {d1, d2, d3………dM}: 1 ≤ i ≤ M
The experiment results FIHC actually outperforms all
other algorithms (bisecting-k means, UPGMA) in A. Pre-Processing
accuracy for most number of clusters. The document set Ds is converted from
The Apriori algorithm [9] is a well-known method unstructured format into some common representation
for computing frequent item sets in a transaction using the text preprocessing techniques, in which
database. The document under the same topic, shares words or terms are extracted (tokenization). The input
more common frequent item sets (terms) than the data set of documents in Ds are preprocessed using the
documents of different topics. The main advantage of techniques namely, removing HTML tags first, after
using frequent item sets is that it can identify the that apply stop words list and stemming algorithm.
relation among the more than two documents at a time a) HTML Tags: parsing of HTML Tag
in a document collection unlike similarity measure b) Stop words: Remove the stop words list like
between two documents [10, 11].By the means of “conjunctions, connectives, prepositions etc”
maximal frequent item sets, the dimensionality of the c) Stemming algorithm: We utilize porter 2
document set is reduced. More over maximal frequent stemmer algorithm in our approach.
item sets captures most related document sets. On the
other hand, hierarchical clustering most relevant for B. Vector representation of document:
browsing and maps most specific documents to
generalized documents in the whole collection. Vector space model is the most commonly used
document representation model in text mining, web
A conventional hierarchical clustering method mining and information retrieval areas. In this model
constructs the hierarchy by subdividing parent cluster each document is represented as n-dimensional term
or merging similar children clusters. It usually suffers vector. The value of each term in the n-dimensional
from its inability to perform tuning once a merge or vector reflects the importance of corresponding
split decision has been performed. This rigidity may document. Let N be the total number of terms and M
lower the clustering accuracy. Furthermore, due to the be the number of documents and each the document
𝐷 𝑖 = (𝑡𝑒𝑟𝑚 𝑖1 , 𝑡𝑒𝑟𝑚 𝑖2 , … … … … . . 𝑡𝑒𝑟𝑚 𝑖𝑛 ) 1≤ i≤ M. Where
fact that a parent cluster in the hierarchy always can be denoted as
𝑑𝑓(𝑡𝑒𝑟𝑚 𝑖𝑗 ) < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
contains all objects of its Childs, this kind of hierarchy
frequency 𝑡𝑒𝑟𝑚 𝑖𝑗 is less than the threshold value is
is not suitable for browsing. The user may have value. The document
difficulty to locate his intention object in such a large
cluster. considered to avoid the problem of more times a term
Our hierarchical clustering method is completely appears throughout all documents in the whole
different. The aim of this paper is, first we form all collection, the more poorly it discriminates between
the clusters by assigning documents to the most similar documents [12].Calculate term frequency tf is number
cluster using maximal frequent item sets by Apriori of times a term appears in a document. Document
frequency of a term df as no of documents that
documents vectors. 𝐷 𝑖 = (𝑤 𝑖1 , 𝑤12 , 𝑤13 , … … . . , 𝑤1 𝑖𝑛 )
algorithm and then construct the hierarchical
document clustering based on their inter-cluster contains term. Also construct the weights for
Where 𝑤 𝑖𝑗 = 𝑡𝑓𝑖𝑗 ∗ 𝐼𝐷𝑓(𝑗) and
similarities via same maximal frequent item set (MFI)
www.ijorcs.org
3. IDf (j) =𝑙𝑜𝑔 � �1≤j≤n.where IDf is the inverse
𝑚
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering 9
𝑑𝑓 𝑗
A frequent item set is a set of words which occurs
frequently together and are good candidates for
such that X ⊂ X1 and t(X) = t(X1), where t(X) defined
document frequency. clusters and are denoted by FI. An item set X is closed
Table 1: Table Representation of Transactional Database of if there does not exist an item set X1 such that X1,
Documents
as the set of transactions that contain item set X and it
Terms Doc 1 Doc 2 Doc 3 ..... Doc 4
is denoted by FCI(frequently closed items).If X is
Java 1 1 0 ..... 1
frequent and no superset of X is frequent among the
Beans 0 1 0 ..... 0
MFI. Then MFI⊂ FCI ⊂ FI Whenever there are very
set of items I in transactional databases. Then we say
..... ..... ….. ….. ..... ….. that X is maximal frequent item set and denoted by
Servlets 1 0 1 ..... 1
By the representation of document as vector form, long patterns are present in the data it is often
we can easily identify which documents Contains the impractical to generate the entire set if frequent item
same features .The more features documents have in sets or closed item sets [16]. In that case, maximal
common, the more related they are. Thus, it is realistic frequent item sets are adequate for such applications.
to find well related documents. Assume that each We employed maximal frequent item set algorithm
document is an item in the transactional database; each from [17] using apriori. These maximal frequent item
term corresponds to a transaction. Our aim is to search sets are initial seeds for hierarchical document
for highly related documents “appearing” together clustering.
with same features (the documents whose MFI features D. Pseudo code Algorithm
are closed). Similarly, the maximal frequent item set
discovery in the transaction database serves the For MFI Based Similarity Measure for Hierarchical
purpose of finding items of documents appearing Document Clustering
together in many transactions. i.e., document sets Input: Document set Ds.
which have large amount of feature in common.
Definition: MFI: Maximal Frequent Item set.
C. Apriori for maximal frequent item sets
(tf) Term frequency and (df) document frequency
Mining frequent item sets is a primary content of
Step 1. For each document in Ds, Remove the HTML
data mining that emphasizes particularly in finding the
relation of different items in the large database. Mining tags and perform stop word list and stemming.
Step 2. Calculate the term frequency (tf) and document
𝐷 𝑖 = (𝑡𝑒𝑟𝑚 𝑖1 , 𝑡𝑒𝑟𝑚 𝑖2 , … … … … . . 𝑡𝑒𝑟𝑚 𝑖𝑛 ) 1≤i≤M
frequent patterns is crucial problem in many data
mining applications such as the discovery of frequency (df).
Where df�𝑡𝑒𝑟𝑚 𝑖𝑗 � < Threshold value
association rules, correlations, multidimensional
patterns, and other numerous important inferring
patterns from consumer market basket analysis and
web access etc. The association mining problem is Step 3. Also construct the weighted document vectors
𝐷 𝑖 = (𝑤 𝑖1 , 𝑤12 , 𝑤13 , … … . . , 𝑤1 𝑖𝑛 ) 𝑤 𝑖𝑗 = 𝑡𝑓𝑖𝑗 ∗
formulated as follows: Given a large data base of set of for all the documents
𝐼𝐷𝑓(𝑗).Idf (j) =𝑙𝑜𝑔 � � 1≤j≤n.
items transactions, find all frequent item sets, where a
𝑚
Where
frequent item set is one that occurs in at least a user-
𝑑𝑓 𝑗
specified threshold value of the data base. Many of the
proposed item set mining algorithms are a variant of
Step 4. Now represent each documents by keywords
Apriori, which employs a bottom-up, breadth first
whose tf>support
search that enumerates every single frequent item set.
𝑀𝐹𝐼 = { 𝐹1 , 𝐹2 , 𝐹3 , … … … … . . 𝐹 𝑛 }
Apriori is a conventional algorithm that was first Calculate the Maximal Frequent Item set(MFI) of
introduced] for mining association rules. Association terms using Apriori algorithm
Where each 𝐹𝑖 = { 𝑑1 , 𝑑2 , 𝑑3 , … … … 𝑑 𝑘 }
can be viewed as two-step process as
a document 𝑑 𝑖 is in more than one maximal
frequent item set then choose 𝐼 𝑑 as a set
(1) Identifying all frequent item sets
Step 5. If
(2) Generating strong association rules from the
containing document 𝑑 𝑖 . Then Assign𝐼 𝑥 =𝐼 𝑑0 .For
frequent item sets
consisting of such maximal frequent item sets
At first, candidate item sets are generated and
the document 𝑑 𝑖
afterwards frequent item sets are mined with the help each the maximal frequent item sets containing
𝐼𝑓 [ 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑥 , 𝑑 𝑖 ))
of these candidate item sets. In the proposed approach,
> 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑑𝑖 , 𝑑 𝑖 ))]
we have used only the frequent item sets for further
processing so that, we undergone only the first step
(generation of maximal frequent item sets) of the
Apriori algorithm.
www.ijorcs.org
4. Then assign 𝐼 𝑥 = 𝐼 𝑑𝑖 .Assign the document 𝑑 𝑖 to 𝐼 𝑥 𝐹𝑖 𝑙𝑖𝑘𝑒 𝐹3 = { 𝑑1 , 𝑑5 , 𝑑7 } as one cluster in hierarchy
10 P. Rajesh, G. Narasimha, N.Saisumanth
and discard 𝑑 𝑖 for other maximal frequent item sets.
Case 3: If 𝐹𝑖 , 𝐹𝑗 contains some same documents
and represent it by center (as in step6).
Repeat this process for all documents that occurs in
consider the case of document 𝑑2 is repeatedin more
more than one maximal frequent item set
these maximal frequent item sets 𝐹𝑖 as clusters than one maximal frequent item sets{𝐹1 𝐹4 }.Similarly
among the documents list obtained from MFI. Let us
and combine the documents in 𝐹𝑖 into a single
Step 6. Apply hierarchical document clustering to make
𝑑4 is repeated in{ 𝐹1 , 𝐹2 , 𝐹4 }. Then choose𝐼 𝑑 =
{ 𝐹1 , 𝐹2 , 𝐹4 } = { 𝐼 𝑑0 , 𝐼 𝑑1 , 𝐼 𝑑2 }for document𝑑4 .Assign
𝐼 𝑥 =𝐼 𝑑0 = 𝐹1 . For each the maximal frequent item sets
new document and represent it by centers of the
𝐼 𝑑 containing 𝑑4
maximal frequent item sets. These are obtained
𝐼 𝑑0 𝑡𝑜 𝐼 𝑑2 calculate the measure
by combining the features of maximal frequent in the document from
𝐼𝑓 [ 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑥 , 𝑑4 ))
item set of terms that grouping the documents
> 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑑𝑖 , 𝑑4 ))]
Step 7. Repeat the same process of hierarchical
document clustering based on maximal frequent
document 𝑑4 closest to which maximal frequent item
item sets for all levels in hierarchy and stop if
total number of documents equals to one else go By using this jaccards measure, we can identify the
document 𝑑4 .Then assign 𝐼 𝑥 = 𝐼 𝑑𝑖 .
to step 4.
set among maximal frequent item sets containing the
Let’s suppose that 𝑑4 is closed to the maximal
IV. HIERARCHICAL CLUSTERS BASED ON
frequent item set 𝐹4 . Assign the document𝑑4 to𝐼 𝑥 =
MAXIMAL FREQUENT ITEM SETS
𝐼 𝑑𝑖 = 𝐹4 and discard 𝑑4 for other maximal frequent
After finding maximal frequent item sets (MFI) by
using Apriori algorithm. We turn to describing the
exactly one cluster. Similarly 𝑑2 belongs to𝐹1 .Repeat
creation of hierarchical document clustering using item sets. After this step, each document belongs to
same similarity measure by MFI. A simple instance
among the whole collection of documents 𝐷 𝑆 by
case of example is also provided to demonstrate the
𝑑2 , 𝑑4 are repeated in𝐹1 , 𝐹4 . The clusters that will form
this process for all documents that occurs in more than
apriorialgorithm are 𝑀𝐹𝐼 = { 𝐹1 , 𝐹2 , 𝐹3 … . . 𝐹 𝑛 }.Where
entire process. The set of maximal frequent item sets one maximal frequent item set. Since the documents
by𝐹𝑖 = { 𝑑1 , 𝑑2 , 𝑑3 … . . 𝑑 𝑘 }.Then consider total number
at the first level of hierarchy by applying step5 and
𝐹1 = {𝑑2 , 𝑑6 }
each MFI consist of set of documents represented step 6 are as follows.
𝐹2 = {𝑑3 , , 𝑑8 }
of documents which occurs in maximal frequent item
𝑑1 , 𝑑2 , 𝑑3, 𝑑4 , 𝑑5 , 𝑑6 , 𝑑7 , 𝑑8 , 𝐹3 = { 𝑑1 , 𝑑5 , 𝑑7 }
sets in MFI as follows.
𝑀𝐹𝐼 = � �
𝑑9 , 𝑑10 , 𝑑11 , 𝑑12 , 𝑑13 , 𝑑14 , 𝑑15
𝐹4 = {𝑑4 , , 𝑑14 }
𝐹1 = {𝑑2 , 𝑑4 , 𝑑6 }
𝐹5 = {𝑑10 , 𝑑12 , 𝑑15 }
𝐹2 = {𝑑3 , 𝑑4 , 𝑑8 }
𝐹6 = {𝑑9 , 𝑑11 , 𝑑13 }
𝐹3 = { 𝑑1 , 𝑑5 , 𝑑7 }
𝐹4 = {𝑑4 , 𝑑2 , 𝑑14 }
The hierarchical diagram for the above form of
𝐹5 = {𝑑10 , 𝑑12 , 𝑑15 }
maximal frequent item set clusters can be representing
𝐹6 = {𝑑9 , 𝑑11 , 𝑑13 }
as follows. Repeat the same process of hierarchical
document clustering based on maximal frequent item
sets for all levels in hierarchy and stop if total number
The clusters in the resulting hierarchy are non- of documents equals to one else go to step 4.
overlapping. This can be achieved through the
Case1: If 𝐹𝑖 , 𝐹𝑗 are same then choose one in random
following cases.
Case2: If 𝐹𝑖 , 𝐹𝑗 are different then form clusters of
to form cluster.
documents contained in𝐹𝑖 , 𝐹𝑗 independently. In our
in 𝐹3 , 𝐹5 and 𝐹6 𝑎𝑟𝑒 different. So we form a clusters
example, the maximal frequent item set of documents
according to the documents contained in
Figure 1: Hierarchical document clustering using MFI
www.ijorcs.org
5. Represent each new document �𝐿 𝑖𝑗 � in hierarchy by
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering 11
itself. When we are classifying the documents into
maximal frequent item set of terms as centers (as in equivalence classes, we are not considering these ones
step 6).These maximal frequent item sets are obtained and put zeros. Jaccard similarity coefficient matrix for
by combining the features of maximal frequent item four documents can be represented as follows.
set of terms that grouping the documents. Each new
d1 d2 d3 d4
�𝐿 𝑖𝑗 � represents that jth document in the level of
document also consisting of corresponding updated
weights of maximal frequent item set of terms. Where d 1 1 0.4 0.8 0.5
hierarchy𝐿 𝑖 . In the figure { 𝐿12 = 𝐿21 }means that the
d 2 0.4 1 0.8 0.4
Rα =
level 𝐿1 are not matched with other documents MFI set
d 3 0.8 0.8 1 0.9
maximal frequent item set of terms in 2nd document of
d 4 0.5 0.4 0.9 1
in same level𝐿1 .So it is repeated same for the next
level and it is also same for the document { 𝐿13 = Ds = {d1 , d2 , d3 , d4 }as the collectionof document pairs
𝐿22 }. The documents{ 𝐿11 , 𝐿15 } and{ 𝐿14 , 𝐿16 } in first
Where alpha is threshold. Let define a relation R on
value. i.e 𝑅 = {(𝑑 𝑖 , 𝑑 𝑗 )/ 𝐽 (𝑑 𝑖 , 𝑑 𝑗 ) ≥ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 }
whose similarity measure is above some threshold
level as 𝐿23 , 𝐿24 .
level are combined using MFI based hierarchical
1. R is reflexive on Ds iff 𝑅 (𝑑 𝑖 , 𝑑 𝑖 ) = 1. i.e Every
clustering and represent these documents in the second
2. R is symmetric on Ds iff𝑅 �𝑑 𝑖 , 𝑑 𝑗 � = 𝑅 �𝑑 𝑗 , 𝑑 𝑖 �i.e
document is mostly related to itself.
if the document 𝑑 𝑖 is similar to 𝑑 𝑗 then the
V. PRIVACY PRESERVING OF WEB
document 𝑑 𝑗 is also similar to𝑑 𝑖 .
DOCUMENTS USING EQUIVALENCE
RELATION
Most internet web documents are publicly available
𝑅 (𝑑 𝑖 , 𝑑 𝑘 ) ≥ 𝑚𝑎𝑥 𝑗 { min{𝑅 �𝑑 𝑖 , 𝑑 𝑗 �, 𝑅 �𝑑 𝑗 , 𝑑 𝑖 �}}.
for providing services required by the user. In such 3. R is transitive on Ds iff
documents there is no confidential or sensitive data
(open to all). Then how can we provide privacy of
such documents. Now a days, same information will Then R is transitive by the definition.
be exists in more than one document in duplicate
Then R is an equivalence relation on Ds, which
forms. The way of providing privacy preserving of
partitions the input document set Ds into set of
documents is by avoiding duplicate documents. There
equivalence classes. Equivalence relation seems a
by we can protect the privacy of individual copy rights
natural technique for duplicate document
of documents. Many duplicate document detection
categorization. Any two documents in same
techniques are available such as syntactic, URL based,
equivalence class are related and are different if they
semantic approaches. In each technique, a processing
are coming from two equivalence classes. The set of
overhead of maintaining shingling’s, signatures,
all equivalence classes induces the document set Ds.
fingerprints [13, 14, 15, 18]. In this paper, we
High syntactic similarity pairs of documents typically
proposed a new technique for avoiding duplicate
referred to as duplicates or near duplicates except
documents using equivalence relation. Let Ds be the
diagonal elements. By using equivalence relation,
input duplicate document set is subset to web
easily we can identify the duplicate documents or we
document collection. First find the jaccard similarity
can perform the clustering on duplicate documents.
measure for every pair of documents in Ds using
Apart from the representation of feature document
weighted feature representation of maximal frequent
vector by MFI, we also need to consider that who is
item sets discussed in step 2 and step 3 in algorithm. If
the author of document, when the document was
the similarity measure of two documents is equal to 1,
created, where it is available, helps in effectively
then the two documents are most similar. If the
finding the duplicate documents. Each document in
measure is 0, then they are not duplicates. The Jaccard
input Ds must belong to unique equivalence class. If R
index or the Jaccard similarity coefficient is a
is equivalence relation on Ds = {d1, d2, d3, d4 …..dn}.
statistical measure of similarity between sample sets.
Then number of equivalence relations on Ds is always
For two sets, it is denoted as the cardinality of their
lies between n ≤ | R|≤ n2. i.e the time complexity of
intersection divided by the cardinality of their union.
|𝑑1 ∩ 𝑑2 |
calculating equivalence relation on Ds is O(n2).
.i.e𝐽 �𝑑 𝑖 , 𝑑 𝑗 � ≥ 0.8. Since the matrix is symmetric, the
Mathematically
𝐽(𝑑1 , 𝑑2 ) =
Choose the threshold α in equivalence relation as 0.8
|𝑑1 ∩ 𝑑2 | documents sets {(𝑑3 , 𝑑1 ), (𝑑3 , 𝑑2 ), (𝑑4 , 𝑑3 )} are
mostly related. Hence the documents are near
For every pair of two documents calculate jaccard duplicates and grouping the documents into clusters
measure of d1, d2.All the diagonal elements in matrix thereby providing privacy of individual copy rights of
are ones, because every document mostly related to documents.
www.ijorcs.org
6. 12 P. Rajesh, G. Narasimha, N.Saisumanth
0 0 1 0 Data mining 2002 (KDD-2002), Edmonton, Alberta,
0 0 1 0
Canada.
R 0.8 = [8] BenjaminFung, C.M., Wang, Ke., Ester, Martin. (2003).
1 1 0 1 “Hierarchical Document Clustering using Frequent Item
Sets”. In Proceedings SIAM International Conference
0 0 1 0 on Data Mining 2003 (SIAM DM-2003), pp:59-70.
[9] Agrawal, R., Srikant, R. (1994). “Fast Algorithms for
VI. CONCLUSION AND FUTURE SCOPE Mining Association Rules”. In the Proceedings of 20th
International Conference on Very Large Data Bases,
Cluster analysis can be used as powerful ,stranded
1994, Santiago, Chile, PP: 487-499.
alone data mining concept that gains insight
[10] Liu, W.L., and Zeng, X.S. (2005). “Document
information of knowledge from huge unstructured
Clustering Based on Frequent Term Sets”. Proceedings
databases. Most conventional clustering methods do of Intelligent Systems and Control, 2005.
not satisfy the document clustering requirements such
[11] Zamir, O., Etzioni, O. (1998). “Web Document
as high dimensionality, huge volumes and easy of
Clustering: A Feasibility Demonstration”. In the
accessing meaningful clusters labels. In this paper, we Proceedings of ACM,1998 (SIGIR-98), PP: 46-54.
presented novel approach; Maximal frequent item set
[12] Kjersti, (1997). “A Survey on Personalized Information
(MFI) Based Similarity Measure for Hierarchical
Filtering Systems for the World Wide Web”. Technical
Document Clustering to address these issues. Report 922, Norwegian Computing Center, 1997.
Dimensionality reduction can be achieved through
[13] Prasannakumar, J., Govindarajulu, P., “Duplicate and
MFI. By using the same MFI similarity measure in Near Duplicate Documents Detection: A Review”.
hierarchal document clustering, the number of levels European Journal of Scientific Research ISSN 1450-
will be decreased. It is easy for browsing. Clustering 216X Vol.32 No.4 ,2009, pp:514-527
has its paths in many areas, by applying MFI based [14] Syed Mudhasir,Y., Deepika,J., “Near Duplicate
techniques to clusters, including data mining, statistics, Detection and Elimination Based on Web Provenance
biology, and machine learning we can get the high for Efficient Web Search”. In the Proceedings of
quality of clusters. Moreover, by means of maximal International Journal on Internet and Distributed
frequent item sets, we can predict the most influenced Computing Systems, Vol.1, No.1, 2011.
objects of clusters in the entire dataset of applications [15] Alsulami, B.S., Abulkhair, F., Essa, E., “Near Duplicate
like business, marketing, world wide web, social Document Detection Survey”. In the Proceedings of
networking analysis. International Journal of Computer Science and
Communications Networks, Vol.2, N0.2, pp:147-151.
VII. REFEERENCES [16] Doug Burdick, Manuel Calimlim, Johannes Gehrke.
(2001). “A Maximal Frequent Itemset Algorithm for
[1] Ruxixu, Donald Wunsch., “A Survey of Clustering
Transactional Databases”. In the Proceedings of ICDE,
Algorithms”. In the Proceedings of IEEE Transactions
17th International Conference on Data Engineering
on Neural Networks, Vol. 16, No. 3, May 2005.
2001 (ICDE-2001).
[2] Jain, A.K., Murty, M.N., Flynn, P.J., “Data Clustering:
[17] Murali Krishna, S., Durga Bhavani, S., “An Efficient
A Review”. In the Proceedings of ACM Computing
Approach for Text Clustering Based On Frequent Item
Surveys, Vol.31, No.3, 1999, pp: 264-323.
Sets”. European Journal of Scientific Research ISSN
[3] Kleinberg, J.M., “Authoritative Sources in a 1450-216X, Vol.42, No.3, 2010, pp:399-410.
Hyperlinked Environment”. In the Journal of the ACM,
[18] Lopresti, D.P. (1999). "Models and Algorithms for
Vol. 46, No.5, 1999, pp: 604-632.
Duplicate Document Detection". In the Proceedings of
[4] Ling Zhuang, Honghua Dai. (2004). “A Maximal Fifth International Conference on Document Analysis
Frequent Item Set Approach for Web Document and Recognition 1999 (ICDAR-1999), 20th-22th Sep,
Clustering”. In Proceedings of the IEEE Fourth pp:297-300.
International Conference on Computer and Information
Technology 2004 (CIT-2004).
[5] Michael, W., Trosset. (2008). “Representing Clusters:
k-Means Clustering, Self-Organizing Maps and
Multidimensional Scaling”. Technical Report,
Department of Statistics, Indian University,
Bloomington, 2008.
[6] Michael Steinbach, George karypis, and Vipinkumar.
(2000). “A Comparison of Document Clustering
Techniques”. In Proceedings of the Workshop on Text
Mining, 2000 (KDD-2000), Boston, pp: 109-111.
[7] Beil, F., Ester, M., Xu, X. (2002). “Frequent Term-
Based Text Clustering”. In Proceedings of 8th
International Conference on Knowledge Discovery and
www.ijorcs.org