SlideShare a Scribd company logo
www.ijmer.com

International Journal of Modern Engineering Research (IJMER)
Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2823-2826
ISSN: 2249-6645

A Novel Clustering Method for Similarity Measuring in
Text Documents
Preethi Priyanka Thella1, G. Sridevi2
1

M.Tech, Nimra College of Engineering & Technology, Vijayawada, A.P., India.
Assoc.Professor, Dept.of CSE, Nimra College of Engineering & Technology, Vijayawada, A.P., India.

2

ABSTRACT: Clustering is the process of grouping data into subsets in such a manner that identical instances are
collected together, while different instances belong to different groups. The instances are thereby arranged into an efficient
depiction that characterizes the populace that is being sampled. A general move towards the clustering process is to treat it
as an optimization process. A best partition is found by optimizing an exacting function of similarity, or distance, among
data. Basically, there is a hidden assumption that the true inherent structure of data could be correctly describe by using the
similarity formula defined and fixed in the clustering decisive factor. In this paper, we introduce clustering with multi- view
points based on different similarity measures. The multi- view point approach to learning is one in which we have ‘views’ of
the data (sometimes in a rather abstract sense) and the goal is to use the relationship between these views to alleviate the
difficulty of a learning problem of interest.

Keywords: Clustering, Text mining, Similarity measure, View point.
I.

INTRODUCTION

Clustering[1] or cluster analysis is the task of grouping a set of objects in such a way that objects in the same group
(called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main
task of explorative data mining techniques, and a common technique for statistical data analysis used in many fields,
including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Cluster analysis
itself is not one specific algorithm or procedure, but the general task to be solved. It can be achieved by using various
algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular
notions of clusters include groups with low distances among the cluster members, intervals or particular statistical
distributions, dense areas of the data space. Clustering can therefore be formulated as a Multi- objective optimization
process.
The appropriate clustering algorithm and parameter settings, including values such as the distance function to use, a
density threshold or the number of expected clusters, depend on the individual data set and intended use of the results.
Clustering as such is not an automatic task, but an iterative process of Knowledge discovery or interactive multi- objective
optimization that involves trial and failure. It will often be necessary to modify parameters and preprocessing until the result
achieves the desired properties. Cluster analysis can be considered the most important unsupervised learning problem; so, as
every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of
clustering process could be “the process of organizing objects into groups whose members are similar in some way”. A
cluster is therefore a collection of objects or items which are “similar” between them and are “dissimilar” to the objects
belonging to other clusters. Figure 1 shows clustering process.

Figure 1: Clustering Process
In this case we easily identify the four clusters into which the data can be divided; the similarity criterion is
distance: two or more objects belong to the same cluster if they are “close” according to a given distance (in this case
geometrical distance). This is called as distance based clustering. Another kind of clustering is called conceptual clustering:
two or more objects belong to the same cluster if this one defines a concept common to all that objects. In other words,
objects are grouped according to their fit to descriptive concepts, not according to the simple similarity measures. The multiview point approach to learning is one in which we have „views‟ of the data (sometimes in a rather abstract sense) and the
goal is to use the relationship between these views to alleviate the difficulty of a learning problem of interest.
www.ijmer.com

2823 | Page
www.ijmer.com

International Journal of Modern Engineering Research (IJMER)
Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2823-2826
ISSN: 2249-6645

II.

RELATED WORK

Text clustering is required in the real world applications such as web search engines. It comes under text mining
process. It is meant for grouping text documents into various clusters. These clusters are used by various applications in the
real world, for example, search engines. A text document is treated as an object a word in the document is referred as a term.
A vector is built to represent each text document. The total number of terms in the text document is represented by m. Some
kind of weighting schemes like Term Frequency – Inverse Document Frequency (TF-IDF) is used to represent document
vectors. There are many approaches for text document clustering. They include probabilistic based methods [2], nonnegative
matrix factorization [3] and information theoretic co-clustering [4]. These approaches are not using a particular measure for
finding similarity among text documents. In this paper, we make use of multi- view point similarity measure for finding the
similarity. As found it literature, a measure widely used in text document clustering is ED (Euclidian Distance).
K-Means algorithm is most widely used clustering algorithm due to its ease of use and simplicity. Euclidian
distance is the measure used in K-Means algorithm to measure the distance between objects to make them into clusters. In
this case the cluster centroid is computed as follows:

Another similarity measure being used for text document mining is cosine similarity measure. It is best useful in hidimensional documents [5]. This measure is also being used in Spherical K-Means which is a variant of K-Means algorithm.
The difference between the two flavors of K-Means algorithm that use cosine similarity measure and ED measure
respectively is that the former focuses on vector directions while the latter focuses on vector magnitudes. Graph partitioning
is yet another approach which is very popular. It considers the text document corpus as graph and uses min-max cut
algorithm which represents centriod as follows:

There is a software package called CLUTO [6] which is meant for document clustering. It makes use of the graph
partitioning approach. Based on the nearest neighbor graph it builds, it text documents are clustered. It is based on the
Jacquard coefficient which is computed as follows:

Jacquard coefficients use both magnitude and direction which is not the case with Euclidian distance and cosine
similarity. However, it is similarity to cosine similarity when the documents are represented as unit vectors. In [7] there is
comparison between the two techniques namely Jacquard and Pearson correlation. It also concludes that both of them are
best used in clustering process of web documents. For tsxt document clustering other approaches can be used which are
phrase based and concept based. In phrase based approach is found while in [8] tree similarity based approach is found. The
common procedure used by both of them is “Hierarchical agglomerative Clustering”. The drawback of these approaches is
that their computational cost is too high. For clustering XML documents also there are some measures. One such measure is
called “Structural Similarity” which differs from text document clustering. This paper focuses on a new multi-view point
based similarity measure for text clustering.

III.

PROPOSED WORK

In proposed work, our approach in finding similarity between documents or objects while performing clustering is
multi-view based similarity. It makes use of more than one point of reference as opposed to existing algorithms used for text
document clustering. As per our approach the similarity between two documents is calculated as follows:

sim(d i , d j ) 
d i , d j S r

1
n  nr

 sim(d

d h S  S r

i

 dh , d j  dh )

Consider two point “di” and “dj” in the cluster Sr. The similarity between those two points is viewed from a point
“dh” which is outside the cluster. Such similarity is equal to the product of the cosine angle between those points with
respect to Euclidean distance between the points. An assumption on which this definition is based on is “dh” is not the same
cluster as “di” and “dj”. When distances are very small, then the chances are higher that the “dh” is in the same cluster.
Though various viewpoints are useful in increasing the accuracy of the similarity measure there is a possibility of having that
give negative result. However the possibility of such a drawback can be ignored provided plenty of documents to be
clustered.
Now we have to carry out the validity test for the cosine similarity and multi view based similarity as follows. For each
type of the similarity measure, a similarity matrix called A = {aij}n×n is created. For CS, this is very simple, as aij = dti dj .
The algorithm for building Multi view Similarity (MVS) matrix is described in Algorithm 1.

www.ijmer.com

2824 | Page
International Journal of Modern Engineering Research (IJMER)
www.ijmer.com
Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2823-2826
ISSN: 2249-6645
ALGORITHM 1: BUILDMVSMATRIX(A)
Step 1: for r ← 1 : c do
Step 2:

DS Sr ←

d

d i S r

i

Step 3: nS Sr←|S  Sr|
Step 4: end for
Step 5: for i ← 1 : n do
Step 6:
r ← class of di
Step 7:
for j ← 1 : n do
Step 8:
if dj  Sr then

Step 9:
Step 10:

else

Step 11:
Step 12: end if
Step 13: end for
Step 14: end for
Step 15: return A = {aij}n×n
First, the outer composite with respect to each class is determined. Then, for each row ai of “A”, i = 1, . . . , n, if the pair of
text documents di and dj, j = 1, . . . , n are in the same class, aij is calculated as in line 9. Otherwise, dj is assumed to be in
di‟s class, and aij is calculated as shown in line 11.
After matrix “A” is formed, the code in Algorithm 2 is used to get its validity score:
ALGORITHM 2: GETVALIDITY(validity,A, percentage)
Step 1: for r ← 1 : c do
Step 2: qr ← floor(percentage × nr)
Step 3: if qr = 0 then
Step 4:
qr ← 1
Step 5: end if
Step 6: end for
Step 7: for i ← 1 : n do
Step 8: {aiv[1], . . . , aiv[n] } ←Sort {ai1, . . . , ain}
Step 9: s.t. aiv[1] ≥ aiv[2] ≥ . . . ≥ aiv[n]
{v[1], . . . , v[n]} ← permute {1, . . . , n}
Step 10: r ← class of di

Step 11:
Step 12: end for

Step 13:
Step 14: return validity
For each document “di” corresponding to row “ai” of matrix A, we select “qr” documents closest to point “di”. The
value of “qr” is chosen relatively as the percentage of the size of the class r that contains “di”, where percentage  (0, 1].
Then, validity with respect to “di” is calculated by the fraction of these “qr” documents having the same class label with
“di”, as shown in line 11. The final validity is determined by averaging the over all the rows of matrix A, as shown in line
13. It is clear that the validity score is bounded within values 0 and 1. The higher validity score a similarity measure has, the
more suitable it should be useful for the clustering process.

IV.

INCREMENTAL CLUSTERING ALGORITHM

The main goal of this algorithm is to perform text document clustering by optimizing

www.ijmer.com

I R and I V as shown below:

2825 | Page
www.ijmer.com

International Journal of Modern Engineering Research (IJMER)
Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2823-2826
ISSN: 2249-6645

With this general form, the incremental optimization algorithm, which has two major steps Initialization and Refinement, is
shown in Algorithm 3 and Algorithm 4.
ALGORITHM 3: INITIALIZATION
Step 1: Select k seeds s1, . . . , sk randomly
Step 2:
Step 3:
Step 4: end
ALGORITHM 4: REFINEMENT
Step 1: repeat
Step 2: {v[1 : n]} ← random permutation of {1, . . ., n}
Step 3: for j ← 1 : n do
Step 4: i ← v[j]
Step 5: p ← cluster[di]
Step 6:
Step 7:

Step 8:
Step 9: if
then
Step 10: Move di to cluster q: cluster[di] ← q
Step 11: Update Dp, np,Dq, nq
Step 12: end if
Step 13: end for
Step 14: until No move for all n documents
Step 15: end
At Initialization, “k” arbitrary documents are selected to be the seeds from which initial partitions are formed.
Refinement is a process that consists of a number of iterations. During each iteration, the “n” text documents are visited one
by one in a totally random order. Each text document is checked if its move to another cluster results in improvement of the
objective function. If yes, then the text document is moved to the cluster that leads to the highest improvement. If no clusters
are better than the current cluster, the text document is not moved. The clustering process terminates when iteration
completes without any text documents being moved to new clusters.

V.

CONCLUSION

In the view point of data engineering, a cluster is a group of objects with similar nature. The grouping mechanism is
called as clustering process. The similar text documents are grouped together in a cluster, if their cosine similarity measure is
less than a specified threshold. In this paper we mainly focuses on view points and we introduce a novel multi-viewpoint
based similarity measure for text mining. The nature of similarity measure plays a very important role in the success or
failure of the clustering method. From the proposed similarity measure, we then formulate new clustering criterion functions
and introduce their respective clustering algorithms, which are fast and scalable like k-means algorithm, but are also capable
of providing high quality and consistent performance.

REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]

I. Guyon, U. von Luxburg, and R. C. Williamson, “Clustering: Science or Art?” , ‖ NIPS‟09 Workshop on Clustering Theory, 2009.
Leo
Wanner
(2004).
“Introduction
to
Clustering
Techniques”.
Available
online
at:
http://www.iula.upf.edu/materials/040701wanner.pdf [viewed: 16 August 2012]
D. Ienco, R. G. Pensa, and R. Meo, “Context-based distance learning for categorical data clustering,” in Proc. of the 8th Int. Symp.
IDA, 2009, pp. 83–94.
I. Guyon, U. von Luxburg, and R. C. Williamson, “Clustering: Science or Art?” NIPS‟09 Workshop on Clustering Theory, 2009.
C. D. Manning, P. Raghavan, and H. Sch ¨ utze, An Introduction to Information Retrieval. Press, Cambridge U., 2009.
X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M.
Steinbach, D. J. Hand, and D. Steinberg, “Top 10 algorithms in data mining,” Knowl.Inf. Syst., vol. 14, no. 1, pp. 1–37, 2007.
W. Xu, X. Liu, and Y. Gong, “Document clustering based on nonnegative matrix factorization,” in SIGIR, 2003, pp. 267–273.
S. Zhong, “Efficient online spherical K-means clustering,” in IEEE IJCNN, 2005, pp. 3180–3185.

www.ijmer.com

2826 | Page

More Related Content

What's hot

Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
Editor IJMTER
 
A0310112
A0310112A0310112
A0310112
iosrjournals
 
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
ijdkp
 
Ijartes v1-i2-006
Ijartes v1-i2-006Ijartes v1-i2-006
Ijartes v1-i2-006
IJARTES
 
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGA SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
ijcsa
 
Du35687693
Du35687693Du35687693
Du35687693
IJERA Editor
 
Expandable bayesian
Expandable bayesianExpandable bayesian
Expandable bayesian
Ahmad Amri
 
B colouring
B colouringB colouring
B colouringxs76250
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
IJECEIAES
 
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
IJERA Editor
 
automatic classification in information retrieval
automatic classification in information retrievalautomatic classification in information retrieval
automatic classification in information retrieval
Basma Gamal
 
Semi-Supervised Discriminant Analysis Based On Data Structure
Semi-Supervised Discriminant Analysis Based On Data StructureSemi-Supervised Discriminant Analysis Based On Data Structure
Semi-Supervised Discriminant Analysis Based On Data Structure
iosrjce
 
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
IJCSIS Research Publications
 
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
ijtsrd
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
IRJET Journal
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
IOSR Journals
 

What's hot (17)

Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
 
A0310112
A0310112A0310112
A0310112
 
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
Ijartes v1-i2-006
Ijartes v1-i2-006Ijartes v1-i2-006
Ijartes v1-i2-006
 
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGA SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
 
Du35687693
Du35687693Du35687693
Du35687693
 
Expandable bayesian
Expandable bayesianExpandable bayesian
Expandable bayesian
 
B colouring
B colouringB colouring
B colouring
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
 
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
 
automatic classification in information retrieval
automatic classification in information retrievalautomatic classification in information retrieval
automatic classification in information retrieval
 
Semi-Supervised Discriminant Analysis Based On Data Structure
Semi-Supervised Discriminant Analysis Based On Data StructureSemi-Supervised Discriminant Analysis Based On Data Structure
Semi-Supervised Discriminant Analysis Based On Data Structure
 
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
 
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 

Viewers also liked

Digital Citizenship Webquest
Digital Citizenship WebquestDigital Citizenship Webquest
Digital Citizenship Webquest
mdonel
 
Significance Assessment of Architectural Heritage Monuments in Old-Goa
Significance Assessment of Architectural Heritage Monuments in Old-GoaSignificance Assessment of Architectural Heritage Monuments in Old-Goa
Significance Assessment of Architectural Heritage Monuments in Old-Goa
IJMER
 
Virtualization Technology using Virtual Machines for Cloud Computing
Virtualization Technology using Virtual Machines for Cloud ComputingVirtualization Technology using Virtual Machines for Cloud Computing
Virtualization Technology using Virtual Machines for Cloud Computing
IJMER
 
Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...
Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...
Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...
IJMER
 
Finite Element Analysis of Human RIB Cage
Finite Element Analysis of Human RIB CageFinite Element Analysis of Human RIB Cage
Finite Element Analysis of Human RIB Cage
IJMER
 
E04011 03 3339
E04011 03 3339E04011 03 3339
E04011 03 3339IJMER
 
Ρωμιοσύνη
ΡωμιοσύνηΡωμιοσύνη
Ρωμιοσύνη
Popi Kaza
 
Education set for collecting and visualizing data using sensor system based ...
Education set for collecting and visualizing data using sensor  system based ...Education set for collecting and visualizing data using sensor  system based ...
Education set for collecting and visualizing data using sensor system based ...
IJMER
 
The creative arts –
The creative arts –The creative arts –
The creative arts –
Ylenia Vella
 
Monitor the Unmeasurable
Monitor the UnmeasurableMonitor the Unmeasurable
Monitor the Unmeasurable
Jennifer Davis
 
Bu32888890
Bu32888890Bu32888890
Bu32888890IJMER
 
I0502 01 4856
I0502 01 4856I0502 01 4856
I0502 01 4856IJMER
 
Cb31324330
Cb31324330Cb31324330
Cb31324330IJMER
 
Bw31297301
Bw31297301Bw31297301
Bw31297301IJMER
 
Bs32869883
Bs32869883Bs32869883
Bs32869883IJMER
 
Tracking of Maximum Power from Wind Using Fuzzy Logic Controller Based On PMSG
Tracking of Maximum Power from Wind Using Fuzzy Logic Controller Based On PMSGTracking of Maximum Power from Wind Using Fuzzy Logic Controller Based On PMSG
Tracking of Maximum Power from Wind Using Fuzzy Logic Controller Based On PMSG
IJMER
 
On Characterizations of NANO RGB-Closed Sets in NANO Topological Spaces
On Characterizations of NANO RGB-Closed Sets in NANO Topological SpacesOn Characterizations of NANO RGB-Closed Sets in NANO Topological Spaces
On Characterizations of NANO RGB-Closed Sets in NANO Topological Spaces
IJMER
 
Fisheries Enumerator - Ojas.guj.nic.in
Fisheries Enumerator - Ojas.guj.nic.inFisheries Enumerator - Ojas.guj.nic.in
Fisheries Enumerator - Ojas.guj.nic.in
ojasgujnicin
 
Www youtube com_watch_v_p_n_huv50kbfe
Www youtube com_watch_v_p_n_huv50kbfeWww youtube com_watch_v_p_n_huv50kbfe
Www youtube com_watch_v_p_n_huv50kbfe
Rancyna James
 

Viewers also liked (20)

Digital Citizenship Webquest
Digital Citizenship WebquestDigital Citizenship Webquest
Digital Citizenship Webquest
 
Significance Assessment of Architectural Heritage Monuments in Old-Goa
Significance Assessment of Architectural Heritage Monuments in Old-GoaSignificance Assessment of Architectural Heritage Monuments in Old-Goa
Significance Assessment of Architectural Heritage Monuments in Old-Goa
 
Virtualization Technology using Virtual Machines for Cloud Computing
Virtualization Technology using Virtual Machines for Cloud ComputingVirtualization Technology using Virtual Machines for Cloud Computing
Virtualization Technology using Virtual Machines for Cloud Computing
 
Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...
Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...
Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...
 
Finite Element Analysis of Human RIB Cage
Finite Element Analysis of Human RIB CageFinite Element Analysis of Human RIB Cage
Finite Element Analysis of Human RIB Cage
 
E04011 03 3339
E04011 03 3339E04011 03 3339
E04011 03 3339
 
Ρωμιοσύνη
ΡωμιοσύνηΡωμιοσύνη
Ρωμιοσύνη
 
Education set for collecting and visualizing data using sensor system based ...
Education set for collecting and visualizing data using sensor  system based ...Education set for collecting and visualizing data using sensor  system based ...
Education set for collecting and visualizing data using sensor system based ...
 
The creative arts –
The creative arts –The creative arts –
The creative arts –
 
Monitor the Unmeasurable
Monitor the UnmeasurableMonitor the Unmeasurable
Monitor the Unmeasurable
 
Bu32888890
Bu32888890Bu32888890
Bu32888890
 
I0502 01 4856
I0502 01 4856I0502 01 4856
I0502 01 4856
 
Reno x tkja
Reno x tkjaReno x tkja
Reno x tkja
 
Cb31324330
Cb31324330Cb31324330
Cb31324330
 
Bw31297301
Bw31297301Bw31297301
Bw31297301
 
Bs32869883
Bs32869883Bs32869883
Bs32869883
 
Tracking of Maximum Power from Wind Using Fuzzy Logic Controller Based On PMSG
Tracking of Maximum Power from Wind Using Fuzzy Logic Controller Based On PMSGTracking of Maximum Power from Wind Using Fuzzy Logic Controller Based On PMSG
Tracking of Maximum Power from Wind Using Fuzzy Logic Controller Based On PMSG
 
On Characterizations of NANO RGB-Closed Sets in NANO Topological Spaces
On Characterizations of NANO RGB-Closed Sets in NANO Topological SpacesOn Characterizations of NANO RGB-Closed Sets in NANO Topological Spaces
On Characterizations of NANO RGB-Closed Sets in NANO Topological Spaces
 
Fisheries Enumerator - Ojas.guj.nic.in
Fisheries Enumerator - Ojas.guj.nic.inFisheries Enumerator - Ojas.guj.nic.in
Fisheries Enumerator - Ojas.guj.nic.in
 
Www youtube com_watch_v_p_n_huv50kbfe
Www youtube com_watch_v_p_n_huv50kbfeWww youtube com_watch_v_p_n_huv50kbfe
Www youtube com_watch_v_p_n_huv50kbfe
 

Similar to A Novel Clustering Method for Similarity Measuring in Text Documents

IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Editor IJARCET
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Bs31267274
Bs31267274Bs31267274
Bs31267274IJMER
 
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Editor IJMTER
 
Recent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A ReviewRecent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A Review
IOSRjournaljce
 
Literature Survey On Clustering Techniques
Literature Survey On Clustering TechniquesLiterature Survey On Clustering Techniques
Literature Survey On Clustering Techniques
IOSR Journals
 
F04463437
F04463437F04463437
F04463437
IOSR-JEN
 
An Analysis On Clustering Algorithms In Data Mining
An Analysis On Clustering Algorithms In Data MiningAn Analysis On Clustering Algorithms In Data Mining
An Analysis On Clustering Algorithms In Data Mining
Gina Rizzo
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
eSAT Journals
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
eSAT Publishing House
 
Hierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongHierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures along
eSAT Publishing House
 
Hierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationHierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representation
eSAT Journals
 
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
Clustering heterogeneous categorical data using enhanced mini  batch K-means ...Clustering heterogeneous categorical data using enhanced mini  batch K-means ...
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
IJECEIAES
 
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
ijdmtaiir
 
Dp33701704
Dp33701704Dp33701704
Dp33701704
IJERA Editor
 

Similar to A Novel Clustering Method for Similarity Measuring in Text Documents (20)

IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Bs31267274
Bs31267274Bs31267274
Bs31267274
 
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
 
50120130406022
5012013040602250120130406022
50120130406022
 
Recent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A ReviewRecent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A Review
 
Literature Survey On Clustering Techniques
Literature Survey On Clustering TechniquesLiterature Survey On Clustering Techniques
Literature Survey On Clustering Techniques
 
Ir3116271633
Ir3116271633Ir3116271633
Ir3116271633
 
F04463437
F04463437F04463437
F04463437
 
An Analysis On Clustering Algorithms In Data Mining
An Analysis On Clustering Algorithms In Data MiningAn Analysis On Clustering Algorithms In Data Mining
An Analysis On Clustering Algorithms In Data Mining
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
 
Hierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongHierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures along
 
Hierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationHierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representation
 
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
Clustering heterogeneous categorical data using enhanced mini  batch K-means ...Clustering heterogeneous categorical data using enhanced mini  batch K-means ...
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
 
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
 
Dp33701704
Dp33701704Dp33701704
Dp33701704
 

More from IJMER

A Study on Translucent Concrete Product and Its Properties by Using Optical F...
A Study on Translucent Concrete Product and Its Properties by Using Optical F...A Study on Translucent Concrete Product and Its Properties by Using Optical F...
A Study on Translucent Concrete Product and Its Properties by Using Optical F...
IJMER
 
Developing Cost Effective Automation for Cotton Seed Delinting
Developing Cost Effective Automation for Cotton Seed DelintingDeveloping Cost Effective Automation for Cotton Seed Delinting
Developing Cost Effective Automation for Cotton Seed Delinting
IJMER
 
Study & Testing Of Bio-Composite Material Based On Munja Fibre
Study & Testing Of Bio-Composite Material Based On Munja FibreStudy & Testing Of Bio-Composite Material Based On Munja Fibre
Study & Testing Of Bio-Composite Material Based On Munja Fibre
IJMER
 
Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)
Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)
Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)
IJMER
 
Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...
Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...
Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...
IJMER
 
Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...
Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...
Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...
IJMER
 
Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...
Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...
Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...
IJMER
 
Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...
Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...
Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...
IJMER
 
Static Analysis of Go-Kart Chassis by Analytical and Solid Works Simulation
Static Analysis of Go-Kart Chassis by Analytical and Solid Works SimulationStatic Analysis of Go-Kart Chassis by Analytical and Solid Works Simulation
Static Analysis of Go-Kart Chassis by Analytical and Solid Works Simulation
IJMER
 
High Speed Effortless Bicycle
High Speed Effortless BicycleHigh Speed Effortless Bicycle
High Speed Effortless Bicycle
IJMER
 
Integration of Struts & Spring & Hibernate for Enterprise Applications
Integration of Struts & Spring & Hibernate for Enterprise ApplicationsIntegration of Struts & Spring & Hibernate for Enterprise Applications
Integration of Struts & Spring & Hibernate for Enterprise Applications
IJMER
 
Microcontroller Based Automatic Sprinkler Irrigation System
Microcontroller Based Automatic Sprinkler Irrigation SystemMicrocontroller Based Automatic Sprinkler Irrigation System
Microcontroller Based Automatic Sprinkler Irrigation System
IJMER
 
On some locally closed sets and spaces in Ideal Topological Spaces
On some locally closed sets and spaces in Ideal Topological SpacesOn some locally closed sets and spaces in Ideal Topological Spaces
On some locally closed sets and spaces in Ideal Topological Spaces
IJMER
 
Intrusion Detection and Forensics based on decision tree and Association rule...
Intrusion Detection and Forensics based on decision tree and Association rule...Intrusion Detection and Forensics based on decision tree and Association rule...
Intrusion Detection and Forensics based on decision tree and Association rule...
IJMER
 
Natural Language Ambiguity and its Effect on Machine Learning
Natural Language Ambiguity and its Effect on Machine LearningNatural Language Ambiguity and its Effect on Machine Learning
Natural Language Ambiguity and its Effect on Machine Learning
IJMER
 
Evolvea Frameworkfor SelectingPrime Software DevelopmentProcess
Evolvea Frameworkfor SelectingPrime Software DevelopmentProcessEvolvea Frameworkfor SelectingPrime Software DevelopmentProcess
Evolvea Frameworkfor SelectingPrime Software DevelopmentProcess
IJMER
 
Material Parameter and Effect of Thermal Load on Functionally Graded Cylinders
Material Parameter and Effect of Thermal Load on Functionally Graded CylindersMaterial Parameter and Effect of Thermal Load on Functionally Graded Cylinders
Material Parameter and Effect of Thermal Load on Functionally Graded Cylinders
IJMER
 
Studies On Energy Conservation And Audit
Studies On Energy Conservation And AuditStudies On Energy Conservation And Audit
Studies On Energy Conservation And Audit
IJMER
 
An Implementation of I2C Slave Interface using Verilog HDL
An Implementation of I2C Slave Interface using Verilog HDLAn Implementation of I2C Slave Interface using Verilog HDL
An Implementation of I2C Slave Interface using Verilog HDL
IJMER
 
Discrete Model of Two Predators competing for One Prey
Discrete Model of Two Predators competing for One PreyDiscrete Model of Two Predators competing for One Prey
Discrete Model of Two Predators competing for One Prey
IJMER
 

More from IJMER (20)

A Study on Translucent Concrete Product and Its Properties by Using Optical F...
A Study on Translucent Concrete Product and Its Properties by Using Optical F...A Study on Translucent Concrete Product and Its Properties by Using Optical F...
A Study on Translucent Concrete Product and Its Properties by Using Optical F...
 
Developing Cost Effective Automation for Cotton Seed Delinting
Developing Cost Effective Automation for Cotton Seed DelintingDeveloping Cost Effective Automation for Cotton Seed Delinting
Developing Cost Effective Automation for Cotton Seed Delinting
 
Study & Testing Of Bio-Composite Material Based On Munja Fibre
Study & Testing Of Bio-Composite Material Based On Munja FibreStudy & Testing Of Bio-Composite Material Based On Munja Fibre
Study & Testing Of Bio-Composite Material Based On Munja Fibre
 
Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)
Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)
Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)
 
Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...
Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...
Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...
 
Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...
Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...
Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...
 
Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...
Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...
Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...
 
Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...
Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...
Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...
 
Static Analysis of Go-Kart Chassis by Analytical and Solid Works Simulation
Static Analysis of Go-Kart Chassis by Analytical and Solid Works SimulationStatic Analysis of Go-Kart Chassis by Analytical and Solid Works Simulation
Static Analysis of Go-Kart Chassis by Analytical and Solid Works Simulation
 
High Speed Effortless Bicycle
High Speed Effortless BicycleHigh Speed Effortless Bicycle
High Speed Effortless Bicycle
 
Integration of Struts & Spring & Hibernate for Enterprise Applications
Integration of Struts & Spring & Hibernate for Enterprise ApplicationsIntegration of Struts & Spring & Hibernate for Enterprise Applications
Integration of Struts & Spring & Hibernate for Enterprise Applications
 
Microcontroller Based Automatic Sprinkler Irrigation System
Microcontroller Based Automatic Sprinkler Irrigation SystemMicrocontroller Based Automatic Sprinkler Irrigation System
Microcontroller Based Automatic Sprinkler Irrigation System
 
On some locally closed sets and spaces in Ideal Topological Spaces
On some locally closed sets and spaces in Ideal Topological SpacesOn some locally closed sets and spaces in Ideal Topological Spaces
On some locally closed sets and spaces in Ideal Topological Spaces
 
Intrusion Detection and Forensics based on decision tree and Association rule...
Intrusion Detection and Forensics based on decision tree and Association rule...Intrusion Detection and Forensics based on decision tree and Association rule...
Intrusion Detection and Forensics based on decision tree and Association rule...
 
Natural Language Ambiguity and its Effect on Machine Learning
Natural Language Ambiguity and its Effect on Machine LearningNatural Language Ambiguity and its Effect on Machine Learning
Natural Language Ambiguity and its Effect on Machine Learning
 
Evolvea Frameworkfor SelectingPrime Software DevelopmentProcess
Evolvea Frameworkfor SelectingPrime Software DevelopmentProcessEvolvea Frameworkfor SelectingPrime Software DevelopmentProcess
Evolvea Frameworkfor SelectingPrime Software DevelopmentProcess
 
Material Parameter and Effect of Thermal Load on Functionally Graded Cylinders
Material Parameter and Effect of Thermal Load on Functionally Graded CylindersMaterial Parameter and Effect of Thermal Load on Functionally Graded Cylinders
Material Parameter and Effect of Thermal Load on Functionally Graded Cylinders
 
Studies On Energy Conservation And Audit
Studies On Energy Conservation And AuditStudies On Energy Conservation And Audit
Studies On Energy Conservation And Audit
 
An Implementation of I2C Slave Interface using Verilog HDL
An Implementation of I2C Slave Interface using Verilog HDLAn Implementation of I2C Slave Interface using Verilog HDL
An Implementation of I2C Slave Interface using Verilog HDL
 
Discrete Model of Two Predators competing for One Prey
Discrete Model of Two Predators competing for One PreyDiscrete Model of Two Predators competing for One Prey
Discrete Model of Two Predators competing for One Prey
 

Recently uploaded

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 

A Novel Clustering Method for Similarity Measuring in Text Documents

  • 1. www.ijmer.com International Journal of Modern Engineering Research (IJMER) Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2823-2826 ISSN: 2249-6645 A Novel Clustering Method for Similarity Measuring in Text Documents Preethi Priyanka Thella1, G. Sridevi2 1 M.Tech, Nimra College of Engineering & Technology, Vijayawada, A.P., India. Assoc.Professor, Dept.of CSE, Nimra College of Engineering & Technology, Vijayawada, A.P., India. 2 ABSTRACT: Clustering is the process of grouping data into subsets in such a manner that identical instances are collected together, while different instances belong to different groups. The instances are thereby arranged into an efficient depiction that characterizes the populace that is being sampled. A general move towards the clustering process is to treat it as an optimization process. A best partition is found by optimizing an exacting function of similarity, or distance, among data. Basically, there is a hidden assumption that the true inherent structure of data could be correctly describe by using the similarity formula defined and fixed in the clustering decisive factor. In this paper, we introduce clustering with multi- view points based on different similarity measures. The multi- view point approach to learning is one in which we have ‘views’ of the data (sometimes in a rather abstract sense) and the goal is to use the relationship between these views to alleviate the difficulty of a learning problem of interest. Keywords: Clustering, Text mining, Similarity measure, View point. I. INTRODUCTION Clustering[1] or cluster analysis is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of explorative data mining techniques, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Cluster analysis itself is not one specific algorithm or procedure, but the general task to be solved. It can be achieved by using various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with low distances among the cluster members, intervals or particular statistical distributions, dense areas of the data space. Clustering can therefore be formulated as a Multi- objective optimization process. The appropriate clustering algorithm and parameter settings, including values such as the distance function to use, a density threshold or the number of expected clusters, depend on the individual data set and intended use of the results. Clustering as such is not an automatic task, but an iterative process of Knowledge discovery or interactive multi- objective optimization that involves trial and failure. It will often be necessary to modify parameters and preprocessing until the result achieves the desired properties. Cluster analysis can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering process could be “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects or items which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. Figure 1 shows clustering process. Figure 1: Clustering Process In this case we easily identify the four clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance (in this case geometrical distance). This is called as distance based clustering. Another kind of clustering is called conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects. In other words, objects are grouped according to their fit to descriptive concepts, not according to the simple similarity measures. The multiview point approach to learning is one in which we have „views‟ of the data (sometimes in a rather abstract sense) and the goal is to use the relationship between these views to alleviate the difficulty of a learning problem of interest. www.ijmer.com 2823 | Page
  • 2. www.ijmer.com International Journal of Modern Engineering Research (IJMER) Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2823-2826 ISSN: 2249-6645 II. RELATED WORK Text clustering is required in the real world applications such as web search engines. It comes under text mining process. It is meant for grouping text documents into various clusters. These clusters are used by various applications in the real world, for example, search engines. A text document is treated as an object a word in the document is referred as a term. A vector is built to represent each text document. The total number of terms in the text document is represented by m. Some kind of weighting schemes like Term Frequency – Inverse Document Frequency (TF-IDF) is used to represent document vectors. There are many approaches for text document clustering. They include probabilistic based methods [2], nonnegative matrix factorization [3] and information theoretic co-clustering [4]. These approaches are not using a particular measure for finding similarity among text documents. In this paper, we make use of multi- view point similarity measure for finding the similarity. As found it literature, a measure widely used in text document clustering is ED (Euclidian Distance). K-Means algorithm is most widely used clustering algorithm due to its ease of use and simplicity. Euclidian distance is the measure used in K-Means algorithm to measure the distance between objects to make them into clusters. In this case the cluster centroid is computed as follows: Another similarity measure being used for text document mining is cosine similarity measure. It is best useful in hidimensional documents [5]. This measure is also being used in Spherical K-Means which is a variant of K-Means algorithm. The difference between the two flavors of K-Means algorithm that use cosine similarity measure and ED measure respectively is that the former focuses on vector directions while the latter focuses on vector magnitudes. Graph partitioning is yet another approach which is very popular. It considers the text document corpus as graph and uses min-max cut algorithm which represents centriod as follows: There is a software package called CLUTO [6] which is meant for document clustering. It makes use of the graph partitioning approach. Based on the nearest neighbor graph it builds, it text documents are clustered. It is based on the Jacquard coefficient which is computed as follows: Jacquard coefficients use both magnitude and direction which is not the case with Euclidian distance and cosine similarity. However, it is similarity to cosine similarity when the documents are represented as unit vectors. In [7] there is comparison between the two techniques namely Jacquard and Pearson correlation. It also concludes that both of them are best used in clustering process of web documents. For tsxt document clustering other approaches can be used which are phrase based and concept based. In phrase based approach is found while in [8] tree similarity based approach is found. The common procedure used by both of them is “Hierarchical agglomerative Clustering”. The drawback of these approaches is that their computational cost is too high. For clustering XML documents also there are some measures. One such measure is called “Structural Similarity” which differs from text document clustering. This paper focuses on a new multi-view point based similarity measure for text clustering. III. PROPOSED WORK In proposed work, our approach in finding similarity between documents or objects while performing clustering is multi-view based similarity. It makes use of more than one point of reference as opposed to existing algorithms used for text document clustering. As per our approach the similarity between two documents is calculated as follows: sim(d i , d j )  d i , d j S r 1 n  nr  sim(d d h S S r i  dh , d j  dh ) Consider two point “di” and “dj” in the cluster Sr. The similarity between those two points is viewed from a point “dh” which is outside the cluster. Such similarity is equal to the product of the cosine angle between those points with respect to Euclidean distance between the points. An assumption on which this definition is based on is “dh” is not the same cluster as “di” and “dj”. When distances are very small, then the chances are higher that the “dh” is in the same cluster. Though various viewpoints are useful in increasing the accuracy of the similarity measure there is a possibility of having that give negative result. However the possibility of such a drawback can be ignored provided plenty of documents to be clustered. Now we have to carry out the validity test for the cosine similarity and multi view based similarity as follows. For each type of the similarity measure, a similarity matrix called A = {aij}n×n is created. For CS, this is very simple, as aij = dti dj . The algorithm for building Multi view Similarity (MVS) matrix is described in Algorithm 1. www.ijmer.com 2824 | Page
  • 3. International Journal of Modern Engineering Research (IJMER) www.ijmer.com Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2823-2826 ISSN: 2249-6645 ALGORITHM 1: BUILDMVSMATRIX(A) Step 1: for r ← 1 : c do Step 2: DS Sr ← d d i S r i Step 3: nS Sr←|S Sr| Step 4: end for Step 5: for i ← 1 : n do Step 6: r ← class of di Step 7: for j ← 1 : n do Step 8: if dj  Sr then Step 9: Step 10: else Step 11: Step 12: end if Step 13: end for Step 14: end for Step 15: return A = {aij}n×n First, the outer composite with respect to each class is determined. Then, for each row ai of “A”, i = 1, . . . , n, if the pair of text documents di and dj, j = 1, . . . , n are in the same class, aij is calculated as in line 9. Otherwise, dj is assumed to be in di‟s class, and aij is calculated as shown in line 11. After matrix “A” is formed, the code in Algorithm 2 is used to get its validity score: ALGORITHM 2: GETVALIDITY(validity,A, percentage) Step 1: for r ← 1 : c do Step 2: qr ← floor(percentage × nr) Step 3: if qr = 0 then Step 4: qr ← 1 Step 5: end if Step 6: end for Step 7: for i ← 1 : n do Step 8: {aiv[1], . . . , aiv[n] } ←Sort {ai1, . . . , ain} Step 9: s.t. aiv[1] ≥ aiv[2] ≥ . . . ≥ aiv[n] {v[1], . . . , v[n]} ← permute {1, . . . , n} Step 10: r ← class of di Step 11: Step 12: end for Step 13: Step 14: return validity For each document “di” corresponding to row “ai” of matrix A, we select “qr” documents closest to point “di”. The value of “qr” is chosen relatively as the percentage of the size of the class r that contains “di”, where percentage  (0, 1]. Then, validity with respect to “di” is calculated by the fraction of these “qr” documents having the same class label with “di”, as shown in line 11. The final validity is determined by averaging the over all the rows of matrix A, as shown in line 13. It is clear that the validity score is bounded within values 0 and 1. The higher validity score a similarity measure has, the more suitable it should be useful for the clustering process. IV. INCREMENTAL CLUSTERING ALGORITHM The main goal of this algorithm is to perform text document clustering by optimizing www.ijmer.com I R and I V as shown below: 2825 | Page
  • 4. www.ijmer.com International Journal of Modern Engineering Research (IJMER) Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2823-2826 ISSN: 2249-6645 With this general form, the incremental optimization algorithm, which has two major steps Initialization and Refinement, is shown in Algorithm 3 and Algorithm 4. ALGORITHM 3: INITIALIZATION Step 1: Select k seeds s1, . . . , sk randomly Step 2: Step 3: Step 4: end ALGORITHM 4: REFINEMENT Step 1: repeat Step 2: {v[1 : n]} ← random permutation of {1, . . ., n} Step 3: for j ← 1 : n do Step 4: i ← v[j] Step 5: p ← cluster[di] Step 6: Step 7: Step 8: Step 9: if then Step 10: Move di to cluster q: cluster[di] ← q Step 11: Update Dp, np,Dq, nq Step 12: end if Step 13: end for Step 14: until No move for all n documents Step 15: end At Initialization, “k” arbitrary documents are selected to be the seeds from which initial partitions are formed. Refinement is a process that consists of a number of iterations. During each iteration, the “n” text documents are visited one by one in a totally random order. Each text document is checked if its move to another cluster results in improvement of the objective function. If yes, then the text document is moved to the cluster that leads to the highest improvement. If no clusters are better than the current cluster, the text document is not moved. The clustering process terminates when iteration completes without any text documents being moved to new clusters. V. CONCLUSION In the view point of data engineering, a cluster is a group of objects with similar nature. The grouping mechanism is called as clustering process. The similar text documents are grouped together in a cluster, if their cosine similarity measure is less than a specified threshold. In this paper we mainly focuses on view points and we introduce a novel multi-viewpoint based similarity measure for text mining. The nature of similarity measure plays a very important role in the success or failure of the clustering method. From the proposed similarity measure, we then formulate new clustering criterion functions and introduce their respective clustering algorithms, which are fast and scalable like k-means algorithm, but are also capable of providing high quality and consistent performance. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] I. Guyon, U. von Luxburg, and R. C. Williamson, “Clustering: Science or Art?” , ‖ NIPS‟09 Workshop on Clustering Theory, 2009. Leo Wanner (2004). “Introduction to Clustering Techniques”. Available online at: http://www.iula.upf.edu/materials/040701wanner.pdf [viewed: 16 August 2012] D. Ienco, R. G. Pensa, and R. Meo, “Context-based distance learning for categorical data clustering,” in Proc. of the 8th Int. Symp. IDA, 2009, pp. 83–94. I. Guyon, U. von Luxburg, and R. C. Williamson, “Clustering: Science or Art?” NIPS‟09 Workshop on Clustering Theory, 2009. C. D. Manning, P. Raghavan, and H. Sch ¨ utze, An Introduction to Information Retrieval. Press, Cambridge U., 2009. X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg, “Top 10 algorithms in data mining,” Knowl.Inf. Syst., vol. 14, no. 1, pp. 1–37, 2007. W. Xu, X. Liu, and Y. Gong, “Document clustering based on nonnegative matrix factorization,” in SIGIR, 2003, pp. 267–273. S. Zhong, “Efficient online spherical K-means clustering,” in IEEE IJCNN, 2005, pp. 3180–3185. www.ijmer.com 2826 | Page