SlideShare a Scribd company logo
M. K. Kond Reddy et al Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036

RESEARCH ARTICLE

www.ijera.com

OPEN ACCESS

Data Mining Tool using Clustering Technique on Exploration
Engine Dataset
Mahesh Kumar KondReddy, Sujeeth .T
Dept. of Computer Science &Engineering, Sri Venkateswara University,Tirupati, Andhra Pradesh, India

ABSTRACT
It is a major issue to retrieve good websites from the larger collections of websites. As the number of available
Web pages grows, it is become more difficult for users finding documents relevant to their interests. Clustering
is the classification of a data set into subsets (clusters), so that the data in each subset share some common trait often proximity according to some defined distance measure. By clustering we improve the quality of websites
by grouping similar websites in groups. This paper addresses the applications of data mining tool Weka by
applying k means clustering to find clusters from huge data sets and find the attributes that govern optimization
of search engines. Unlabeled document collections are becoming increasingly common and mining such
databases becomes a major challenge.
Keywords—Websites; Data mining; Weka; k-means, Dataset.

I.

INTRODUCTION

The Web has experienced continuous
growth since its creation. As of March 2002, the
largest search engine contained approximately 968
million indexed pages in its database. Finding the
right information from such a large collection is
extremely difficult [3].Information extraction plays
a vital role in today's life. How efficiently and
effectively the relevant documents are extracted
from World Wide Web is a challenging issue. As
today's search engine does just string matching,
documents retrieved may not be so relevant
according to user's query. By clustering the
websites, the websites having values of attributes
are in particular range are grouped together [6]. Data
is collected from various websites source code like
their title length, number of keywords in title, url
length, number of backlinks etc and based on this
we derive the conclusion. A popular technique for
clustering is based on K-means such that the data is
partitioned into K clusters. In this method, the
groups are identified by a set of points that are
called the cluster centers.

II.

DOCUMENT CLUSTERING

Document clustering analysis plays an
important role
in document mining research. A widely adopted
definition of optimal clustering is a partitioning that
minimizes distances within a cluster and maximizes
distances between clusters.
In this approach the clusters and, to a limited degree,
relationships between clusters are derived
automatically from the documents to be clustered,
and the documents are subsequently assigned to
those clusters [1].Users are known to have
www.ijera.com

difficulties in dealing with information retrieval
search outputs especially if the outputs are above a
certain size. Clustering can enable them to find the
relevant documents more easily and also help them
to form an understanding of the different facets of
the query that have been provided for their
inspection. This project aimed to investigate the
websites that are in top 5 in one cluster and other
sites in second cluster. Clustering based on k-means
is closely related to a number of other clustering and
location problems. These include the Euclidean kmedians , in which the objective is to minimize the
sum of distances to the nearest center, and the
geometric k-center problem in which the objective is
to minimize the maximum distance from every point
to its closest center.
A.

K means Algorithm
K-Means clustering is a very popular
algorithm to find the clustering in dataset by
iterative computations. It has the advantages of
simple implementing and finding at least local
optimal clustering. K-Means algorithm is employed
to find the clustering in dataset. The algorithm
[2],[9] is composed of the following steps:


Initialize k cluster centers to be seed points.
(These centers can be randomly produced or
use other ways to generate).



For each sample, find the nearest cluster
center, put the sample in this cluster and
recompute centers of the altered cluster
(Repeat n times).



Exam all samples again and put each one in
2032 | P a g e
M. K. Kond Reddy et al Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036
the cluster identified with the nearest center
(don’t recompute any cluster centers). If
members of each cluster haven’t been
changed, stop. If changed, go to step 2.

III.

THE METHOD FOR FINDING
MIXTURE
STRUCTURE
IN
HETEROGENEOUS
MULTIVARIATE DATA SET

A. Partly And Completely Overlapped Group
Structures In Heteregenous Multivariate Data
Set
Partly and completely overlapped group
structures make it difficult to determine the number
and structure of clusters in multivariate data set. The
cases of partly and completely overlapped group
structures in heterogeneous data are shown in Figure
01. It shows also mixture structure in data.
The first case is partly group overlapping, in this
case group means are different but very close to
each other. Group variances are the same as shown
in Figure 01(a). The second case is completely
group overlapping, in this case group means are the
same. Group variances are different from each other
as shown in Figure 01(b). The horizontal axis
denotes the values of independent variable and the
vertical axis denotes the values of dependent
variable with respect to the values of independent
variable having heterogeneous structure thus having
mixture structure in Figure 01. There is only one
independent variable having two subgroups in
Figure 01.

other.
B. Algorithm For Diagnosing Mixture Structure In
Multivariate Heteregoneous Data Set And
Refining Groups
Model selection methods based on information
criterions is applied for diagnosing the mixture
structure in heterogeneous multivariate data set. The
algorithm given in this section is a new algorithm.
The steps of the algorithm proposed for refining
groups in data using dynamic model based
clustering is given below:
1. Apply model based clustering for multivariate
data set by assuming that multivariate data have
mixture structure thus data comes from a
mixture of multivariate normal densities. This
step shows either multivariate data is
homogeneous thus there is no mixture structure
or multivariate data have a group structure thus
there is a mixture structure.
2. If multivariate data have mixture structure or
multivariate data contains groups then find these
groups in multivariate data. Test each group
found in multivariate data for homogeneity. In
other words check each group found for further
mixture structure or subgroups.

IV.

ORGANIZATION OF DATA

It’s important for search engine to maintain
a high quality websites. This will improve the
optimization. We made a database in which
following attributes we take length of title,
keywords in title, Domain length, and number of
backlinks and Top rank website .
A. Working with Weka on Dataset
Open Weka, and then click on right side
option explorer then Open data file under preprocess
option which is in csv or arff format [4],[5]. As we
choose the explorer option it will appear as given
below, the screen shot in fig 1. Clearly indicate the
open file option. Now we click on view open file
and choose the data set. Weka provides filters to
accomplish all of these preprocessing tasks, they are
not necessary for clustering in Weka .This is
because Weka Simple K Means algorithm
automatically handles a mixture of categorical and
numerical attributes. This algorithm automatically
normalizes numerical attributes when doing distance
computations [7].

A

B
Figure 01. (a) The case of partly overlapping:
group means are different but very close to each
other. Group variances are the same. (b) The case
of completely overlapping: group means are the
same. Group variances are different from each
www.ijera.com

www.ijera.com

This gives all attributes that are present in dataset.
We can select any one which we want to include or
select all.

2033 | P a g e
M. K. Kond Reddy et al Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036

www.ijera.com

Fig 3: Choose parameters

Fig.1: Opening page
After this just click on cluster tab and click on
choose button on left side and select clustering
algorithm which we want to apply, we select simple
k means the screen appears below in fig 2. [8]

Note that, in general, K-means is quite sensitive to
how clusters are initially assigned. Thus, it is often
necessary to try different values and evaluate the
results. [10].
Once the options have been specified, we can run
the clustering algorithm. Here we make sure that in
the "Cluster Mode" panel, the "Use training set"
option is selected, and we click "Start". We can
right click the result set in the "Result list" panel
and view the results of clustering in a separate
window.
The result window shows the centroid of each
cluster as well as statistics on the number and
percentage of instances assigned to different
clusters. Cluster centroids are the mean vectors for
each cluster (so, each dimension value in the
centroid represents the mean value for that
dimension in the cluster). Thus, centroids can be
used to characterize the clusters.

Fig.2: Select algorithm
Next, click on the text box to the right of the
"Choose" button to get the pop-up window shown in
Fig 3, for editing the clustering parameter. In the
pop-up window we enter 2 as the number of clusters
and we leave the value of "seed" as is. The seed
value is used in generating a random number which
is, in turn, used for making the initial assignment of
instances to clusters.

The result shows that in cluster 0 there are 13
websites that have length of title 59 characters long,
keywords in title are 5, url length 22 characters long
and number of backlinks are 6638 and in cluster 1
there are 16 websites that have length of title 36
characters long, keywords in title are 3, url length
29 characters long and number of backlinks are
19163 as shown in fig 4.
Another way of understanding the characteristics of
each cluster is through visualization. We can do this
by right-clicking the result set on the left "Result
list" panel and selecting "Visualize cluster
assignments". This pops up the visualization
window as shown in Fig 5.
In this, we choose the cluster number and any of the
other attributes for each of the three different
dimensions available (x-axis, y-axis, and color).
Different combinations of choices will result in a
visual rendering of different relationships within
each cluster.

www.ijera.com

2034 | P a g e
M. K. Kond Reddy et al Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036

www.ijera.com

Fig. 4: Result of clustering

Fig.5: Visual result

www.ijera.com

2035 | P a g e
M. K. Kond Reddy et al Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036

www.ijera.com

In the above example, we have chosen the cluster
number as the x-axis, the instance number (assigned
by Weka) as the y-axis, and the "length of title"
attribute as the color dimension. This will result in a
visualization of the distribution of length of title in
two clusters.
As more and more data is collected from websites
we can get more detail and can find attributes as by
this method we find backlinks > 19000 , length of
title < 40 , keywords in title > 3 and Domain length
< 30 is good for search engine optimization

REFERENCES
[1]

A Document Clustering Algorithm for Web
Search Engine Retrieval System, Hongwei
Yang School of Software Yunnan
University, Kunming 650021, China;

[2]

S. Kantabutra, Efficient Representation of
Cluster Structure in Large Data Sets, Ph.D.
Thesis, Tufts University, Medford MA,
September 2001.

[3]

Wang Jun, OuYang Zheng-Zheng “The
Research of KMeans Clustering Algorithm
Based on Association Rules “

[4]http://maya.cs.depaul.edu/classes/ect584/weka/
k-means.html
[5].http://www.cs.ccsu.edu/~markov/wekatutorial.
pdf.
[6]http://thesai.org/Downloads/Volume3No4/Pape
r_20
Knowledge_Discovery_in_Health_Care_Dat
asets Using_Data_Mining_Tools.pdf
[7]www.gtbit.org/downloads/dwdmsem6/dwdmse
m6lman.pdf
[8]http://www.iasri.res.in/ebook/win_school_aa/n
otes/WEKA.pdf
[9]

R. Kannan, S. Vempala, and Adrian Vetta,
“On ClusteringsGood, Bad, and Spectral”,
Proc. of the 41st Foundations of Computer
Science, Redondo Beach, 2000.5.

[10]http://www.bvicam.ac.in/news/INDIACom%
202010%20Proceedingspapers/Group3/IND
IACom10_388_Paper.pdf.

www.ijera.com

2036 | P a g e

More Related Content

What's hot

G0354451
G0354451G0354451
G0354451
iosrjournals
 
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATACOMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
cscpconf
 
Combined mining approach to generate patterns for complex data
Combined mining approach to generate patterns for complex dataCombined mining approach to generate patterns for complex data
Combined mining approach to generate patterns for complex data
csandit
 
A03202001005
A03202001005A03202001005
A03202001005
theijes
 
GCUBE INDEXING
GCUBE INDEXINGGCUBE INDEXING
GCUBE INDEXING
IJDKP
 
data generalization and summarization
data generalization and summarization data generalization and summarization
data generalization and summarization
janani thirupathi
 
B colouring
B colouringB colouring
B colouring
xs76250
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
IJDKP
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environment
IJDKP
 
Assessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data LinkagesAssessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data Linkages
journal ijrtem
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
IRJET Journal
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
s v
 
F04463437
F04463437F04463437
F04463437
IOSR-JEN
 
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYSOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
IJDKP
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
IOSR Journals
 
Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithm
eSAT Publishing House
 
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGPATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
IJDKP
 
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...
Waqas Tariq
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
DataminingTools Inc
 
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
IOSR Journals
 

What's hot (20)

G0354451
G0354451G0354451
G0354451
 
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATACOMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
 
Combined mining approach to generate patterns for complex data
Combined mining approach to generate patterns for complex dataCombined mining approach to generate patterns for complex data
Combined mining approach to generate patterns for complex data
 
A03202001005
A03202001005A03202001005
A03202001005
 
GCUBE INDEXING
GCUBE INDEXINGGCUBE INDEXING
GCUBE INDEXING
 
data generalization and summarization
data generalization and summarization data generalization and summarization
data generalization and summarization
 
B colouring
B colouringB colouring
B colouring
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environment
 
Assessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data LinkagesAssessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data Linkages
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
F04463437
F04463437F04463437
F04463437
 
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYSOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
 
Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithm
 
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGPATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
 
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
 

Viewers also liked

Marketing Strategy cert
Marketing Strategy certMarketing Strategy cert
Marketing Strategy certTwila Arias
 
Tipos de software
Tipos de softwareTipos de software
Tipos de software
Hafid Oviedo Alvarez
 
April 2012 - Made in Brazil
April 2012 - Made in BrazilApril 2012 - Made in Brazil
April 2012 - Made in Brazil
FGV Brazil
 
Colciencias y los recursos de inversión power point
Colciencias y los recursos de inversión power pointColciencias y los recursos de inversión power point
Colciencias y los recursos de inversión power point
joel150211
 
Portafolios digitales y gestión del conocimiento (resumen del curso)
Portafolios digitales y gestión del conocimiento (resumen del curso)Portafolios digitales y gestión del conocimiento (resumen del curso)
Portafolios digitales y gestión del conocimiento (resumen del curso)
Gabriela Morales
 
Computer Networking meets Social Psychology
Computer Networking meets Social PsychologyComputer Networking meets Social Psychology
Computer Networking meets Social Psychology
Waldir Moreira
 
The pit and the pendulum (1)
The pit and the pendulum (1)The pit and the pendulum (1)
The pit and the pendulum (1)
Amelia Payne
 
“暗算の達人” で数字で話せる人を目指す
“暗算の達人”で数字で話せる人を目指す“暗算の達人”で数字で話せる人を目指す
“暗算の達人” で数字で話せる人を目指す
bijikin
 
PA.3/4 แบบข้อเสนอในการพัฒนางานตามหน้าที่และความรับผิดชอบ (สายงานนิเทศการศึกษา)
PA.3/4 แบบข้อเสนอในการพัฒนางานตามหน้าที่และความรับผิดชอบ (สายงานนิเทศการศึกษา)PA.3/4 แบบข้อเสนอในการพัฒนางานตามหน้าที่และความรับผิดชอบ (สายงานนิเทศการศึกษา)
PA.3/4 แบบข้อเสนอในการพัฒนางานตามหน้าที่และความรับผิดชอบ (สายงานนิเทศการศึกษา)
Teacher Sophonnawit
 
Andalucia Nuestra Tierra 1
Andalucia Nuestra Tierra 1Andalucia Nuestra Tierra 1
Andalucia Nuestra Tierra 1
guest6bd45fe
 
TEANO - PREVISIÓ DE PROBLEMES
TEANO - PREVISIÓ DE PROBLEMESTEANO - PREVISIÓ DE PROBLEMES
TEANO - PREVISIÓ DE PROBLEMESmceide
 
Presentación1
Presentación1Presentación1
Presentación1
Fredy003
 
Whpmt1 me0059 d2
Whpmt1 me0059 d2Whpmt1 me0059 d2
Whpmt1 me0059 d2Tuan Pd
 
Cópia de f 07.78 u 06 perfil usitep[1]
Cópia de f 07.78 u 06 perfil usitep[1]Cópia de f 07.78 u 06 perfil usitep[1]
Cópia de f 07.78 u 06 perfil usitep[1]
viny_pt
 
Valores
ValoresValores
Valores
guest9f428c
 
A que cheiram as cores
A que cheiram as coresA que cheiram as cores
A que cheiram as cores
ano1ar
 
#ebc #negocios #FinaciacionDeVentas
#ebc #negocios #FinaciacionDeVentas#ebc #negocios #FinaciacionDeVentas
#ebc #negocios #FinaciacionDeVentas
Alma Karime
 
Cadastro maebraz ver set 2010 - comercial
Cadastro maebraz ver set 2010 - comercialCadastro maebraz ver set 2010 - comercial
Cadastro maebraz ver set 2010 - comercial
viny_pt
 
Power Point
Power PointPower Point
Power Point
mottorsait
 

Viewers also liked (20)

Тренаж по орфоэпии
Тренаж по орфоэпииТренаж по орфоэпии
Тренаж по орфоэпии
 
Marketing Strategy cert
Marketing Strategy certMarketing Strategy cert
Marketing Strategy cert
 
Tipos de software
Tipos de softwareTipos de software
Tipos de software
 
April 2012 - Made in Brazil
April 2012 - Made in BrazilApril 2012 - Made in Brazil
April 2012 - Made in Brazil
 
Colciencias y los recursos de inversión power point
Colciencias y los recursos de inversión power pointColciencias y los recursos de inversión power point
Colciencias y los recursos de inversión power point
 
Portafolios digitales y gestión del conocimiento (resumen del curso)
Portafolios digitales y gestión del conocimiento (resumen del curso)Portafolios digitales y gestión del conocimiento (resumen del curso)
Portafolios digitales y gestión del conocimiento (resumen del curso)
 
Computer Networking meets Social Psychology
Computer Networking meets Social PsychologyComputer Networking meets Social Psychology
Computer Networking meets Social Psychology
 
The pit and the pendulum (1)
The pit and the pendulum (1)The pit and the pendulum (1)
The pit and the pendulum (1)
 
“暗算の達人” で数字で話せる人を目指す
“暗算の達人”で数字で話せる人を目指す“暗算の達人”で数字で話せる人を目指す
“暗算の達人” で数字で話せる人を目指す
 
PA.3/4 แบบข้อเสนอในการพัฒนางานตามหน้าที่และความรับผิดชอบ (สายงานนิเทศการศึกษา)
PA.3/4 แบบข้อเสนอในการพัฒนางานตามหน้าที่และความรับผิดชอบ (สายงานนิเทศการศึกษา)PA.3/4 แบบข้อเสนอในการพัฒนางานตามหน้าที่และความรับผิดชอบ (สายงานนิเทศการศึกษา)
PA.3/4 แบบข้อเสนอในการพัฒนางานตามหน้าที่และความรับผิดชอบ (สายงานนิเทศการศึกษา)
 
Andalucia Nuestra Tierra 1
Andalucia Nuestra Tierra 1Andalucia Nuestra Tierra 1
Andalucia Nuestra Tierra 1
 
TEANO - PREVISIÓ DE PROBLEMES
TEANO - PREVISIÓ DE PROBLEMESTEANO - PREVISIÓ DE PROBLEMES
TEANO - PREVISIÓ DE PROBLEMES
 
Presentación1
Presentación1Presentación1
Presentación1
 
Whpmt1 me0059 d2
Whpmt1 me0059 d2Whpmt1 me0059 d2
Whpmt1 me0059 d2
 
Cópia de f 07.78 u 06 perfil usitep[1]
Cópia de f 07.78 u 06 perfil usitep[1]Cópia de f 07.78 u 06 perfil usitep[1]
Cópia de f 07.78 u 06 perfil usitep[1]
 
Valores
ValoresValores
Valores
 
A que cheiram as cores
A que cheiram as coresA que cheiram as cores
A que cheiram as cores
 
#ebc #negocios #FinaciacionDeVentas
#ebc #negocios #FinaciacionDeVentas#ebc #negocios #FinaciacionDeVentas
#ebc #negocios #FinaciacionDeVentas
 
Cadastro maebraz ver set 2010 - comercial
Cadastro maebraz ver set 2010 - comercialCadastro maebraz ver set 2010 - comercial
Cadastro maebraz ver set 2010 - comercial
 
Power Point
Power PointPower Point
Power Point
 

Similar to Lx3520322036

Web Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering AnalysisWeb Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering Analysis
inventy
 
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
IJECEIAES
 
Introduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIntroduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering Ensemble
IJSRD
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
PRAWEEN KUMAR
 
Ijmet 10 01_141
Ijmet 10 01_141Ijmet 10 01_141
Ijmet 10 01_141
IAEME Publication
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
IJECEIAES
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
IRJET Journal
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
DrGnaneswariG
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET Journal
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
IJCSIS Research Publications
 
Cg33504508
Cg33504508Cg33504508
Cg33504508
IJERA Editor
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
IJECEIAES
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
IJDKP
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
IAEME Publication
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...
IRJET Journal
 
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataMPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
IRJET Journal
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET Journal
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
Natasha Grant
 
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
ijtsrd
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
IOSR Journals
 

Similar to Lx3520322036 (20)

Web Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering AnalysisWeb Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering Analysis
 
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
 
Introduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIntroduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering Ensemble
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
Ijmet 10 01_141
Ijmet 10 01_141Ijmet 10 01_141
Ijmet 10 01_141
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
 
Cg33504508
Cg33504508Cg33504508
Cg33504508
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...
 
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataMPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 

Recently uploaded

20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfAI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
Techgropse Pvt.Ltd.
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 

Recently uploaded (20)

20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfAI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 

Lx3520322036

  • 1. M. K. Kond Reddy et al Int. Journal of Engineering Research and Application ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036 RESEARCH ARTICLE www.ijera.com OPEN ACCESS Data Mining Tool using Clustering Technique on Exploration Engine Dataset Mahesh Kumar KondReddy, Sujeeth .T Dept. of Computer Science &Engineering, Sri Venkateswara University,Tirupati, Andhra Pradesh, India ABSTRACT It is a major issue to retrieve good websites from the larger collections of websites. As the number of available Web pages grows, it is become more difficult for users finding documents relevant to their interests. Clustering is the classification of a data set into subsets (clusters), so that the data in each subset share some common trait often proximity according to some defined distance measure. By clustering we improve the quality of websites by grouping similar websites in groups. This paper addresses the applications of data mining tool Weka by applying k means clustering to find clusters from huge data sets and find the attributes that govern optimization of search engines. Unlabeled document collections are becoming increasingly common and mining such databases becomes a major challenge. Keywords—Websites; Data mining; Weka; k-means, Dataset. I. INTRODUCTION The Web has experienced continuous growth since its creation. As of March 2002, the largest search engine contained approximately 968 million indexed pages in its database. Finding the right information from such a large collection is extremely difficult [3].Information extraction plays a vital role in today's life. How efficiently and effectively the relevant documents are extracted from World Wide Web is a challenging issue. As today's search engine does just string matching, documents retrieved may not be so relevant according to user's query. By clustering the websites, the websites having values of attributes are in particular range are grouped together [6]. Data is collected from various websites source code like their title length, number of keywords in title, url length, number of backlinks etc and based on this we derive the conclusion. A popular technique for clustering is based on K-means such that the data is partitioned into K clusters. In this method, the groups are identified by a set of points that are called the cluster centers. II. DOCUMENT CLUSTERING Document clustering analysis plays an important role in document mining research. A widely adopted definition of optimal clustering is a partitioning that minimizes distances within a cluster and maximizes distances between clusters. In this approach the clusters and, to a limited degree, relationships between clusters are derived automatically from the documents to be clustered, and the documents are subsequently assigned to those clusters [1].Users are known to have www.ijera.com difficulties in dealing with information retrieval search outputs especially if the outputs are above a certain size. Clustering can enable them to find the relevant documents more easily and also help them to form an understanding of the different facets of the query that have been provided for their inspection. This project aimed to investigate the websites that are in top 5 in one cluster and other sites in second cluster. Clustering based on k-means is closely related to a number of other clustering and location problems. These include the Euclidean kmedians , in which the objective is to minimize the sum of distances to the nearest center, and the geometric k-center problem in which the objective is to minimize the maximum distance from every point to its closest center. A. K means Algorithm K-Means clustering is a very popular algorithm to find the clustering in dataset by iterative computations. It has the advantages of simple implementing and finding at least local optimal clustering. K-Means algorithm is employed to find the clustering in dataset. The algorithm [2],[9] is composed of the following steps:  Initialize k cluster centers to be seed points. (These centers can be randomly produced or use other ways to generate).  For each sample, find the nearest cluster center, put the sample in this cluster and recompute centers of the altered cluster (Repeat n times).  Exam all samples again and put each one in 2032 | P a g e
  • 2. M. K. Kond Reddy et al Int. Journal of Engineering Research and Application ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036 the cluster identified with the nearest center (don’t recompute any cluster centers). If members of each cluster haven’t been changed, stop. If changed, go to step 2. III. THE METHOD FOR FINDING MIXTURE STRUCTURE IN HETEROGENEOUS MULTIVARIATE DATA SET A. Partly And Completely Overlapped Group Structures In Heteregenous Multivariate Data Set Partly and completely overlapped group structures make it difficult to determine the number and structure of clusters in multivariate data set. The cases of partly and completely overlapped group structures in heterogeneous data are shown in Figure 01. It shows also mixture structure in data. The first case is partly group overlapping, in this case group means are different but very close to each other. Group variances are the same as shown in Figure 01(a). The second case is completely group overlapping, in this case group means are the same. Group variances are different from each other as shown in Figure 01(b). The horizontal axis denotes the values of independent variable and the vertical axis denotes the values of dependent variable with respect to the values of independent variable having heterogeneous structure thus having mixture structure in Figure 01. There is only one independent variable having two subgroups in Figure 01. other. B. Algorithm For Diagnosing Mixture Structure In Multivariate Heteregoneous Data Set And Refining Groups Model selection methods based on information criterions is applied for diagnosing the mixture structure in heterogeneous multivariate data set. The algorithm given in this section is a new algorithm. The steps of the algorithm proposed for refining groups in data using dynamic model based clustering is given below: 1. Apply model based clustering for multivariate data set by assuming that multivariate data have mixture structure thus data comes from a mixture of multivariate normal densities. This step shows either multivariate data is homogeneous thus there is no mixture structure or multivariate data have a group structure thus there is a mixture structure. 2. If multivariate data have mixture structure or multivariate data contains groups then find these groups in multivariate data. Test each group found in multivariate data for homogeneity. In other words check each group found for further mixture structure or subgroups. IV. ORGANIZATION OF DATA It’s important for search engine to maintain a high quality websites. This will improve the optimization. We made a database in which following attributes we take length of title, keywords in title, Domain length, and number of backlinks and Top rank website . A. Working with Weka on Dataset Open Weka, and then click on right side option explorer then Open data file under preprocess option which is in csv or arff format [4],[5]. As we choose the explorer option it will appear as given below, the screen shot in fig 1. Clearly indicate the open file option. Now we click on view open file and choose the data set. Weka provides filters to accomplish all of these preprocessing tasks, they are not necessary for clustering in Weka .This is because Weka Simple K Means algorithm automatically handles a mixture of categorical and numerical attributes. This algorithm automatically normalizes numerical attributes when doing distance computations [7]. A B Figure 01. (a) The case of partly overlapping: group means are different but very close to each other. Group variances are the same. (b) The case of completely overlapping: group means are the same. Group variances are different from each www.ijera.com www.ijera.com This gives all attributes that are present in dataset. We can select any one which we want to include or select all. 2033 | P a g e
  • 3. M. K. Kond Reddy et al Int. Journal of Engineering Research and Application ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036 www.ijera.com Fig 3: Choose parameters Fig.1: Opening page After this just click on cluster tab and click on choose button on left side and select clustering algorithm which we want to apply, we select simple k means the screen appears below in fig 2. [8] Note that, in general, K-means is quite sensitive to how clusters are initially assigned. Thus, it is often necessary to try different values and evaluate the results. [10]. Once the options have been specified, we can run the clustering algorithm. Here we make sure that in the "Cluster Mode" panel, the "Use training set" option is selected, and we click "Start". We can right click the result set in the "Result list" panel and view the results of clustering in a separate window. The result window shows the centroid of each cluster as well as statistics on the number and percentage of instances assigned to different clusters. Cluster centroids are the mean vectors for each cluster (so, each dimension value in the centroid represents the mean value for that dimension in the cluster). Thus, centroids can be used to characterize the clusters. Fig.2: Select algorithm Next, click on the text box to the right of the "Choose" button to get the pop-up window shown in Fig 3, for editing the clustering parameter. In the pop-up window we enter 2 as the number of clusters and we leave the value of "seed" as is. The seed value is used in generating a random number which is, in turn, used for making the initial assignment of instances to clusters. The result shows that in cluster 0 there are 13 websites that have length of title 59 characters long, keywords in title are 5, url length 22 characters long and number of backlinks are 6638 and in cluster 1 there are 16 websites that have length of title 36 characters long, keywords in title are 3, url length 29 characters long and number of backlinks are 19163 as shown in fig 4. Another way of understanding the characteristics of each cluster is through visualization. We can do this by right-clicking the result set on the left "Result list" panel and selecting "Visualize cluster assignments". This pops up the visualization window as shown in Fig 5. In this, we choose the cluster number and any of the other attributes for each of the three different dimensions available (x-axis, y-axis, and color). Different combinations of choices will result in a visual rendering of different relationships within each cluster. www.ijera.com 2034 | P a g e
  • 4. M. K. Kond Reddy et al Int. Journal of Engineering Research and Application ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036 www.ijera.com Fig. 4: Result of clustering Fig.5: Visual result www.ijera.com 2035 | P a g e
  • 5. M. K. Kond Reddy et al Int. Journal of Engineering Research and Application ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036 www.ijera.com In the above example, we have chosen the cluster number as the x-axis, the instance number (assigned by Weka) as the y-axis, and the "length of title" attribute as the color dimension. This will result in a visualization of the distribution of length of title in two clusters. As more and more data is collected from websites we can get more detail and can find attributes as by this method we find backlinks > 19000 , length of title < 40 , keywords in title > 3 and Domain length < 30 is good for search engine optimization REFERENCES [1] A Document Clustering Algorithm for Web Search Engine Retrieval System, Hongwei Yang School of Software Yunnan University, Kunming 650021, China; [2] S. Kantabutra, Efficient Representation of Cluster Structure in Large Data Sets, Ph.D. Thesis, Tufts University, Medford MA, September 2001. [3] Wang Jun, OuYang Zheng-Zheng “The Research of KMeans Clustering Algorithm Based on Association Rules “ [4]http://maya.cs.depaul.edu/classes/ect584/weka/ k-means.html [5].http://www.cs.ccsu.edu/~markov/wekatutorial. pdf. [6]http://thesai.org/Downloads/Volume3No4/Pape r_20 Knowledge_Discovery_in_Health_Care_Dat asets Using_Data_Mining_Tools.pdf [7]www.gtbit.org/downloads/dwdmsem6/dwdmse m6lman.pdf [8]http://www.iasri.res.in/ebook/win_school_aa/n otes/WEKA.pdf [9] R. Kannan, S. Vempala, and Adrian Vetta, “On ClusteringsGood, Bad, and Spectral”, Proc. of the 41st Foundations of Computer Science, Redondo Beach, 2000.5. [10]http://www.bvicam.ac.in/news/INDIACom% 202010%20Proceedingspapers/Group3/IND IACom10_388_Paper.pdf. www.ijera.com 2036 | P a g e