Clustering

A new clustering tool
of Data Mining
RAPID MINER

Introduction To Clustering
 Unsupervised learning when old data with class
labels not available e.g. when introducing a new
product.
 Group/cluster existing customers based on time
series of payment history such that similar
customers in same cluster.
 Key requirement: Need a good measure of similarity
between instances.
 Identify micro-markets and develop policies for
each

About The Project
 Aim of this project is to devise a new algorithm of
clustering for Data Mining
 The main functionalities which would be implemented in
the system would be preprocessing and clustering.
 In the preprocessing of the data, input file, .xls file can be
chosen. The null values, if any, present in the input file
would be removed in order to avoid the occurrence of
faulty results in the output data sets. The redundancy or
duplicity in the data sets of the attributes is removed.
 In the clustering, the data is distributed into groups, so that
the degree of association to be strong between members of
the same cluster and weak between members of different
clusters.

Present Tool: Weka
 Weka (Waikato Environment for Knowledge Analysis) is a popular
suite of machine learning software written in Java, developed at
the University of Waikato, New Zealand.
 The Explorer interface features several panels providing access to the
main components of the workbench:
 The Preprocess panel has facilities for importing data from a database,
a CSV file, etc., and for preprocessing this data using a so-called
filtering algorithm. These filters can be used to transform the data (e.g.,
turning numeric attributes into discrete ones) and make it possible to
delete instances and attributes according to specific criteria.
 The Cluster panel gives access to the clustering techniques in Weka,
e.g., the simple k-means algorithm. There is also an implementation of
the expectation maximization algorithm for learning a mixture
of normal distributions.

Our tool:
 Initially in the data preprocessing phase, the MS-Excel File is taken as
input. There is no question of CSV of ARFF File(s). This is done since
Excel file(s) are well known and comfortably handled by non-technical
people as well. But, CSV and ARFF file(s) are needed to be well versed
with also. This was done by importing a new library, the ‘jxl.jar’ library
into the project.
 File(s) for data mining is firstly cleaned, by removing the null data sets
from the input file(s). Null data sets are the data sets that contained no
information or some information less than a threshold (minimum
number of values of required attributes) value. The number of null
data sets is reported to the user of the system as well. The second thing
that was done was to remove redundancy/ duplicity of data sets from
the file(s). Redundant/ Duplicate data sets are the data sets which have
all the attribute values same in value with some other data set. These
data sets are eliminated for the further process of data mining. The
number of these redundant/ duplicate data sets is also reported to the
user.

KD Trees
 K Dimensional Trees
 Space Partitioning Data Structure
 Splitting planes perpendicular to
Coordinate Axes
 Reduces the Overall Time Complexity to
O(log n)

Clustering
 Our Clustering Algorithm uses KD Tree extensively for
improving its Time Complexity Requirements.
 Our algorithm differs from existing approach in how
nearest centers are computed.
 Efficiency is achieved because the data points do not
vary throughout the computation and, hence, this data
structure does not need to be recomputed at each
stage.

K-means Clustering
 Complexity is O( n * K * I * d )
 – n = number of points, K = number of clusters,
 I = number of iterations, d = number of attributes

K means
 K-Means methodology is a commonly used clustering technique. In
this analysis the user starts with a collection of samples and attempts to
group them into ‘k’ Number of Clusters based on certain specific
distance measurements. The prominent steps involved in the K-Means
clustering algorithm are given below.
 1. This algorithm is initiated by creating ‘k’ different clusters. The given
sample set is first randomly distributed between these ‘k’ different
clusters.
 2. As a next step, the distance measurement between each of the
sample, within a given cluster, to their respective cluster centroid is
calculated.
 3. Samples are then moved to a cluster (k ¢ ) that records the shortest
distance from a sample to the cluster (k ¢ ) centroid.

 As a first step to the cluster analysis, the user decides
on the Number of Clusters‘k’. This parameter could
take definite integer values with the lower bound of 1
(in practice, 2 is the smallest relevant number of
clusters) and an upper bound that equals the total
number of samples.

 The K-Means algorithm is repeated a number of times
to obtain an optimal clustering solution, every time
starting with a random set of initial clusters.

COMPARISON OF OUR TOOL WITH WEKA

A set of data with the following statistics was run on
WEKA and our tool both :

 Relation = weather
 No. of attributes = 3
 No. of Instances ( including redundant/ duplicate and
null instances) = 17

Limitations :-
This tool does not provide protection from:
 Shared storage failures.

 Network service failures.

 Operational errors.

 Site disasters (unless a geographically dispersed
clustering solution has been implemented).

In the near future…
 Market analysis
 Marketing strategies
 Advertisement
 Risk analysis and management
 Finance and finance investments
 Manufacturing and production
 Fraud detection and detection of unusual patterns
(outliers)
 Telecommunication
 Finanancial transactions
 Anti-terrorism (!!!)

CONCLUSION
We device a new algorithm for clustering by considering the following variations:-

 MS-Excel File(s) is successfully read, handled and processed by the system with the help
of ‘jxl.jar’ library. By using this library, new features and functionalities of using Excel
document were known.

 Null data sets were removed comfortably. Along with this, redundant and duplicate data
sets were also removed.

 This algorithm choose better starting clusters i.e. choosing the initial values (or “seeds”)
for the clustering algorithm.

 A filtering algorithm is included in this which uses KD-TREES to speed up each k-mean
step.

 The initial centers are chosen in this algorithm. K-MEANS does not specify how they are
to be selected.

 An inappropriate choice of number of clusters can yield poor results. That is why,
number of clusters are determined properly in the data set.

References
 An Efficient k-Means Clustering Algorithm: Analysis and
Implementation - Tapas Kanungo, Nathan
S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu.
 Introduction to Clustering Techniques – by Leo Wanner
 A comprehensive overview of Basic Clustering Algorithms –
Glenn Fung
 Introduction to Data Mining –
Tan/Steinbach/Kumar

Clustering

More Related Content

What's hot

Similar to Clustering

Recently uploaded

Clustering