A new clustering tool of Data Mining RAPID MINER
Introduction To Clustering Unsupervised learning when old data with class labels not available e.g. when introducing a new product. Group/cluster existing customers based on time series of payment history such that similar customers in same cluster. Key requirement: Need a good measure of similarity between instances. Identify micro-markets and develop policies for each
About The Project Aim of this project is to devise a new algorithm of clustering for Data Mining The main functionalities which would be implemented in the system would be preprocessing and clustering. In the preprocessing of the data, input file, .xls file can be chosen. The null values, if any, present in the input file would be removed in order to avoid the occurrence of faulty results in the output data sets. The redundancy or duplicity in the data sets of the attributes is removed. In the clustering, the data is distributed into groups, so that the degree of association to be strong between members of the same cluster and weak between members of different clusters.
Present Tool: Weka Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. The Explorer interface features several panels providing access to the main components of the workbench: The Preprocess panel has facilities for importing data from a database, a CSV file, etc., and for preprocessing this data using a so-called filtering algorithm. These filters can be used to transform the data (e.g., turning numeric attributes into discrete ones) and make it possible to delete instances and attributes according to specific criteria. The Cluster panel gives access to the clustering techniques in Weka, e.g., the simple k-means algorithm. There is also an implementation of the expectation maximization algorithm for learning a mixture of normal distributions.
Our tool: Initially in the data preprocessing phase, the MS-Excel File is taken as input. There is no question of CSV of ARFF File(s). This is done since Excel file(s) are well known and comfortably handled by non-technical people as well. But, CSV and ARFF file(s) are needed to be well versed with also. This was done by importing a new library, the ‘jxl.jar’ library into the project. File(s) for data mining is firstly cleaned, by removing the null data sets from the input file(s). Null data sets are the data sets that contained no information or some information less than a threshold (minimum number of values of required attributes) value. The number of null data sets is reported to the user of the system as well. The second thing that was done was to remove redundancy/ duplicity of data sets from the file(s). Redundant/ Duplicate data sets are the data sets which have all the attribute values same in value with some other data set. These data sets are eliminated for the further process of data mining. The number of these redundant/ duplicate data sets is also reported to the user.
KD Trees K Dimensional Trees Space Partitioning Data Structure Splitting planes perpendicular to Coordinate Axes Reduces the Overall Time Complexity to O(log n)
Clustering Our Clustering Algorithm uses KD Tree extensively for improving its Time Complexity Requirements. Our algorithm differs from existing approach in how nearest centers are computed. Efficiency is achieved because the data points do not vary throughout the computation and, hence, this data structure does not need to be recomputed at each stage.
K-means Clustering Complexity is O( n * K * I * d ) – n = number of points, K = number of clusters, I = number of iterations, d = number of attributes
K means K-Means methodology is a commonly used clustering technique. In this analysis the user starts with a collection of samples and attempts to group them into ‘k’ Number of Clusters based on certain specific distance measurements. The prominent steps involved in the K-Means clustering algorithm are given below. 1. This algorithm is initiated by creating ‘k’ different clusters. The given sample set is first randomly distributed between these ‘k’ different clusters. 2. As a next step, the distance measurement between each of the sample, within a given cluster, to their respective cluster centroid is calculated. 3. Samples are then moved to a cluster (k ¢ ) that records the shortest distance from a sample to the cluster (k ¢ ) centroid.
As a first step to the cluster analysis, the user decides on the Number of Clusters‘k’. This parameter could take definite integer values with the lower bound of 1 (in practice, 2 is the smallest relevant number of clusters) and an upper bound that equals the total number of samples. The K-Means algorithm is repeated a number of times to obtain an optimal clustering solution, every time starting with a random set of initial clusters.
COMPARISON OF OUR TOOL WITH WEKA A set of data with the following statistics was run on WEKA and our tool both : Relation = weather No. of attributes = 3 No. of Instances ( including redundant/ duplicate and null instances) = 17
Limitations :-This tool does not provide protection from: Shared storage failures. Network service failures. Operational errors. Site disasters (unless a geographically dispersed clustering solution has been implemented).
In the near future… Market analysis Marketing strategies Advertisement Risk analysis and management Finance and finance investments Manufacturing and production Fraud detection and detection of unusual patterns (outliers) Telecommunication Finanancial transactions Anti-terrorism (!!!)
CONCLUSION We device a new algorithm for clustering by considering the following variations:- MS-Excel File(s) is successfully read, handled and processed by the system with the help of ‘jxl.jar’ library. By using this library, new features and functionalities of using Excel document were known. Null data sets were removed comfortably. Along with this, redundant and duplicate data sets were also removed. This algorithm choose better starting clusters i.e. choosing the initial values (or “seeds”) for the clustering algorithm. A filtering algorithm is included in this which uses KD-TREES to speed up each k-mean step. The initial centers are chosen in this algorithm. K-MEANS does not specify how they are to be selected. An inappropriate choice of number of clusters can yield poor results. That is why, number of clusters are determined properly in the data set.
References An Efficient k-Means Clustering Algorithm: Analysis and Implementation - Tapas Kanungo, NathanS. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu. Introduction to Clustering Techniques – by Leo Wanner A comprehensive overview of Basic Clustering Algorithms –Glenn Fung Introduction to Data Mining –Tan/Steinbach/Kumar