Project 0th Review

A Combined Approach for Clustering
based on the GSA-KM and Genetic
Algorithms

Divakar Raj.M (0901016) Under the guidance of
Dilip.M (0901015) Mr.P.Perumal
Kishore Kumar.C (0901036) Associate Professor
IV CSE - A Department of Computer Science and
Engineering (UG)

Data Mining / Clustering 1/33

Introduction about Data Mining

• Data mining (knowledge discovery in databases):
– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) information or patterns from data in large databases

• Potential Applications
– Market analysis and management
– Risk analysis and management
– Fraud detection and management
– Text mining (news group, email, documents) and Web analysis
– Intelligent query answering


Data Mining: A KDD Process

– Data mining: the core of Pattern Evaluation
knowledge discovery
process.
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases Data Mining / Clustering 3/33

Architecture of a Typical Data
Mining System
Graphical user interface

Pattern evaluation

Data mining engine

Knowledge-base
Database or data warehouse
server
Filtering
Data cleaning & data integration

Data
Databases Warehouse
4/33
Data Mining / Clustering

Data Mining Functionalities
• Concept description: Characterization and discrimination
– Generalize, summarize, and contrast data characteristics, e.g., dry
vs. wet regions

• Association (correlation and causality)
– Multi-dimensional vs. single-dimensional association
– age(X, ―20..29‖) ^ income(X, ―20..29K‖) buys(X, ―PC‖)
– contains(T, ―computer‖) contains(x, ―software‖)


• Classification and Prediction
– Finding models (functions) that describe and distinguish classes or
concepts for future prediction
– E.g., classify countries based on climate, or classify cars based on gas
mileage
– Presentation: decision-tree, classification rule, neural network
– Prediction: Predict some unknown or missing numerical values
• Cluster analysis
– Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
– Clustering based on the principle: maximizing the intra-class similarity
and minimizing the interclass similarity


• Outlier analysis
– Outlier: a data object that does not comply with the general behavior
of the data
– It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis

• Trend and evolution analysis
– Trend and deviation: regression analysis
– Sequential pattern mining, periodicity analysis
– Similarity-based analysis


Issues in Data mining
• Individual Privacy
• Data Integrity
• Relational Database Structure (vs) Multidimensional One
• Issue of Cost
• Mining methodology and user interaction issues
• Performance issues
• Issues relating to the diversity of database types


Applications
• Database analysis and decision support

– Market analysis and management
• Target Marketing, Customer Relation Management, Market
Basket Analysis, Cross Selling, Market Segmentation

– Risk analysis and management
• Forecasting, Customer Retention, Improved Underwriting,
Quality Control, Competitive Analysis


Applications

• Text mining (news group, email, documents) and Web analysis
• Intelligent query answering
• Sports
• Astronomy
• Internet Web Surf-Aid


Clustering

• Clustering is a data mining (machine learning)
technique used to place data elements into related
groups without advance knowledge of the group
definitions

• Set of meaningful sub classes called clusters


Cluster Analysis
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters

• Cluster analysis
– Grouping a set of data objects into clusters

• Clustering is unsupervised classification: no predefined classes

• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms


What Is Good Clustering?
• A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity

• The quality of a clustering result depends on both the similarity
measure used by the method and its implementation.

• The quality of a clustering method is also measured by its ability to
discover some or all of the hidden patterns


Requirements of Clustering in Data Mining

• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine input
parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and Usability


Major Clustering Approaches
• Partitioning algorithms: Construct various partitions and then
evaluate them by some criterion
• Hierarchy algorithms: Create a hierarchical decomposition of the
set of data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the clusters
and the idea is to find the best fit of that model to each other


Issues of Clustering
• Assessment of results

• Choice of appropriate number of clusters

• Data preparation

• Proximity measures

• Handling outliers


General Applications of Clustering

• Pattern Recognition

• Image Processing

• Economic Science (especially market research)

• WWW
– Document classification
– Cluster Weblog data to discover groups of similar access patterns


Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop targeted
marketing programs
• Land use: Identification of areas of similar land use in an earth
observation database
• Insurance: Identifying groups of motor insurance policy holders
with a high average claim cost
• City-planning: Identifying groups of houses according to their
house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults

Literature Survey
[1] An Architecture for Component-Based Design of Representative-
Based Clustering Algorithms
Boris Delibas, Milan Vuki, Milos Jovanovi, Kathrin Kirchner,
Johannes Ruhland, Milija Suknovic (2012)

[2] The Research of Imbalanced Data Set of Sample Sampling Method
Based on K-Means Cluster and Genetic Algorithm
Yang Yong, (2012)

[3] A Combined Approach for Clustering based on K-means and
Gravitational Search Algorithms
Abdolreza Hatamlou, Salwani Abdullah, Hossein Nezamabadi-
pour, (2012)

An Architecture for Component-Based Design of
Representative-Based Clustering Algorithms

• Based on reusable components

• Components derived from K-Means like algorithms and their extensions

• The new algorithm is built by exchanging components from the original
algorithm and their improvements

• The Comparison & Evaluation are possible by using Representative Based
Clustering Algorithm


The Research of Imbalanced Data Set of Sample
Sampling Method
Based on K-Means Cluster and Genetic Algorithm

• We use K-Means to cluster & In each cluster, we use GA to carry on the
valid confirmation and to gain a new sample

• Enhances the classified performance of imbalanced datasets

• Generates unbalanced data set’s minority class

• Attention to Classification’s accuracy of Minority Classes


A Combined Approach for Clustering based on K-
means and
Gravitational Search Algorithms
• A hybrid data clustering algorithm based on GSA and k-means
(GSA-KM) is presented
• It uses the advantages of both algorithms
• Comparison of the performance of GSA-KM with other well-known
algorithms
– K-means
– Genetic Algorithm(GA)
– Simulated Annealing(SA)
– Ant Colony Optimization(ACO)
– Honey Bee Mating Optimization(HBMO)
– Particle Swarm Optimization(PSO)
– Gravitational Search Algorithm(GSA)
• Comparison based on real and standard datasets from the UCI
repository Data Mining / Clustering 22/33

Existing System
K-Means
• One of the most efficient and famous clustering algorithms
• Starts with some random or heuristic-based centroids for the desired clusters
• Assigns every data object to the closest centroid
• Iteratively refines the current centroids to reach the (near) optimal ones by
calculating the mean value of data objects within their respective clusters
• The algorithm will terminate when any one of the specified termination
criteria is met (i.e., a predetermined maximum number of iterations is
reached, a (near) optimal solution is found or the maximum search time is
reached)


Existing System
Gravitational Search Algorithm

• Inspired by the physical phenomenon of Gravity
• Based on the interaction of masses in the universe via Newtonian
gravity law
• Attraction depends on the amount of masses and the distance
between them

2
• F = G (M1*M2) / R


Drawbacks of Existing System
K – Means

• Performance is highly dependent on the initial state of
centroids

• May converge to the local optima rather than global optima

• The number of clusters is needed as input to the algorithm, i.e.
the number of clusters is assumed known


GSA-KM
• Built on three main steps

1. GSA-KM applies k-means algorithm on selected dataset
and tries to produce near optimal centroids for desired
clusters
2. The proposed approach will produce an initial population
of solutions
3. Application of the GSA Algorithm


GSA - KM
Ways for production of an initial population

• One of the candidate solutions will be produced by the output of the
k-means algorithm, which has been achieved in the previous step

• Three of them will be created based on the dataset itself and other
solutions will be produced randomly

• GSA will be employed for determining an optimal solution for the
clustering problem


Reasons for Efficiency
• Decreases the number of iterations and function evaluations to
find a near global optimum compared to the original GSA
alone

• With the advent of a good candidate solution in the initial
population, GSA can search for near global optima in a
promising search space and, therefore, find a high quality
solution in comparison with the original GSA alone


Proposed System

• Along with the given GSA-KM, we intend to implement
Genetic Algorithm to further increase the efficiency and speed
of the clustering

• The proposed system will have combined advantages and will
be faster and efficient than the traditional clustering algorithms
and also GSA-KM


Implementation Details

• Programming language : C#
• Database : MS- Access

• The given repository is clustered using K-Means and GSA,
combinedly called GSA-KM and Genetic Algorithm is used to
enhance the performance
• The performance is calculated and compared with other
clustering algorithms


References
[1] C.L. Blake, C.J. Merz
UCI repository of machine learning databases
http://www.ics.uci.edu/-learn/MLRepository.html

[2] S. Das, A. Abraham, A. Konar
Meta heuristic pattern clustering —an overview
Studies in Computational Intelligence (2009)

[3] L. Kaufman, P.J. Rousseeuw
Finding Groups in Data: An Introduction to Cluster Analysis
John Wiley & Sons, New York, (1990)

[4] M.B. Adil
Modified global-means algorithm for minimum sum-of- squares clustering problems
Pattern Recognition 41 (10) (2008)

[5] E. Rashedi, H. Nezamabadi-pour, S. Saryazdi
GSA: a gravitational search algorithm
Information Sciences 179 (13) (2009)


References
[6] A. Likas, N. Vlassis, J.J. Verbeek
The global k -means clustering algorithm
Pattern Recognition 36 (2) (2003)

[7] M. Mahdavi
Novel meta-heuristic algorithms for clustering web documents
Applied Mathematics and Computation (2008)

[8] M. Moshtaghi
Clustering ellipses for anomaly detection
Pattern Recognition 44 (2008)

[9] B. Saglam, et al.,
A mixed-integer programming approach to the clustering problem with an application in customer
segmentation
European Journal of Operational Research 173 (3) (2006)

[10] A.K. Jain
Data clustering: 50 years beyond K –means
Pattern Recognition Letters 31 (8) (2010)


Thank You !!!


Project 0th Review

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Project 0th Review

Similar to Project 0th Review (20)

Recently uploaded

Recently uploaded (20)

Project 0th Review