Data clustering and optimization techniques

DATA CLUSTERING AND
OPTIMIZATION TECHNIQUES
Information Science & Informatics
Informatics and Neuroinformatics
1
Spyros Ktenas

Spyros Ktenas - http://open-works.org/profiles/spyros-ktenas
WHAT IS CLUSTERING
 Is the process in which a set of objects are separated
into a set of logical groups. Entry of objects in a group is
translated as a similarity of these objects and vice versa
(objects belonging to different groups are dissimilar).
The similarity or not, among the objects, essentially
depends on the specific problem and the form of the
"objects". In the bibliography is seen as grouping and
unsupervised learning. Objects can also be mentioned
with different terms (patterns, vectors).
2
Clustering
Clustering

CLUSTERING APPLICATIONS
 Health, Biology, Constructions, Insurance, Marketing, Technology,
Academic, Networks
 Classification of species
 Housing planning
 Calcification of customers
 Search engines group items as clusters
 Drug Activity Prediction
3
Clustering

CLUSTERING ALGORITHMS
 Connectivity-based clustering (hierarchical clustering)
Items are more similar to nearby items
 Centroid-based clustering (K-means)
The algorithm starts by separating the starting items into k initial sets either
randomly or using localized data. It then calculates the centroid of each set,
implements a new separation so that each point is related to the nearest
centroid. Then the centroid is recalculated for the new groups, the algorithm
repeats the two steps until the items can not be changed (the centroid
remains unchanged).
 Distribution-based clustering
 Density-based clustering (DBSCAN)
4
Clustering
Images from Wikipedia

CLUSTERING ALGORITHMS OPTIMIZATION PAPERS
 Swarm Intelligence Algorithms for Data Clustering
Ajith Abraham, Swagatam Das, and Sandip Roy
Bio-inspired algorithms - Swarm Intelligence (SI) has successfully been applied
to a number of real world clustering problems. This chapter explores the role
of SI in clustering different kinds of Datasets.
The proposed algorithm can automatically compute the optimal number of
clusters in any dataset and thus requires minimal user intervention.
Comparison with a state of the art GA based clustering strategy, reveals the
superiority of the algorithm both in terms of accuracy and speed.
 Initializing Partition-Optimization Algorithms
Ranjan Maitra
Partition-optimization approaches, such as k-means or expectation-
maximization (EM) algorithms, are sub-optimal and find solutions in the
vicinity of their initialization. This paper proposes a staged approach to
specifying initial values by finding a large number of local modes and then
obtaining representatives from the most separated ones. 5
Clustering

 Clusterpath: An Algorithm for Clustering using Convex Fusion Penalties
Toby Dylan Hocking, Armand Joulin, Francis Bach, Jean-Philippe Vert
A convex relaxation of hierarchical clustering, which results in a family of
objective functions with a natural geometric interpretation. The method
experimentally gives state-of-the-art results similar to spectral clustering for
non-convex clusters, and has the added benefit of learning a tree structure
from the data Initializing Partition-Optimization Algorithms.
6
Clustering

 An Optimized Version of the K-Means Clustering Algorithm
Marian Poteras,Marian Cristian Mihaescu,Mihai Mocanu
The paper describes an optimized version of the K-Means algorithm. The
optimization refers to the running time. The implementation proposed in this
paper distinguishes data elements which won’t change their cluster during the
next iteration and those who might change it, reducing significantly the
workload especially for large data. The prototype showed up to 70% reduction
of the running time.
7
Clustering

K-MEANS PHP IMPLEMENTATION
 The first steps of a PHP k-means implementation completed
-UI
-Input Validation
-Initializations
-Clustering for first Iteration
 Future Work
-Iterations implementation
-Results Presentation
8
Clustering

CONCLUSIONS
9
Clustering
 Although data clustering is an old problem, it remains an active
field of scientific research. No algorithm has been found that can
group all real-world data efficiently and error-free. In order to judge
the quality of clustering, we need a specially designed statistical
mathematical function called clustering validity, however
bibliographic research reveals that most of these validity indicators
are empirically designed and there is no universally good index that
can work.
THANK YOU

Data clustering and optimization techniques

More Related Content

What's hot

Similar to Data clustering and optimization techniques

More from Spyros Ktenas

Recently uploaded

Data clustering and optimization techniques