Cluster Analysis Introduction

CLUSTER ANALYSIS
PRASIDDHA SARMA
17352220, MCA -III

Introduction
Cluster analysis is a class of techniques that are used to
classify objects or cases into relative groups called clusters.
Clustering as a data mining tool has its roots in many
application areas such as biology, security, business
intelligence, and Web search etc.
 Why Clusturing?

Requirements for Cluster Analysis
Scalability − We need highly scalable clustering algorithms
to deal with large databases.
Ability to deal with different kinds of attributes −
Algorithms should be capable to be applied on any kind of
data such as interval-based (numerical) data, categorical, and
binary data.

Requirements for Cluster Analysis(2)
Discovery of clusters with attribute shape − The clustering
algorithm should be capable of detecting clusters of arbitrary
shape. They should not be bounded to only distance
measures that tend to find spherical cluster of small sizes.
High dimensionality − The clustering algorithm should not
only be able to handle low-dimensional data but also the
high dimensional space.

Requirements for Cluster Analysis(3)
Ability to deal with noisy data − Databases contain noisy,
missing or erroneous data. Some algorithms are sensitive to
such data and may lead to poor quality clusters.
Interpretability − The clustering results should be
interpretable, comprehensible, and usable.

Applications
Clustering analysis is broadly used in many applications
such as market research, pattern recognition, data analysis,
and image processing.
Clustering can also help marketers discover distinct groups
in their customer base. And they can characterize their
customer groups based on the purchasing patterns.

Applications(2)
In the field of biology, it can be used to derive plant and
animal taxonomies, categorize genes with similar
functionalities and gain insight into structures inherent to
populations.
Clustering also helps in identification of areas of similar
land use in an earth observation database. It also helps in the
identification of groups of houses in a city according to
house type, value, and geographic location.

Applications(3)
Clustering also helps in classifying documents on the web
for information discovery.
Clustering is also used in outlier detection applications
such as detection of credit card fraud.

Applications (4)
As a data mining function, cluster analysis serves as a tool
to gain insight into the distribution of data to observe
characteristics of each cluster.

Overview of Basic Clustering Methods
Clustering methods can be classified into the following
categories −
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method

Overview of Basic Clustering Methods

Partitioning Methods
K-Means Algorithm
◦It is an Centroid Based Technique
Suppose a data set, D, contains n objects in Euclidean
space. Partitioning methods distribute the objects in D into k
clusters, C1,...,Ck , that is, Ci ⊂ D and Ci ∩Cj = ∅ for (1 ≤ i,j ≤
k).
An objective function is used to assess the partitioning
quality so that objects within a cluster are similar to one another
but dissimilar to objects in other clusters.

◦The k-means algorithm for partitioning, where each cluster’s
center is represented by the mean value of the objects in the
cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.

Method
1. arbitrarily choose k objects from D as the initial cluster centers;
2. repeat
3. (re)assign each object to the cluster to which the object is the most
similar, based on the mean value of the objects in the cluster;
4. update the cluster means, that is, calculate the mean value of the
objects for each cluster;
5. until no change;

Hierarichal Clusturing
Agglomerative vs Divisive

Cluster Analysis Introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cluster Analysis Introduction

Similar to Cluster Analysis Introduction (20)

Recently uploaded

Recently uploaded (20)

Cluster Analysis Introduction