2. Outlines:
• Importance of Cluster Analysis for Data
Mining
• What is Cluster Analysis
• Purposes of Cluster Analysis
• Main Types of Clustering
• Common Types of Clustering
• Cluster Algorithms
• Applications of clustering
• Conclusion
• References
3. Importance of Cluster Analysis for
Data Mining
• Used for automatic identification of
natural groupings of things
• Part of the machine-learning family
• Employ unsupervised learning
• Learns the clusters of things from past
data, then assigns new instances
• There is not an output variable
• Also known as segmentation
5. Notion of a Cluster can be Ambiguous
QUESTION ??
How many clusters?
6. What is Cluster Analysis?
• Finding groups of objects such that the
objects in a group will be similar (or related)
to one another and different from (or
unrelated to) the objects in other groups
7. Puprses of Cluster Analysis
Understanding
Group related documents for
browsing, group genes and
proteins that have similar
functionality, or group stocks
with similar price fluctuations
Summarization
Reduce the size of large data
sets
Clustering precipitation in Australia
8. What is not Cluster Analysis?
Supervised classification
Have class label information
Simple segmentation
Dividing students into different registration groups
alphabetically, by last name
Results of a query
Groupings are a result of an external specification
Graph partitioning
Some mutual relevance and synergy, but areas are not
identical
9. Common Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density-based clusters
• Property or Conceptual
• Described by an Objective Function
10. Types of Clusters: Well-Separated
• Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is closer
(or more similar) to every other point in the cluster than to any
point not in the cluster.
3 well-separated clusters
11. Types of Clusters: Center-Based
• Center-based (prototype-based)
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
– The center of a cluster is often a centroid, the average of all the
points in the cluster, or a medoid, the most “representative”
point of a cluster
4 center-based clusters
12. Types of Clusters: Contiguity-Based
• Contiguous Cluster (Graph-based)
– A cluster is a set of points such that a point in a cluster is closer (or
more similar) to one or more other points in the cluster than to any
point not in the cluster. Cluster can be defined as a connected
component i.e. a group of objects that are connected to one
another, but that have no connection to objects outside the group.
8 contiguous clusters
13. Types of Clusters : DENSITY-BASED
Density-based
A cluster is a dense region of points, which is separated by low-density
regions, from other regions of high density.
Used when the clusters are irregular or intertwined, and when noise
and outliers are present.
6 density-based clusters
14. Types of Clusters: Conceptual Clusters
• Shared Property or Conceptual Clusters
– Finds clusters that share some common property or represent a
particular concept.
.
2 Overlapping Circles
15. Types of Clusters: Objective Function
• Clusters Defined by an Objective Function
– Finds clusters that minimize or maximize an objective function.
– Enumerate all possible ways of dividing the points into clusters and
evaluate the `goodness' of each potential set of clusters by using the
given objective function.
– Can have global or local objectives.
• Hierarchical clustering algorithms typically have local objectives
• Partitional algorithms typically have global objectives
– A variation of the global objective function approach is to fit the data to a
parameterized model.
• Parameters for the model are determined from the data.
• Mixture models assume that the data is a ‘mixture' of a number of statistical
distributions.
16. CONT.
-Map the clustering problem to a different domain
and solve a related problem in that domain
-Proximity matrix defines a weighted graph, where
the nodes are the points being clustered, and the
weighted edges represent the proximities between
points
Clustering is equivalent to breaking the graph into
connected components, one for each cluster.
Want to minimize the edge weight between clusters
and maximize the edge weight within clusters
17. • A clustering is a set of clusters
• Important distinction between
hierarchical and partitional sets of
clusters
What is Clustering?
18. Main type of Cluster Method
•Hierarchical clustering methods: can be either agglomerative or divisive.
• An agglomerative method starts with each point as a separate cluster,
and successively performs merging until a stopping criterion is met.
• A divisive method begins with all points in a single cluster and
performs splitting until a stopping criterion is met.
• The result of a hierarchical clustering method is a tree of clusters called
a dendogram.
• A tree like diagram that records the sequences of merges or splits
1 3 2 5 4 6
0
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
2
3 4
5
27. Hierarchical clustering
An example of hierarchical clustering methods is BIRCH (Balanced
Iterative Reducing and Clustering using Hierarchies)
Advantage: typically find a good clustering with a single scan of the data,
and improve the quality further with a few additional scans.
Disadvantages: it can hold only a limited number of entries due to its size,
it not always correspond to what a user may consider a natural cluster.
28. Type of Cluster Method CONT.
•Partitional clustering methods:
• determine a partition of the points into clusters, such that the
points in a cluster are more similar to each other than to points in
different clusters.
•They start with some arbitrary initial clusters and iteratively until
criterion is met.
Examples of partitional clustering algorithms include k-means, PAM,
CLARA, CLARANS and EM.
Disadvantages:
• it assumes that the points to be clustered are all stored in main
memory.
• the run time of the algorithm is prohibitive on large datasets.
30. Type of Cluster Method CONT.
•Density-based clustering methods :try to find clusters based on the density of
points in regions. Dense regions that are reachable from each other are merged to
form clusters.
Examples of density-based clustering methods include DBSCAN , DBRS and DBRS+
.
Advantages: It gives extremely good results and is efficient in many datasets.
Disadvantages:
• if a dataset has clusters of widely varying densities, it can't able to handle with
it efficiently.
• it does not consider non-spatial attributes in the dataset.
• it not suitable for finding approximate clusters in very large datasets.
31. Type of Cluster Method CONT.
Grid-based clustering methods quantize the clustering space into a finite number of
cells and then perform the required operations on the quantized space. Cells
containing more than a certain number of points are considered to be dense.
Contiguous dense cells are connected to form clusters.
Examples of grid-based clustering methods include CLIQUE and STING .
33. K-Means
• Initial set of clusters randomly chosen.
• Iteratively, items are moved among sets of
clusters until the desired set is reached.
• High degree of similarity among elements
in a cluster is obtained.
• Given a cluster Ki={ti1,ti2,…,tim}, the cluster
mean is mi = (1/m)(ti1 + … + tim)
34. K-Means Example
• Given: {2,4,10,12,3,20,30,11,25}, k=2
1- Randomly assign means: m1=3,m2=4
2- K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16
3-K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
4-K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6
5-K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25
• Stop as the clusters with these means are the same.
38. Some applications of clustering
• Clustering in EBMT (Example-Based Machine
Translation)
• Clustering in Online shopping mall
• Clustering in Spatial database
39. Clustering in EBMT
• EBMT is a method of performing machine translation by imitating
translation examples of similar sentences .
• a large amount of bi/multi-lingual translation examples (Greek-
English) has been stored in a textual database .
• and input expressions are rendered in the target language by
retrieving from the database that example which is most similar to
the input.
But , how to make retrieval of the example that best matches the
input more efficient?
By using of
cluster
40. Clustering in EBMT
• It applied k-means algorithm .
• Its enable the application to limit the search space
and locate the best available match in a database.
41. Clustering in Online shopping mall
• Its combine the characteristics of Web 2.0 into Decision
Support System for online shopping mall.
• To facilitate customers to share profiles and comments.
• The goal is to show the customers the product which they
may have interest in, according to the information provided
by the them.
• the framework contains of the search engine as an external
layer and the recommendation system as an internal layer.
• It used K-means to clustering the products in internal layer.
44. Clustering in Spatial database
• Spatial database is a large amounts of data are obtained from
satellite for applications such as geo-marketing, traffic control,
and environmental studies.
• There exist many physical obstacles (rivers, lakes , highways)
and their presence may affect the result of clustering in spatial
databases.
45. Clustering in Spatial database
Fig4: map of western population of Canada with Highways and Rivers
46. Clustering in Spatial database
• However, most of clustering algorithms cannot deal with obstacles like
DBRS, DBRS+( Density-Based Clustering with Random Sampling) .
• Because, it can't handle with widely varying densities clusters in
dataset efficiently.
• so new clustering algorithm MMO (mathematics morphology based
algorithm of obstacles clustering) is proposed for the problem of
clustering in the presence of obstacles.
47. Clustering in Spatial database
• The main contributions are new operators are more
accurate (Connector and Obstor)than the ordinary
operators: (open and close).
48. Clustering in Spatial database
• Result: The performance tests show that: MMO is
suitable and effective for large data set more than
other algorithms .
Fig5: runtime of MMO with DBRS+
49. Conclusion
The main purpose of Data mining is to provide decision support to
managers and business professionals through knowledge
discovery
-Analyzes vast store of historical business data
-Tries to discover patterns, trends, and correlations hidden in the
data that can help a company improve its business performance
-Use regression, decision tree, neural network, cluster analysis, or
market basket analysis
-Cluster Analysis is the common method of Data mining tools
-Cluster methods and It’s algorithms were the effective methods
used by many application these day , and good Research area For
DSS fields was Done in Cluster methods .
50. References
1. CRANIAS, L., PAPAGEORGIOU, H., & PIPERIDIS, S. (1994). CLUSTERING : A
TECHNIQUE FOR SEARCH SPACE REDUCTION IN EXAMPLEBASED MACHINE
TRANSLATION. IEEE.
2. Pattabiraman, V. (2012). Optimization of spatial join using constraints based-
clustering techniques. Engineering and Computer Innovations.
3. Wang, X., Rostoker, C., & Hamilton , H. J. (2004). Density-Based Spatial Clustering
in the Presence of Obstacles and Facilitators.
4. Wang, X., & Hamilton, H. (n.d.). Using Clustering Methods in Geospatial
Information Systems.
5. Wenxing, H., Weng, Y., Xie, L., & Maoqing, L. (2009). Design and Implementation
of Web-based DSS for Online Shopping Mall. IEEE.
6. Zhang, Q. (2008). A Mathematics Morphology Based Algorithm of Obstacles
Clustering. International Conference on Computer Science and Software
Engineering.