Clustering on DSS
By:
Nora Alharbi
Enaam Alotaibi
Outlines:
• Importance of Cluster Analysis for Data
Mining
• What is Cluster Analysis
• Purposes of Cluster Analysis
• Main Types of Clustering
• Common Types of Clustering
• Cluster Algorithms
• Applications of clustering
• Conclusion
• References
Importance of Cluster Analysis for
Data Mining
• Used for automatic identification of
natural groupings of things
• Part of the machine-learning family
• Employ unsupervised learning
• Learns the clusters of things from past
data, then assigns new instances
• There is not an output variable
• Also known as segmentation
What is a Cluster Analysis?
Notion of a Cluster can be Ambiguous
QUESTION ??
How many clusters?
What is Cluster Analysis?
• Finding groups of objects such that the
objects in a group will be similar (or related)
to one another and different from (or
unrelated to) the objects in other groups
Puprses of Cluster Analysis
Understanding
Group related documents for
browsing, group genes and
proteins that have similar
functionality, or group stocks
with similar price fluctuations
Summarization
Reduce the size of large data
sets
Clustering precipitation in Australia
What is not Cluster Analysis?
Supervised classification
Have class label information
Simple segmentation
Dividing students into different registration groups
alphabetically, by last name
Results of a query
Groupings are a result of an external specification
Graph partitioning
Some mutual relevance and synergy, but areas are not
identical
Common Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density-based clusters
• Property or Conceptual
• Described by an Objective Function
Types of Clusters: Well-Separated
• Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is closer
(or more similar) to every other point in the cluster than to any
point not in the cluster.
3 well-separated clusters
Types of Clusters: Center-Based
• Center-based (prototype-based)
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
– The center of a cluster is often a centroid, the average of all the
points in the cluster, or a medoid, the most “representative”
point of a cluster
4 center-based clusters
Types of Clusters: Contiguity-Based
• Contiguous Cluster (Graph-based)
– A cluster is a set of points such that a point in a cluster is closer (or
more similar) to one or more other points in the cluster than to any
point not in the cluster. Cluster can be defined as a connected
component i.e. a group of objects that are connected to one
another, but that have no connection to objects outside the group.
8 contiguous clusters
Types of Clusters : DENSITY-BASED
Density-based
A cluster is a dense region of points, which is separated by low-density
regions, from other regions of high density.
Used when the clusters are irregular or intertwined, and when noise
and outliers are present.
6 density-based clusters
Types of Clusters: Conceptual Clusters
• Shared Property or Conceptual Clusters
– Finds clusters that share some common property or represent a
particular concept.
.
2 Overlapping Circles
Types of Clusters: Objective Function
• Clusters Defined by an Objective Function
– Finds clusters that minimize or maximize an objective function.
– Enumerate all possible ways of dividing the points into clusters and
evaluate the `goodness' of each potential set of clusters by using the
given objective function.
– Can have global or local objectives.
• Hierarchical clustering algorithms typically have local objectives
• Partitional algorithms typically have global objectives
– A variation of the global objective function approach is to fit the data to a
parameterized model.
• Parameters for the model are determined from the data.
• Mixture models assume that the data is a ‘mixture' of a number of statistical
distributions.
CONT.
-Map the clustering problem to a different domain
and solve a related problem in that domain
-Proximity matrix defines a weighted graph, where
the nodes are the points being clustered, and the
weighted edges represent the proximities between
points
Clustering is equivalent to breaking the graph into
connected components, one for each cluster.
Want to minimize the edge weight between clusters
and maximize the edge weight within clusters
• A clustering is a set of clusters
• Important distinction between
hierarchical and partitional sets of
clusters
What is Clustering?
Main type of Cluster Method
•Hierarchical clustering methods: can be either agglomerative or divisive.
• An agglomerative method starts with each point as a separate cluster,
and successively performs merging until a stopping criterion is met.
• A divisive method begins with all points in a single cluster and
performs splitting until a stopping criterion is met.
• The result of a hierarchical clustering method is a tree of clusters called
a dendogram.
• A tree like diagram that records the sequences of merges or splits
1 3 2 5 4 6
0
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
2
3 4
5
Hierarchical clustering
• Agglomerative (Bottom up)
Hierarchical clustering
• Agglomerative (Bottom up)
• 1st iteration1
Hierarchical clustering
• Agglomerative (Bottom up)
• 2nd iteration1 2
Hierarchical clustering
• Agglomerative (Bottom up)
• 3rd iteration
1 2
3
Hierarchical clustering
• Agglomerative (Bottom up)
• 4th iteration
1 2
3
4
Hierarchical clustering
• Agglomerative (Bottom up)
• 5th iteration
1 2
3
4
5
Hierarchical clustering
• Agglomerative (Bottom up)
• Finally k clusters left
1 2
3
4
6 9
5
7
8
Hierarchical clustering
• Divisive (Top-down)
Slide credit: Min Zhang
Hierarchical clustering
An example of hierarchical clustering methods is BIRCH (Balanced
Iterative Reducing and Clustering using Hierarchies)
Advantage: typically find a good clustering with a single scan of the data,
and improve the quality further with a few additional scans.
Disadvantages: it can hold only a limited number of entries due to its size,
it not always correspond to what a user may consider a natural cluster.
Type of Cluster Method CONT.
•Partitional clustering methods:
• determine a partition of the points into clusters, such that the
points in a cluster are more similar to each other than to points in
different clusters.
•They start with some arbitrary initial clusters and iteratively until
criterion is met.
Examples of partitional clustering algorithms include k-means, PAM,
CLARA, CLARANS and EM.
Disadvantages:
• it assumes that the points to be clustered are all stored in main
memory.
• the run time of the algorithm is prohibitive on large datasets.
Partitional Clustering
Original Points A Partitional Clustering
Type of Cluster Method CONT.
•Density-based clustering methods :try to find clusters based on the density of
points in regions. Dense regions that are reachable from each other are merged to
form clusters.
Examples of density-based clustering methods include DBSCAN , DBRS and DBRS+
.
Advantages: It gives extremely good results and is efficient in many datasets.
Disadvantages:
• if a dataset has clusters of widely varying densities, it can't able to handle with
it efficiently.
• it does not consider non-spatial attributes in the dataset.
• it not suitable for finding approximate clusters in very large datasets.
Type of Cluster Method CONT.
Grid-based clustering methods quantize the clustering space into a finite number of
cells and then perform the required operations on the quantized space. Cells
containing more than a certain number of points are considered to be dense.
Contiguous dense cells are connected to form clusters.
Examples of grid-based clustering methods include CLIQUE and STING .
Clustering Algorithms
K-means and its variants
Hierarchical clustering
Density-based clustering
K-Means
• Initial set of clusters randomly chosen.
• Iteratively, items are moved among sets of
clusters until the desired set is reached.
• High degree of similarity among elements
in a cluster is obtained.
• Given a cluster Ki={ti1,ti2,…,tim}, the cluster
mean is mi = (1/m)(ti1 + … + tim)
K-Means Example
• Given: {2,4,10,12,3,20,30,11,25}, k=2
1- Randomly assign means: m1=3,m2=4
2- K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16
3-K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
4-K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6
5-K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25
• Stop as the clusters with these means are the same.
K-Means Algorithm
Two different K-means Clustering
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Sub-optimal Clustering
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Optimal Clustering
Original Points
What are applications that
use Clustering?
Some applications of clustering
• Clustering in EBMT (Example-Based Machine
Translation)
• Clustering in Online shopping mall
• Clustering in Spatial database
Clustering in EBMT
• EBMT is a method of performing machine translation by imitating
translation examples of similar sentences .
• a large amount of bi/multi-lingual translation examples (Greek-
English) has been stored in a textual database .
• and input expressions are rendered in the target language by
retrieving from the database that example which is most similar to
the input.
But , how to make retrieval of the example that best matches the
input more efficient?
By using of
cluster
Clustering in EBMT
• It applied k-means algorithm .
• Its enable the application to limit the search space
and locate the best available match in a database.
Clustering in Online shopping mall
• Its combine the characteristics of Web 2.0 into Decision
Support System for online shopping mall.
• To facilitate customers to share profiles and comments.
• The goal is to show the customers the product which they
may have interest in, according to the information provided
by the them.
• the framework contains of the search engine as an external
layer and the recommendation system as an internal layer.
• It used K-means to clustering the products in internal layer.
Clustering in Online shopping mall
Fig2: DSS framework for online shopping mall
Clustering in Online shopping mall
Fig3: Products clustering
Clustering in Spatial database
• Spatial database is a large amounts of data are obtained from
satellite for applications such as geo-marketing, traffic control,
and environmental studies.
• There exist many physical obstacles (rivers, lakes , highways)
and their presence may affect the result of clustering in spatial
databases.
Clustering in Spatial database
Fig4: map of western population of Canada with Highways and Rivers
Clustering in Spatial database
• However, most of clustering algorithms cannot deal with obstacles like
DBRS, DBRS+( Density-Based Clustering with Random Sampling) .
• Because, it can't handle with widely varying densities clusters in
dataset efficiently.
• so new clustering algorithm MMO (mathematics morphology based
algorithm of obstacles clustering) is proposed for the problem of
clustering in the presence of obstacles.
Clustering in Spatial database
• The main contributions are new operators are more
accurate (Connector and Obstor)than the ordinary
operators: (open and close).
Clustering in Spatial database
• Result: The performance tests show that: MMO is
suitable and effective for large data set more than
other algorithms .
Fig5: runtime of MMO with DBRS+
Conclusion
The main purpose of Data mining is to provide decision support to
managers and business professionals through knowledge
discovery
-Analyzes vast store of historical business data
-Tries to discover patterns, trends, and correlations hidden in the
data that can help a company improve its business performance
-Use regression, decision tree, neural network, cluster analysis, or
market basket analysis
-Cluster Analysis is the common method of Data mining tools
-Cluster methods and It’s algorithms were the effective methods
used by many application these day , and good Research area For
DSS fields was Done in Cluster methods .
References
1. CRANIAS, L., PAPAGEORGIOU, H., & PIPERIDIS, S. (1994). CLUSTERING : A
TECHNIQUE FOR SEARCH SPACE REDUCTION IN EXAMPLEBASED MACHINE
TRANSLATION. IEEE.
2. Pattabiraman, V. (2012). Optimization of spatial join using constraints based-
clustering techniques. Engineering and Computer Innovations.
3. Wang, X., Rostoker, C., & Hamilton , H. J. (2004). Density-Based Spatial Clustering
in the Presence of Obstacles and Facilitators.
4. Wang, X., & Hamilton, H. (n.d.). Using Clustering Methods in Geospatial
Information Systems.
5. Wenxing, H., Weng, Y., Xie, L., & Maoqing, L. (2009). Design and Implementation
of Web-based DSS for Online Shopping Mall. IEEE.
6. Zhang, Q. (2008). A Mathematics Morphology Based Algorithm of Obstacles
Clustering. International Conference on Computer Science and Software
Engineering.

Clustering on DSS

  • 1.
    Clustering on DSS By: NoraAlharbi Enaam Alotaibi
  • 2.
    Outlines: • Importance ofCluster Analysis for Data Mining • What is Cluster Analysis • Purposes of Cluster Analysis • Main Types of Clustering • Common Types of Clustering • Cluster Algorithms • Applications of clustering • Conclusion • References
  • 3.
    Importance of ClusterAnalysis for Data Mining • Used for automatic identification of natural groupings of things • Part of the machine-learning family • Employ unsupervised learning • Learns the clusters of things from past data, then assigns new instances • There is not an output variable • Also known as segmentation
  • 4.
    What is aCluster Analysis?
  • 5.
    Notion of aCluster can be Ambiguous QUESTION ?? How many clusters?
  • 6.
    What is ClusterAnalysis? • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
  • 7.
    Puprses of ClusterAnalysis Understanding Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations Summarization Reduce the size of large data sets Clustering precipitation in Australia
  • 8.
    What is notCluster Analysis? Supervised classification Have class label information Simple segmentation Dividing students into different registration groups alphabetically, by last name Results of a query Groupings are a result of an external specification Graph partitioning Some mutual relevance and synergy, but areas are not identical
  • 9.
    Common Types ofClusters • Well-separated clusters • Center-based clusters • Contiguous clusters • Density-based clusters • Property or Conceptual • Described by an Objective Function
  • 10.
    Types of Clusters:Well-Separated • Well-Separated Clusters: – A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters
  • 11.
    Types of Clusters:Center-Based • Center-based (prototype-based) – A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster – The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters
  • 12.
    Types of Clusters:Contiguity-Based • Contiguous Cluster (Graph-based) – A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. Cluster can be defined as a connected component i.e. a group of objects that are connected to one another, but that have no connection to objects outside the group. 8 contiguous clusters
  • 13.
    Types of Clusters: DENSITY-BASED Density-based A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters
  • 14.
    Types of Clusters:Conceptual Clusters • Shared Property or Conceptual Clusters – Finds clusters that share some common property or represent a particular concept. . 2 Overlapping Circles
  • 15.
    Types of Clusters:Objective Function • Clusters Defined by an Objective Function – Finds clusters that minimize or maximize an objective function. – Enumerate all possible ways of dividing the points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function. – Can have global or local objectives. • Hierarchical clustering algorithms typically have local objectives • Partitional algorithms typically have global objectives – A variation of the global objective function approach is to fit the data to a parameterized model. • Parameters for the model are determined from the data. • Mixture models assume that the data is a ‘mixture' of a number of statistical distributions.
  • 16.
    CONT. -Map the clusteringproblem to a different domain and solve a related problem in that domain -Proximity matrix defines a weighted graph, where the nodes are the points being clustered, and the weighted edges represent the proximities between points Clustering is equivalent to breaking the graph into connected components, one for each cluster. Want to minimize the edge weight between clusters and maximize the edge weight within clusters
  • 17.
    • A clusteringis a set of clusters • Important distinction between hierarchical and partitional sets of clusters What is Clustering?
  • 18.
    Main type ofCluster Method •Hierarchical clustering methods: can be either agglomerative or divisive. • An agglomerative method starts with each point as a separate cluster, and successively performs merging until a stopping criterion is met. • A divisive method begins with all points in a single cluster and performs splitting until a stopping criterion is met. • The result of a hierarchical clustering method is a tree of clusters called a dendogram. • A tree like diagram that records the sequences of merges or splits 1 3 2 5 4 6 0 0.05 0.1 0.15 0.2 1 2 3 4 5 6 1 2 3 4 5
  • 19.
  • 20.
    Hierarchical clustering • Agglomerative(Bottom up) • 1st iteration1
  • 21.
    Hierarchical clustering • Agglomerative(Bottom up) • 2nd iteration1 2
  • 22.
    Hierarchical clustering • Agglomerative(Bottom up) • 3rd iteration 1 2 3
  • 23.
    Hierarchical clustering • Agglomerative(Bottom up) • 4th iteration 1 2 3 4
  • 24.
    Hierarchical clustering • Agglomerative(Bottom up) • 5th iteration 1 2 3 4 5
  • 25.
    Hierarchical clustering • Agglomerative(Bottom up) • Finally k clusters left 1 2 3 4 6 9 5 7 8
  • 26.
    Hierarchical clustering • Divisive(Top-down) Slide credit: Min Zhang
  • 27.
    Hierarchical clustering An exampleof hierarchical clustering methods is BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) Advantage: typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. Disadvantages: it can hold only a limited number of entries due to its size, it not always correspond to what a user may consider a natural cluster.
  • 28.
    Type of ClusterMethod CONT. •Partitional clustering methods: • determine a partition of the points into clusters, such that the points in a cluster are more similar to each other than to points in different clusters. •They start with some arbitrary initial clusters and iteratively until criterion is met. Examples of partitional clustering algorithms include k-means, PAM, CLARA, CLARANS and EM. Disadvantages: • it assumes that the points to be clustered are all stored in main memory. • the run time of the algorithm is prohibitive on large datasets.
  • 29.
  • 30.
    Type of ClusterMethod CONT. •Density-based clustering methods :try to find clusters based on the density of points in regions. Dense regions that are reachable from each other are merged to form clusters. Examples of density-based clustering methods include DBSCAN , DBRS and DBRS+ . Advantages: It gives extremely good results and is efficient in many datasets. Disadvantages: • if a dataset has clusters of widely varying densities, it can't able to handle with it efficiently. • it does not consider non-spatial attributes in the dataset. • it not suitable for finding approximate clusters in very large datasets.
  • 31.
    Type of ClusterMethod CONT. Grid-based clustering methods quantize the clustering space into a finite number of cells and then perform the required operations on the quantized space. Cells containing more than a certain number of points are considered to be dense. Contiguous dense cells are connected to form clusters. Examples of grid-based clustering methods include CLIQUE and STING .
  • 32.
    Clustering Algorithms K-means andits variants Hierarchical clustering Density-based clustering
  • 33.
    K-Means • Initial setof clusters randomly chosen. • Iteratively, items are moved among sets of clusters until the desired set is reached. • High degree of similarity among elements in a cluster is obtained. • Given a cluster Ki={ti1,ti2,…,tim}, the cluster mean is mi = (1/m)(ti1 + … + tim)
  • 34.
    K-Means Example • Given:{2,4,10,12,3,20,30,11,25}, k=2 1- Randomly assign means: m1=3,m2=4 2- K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16 3-K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18 4-K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6 5-K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25 • Stop as the clusters with these means are the same.
  • 35.
  • 36.
    Two different K-meansClustering -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Sub-optimal Clustering -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Optimal Clustering Original Points
  • 37.
    What are applicationsthat use Clustering?
  • 38.
    Some applications ofclustering • Clustering in EBMT (Example-Based Machine Translation) • Clustering in Online shopping mall • Clustering in Spatial database
  • 39.
    Clustering in EBMT •EBMT is a method of performing machine translation by imitating translation examples of similar sentences . • a large amount of bi/multi-lingual translation examples (Greek- English) has been stored in a textual database . • and input expressions are rendered in the target language by retrieving from the database that example which is most similar to the input. But , how to make retrieval of the example that best matches the input more efficient? By using of cluster
  • 40.
    Clustering in EBMT •It applied k-means algorithm . • Its enable the application to limit the search space and locate the best available match in a database.
  • 41.
    Clustering in Onlineshopping mall • Its combine the characteristics of Web 2.0 into Decision Support System for online shopping mall. • To facilitate customers to share profiles and comments. • The goal is to show the customers the product which they may have interest in, according to the information provided by the them. • the framework contains of the search engine as an external layer and the recommendation system as an internal layer. • It used K-means to clustering the products in internal layer.
  • 42.
    Clustering in Onlineshopping mall Fig2: DSS framework for online shopping mall
  • 43.
    Clustering in Onlineshopping mall Fig3: Products clustering
  • 44.
    Clustering in Spatialdatabase • Spatial database is a large amounts of data are obtained from satellite for applications such as geo-marketing, traffic control, and environmental studies. • There exist many physical obstacles (rivers, lakes , highways) and their presence may affect the result of clustering in spatial databases.
  • 45.
    Clustering in Spatialdatabase Fig4: map of western population of Canada with Highways and Rivers
  • 46.
    Clustering in Spatialdatabase • However, most of clustering algorithms cannot deal with obstacles like DBRS, DBRS+( Density-Based Clustering with Random Sampling) . • Because, it can't handle with widely varying densities clusters in dataset efficiently. • so new clustering algorithm MMO (mathematics morphology based algorithm of obstacles clustering) is proposed for the problem of clustering in the presence of obstacles.
  • 47.
    Clustering in Spatialdatabase • The main contributions are new operators are more accurate (Connector and Obstor)than the ordinary operators: (open and close).
  • 48.
    Clustering in Spatialdatabase • Result: The performance tests show that: MMO is suitable and effective for large data set more than other algorithms . Fig5: runtime of MMO with DBRS+
  • 49.
    Conclusion The main purposeof Data mining is to provide decision support to managers and business professionals through knowledge discovery -Analyzes vast store of historical business data -Tries to discover patterns, trends, and correlations hidden in the data that can help a company improve its business performance -Use regression, decision tree, neural network, cluster analysis, or market basket analysis -Cluster Analysis is the common method of Data mining tools -Cluster methods and It’s algorithms were the effective methods used by many application these day , and good Research area For DSS fields was Done in Cluster methods .
  • 50.
    References 1. CRANIAS, L.,PAPAGEORGIOU, H., & PIPERIDIS, S. (1994). CLUSTERING : A TECHNIQUE FOR SEARCH SPACE REDUCTION IN EXAMPLEBASED MACHINE TRANSLATION. IEEE. 2. Pattabiraman, V. (2012). Optimization of spatial join using constraints based- clustering techniques. Engineering and Computer Innovations. 3. Wang, X., Rostoker, C., & Hamilton , H. J. (2004). Density-Based Spatial Clustering in the Presence of Obstacles and Facilitators. 4. Wang, X., & Hamilton, H. (n.d.). Using Clustering Methods in Geospatial Information Systems. 5. Wenxing, H., Weng, Y., Xie, L., & Maoqing, L. (2009). Design and Implementation of Web-based DSS for Online Shopping Mall. IEEE. 6. Zhang, Q. (2008). A Mathematics Morphology Based Algorithm of Obstacles Clustering. International Conference on Computer Science and Software Engineering.