Cluster analysis

lustering is a statistical classification approach for the
supervised learning. Cluster analysis or clustering is
the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more
similar to each other than to those in other groups
(clusters).
 It is a main task of exploratory data mining, and a
common technique for statistical data analysis, used in
many fields including machine learning, pattern
recognition, image analysis and data compression.
Contin..
 Clustering can be achieved by various algorithms that
differ significantly in their understanding of what
constitutes a cluster and how to efficiently find them.
Popular notions of clusters include groups with small
distances between cluster members, dense areas of the
data space, intervals or particular statistical
distributions.
Example.
Applications of unsupervised
machine learning
 Some applications of unsupervised machine learning
techniques are:
 Clustering automatically split the dataset into groups base
on their similarities
 Anomaly detection can discover unusual data points in
your dataset. It is useful for finding fraudulent transactions
 Association mining identifies sets of items which often
occur together in your dataset
 Latent variable models are widely used for data
preprocessing. Like reducing the number of features in a
dataset or decomposing the dataset into multiple
components
Disadvantages of unsupervised
learning
 You cannot get precise information regarding data
sorting, and the output as data used in unsupervised
learning is labeled and not known
 Less accuracy of the results is because the input data is
not known and not labeled by people in advance. This
means that the machine requires to do this itself.
 The spectral classes do not always correspond to
informational classes.
 The user needs to spend time interpreting and label
the classes which follow that classification.
Cluster Models
Cluster Models
 Connectivity models( Hierarchy clustering):
clustering builds models based on distance
connectivity.
 Centroid models: k-means algorithm represents each
cluster by a single mean vector.
 Distribution models: clusters are modeled using
statistical distributions, such as multivariate normal
distributions.
 Density models: DBSCAN and OPTICS defines clusters
as connected dense regions in the data space.
Conti..
 Graph-based models: a clique, that is, a subset of nodes
in a graph such that every two nodes in the subset are
connected by an edge can be considered as a
prototypical form of cluster. Relaxations of the
complete connectivity requirement (a fraction of the
edges can be missing) are known as quasi-cliques.
 Signed graph models: Every path in a signed graph has
a sign from the product of the signs on the edges. The
weaker “clusterability axiom” yields results with more
than two clusters, or subgraphs with only positive
edges.
Conti..
 Neural models: the most well known unsupervised neural
network is the self organizing map and these models can
usually be characterized as similar to one or more of the
above models, and including subspace models when neural
networks implement a form of Principal component
analysis or Independent component analysis.
 Subspace models: clusters are modeled with both cluster
members and relevant attributes.
 Group models: some algorithms do not provide a refined
model for their results and just provide the grouping
information.

cluster.pptx

  • 2.
    Cluster analysis  lustering isa statistical classification approach for the supervised learning. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters).  It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields including machine learning, pattern recognition, image analysis and data compression.
  • 3.
    Contin..  Clustering canbe achieved by various algorithms that differ significantly in their understanding of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances between cluster members, dense areas of the data space, intervals or particular statistical distributions.
  • 4.
  • 5.
    Applications of unsupervised machinelearning  Some applications of unsupervised machine learning techniques are:  Clustering automatically split the dataset into groups base on their similarities  Anomaly detection can discover unusual data points in your dataset. It is useful for finding fraudulent transactions  Association mining identifies sets of items which often occur together in your dataset  Latent variable models are widely used for data preprocessing. Like reducing the number of features in a dataset or decomposing the dataset into multiple components
  • 6.
    Disadvantages of unsupervised learning You cannot get precise information regarding data sorting, and the output as data used in unsupervised learning is labeled and not known  Less accuracy of the results is because the input data is not known and not labeled by people in advance. This means that the machine requires to do this itself.  The spectral classes do not always correspond to informational classes.  The user needs to spend time interpreting and label the classes which follow that classification.
  • 7.
  • 8.
    Cluster Models  Connectivitymodels( Hierarchy clustering): clustering builds models based on distance connectivity.  Centroid models: k-means algorithm represents each cluster by a single mean vector.  Distribution models: clusters are modeled using statistical distributions, such as multivariate normal distributions.  Density models: DBSCAN and OPTICS defines clusters as connected dense regions in the data space.
  • 9.
    Conti..  Graph-based models:a clique, that is, a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the complete connectivity requirement (a fraction of the edges can be missing) are known as quasi-cliques.  Signed graph models: Every path in a signed graph has a sign from the product of the signs on the edges. The weaker “clusterability axiom” yields results with more than two clusters, or subgraphs with only positive edges.
  • 10.
    Conti..  Neural models:the most well known unsupervised neural network is the self organizing map and these models can usually be characterized as similar to one or more of the above models, and including subspace models when neural networks implement a form of Principal component analysis or Independent component analysis.  Subspace models: clusters are modeled with both cluster members and relevant attributes.  Group models: some algorithms do not provide a refined model for their results and just provide the grouping information.