CLUSTERING
TECHNIQUES
OVERVIEW AND
APPLICATIONS
INDEX
Introduction
Clustering
Clustering Techniques
Pros and Cons of Clustering Techniques
Applications of Clustering Techniques
Conclusion
Future Work
INTRODUCTION
• Clustering was first employed in biology back in the 1960s to classify species.
• In this data-driven era, effective data organization and analysis methods play a
major role in gaining insights from data.
• From marketing to social network analysis, clustering has been evolving and now
is an essential sorting and categorizing data tool for pattern detection, data
analysis, and interpretation
• Clustering is an unsupervised data analysis technique that groups a set of objects
such that objects in the same group (cluster) are more similar to each other than to
those in other groups.
.
Example: Clustering Grocery Items
eggs bananas
milk bread
TECHNIQUES
Partitional Clustering - K Means
Hierarchical Based Clustering – BIRCH
Density Based Clustering – DBSCAN
Grid Based Clustering – STING
Model Based Clustering – Gaussian Mixture Model
Partitional Clustering - K Means
• Partitional clustering divides a dataset into non-overlapping partitions or clusters, where
each data point belongs to exactly one cluster.
• K-means clustering groups the unlabelled dataset into a defined number of clusters where
similar data points are grouped together to discover underlying patterns.
Phases:
• Initialization
• Categorize and Update centroids
• Repeat
Hierarchical Based Clustering –
BIRCH(Unsupervised)
Hierarchical Clustering organizes elements in a hierarchical or tree like structure.
Balanced Iterative Reducing and Clustering
BIRCH clusters large data set with a single scan and improves the quality of data
with a few additional scans.
BIRCH consists of two stages,
• Building the CF(Clustering Feature) tree
• Global Clustering.
Cluster refinement for accuracy.
Density Based Clustering – DBSCAN
• Density-based clustering methods create clusters based on the density of data or
information that are to be clustered in the feature space.
• Density Based Spatial Clustering of Applications with Noise defines clusters by
identifying the data which has a minimum number of data points within a specific
radius.
• Steps in the DBSCAN algorithm
• Classify the points and discard noise.
• Assign cluster to a core point.
• Color all the density connected points and boundary points according to the nearest core point.
Grid-Based
Clustering –
STING
• Grid-based clustering partitions the dataset into a grid structure,
organizing data points into cells for efficient clustering based on spatial
proximity.
• STING(STATISTICAL INFORMATION GRID) approach which
partitions the data into a hierarchical grid, Investigates the clusters at
different levels of their detail
• Phases of sting are
• Grid Construction & Cell Assignment
• Density Calculation & Cluster Identification
• Border Point Assignment & Noise Identification
Model-Based Clustering
– Gaussian Mixture
• Model-based clustering assigns data points to clusters based on
probabilistic models representing the data distribution.
• "Gaussian Mixture is a statistical model that identifies subgroups within
a population using a combination of Gaussian distributions."
• It repeatedly optimizes parameters using an expectation-maximization
algorithm which estimates cluster means, covariances, and mixture
covariances
• Steps of gaussian mixture are
• Initialization
• Expectation Step(E-Step) & Maximum Step(M-Step)
• Convergence Check
• Iteration
PROS AND CONS OF CLUSTERING
TECHNIQUES
Cons:
• Parameter subjectivity
• High dimensions challenge
• Evaluation difficulty
• Shape Assumptions
• Noise handling
Pros:
• Pattern finding
• Exploration
• Feature Discovery
• Data compression
• Scalability
Applications of Clustering Techniques
• Customer Segmentation: Grouping customers into distinct segments based on attitudes
and behavior for targeted marketing strategies.
• Anomaly Detection: Identifying unusual patterns or outliers in datasets that deviate
significantly from normal behavior.
• Image Segmentation: Partitioning an image into regions with similar attributes, for object
recognition and image analysis tasks.
• Recommendation Systems: Grouping users or items into clusters based on preferences
or similarities to provide personalized recommendations in e-commerce or content
platforms.
• Document clustering enables automatic grouping of similar documents for efficient
information retrieval, text summarization, and content-based recommendation systems.
CONCLUSION
Clustering techniques offer a flexible approach to unsupervised learning, applicable
across diverse datasets and domains. By grouping similar data points, clustering
facilitates exploration and recognition of underlying patterns, leading to valuable
insights.
Clustering algorithms automate data grouping tasks, saving time and enabling
efficient analysis of large datasets. Clustering finds use in marketing, healthcare,
finance, and more, for tasks like customer segmentation and anomaly detection.
FUTURE
RESEARCH Adaptability to diverse data types,
including text, image, and graph data.
Improving visualization of clustering
results.
Integration with machine learning for
predictive modeling.
Addressing privacy concerns with
privacy-preserving techniques.
Tailoring clustering methods for
domain-specific applications.
References
• T. Zhang, R. Ramakrishnan and M. Livny, “BIRCH: an efficient data clustering method
for very large databases” in ACM Sigmod Record, ACM, vol. 25, pp. 103–114.
• M. Steinbach, G. Karypis, and V. Kumar, "A comparison of document clustering
techniques" in Proceedings of the KDD Workshop on Text Mining, ACM, 2000.
• M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, "A density-based algorithm for discovering
clusters in large spatial databases with noise" in Proceedings of the 2nd International
Conference on Knowledge Discovery and Data Mining (KDD-96), AAAI Press, 1996.

Clustering: Grouping all Data for Insights

  • 1.
  • 2.
    INDEX Introduction Clustering Clustering Techniques Pros andCons of Clustering Techniques Applications of Clustering Techniques Conclusion Future Work
  • 3.
    INTRODUCTION • Clustering wasfirst employed in biology back in the 1960s to classify species. • In this data-driven era, effective data organization and analysis methods play a major role in gaining insights from data. • From marketing to social network analysis, clustering has been evolving and now is an essential sorting and categorizing data tool for pattern detection, data analysis, and interpretation • Clustering is an unsupervised data analysis technique that groups a set of objects such that objects in the same group (cluster) are more similar to each other than to those in other groups. .
  • 4.
    Example: Clustering GroceryItems eggs bananas milk bread
  • 5.
    TECHNIQUES Partitional Clustering -K Means Hierarchical Based Clustering – BIRCH Density Based Clustering – DBSCAN Grid Based Clustering – STING Model Based Clustering – Gaussian Mixture Model
  • 6.
    Partitional Clustering -K Means • Partitional clustering divides a dataset into non-overlapping partitions or clusters, where each data point belongs to exactly one cluster. • K-means clustering groups the unlabelled dataset into a defined number of clusters where similar data points are grouped together to discover underlying patterns. Phases: • Initialization • Categorize and Update centroids • Repeat
  • 7.
    Hierarchical Based Clustering– BIRCH(Unsupervised) Hierarchical Clustering organizes elements in a hierarchical or tree like structure. Balanced Iterative Reducing and Clustering BIRCH clusters large data set with a single scan and improves the quality of data with a few additional scans. BIRCH consists of two stages, • Building the CF(Clustering Feature) tree • Global Clustering. Cluster refinement for accuracy.
  • 8.
    Density Based Clustering– DBSCAN • Density-based clustering methods create clusters based on the density of data or information that are to be clustered in the feature space. • Density Based Spatial Clustering of Applications with Noise defines clusters by identifying the data which has a minimum number of data points within a specific radius. • Steps in the DBSCAN algorithm • Classify the points and discard noise. • Assign cluster to a core point. • Color all the density connected points and boundary points according to the nearest core point.
  • 9.
    Grid-Based Clustering – STING • Grid-basedclustering partitions the dataset into a grid structure, organizing data points into cells for efficient clustering based on spatial proximity. • STING(STATISTICAL INFORMATION GRID) approach which partitions the data into a hierarchical grid, Investigates the clusters at different levels of their detail • Phases of sting are • Grid Construction & Cell Assignment • Density Calculation & Cluster Identification • Border Point Assignment & Noise Identification
  • 10.
    Model-Based Clustering – GaussianMixture • Model-based clustering assigns data points to clusters based on probabilistic models representing the data distribution. • "Gaussian Mixture is a statistical model that identifies subgroups within a population using a combination of Gaussian distributions." • It repeatedly optimizes parameters using an expectation-maximization algorithm which estimates cluster means, covariances, and mixture covariances • Steps of gaussian mixture are • Initialization • Expectation Step(E-Step) & Maximum Step(M-Step) • Convergence Check • Iteration
  • 11.
    PROS AND CONSOF CLUSTERING TECHNIQUES Cons: • Parameter subjectivity • High dimensions challenge • Evaluation difficulty • Shape Assumptions • Noise handling Pros: • Pattern finding • Exploration • Feature Discovery • Data compression • Scalability
  • 12.
    Applications of ClusteringTechniques • Customer Segmentation: Grouping customers into distinct segments based on attitudes and behavior for targeted marketing strategies. • Anomaly Detection: Identifying unusual patterns or outliers in datasets that deviate significantly from normal behavior. • Image Segmentation: Partitioning an image into regions with similar attributes, for object recognition and image analysis tasks. • Recommendation Systems: Grouping users or items into clusters based on preferences or similarities to provide personalized recommendations in e-commerce or content platforms. • Document clustering enables automatic grouping of similar documents for efficient information retrieval, text summarization, and content-based recommendation systems.
  • 13.
    CONCLUSION Clustering techniques offera flexible approach to unsupervised learning, applicable across diverse datasets and domains. By grouping similar data points, clustering facilitates exploration and recognition of underlying patterns, leading to valuable insights. Clustering algorithms automate data grouping tasks, saving time and enabling efficient analysis of large datasets. Clustering finds use in marketing, healthcare, finance, and more, for tasks like customer segmentation and anomaly detection.
  • 14.
    FUTURE RESEARCH Adaptability todiverse data types, including text, image, and graph data. Improving visualization of clustering results. Integration with machine learning for predictive modeling. Addressing privacy concerns with privacy-preserving techniques. Tailoring clustering methods for domain-specific applications.
  • 15.
    References • T. Zhang,R. Ramakrishnan and M. Livny, “BIRCH: an efficient data clustering method for very large databases” in ACM Sigmod Record, ACM, vol. 25, pp. 103–114. • M. Steinbach, G. Karypis, and V. Kumar, "A comparison of document clustering techniques" in Proceedings of the KDD Workshop on Text Mining, ACM, 2000. • M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise" in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), AAAI Press, 1996.