SlideShare a Scribd company logo
Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
Applications of Cluster Analysis  Understanding Group genes and proteins that have similar functionality, or group stocks with similar price fluctuations Summarization Reduce the size of large data sets
Types of Clustering  A clustering is a set of clusters       Important distinction between hierarchical and partitional sets of clusters   Partitional Clustering   A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset   Hierarchical clustering   A set of nested clusters organized as a hierarchical tree
Clustering Algorithms K-means  Hierarchical clustering  Graph based clustering 
K-means Clustering  Partitional clustering approach  Each cluster is associated with a centroid (center point)  Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified  The basic algorithm is very simple
K-means Clustering – Details  Initial centroids are often chosen randomly.  Clusters produced vary from one run to another.  The centroid is (typically) the mean of the points in the cluster.  ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.  K-means will converge for common similarity measures mentioned above
K-means Clustering – Details  Most of the convergence happens in the first few iterations.  Often the stopping condition is changed to ‘Until relatively few points change clusters’  Complexity is O( n * K * I * d )  n = number of points, K = number of clusters,  I = number of iterations, d = number of attributes
Two different K-means Clusterings  Sub-optimal Clustering  Optimal Clustering 
Problems with Selecting Initial Points  If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small.  Chance is relatively small when K is large  If clusters are the same size, n, then For example, if K = 10, then probability = 10!/1010 = 0.00036  Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t  Consider an example of five pairs of clusters
Solutions to Initial Centroids Problem Multiple runs  Helps, but probability is not on your side   Sample and use hierarchical clustering to determine initial centroids  Select more than k initial centroids and then select among these initial centroids Select most widely separated   Bisecting K-means  Not as susceptible to initialization issues
Evaluating K-means Clusters  Most common measure is Sum of Squared Error (SSE)  For each point, the error is the distance to the nearest cluster  To get SSE, we square these errors and sum them.   x is a data point in cluster Ciand mi is the representative point for cluster Ci can show that micorresponds to the center (mean) of the cluster
Evaluating K-means Clusters  Given two clusters, we can choose the one with the smaller error  One easy way to reduce SSE is to increase K, the number of clusters  A good clustering with smaller K can have a lower SSE than a poor clustering with higher K
Limitations of K-means  K-means has problems when clusters are of differing  Sizes  Densities  Non-globular shapes   K-means has problems when the data contains outliers.   The number of clusters (K) is difficult to determine.
Hierarchical Clustering   Produces a set of nested clusters organized as a hierarchical tree  Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits
Strengths of Hierarchical Clustering  Do not have to assume any particular number of clusters  Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level   They may correspond to meaningful taxonomies  Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)
Hierarchical Clustering  Two main types of hierarchical clustering  Agglomerative:  Start with the points as individual clusters  At each step, merge the closest pair of clusters until only one cluster (or k clusters) left  Divisive:  Start with one, all-inclusive cluster  At each step, split a cluster until each cluster contains a point (or there are k clusters)
Agglomerative Clustering Algorithm  More popular hierarchical clustering technique  Basic algorithm is straightforward  Compute the proximity matrix  Let each data point be a cluster Repeat   Merge the two closest clusters   Update the proximity matrix  Until only a single cluster remains
Hierarchical Clustering: Group Average  Compromise between Single and Complete Link   Strengths  Less susceptible to noise and outliers   Limitations  Biased towards globular clusters
Hierarchical Clustering: Time and Space requirements  O(N2) space since it uses the proximity matrix.  N is the number of points.   O(N3) time in many cases  There are N steps and at each step the size, N2, proximity matrix must be updated and searched  Complexity can be reduced to O(N2 log(N) ) time for some approaches
Hierarchical Clustering: Problems and Limitations  Once a decision is made to combine two clusters, it cannot be undone No objective function is directly minimized  Different schemes have problems with one or more of the following:  Sensitivity to noise and outliers (MIN)  Difficulty handling different sized clusters and non-convex shapes (Group average, MAX)  Breaking large clusters (MAX)
conclusion The purpose of clustering in data mining and its types are discussed. The k-means and hierarchical algorithm are explained in detail and their pros and cons are analyzed.
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

More Related Content

What's hot

Hierarchical Clustering in Data Mining
Hierarchical Clustering in Data MiningHierarchical Clustering in Data Mining
Hierarchical Clustering in Data Mining
Yashraj Nigam
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis
Dr Athar Khan
 
Cluster Analysis Assignment 2013-2014(2)
Cluster Analysis Assignment 2013-2014(2)Cluster Analysis Assignment 2013-2014(2)
Cluster Analysis Assignment 2013-2014(2)TIEZHENG YUAN
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Avijit Famous
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysisguru_prasadg
 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysis
Animesh Kumar
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
SSA KPI
 
Clustering
ClusteringClustering
Clustering
Meme Hei
 
Cluster analysis for market segmentation
Cluster analysis for market segmentationCluster analysis for market segmentation
Cluster analysis for market segmentation
Vishal Tandel
 
Clustering
ClusteringClustering
Clustering
M Rizwan Aqeel
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
saba khan
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering
108kaushik
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
SUBBIAH SURESH
 
3.1 clustering
3.1 clustering3.1 clustering
3.1 clustering
Krish_ver2
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
Lino Possamai
 
Malhotra20
Malhotra20Malhotra20
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
hadifar
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
Jaclyn Kokx
 
Clustering
ClusteringClustering
Clustering
LipikaSaha2
 

What's hot (20)

Hierarchical Clustering in Data Mining
Hierarchical Clustering in Data MiningHierarchical Clustering in Data Mining
Hierarchical Clustering in Data Mining
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis
 
Cluster Analysis Assignment 2013-2014(2)
Cluster Analysis Assignment 2013-2014(2)Cluster Analysis Assignment 2013-2014(2)
Cluster Analysis Assignment 2013-2014(2)
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysis
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Clustering
ClusteringClustering
Clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Cluster analysis for market segmentation
Cluster analysis for market segmentationCluster analysis for market segmentation
Cluster analysis for market segmentation
 
Clustering
ClusteringClustering
Clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
 
3.1 clustering
3.1 clustering3.1 clustering
3.1 clustering
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
Malhotra20
Malhotra20Malhotra20
Malhotra20
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
Clustering
ClusteringClustering
Clustering
 

Viewers also liked

Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
DataminingTools Inc
 
Knowledge Discovery
Knowledge  DiscoveryKnowledge  Discovery
Knowledge Discovery
Datamining Tools
 
Quick Look At Clustering
Quick Look At  ClusteringQuick Look At  Clustering
Quick Look At Clustering
Datamining Tools
 
lo ultimo en tecnologia
lo ultimo en tecnologialo ultimo en tecnologia
lo ultimo en tecnologia
ies fuente de la peña
 
Paper report(修改版) 2010.10.25 968610234 簡碩辰
Paper report(修改版) 2010.10.25 968610234 簡碩辰Paper report(修改版) 2010.10.25 968610234 簡碩辰
Paper report(修改版) 2010.10.25 968610234 簡碩辰
Shuo-Chen Chien
 
Chapter 1
Chapter 1Chapter 1
Chapter 1
Katrina Homer
 
Assignment of Business Law : Environment pollution caused by Plastic, a study...
Assignment of Business Law : Environment pollution caused by Plastic, a study...Assignment of Business Law : Environment pollution caused by Plastic, a study...
Assignment of Business Law : Environment pollution caused by Plastic, a study...
Abdulla chowdhury
 
Estrategias competitivas básicas
Estrategias competitivas básicasEstrategias competitivas básicas
Estrategias competitivas básicas
LarryJimenez
 

Viewers also liked (13)

Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Knowledge Discovery
Knowledge  DiscoveryKnowledge  Discovery
Knowledge Discovery
 
Quick Look At Clustering
Quick Look At  ClusteringQuick Look At  Clustering
Quick Look At Clustering
 
lo ultimo en tecnologia
lo ultimo en tecnologialo ultimo en tecnologia
lo ultimo en tecnologia
 
pgh_in_p
pgh_in_ppgh_in_p
pgh_in_p
 
Fonts
FontsFonts
Fonts
 
Abstract Images
Abstract ImagesAbstract Images
Abstract Images
 
Paper report(修改版) 2010.10.25 968610234 簡碩辰
Paper report(修改版) 2010.10.25 968610234 簡碩辰Paper report(修改版) 2010.10.25 968610234 簡碩辰
Paper report(修改版) 2010.10.25 968610234 簡碩辰
 
Chapter 1
Chapter 1Chapter 1
Chapter 1
 
Assignment of Business Law : Environment pollution caused by Plastic, a study...
Assignment of Business Law : Environment pollution caused by Plastic, a study...Assignment of Business Law : Environment pollution caused by Plastic, a study...
Assignment of Business Law : Environment pollution caused by Plastic, a study...
 
Cap6
Cap6Cap6
Cap6
 
Manual 16 pf
Manual 16 pfManual 16 pf
Manual 16 pf
 
Estrategias competitivas básicas
Estrategias competitivas básicasEstrategias competitivas básicas
Estrategias competitivas básicas
 

Similar to Cluster Analysis

15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
Anil Yadav
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Subrata Kumer Paul
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
mqasimsheikh5
 
Clustering
ClusteringClustering
Clustering
Datamining Tools
 
Clustering
ClusteringClustering
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
Houw Liong The
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
Houw Liong The
 
CLUSTERING
CLUSTERINGCLUSTERING
CLUSTERING
Aman Jatain
 
My8clst
My8clstMy8clst
My8clst
ketan533
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
engrasi
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdf
VIKASGUPTA127897
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
NaveenKumar5162
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
NaveenKumar5162
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
nikshaikh786
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
Sudhakar Chavan
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
SandinoBerutu1
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
ImXaib
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
vikassingh569137
 

Similar to Cluster Analysis (20)

15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
Clustering
ClusteringClustering
Clustering
 
Clustering
ClusteringClustering
Clustering
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
 
CLUSTERING
CLUSTERINGCLUSTERING
CLUSTERING
 
My8clst
My8clstMy8clst
My8clst
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdf
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 

More from Datamining Tools

Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
Datamining Tools
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
Datamining Tools
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
Datamining Tools
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
Datamining Tools
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
Datamining Tools
 
Data Mining: Data warehouse and olap technology
Data Mining: Data warehouse and olap technologyData Mining: Data warehouse and olap technology
Data Mining: Data warehouse and olap technology
Datamining Tools
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processing
Datamining Tools
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
Datamining Tools
 
Data mining: Classification and Prediction
Data mining: Classification and PredictionData mining: Classification and Prediction
Data mining: Classification and Prediction
Datamining Tools
 
Data Mining: Data mining classification and analysis
Data Mining: Data mining classification and analysisData Mining: Data mining classification and analysis
Data Mining: Data mining classification and analysis
Datamining Tools
 
Data Mining: Data mining and key definitions
Data Mining: Data mining and key definitionsData Mining: Data mining and key definitions
Data Mining: Data mining and key definitions
Datamining Tools
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalization
Datamining Tools
 
Data Mining: Applying data mining
Data Mining: Applying data miningData Mining: Applying data mining
Data Mining: Applying data mining
Datamining Tools
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
Datamining Tools
 
AI: Planning and AI
AI: Planning and AIAI: Planning and AI
AI: Planning and AI
Datamining Tools
 
AI: Logic in AI 2
AI: Logic in AI 2AI: Logic in AI 2
AI: Logic in AI 2
Datamining Tools
 
AI: Logic in AI
AI: Logic in AIAI: Logic in AI
AI: Logic in AI
Datamining Tools
 
AI: Learning in AI 2
AI: Learning in AI  2AI: Learning in AI  2
AI: Learning in AI 2
Datamining Tools
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
Datamining Tools
 
AI: Introduction to artificial intelligence
AI: Introduction to artificial intelligenceAI: Introduction to artificial intelligence
AI: Introduction to artificial intelligence
Datamining Tools
 

More from Datamining Tools (20)

Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Data Mining: Data warehouse and olap technology
Data Mining: Data warehouse and olap technologyData Mining: Data warehouse and olap technology
Data Mining: Data warehouse and olap technology
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processing
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Data mining: Classification and Prediction
Data mining: Classification and PredictionData mining: Classification and Prediction
Data mining: Classification and Prediction
 
Data Mining: Data mining classification and analysis
Data Mining: Data mining classification and analysisData Mining: Data mining classification and analysis
Data Mining: Data mining classification and analysis
 
Data Mining: Data mining and key definitions
Data Mining: Data mining and key definitionsData Mining: Data mining and key definitions
Data Mining: Data mining and key definitions
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalization
 
Data Mining: Applying data mining
Data Mining: Applying data miningData Mining: Applying data mining
Data Mining: Applying data mining
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
AI: Planning and AI
AI: Planning and AIAI: Planning and AI
AI: Planning and AI
 
AI: Logic in AI 2
AI: Logic in AI 2AI: Logic in AI 2
AI: Logic in AI 2
 
AI: Logic in AI
AI: Logic in AIAI: Logic in AI
AI: Logic in AI
 
AI: Learning in AI 2
AI: Learning in AI  2AI: Learning in AI  2
AI: Learning in AI 2
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
 
AI: Introduction to artificial intelligence
AI: Introduction to artificial intelligenceAI: Introduction to artificial intelligence
AI: Introduction to artificial intelligence
 

Recently uploaded

By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 

Recently uploaded (20)

By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 

Cluster Analysis

  • 1. Cluster Analysis: Basic Concepts and Algorithms
  • 2. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
  • 3. Applications of Cluster Analysis  Understanding Group genes and proteins that have similar functionality, or group stocks with similar price fluctuations Summarization Reduce the size of large data sets
  • 4. Types of Clustering  A clustering is a set of clusters Important distinction between hierarchical and partitional sets of clusters  Partitional Clustering A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset  Hierarchical clustering A set of nested clusters organized as a hierarchical tree
  • 5. Clustering Algorithms K-means  Hierarchical clustering  Graph based clustering 
  • 6. K-means Clustering  Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple
  • 7. K-means Clustering – Details  Initial centroids are often chosen randomly. Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above
  • 8. K-means Clustering – Details  Most of the convergence happens in the first few iterations. Often the stopping condition is changed to ‘Until relatively few points change clusters’ Complexity is O( n * K * I * d ) n = number of points, K = number of clusters,  I = number of iterations, d = number of attributes
  • 9. Two different K-means Clusterings  Sub-optimal Clustering  Optimal Clustering 
  • 10. Problems with Selecting Initial Points  If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small. Chance is relatively small when K is large If clusters are the same size, n, then For example, if K = 10, then probability = 10!/1010 = 0.00036 Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t Consider an example of five pairs of clusters
  • 11. Solutions to Initial Centroids Problem Multiple runs Helps, but probability is not on your side  Sample and use hierarchical clustering to determine initial centroids  Select more than k initial centroids and then select among these initial centroids Select most widely separated  Bisecting K-means Not as susceptible to initialization issues
  • 12. Evaluating K-means Clusters  Most common measure is Sum of Squared Error (SSE) For each point, the error is the distance to the nearest cluster To get SSE, we square these errors and sum them.  x is a data point in cluster Ciand mi is the representative point for cluster Ci can show that micorresponds to the center (mean) of the cluster
  • 13. Evaluating K-means Clusters  Given two clusters, we can choose the one with the smaller error One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K
  • 14. Limitations of K-means  K-means has problems when clusters are of differing Sizes Densities Non-globular shapes  K-means has problems when the data contains outliers.  The number of clusters (K) is difficult to determine.
  • 15. Hierarchical Clustering   Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits
  • 16. Strengths of Hierarchical Clustering  Do not have to assume any particular number of clusters Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level  They may correspond to meaningful taxonomies Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)
  • 17. Hierarchical Clustering  Two main types of hierarchical clustering Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters)
  • 18. Agglomerative Clustering Algorithm  More popular hierarchical clustering technique Basic algorithm is straightforward Compute the proximity matrix Let each data point be a cluster Repeat  Merge the two closest clusters  Update the proximity matrix Until only a single cluster remains
  • 19. Hierarchical Clustering: Group Average  Compromise between Single and Complete Link  Strengths Less susceptible to noise and outliers  Limitations Biased towards globular clusters
  • 20. Hierarchical Clustering: Time and Space requirements  O(N2) space since it uses the proximity matrix. N is the number of points.  O(N3) time in many cases There are N steps and at each step the size, N2, proximity matrix must be updated and searched Complexity can be reduced to O(N2 log(N) ) time for some approaches
  • 21. Hierarchical Clustering: Problems and Limitations  Once a decision is made to combine two clusters, it cannot be undone No objective function is directly minimized  Different schemes have problems with one or more of the following: Sensitivity to noise and outliers (MIN) Difficulty handling different sized clusters and non-convex shapes (Group average, MAX) Breaking large clusters (MAX)
  • 22. conclusion The purpose of clustering in data mining and its types are discussed. The k-means and hierarchical algorithm are explained in detail and their pros and cons are analyzed.
  • 23. Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net