SlideShare a Scribd company logo
1 of 27
CLUSTER ANALYSIS
ALGORITHMS
By
SHWETAPADMA BABU
Regd no: 2105040010
M.Tech , Department Of Computer Sc. & Engineering
OUTLINE
• What is clustering ?
• Type Of clustering
• Type of clustering Algorithm
• K Mean clustering
• CLARA
• The K-MEDOID Clustering
• BRICH
• Density Based Clustering
• DBSCAN
• Grid-Based Clustering
• Hierarchical clustering
• Dendrogram
• Conclusion
• References
What is Clustering
• Clustering is the task of dividing the
population or data points into a number of
groups such that data points in the same
groups are more similar to other data points
in the same group than those in other
groups. In simple words, the aim is to
segregate groups with similar traits and
assign them into clusters.
• The method of identifying similar groups of
data in a data set is called clustering.
Entities in each group are comparatively
more similar to entities of that group than
those of the other groups.
Type of Clustering
• Hard Clustering: In hard clustering, each data point either belongs to
a cluster completely or not.
• Soft Clustering: In soft clustering, instead of putting each data point
into a separate cluster, each data point can belong to more than one
cluster.
Each observation belongs to exactly one cluster An observation can belong to more than one cluster to
a certain degree (e.g. likelihood of belonging to the cluster)
Type of clustering Algorithms
• Connectivity models: these models are based on the notion that the
data points closer in data space exhibit more similarity to each other
than the data points lying farther away.
Divisive approach /Top down approach
Type of clustering Algorithms (Continue…)
• Distribution models: These clustering models are based on the
notion of how probable is it that all data points in the cluster belong
to the same distribution (For example: Normal, Gaussian).
• Distance based/Centroid models: These are iterative clustering
algorithms in which the notion of similarity is derived by the
closeness of a data point to the centroid of the clusters. K-Means
clustering algorithm ,CLARA are popular algorithms that falls into this
category.
• Density Models:
• These models search the data space for areas of varied density of data
points in the data space. It isolates various different density regions and
assign the data points within these regions in the same cluster.
• This method is based on the notion of density. The basic idea is to continue
growing the given cluster as long as the density in the neighborhood
exceeds some threshold i.e. for each data point within a given cluster, the
radius of a given cluster has to contain at least a minimum number of points.
K Mean Clustering
• K means is an iterative clustering algorithm that aims to find local
maxima in each iteration. This algorithm works in these 5 steps:
1. Specify the desired number of clusters K:
Let us choose k=2 for these 5 data points in 2-
D space.
2. Randomly assign each data point to a
cluster
K=2 K=2
3. Compute cluster centroids: The
centroid of data points in the red cluster
is shown using red cross and those in grey
cluster using are cross
4. Re-assign each point to the closest
cluster centroid:
K=2 K=2
5. re-compute cluster centroids:
Repeat steps 4 & 5 until no improvement are possible
K=2
Step:1 Randomly Take Two Centroid As K-2, So
Here (185.72) and (170,56).
K1 = (185,72) && K2 = (170,56)
Step2:- Calculate Distance of all points from K
Centroid. Here I Use Euclidian Distance= sqrt( (x-
x)² + (Yo-Ycl²)So Now Check Point (168,60) Goes to
which cluster so 1st fine the distance
(168,60)=> K1=Sqrt((168-185)2+(60-72)²) = 20.80
K2=Sqrt((168-170)2+(60-56)²) = 4.48
Step3:- As 4.48 small so (168,60) goes to K2
cluster. As coster update Now we recomputed k2.
k2= (170+168/2,56+60/2)=(169,58) This is now
the updated K2 value.
Step4:- Again Repeat step2 to step 3 for remaining
data points
EXAMPLE
Height(x0) Weight(y0)
185 72
170 56
168 60
179 68
Let k=2 for this problems
PROBLEM OF K-MEANS METHOD
The k-means algorithm is sensitive to outliers!
Since an object with an extremely large value may substantially
distort the distribution of the data
K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the
most centrally located object in a cluster
Weaknesses :-
 Applicable only when mean is defined, then what about categorical data?
 Need to specify k, the number of clusters, in advance.
 Unable to handle noisy data
CLARA (Clustering Large Applications)
• It draws multiple samples of the data set, and gives the best
clustering as the output
Strength:
• deals with larger data sets
Weakness:
• Efficiency depends on the sample size
• A good clustering based on samples will not necessarily represent a
good clustering of the whole data set if the sample is biased
THE K-MEDOID CLUSTERING METHOD
 K-Medoids Clustering: Find representative objects (medoids) in clusters
 PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987).
 Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-
medoids if it improves the total distance of the resulting clustering.
 PAM works effectively for small data sets, but does not scale well for large data sets (due to the
computational complexity).
 Efficiency improvement on PAM
 CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
 CLARANS (Ng & Han, 1994): Randomized re-sampling
EXAMPLE
Step1: Select two Random Medoids. Step2:- Calculate
distance of all point into each medoid
Step3 - Compare Cost of cl and c2 for every i and select
the minimum value.So in above table in C1=(x2.x1.x3)
and C2=(x5,x4)
Step4- Now calculate the cost = (3+4+2) = 9Let K-2 For
This Problem
Step5- Now select one of non medoid and repeat the
2,3 and 4
Step6- After recalculate the medoids the again calculate
the cost We terminate when old cost value is smaller
than the new cost otherwise we repeat all step until old
cost smaller than new cost
I X Y
X1 2 6
X2 3 4
X3 3 8
X4 8 5
X5 7 4
I X(a) Y(b) C1(3,4)
|a-3|+|b-4|
C2(7,4)
X1 2 6 3 7
X3 3 8 4 8
X4 8 5 4 2
Let k=2 for this problem
BRICH(BALANCED ITERATIVE REDUCING AND
CLUSTERING USING HIERARCHIES)
 It is a clustering algorithm that can cluster large datasets by first generating a small and
compact summary of the large dataset that retains as much information as possible.
This smaller summary is then clustered instead of clustering the larger dataset.
 Phase 1: scan Database to build an initial in-memory CF tree (a multi-level compression of the data that
tries to preserve the inherent clustering structure of the data).
 Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
 BIRCH has one major drawback - it can only process metric attributes(integers or reals).
A metric attribute any attribute whose values can be represented in Euclidean space
i.e., no categorical attributes should be present.
 Two important terms: Clustering Feature (CF) and CF - Tree
DENSITY BASED CLUSTERING METHODS
 These methods consider the clusters as the dense region having some similarity and
different from the lower dense region of the space. These methods have good accuracy
and ability to merge two clusters.
 Major features:
 Discover clusters of arbitrary shape
 Handle noise
 Several interesting studies:
 Density-based spatial clustering of
applications with noise(DBSCAN)
DBSCAN
 Some definitions first:
 Epsilon: This is also called eps. This is the
distancefill which we look for the neighbouring
points.
 Min_points: The minimum number of points
specified by the user.
 Core Points: If the number of points inside the eps
radius of a point is greater than or equal to the
min_points then it's called a core point.
 Border Points: If the number of points inside the
eps radius of a point is less than the min_points
and it lies within the eps radius region of a core
point, it's called a border point.
 Noise: A point which is neither a core nor a border
point is a noise point.
ALGORITHM STEPS FOR DBSCAN
1. First assign min points and eps value. The Pick a random point that has not
yet been assigned to a cluster or designated to a cluster or designated as
an outlier.
2. Determine if point contains greater than or equal points than the min pts
then this point becomes to the core point else label the point as Outlier.
3. Once a Core Point has been found add all directly reachable to its cluster.
Then de neighbor jumps to each reachable point and add them to the cluster.
If an outlier mas been added label it is boarder point.
4. Repeat the steps above until all points are classified into different clusters
or noises
GRID-BASED CLUSTERING METHOD
STING (Statistical Information Grid approach):
 The area is divided into rectangular cells at
different levels of resolution and these form a tree
like structure.
 Each cell at a high level is contains a number of
smaller cells in the next lower level.
 Statistical info of each cell is calculated and stored
beforehand and is used to answer queries.
 Parameters of higher level cells can be easily
calculated from parameters of lower level cell.
 count, mean, standard deviation, min, max
 type of distribution-normal, uniform, etc.
 In this method the data space is formulated into a finite number of cells that form a grid-
like structure. All the clustering operation done on these grids are fast and independent
of the number of data objects
Hierarchical clustering
Hierarchical clustering, as the name suggests is an algorithm that builds
hierarchy of clusters. This algorithm starts with all the data points
assigned to a cluster of their own. Then two nearest clusters are merged
into the same cluster. In the end, this algorithm terminates when there is
only a single cluster left.
Hierarchical Method
Divisive Approach
Agglomerative Approach
(Bottom-up) (Top-down)
Agglomerative Approach :-
• Bottom-up approach
• We start with each object forming a separate group. It keeps on
merging the objects or groups that are close to one another. It keep
on doing so until all of the groups are merged into one or until the
termination condition holds.
Divisive Approach:-
• Top-down approach
• We start with all of the objects in the same cluster. In the continuous
iteration, a cluster is split up into smaller clusters. It is down until
each object in one cluster or the termination condition holds.
Disadvantage :-This method is rigid i.e. once merge or split is done, It
can never be undone.
Hierarchical clustering (Continue…)
Dendrogram
CONCLUSION
 Cluster analysis groups objects based on their similarity and has wide
applications. Measure of similarity can be computed for various types of data.
 Clustering algorithms can be categorized into partitioning methods, hierarchical
methods, density-based methods, grid-based methods.
 K-means and K-medoids algorithms are popular partitioning-based clustering
algorithms,
 Birch and Chameleon are interesting hierarchical clustering algorithms.
 DBSCAN. OPTICS, and DENCLU are interesting density-based algorithms.
 STING and CLIQUE are grid-based methods, where CLIQUE is also a subspace
clustering algorithm.
 Quality of clustering results can be evaluated in various ways such as
Determining the number of clusters in a data set, Measuring clustering quality.
REFERENCES
1. www.google.com
2. www.youtube.com
3. R.. Agrawal. J. Gehrke. D. Gunopulos. and P Raghuvau Automatic subspace
clustering of high dimensional data for data mining applications
4. M. R. Anderberg Cluster Analysis for Applications Academic Press
5. M Ankerst. M. Breunir H-P. Kriezel, and J. Sander Optics: Ordering points to
identify the chustering structure
6. Beil F. Ester M. Xu X "Frequent Term-Based Text Chustering“
7. M. M. Breunie H.-P Kriegel, R. Ng. J. Sander LOF Identifying Density-Based
Local Outliers.
THANK YOU

More Related Content

What's hot (20)

K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Text Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion MiningText Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion Mining
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Clustering
ClusteringClustering
Clustering
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Cluster Analysis Introduction
Cluster Analysis IntroductionCluster Analysis Introduction
Cluster Analysis Introduction
 
Clustering - K-Means, DBSCAN
Clustering - K-Means, DBSCANClustering - K-Means, DBSCAN
Clustering - K-Means, DBSCAN
 
Birch
BirchBirch
Birch
 
Clustering ppt
Clustering pptClustering ppt
Clustering ppt
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
 

Similar to CLUSTER ANALYSIS ALGORITHMS.pptx

Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningPyingkodi Maran
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
Data Mining: Cluster Analysis
Data Mining: Cluster AnalysisData Mining: Cluster Analysis
Data Mining: Cluster AnalysisSuman Mia
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
multiarmed bandit.ppt
multiarmed bandit.pptmultiarmed bandit.ppt
multiarmed bandit.pptLPrashanthi
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptxJK970901
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
 
CLIQUE Automatic subspace clustering of high dimensional data for data mining...
CLIQUE Automatic subspace clustering of high dimensional data for data mining...CLIQUE Automatic subspace clustering of high dimensional data for data mining...
CLIQUE Automatic subspace clustering of high dimensional data for data mining...Raed Aldahdooh
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Maninda Edirisooriya
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptxNANDHINIS900805
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
 

Similar to CLUSTER ANALYSIS ALGORITHMS.pptx (20)

Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Data Mining: Cluster Analysis
Data Mining: Cluster AnalysisData Mining: Cluster Analysis
Data Mining: Cluster Analysis
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
multiarmed bandit.ppt
multiarmed bandit.pptmultiarmed bandit.ppt
multiarmed bandit.ppt
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
PPT s10-machine vision-s2
PPT s10-machine vision-s2PPT s10-machine vision-s2
PPT s10-machine vision-s2
 
My8clst
My8clstMy8clst
My8clst
 
dm_clustering2.ppt
dm_clustering2.pptdm_clustering2.ppt
dm_clustering2.ppt
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
CLIQUE Automatic subspace clustering of high dimensional data for data mining...
CLIQUE Automatic subspace clustering of high dimensional data for data mining...CLIQUE Automatic subspace clustering of high dimensional data for data mining...
CLIQUE Automatic subspace clustering of high dimensional data for data mining...
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

CLUSTER ANALYSIS ALGORITHMS.pptx

  • 1. CLUSTER ANALYSIS ALGORITHMS By SHWETAPADMA BABU Regd no: 2105040010 M.Tech , Department Of Computer Sc. & Engineering
  • 2. OUTLINE • What is clustering ? • Type Of clustering • Type of clustering Algorithm • K Mean clustering • CLARA • The K-MEDOID Clustering • BRICH • Density Based Clustering • DBSCAN • Grid-Based Clustering • Hierarchical clustering • Dendrogram • Conclusion • References
  • 3. What is Clustering • Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters. • The method of identifying similar groups of data in a data set is called clustering. Entities in each group are comparatively more similar to entities of that group than those of the other groups.
  • 4. Type of Clustering • Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. • Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, each data point can belong to more than one cluster. Each observation belongs to exactly one cluster An observation can belong to more than one cluster to a certain degree (e.g. likelihood of belonging to the cluster)
  • 5. Type of clustering Algorithms • Connectivity models: these models are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away. Divisive approach /Top down approach
  • 6. Type of clustering Algorithms (Continue…) • Distribution models: These clustering models are based on the notion of how probable is it that all data points in the cluster belong to the same distribution (For example: Normal, Gaussian). • Distance based/Centroid models: These are iterative clustering algorithms in which the notion of similarity is derived by the closeness of a data point to the centroid of the clusters. K-Means clustering algorithm ,CLARA are popular algorithms that falls into this category.
  • 7. • Density Models: • These models search the data space for areas of varied density of data points in the data space. It isolates various different density regions and assign the data points within these regions in the same cluster. • This method is based on the notion of density. The basic idea is to continue growing the given cluster as long as the density in the neighborhood exceeds some threshold i.e. for each data point within a given cluster, the radius of a given cluster has to contain at least a minimum number of points.
  • 8. K Mean Clustering • K means is an iterative clustering algorithm that aims to find local maxima in each iteration. This algorithm works in these 5 steps: 1. Specify the desired number of clusters K: Let us choose k=2 for these 5 data points in 2- D space. 2. Randomly assign each data point to a cluster K=2 K=2
  • 9. 3. Compute cluster centroids: The centroid of data points in the red cluster is shown using red cross and those in grey cluster using are cross 4. Re-assign each point to the closest cluster centroid: K=2 K=2
  • 10. 5. re-compute cluster centroids: Repeat steps 4 & 5 until no improvement are possible K=2
  • 11. Step:1 Randomly Take Two Centroid As K-2, So Here (185.72) and (170,56). K1 = (185,72) && K2 = (170,56) Step2:- Calculate Distance of all points from K Centroid. Here I Use Euclidian Distance= sqrt( (x- x)² + (Yo-Ycl²)So Now Check Point (168,60) Goes to which cluster so 1st fine the distance (168,60)=> K1=Sqrt((168-185)2+(60-72)²) = 20.80 K2=Sqrt((168-170)2+(60-56)²) = 4.48 Step3:- As 4.48 small so (168,60) goes to K2 cluster. As coster update Now we recomputed k2. k2= (170+168/2,56+60/2)=(169,58) This is now the updated K2 value. Step4:- Again Repeat step2 to step 3 for remaining data points EXAMPLE Height(x0) Weight(y0) 185 72 170 56 168 60 179 68 Let k=2 for this problems
  • 12. PROBLEM OF K-MEANS METHOD The k-means algorithm is sensitive to outliers! Since an object with an extremely large value may substantially distort the distribution of the data K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster
  • 13. Weaknesses :-  Applicable only when mean is defined, then what about categorical data?  Need to specify k, the number of clusters, in advance.  Unable to handle noisy data
  • 14. CLARA (Clustering Large Applications) • It draws multiple samples of the data set, and gives the best clustering as the output Strength: • deals with larger data sets Weakness: • Efficiency depends on the sample size • A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased
  • 15. THE K-MEDOID CLUSTERING METHOD  K-Medoids Clustering: Find representative objects (medoids) in clusters  PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987).  Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non- medoids if it improves the total distance of the resulting clustering.  PAM works effectively for small data sets, but does not scale well for large data sets (due to the computational complexity).  Efficiency improvement on PAM  CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples  CLARANS (Ng & Han, 1994): Randomized re-sampling
  • 16. EXAMPLE Step1: Select two Random Medoids. Step2:- Calculate distance of all point into each medoid Step3 - Compare Cost of cl and c2 for every i and select the minimum value.So in above table in C1=(x2.x1.x3) and C2=(x5,x4) Step4- Now calculate the cost = (3+4+2) = 9Let K-2 For This Problem Step5- Now select one of non medoid and repeat the 2,3 and 4 Step6- After recalculate the medoids the again calculate the cost We terminate when old cost value is smaller than the new cost otherwise we repeat all step until old cost smaller than new cost I X Y X1 2 6 X2 3 4 X3 3 8 X4 8 5 X5 7 4 I X(a) Y(b) C1(3,4) |a-3|+|b-4| C2(7,4) X1 2 6 3 7 X3 3 8 4 8 X4 8 5 4 2 Let k=2 for this problem
  • 17. BRICH(BALANCED ITERATIVE REDUCING AND CLUSTERING USING HIERARCHIES)  It is a clustering algorithm that can cluster large datasets by first generating a small and compact summary of the large dataset that retains as much information as possible. This smaller summary is then clustered instead of clustering the larger dataset.  Phase 1: scan Database to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data).  Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree  BIRCH has one major drawback - it can only process metric attributes(integers or reals). A metric attribute any attribute whose values can be represented in Euclidean space i.e., no categorical attributes should be present.  Two important terms: Clustering Feature (CF) and CF - Tree
  • 18. DENSITY BASED CLUSTERING METHODS  These methods consider the clusters as the dense region having some similarity and different from the lower dense region of the space. These methods have good accuracy and ability to merge two clusters.  Major features:  Discover clusters of arbitrary shape  Handle noise  Several interesting studies:  Density-based spatial clustering of applications with noise(DBSCAN)
  • 19. DBSCAN  Some definitions first:  Epsilon: This is also called eps. This is the distancefill which we look for the neighbouring points.  Min_points: The minimum number of points specified by the user.  Core Points: If the number of points inside the eps radius of a point is greater than or equal to the min_points then it's called a core point.  Border Points: If the number of points inside the eps radius of a point is less than the min_points and it lies within the eps radius region of a core point, it's called a border point.  Noise: A point which is neither a core nor a border point is a noise point.
  • 20. ALGORITHM STEPS FOR DBSCAN 1. First assign min points and eps value. The Pick a random point that has not yet been assigned to a cluster or designated to a cluster or designated as an outlier. 2. Determine if point contains greater than or equal points than the min pts then this point becomes to the core point else label the point as Outlier. 3. Once a Core Point has been found add all directly reachable to its cluster. Then de neighbor jumps to each reachable point and add them to the cluster. If an outlier mas been added label it is boarder point. 4. Repeat the steps above until all points are classified into different clusters or noises
  • 21. GRID-BASED CLUSTERING METHOD STING (Statistical Information Grid approach):  The area is divided into rectangular cells at different levels of resolution and these form a tree like structure.  Each cell at a high level is contains a number of smaller cells in the next lower level.  Statistical info of each cell is calculated and stored beforehand and is used to answer queries.  Parameters of higher level cells can be easily calculated from parameters of lower level cell.  count, mean, standard deviation, min, max  type of distribution-normal, uniform, etc.  In this method the data space is formulated into a finite number of cells that form a grid- like structure. All the clustering operation done on these grids are fast and independent of the number of data objects
  • 22. Hierarchical clustering Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left. Hierarchical Method Divisive Approach Agglomerative Approach (Bottom-up) (Top-down)
  • 23. Agglomerative Approach :- • Bottom-up approach • We start with each object forming a separate group. It keeps on merging the objects or groups that are close to one another. It keep on doing so until all of the groups are merged into one or until the termination condition holds. Divisive Approach:- • Top-down approach • We start with all of the objects in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one cluster or the termination condition holds. Disadvantage :-This method is rigid i.e. once merge or split is done, It can never be undone. Hierarchical clustering (Continue…)
  • 25. CONCLUSION  Cluster analysis groups objects based on their similarity and has wide applications. Measure of similarity can be computed for various types of data.  Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods.  K-means and K-medoids algorithms are popular partitioning-based clustering algorithms,  Birch and Chameleon are interesting hierarchical clustering algorithms.  DBSCAN. OPTICS, and DENCLU are interesting density-based algorithms.  STING and CLIQUE are grid-based methods, where CLIQUE is also a subspace clustering algorithm.  Quality of clustering results can be evaluated in various ways such as Determining the number of clusters in a data set, Measuring clustering quality.
  • 26. REFERENCES 1. www.google.com 2. www.youtube.com 3. R.. Agrawal. J. Gehrke. D. Gunopulos. and P Raghuvau Automatic subspace clustering of high dimensional data for data mining applications 4. M. R. Anderberg Cluster Analysis for Applications Academic Press 5. M Ankerst. M. Breunir H-P. Kriezel, and J. Sander Optics: Ordering points to identify the chustering structure 6. Beil F. Ester M. Xu X "Frequent Term-Based Text Chustering“ 7. M. M. Breunie H.-P Kriegel, R. Ng. J. Sander LOF Identifying Density-Based Local Outliers.