SlideShare a Scribd company logo
1 of 24
Download to read offline
Clustering in Machine Learning
In this article, we will study an unsupervised learning-based technique known
as clustering in machine learning. Here, we will discuss what clustering really
is, What are its types? We will look at the algorithms involved in the clustering
technique.
Clustering in Machine Learning
Let’s try to understand what clustering exactly is. Examples make the job a lot
more easier.
So, as we know, there are two types of learning: active and passive.
Passive means that the model follows a certain pre-written path and is also
done under supervision. This doesn’t mean something like we will take
unsupervised learning on the active side. It is just that the human intervention
in unsupervised learning is quite minimal as compared to supervised learning.
Also, we have unlabeled data in unsupervised learning. So, here, the algorithm
has to completely analyze the data, find patterns, and cluster the data
depicting similar features.
For example, if we provide a dataset consisting of images of two different
objects. The model will scan the images for certain features. If some images
have matching features, it will form a cluster.
Note:-Active learning is a different concept. It’s applicable for semi-supervised
and reinforcement learning techniques.
Examples of Clustering in Machine Learning
A real-life example would be: -Trying to solve a hard problem in chess. The
possibilities to checkmate the king are endless. There is no predefined or
pre-set solution in chess. You have to analyze the positions, your pieces, the
opponent’s pieces and find a solution.
These traits to find the solution are the dataset, you are the model that has to
analyze and find the answer. This is what unsupervised learning is.
Now let’s understand clustering. In clustering, we classify data points into
clusters based on similar features rather than labels. The labelling part in
clustering comes at the end when clustering is over.
Also, we should add a lot of data to the dataset, to increase the accuracy of the
results. The algorithm will learn various patterns in the dataset.
Like, it will look for certain traits and features and compare the similarities
between the data.
There is a difference here between classification and clustering. The labeling
part involves a lot of human intervention that proves both costly and
time-consuming. Although labelling gets fairly simple in unsupervised
learning, the model would have to process more due to more analysis.
Let’s take an example. If we have a dataset consisting of images of tigers,
zebras, leopards, cheetahs.
The clustering algorithm would analyze this dataset and then divide the data
based on some specific characteristics. The characteristics would include fur
color, patterns (spots, stripes), face shape, etc.
The model would remember the pattern in which it classified the data. This
knowledge will come in handy for future unknown data.
We also have other applications of clustering like fake-news detection, fraud
detection, spam mail segregation, etc. Now, let’s dive a bit deeper into
clustering now that we have seen and understood it.
There are various types of clustering that we should know about. These will
help us to further classify and understand the various algorithms that
unsupervised learning has.
These include:
● Centroid-based clustering
● Density-based clustering
● Distribution-based clustering
● Hierarchical clustering
● Grid clustering
Now let’s understand these one-by-one.
Types of Clustering in Machine Learning
1. Centroid-Based Clustering in Machine Learning
In centroid-based clustering, we form clusters around several points that act
as the centroids. The k-means clustering algorithm is the perfect example of
the Centroid-based clustering method. Here, we form k number of clusters
that have k number of centroids.
The algorithm assigns the datapoints to certain cluster centres (centroids)
based on their proximity to certain centroids.
The algorithm measures the Euclidean distances between the datapoints and
all k centroids. The one that is the nearest will get clustered with that
particular centroid.
Also, k-means is the most widely used centroid-based clustering algorithm.
The primary aim of the algorithm is to simplify an N-dimensional dataset into
smaller K clusters.
K-means Clustering in Machine Learning
Let’s try to understand more about k-means clustering. It is an iterative
clustering type of algorithm. This means that it compares each datapoint’s
proximity with the centroids one-by-one in an iterative fashion. The premise
of this algorithm is that it has to find the maximum (local maxima) or the best
possible value for each iteration.
Let’s understand the working of the algorithm with the help of some images:
Let’s say we have two different types of datapoints. Red colour and blue
colour.
Now, for the next step, let’s assign the value of k. Let’s say k=2 as we have two
types of data. K=2 means we will have two centroids far from each other
Note:- The centroids should always remain distant from each other to avoid
any confusion and error.
The two-colored crosses are the centroids.
Now the algorithm will compare the distance of each point with the centroids.
The points will join the cluster whose centroid is nearer to them. This
algorithm makes nice use of the distance metrics for this purpose. The points
are now assigned to their respective clusters.
After this, again centroids are assigned for the clusters and again they
re-adjust themselves.
The last two steps will keep on continuing until the algorithm achieves
maximum accuracy (maxima) or you can say, until the clusters stop moving
and everything becomes stable.
So, let’s see the coding part of it:
Step 1: First, we will import all the necessary libraries. We have the basic
import libraries:
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
from numpy import unique
from numpy import where
Here, make_classification is for the dataset. KMeans is to import the model
for the KMeans algorithm.
Advertisement
If there are any special or unique elements in the NumPy array, we can extract
them with the help of the unique function provided by the unique library.
Also, in the case of the where function, you can relate it with the conditional
operator in mathematics.
You have to give a condition to the where function in order to get the output
according to it. Remember that these two functions belong to the Numpy
group.
Here, we have used sklearn.datasets in order to include more parameters
related to the datasets.
Step-2: Let’s try to define the dataset. We will use the make_classification
function to define our dataset and to include certain parameters with it.
A, _ = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=0, n_repeated=0,
n_clusters_per_class=1, random_state=4)
Remember that, whatever parameter you choose to have, the output would
look different for each case.
Also, these parameters have specific meanings.
n_samples is the number of samples.
n_features is the number of features that you would want to have in the
dataset.
Now, these next three are a bit important to know. As they are responsible for
how your data would look.
n_informative tells the number of important and informative features of your
data.
n_redundant gives you random combinations of informative features.
n_repeated helps to draw out the duplicate features from the above two
features.
The Numpy array stores the features in the same order as mentioned and
Random_state helps in giving random numbers in the same order.
Step-3: Now, we will bring in our model. In the parameter of our model or the
function name (KMeans), we will give the number of clusters we need:
model = KMeans(n_clusters=3)
Step-4: Now, here we run the model. We try to fit in the data that we have
defined above.
model.fit(A)
After execution, it will also show a certain output message (depends on the
framework that you use)
Step-5: The function predict() will help to analyze and predict the dataset to
make clusters.
B = model.predict(A)
Step-6: Here the unique function comes into play. This will help to segregate
certain features for the cluster in a NumPy array.
clust = unique(B)
Step-7: Now, we will run the entire system in a loop so as to keep taking the
next set of data for creating the clusters. Here, the clust variable already has
the segregated data for the cluster.
We will use it in the where function and create a condition.
In the end, we will see a graph plotted as a result (because pyplot and
matplotlib come into play here).
for cluster in clust:
C = where(B == cluster)
pyplot.scatter(A[C, 0], A[C, 1])
pyplot.show()
2. Density-Based Clustering in Machine Learning
In this type of clustering, the clustering doesn’t happen around centroid or
central points, but the cluster forms where the density looks higher. Also, the
advantage is that unlike centroid-based clustering it doesn’t include all the
points.
Including all the points can create unnecessary noise in the data and that is
where density-based clustering has an edge over centroid-based clustering.
The sparse/noise datapoints are included in order to define the border
between clusters. These points are always outside the main cluster.
Also, one negative thing is that in density-based clustering, the cluster borders
are defined by decreasing density. So, the borders might be of varied shapes
that make it difficult to draw a perfect border for clusters. However, there are
many methods out there to improve these kinds of problems, but we look at
them some other time.
We have three main algorithms in Density-based clustering – DBSCAN,
HDBSCAN, and OPTICS.
DBSCAN uses a fixed distance for separating the dense clusters from the noise
datapoints. It is the fastest among clustering algorithms.
HDBSCAN uses a range of distances to separate itself from the noise. It
requires the least amount of user-input.
OPTICS measures the distance between neighboring features and it draws a
reachability plot to separate itself from the noise datapoints.
Now, we will use the same code but some different functions to understand
density-based clustering. So, we will only mention the different functions here
as the rest is the same.
Advertisement
So, let’s have a look at DBSCAN: We just need to import DBSCAN:
from sklearn.cluster import DBSCAN
This will import the DBSCAN model into your program. Now to define your
model and to provide it with the arguments:
model = DBSCAN(eps=0.20, min_samples=5)
Here, eps is the max. distance between two data points. This is the prime
parameter for the DBSCAN as the values we enter in eps would change the
plot every time.
The reason is that cluster density would look different for different values.
Also, min_samples helps to set the minimum number of samples we want
within a neighborhood collection of features. So, the output for the given value
set would be:
3. Distribution-Based Clustering in Machine
Learning
In distribution-based clustering, if the distance between the point and the
central distribution of points increases, then the probability of the point being
included in the distribution decreases.
For problems like these, we use the Gaussian distribution model. The model
works like, there will be a fixed number of distributors for the Gaussian model.
These distributors are concentric figures with decreasing color intensity from
inside to the outside.
The central part tends to be denser and it tends to decrease as we go outside as
we have bigger distributors now. So, even though distributors might contain
the same number of points, their densities may still differ due to the size of
distributors. Overfitting can be a bit of a problem for this type of clustering.
As long as we don’t set any strong criteria for points, we can avoid overfitting.
The Gaussian mixture model is generally used with the
expectation-maximization model. An important point to remember that we
cannot use this technique for density-based clustering as it would not properly
fit among the distributions.
Only expectation-maximization algorithms can work well with Gaussian
mixture models.
So let’s see how the Gaussian distribution looks in general:
The concentric figures may not be this perfect, but it’s a similar depiction of
what the real thing looks like.
Now, let’s understand the code involved in this:
The code has numerous similarities just like the previous case. So, we will only
mention the main components involved as the rest gets pretty much the same.
These are all easier examples, just to understand how the clustering works.
The coding has an enormous depth with numerous parameters for the
functions that we use. But, here we restrict to very few functions to make the
explanation simpler.
From sklearn.cluster import GaussianMixture
Firstly, we will import the model.
Note: All the models and algorithms that are a part of sklearn, we can import
them directly from sklearn and just use their function to define the models.
model = GaussianMixture(n_components=4)
Here, n_components is the number of mixture components ( the number of
different colored components that we will use in the plot).
So, the result would look like:
4. Hierarchical Clustering in Machine Learning
Well, in hierarchical clustering we deal with either merging of clusters or
division of a big cluster.
So, we should know that hierarchical clustering has two types: Agglomerative
hierarchical clustering and divisive hierarchical clustering.
In agglomerative clustering, we tend to merge smaller clusters into bigger
clusters. But this also has some process. If the two clusters that we compare
have similarities between them and if they are near to each other, then we
merge them.
In the case of divisive hierarchical clustering, we divide one big cluster into
n-smaller clusters. The clusters here are divided if some datapoints are not
similar to the larger cluster; we separate them and make an individual cluster
for them.
For solving the hierarchical clustering problem, we use the proximity
(distance) matrix in which we store the distances of each individual cluster
among each other.
For every merge or divide, the matrix would change because at every next step
we get a new cluster.
Like, if we have 5 clusters, we merge two of them, then the matrix will then
have 4 clusters and the distances stored in the matrix will eventually change
(Same case for divisive hierarchical clustering as well).
Also, for visualizing hierarchical clustering, we use a dendrogram, which is a
type of graph. In hierarchical clustering, the most important job is to calculate
the similarity between clusters.
It’s the most important part as it helps us to understand whether to merge or
divide the cluster. We have several methods like:
Single Linkage Algorithm (MIN)
● In this, we take two points from two individual clusters and these
points have to be the closest to each.
● We then calculate the distance and similarity between them.
● It works well for separating non-elliptical clusters.
● Noise in the data can create problems.
Complete Linkage Algorithm (MAX)
● This is the exact opposite of the MIN algorithm.
● We take the distance between the two farthest points in two clusters
and measure the distance and similarities.
● It works well even in the presence of noise.
● However, it can accidentally break bigger clusters as well.
Group Average
● Here, we take all the points and then measure them.
● It can accurately separate clusters even if there remains noise in
between clusters.
Distance Between the Centroids
● We can measure the distance between the centroids of both the
clusters.
● Although it’s not used that much as you have to find the centroid
after every merge or divide.
Ward’s Method
● This one is the last method to calculate the similarity.
● Here, instead of taking the average, we take the sum of squares of
distances.
Note:- The complexity of space in hierarchical clustering is O(n2). We require
lots and lots of space to hold the matrix. For time complexity it is O(n3). The
iterations we perform can get complex if there is a large dataset.
We have some visual representations of how the clustering happens in
hierarchical clustering:
There is also the proximity matrix and the dendrogram:
This is the proximity matrix. Now the dendrogram:
5. Grid Clustering in Machine Learning
This technique is very useful for multidimensional spaces. Here, we can divide
the entire data space into a finite number of small cells. This helps a lot in
reducing the complexity of the problem and this is what separates grid
clustering from all conventional clustering.
We can recognize denser areas as places that have clusters. When we divide
the plane into cells, we can then calculate the density of each cell and then sort
them.
At first, we will select one cell and calculate its density. If it is more than the
normal density it will become a cluster. We will apply the same process for the
cell’s neighbors until there are no neighbors left. All the neighboring cells
would have grouped into a cluster. The process will continue until all the cells
are traversed.
Here, we focus on the cells rather than the data. This helps in reducing
complexity. The algorithms that fall under the grid-based clustering are the
STING and CLIQUE algorithms.
Applications of Clustering in Machine
Learning
So, finally, let’s have a look at the specific areas where this concept is applied.
Here are the top applications of the clustering concept:
● It’s useful in various image recognition platforms and also various
image segregation tasks. One such example is the biological field.
● Here, this can help the researchers to categorize and classify certain
unknown species.
● Clustering can come in handy for the city’s crime branch for
classifying different parts of a city based on the crime-rate.
● The algorithm can classify and tell based on the frequency and
number of cases that, which part has more criminal activities.
● It’s quite useful in data-mining as well because clustering can help
in understanding the data.
● It can be useful for banks as a fraud detection algorithm. The bank
sector faces the problem of credit card scams and frauds.
● Using clustering we can uncover several hidden patterns that may
tell us which scheme is a fraudulent scheme.
● We can understand various seismic zones based on the data and the
clustering algorithm can create clusters of various seismic activities
and active regions.
● In the health industry as well, we can use this algorithm to detect
various ailments, especially tumors and cancers.
● The search engine is a prime example of the clustering technique.
● It categorizes search results on the basis of billions of searches and
has the results ready on the go every time for every search.
● It can be of great use in analyzing data on soil for agriculture.
● With the data of the soil, it can classify whether the soil has the
necessary nutrient content for growing crops.
● It is also useful in the case of advertising the product based on
customer interest and feedback.
Conclusion
So, for this article, we have studied all the necessary aspects of clustering in
machine learning that a student or someone who wants to pursue ML should
know about.
We covered everything from what clustering is to what are its types. We even
have a few coding examples, more importantly, what function and library we
should use in ML is more important than to understand the whole code. You
can understand the whole code if you know Python language.
We even saw the algorithms that fall under various categories. Also, several
diagrams have been used for better understanding.

More Related Content

Similar to Clustering in Machine Learning.pdf

Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataIOSR Journals
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
Dynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c meansDynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c meansWrishin Bhattacharya
 
hierarchical clustering.pptx
hierarchical clustering.pptxhierarchical clustering.pptx
hierarchical clustering.pptxPriyadharshiniG41
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisIOSR Journals
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Simplilearn
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptxGandhiMathy6
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
House price prediction
House price predictionHouse price prediction
House price predictionSabahBegum
 
Mis End Term Exam Theory Concepts
Mis End Term Exam Theory ConceptsMis End Term Exam Theory Concepts
Mis End Term Exam Theory ConceptsVidya sagar Sharma
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataKathleneNgo
 
ML crash course
ML crash courseML crash course
ML crash coursemikaelhuss
 

Similar to Clustering in Machine Learning.pdf (20)

Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Dynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c meansDynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c means
 
hierarchical clustering.pptx
hierarchical clustering.pptxhierarchical clustering.pptx
hierarchical clustering.pptx
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
 
Top Machine Learning Algorithms Used By AI Professionals ARTiBA.pdf
Top Machine Learning Algorithms Used By AI Professionals ARTiBA.pdfTop Machine Learning Algorithms Used By AI Professionals ARTiBA.pdf
Top Machine Learning Algorithms Used By AI Professionals ARTiBA.pdf
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
fINAL ML PPT.pptx
fINAL ML PPT.pptxfINAL ML PPT.pptx
fINAL ML PPT.pptx
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
 
cluster.pptx
cluster.pptxcluster.pptx
cluster.pptx
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Mis End Term Exam Theory Concepts
Mis End Term Exam Theory ConceptsMis End Term Exam Theory Concepts
Mis End Term Exam Theory Concepts
 
DCSM report2
DCSM report2DCSM report2
DCSM report2
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports Data
 
ML crash course
ML crash courseML crash course
ML crash course
 
XL-MINER: Data Exploration
XL-MINER: Data ExplorationXL-MINER: Data Exploration
XL-MINER: Data Exploration
 

More from SudhanshiBakre1

Float Data Type in C.pdf
Float Data Type in C.pdfFloat Data Type in C.pdf
Float Data Type in C.pdfSudhanshiBakre1
 
IoT Hardware – The Backbone of Smart Devices.pdf
IoT Hardware – The Backbone of Smart Devices.pdfIoT Hardware – The Backbone of Smart Devices.pdf
IoT Hardware – The Backbone of Smart Devices.pdfSudhanshiBakre1
 
Internet of Things – Contiki.pdf
Internet of Things – Contiki.pdfInternet of Things – Contiki.pdf
Internet of Things – Contiki.pdfSudhanshiBakre1
 
Java abstract Keyword.pdf
Java abstract Keyword.pdfJava abstract Keyword.pdf
Java abstract Keyword.pdfSudhanshiBakre1
 
Collections in Python - Where Data Finds Its Perfect Home.pdf
Collections in Python - Where Data Finds Its Perfect Home.pdfCollections in Python - Where Data Finds Its Perfect Home.pdf
Collections in Python - Where Data Finds Its Perfect Home.pdfSudhanshiBakre1
 
File Handling in Java.pdf
File Handling in Java.pdfFile Handling in Java.pdf
File Handling in Java.pdfSudhanshiBakre1
 
Types of AI you should know.pdf
Types of AI you should know.pdfTypes of AI you should know.pdf
Types of AI you should know.pdfSudhanshiBakre1
 
Annotations in Java with Example.pdf
Annotations in Java with Example.pdfAnnotations in Java with Example.pdf
Annotations in Java with Example.pdfSudhanshiBakre1
 
Top Cryptocurrency Exchanges of 2023.pdf
Top Cryptocurrency Exchanges of 2023.pdfTop Cryptocurrency Exchanges of 2023.pdf
Top Cryptocurrency Exchanges of 2023.pdfSudhanshiBakre1
 
Epic Python Face-Off -Methods vs.pdf
Epic Python Face-Off -Methods vs.pdfEpic Python Face-Off -Methods vs.pdf
Epic Python Face-Off -Methods vs.pdfSudhanshiBakre1
 
Django Tutorial_ Let’s take a deep dive into Django’s web framework.pdf
Django Tutorial_ Let’s take a deep dive into Django’s web framework.pdfDjango Tutorial_ Let’s take a deep dive into Django’s web framework.pdf
Django Tutorial_ Let’s take a deep dive into Django’s web framework.pdfSudhanshiBakre1
 
Benefits Of IoT Salesforce.pdf
Benefits Of IoT Salesforce.pdfBenefits Of IoT Salesforce.pdf
Benefits Of IoT Salesforce.pdfSudhanshiBakre1
 
Epic Python Face-Off -Methods vs. Functions.pdf
Epic Python Face-Off -Methods vs. Functions.pdfEpic Python Face-Off -Methods vs. Functions.pdf
Epic Python Face-Off -Methods vs. Functions.pdfSudhanshiBakre1
 
Python Classes_ Empowering Developers, Enabling Breakthroughs.pdf
Python Classes_ Empowering Developers, Enabling Breakthroughs.pdfPython Classes_ Empowering Developers, Enabling Breakthroughs.pdf
Python Classes_ Empowering Developers, Enabling Breakthroughs.pdfSudhanshiBakre1
 

More from SudhanshiBakre1 (20)

IoT Security.pdf
IoT Security.pdfIoT Security.pdf
IoT Security.pdf
 
Top Java Frameworks.pdf
Top Java Frameworks.pdfTop Java Frameworks.pdf
Top Java Frameworks.pdf
 
Numpy ndarrays.pdf
Numpy ndarrays.pdfNumpy ndarrays.pdf
Numpy ndarrays.pdf
 
Float Data Type in C.pdf
Float Data Type in C.pdfFloat Data Type in C.pdf
Float Data Type in C.pdf
 
IoT Hardware – The Backbone of Smart Devices.pdf
IoT Hardware – The Backbone of Smart Devices.pdfIoT Hardware – The Backbone of Smart Devices.pdf
IoT Hardware – The Backbone of Smart Devices.pdf
 
Internet of Things – Contiki.pdf
Internet of Things – Contiki.pdfInternet of Things – Contiki.pdf
Internet of Things – Contiki.pdf
 
Java abstract Keyword.pdf
Java abstract Keyword.pdfJava abstract Keyword.pdf
Java abstract Keyword.pdf
 
Node.js with MySQL.pdf
Node.js with MySQL.pdfNode.js with MySQL.pdf
Node.js with MySQL.pdf
 
Collections in Python - Where Data Finds Its Perfect Home.pdf
Collections in Python - Where Data Finds Its Perfect Home.pdfCollections in Python - Where Data Finds Its Perfect Home.pdf
Collections in Python - Where Data Finds Its Perfect Home.pdf
 
File Handling in Java.pdf
File Handling in Java.pdfFile Handling in Java.pdf
File Handling in Java.pdf
 
Types of AI you should know.pdf
Types of AI you should know.pdfTypes of AI you should know.pdf
Types of AI you should know.pdf
 
Streams in Node .pdf
Streams in Node .pdfStreams in Node .pdf
Streams in Node .pdf
 
Annotations in Java with Example.pdf
Annotations in Java with Example.pdfAnnotations in Java with Example.pdf
Annotations in Java with Example.pdf
 
RESTful API in Node.pdf
RESTful API in Node.pdfRESTful API in Node.pdf
RESTful API in Node.pdf
 
Top Cryptocurrency Exchanges of 2023.pdf
Top Cryptocurrency Exchanges of 2023.pdfTop Cryptocurrency Exchanges of 2023.pdf
Top Cryptocurrency Exchanges of 2023.pdf
 
Epic Python Face-Off -Methods vs.pdf
Epic Python Face-Off -Methods vs.pdfEpic Python Face-Off -Methods vs.pdf
Epic Python Face-Off -Methods vs.pdf
 
Django Tutorial_ Let’s take a deep dive into Django’s web framework.pdf
Django Tutorial_ Let’s take a deep dive into Django’s web framework.pdfDjango Tutorial_ Let’s take a deep dive into Django’s web framework.pdf
Django Tutorial_ Let’s take a deep dive into Django’s web framework.pdf
 
Benefits Of IoT Salesforce.pdf
Benefits Of IoT Salesforce.pdfBenefits Of IoT Salesforce.pdf
Benefits Of IoT Salesforce.pdf
 
Epic Python Face-Off -Methods vs. Functions.pdf
Epic Python Face-Off -Methods vs. Functions.pdfEpic Python Face-Off -Methods vs. Functions.pdf
Epic Python Face-Off -Methods vs. Functions.pdf
 
Python Classes_ Empowering Developers, Enabling Breakthroughs.pdf
Python Classes_ Empowering Developers, Enabling Breakthroughs.pdfPython Classes_ Empowering Developers, Enabling Breakthroughs.pdf
Python Classes_ Empowering Developers, Enabling Breakthroughs.pdf
 

Recently uploaded

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Clustering in Machine Learning.pdf

  • 1. Clustering in Machine Learning In this article, we will study an unsupervised learning-based technique known as clustering in machine learning. Here, we will discuss what clustering really is, What are its types? We will look at the algorithms involved in the clustering technique. Clustering in Machine Learning Let’s try to understand what clustering exactly is. Examples make the job a lot more easier. So, as we know, there are two types of learning: active and passive. Passive means that the model follows a certain pre-written path and is also done under supervision. This doesn’t mean something like we will take unsupervised learning on the active side. It is just that the human intervention in unsupervised learning is quite minimal as compared to supervised learning. Also, we have unlabeled data in unsupervised learning. So, here, the algorithm has to completely analyze the data, find patterns, and cluster the data depicting similar features. For example, if we provide a dataset consisting of images of two different objects. The model will scan the images for certain features. If some images have matching features, it will form a cluster.
  • 2. Note:-Active learning is a different concept. It’s applicable for semi-supervised and reinforcement learning techniques. Examples of Clustering in Machine Learning A real-life example would be: -Trying to solve a hard problem in chess. The possibilities to checkmate the king are endless. There is no predefined or pre-set solution in chess. You have to analyze the positions, your pieces, the opponent’s pieces and find a solution. These traits to find the solution are the dataset, you are the model that has to analyze and find the answer. This is what unsupervised learning is. Now let’s understand clustering. In clustering, we classify data points into clusters based on similar features rather than labels. The labelling part in clustering comes at the end when clustering is over. Also, we should add a lot of data to the dataset, to increase the accuracy of the results. The algorithm will learn various patterns in the dataset. Like, it will look for certain traits and features and compare the similarities between the data. There is a difference here between classification and clustering. The labeling part involves a lot of human intervention that proves both costly and time-consuming. Although labelling gets fairly simple in unsupervised learning, the model would have to process more due to more analysis. Let’s take an example. If we have a dataset consisting of images of tigers, zebras, leopards, cheetahs.
  • 3. The clustering algorithm would analyze this dataset and then divide the data based on some specific characteristics. The characteristics would include fur color, patterns (spots, stripes), face shape, etc. The model would remember the pattern in which it classified the data. This knowledge will come in handy for future unknown data. We also have other applications of clustering like fake-news detection, fraud detection, spam mail segregation, etc. Now, let’s dive a bit deeper into clustering now that we have seen and understood it. There are various types of clustering that we should know about. These will help us to further classify and understand the various algorithms that unsupervised learning has. These include: ● Centroid-based clustering ● Density-based clustering ● Distribution-based clustering ● Hierarchical clustering ● Grid clustering Now let’s understand these one-by-one. Types of Clustering in Machine Learning 1. Centroid-Based Clustering in Machine Learning In centroid-based clustering, we form clusters around several points that act as the centroids. The k-means clustering algorithm is the perfect example of
  • 4. the Centroid-based clustering method. Here, we form k number of clusters that have k number of centroids. The algorithm assigns the datapoints to certain cluster centres (centroids) based on their proximity to certain centroids. The algorithm measures the Euclidean distances between the datapoints and all k centroids. The one that is the nearest will get clustered with that particular centroid. Also, k-means is the most widely used centroid-based clustering algorithm. The primary aim of the algorithm is to simplify an N-dimensional dataset into smaller K clusters. K-means Clustering in Machine Learning Let’s try to understand more about k-means clustering. It is an iterative clustering type of algorithm. This means that it compares each datapoint’s proximity with the centroids one-by-one in an iterative fashion. The premise of this algorithm is that it has to find the maximum (local maxima) or the best possible value for each iteration. Let’s understand the working of the algorithm with the help of some images: Let’s say we have two different types of datapoints. Red colour and blue colour.
  • 5. Now, for the next step, let’s assign the value of k. Let’s say k=2 as we have two types of data. K=2 means we will have two centroids far from each other Note:- The centroids should always remain distant from each other to avoid any confusion and error.
  • 6. The two-colored crosses are the centroids. Now the algorithm will compare the distance of each point with the centroids. The points will join the cluster whose centroid is nearer to them. This algorithm makes nice use of the distance metrics for this purpose. The points are now assigned to their respective clusters.
  • 7. After this, again centroids are assigned for the clusters and again they re-adjust themselves. The last two steps will keep on continuing until the algorithm achieves maximum accuracy (maxima) or you can say, until the clusters stop moving and everything becomes stable. So, let’s see the coding part of it: Step 1: First, we will import all the necessary libraries. We have the basic import libraries:
  • 8. from sklearn.datasets import make_classification from sklearn.cluster import KMeans from matplotlib import pyplot from numpy import unique from numpy import where Here, make_classification is for the dataset. KMeans is to import the model for the KMeans algorithm. Advertisement If there are any special or unique elements in the NumPy array, we can extract them with the help of the unique function provided by the unique library. Also, in the case of the where function, you can relate it with the conditional operator in mathematics. You have to give a condition to the where function in order to get the output according to it. Remember that these two functions belong to the Numpy group. Here, we have used sklearn.datasets in order to include more parameters related to the datasets. Step-2: Let’s try to define the dataset. We will use the make_classification function to define our dataset and to include certain parameters with it. A, _ = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, n_repeated=0, n_clusters_per_class=1, random_state=4)
  • 9. Remember that, whatever parameter you choose to have, the output would look different for each case. Also, these parameters have specific meanings. n_samples is the number of samples. n_features is the number of features that you would want to have in the dataset. Now, these next three are a bit important to know. As they are responsible for how your data would look. n_informative tells the number of important and informative features of your data. n_redundant gives you random combinations of informative features. n_repeated helps to draw out the duplicate features from the above two features. The Numpy array stores the features in the same order as mentioned and Random_state helps in giving random numbers in the same order. Step-3: Now, we will bring in our model. In the parameter of our model or the function name (KMeans), we will give the number of clusters we need: model = KMeans(n_clusters=3) Step-4: Now, here we run the model. We try to fit in the data that we have defined above.
  • 10. model.fit(A) After execution, it will also show a certain output message (depends on the framework that you use) Step-5: The function predict() will help to analyze and predict the dataset to make clusters. B = model.predict(A) Step-6: Here the unique function comes into play. This will help to segregate certain features for the cluster in a NumPy array. clust = unique(B) Step-7: Now, we will run the entire system in a loop so as to keep taking the next set of data for creating the clusters. Here, the clust variable already has the segregated data for the cluster. We will use it in the where function and create a condition. In the end, we will see a graph plotted as a result (because pyplot and matplotlib come into play here). for cluster in clust:
  • 11. C = where(B == cluster) pyplot.scatter(A[C, 0], A[C, 1]) pyplot.show() 2. Density-Based Clustering in Machine Learning In this type of clustering, the clustering doesn’t happen around centroid or central points, but the cluster forms where the density looks higher. Also, the advantage is that unlike centroid-based clustering it doesn’t include all the points. Including all the points can create unnecessary noise in the data and that is where density-based clustering has an edge over centroid-based clustering. The sparse/noise datapoints are included in order to define the border between clusters. These points are always outside the main cluster. Also, one negative thing is that in density-based clustering, the cluster borders are defined by decreasing density. So, the borders might be of varied shapes
  • 12. that make it difficult to draw a perfect border for clusters. However, there are many methods out there to improve these kinds of problems, but we look at them some other time. We have three main algorithms in Density-based clustering – DBSCAN, HDBSCAN, and OPTICS. DBSCAN uses a fixed distance for separating the dense clusters from the noise datapoints. It is the fastest among clustering algorithms. HDBSCAN uses a range of distances to separate itself from the noise. It requires the least amount of user-input. OPTICS measures the distance between neighboring features and it draws a reachability plot to separate itself from the noise datapoints. Now, we will use the same code but some different functions to understand density-based clustering. So, we will only mention the different functions here as the rest is the same. Advertisement So, let’s have a look at DBSCAN: We just need to import DBSCAN: from sklearn.cluster import DBSCAN This will import the DBSCAN model into your program. Now to define your model and to provide it with the arguments: model = DBSCAN(eps=0.20, min_samples=5)
  • 13. Here, eps is the max. distance between two data points. This is the prime parameter for the DBSCAN as the values we enter in eps would change the plot every time. The reason is that cluster density would look different for different values. Also, min_samples helps to set the minimum number of samples we want within a neighborhood collection of features. So, the output for the given value set would be: 3. Distribution-Based Clustering in Machine Learning In distribution-based clustering, if the distance between the point and the central distribution of points increases, then the probability of the point being included in the distribution decreases. For problems like these, we use the Gaussian distribution model. The model works like, there will be a fixed number of distributors for the Gaussian model.
  • 14. These distributors are concentric figures with decreasing color intensity from inside to the outside. The central part tends to be denser and it tends to decrease as we go outside as we have bigger distributors now. So, even though distributors might contain the same number of points, their densities may still differ due to the size of distributors. Overfitting can be a bit of a problem for this type of clustering. As long as we don’t set any strong criteria for points, we can avoid overfitting. The Gaussian mixture model is generally used with the expectation-maximization model. An important point to remember that we cannot use this technique for density-based clustering as it would not properly fit among the distributions. Only expectation-maximization algorithms can work well with Gaussian mixture models. So let’s see how the Gaussian distribution looks in general:
  • 15. The concentric figures may not be this perfect, but it’s a similar depiction of what the real thing looks like. Now, let’s understand the code involved in this: The code has numerous similarities just like the previous case. So, we will only mention the main components involved as the rest gets pretty much the same. These are all easier examples, just to understand how the clustering works. The coding has an enormous depth with numerous parameters for the functions that we use. But, here we restrict to very few functions to make the explanation simpler.
  • 16. From sklearn.cluster import GaussianMixture Firstly, we will import the model. Note: All the models and algorithms that are a part of sklearn, we can import them directly from sklearn and just use their function to define the models. model = GaussianMixture(n_components=4) Here, n_components is the number of mixture components ( the number of different colored components that we will use in the plot). So, the result would look like: 4. Hierarchical Clustering in Machine Learning Well, in hierarchical clustering we deal with either merging of clusters or division of a big cluster.
  • 17. So, we should know that hierarchical clustering has two types: Agglomerative hierarchical clustering and divisive hierarchical clustering. In agglomerative clustering, we tend to merge smaller clusters into bigger clusters. But this also has some process. If the two clusters that we compare have similarities between them and if they are near to each other, then we merge them. In the case of divisive hierarchical clustering, we divide one big cluster into n-smaller clusters. The clusters here are divided if some datapoints are not similar to the larger cluster; we separate them and make an individual cluster for them. For solving the hierarchical clustering problem, we use the proximity (distance) matrix in which we store the distances of each individual cluster among each other. For every merge or divide, the matrix would change because at every next step we get a new cluster. Like, if we have 5 clusters, we merge two of them, then the matrix will then have 4 clusters and the distances stored in the matrix will eventually change (Same case for divisive hierarchical clustering as well). Also, for visualizing hierarchical clustering, we use a dendrogram, which is a type of graph. In hierarchical clustering, the most important job is to calculate the similarity between clusters.
  • 18. It’s the most important part as it helps us to understand whether to merge or divide the cluster. We have several methods like: Single Linkage Algorithm (MIN) ● In this, we take two points from two individual clusters and these points have to be the closest to each. ● We then calculate the distance and similarity between them. ● It works well for separating non-elliptical clusters. ● Noise in the data can create problems. Complete Linkage Algorithm (MAX) ● This is the exact opposite of the MIN algorithm. ● We take the distance between the two farthest points in two clusters and measure the distance and similarities. ● It works well even in the presence of noise. ● However, it can accidentally break bigger clusters as well. Group Average ● Here, we take all the points and then measure them. ● It can accurately separate clusters even if there remains noise in between clusters. Distance Between the Centroids ● We can measure the distance between the centroids of both the clusters. ● Although it’s not used that much as you have to find the centroid after every merge or divide. Ward’s Method ● This one is the last method to calculate the similarity.
  • 19. ● Here, instead of taking the average, we take the sum of squares of distances. Note:- The complexity of space in hierarchical clustering is O(n2). We require lots and lots of space to hold the matrix. For time complexity it is O(n3). The iterations we perform can get complex if there is a large dataset. We have some visual representations of how the clustering happens in hierarchical clustering:
  • 20.
  • 21. There is also the proximity matrix and the dendrogram: This is the proximity matrix. Now the dendrogram:
  • 22. 5. Grid Clustering in Machine Learning This technique is very useful for multidimensional spaces. Here, we can divide the entire data space into a finite number of small cells. This helps a lot in reducing the complexity of the problem and this is what separates grid clustering from all conventional clustering. We can recognize denser areas as places that have clusters. When we divide the plane into cells, we can then calculate the density of each cell and then sort them. At first, we will select one cell and calculate its density. If it is more than the normal density it will become a cluster. We will apply the same process for the cell’s neighbors until there are no neighbors left. All the neighboring cells would have grouped into a cluster. The process will continue until all the cells are traversed. Here, we focus on the cells rather than the data. This helps in reducing complexity. The algorithms that fall under the grid-based clustering are the STING and CLIQUE algorithms.
  • 23. Applications of Clustering in Machine Learning So, finally, let’s have a look at the specific areas where this concept is applied. Here are the top applications of the clustering concept: ● It’s useful in various image recognition platforms and also various image segregation tasks. One such example is the biological field. ● Here, this can help the researchers to categorize and classify certain unknown species. ● Clustering can come in handy for the city’s crime branch for classifying different parts of a city based on the crime-rate. ● The algorithm can classify and tell based on the frequency and number of cases that, which part has more criminal activities. ● It’s quite useful in data-mining as well because clustering can help in understanding the data. ● It can be useful for banks as a fraud detection algorithm. The bank sector faces the problem of credit card scams and frauds. ● Using clustering we can uncover several hidden patterns that may tell us which scheme is a fraudulent scheme. ● We can understand various seismic zones based on the data and the clustering algorithm can create clusters of various seismic activities and active regions. ● In the health industry as well, we can use this algorithm to detect various ailments, especially tumors and cancers. ● The search engine is a prime example of the clustering technique. ● It categorizes search results on the basis of billions of searches and has the results ready on the go every time for every search. ● It can be of great use in analyzing data on soil for agriculture. ● With the data of the soil, it can classify whether the soil has the necessary nutrient content for growing crops. ● It is also useful in the case of advertising the product based on customer interest and feedback.
  • 24. Conclusion So, for this article, we have studied all the necessary aspects of clustering in machine learning that a student or someone who wants to pursue ML should know about. We covered everything from what clustering is to what are its types. We even have a few coding examples, more importantly, what function and library we should use in ML is more important than to understand the whole code. You can understand the whole code if you know Python language. We even saw the algorithms that fall under various categories. Also, several diagrams have been used for better understanding.