Clustering in Machine Learning.pdf

Clustering in Machine Learning
In this article, we will study an unsupervised learning-based technique known
as clustering in machine learning. Here, we will discuss what clustering really
is, What are its types? We will look at the algorithms involved in the clustering
technique.
Clustering in Machine Learning
Let’s try to understand what clustering exactly is. Examples make the job a lot
more easier.
So, as we know, there are two types of learning: active and passive.
Passive means that the model follows a certain pre-written path and is also
done under supervision. This doesn’t mean something like we will take
unsupervised learning on the active side. It is just that the human intervention
in unsupervised learning is quite minimal as compared to supervised learning.
Also, we have unlabeled data in unsupervised learning. So, here, the algorithm
has to completely analyze the data, find patterns, and cluster the data
depicting similar features.
For example, if we provide a dataset consisting of images of two different
objects. The model will scan the images for certain features. If some images
have matching features, it will form a cluster.

Note:-Active learning is a different concept. It’s applicable for semi-supervised
and reinforcement learning techniques.
Examples of Clustering in Machine Learning
A real-life example would be: -Trying to solve a hard problem in chess. The
possibilities to checkmate the king are endless. There is no predefined or
pre-set solution in chess. You have to analyze the positions, your pieces, the
opponent’s pieces and find a solution.
These traits to find the solution are the dataset, you are the model that has to
analyze and find the answer. This is what unsupervised learning is.
Now let’s understand clustering. In clustering, we classify data points into
clusters based on similar features rather than labels. The labelling part in
clustering comes at the end when clustering is over.
Also, we should add a lot of data to the dataset, to increase the accuracy of the
results. The algorithm will learn various patterns in the dataset.
Like, it will look for certain traits and features and compare the similarities
between the data.
There is a difference here between classification and clustering. The labeling
part involves a lot of human intervention that proves both costly and
time-consuming. Although labelling gets fairly simple in unsupervised
learning, the model would have to process more due to more analysis.
Let’s take an example. If we have a dataset consisting of images of tigers,
zebras, leopards, cheetahs.

The clustering algorithm would analyze this dataset and then divide the data
based on some specific characteristics. The characteristics would include fur
color, patterns (spots, stripes), face shape, etc.
The model would remember the pattern in which it classified the data. This
knowledge will come in handy for future unknown data.
We also have other applications of clustering like fake-news detection, fraud
detection, spam mail segregation, etc. Now, let’s dive a bit deeper into
clustering now that we have seen and understood it.
There are various types of clustering that we should know about. These will
help us to further classify and understand the various algorithms that
unsupervised learning has.
These include:
● Centroid-based clustering
● Density-based clustering
● Distribution-based clustering
● Hierarchical clustering
● Grid clustering
Now let’s understand these one-by-one.
Types of Clustering in Machine Learning
1. Centroid-Based Clustering in Machine Learning
In centroid-based clustering, we form clusters around several points that act
as the centroids. The k-means clustering algorithm is the perfect example of

the Centroid-based clustering method. Here, we form k number of clusters
that have k number of centroids.
The algorithm assigns the datapoints to certain cluster centres (centroids)
based on their proximity to certain centroids.
The algorithm measures the Euclidean distances between the datapoints and
all k centroids. The one that is the nearest will get clustered with that
particular centroid.
Also, k-means is the most widely used centroid-based clustering algorithm.
The primary aim of the algorithm is to simplify an N-dimensional dataset into
smaller K clusters.
K-means Clustering in Machine Learning
Let’s try to understand more about k-means clustering. It is an iterative
clustering type of algorithm. This means that it compares each datapoint’s
proximity with the centroids one-by-one in an iterative fashion. The premise
of this algorithm is that it has to find the maximum (local maxima) or the best
possible value for each iteration.
Let’s understand the working of the algorithm with the help of some images:
Let’s say we have two different types of datapoints. Red colour and blue
colour.

Now, for the next step, let’s assign the value of k. Let’s say k=2 as we have two
types of data. K=2 means we will have two centroids far from each other
Note:- The centroids should always remain distant from each other to avoid
any confusion and error.

The two-colored crosses are the centroids.
Now the algorithm will compare the distance of each point with the centroids.
The points will join the cluster whose centroid is nearer to them. This
algorithm makes nice use of the distance metrics for this purpose. The points
are now assigned to their respective clusters.

After this, again centroids are assigned for the clusters and again they
re-adjust themselves.
The last two steps will keep on continuing until the algorithm achieves
maximum accuracy (maxima) or you can say, until the clusters stop moving
and everything becomes stable.
So, let’s see the coding part of it:
Step 1: First, we will import all the necessary libraries. We have the basic
import libraries:

from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
from numpy import unique
from numpy import where
Here, make_classification is for the dataset. KMeans is to import the model
for the KMeans algorithm.
Advertisement
If there are any special or unique elements in the NumPy array, we can extract
them with the help of the unique function provided by the unique library.
Also, in the case of the where function, you can relate it with the conditional
operator in mathematics.
You have to give a condition to the where function in order to get the output
according to it. Remember that these two functions belong to the Numpy
group.
Here, we have used sklearn.datasets in order to include more parameters
related to the datasets.
Step-2: Let’s try to define the dataset. We will use the make_classification
function to define our dataset and to include certain parameters with it.
A, _ = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=0, n_repeated=0,
n_clusters_per_class=1, random_state=4)

Remember that, whatever parameter you choose to have, the output would
look different for each case.
Also, these parameters have specific meanings.
n_samples is the number of samples.
n_features is the number of features that you would want to have in the
dataset.
Now, these next three are a bit important to know. As they are responsible for
how your data would look.
n_informative tells the number of important and informative features of your
data.
n_redundant gives you random combinations of informative features.
n_repeated helps to draw out the duplicate features from the above two
features.
The Numpy array stores the features in the same order as mentioned and
Random_state helps in giving random numbers in the same order.
Step-3: Now, we will bring in our model. In the parameter of our model or the
function name (KMeans), we will give the number of clusters we need:
model = KMeans(n_clusters=3)
Step-4: Now, here we run the model. We try to fit in the data that we have
defined above.

model.fit(A)
After execution, it will also show a certain output message (depends on the
framework that you use)
Step-5: The function predict() will help to analyze and predict the dataset to
make clusters.
B = model.predict(A)
Step-6: Here the unique function comes into play. This will help to segregate
certain features for the cluster in a NumPy array.
clust = unique(B)
Step-7: Now, we will run the entire system in a loop so as to keep taking the
next set of data for creating the clusters. Here, the clust variable already has
the segregated data for the cluster.
We will use it in the where function and create a condition.
In the end, we will see a graph plotted as a result (because pyplot and
matplotlib come into play here).
for cluster in clust:

C = where(B == cluster)
pyplot.scatter(A[C, 0], A[C, 1])
pyplot.show()
2. Density-Based Clustering in Machine Learning
In this type of clustering, the clustering doesn’t happen around centroid or
central points, but the cluster forms where the density looks higher. Also, the
advantage is that unlike centroid-based clustering it doesn’t include all the
points.
Including all the points can create unnecessary noise in the data and that is
where density-based clustering has an edge over centroid-based clustering.
The sparse/noise datapoints are included in order to define the border
between clusters. These points are always outside the main cluster.
Also, one negative thing is that in density-based clustering, the cluster borders
are defined by decreasing density. So, the borders might be of varied shapes

that make it difficult to draw a perfect border for clusters. However, there are
many methods out there to improve these kinds of problems, but we look at
them some other time.
We have three main algorithms in Density-based clustering – DBSCAN,
HDBSCAN, and OPTICS.
DBSCAN uses a fixed distance for separating the dense clusters from the noise
datapoints. It is the fastest among clustering algorithms.
HDBSCAN uses a range of distances to separate itself from the noise. It
requires the least amount of user-input.
OPTICS measures the distance between neighboring features and it draws a
reachability plot to separate itself from the noise datapoints.
Now, we will use the same code but some different functions to understand
density-based clustering. So, we will only mention the different functions here
as the rest is the same.
Advertisement
So, let’s have a look at DBSCAN: We just need to import DBSCAN:
from sklearn.cluster import DBSCAN
This will import the DBSCAN model into your program. Now to define your
model and to provide it with the arguments:
model = DBSCAN(eps=0.20, min_samples=5)

Here, eps is the max. distance between two data points. This is the prime
parameter for the DBSCAN as the values we enter in eps would change the
plot every time.
The reason is that cluster density would look different for different values.
Also, min_samples helps to set the minimum number of samples we want
within a neighborhood collection of features. So, the output for the given value
set would be:
3. Distribution-Based Clustering in Machine
Learning
In distribution-based clustering, if the distance between the point and the
central distribution of points increases, then the probability of the point being
included in the distribution decreases.
For problems like these, we use the Gaussian distribution model. The model
works like, there will be a fixed number of distributors for the Gaussian model.

These distributors are concentric figures with decreasing color intensity from
inside to the outside.
The central part tends to be denser and it tends to decrease as we go outside as
we have bigger distributors now. So, even though distributors might contain
the same number of points, their densities may still differ due to the size of
distributors. Overfitting can be a bit of a problem for this type of clustering.
As long as we don’t set any strong criteria for points, we can avoid overfitting.
The Gaussian mixture model is generally used with the
expectation-maximization model. An important point to remember that we
cannot use this technique for density-based clustering as it would not properly
fit among the distributions.
Only expectation-maximization algorithms can work well with Gaussian
mixture models.
So let’s see how the Gaussian distribution looks in general:

The concentric figures may not be this perfect, but it’s a similar depiction of
what the real thing looks like.
Now, let’s understand the code involved in this:
The code has numerous similarities just like the previous case. So, we will only
mention the main components involved as the rest gets pretty much the same.
These are all easier examples, just to understand how the clustering works.
The coding has an enormous depth with numerous parameters for the
functions that we use. But, here we restrict to very few functions to make the
explanation simpler.

From sklearn.cluster import GaussianMixture
Firstly, we will import the model.
Note: All the models and algorithms that are a part of sklearn, we can import
them directly from sklearn and just use their function to define the models.
model = GaussianMixture(n_components=4)
Here, n_components is the number of mixture components ( the number of
different colored components that we will use in the plot).
So, the result would look like:
4. Hierarchical Clustering in Machine Learning
Well, in hierarchical clustering we deal with either merging of clusters or
division of a big cluster.

So, we should know that hierarchical clustering has two types: Agglomerative
hierarchical clustering and divisive hierarchical clustering.
In agglomerative clustering, we tend to merge smaller clusters into bigger
clusters. But this also has some process. If the two clusters that we compare
have similarities between them and if they are near to each other, then we
merge them.
In the case of divisive hierarchical clustering, we divide one big cluster into
n-smaller clusters. The clusters here are divided if some datapoints are not
similar to the larger cluster; we separate them and make an individual cluster
for them.
For solving the hierarchical clustering problem, we use the proximity
(distance) matrix in which we store the distances of each individual cluster
among each other.
For every merge or divide, the matrix would change because at every next step
we get a new cluster.
Like, if we have 5 clusters, we merge two of them, then the matrix will then
have 4 clusters and the distances stored in the matrix will eventually change
(Same case for divisive hierarchical clustering as well).
Also, for visualizing hierarchical clustering, we use a dendrogram, which is a
type of graph. In hierarchical clustering, the most important job is to calculate
the similarity between clusters.

It’s the most important part as it helps us to understand whether to merge or
divide the cluster. We have several methods like:
Single Linkage Algorithm (MIN)
● In this, we take two points from two individual clusters and these
points have to be the closest to each.
● We then calculate the distance and similarity between them.
● It works well for separating non-elliptical clusters.
● Noise in the data can create problems.
Complete Linkage Algorithm (MAX)
● This is the exact opposite of the MIN algorithm.
● We take the distance between the two farthest points in two clusters
and measure the distance and similarities.
● It works well even in the presence of noise.
● However, it can accidentally break bigger clusters as well.
Group Average
● Here, we take all the points and then measure them.
● It can accurately separate clusters even if there remains noise in
between clusters.
Distance Between the Centroids
● We can measure the distance between the centroids of both the
clusters.
● Although it’s not used that much as you have to find the centroid
after every merge or divide.
Ward’s Method
● This one is the last method to calculate the similarity.

● Here, instead of taking the average, we take the sum of squares of
distances.
Note:- The complexity of space in hierarchical clustering is O(n2). We require
lots and lots of space to hold the matrix. For time complexity it is O(n3). The
iterations we perform can get complex if there is a large dataset.
We have some visual representations of how the clustering happens in
hierarchical clustering:

There is also the proximity matrix and the dendrogram:
This is the proximity matrix. Now the dendrogram:

5. Grid Clustering in Machine Learning
This technique is very useful for multidimensional spaces. Here, we can divide
the entire data space into a finite number of small cells. This helps a lot in
reducing the complexity of the problem and this is what separates grid
clustering from all conventional clustering.
We can recognize denser areas as places that have clusters. When we divide
the plane into cells, we can then calculate the density of each cell and then sort
them.
At first, we will select one cell and calculate its density. If it is more than the
normal density it will become a cluster. We will apply the same process for the
cell’s neighbors until there are no neighbors left. All the neighboring cells
would have grouped into a cluster. The process will continue until all the cells
are traversed.
Here, we focus on the cells rather than the data. This helps in reducing
complexity. The algorithms that fall under the grid-based clustering are the
STING and CLIQUE algorithms.

Applications of Clustering in Machine
Learning
So, finally, let’s have a look at the specific areas where this concept is applied.
Here are the top applications of the clustering concept:
● It’s useful in various image recognition platforms and also various
image segregation tasks. One such example is the biological field.
● Here, this can help the researchers to categorize and classify certain
unknown species.
● Clustering can come in handy for the city’s crime branch for
classifying different parts of a city based on the crime-rate.
● The algorithm can classify and tell based on the frequency and
number of cases that, which part has more criminal activities.
● It’s quite useful in data-mining as well because clustering can help
in understanding the data.
● It can be useful for banks as a fraud detection algorithm. The bank
sector faces the problem of credit card scams and frauds.
● Using clustering we can uncover several hidden patterns that may
tell us which scheme is a fraudulent scheme.
● We can understand various seismic zones based on the data and the
clustering algorithm can create clusters of various seismic activities
and active regions.
● In the health industry as well, we can use this algorithm to detect
various ailments, especially tumors and cancers.
● The search engine is a prime example of the clustering technique.
● It categorizes search results on the basis of billions of searches and
has the results ready on the go every time for every search.
● It can be of great use in analyzing data on soil for agriculture.
● With the data of the soil, it can classify whether the soil has the
necessary nutrient content for growing crops.
● It is also useful in the case of advertising the product based on
customer interest and feedback.

Conclusion
So, for this article, we have studied all the necessary aspects of clustering in
machine learning that a student or someone who wants to pursue ML should
know about.
We covered everything from what clustering is to what are its types. We even
have a few coding examples, more importantly, what function and library we
should use in ML is more important than to understand the whole code. You
can understand the whole code if you know Python language.
We even saw the algorithms that fall under various categories. Also, several
diagrams have been used for better understanding.

Clustering in Machine Learning.pdf

Recommended

Recommended

More Related Content

Similar to Clustering in Machine Learning.pdf

Similar to Clustering in Machine Learning.pdf (20)

More from SudhanshiBakre1

More from SudhanshiBakre1 (20)

Recently uploaded

Recently uploaded (20)

Clustering in Machine Learning.pdf