SlideShare a Scribd company logo
1 of 39
Clustering
Sherin Rappai
Assistant Professor
Department of Computer Science
Introduction
• Unsupervised learning is a machine learning concept where the unlabelled and
unclassified information is analysed to discover hidden knowledge.
• The algorithms work on the data without any prior training, but they are
constructed in such a way that they can identify patterns, groupings, sorting order,
and numerous other interesting knowledge from the set of data.
• Data can broadly be divided into following two types:
• 1. Qualitative data
• 2. Quantitative data
• Qualitative data provides information about the quality of an object or
information which cannot be measured.
• For example, if we consider the quality of performance of students in terms of
‘Good’, ‘Average’, and ‘Poor’, it falls under the category of qualitative data.
• Also, name or roll number of students are information that cannot be measured
using some scale of measurement. So they would fall under qualitative data.
• Qualitative data is also called categorical data.
• Qualitative data can be further subdivided into two types as follows:
• 1. Nominal data
• 2. Ordinal data
• Nominal data is one which has no numeric value, but a named value. It is used
for assigning named values to attributes. Nominal values cannot be quantified.
Examples of nominal data are:
• 1. Blood group: A, B, O, AB, etc.
• 2. Nationality: Indian, American, British, etc.
• 3. Gender: Male, Female, Other .
• It is obvious, mathematical operations such as addition, subtraction,
multiplication, etc. cannot be performed on nominal data. For that reason,
statistical functions such as mean, variance, etc. can also not be applied on
nominal data. However, a basic count is possible. So mode, i.e. most frequently
occurring value, can be identified for nominal data.
• Ordinal data, in addition to possessing the properties of nominal data, can also be
naturally ordered. This means ordinal data also assigns named values to attributes
but unlike nominal data, they can be arranged in a sequence of increasing or
decreasing value so that we can say whether a value is better than or greater than
another value. Examples of ordinal data are :
• 1. Customer satisfaction: ‘Very Happy’, ‘Happy’, ‘Unhappy’, etc.
• 2. Grades: A, B, C, etc.
• 3. Hardness of Metal: ‘Very Hard’, ‘Hard’, ‘Soft’, etc.
• Like nominal data, basic counting is possible for ordinal data. Hence, the mode
can be identified. Since ordering is possible in case of ordinal data, median, and
quartiles can be identified in addition. Mean can still not be calculated.
• Quantitative data relates to information about the quantity of an object – hence it
can be measured.
• For example, if we consider the attribute ‘marks’, it can be measured using a scale
of measurement.
• Quantitative data is also termed as numeric data.
• There are two types of quantitative data:
• 1. Interval data .
• 2. Ratio data.
• Interval data is numeric data for which not only the order is known, but the
exact difference between values is also known.
• An ideal example of interval data is Celsius temperature. The difference between
each value remains the same in Celsius temperature. For example, the difference
between 12°C and 18°C degrees is measurable and is 6°C as in the case of
difference between 15.5°C and 21.5°C. Other examples include date, time, etc.
• For interval data, mathematical operations such as addition and subtraction are
possible. For that reason, for interval data, the central tendency can be measured
by mean, median, or mode. Standard deviation can also be calculated.
• However, interval data do not have something called a ‘true zero’ value. For
example, there is nothing called ‘0 temperature’ or ‘no temperature’. Hence, only
addition and subtraction applies for interval data.
• The ratio cannot be applied. This means, we can say a temperature of 40°C is
equal to the temperature of 20°C + temperature of 20°C. However, we cannot say
the temperature of 40°C means it is twice as hot as in temperature of 20°C.
• Ratio data represents numeric data for which exact value can be measured.
Absolute zero is available for ratio data,which represents the complete absence of
the measured attribute. Also, these variables can be added, subtracted, multiplied,
or divided. The central tendency can be measured by mean, median, or mode and
methods of dispersion such as standard deviation. Examples of ratio data include
height, weight, age, salary, etc.
• Binary variables
• A binary variable is a variable that can take only 2 values.
• For example, generally, gender variables can take 2 variables male and female.
CLUSTERING
• Clustering refers to a broad set of techniques for finding subgroups, or clusters, in
a data set on the basis of the characteristics of the objects within that data set in
such a manner that the objects within the group are similar (or related to each
other) but are different from (or unrelated to) the objects from the other groups.
• The primary driver of clustering knowledge is discovery rather than prediction,
because we may not even know what we are looking for before starting the
clustering analysis.
• The effectiveness of clustering depends on how similar or related the objects
within a group are or how different or unrelated the objects in different groups are
from each other.
• It is often domain specific to define what is meant by two objects to be similar or dissimilar and
thus is an important aspect of the unsupervised machine learning task.
• As an example, suppose we want to run some advertisements of a new movie for a countrywide
promotional activity. We have data for the age, location, financial condition, and political stability
of the people in different parts of the country.
• We may want to run a different type of campaign for the different parts grouped according to the
data we have. Any logical grouping obtained by analysing the characteristics of the people will
help us in driving the campaigns in a more targeted way.
• Clustering analysis can help in this activity by analysing different ways to group the set of people
and arriving at different types of clusters.
• There are many different fields where cluster analysis is used effectively, such as
Text data mining: this includes tasks such as text categorization, text clustering,
document summarization, concept extraction, sentiment analysis, and entity
relation modelling.
Customer segmentation: creating clusters of customers on the basis of parameters
such as demographics, financial conditions, buying habits, etc., which can be used
by retailers and advertisers to promote their products in the correct segment.
 Anomaly checking: checking of anomalous behaviours such as fraudulent bank
transaction, unauthorized computer intrusion, suspicious movements on a radar
scanner, etc.
Data mining: simplify the data mining task by grouping a large number of features
from an extremely large data set to make the analysis manageable.
• Let us take an example. You were invited to take a session on Machine Learning in a reputed
university for induction of their professors on the subject. Before you create the material for the
session, you want to know the level of acquaintance of the professors on the subject so that the
session is successful.
• But you do not want to ask the inviting university, but rather do some analysis on your own on the
basis of the data available freely. As Machine Learning is the intersection of Statistics and
Computer Science, you focused on identifying the professors from these two areas also.
• So, you searched the list of research publications of these professors from the internet, and by
using the machine learning algorithm, you now want to group the papers and thus infer the
expertise of the professors into three buckets – Statistics, Computer Science, and Machine
Learning.
After plotting the number of publications of these professors in the two core areas,
namely Statistics and Computer Science, you obtain a scatter plot as shown in
Figure
• Some inferences that can be derived from the pattern analysis of the data is that
there seems to be three groups or clusters emerging from the data. The pure
statisticians have very less Computer Science-related papers, whereas the pure
Computer Science professors have less number of statistics related papers than
Computer Science-related papers.
• There is a third cluster of professors who have published papers on both these
areas and thus can be assumed to be the persons knowledgeable in machine
learning concepts, as shown in Figure.
• Thus, in the above problem, we used visual indication of logical grouping of data
to identify a pattern or cluster and labelled the data in three different clusters. The
main driver for our clustering was the closeness of the points to each other to form
a group.
• The clustering algorithm uses a very similar approach to measure how closely the
data points are related and decides whether they can be labelled as a homogeneous
group.
• Different types of clustering techniques
• The major clustering techniques are
Partitioning methods,
Hierarchical methods, and
Density-based methods.
Grid based clustering
Model based clustering
Constraint based clustering
• Their approach towards creating the clusters, way to measure the quality of the
clusters, and applicability are different. Table summarizes the main characteristics
of each method for each reference.
Partitioning Method
• Suppose we are given a database of n objects, the
partitioning method construct k partition of data.
Each partition will represents a cluster and k≤n. It
means that it will classify the data into k groups,
– Each group contain at least one object.
– Each object must belong to exactly one group.
• For a given number of partitions (say k), the
partitioning method will create an initial partitioning.
• Then it uses the iterative relocation technique to
improve the partitioning by moving objects from one
group to other.
Partitioning methods
• Two of the most important algorithms for partitioning-based clustering are k-
means and k-medoid. In the k-means algorithm, the centroid of the prototype is
identified for clustering, which is normally the mean of a group of points.
• Similarly, the k-medoid algorithm identifies the medoid which is the most
representative point for a group of points. We can also infer that in most cases, the
centroid does not correspond to an actual data point, whereas medoid is always an
actual data point.
• K-means - A centroid-based technique
• This is one of the oldest and most popularly used algorithm for clustering.
• The principle of the k-means algorithm is to assign each of the ‘n’ data points to
one of the K clusters where ‘K’ is a user defined parameter as the number of
clusters desired.
• The objective is to maximize the homogeneity within the clusters and also to
maximize the differences between the clusters.
• The homogeneity and differences are measured in terms of the distance between
the objects or points in the data set.
• Algorithm shows the simple algorithm of K-means:
• Step 1: Select K points in the data space and mark them as initial centroids
• loop
• Step 2: Assign each point in the data space to the nearest centroid to form K
clusters
• Step 3: Measure the distance of each point in the cluster from the centroid
• Step 4: Calculate the Sum of Squared Error (SSE) to measure the quality of the
clusters .
• Step 5: Identify the new centroid of each cluster on the basis of distance between
points
• Step 6: Repeat Steps 2 to 5 to refine until centroids do not change
• end loop
• Let us understand this algorithm with an example. In Figure below, we have
certain set of data points, and we will apply the kmeans algorithm to find out the
clusters generated from this data set.
• Let us fix K = 4, implying that we want to create four clusters out of this data set.
As the first step, we assign four random points from the data set as the centroids,
as represented by the * signs, and we assign the data points to the nearest centroid
to create four clusters.
• In the second step, on the basis of the distance of the points from the
corresponding centroids, the centroids are updated and points are reassigned to the
updated centroids.
• After three iterations, we found that the centroids are not moving as there is no
scope for refinement, and thus, the k-means algorithm will terminate.
• This provides us the most logical four groupings or cluster of the data sets where
the homogeneity within the groups is highest and difference between the groups is
maximum.
• Choosing appropriate number of clusters
• One of the most important success factors in arriving at correct clustering is to
start with the correct number of cluster assumptions. Different numbers of starting
cluster lead to completely different types of data split.
• It will always help if we have some prior knowledge about the number of clusters
and we start our k-means algorithm with that prior knowledge. For example, if we
are clustering the data of the students of a university, it is always better to start
with the number of departments in that university.
• Sometimes, the business needs or resource limitations drive the number of
required clusters.
• For example, if a movie maker wants to cluster the movies on the basis of
combination of two parameters – budget of the movie: high or low, and casting of
the movie: star or non-star, then there are 4 possible combinations, and thus, there
can be four clusters to split the data.
• For a small data set, sometimes a rule of thumb that is followed is
• which means that K is set as the square root of n/2 for a data set of n examples.
But unfortunately, this thumb rule does not work well for large data sets. There are
several statistical methods to arrive at the suitable number of clusters.
Hierarchical clustering
• But there are situations when the data needs to be partitioned into groups at
different levels such as in a hierarchy. The hierarchical clustering methods are
used to group the data into hierarchy or tree-like structure.
• For example, in a machine learning problem of organizing employees of a
university in different departments, first the employees are grouped under the
different departments in the university, and then within each department, the
employees can be grouped according to their roles such as professors, assistant
professors, supervisors, lab assistants, etc.
• This creates a hierarchical structure of the employee data and eases visualization
and analysis. Similarly, there may be a data set which has an underlying hierarchy
structure that we want to discover and we can use the hierarchical clustering
methods to achieve that.
• There are two main hierarchical clustering methods: agglomerative clustering and
divisive clustering.
• Agglomerative clustering is a bottom-up technique which starts with individual
objects as clusters and then iteratively merges them to form larger clusters. On the
other hand, the divisive method starts with one cluster with all given objects and
then splits it iteratively to form smaller clusters.
• The agglomerative hierarchical clustering method uses the bottom-up strategy.
It starts with each object forming its own cluster and then iteratively merges the
clusters according to their similarity to form larger clusters. It terminates either
when a certain clustering condition imposed by the user is achieved or all the
clusters merge into a single cluster.
• The divisive hierarchical clustering method uses a top-down strategy. The
starting point is the largest cluster with all the objects in it, and then, it is split
recursively to form smaller and smaller clusters, thus forming the hierarchy. The
end of iterations is achieved when the objects in the final clusters are sufficiently
homogeneous to each other or the final clusters contain only one object or the
user-defined clustering condition is achieved.
• In both these cases, it is important to select the split and merger points carefully,
because the subsequent splits or mergers will use the result of the previous ones
and there is no option to perform any object swapping between the clusters or
rectify the decisions made in previous steps, which may result in poor clustering
quality at the end.
• A dendrogram is a commonly used tree structure representation of step-by-step
creation of hierarchical clustering. It shows how the clusters are merged iteratively
(in the case of agglomerative clustering) or split iteratively (in the case of divisive
clustering) to arrive at the optimal clustering solution. Figure shows a dendro-
gram with four levels and how the objects are merged or split at each level to
arrive at the hierarchical clustering.
Density-based methods - DBSCAN
• While using the partitioning and hierarchical clustering methods, the resulting
clusters are spherical or nearly spherical in nature. In the case of the other shaped
clusters such as S-shaped or uneven shaped clusters, the above two types of
method do not provide accurate results.
• The density-based clustering approach provides a solution to identify clusters of
arbitrary shapes. The principle is based on identifying the dense area and sparse
area within the data set and then run the clustering algorithm. DBSCAN is one of
the popular density-based algorithm which creates clusters by using connected
regions with high density.
Grid Based Clustering
• Grid-based clustering is a data clustering technique that partitions a dataset into cells or grids and
then groups data points based on their proximity within each grid cell. It's a simple and efficient
method for clustering large datasets, and it can help identify clusters with varying shapes and
densities. The primary advantage of grid-based clustering is that it reduces the need to compare
each data point to every other data point, which can be computationally expensive for large
datasets.
• Grid-based clustering has several advantages, such as its scalability to large datasets, ability to
handle varying cluster shapes and densities, and reduced computational complexity compared to
some other clustering methods.
• However, it may not work well with data of high dimensionality or data with complex, non-grid-
like structures. Additionally, the choice of grid cell size and other parameters can affect the results,
so parameter tuning is necessary for effective clustering.
Model based clustering
• Model-based clustering is a statistical and machine learning approach used to
cluster data into groups or clusters based on probabilistic models. Unlike some
other clustering methods, model-based clustering explicitly assumes that the data
is generated from a mixture of underlying probability distributions. Each cluster is
associated with one of these distributions, and the goal is to estimate the
parameters of these distributions and assign data points to the clusters.
• Model-based clustering is a data clustering approach that assumes that the data is generated from a statistical
model. In this approach, clusters are defined by the parameters of the model, and data points are assigned to
clusters based on the likelihood that they were generated by a particular cluster's model. Model-based
clustering algorithms aim to find the best-fitting model that explains the data distribution, allowing for the
identification of clusters in a more probabilistic and generative way.
• One of the most commonly used model-based clustering algorithms is the Gaussian Mixture Model (GMM).
• Key characteristics of model-based clustering (GMM) include:
 It is based on probabilistic models, making it a generative clustering approach.
 It can capture clusters of different shapes, as the Gaussian components can have various means and
variances.
 It provides uncertainty estimates for cluster assignments.
• Model-based clustering is effective when the data is assumed to be generated from a mixture of underlying
distributions, and it is capable of modeling complex data structures. However, it relies on strong assumptions
about the data distribution (e.g., Gaussian components in the case of GMM), and the performance can be
sensitive to the choice of the model and its parameters.
Constraint based clustering
• Constraint-based cluster analysis, also known as constrained clustering, is a type of clustering technique that
incorporates constraints or prior knowledge into the clustering process. These constraints can help guide the
clustering algorithm to produce more meaningful and interpretable clusters. Constraint-based clustering can
be particularly useful in scenarios where you have domain-specific knowledge that you want to enforce
during the clustering process.
Clustering high dimensional data
• Clustering high-dimensional data can be challenging due to the curse of dimensionality, where the distance
between data points tends to increase as the number of dimensions grows. This can result in sparse data and
make traditional clustering methods less effective.
• Clustering is the task of dividing the population or data points into a number of groups such that data points
in the same groups are more similar to other data points in the same group and dissimilar to the data points in
other groups.
Challenges of Clustering High-Dimensional Data:
• Clustering of the High-Dimensional Data return the group of objects which are clusters. It is required to
group similar types of objects together to perform the cluster analysis of high-dimensional data, But the
High-Dimensional data space is huge and it has complex data types and attributes. A major challenge is that
we need to find out the set of attributes that are present in each cluster. A cluster is defined and
characterized based on the attributes present in the cluster. Clustering High-Dimensional Data we need to
search for clusters and find out the space for the existing clusters.
• The High-Dimensional data is reduced to low-dimension data to make the clustering and search for clusters
simple. Some applications need the appropriate models of clusters, especially the high-dimensional data.
Clusters in the high-dimensional data are significantly small. The conventional distance measures can be
ineffective. Instead, To find the hidden clusters in high-dimensional data we need to apply sophisticated
techniques that can model correlations among the objects in subspaces.
Outlier analysis in data mining
• Outlier analysis in data mining is the process of identifying and examining data points that
significantly differ from the rest of the dataset. An outlier can be defined as a data point that
deviates significantly from the normal pattern or behavior of the data.
• Various factors, such as measurement errors, unexpected events, data processing errors, etc., can
cause these outliers.
• For example, outliers are represented as red dots in the figure below, and you can see that they
deviate significantly from the rest of the data points. Outliers are also often referred to as
anomalies, aberrations, or irregularities.
• Outlier analysis in data mining can provide several benefits, as mentioned below -
• Improved accuracy of data analysis - Outliers can skew the results of statistical analyses or predictive
models, leading to inaccurate or misleading conclusions. Detecting and removing outliers can improve the
accuracy and reliability of data analysis.
• Identification of data quality issues - Outliers can be caused by data collection, processing, or measurement
errors, which can indicate data quality issues. Outlier analysis in data mining can help identify and correct
these issues to improve data quality.
Clustering.pptx

More Related Content

Similar to Clustering.pptx

Similar to Clustering.pptx (20)

Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Data analysis
Data analysisData analysis
Data analysis
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
The Research specifically DataAnalysis.pptx
The Research specifically DataAnalysis.pptxThe Research specifically DataAnalysis.pptx
The Research specifically DataAnalysis.pptx
 
Data Analysis, Intepretation
Data Analysis, IntepretationData Analysis, Intepretation
Data Analysis, Intepretation
 
2_Types_of_Data.pdf
2_Types_of_Data.pdf2_Types_of_Data.pdf
2_Types_of_Data.pdf
 
CLUSTER ANALYSIS.pptx
CLUSTER ANALYSIS.pptxCLUSTER ANALYSIS.pptx
CLUSTER ANALYSIS.pptx
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Introduction to statistics.pptx
Introduction to statistics.pptxIntroduction to statistics.pptx
Introduction to statistics.pptx
 
classIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptxclassIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptx
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
 
data analysis.ppt
data analysis.pptdata analysis.ppt
data analysis.ppt
 
data analysis.pptx
data analysis.pptxdata analysis.pptx
data analysis.pptx
 
Chapter 11 Data Analysis Classification and Tabulation
Chapter 11 Data Analysis Classification and TabulationChapter 11 Data Analysis Classification and Tabulation
Chapter 11 Data Analysis Classification and Tabulation
 
Types of Statistics.pptx
Types of Statistics.pptxTypes of Statistics.pptx
Types of Statistics.pptx
 
BTech Pattern Recognition Notes
BTech Pattern Recognition NotesBTech Pattern Recognition Notes
BTech Pattern Recognition Notes
 
Data Science in Python.pptx
Data Science in Python.pptxData Science in Python.pptx
Data Science in Python.pptx
 
Lane-SlidesMania.pptx
Lane-SlidesMania.pptxLane-SlidesMania.pptx
Lane-SlidesMania.pptx
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Supervised Learning-Unit 3.pptx
Supervised Learning-Unit 3.pptxSupervised Learning-Unit 3.pptx
Supervised Learning-Unit 3.pptx
 

More from SherinRappai

Extensible markup language ppt as part of Internet Technology
Extensible markup language ppt as part of Internet TechnologyExtensible markup language ppt as part of Internet Technology
Extensible markup language ppt as part of Internet TechnologySherinRappai
 
Java script ppt from students in internet technology
Java script ppt from students in internet technologyJava script ppt from students in internet technology
Java script ppt from students in internet technologySherinRappai
 
Building Competency and Career in the Open Source World
Building Competency and Career in the Open Source WorldBuilding Competency and Career in the Open Source World
Building Competency and Career in the Open Source WorldSherinRappai
 
How to Build a Career in Open Source.pptx
How to Build a Career in Open Source.pptxHow to Build a Career in Open Source.pptx
How to Build a Career in Open Source.pptxSherinRappai
 
Issues in Knowledge representation for students
Issues in Knowledge representation for studentsIssues in Knowledge representation for students
Issues in Knowledge representation for studentsSherinRappai
 
A* algorithm of Artificial Intelligence for BCA students
A* algorithm of Artificial Intelligence for BCA studentsA* algorithm of Artificial Intelligence for BCA students
A* algorithm of Artificial Intelligence for BCA studentsSherinRappai
 
COMPUTING AND PROGRAMMING FUNDAMENTAL.pptx
COMPUTING AND PROGRAMMING FUNDAMENTAL.pptxCOMPUTING AND PROGRAMMING FUNDAMENTAL.pptx
COMPUTING AND PROGRAMMING FUNDAMENTAL.pptxSherinRappai
 
neuralnetwork.pptx
neuralnetwork.pptxneuralnetwork.pptx
neuralnetwork.pptxSherinRappai
 
Rendering Algorithms.pptx
Rendering Algorithms.pptxRendering Algorithms.pptx
Rendering Algorithms.pptxSherinRappai
 
Introduction to Multimedia.pptx
Introduction to Multimedia.pptxIntroduction to Multimedia.pptx
Introduction to Multimedia.pptxSherinRappai
 
Working with data.pptx
Working with data.pptxWorking with data.pptx
Working with data.pptxSherinRappai
 
Introduction to PHP.pptx
Introduction to PHP.pptxIntroduction to PHP.pptx
Introduction to PHP.pptxSherinRappai
 
Aggregate functions in SQL.pptx
Aggregate functions in SQL.pptxAggregate functions in SQL.pptx
Aggregate functions in SQL.pptxSherinRappai
 
Introduction to DBMS.pptx
Introduction to DBMS.pptxIntroduction to DBMS.pptx
Introduction to DBMS.pptxSherinRappai
 
Input_Output_Organization.pptx
Input_Output_Organization.pptxInput_Output_Organization.pptx
Input_Output_Organization.pptxSherinRappai
 

More from SherinRappai (20)

Extensible markup language ppt as part of Internet Technology
Extensible markup language ppt as part of Internet TechnologyExtensible markup language ppt as part of Internet Technology
Extensible markup language ppt as part of Internet Technology
 
Java script ppt from students in internet technology
Java script ppt from students in internet technologyJava script ppt from students in internet technology
Java script ppt from students in internet technology
 
Building Competency and Career in the Open Source World
Building Competency and Career in the Open Source WorldBuilding Competency and Career in the Open Source World
Building Competency and Career in the Open Source World
 
How to Build a Career in Open Source.pptx
How to Build a Career in Open Source.pptxHow to Build a Career in Open Source.pptx
How to Build a Career in Open Source.pptx
 
Issues in Knowledge representation for students
Issues in Knowledge representation for studentsIssues in Knowledge representation for students
Issues in Knowledge representation for students
 
A* algorithm of Artificial Intelligence for BCA students
A* algorithm of Artificial Intelligence for BCA studentsA* algorithm of Artificial Intelligence for BCA students
A* algorithm of Artificial Intelligence for BCA students
 
Unit 2.pptx
Unit 2.pptxUnit 2.pptx
Unit 2.pptx
 
COMPUTING AND PROGRAMMING FUNDAMENTAL.pptx
COMPUTING AND PROGRAMMING FUNDAMENTAL.pptxCOMPUTING AND PROGRAMMING FUNDAMENTAL.pptx
COMPUTING AND PROGRAMMING FUNDAMENTAL.pptx
 
neuralnetwork.pptx
neuralnetwork.pptxneuralnetwork.pptx
neuralnetwork.pptx
 
Rendering Algorithms.pptx
Rendering Algorithms.pptxRendering Algorithms.pptx
Rendering Algorithms.pptx
 
Introduction to Multimedia.pptx
Introduction to Multimedia.pptxIntroduction to Multimedia.pptx
Introduction to Multimedia.pptx
 
Linked List.pptx
Linked List.pptxLinked List.pptx
Linked List.pptx
 
Stack.pptx
Stack.pptxStack.pptx
Stack.pptx
 
system model.pptx
system model.pptxsystem model.pptx
system model.pptx
 
SE UNIT-1.pptx
SE UNIT-1.pptxSE UNIT-1.pptx
SE UNIT-1.pptx
 
Working with data.pptx
Working with data.pptxWorking with data.pptx
Working with data.pptx
 
Introduction to PHP.pptx
Introduction to PHP.pptxIntroduction to PHP.pptx
Introduction to PHP.pptx
 
Aggregate functions in SQL.pptx
Aggregate functions in SQL.pptxAggregate functions in SQL.pptx
Aggregate functions in SQL.pptx
 
Introduction to DBMS.pptx
Introduction to DBMS.pptxIntroduction to DBMS.pptx
Introduction to DBMS.pptx
 
Input_Output_Organization.pptx
Input_Output_Organization.pptxInput_Output_Organization.pptx
Input_Output_Organization.pptx
 

Recently uploaded

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 

Recently uploaded (20)

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 

Clustering.pptx

  • 2. Introduction • Unsupervised learning is a machine learning concept where the unlabelled and unclassified information is analysed to discover hidden knowledge. • The algorithms work on the data without any prior training, but they are constructed in such a way that they can identify patterns, groupings, sorting order, and numerous other interesting knowledge from the set of data.
  • 3. • Data can broadly be divided into following two types: • 1. Qualitative data • 2. Quantitative data • Qualitative data provides information about the quality of an object or information which cannot be measured. • For example, if we consider the quality of performance of students in terms of ‘Good’, ‘Average’, and ‘Poor’, it falls under the category of qualitative data. • Also, name or roll number of students are information that cannot be measured using some scale of measurement. So they would fall under qualitative data. • Qualitative data is also called categorical data. • Qualitative data can be further subdivided into two types as follows: • 1. Nominal data • 2. Ordinal data
  • 4. • Nominal data is one which has no numeric value, but a named value. It is used for assigning named values to attributes. Nominal values cannot be quantified. Examples of nominal data are: • 1. Blood group: A, B, O, AB, etc. • 2. Nationality: Indian, American, British, etc. • 3. Gender: Male, Female, Other . • It is obvious, mathematical operations such as addition, subtraction, multiplication, etc. cannot be performed on nominal data. For that reason, statistical functions such as mean, variance, etc. can also not be applied on nominal data. However, a basic count is possible. So mode, i.e. most frequently occurring value, can be identified for nominal data.
  • 5. • Ordinal data, in addition to possessing the properties of nominal data, can also be naturally ordered. This means ordinal data also assigns named values to attributes but unlike nominal data, they can be arranged in a sequence of increasing or decreasing value so that we can say whether a value is better than or greater than another value. Examples of ordinal data are : • 1. Customer satisfaction: ‘Very Happy’, ‘Happy’, ‘Unhappy’, etc. • 2. Grades: A, B, C, etc. • 3. Hardness of Metal: ‘Very Hard’, ‘Hard’, ‘Soft’, etc. • Like nominal data, basic counting is possible for ordinal data. Hence, the mode can be identified. Since ordering is possible in case of ordinal data, median, and quartiles can be identified in addition. Mean can still not be calculated.
  • 6. • Quantitative data relates to information about the quantity of an object – hence it can be measured. • For example, if we consider the attribute ‘marks’, it can be measured using a scale of measurement. • Quantitative data is also termed as numeric data. • There are two types of quantitative data: • 1. Interval data . • 2. Ratio data. • Interval data is numeric data for which not only the order is known, but the exact difference between values is also known. • An ideal example of interval data is Celsius temperature. The difference between each value remains the same in Celsius temperature. For example, the difference between 12°C and 18°C degrees is measurable and is 6°C as in the case of difference between 15.5°C and 21.5°C. Other examples include date, time, etc.
  • 7. • For interval data, mathematical operations such as addition and subtraction are possible. For that reason, for interval data, the central tendency can be measured by mean, median, or mode. Standard deviation can also be calculated. • However, interval data do not have something called a ‘true zero’ value. For example, there is nothing called ‘0 temperature’ or ‘no temperature’. Hence, only addition and subtraction applies for interval data. • The ratio cannot be applied. This means, we can say a temperature of 40°C is equal to the temperature of 20°C + temperature of 20°C. However, we cannot say the temperature of 40°C means it is twice as hot as in temperature of 20°C.
  • 8. • Ratio data represents numeric data for which exact value can be measured. Absolute zero is available for ratio data,which represents the complete absence of the measured attribute. Also, these variables can be added, subtracted, multiplied, or divided. The central tendency can be measured by mean, median, or mode and methods of dispersion such as standard deviation. Examples of ratio data include height, weight, age, salary, etc.
  • 9. • Binary variables • A binary variable is a variable that can take only 2 values. • For example, generally, gender variables can take 2 variables male and female.
  • 10. CLUSTERING • Clustering refers to a broad set of techniques for finding subgroups, or clusters, in a data set on the basis of the characteristics of the objects within that data set in such a manner that the objects within the group are similar (or related to each other) but are different from (or unrelated to) the objects from the other groups. • The primary driver of clustering knowledge is discovery rather than prediction, because we may not even know what we are looking for before starting the clustering analysis. • The effectiveness of clustering depends on how similar or related the objects within a group are or how different or unrelated the objects in different groups are from each other.
  • 11. • It is often domain specific to define what is meant by two objects to be similar or dissimilar and thus is an important aspect of the unsupervised machine learning task. • As an example, suppose we want to run some advertisements of a new movie for a countrywide promotional activity. We have data for the age, location, financial condition, and political stability of the people in different parts of the country. • We may want to run a different type of campaign for the different parts grouped according to the data we have. Any logical grouping obtained by analysing the characteristics of the people will help us in driving the campaigns in a more targeted way. • Clustering analysis can help in this activity by analysing different ways to group the set of people and arriving at different types of clusters.
  • 12. • There are many different fields where cluster analysis is used effectively, such as Text data mining: this includes tasks such as text categorization, text clustering, document summarization, concept extraction, sentiment analysis, and entity relation modelling. Customer segmentation: creating clusters of customers on the basis of parameters such as demographics, financial conditions, buying habits, etc., which can be used by retailers and advertisers to promote their products in the correct segment.  Anomaly checking: checking of anomalous behaviours such as fraudulent bank transaction, unauthorized computer intrusion, suspicious movements on a radar scanner, etc. Data mining: simplify the data mining task by grouping a large number of features from an extremely large data set to make the analysis manageable.
  • 13.
  • 14. • Let us take an example. You were invited to take a session on Machine Learning in a reputed university for induction of their professors on the subject. Before you create the material for the session, you want to know the level of acquaintance of the professors on the subject so that the session is successful. • But you do not want to ask the inviting university, but rather do some analysis on your own on the basis of the data available freely. As Machine Learning is the intersection of Statistics and Computer Science, you focused on identifying the professors from these two areas also. • So, you searched the list of research publications of these professors from the internet, and by using the machine learning algorithm, you now want to group the papers and thus infer the expertise of the professors into three buckets – Statistics, Computer Science, and Machine Learning.
  • 15. After plotting the number of publications of these professors in the two core areas, namely Statistics and Computer Science, you obtain a scatter plot as shown in Figure
  • 16. • Some inferences that can be derived from the pattern analysis of the data is that there seems to be three groups or clusters emerging from the data. The pure statisticians have very less Computer Science-related papers, whereas the pure Computer Science professors have less number of statistics related papers than Computer Science-related papers. • There is a third cluster of professors who have published papers on both these areas and thus can be assumed to be the persons knowledgeable in machine learning concepts, as shown in Figure.
  • 17. • Thus, in the above problem, we used visual indication of logical grouping of data to identify a pattern or cluster and labelled the data in three different clusters. The main driver for our clustering was the closeness of the points to each other to form a group. • The clustering algorithm uses a very similar approach to measure how closely the data points are related and decides whether they can be labelled as a homogeneous group. • Different types of clustering techniques • The major clustering techniques are Partitioning methods, Hierarchical methods, and Density-based methods. Grid based clustering Model based clustering Constraint based clustering
  • 18. • Their approach towards creating the clusters, way to measure the quality of the clusters, and applicability are different. Table summarizes the main characteristics of each method for each reference.
  • 19. Partitioning Method • Suppose we are given a database of n objects, the partitioning method construct k partition of data. Each partition will represents a cluster and k≤n. It means that it will classify the data into k groups, – Each group contain at least one object. – Each object must belong to exactly one group. • For a given number of partitions (say k), the partitioning method will create an initial partitioning. • Then it uses the iterative relocation technique to improve the partitioning by moving objects from one group to other.
  • 20. Partitioning methods • Two of the most important algorithms for partitioning-based clustering are k- means and k-medoid. In the k-means algorithm, the centroid of the prototype is identified for clustering, which is normally the mean of a group of points. • Similarly, the k-medoid algorithm identifies the medoid which is the most representative point for a group of points. We can also infer that in most cases, the centroid does not correspond to an actual data point, whereas medoid is always an actual data point. • K-means - A centroid-based technique • This is one of the oldest and most popularly used algorithm for clustering. • The principle of the k-means algorithm is to assign each of the ‘n’ data points to one of the K clusters where ‘K’ is a user defined parameter as the number of clusters desired. • The objective is to maximize the homogeneity within the clusters and also to maximize the differences between the clusters. • The homogeneity and differences are measured in terms of the distance between the objects or points in the data set.
  • 21. • Algorithm shows the simple algorithm of K-means: • Step 1: Select K points in the data space and mark them as initial centroids • loop • Step 2: Assign each point in the data space to the nearest centroid to form K clusters • Step 3: Measure the distance of each point in the cluster from the centroid • Step 4: Calculate the Sum of Squared Error (SSE) to measure the quality of the clusters . • Step 5: Identify the new centroid of each cluster on the basis of distance between points • Step 6: Repeat Steps 2 to 5 to refine until centroids do not change • end loop
  • 22. • Let us understand this algorithm with an example. In Figure below, we have certain set of data points, and we will apply the kmeans algorithm to find out the clusters generated from this data set. • Let us fix K = 4, implying that we want to create four clusters out of this data set. As the first step, we assign four random points from the data set as the centroids, as represented by the * signs, and we assign the data points to the nearest centroid to create four clusters. • In the second step, on the basis of the distance of the points from the corresponding centroids, the centroids are updated and points are reassigned to the updated centroids. • After three iterations, we found that the centroids are not moving as there is no scope for refinement, and thus, the k-means algorithm will terminate. • This provides us the most logical four groupings or cluster of the data sets where the homogeneity within the groups is highest and difference between the groups is maximum.
  • 23.
  • 24. • Choosing appropriate number of clusters • One of the most important success factors in arriving at correct clustering is to start with the correct number of cluster assumptions. Different numbers of starting cluster lead to completely different types of data split. • It will always help if we have some prior knowledge about the number of clusters and we start our k-means algorithm with that prior knowledge. For example, if we are clustering the data of the students of a university, it is always better to start with the number of departments in that university. • Sometimes, the business needs or resource limitations drive the number of required clusters. • For example, if a movie maker wants to cluster the movies on the basis of combination of two parameters – budget of the movie: high or low, and casting of the movie: star or non-star, then there are 4 possible combinations, and thus, there can be four clusters to split the data.
  • 25. • For a small data set, sometimes a rule of thumb that is followed is • which means that K is set as the square root of n/2 for a data set of n examples. But unfortunately, this thumb rule does not work well for large data sets. There are several statistical methods to arrive at the suitable number of clusters.
  • 26. Hierarchical clustering • But there are situations when the data needs to be partitioned into groups at different levels such as in a hierarchy. The hierarchical clustering methods are used to group the data into hierarchy or tree-like structure. • For example, in a machine learning problem of organizing employees of a university in different departments, first the employees are grouped under the different departments in the university, and then within each department, the employees can be grouped according to their roles such as professors, assistant professors, supervisors, lab assistants, etc. • This creates a hierarchical structure of the employee data and eases visualization and analysis. Similarly, there may be a data set which has an underlying hierarchy structure that we want to discover and we can use the hierarchical clustering methods to achieve that.
  • 27. • There are two main hierarchical clustering methods: agglomerative clustering and divisive clustering. • Agglomerative clustering is a bottom-up technique which starts with individual objects as clusters and then iteratively merges them to form larger clusters. On the other hand, the divisive method starts with one cluster with all given objects and then splits it iteratively to form smaller clusters.
  • 28. • The agglomerative hierarchical clustering method uses the bottom-up strategy. It starts with each object forming its own cluster and then iteratively merges the clusters according to their similarity to form larger clusters. It terminates either when a certain clustering condition imposed by the user is achieved or all the clusters merge into a single cluster. • The divisive hierarchical clustering method uses a top-down strategy. The starting point is the largest cluster with all the objects in it, and then, it is split recursively to form smaller and smaller clusters, thus forming the hierarchy. The end of iterations is achieved when the objects in the final clusters are sufficiently homogeneous to each other or the final clusters contain only one object or the user-defined clustering condition is achieved. • In both these cases, it is important to select the split and merger points carefully, because the subsequent splits or mergers will use the result of the previous ones and there is no option to perform any object swapping between the clusters or rectify the decisions made in previous steps, which may result in poor clustering quality at the end.
  • 29. • A dendrogram is a commonly used tree structure representation of step-by-step creation of hierarchical clustering. It shows how the clusters are merged iteratively (in the case of agglomerative clustering) or split iteratively (in the case of divisive clustering) to arrive at the optimal clustering solution. Figure shows a dendro- gram with four levels and how the objects are merged or split at each level to arrive at the hierarchical clustering.
  • 30. Density-based methods - DBSCAN • While using the partitioning and hierarchical clustering methods, the resulting clusters are spherical or nearly spherical in nature. In the case of the other shaped clusters such as S-shaped or uneven shaped clusters, the above two types of method do not provide accurate results. • The density-based clustering approach provides a solution to identify clusters of arbitrary shapes. The principle is based on identifying the dense area and sparse area within the data set and then run the clustering algorithm. DBSCAN is one of the popular density-based algorithm which creates clusters by using connected regions with high density.
  • 31. Grid Based Clustering • Grid-based clustering is a data clustering technique that partitions a dataset into cells or grids and then groups data points based on their proximity within each grid cell. It's a simple and efficient method for clustering large datasets, and it can help identify clusters with varying shapes and densities. The primary advantage of grid-based clustering is that it reduces the need to compare each data point to every other data point, which can be computationally expensive for large datasets. • Grid-based clustering has several advantages, such as its scalability to large datasets, ability to handle varying cluster shapes and densities, and reduced computational complexity compared to some other clustering methods. • However, it may not work well with data of high dimensionality or data with complex, non-grid- like structures. Additionally, the choice of grid cell size and other parameters can affect the results, so parameter tuning is necessary for effective clustering.
  • 32. Model based clustering • Model-based clustering is a statistical and machine learning approach used to cluster data into groups or clusters based on probabilistic models. Unlike some other clustering methods, model-based clustering explicitly assumes that the data is generated from a mixture of underlying probability distributions. Each cluster is associated with one of these distributions, and the goal is to estimate the parameters of these distributions and assign data points to the clusters.
  • 33. • Model-based clustering is a data clustering approach that assumes that the data is generated from a statistical model. In this approach, clusters are defined by the parameters of the model, and data points are assigned to clusters based on the likelihood that they were generated by a particular cluster's model. Model-based clustering algorithms aim to find the best-fitting model that explains the data distribution, allowing for the identification of clusters in a more probabilistic and generative way. • One of the most commonly used model-based clustering algorithms is the Gaussian Mixture Model (GMM). • Key characteristics of model-based clustering (GMM) include:  It is based on probabilistic models, making it a generative clustering approach.  It can capture clusters of different shapes, as the Gaussian components can have various means and variances.  It provides uncertainty estimates for cluster assignments. • Model-based clustering is effective when the data is assumed to be generated from a mixture of underlying distributions, and it is capable of modeling complex data structures. However, it relies on strong assumptions about the data distribution (e.g., Gaussian components in the case of GMM), and the performance can be sensitive to the choice of the model and its parameters.
  • 34. Constraint based clustering • Constraint-based cluster analysis, also known as constrained clustering, is a type of clustering technique that incorporates constraints or prior knowledge into the clustering process. These constraints can help guide the clustering algorithm to produce more meaningful and interpretable clusters. Constraint-based clustering can be particularly useful in scenarios where you have domain-specific knowledge that you want to enforce during the clustering process.
  • 35. Clustering high dimensional data • Clustering high-dimensional data can be challenging due to the curse of dimensionality, where the distance between data points tends to increase as the number of dimensions grows. This can result in sparse data and make traditional clustering methods less effective. • Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups.
  • 36. Challenges of Clustering High-Dimensional Data: • Clustering of the High-Dimensional Data return the group of objects which are clusters. It is required to group similar types of objects together to perform the cluster analysis of high-dimensional data, But the High-Dimensional data space is huge and it has complex data types and attributes. A major challenge is that we need to find out the set of attributes that are present in each cluster. A cluster is defined and characterized based on the attributes present in the cluster. Clustering High-Dimensional Data we need to search for clusters and find out the space for the existing clusters. • The High-Dimensional data is reduced to low-dimension data to make the clustering and search for clusters simple. Some applications need the appropriate models of clusters, especially the high-dimensional data. Clusters in the high-dimensional data are significantly small. The conventional distance measures can be ineffective. Instead, To find the hidden clusters in high-dimensional data we need to apply sophisticated techniques that can model correlations among the objects in subspaces.
  • 37. Outlier analysis in data mining • Outlier analysis in data mining is the process of identifying and examining data points that significantly differ from the rest of the dataset. An outlier can be defined as a data point that deviates significantly from the normal pattern or behavior of the data. • Various factors, such as measurement errors, unexpected events, data processing errors, etc., can cause these outliers. • For example, outliers are represented as red dots in the figure below, and you can see that they deviate significantly from the rest of the data points. Outliers are also often referred to as anomalies, aberrations, or irregularities.
  • 38. • Outlier analysis in data mining can provide several benefits, as mentioned below - • Improved accuracy of data analysis - Outliers can skew the results of statistical analyses or predictive models, leading to inaccurate or misleading conclusions. Detecting and removing outliers can improve the accuracy and reliability of data analysis. • Identification of data quality issues - Outliers can be caused by data collection, processing, or measurement errors, which can indicate data quality issues. Outlier analysis in data mining can help identify and correct these issues to improve data quality.

Editor's Notes

  1. Interval-scaled variables are continuous measurements of a roughly linear scale. Typical examples include weight and height, latitude and longitude coordinates (e.g., when clustering houses), and weather temperature. The measurement unit used can affect the clustering analysis. For example, changing measurement units from meters to inches for height, or from kilograms to pounds for weight, may lead to a very different clustering structure. In general, expressing a variable in smaller units will lead to a larger range for that variable, and thus a larger effect on the resulting clustering structure. To help avoid dependence on the choice of measurement units, the data should be standardized. Standardizing measurements attempts to give all variables an equal weight. This is especially useful when given no prior knowledge of the data. However, in some applications, users may intentionally want to give more weight to a certain set of variables than to others. For example, when clustering basketball player candidates, we may prefer to give more weight to the variable height.