The document discusses cluster analysis, which groups data objects into clusters so that objects within a cluster are similar but dissimilar to objects in other clusters. It describes key characteristics of clustering, including that it is unsupervised learning and the clusters are determined algorithmically rather than by humans. Various clustering algorithms are covered, including partitioning, hierarchical, density-based, and grid-based methods. Applications of clustering discussed include business intelligence, image recognition, web search, outlier detection, and biology. Requirements for effective clustering in data mining are also outlined.
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
Classification of common clustering algorithm and techniques, e.g., hierarchical clustering, distance measures, K-means, Squared error, SOFM, Clustering large databases.
Classification of common clustering algorithm and techniques, e.g., hierarchical clustering, distance measures, K-means, Squared error, SOFM, Clustering large databases.
UNIT - 4: Data Warehousing and Data MiningNandakumar P
UNIT-IV
Cluster Analysis: Types of Data in Cluster Analysis – A Categorization of Major Clustering Methods – Partitioning Methods – Hierarchical methods – Density, Based Methods – Grid, Based Methods – Model, Based Clustering Methods – Clustering High, Dimensional Data – Constraint, Based Cluster Analysis – Outlier Analysis.
A Survey on the Clustering Algorithms in Sales Data MiningEditor IJCATR
This paper discusses different clustering techniques that can be used in sales databases. The advancement of digital data
collection and build up of data in data banks as a result of modernization in sales disciplines has brought in great challenges of data
processing for better and meaningful results due to mass data deposits. Clustering techniques therefore are quite necessary so that the
senior management in sales department can have access to processed data as they engage themselves in decision making processes.
In this paper, I focus on the retail sales data mining, classification and clustering techniques. In this study I analyze the attributes for
the prediction of buyer’s behavior and purchase performance by use of various classification methods like decision trees, C4.5
algorithm and ID3 algorithm.
Clustering: Grouping Data for Insights
Clustering is a fundamental method in data analysis and machine learning that focuses on the task of dividing a set of data points into groups or clusters. The primary goal is to ensure that data points within the same cluster are more similar to each other than to those in other clusters. This technique is invaluable for discovering structure and patterns within complex data sets, making it an essential tool in fields ranging from marketing and finance to bioinformatics and social network analysis.
Key Concepts and Algorithms
K-Means Clustering: One of the most popular clustering algorithms, K-Means aims to partition data into K distinct clusters. Each cluster is defined by its centroid, which is the mean of the data points in that cluster. The algorithm iteratively assigns data points to the nearest centroid and recalculates the centroids until convergence. It is efficient and simple but requires specifying the number of clusters in advance.
Hierarchical Clustering: This method builds a tree-like structure (dendrogram) to represent data points' nested groupings. It can be agglomerative (bottom-up) or divisive (top-down). Agglomerative clustering starts with each data point as a single cluster and merges the closest pairs iteratively, while divisive clustering starts with all data points in one cluster and splits them iteratively. It doesn’t require specifying the number of clusters beforehand but can be computationally intensive for large datasets.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise): DBSCAN identifies clusters based on the density of data points. It groups together points that are closely packed and marks points that lie alone in low-density regions as outliers. This algorithm can discover clusters of arbitrary shapes and is robust to noise but requires careful tuning of its parameters, such as the neighborhood radius and the minimum number of points.
Gaussian Mixture Models (GMM): GMM assumes that the data points are generated from a mixture of several Gaussian distributions with unknown parameters. It uses the Expectation-Maximization (EM) algorithm to estimate the parameters iteratively. GMM is more flexible than K-Means, as it allows clusters to take on various shapes, but it can be more complex and computationally expensive.
Applications of Clustering
Market Segmentation: Businesses use clustering to segment customers into distinct groups based on purchasing behavior, demographics, or other attributes. This helps in tailoring marketing strategies, improving customer satisfaction, and optimizing product offerings.
Image Segmentation: In image analysis, clustering is used to partition an image into meaningful regions, facilitating object recognition, medical imaging, and automated driving applications.
Social Network Analysis: Clustering can identify communities within social networks, helping to understand social structures, spread of information, and inf
Assessment of Cluster Tree Analysis based on Data Linkagesjournal ijrtem
Abstract: Details linkage is a procedure which almost adjoins two or more places of data (surveyed or proprietary) from different companies to generate a value chest of information which can be used for further analysis. This allows for the real application of the details. One-to-Many data linkage affiliates an enterprise from the first data set with a number of related companies from the other data places. Before performs concentrate on accomplishing one-to-one data linkages. So formerly a two level clustering shrub known as One-Class Clustering Tree (OCCT) with designed in Jaccard Likeness evaluate was suggested in which each flyer contains team instead of only one categorized sequence. OCCT's strategy to use Jaccard's similarity co-efficient increases time complexness significantly. So we recommend to substitute jaccard's similarity coefficient with Jaro wrinket similarity evaluate to acquire the team similarity related because it requires purchase into consideration using positional indices to calculate relevance compared with Jaccard's. An assessment of our suggested idea suffices as approval of an enhanced one-to-many data linkage system.
Index Terms: Maximum-Weighted Bipartite Matching, Ant Colony Optimization, Graph Partitioning Technique
It is a data mining technique used to place the data elements into their related groups. Clustering is the process of partitioning the data (or objects) into the same class, The data in one class is more similar to each other than to those in other cluster.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
2. 7/2/2019 Compiled by : Kamal Acharya 2
Cluster Analysis(Clustering/automatic classification/ data segmentation)
• Clustering is the process of grouping a set of data objects into multiple
groups or clusters so that objects within a cluster have high similarity,
but are very dissimilar to objects in other clusters.
• Dissimilarities and similarities are assessed based on the attribute
values describing the objects and often involve distance measures.
3. 7/2/2019 Compiled by : Kamal Acharya 3
Contd..
• Clustering is known as unsupervised leaning because the class
label information is not present. For this reason, clustering is a
form of learning by observation, rather than learning by
examples.
• Different clustering methods may generate different clusterings
on the same data set.
• The partitioning is not performed by humans, but by the
clustering algorithm.
4. 7/2/2019 Compiled by : Kamal Acharya 4
Contd..
• Hence, Clustering is used:
– As a stand-alone tool to get insight into data distribution
• Visualization of clusters may unveil important information
– As a preprocessing step for other algorithms
• Efficient indexing or compression often relies on clustering
5. 7/2/2019 Compiled by : Kamal Acharya 5
Some Applications of Clustering
• Cluster analysis has been widely used in numerous applications
such as:
– In business intelligence
– In image recognization
– In web search
– In Outlier detection
– In biology
6. 7/2/2019 Compiled by : Kamal Acharya 6
Contd..
• In Business intelligence:
– clustering can help marketers discover distinct groups in their
customer bases and characterize customer groups based on
purchasing patterns so that, for example, advertising can be
appropriately targeted..
7. 7/2/2019 Compiled by : Kamal Acharya 7
Contd..
• In image recognization:
– In image recognition, clustering can be used to discover clusters or
“subclasses” in handwritten character recognition systems.
– For example: We can use clustering to determine subclasses for “1,” each
of which represents a variation on the way in which 1 can be written.
8. 7/2/2019 Compiled by : Kamal Acharya 8
Contd..
• In web search
– document grouping: Clustering can be used to organize the
search results into groups and present the results in a concise
and easily accessible way.
– cluster Weblog data to discover groups of similar access
patterns.
9. 7/2/2019 Compiled by : Kamal Acharya 9
Contd..
• In Outlier detection
– Clustering can also be used for outlier detection, where
outliers (values that are “far away” from any cluster) may be
more interesting than common cases.
– Applications of outlier detection include the detection of
credit card fraud and the monitoring of criminal activities in
electronic commerce.
10. 7/2/2019 Compiled by : Kamal Acharya 10
Contd..
• In biology:
– In biology, it can be used to derive plant and animal
taxonomies, categorize genes with similar functionality, and
gain insight into structures inherent in populations.
11. 7/2/2019 Compiled by : Kamal Acharya 11
What Is Good Clustering?
• A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the similarity
measure used by the method and its implementation.
• The quality of a clustering method is also measured by its ability
to discover some or all of the hidden patterns.
12. 7/2/2019 Compiled by : Kamal Acharya 12
Requirements for clustering as a data mining tool
• The following are typical requirements of clustering in data
mining.
– Scalability
– Ability to deal with different types of attributes
– Discovery of clusters with arbitrary shape
– Requirements for domain knowledge to determine input parameters
– Ability to deal with noisy data
– Incremental clustering and insensitivity to input order
– Capability of clustering high-dimensionality data
– Constraint-based clustering
– Interpretability and usability
13. 7/2/2019 Compiled by : Kamal Acharya 13
Contd..
• Scalability:
– Many clustering algorithms work well on small data sets
containing fewer than several hundred data objects; however,
a large database may contain millions of objects.
– Clustering on a sample of a given large data set may lead to
biased results.
– Highly scalable clustering algorithms are needed.
14. 7/2/2019 Compiled by : Kamal Acharya 14
Contd..
• Ability to deal with different types of attributes:
– Many algorithms are designed to cluster interval-based
(numerical) data.
– However, applications may require clustering other types of
data, such as binary, categorical (nominal), and ordinal data,
or mixtures of these data types.
15. 7/2/2019 Compiled by : Kamal Acharya 15
Contd..
• Discovery of clusters with arbitrary shape:
– Many clustering algorithms determine clusters based on
Euclidean distance measures.
– Algorithms based on such distance measures tend to find
spherical clusters with similar size and density.
– However, a cluster could be of any shape.
– It is important to develop algorithms that can detect clusters
of arbitrary shape.
16. 7/2/2019 Compiled by : Kamal Acharya 16
Contd..
• Minimal requirements for domain knowledge to
determine input parameters:
– Many clustering algorithms require users to input certain
parameters in cluster analysis (such as the number of desired
clusters).
– The clustering results can be quite sensitive to input parameters.
– Parameters are often difficult to determine, especially for data sets
containing high-dimensional objects.
– This not only burdens users, but it also makes the quality of
clustering difficult to control.
17. 7/2/2019 Compiled by : Kamal Acharya 17
Contd..
• Ability to deal with noisy data:
– Most real-world databases contain outliers or missing,
unknown, or erroneous data.
– Some clustering algorithms are sensitive to such data and may
lead to clusters of poor quality.
18. 7/2/2019 Compiled by : Kamal Acharya 18
Contd..
• Incremental clustering and insensitivity to the order of input
records:
– Some clustering algorithms cannot incorporate newly inserted
data (i.e., database updates) into existing clustering structures
and, instead, must determine a new clustering from scratch.
– Some clustering algorithms are sensitive to the order of input
data. That is, given a set of data objects, such an algorithm
may return dramatically different clusterings depending on
the order of presentation of the input objects.
– It is important to develop incremental clustering algorithms
and algorithms that are insensitive to the order of input.
19. 7/2/2019 Compiled by : Kamal Acharya 19
Contd..
• High dimensionality:
– A database or a data warehouse can contain several
dimensions or attributes.
– Many clustering algorithms are good at handling low-
dimensional data, involving only two to three dimensions.
– Human eyes are good at judging the quality of clustering for
up to three dimensions.
– Finding clusters of data objects in high dimensional space is
challenging, especially considering that such data can be
sparse and highly skewed.
20. 7/2/2019 Compiled by : Kamal Acharya 20
Contd..
• Constraint-based clustering:
– Real-world applications may need to perform clustering under
various kinds of constraints.
– Suppose that your job is to choose the locations for a given
number of new automatic banking machines (ATMs) in a city.
– To decide upon this, you may cluster households while
considering constraints such as the city’s rivers and highway
networks, and the type and number of customers per cluster.
– A challenging task is to find groups of data with good
clustering behavior that satisfy specified constraints.
21. 7/2/2019 Compiled by : Kamal Acharya 21
Contd..
• Interpretability and usability:
– Users expect clustering results to be interpretable,
comprehensible, and usable.
– That is, clustering may need to be tied to specific semantic
interpretations and applications.
– It is important to study how an application goal may
influence the selection of clustering features and methods.
22. 7/2/2019 Compiled by : Kamal Acharya 22
Aspects of clustering
• A clustering algorithm/methods
– Partitional clustering
– Hierarchical clustering
– …
• A distance (similarity, or dissimilarity) function
• Clustering quality
– Inter-clusters distance maximized
– Intra-clusters distance minimized
• The quality of a clustering result depends on the
algorithm, the distance function, and the application.
23. 7/2/2019 Compiled by : Kamal Acharya 23
Major Clustering Methods:
• In general, the major fundamental clustering methods can be
classified into the following categories:
– Partitioning Methods
– Hierarchical Methods
– Density-Based Methods
– Grid-Based Methods
24. 7/2/2019 Compiled by : Kamal Acharya 24
Contd..
• Partitioning Methods:
– A partitioning method constructs k partitions of the data, where each
partition represents a cluster and k <= n. That is, it classifies the data
into k groups, which together satisfy the following requirements:
• Each group must contain at least one object, and
• Each object must belong to exactly one group.
– A partitioning method creates an initial partitioning. It then uses an
iterative relocation technique that attempts to improve the
partitioning by moving objects from one group to another.
– The general criterion of a good partitioning is that objects in the
same cluster are close or related to each other, whereas objects of
different clusters are far apart or very different.
26. 7/2/2019 Compiled by : Kamal Acharya 26
Contd..
• Hierarchical Methods:
– A hierarchical method creates a hierarchical decomposition of
the given set of data objects.
– A hierarchical method can be classified as being either
agglomerative or divisive, based on how the hierarchical
decomposition is formed.
28. 7/2/2019 Compiled by : Kamal Acharya 28
Contd..
• The agglomerative approach, also called the bottom-up approach, starts
with each object forming a separate group. It successively merges the
objects or groups that are close to one another, until all of the groups
are merged into one or until a termination condition holds.
• The divisive approach, also called the top-down approach, starts with
all of the objects in the same cluster. In each successive iteration, a
cluster is split up into smaller clusters, until eventually each object is in
one cluster, or until a termination condition holds.
29. 7/2/2019 Compiled by : Kamal Acharya 29
Contd..
• Density-based methods:
– General idea is to continue growing the given cluster as long
as the density in the neighborhood exceeds some threshold;
that is, for each data point within a given cluster, the
neighborhood of a given radius has to contain at least a
minimum number of points.
– Such a method can be used to filter out noise (outliers)and
discover clusters of arbitrary shape.
– E.g., DBSCAN
31. 7/2/2019 Compiled by : Kamal Acharya 31
Contd..
• Grid-based methods:
– Grid-based methods quantize the object space into a finite
number of cells that form a grid structure.
– All the clustering operations are performed on the grid
structure.
– E.g., STING
33. 7/2/2019 Compiled by : Kamal Acharya 33
Partitioning Methods
• Given a data set, D, of n objects, and k, the number of clusters to
form, a partitioning algorithm organizes the objects into k
partitions (k<=n), where each partition represents a cluster.
34. 7/2/2019 Compiled by : Kamal Acharya 34
k-Means: A Centroid-Based Technique
• A centroid-based partitioning technique uses the centroid of a cluster,
Ci , to represent that cluster.
• The centroid of a cluster is its center point such as the mean or medoid
of the objects (or points) assigned to the cluster.
• The difference between an object and ci, the representative of
the cluster, is measured by dist(p, ci),
• where dist(i, j) is the Euclidean distance between two points
35. 7/2/2019 Compiled by : Kamal Acharya 35
Contd..
• The k-means algorithm defines the centroid of a cluster as the
mean value of the points within the cluster. It proceeds as
follows:
– First, it randomly selects k of the objects in D, each of which initially
represents a cluster mean or center.
– For each of the remaining objects, an object is assigned to the cluster to
which it is the most similar, based on the Euclidean distance between the
object and the cluster mean.
– The k-means algorithm then iteratively improves the within-cluster
variation. For each cluster, it computes the new mean using the objects
assigned to the cluster in the previous iteration. All the objects are then
reassigned using the updated means as the new cluster centers.
– The iterations continue until the assignment is stable, that is, the clusters
formed in the current round are the same as those formed in the previous
round.
36. 7/2/2019 Compiled by : Kamal Acharya 36
Contd..
• Algorithm:
– The k-means algorithm for partitioning, where each cluster’s center is
represented by the mean value of the objects in the cluster.
37. 7/2/2019 Compiled by : Kamal Acharya 37
The K-Means Clustering Method
• Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K
object as initial cluster
center
Assign
each
objects
to most
similar
center
Update
the
cluster
means
Update
the
cluster
means
reassignreassign
38. 7/2/2019 Compiled by : Kamal Acharya 38
Contd..
• Example1: Clusters the following instances of given data (2-
Dimensional form) with the help of K means algorithm (Take K
= 2)
Instance X Y
1 1 1.5
2 1 4.5
3 2 1.5
4 2 3.5
5 3 2.5
6 3 4
39. 7/2/2019 Compiled by : Kamal Acharya 39
Contd..
• Example 2: Clusters the following instances of given data (2-
Dimensional form) with the help of K means algorithm (Take K
= 2)
Instance X Y
1 1 2.5
2 1 4.5
3 2.5 3
4 2 1.5
5 4.5 1.5
6 4 5
40. 7/2/2019 Compiled by : Kamal Acharya 40
Hierarchical clustering
• A hierarchical clustering method works by grouping data objects
into a hierarchy or “tree” of clusters.
• Representing data objects in the form of a hierarchy is useful for
data summarization and visualization.
41. 7/2/2019 Compiled by : Kamal Acharya 41
Contd..
• Depending on whether the hierarchical decomposition is formed
in a bottom-up (merging) or top-down (splitting) fashion a
hierarchical clustering method can be classified into two
categories:
– Agglomerative Hierarchical Clustering and
– Divisive Hierarchical Clustering
42. 7/2/2019 Compiled by : Kamal Acharya 42
Contd..
• Agglomerative Hierarchical Clustering:
– uses a bottom-up strategy.
– starts by letting each object form its own cluster and
iteratively merges clusters into larger and larger clusters, until
all the objects are in a single cluster or certain termination
conditions(desired number of clusters) are satisfied.
– For the merging step, it finds the two clusters that are closest
to each other (according to some similarity measure), and
combines the two to form one cluster.
43. 7/2/2019 Compiled by : Kamal Acharya 43
Contd..
• Example: a data set of five objects, {a, b, c, d, e}. Initially, AGNES
(AGglomerative NESting), the agglomerative method, places each object into
a cluster of its own. The clusters are then merged step-by-step according to
some criterion (e.g., minimum Euclidean distance).
44. 7/2/2019 Compiled by : Kamal Acharya 44
Contd..
• Divisive hierarchical clustering :
– A divisive hierarchical clustering method employs a top-down
strategy.
– It starts by placing all objects in one cluster, which is the
hierarchy’s root.
– It then divides the root cluster into several smaller sub-clusters, and
recursively partitions those clusters into smaller ones.
– The partitioning process continues until each cluster at the lowest
level either containing only one object, or the objects within a
cluster are sufficiently similar to each other.
45. 7/2/2019 Compiled by : Kamal Acharya 45
Contd..
• Example: DIANA (DIvisive ANAlysis), a divisive hierarchical clustering
method:
– a data set of five objects, {a, b, c, d, e}. All the objects are used to form
one initial cluster. The cluster is split according to some principle such as
the maximum Euclidean distance between the closest neighboring objects
in the cluster. The cluster-splitting process repeats until, eventually, each
new cluster contains only a single object.
46. 7/2/2019 Compiled by : Kamal Acharya 46
Contd..
• agglomerative versus divisive hierarchical clustering:
– Organize objects into a hierarchy using a bottom-up or top-
down strategy, respectively.
– Agglomerative methods start with individual objects as
clusters, which are iteratively merged to form larger clusters.
– Conversely, divisive methods initially let all the given objects
form one cluster, which they iteratively split into smaller
clusters.
47. 7/2/2019 Compiled by : Kamal Acharya 47
Contd..
• Hierarchical clustering methods can encounter difficulties regarding
the selection of merge or split points.
– Such a decision is critical, because once a group of objects is
merged or split, the process at the next step will operate on the
newly generated clusters. It will neither undo what was done
previously, nor perform object swapping between clusters.
– Thus, merge or split decisions, if not well chosen, may lead to low-
quality clusters.
• Moreover, the methods do not scale well because each decision of
merge or split needs to examine and evaluate many objects or clusters.
48. 7/2/2019 Compiled by : Kamal Acharya 48
Density Based Methods
• Partitioning methods and hierarchical clustering are suitable for finding
spherical-shaped clusters.
• Moreover, they are also severely affected by the presence of noise and
outliers in the data.
• Unfortunately, real life data contain:
– Clusters of arbitrary shape such as oval, linear, s-shaped, etc.
– Many noise
• Solution : Density based methods
49. 7/2/2019 Compiled by : Kamal Acharya 49
Contd..
• Basic Idea behind Density based methods:
– Model clusters as dense regions in the data space, separated by sparse
regions.
• Major features:
– Discover clusters of arbitrary shape(e.g., oval, s-shaped, etc)
– Handle noise
– Need density parameters as termination condition
• E.g., : DBSCAN(Density Based Spatial Clustering of Applications with Noise)
50. Density-Based Clustering: Background
• Neighborhood of point p=all points within distance e from p:
– NEps(p)={q | dist(p,q) <= e }
• Two parameters:
– e : Maximum radius of the neighbourhood
– MinPts: Minimum number of points in an e -neighbourhood of that point
• If the number of points in the e -neighborhood of p is at least
MinPts, then p is called a core object.
p
q
MinPts = 5
e = 1 cm
7/2/2019 Compiled by : Kamal Acharya 50
51. Contd..
• Directly density-reachable: A point p is directly density-
reachable from a point q wrt. e, MinPts if
– 1) p belongs to NEps(q)
– 2) core point condition: |NEps (q)| >= MinPts
p
q
MinPts = 5
e = 1 cm
7/2/2019 Compiled by : Kamal Acharya 51
52. Contd..
• Density-reachable:
– A point p is density-reachable from a point q wrt. Eps, MinPts if there is a
chain of points p1, …, pn, q = p1,….. pn = p such that pi+1 is directly
density-reachable from pi
p
q
p1
7/2/2019 Compiled by : Kamal Acharya 52
53. Contd..
• Density-connected:
– A point p is density-connected to a point q wrt. Eps, MinPts if there is a
point o such that both, p and q are density-reachable from o wrt. Eps and
MinPts.
p q
o
7/2/2019 Compiled by : Kamal Acharya 53
54. 7/2/2019 Compiled by : Kamal Acharya 54
Contd..• Density = number of points within a specified radius (Eps).
• A point is a core point if it has at least a specified number of
points (MinPts) within Eps.
• These are points that are at the interior of a cluster
• Counts the point itself
• A border point is not a core point, but is in the neighborhood of a
core point
• A noise point is any point that is not a core point or a border
point
e.g.,: Minpts=7
55. 7/2/2019 Compiled by : Kamal Acharya 55
DBSCAN(Density Based Spatial Clustering of Applications with Noise)
• To find the next cluster, DBSCAN randomly selects an unvisited object
from the remaining ones. The clustering process continues until all
objects are visited.
57. 7/2/2019 Compiled by : Kamal Acharya 57
Contd..
• Example:
– If Epsilon is 2 and minpoint is 2, what are the clusters that DBScan would
discover with the following 8 examples: A1=(2,10), A2=(2,5), A3=(8,4),
A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).
• Solution :
– d(a,b) denotes the Eucledian distance between a and b. It is obtained
directly from the distance matrix calculated as follows:
– d(a,b)=sqrt((xb-xa)2+(yb-ya)2))
59. 7/2/2019 Compiled by : Kamal Acharya 59
Contd..
• N2(A1)={};
• N2(A2)={};
• N2(A3)={A5, A6};
• N2(A4)={A8};
• N2(A5)={A3, A6};
• N2(A6)={A3, A5};
• N2(A7)={};
• N2(A8)={A4};
• So A1, A2, and A7 are outliers, while we have two clusters C1={A4,
A8} and C2={A3, A5, A6}
61. 7/2/2019 Compiled by : Kamal Acharya 61
Advantages and Disadvantages of DBSCAN algorithm:
• Advantages:
– DBSCAN does not require one to specify the number of clusters in the
data priori, as opposed to k-means.
– DBSCAN can find arbitrarily shaped clusters
– DBSCAN is robust to outliers.
– DBSCAN is mostly insensitive to the ordering of the points in the
database.
– The parameters minPts and ε can be set by a domain expert, if the data is
well understood.
62. 7/2/2019 Compiled by : Kamal Acharya 62
Contd..
• Disadvantages:
– DBSCAN is not entirely deterministic: border points that are reachable
from more than one cluster can be part of either cluster, depending on the
order the data is processed. Fortunately, this situation does not arise often,
and has little impact on the clustering result: both on core points and noise
points, DBSCAN is deterministic
– DBSCAN cannot cluster data sets well with large differences in densities,
since the minPts-ε combination cannot then be chosen appropriately for all
clusters.
– If the data and scale are not well understood, choosing a meaningful
distance threshold ε can be difficult.
63. 7/2/2019 Compiled by : Kamal Acharya 63
Homework
• Explain the aims of cluster analysis.
• What is clustering? How is it different than supervised classification?
In what situation clustering can be useful?
• List and explain desired features of cluster analysis.
• Explain the different types of cluster analysis methods and discuss their
features.
• Describe the k-means algorithm and write its strengths and
weaknesses.
• Describe the features of Hierarchical clustering methods? In what
situations are these methods useful?