2. UNSUPERVISED LEARNING-CLUSTERING
ALGORITHMS
• These methods are used to find similarity and/or features as well as the relationship
patterns among data samples, and then cluster those samples into groups having
similarity based on features, i.e. How are they grouped?
• Clustering is important because it determines the intrinsic grouping among the
present unlabeled data.
• They basically make some assumptions about data points to constitute their similarity.
Each assumption will construct differently but equally valid clusters.
4. CLUSTERING ALGORITHMS-HIERARCHY
BASED
• The clusters forms a tree type structure based
on the hierarchy where new clusters are formed
using the previously formed one.
• It is divided into two category:
Agglomerative (bottom up approach)
Divisive (top down approach) .
• Examples CURE (Clustering Using
Representatives), BIRCH (Balanced Iterative
Reducing Clustering and using
Hierarchies) etc.
A vertebrate is an animal with a spinal cord surrounded
by cartilage or bone.
5. CATEGORIES OF HIERARCHICAL
ALGORITHMS
• Agglomerative hierarchical algorithms − each data point is
treated as a single cluster and then successively merge or
agglomerate (bottom-up approach) the pairs of clusters. The
hierarchy of the clusters is represented as a dendrogram or tree
structure.
• Divisive hierarchical algorithms − all the data points are treated
as one big cluster and the process of clustering involves
dividing (Top-down approach) the one big cluster into various
small clusters.
6. STEPS TO PERFORM AGGLOMERATIVE
HIERARCHICAL CLUSTERING
• The steps to perform the same is as follows −
• Step 1 − Treat each data point as single cluster. Hence, we will be
having, say K clusters at start. The number of data points will also be
K at start.
• Step 2
• Step 2.1 − a big cluster by joining two closet datapoints. This will result
in total of K-1 clusters.
• Step 2.2 − To form more clusters, we need to join two closet clusters.
This will result in total of K-2 clusters.
• Step 2.3 − To form one big cluster repeat the above three steps until K
would become 0 i.e. no more data points left to join.
• Step 3 − After making one single big cluster, dendrograms will be
used to divide into multiple clusters depending upon the problem.
7. 1
2
3
4
5
6
7
8
9
10
(1,3,5,7,9) identify the two clusters that are closest together (Euclidean distance)
(2,4,6,8,10) merge the two most similar clusters.
The main output of Hierarchical Clustering is a
dendrogram
8. METRICS BETWEEN CLUSTERS
• Measures of distance (similarity): the distance between two clusters
• computed based on length of the straight line drawn from one cluster to another.
• This is commonly referred to as the Euclidean distance. Many other distance
metrics have been developed.
• Linkage Criteria: determine from where distance is computed.
• single-linkage : computed between the two most similar parts of a cluster
• complete-linkage: computed between the two least similar bits of a cluster
• mean or average-linkage : the center of the clusters
• some other criterion.
• Where there are no clear theoretical justifications for choice of linkage
criteria, Ward’s method is the sensible default. This method works out
which observations to group based on reducing the sum of squared
distances of each observation from the average observation in a cluster.
9. AGGLOMERATIVE VERSUS DIVISIVE
ALGORITHMS
• Hierarchical clustering typically works by sequentially merging similar
clusters, as shown above. This is known as agglomerative hierarchical
clustering (button-up).
• By initially grouping all the observations into one cluster, and then
successively splitting these clusters (top-down). This is known as divisive
hierarchical clustering. Divisive clustering is rarely done in practice.
10. WHAT ARE THE STRENGTHS AND
WEAKNESSES OF HIERARCHICAL
CLUSTERING?
• The strengths of hierarchical clustering are that it is easy to understand and
easy to do. There are four types of clustering algorithms in widespread
use: hierarchical clustering, k-means cluster analysis, latent class
analysis, and self-organizing maps. The math of hierarchical clustering is the
easiest to understand.
• The weaknesses are that it rarely provides the best solution, it involves lots of
arbitrary decisions, it does not work with missing data, it works poorly with
mixed data types, it does not work well on very large data sets, and its main
output, the dendrogram, is commonly misinterpreted.
11. ROLE OF DENDROGRAMS IN
AGGLOMERATIVE HIERARCHICAL
CLUSTERING
original datapoints distribution dendrograms of these datapoints
12. ROLE OF DENDROGRAMS IN
AGGLOMERATIVE HIERARCHICAL
CLUSTERING
once the big cluster is formed, the longest vertical
distance is selected, which is then drawn through it.
As the horizontal line crosses the blue line at two
points, the number of clusters would be two.
The above diagram shows the two
clusters from our datapoints.
13. DISCUSSION
Basically the horizontal line is a
threshold, which defines the minimum
distance required to be a separate
cluster. If we draw a line further down,
the threshold required to be a new
cluster will be decreased and more
clusters will be formed as see in the
image right.
In the plot, the horizontal line passes
through four vertical lines resulting in
four clusters: cluster of points 6,7,8, 10,
cluster of points 3,2,4, 1, points 9, and
5 will be treated as single point clusters.
14. EX. CLUSTERS OF THE DATA POINT IN PIMA
INDIAN DIABETES DATASET
Pima Indian Diabetes Dataset Prediction by Hierarchy-based algorithm
15. EX. CLUSTERS OF THE DATA POINT IN
SHOPPING TRENDS DATASET
Draw a horizontal line that passes through longest
vertical distances without a horizontal line, we get
5 clusters.
Original data set.
16. DISCUSSION
The data points in the bottom right belong
to the customers with high salaries but low
spending. These are the customers that
spend their money carefully.
The customers at top right (green data
points), these are the customers with high
salaries and high spending. These are the
type of customers that companies target.
The customers in the middle (blue data
points) are the ones with average income
and average salaries. The highest numbers
of customers belong to this category.
Salary index
Spending
index