Hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. It is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other. Let’s try to understand how exactly Hierarchical clustering works.
2. Agenda
● What is KSAI ?
● Why KSAI ?
● What is Hierarchical Clustering ?
● How Hierarchical Clustering works?
● Simple Example
● Quick Demo
3. What is KSAI
● KSAI is a machine learning library which contains various algorithms
such as classification, regression, clustering and many others.
● It is an attempt to build machine learning algorithms with the language
Scala.
● The library Breeze, which is again built on scala is getting used for
doing the mathematical functionalities.
5. ● KSAI mainly used Scala’s in built case classes, Future and some of
the other cool features.
● It has also used Akka in some places and tried doing things in a
asynchronous fashion.
POWER OF KSAI
6. Right now KSAI might not be that easy to use the library with limited
documentation, however the committers will update them in the
near future.
8. Hierarchical Clustering
● Hierarchical clustering (also called hierarchical cluster analysis or HCA) is
a method of cluster analysis which seeks to build a hierarchy of clusters.
● Hierarchical clustering, is an algorithm that groups similar objects into
groups called clusters.
● The endpoint is a set of clusters, where each cluster is distinct from each
other cluster, and the objects within each cluster are broadly similar to
each other.
9. ● Agglomerative: This is a "bottom-up" approach: each observation
starts in its own cluster, and pairs of clusters are merged as one
moves up the hierarchy.
● Divisive: This is a "top-down" approach: all observations start in one
cluster, and splits are performed recursively as one moves down the
hierarchy.
Strategies
10. Required Data
Hierarchical clustering can be performed with either a distance matrix or raw
data.
When raw data is provided, the software will automatically compute a
distance matrix in the background.
11. Distance Matrix
● A distance matrix is a matrix of distance between objects.
● It will be symmetric (because the distance between x and y is the same
as the distance between y and x) and will have zeroes on the diagonal
(because every item is distance zero from itself).
● The table below is an example of a distance matrix. Only the lower
triangle is shown, because the upper triangle can be filled in by
reflection.
12. How to build Distance Matrix
● The choice of distance metric should be made based on theoretical
concerns from the domain of study.
● That is, a distance metric needs to define similarity in a way that is
sensible for the field of study.
● For example, if clustering crime sites in a city, city block distance may
be appropriate
14. Simple Working
Hierarchical clustering starts by treating each observation as a separate
cluster. Then, it repeatedly executes the following two steps:
(1) identify the two clusters that are closest together, and
(2) merge the two most similar clusters. This continues until all the clusters
are merged together.
15. Result(Dendrogram)
The main output of Hierarchical Clustering is a dendrogram, which shows
the hierarchical relationship between the clusters.
Where…
● A dendrogram is a diagram that shows the hierarchical relationship
between objects.
● The main use of a dendrogram is to work out the best way to allocate
objects to clusters
16. Cluster dissimilarity
● In order to decide which clusters should be combined (for
agglomerative),a measure of dissimilarity between sets of observations
is required.
● In most methods of hierarchical clustering, this is achieved by use of
an appropriate metric (a measure of distance between pairs of
observations), and a linkage criterion which specifies the dissimilarity
of sets as a function of the pairwise distances of observations in the
sets.
17. Linkage Criteria
After selecting a distance metric, it is necessary to determine from where
distance is computed and how the merging on clusters will take place.
For that we have various Linkage criterias.
18. Single Linkage
In single-link clustering (also called the connectedness or minimum
method), we consider the distance between one cluster and another cluster
to be equal to the shortest distance from any member of one cluster to any
member of the other cluster.
19. Complete Linkage
In complete-link clustering (also called the diameter or maximum method),
we consider the distance between one cluster and another cluster to be
equal to the longest distance from any member of one cluster to any
member of the other cluster.
20. Example
Now let's start clustering.
The smallest distance is between three and five and they get linked up or
merged first into a the cluster '35'.
● To obtain the new distance matrix, we need to remove the 3 and 5
entries, and replace it by an entry "35" .
● Since we are using complete linkage clustering, the distance between
"35" and every other item is the maximum of the distance between this
item and 3 and this item and 5. For example, d(1,3)= 3 and d(1,5)=11.
So, D(1,"35")=11. This gives us the new distance matrix.
21. This gives us the new distance matrix. The items with the smallest distance
get clustered next. This will be 2 and 4.
Continuing in this way, after 6 steps, everything is clustered.
22. Dendrogram
● On this plot, the y-axis shows the
distance between the objects at
the time they were clustered.
● This is called the cluster height.
● Different visualizations use
different measures of cluster
height.
23. Dendrogram
● This single linkage dendrogram for
the same distance matrix.
● It starts with cluster "35" but the
distance between "35" and each
item is now the minimum of d(x,3)
and d(x,5). So c(1,"35")=3.
24. Determining clusters
● One of the problems with hierarchical clustering is that there is no
objective way to say how many clusters there are.
● If we cut the single linkage tree at the point shown below, we would say
that there are two clusters.
25. However, if we cut the tree lower we might say that there is one cluster and
two singletons.
Initially we decided KSAI stands for K-Scalable artificial intelligence(where k comes from one of ML algorithm named as K-Means). But it's up to the user, whatever the name they want to relate to. So that's why we are calling it just KSAI.