2. Introduction
The method of identifying similar groups of data in a
dataset is called clustering.
It is one of the most popular techniques in data science.
Entities in each group are comparatively more similar to
entities of that group than those of the other groups.
In this presentation, I will be taking you through the
types of clustering, different clustering algorithms and a
brief view of two of the most commonly used clustering
methods i.e.,
Graph Based Clustering and Density Based Clustering.
4. Graph Theory :
Graph Theory can be used for getting thorough information
about the inside structure of the data set in terms of :
- cliques (subgraph of graph such that all vertices in subgraph are
completely connected)
- clusters (highly connected group of nodes)
- centrality (measure of importance of a node in the network)
- outliers (unimportant nodes)
Applications :
- Social Graphs (drawing edges between us and the people
and everything)
- Path Optimization Algorithms (Minimal Spanning Tree, Kruskal’s, Prim’s)
- GPS Navigation Systems (shortest path APIs)
5. GRAPH BASED CLUSTERING
Graph-based clustering is a method for identifying
groups of similar cells or samples.
It makes no prior assumptions about the clusters in the
data.
This means the number, size, density, and shape of
clusters does not need to be known or assumed prior to
clustering.
Consequently, graph-based clustering is useful for
identifying clustering in complex data sets such as
scRNA-seq.
6. IDEA :
• Graph-Based clustering uses the proximity graph
– Start with the proximity matrix
– Consider each point as a node in a graph
– Each edge between two nodes has a weight which is the
proximity between the two points
– Initially the proximity graph is fully connected
– MIN (single-link) and MAX (complete-link) can be viewed as
starting with this graph
• In the simplest case, clusters are connected components in the graph.
8. HIERARCHICAL METHOD :
1) Determining a minimal spanning tree (MST)
2) Delete branches iteratively
New Connected Components = Cluster
MINIMAL SPANNING TREE :
A minimal spanning tree of a connected graph G = (V,E) is a
connected subgraph with minimal weight that contains all nodes of
G and has no cycles.
9. Minimal Spanning Trees can be calculated with :-
Prim’s Algorithm
- Prim's (also known as Jarník's) algorithm is a greedy algorithm that finds a
minimum spanning tree for a weighted undirected graph.
- This means it finds a subset of the edges that forms a tree that includes
every vertex, where the total weight of all the edges in the tree is
minimized.
Kruskal’s Algorithm
- Kruskal's algorithm is a minimum-spanning-tree algorithm which finds an
edge of the least possible weight that connects any two trees in the forest.
- It is a greedy algorithm in graph theory as it finds a minimum spanning tree
for a connected weighted graph adding increasing cost arcs at each step.
10. Branch Deletion
Delete Branches – Different Strategies :-
I. Delete the branch with maximum weight.
II. Delete inconsistent branches.
III. Delete by analysis of weights.
11. SUMMARY :-
In graph based clustering objects are represented as
nodes in a complete or connected graph.
The distance between two objects is given by the weight
of the corresponding branch.
Hierarchical Method :
(1) Determine a minimal spanning tree(MST).
(2) Delete branches iteratively.
Visualization of information in large datasets.
13. DBSCAN :
Density based spatial clustering of applications with noise.
It is one of the most cited clustering algorithms in the literature.
Features : -
• Spatial data
(geomarketing, tomography, satellite images)
• Discovery of clusteres with arbitrary shape
(spherical, drawn out, linear, elongated)
• Good efficiency or large databases
(parallel programming)
• Only two parameters required.
• No prior knowledge of the number of clusters are required.
14. IDEA :
Clusters have a high density of points.
In the area of noise the density is lower than in any of the
clusters.
Goal :
Formalize the notions of clusters and
noise.
15. Density based cluster : definition
Relies on a density-based notion of cluster: A cluster is defined as
a
maximum set of density-connected points.
A cluster C is a subset of D satisfying
- For all p, q if p is in C, and q is density reachable from p, then
q
is also in C
- For all p, q in C: p is density connected to q
16. DENSITY BASED CLUSTERING: DATA
● Two Parameters:
- Eps : Maximum radius of the neighbourhood
- MinPts : Minimum number of points in an Eps-neighbourhood of that point
● Neps(p) : {q belongs to D| dist(p,q)<= Eps}
17. Problem :
In each cluster there are two kinds of points :
- points inside the cluster (core points)
- points on the border (border points)
An Eps-neighbourhood of a border point contains significantly less
points than an Eps-neighbourhood of a core point.
18. IDEA :
For every point p in a cluster C there is a point q ∈
C, so that
1) p is inside the Eps-neighbourhood of q and
2) Neps(q) contains at least MinPts points.
19. ● Directly density-reachable: A point p is directly
density-reachable from point q with regard to Eps and MinPts, if
1) p ∈ to Neps (q) (reachability)
2)|Neps (q)|>= MinPts (core point condition)
DEFINITION :
20. Density-reachable:
A point p is density-reachable
from a point q wrt. Eps,
MinPts if there is a chain of
points p1,...,pn,p1= q, pn = p
such that pi+1 is directly
density-reachable from pi.
Density-concerned:
A point p is density-connected
to a wrt. Eps, MinPts if there is
a point o such that both, p and
q are density-reachable from
O wrt. Eps and MinPts.
21. DBSCAN (algorithm) :
Start with an arbitrary point p from the database and
retrieve all points density-reachable from p with regard to
Eps and MinPts.
If p is a core point, the procedure yields a cluster with
regards to Eps and MinPts and the point is classified.
If p is a border point, no points are density-reachable
from p and DBSCAN visits the next unclassified point in
the database.
23. CONCLUSION
Clustering is a descriptive technique.
The solution is not unique and it strongly depends
upon the analyst’s choices.
We described how it is possible to combine different
results in order to obtain stable clusters, not
depending too much on the criteria selected to
analyze data.
Clustering always provides groups, even if there is no
group structure.
24. REFERENCES :
A big help from Eric Kropat.
Wikipedia , Google Searches