3. Introduction
▪ Hierarchical Clustering – clustering given data in hierarchic structure.
– It is structured, more informative than flat clustering.
– Deterministic, Low efficiency
– Important when one of the potential flat clustering problem is concerned.
▪ Most of the flat clustering techniques are concerned with efficiency.
▪ Types
– Agglomerative clustering – bottom up
– DivisiveClustering – top down
CS306 Presentation 3
6. Selected papers
▪ The paper proposed new algorithm called CLUBS.
▪ CLUBS – Clustering Using Binary Splitting.
– Faster than existing algorithm.
– More accurate, robust and impervious to noise.
– Works in complete unsupervised fashion.
– Also works density based clustering.
– It can be used for refining other algorithm’s performances.
▪ Popular algorithm k-means has repeatability problems of results.
– But CLUBS overcomes this problem.
Elio Masciari, Giuseppe Mazzeo and Carlo Zaniolo: A New, Fast and Accurate Algorithm for Hierarchical Clustering
on Euclidean Distances. PAKDD (2) 2013: 111-122.
CS306 Presentation 6
7. Approach
▪ CLUBS has two phases
– Divisive – original data set is split recursively into mini-clusters through binary
splitting.
▪ May cause a non optimal way.
– Agglomerative – the final mini-clusters are recursively combined into the final
results.
▪ It backtracks previously wrong calculations.
▪ Algorithm exploits SSQ (Sum of Squares) to minimize cost of split
operation.
CS306 Presentation 7
8. Algorithm
▪ Phase 1:
▪ Definition 1 – binary partition BP.
– d-dimensional data distribution D (multi-dimensional array of integers).
– N – non-zero entries of D
– ρi – range [l…u] on the i-th dimension of D, 1 ≤ l ≤u ≤ n, 1 ≤ i ≤ d, size(ρi) = ub(ρi) −
lb(ρi) + 1 = u − l + 1.
– block b (of D) is a d-tuple {ρ1, . . . , ρd}, vol(b)=size(ρ1) × . . . ×. size(ρd)
– A point x = x1, . . . , xd is chosen, lb(ρi) ≤ xi ≤ ub(ρi).
– x divides the range ρi of b into ρlowi = [lb(ρi)..x]and ρhighi = [(x+1)..ub(ρi)], thus
partitioning b into blow={ρ1, . . . , ρlowi , . . . , ρd } and bhigh = {ρ1, . . . , ρhighi , . . . , ρd }.
– (blow, bhigh ) – binary split, i – dimension splitting, x – position splitting.
CS306 Presentation 8
9. Algorithm
▪ Definition 2 –stopping condition of BP
– Cs – a cluster , S = (S1, . . . , Sd) = p∈Cs 𝑃 is a vector, p is a point.Centre of Cs,
Cs0=S/N,Qi = p∈Cs 𝑃𝑥𝑃.
CS306 Presentation 9
10. Algorithm
– Binary splitting stops when avgSSQ > deltSSQ which yields n’ mini-clusters,
where avgSSQ = SSQ0/n & deltSSQ = overall reduction of SSQ.
▪ Phase 2:
– n’ mini-clusters merged by choosing each best pairs (greedy approach).
– Continues until increase in SSQ is greater than avgdeltSSQ.
– It gives the final result.
▪ Complexity – O(n.d.l.s)
CS306 Presentation 10
13. Experiment
CS306 Presentation 13
– Dataset 1 – 42 patients into 3 groups
(RM,HN,PM). 98 differentially expressed
genes picked up and analysed.
– Dataset 2 – samples extracted from
human breast cancer cells which consist
of four cell group and analysed.
Ek= Error calculation at 10 clusters
ε = probability that two similar data
belongs to same clusters.
Qk = avg % of points in the k-
neighborhood of a generic point
belonging to the same class of that point.