View stunning SlideShares in full-screen with the new iOS app!Introducing SlideShare for AndroidExplore all your favorite topics in the SlideShare appGet the SlideShare app to Save for Later — even offline
View stunning SlideShares in full-screen with the new Android app!View stunning SlideShares in full-screen with the new iOS app!
Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
3.
Introduction: authors Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
4.
Introduction: Hierarchical Clustering Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
Two fundamental problems in hierarchical clustering:
How to determine the similarity between two objects (eg. Proteins, genes)?
Calculate the distance between two object (e.g RMSD etc).
How to determine the similarity between two clusters?
(Single, Complete, Average) linkage:
Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
6.
Introduction: about the topic Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31 There is no guideline for selecting the best linkage method. In practice, people almost always use average linkage. UPGMA (Unweighted Pair Group Method using arithmetic Averages) Scalable to large datasets as it requires only (O(1)) edges in memory. BUT Highly susceptible to outliers!
No self edges i.e Cluster_id1==cluster_id2 is illegal
No repeated edges i.e if exists i<->j then no j<->i
Output format:
Four fields per line
Cluster_id1 cluster_id2 distance cluster_id3
Cluster_id1 cluster_id2 identify the pair of merged clusters while cluster_id3 is an identifier for a new cluster – their union.
8.
Introduction: UPGMA -Sparse input N=11 input singletons ( vertices ): {1,2,3,4,11,12,13,14,21,22,23} and 14 edges in the sparse input. The input is considered sparse since not all pairs are given e.g. there is no edge b/w 1 and 22. Clusters 1,2,3,4 form a clique A. Clusters 11,12,13,14 are missing edge < 11,14 > to form clique B. Clusters 21,22,23 are loosely connected to each other and to the cluster of clique A. In total there are two connected components in the input graph: ({1,2,3,4,21,22,23}) (producing 6 merges for 7 vertices) and {11,12,13,14} (producing 4 merges for 3 nodes), which therefore forms a forest of two disjoint trees , rather than the full tree of N-1=10 merges. UPGMA-input 90 23 1 70 23 22 50 22 21 30 14 13 20 14 12 12 13 12 11 13 11 1e+01 12 11 4e-10 4 3 1e-50 4 2 1e-80 3 2 2e-40 4 1 1e-40 3 1 1e-100 2 1 UPGMA-tree 32 99.167 31 26 31 85 29 23 30 50 28 14 29 50 22 21 28 11.5 27 13 27 10 12 11 26 1.33e-10 25 4 25 5e-41 24 3 24 1e-100 2 1
UPGMA requires the entire dissimilarity matrix to be in memory:
Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31 This data renders UPGMA impractical
10.
Methodology: 1) Sparse-UPGMA Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31 Can’t cope with huge datasets, where an O ( E ) memory requirement is intolerable (e.g. Table 1). UPGMA (mean): New eq: Time and memory improvement:
Solution: To prevent false clustering of a non-minimal edge, suitable bounds per edge are maintained.
The value of d ij is lower ( l ij ) and upper ( u ij ) bounded as:
Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
13.
Methodology: 2) Multi-Round MC-UPGMA Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
When Multi-Round MC-UPGMA halts, it is not using its entire memory budget M , since each merge reduces the number of edges in memory.
Most of the computation time is spent on preprocessing for the next round of clustering.
14.
Methodology: 2) Single-Round MC-UPGMA Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31 Requires O(n) memory for holding forming tree!
Views
Actions
Embeds 0
Report content