Presentation 2009 Journal Club Azhar Ali Shah

  • 383 views
Uploaded on

 

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
383
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Azhar Ali Shah @ Interdisciplinary Optimization and Decision Making Journal Club (IODMJC) IODMJC, March 20 , 2009
  • 2. Overview
    • Introduction
      • About the authors
      • About the topic
        • Hierarchical Clustering
        • UPGMA
    • Research Problem
    • Methodology
      • Suite of algorithms
    • Results
    • Observations
    Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
  • 3. Introduction: authors Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
  • 4. Introduction: Hierarchical Clustering Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
  • 5. Introduction: Hierarchical Clustering
    • Two fundamental problems in hierarchical clustering:
      • How to determine the similarity between two objects (eg. Proteins, genes)?
        • Calculate the distance between two object (e.g RMSD etc).
      • How to determine the similarity between two clusters?
        • (Single, Complete, Average) linkage:
    Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
  • 6. Introduction: about the topic Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31 There is no guideline for selecting the best linkage method. In practice, people almost always use average linkage. UPGMA (Unweighted Pair Group Method using arithmetic Averages) Scalable to large datasets as it requires only (O(1)) edges in memory. BUT Highly susceptible to outliers!
  • 7. Introduction: UPGMA
    • Input format:
      • Three fields per line
        • Cluster_id1 cluster_id2 distance
    • Assumptions on input:
      • Cluster IDs are >0 integers
      • No self edges i.e Cluster_id1==cluster_id2 is illegal
      • No repeated edges i.e if exists i<->j then no j<->i
    • Output format:
      • Four fields per line
        • Cluster_id1 cluster_id2 distance cluster_id3
          • Cluster_id1 cluster_id2 identify the pair of merged clusters while cluster_id3 is an identifier for a new cluster – their union.
  • 8. Introduction: UPGMA -Sparse input N=11 input singletons ( vertices ): {1,2,3,4,11,12,13,14,21,22,23} and 14 edges in the sparse input. The input is considered sparse since not all pairs are given e.g. there is no edge b/w 1 and 22. Clusters 1,2,3,4 form a clique A. Clusters 11,12,13,14 are missing edge < 11,14 > to form clique B. Clusters 21,22,23 are loosely connected to each other and to the cluster of clique A. In total there are two connected components in the input graph: ({1,2,3,4,21,22,23}) (producing 6 merges for 7 vertices) and {11,12,13,14} (producing 4 merges for 3 nodes), which therefore forms a forest of two disjoint trees , rather than the full tree of N-1=10 merges. UPGMA-input 90 23 1 70 23 22 50 22 21 30 14 13 20 14 12 12 13 12 11 13 11 1e+01 12 11 4e-10 4 3 1e-50 4 2 1e-80 3 2 2e-40 4 1 1e-40 3 1 1e-100 2 1 UPGMA-tree 32 99.167 31 26 31 85 29 23 30 50 28 14 29 50 22 21 28 11.5 27 13 27 10 12 11 26 1.33e-10 25 4 25 5e-41 24 3 24 1e-100 2 1
  • 9. Research Problem: UPGMA
    • UPGMA requires the entire dissimilarity matrix to be in memory:
    Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31 This data renders UPGMA impractical
  • 10. Methodology: 1) Sparse-UPGMA Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31 Can’t cope with huge datasets, where an O ( E ) memory requirement is intolerable (e.g. Table 1). UPGMA (mean): New eq: Time and memory improvement:
  • 11. Methodology: 2) Multi-Round MC-UPGMA
    • Requirements:
      • A correct clusterer should be mindful of unseen edges (≥λ), effecting clustering before λ (max of loaded edges).
      • Such examples are rather prevalent in non-metric datasets e.g. the case of clustering sequence similarities.
    Illustration of non-metric constraints imposed by BLAST sequence similarities (eges). False transitivity is possible due to CSKP_HUMAN.
  • 12. Methodology: 2) Multi-Round MC-UPGMA
      • Solution: To prevent false clustering of a non-minimal edge, suitable bounds per edge are maintained.
      • The value of d ij is lower ( l ij ) and upper ( u ij ) bounded as:
    Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
  • 13. Methodology: 2) Multi-Round MC-UPGMA Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
    • When Multi-Round MC-UPGMA halts, it is not using its entire memory budget M , since each merge reduces the number of edges in memory.
    • Most of the computation time is spent on preprocessing for the next round of clustering.
  • 14. Methodology: 2) Single-Round MC-UPGMA Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31 Requires O(n) memory for holding forming tree!
  • 15. Methodology: 2) Single-Round MC-UPGMA
  • 16. Methods
    • Clustered data:
      • UniRef90 (release 8.5) non-redundant
        • 1.80M sequences
    • BLAST Similarities:
      • blastp with E=100 run on MOSIX grid
      • reciprocal-BLAST-like setting – each sequence is used both as a query and database entry
      • The directed multigraph 1 is transformed to undirected graph 2 (symmetric dissimilarities)
        • 2.5x10 9 edges (50 GB)
        • 1.5x10 9 edges (30 GB)
  • 17. Methods
    • Protein Family Keywords
      • Interpro classification is used as a mapping of keywords to protein sequences
    • Metrics
    Jaccard Score
  • 18. Results
    • from 1 801 506 UniRef90 proteins.
      • 1107 (0.06%) proteins are singletons having no BLAST similarities.
      • From the clustered set, 1 791 206 proteins (99.5%) are clustered into a single tree.
    • 1 497 733 of the tree clusters (83.6%) are fully linked, including 426 360 large clusters with at least 10 members.
  • 19. Results Smith–Waterman BLAST Sparse UPGMA With reduced dataset 220K 1.80M
  • 20. Results 200 clustering rounds on a single 4GB memory 4-CPU workstation took about 1-2 days.
  • 21. Results
  • 22. Observations
    • No detailed discussion on parallelization
    • No results of Single round MC-UPGMA
  • 23. Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
  • 24. Cluster Card Page
  • 25. View Proteins of Cluster
  • 26. Keywords Appearances
  • 27. Cluster Similarity Distribution
  • 28. similarity matrix for the proteins in this cluster
  • 29.  
  • 30.  
  • 31.  
  • 32.