SlideShare a Scribd company logo
1 of 32
Azhar Ali Shah @ Interdisciplinary Optimization and Decision Making  Journal Club (IODMJC) IODMJC, March 20 , 2009
Overview  ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
Introduction:  authors Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
Introduction:  Hierarchical  Clustering Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
Introduction:  Hierarchical Clustering ,[object Object],[object Object],[object Object],[object Object],[object Object],Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
Introduction:  about the topic  Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31 There is no guideline for selecting the best linkage method. In practice, people almost always use  average linkage. UPGMA  (Unweighted Pair Group Method using arithmetic Averages) Scalable to large datasets as it requires only (O(1)) edges in memory. BUT Highly susceptible to outliers!
Introduction:  UPGMA ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Introduction:  UPGMA -Sparse input N=11  input singletons ( vertices ): {1,2,3,4,11,12,13,14,21,22,23}  and  14 edges  in the sparse input.   The input is considered  sparse  since  not all pairs are given  e.g. there is no edge b/w 1 and 22.  Clusters  1,2,3,4  form a  clique  A.  Clusters  11,12,13,14  are missing edge < 11,14 > to form  clique  B.  Clusters  21,22,23  are loosely connected to each other and to the cluster of  clique  A.  In total there are two connected components in the input graph:  ({1,2,3,4,21,22,23})  (producing 6 merges for 7 vertices) and  {11,12,13,14}  (producing 4 merges for 3 nodes), which therefore forms a  forest of two disjoint trees , rather than the full tree of N-1=10 merges.  UPGMA-input 90 23 1 70 23 22 50 22 21 30 14 13 20 14 12 12 13 12 11 13 11 1e+01 12 11 4e-10 4 3 1e-50 4 2 1e-80 3 2 2e-40 4 1 1e-40 3 1 1e-100 2 1 UPGMA-tree 32 99.167 31 26 31 85 29 23 30 50 28 14 29 50 22 21 28 11.5 27 13 27 10 12 11 26 1.33e-10 25 4 25 5e-41 24 3 24 1e-100 2 1
Research Problem:  UPGMA ,[object Object],Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31 This data renders UPGMA impractical
Methodology: 1)  Sparse-UPGMA Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31 Can’t  cope with huge datasets, where an  O ( E ) memory requirement is intolerable (e.g. Table 1).  UPGMA (mean): New eq: Time and memory improvement:
Methodology: 2)  Multi-Round MC-UPGMA ,[object Object],[object Object],[object Object],Illustration of  non-metric  constraints imposed by BLAST sequence similarities (eges).  False transitivity  is possible due to CSKP_HUMAN.
Methodology: 2)  Multi-Round MC-UPGMA ,[object Object],[object Object],Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
Methodology: 2)  Multi-Round MC-UPGMA Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31 ,[object Object],[object Object]
Methodology: 2)  Single-Round MC-UPGMA Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31 Requires O(n) memory for holding forming tree!
Methodology: 2)  Single-Round MC-UPGMA
Methods ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Methods ,[object Object],[object Object],[object Object],Jaccard Score
Results ,[object Object],[object Object],[object Object],[object Object]
Results Smith–Waterman BLAST Sparse UPGMA With reduced dataset 220K 1.80M
Results 200 clustering rounds on a single 4GB memory 4-CPU workstation took about 1-2 days.
Results
Observations ,[object Object],[object Object]
Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
Cluster Card Page
View Proteins of Cluster
Keywords Appearances
Cluster Similarity Distribution
similarity matrix for the proteins in this cluster
 
 
 
 

More Related Content

What's hot

B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
Rai University
 
Blast fasta
Blast fastaBlast fasta
Blast fasta
yaghava
 
Product to a Power
Product to a PowerProduct to a Power
Product to a Power
toni dimella
 

What's hot (20)

B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
 
Presentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticePresentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informatice
 
Syabus
SyabusSyabus
Syabus
 
BTrees - Great alternative to Red Black, AVL and other BSTs
BTrees - Great alternative to Red Black, AVL and other BSTsBTrees - Great alternative to Red Black, AVL and other BSTs
BTrees - Great alternative to Red Black, AVL and other BSTs
 
Phylogenetics: Tree building
Phylogenetics: Tree buildingPhylogenetics: Tree building
Phylogenetics: Tree building
 
Blast fasta
Blast fastaBlast fasta
Blast fasta
 
Graphs, Trees, Paths and Their Representations
Graphs, Trees, Paths and Their RepresentationsGraphs, Trees, Paths and Their Representations
Graphs, Trees, Paths and Their Representations
 
synopsis_divyesh
synopsis_divyeshsynopsis_divyesh
synopsis_divyesh
 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekinge
 
dot plot analysis
dot plot analysisdot plot analysis
dot plot analysis
 
blast and fasta
 blast and fasta blast and fasta
blast and fasta
 
Upgma
UpgmaUpgma
Upgma
 
Use of the Tree.
Use of the Tree.Use of the Tree.
Use of the Tree.
 
Swaati algorithm of alignment ppt
Swaati algorithm of alignment pptSwaati algorithm of alignment ppt
Swaati algorithm of alignment ppt
 
Product to a Power
Product to a PowerProduct to a Power
Product to a Power
 
Biological sequences analysis
Biological sequences analysisBiological sequences analysis
Biological sequences analysis
 
Splay Trees and Self Organizing Data Structures
Splay Trees and Self Organizing Data StructuresSplay Trees and Self Organizing Data Structures
Splay Trees and Self Organizing Data Structures
 
Prediction of transcription factor binding to DNA using rule induction methods
Prediction of transcription factor binding to DNA using rule induction methodsPrediction of transcription factor binding to DNA using rule induction methods
Prediction of transcription factor binding to DNA using rule induction methods
 
Slides -a._afanasiev
Slides  -a._afanasievSlides  -a._afanasiev
Slides -a._afanasiev
 
Data Structure with C -Part-2 ADT,Array, Strucure and Union
Data Structure with C -Part-2 ADT,Array, Strucure and  UnionData Structure with C -Part-2 ADT,Array, Strucure and  Union
Data Structure with C -Part-2 ADT,Array, Strucure and Union
 

Viewers also liked

Final Journal Club Presentation
Final Journal Club PresentationFinal Journal Club Presentation
Final Journal Club Presentation
Anna Schemel
 
Schaefer, Joseph, R. Fidaxomicin Presentation
Schaefer, Joseph, R. Fidaxomicin PresentationSchaefer, Joseph, R. Fidaxomicin Presentation
Schaefer, Joseph, R. Fidaxomicin Presentation
Joseph Schaefer
 

Viewers also liked (19)

Journal Club @ UVigo 2011.07.22
Journal Club @ UVigo 2011.07.22Journal Club @ UVigo 2011.07.22
Journal Club @ UVigo 2011.07.22
 
Final Journal Club Presentation
Final Journal Club PresentationFinal Journal Club Presentation
Final Journal Club Presentation
 
The Structural Basis for Agonist and Partial Agonist
The Structural Basis for Agonist and Partial AgonistThe Structural Basis for Agonist and Partial Agonist
The Structural Basis for Agonist and Partial Agonist
 
20140328 TNTL journal club axion electrodynamics, TI-FI interface (nomura, ...
20140328 TNTL journal club   axion electrodynamics, TI-FI interface (nomura, ...20140328 TNTL journal club   axion electrodynamics, TI-FI interface (nomura, ...
20140328 TNTL journal club axion electrodynamics, TI-FI interface (nomura, ...
 
Pseudogene Journal Club Presentation
Pseudogene Journal Club PresentationPseudogene Journal Club Presentation
Pseudogene Journal Club Presentation
 
Journal Club - Early versus Late Parenteral Nutrition in Critically Ill Adults
Journal Club - Early versus Late Parenteral Nutrition in Critically Ill AdultsJournal Club - Early versus Late Parenteral Nutrition in Critically Ill Adults
Journal Club - Early versus Late Parenteral Nutrition in Critically Ill Adults
 
Schaefer, Joseph, R. Fidaxomicin Presentation
Schaefer, Joseph, R. Fidaxomicin PresentationSchaefer, Joseph, R. Fidaxomicin Presentation
Schaefer, Joseph, R. Fidaxomicin Presentation
 
Rituximab CJASN Journal Club
Rituximab CJASN Journal ClubRituximab CJASN Journal Club
Rituximab CJASN Journal Club
 
Parkinson's Disease Presentation
Parkinson's Disease PresentationParkinson's Disease Presentation
Parkinson's Disease Presentation
 
Azithromycin for prevention of exacerbations of copd
Azithromycin for prevention of exacerbations of copdAzithromycin for prevention of exacerbations of copd
Azithromycin for prevention of exacerbations of copd
 
Acute exacerbation of COPD
Acute exacerbation of COPDAcute exacerbation of COPD
Acute exacerbation of COPD
 
Journal Club: Daily Corticosteroids Reduce Infection-associated Relapses in F...
Journal Club: Daily Corticosteroids Reduce Infection-associated Relapses in F...Journal Club: Daily Corticosteroids Reduce Infection-associated Relapses in F...
Journal Club: Daily Corticosteroids Reduce Infection-associated Relapses in F...
 
Journal Club: Fidaxomicin versus Vancomycin for Clostridium Difficile Infection
Journal Club: Fidaxomicin versus Vancomycin for Clostridium Difficile InfectionJournal Club: Fidaxomicin versus Vancomycin for Clostridium Difficile Infection
Journal Club: Fidaxomicin versus Vancomycin for Clostridium Difficile Infection
 
Genetic Basis Of Parkinson Disease
Genetic Basis Of Parkinson DiseaseGenetic Basis Of Parkinson Disease
Genetic Basis Of Parkinson Disease
 
Prevention of Venous Thromboembolism
Prevention of Venous ThromboembolismPrevention of Venous Thromboembolism
Prevention of Venous Thromboembolism
 
Journal Club
Journal ClubJournal Club
Journal Club
 
Journal Club: Thrombin-Receptor Antagonist Vorapaxar in Acute Coronary Syndromes
Journal Club: Thrombin-Receptor Antagonist Vorapaxar in Acute Coronary SyndromesJournal Club: Thrombin-Receptor Antagonist Vorapaxar in Acute Coronary Syndromes
Journal Club: Thrombin-Receptor Antagonist Vorapaxar in Acute Coronary Syndromes
 
Parkinsons Disease
Parkinsons DiseaseParkinsons Disease
Parkinsons Disease
 
How to present a journal club
How to present a journal clubHow to present a journal club
How to present a journal club
 

Similar to Presentation 2009 Journal Club Azhar Ali Shah

20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
Computer Science Club
 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...
TELKOMNIKA JOURNAL
 
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Natalio Krasnogor
 

Similar to Presentation 2009 Journal Club Azhar Ali Shah (20)

The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
 
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
 
report
reportreport
report
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
 
post119s1-file2
post119s1-file2post119s1-file2
post119s1-file2
 
BioINfo.pptx
BioINfo.pptxBioINfo.pptx
BioINfo.pptx
 
Clustering and Visualisation using R programming
Clustering and Visualisation using R programmingClustering and Visualisation using R programming
Clustering and Visualisation using R programming
 
CMSI計算科学技術特論A (2015) 第13回 Parallelization of Molecular Dynamics
CMSI計算科学技術特論A (2015) 第13回 Parallelization of Molecular Dynamics CMSI計算科学技術特論A (2015) 第13回 Parallelization of Molecular Dynamics
CMSI計算科学技術特論A (2015) 第13回 Parallelization of Molecular Dynamics
 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...
 
Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS
 
Msa & rooted/unrooted tree
Msa & rooted/unrooted treeMsa & rooted/unrooted tree
Msa & rooted/unrooted tree
 
04 15029 active node ijeecs 1570310145(edit)
04 15029 active node ijeecs 1570310145(edit)04 15029 active node ijeecs 1570310145(edit)
04 15029 active node ijeecs 1570310145(edit)
 
Nural network ER.Abhishek k. upadhyay
Nural network  ER.Abhishek k. upadhyayNural network  ER.Abhishek k. upadhyay
Nural network ER.Abhishek k. upadhyay
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
 
FractalTreeIndex
FractalTreeIndexFractalTreeIndex
FractalTreeIndex
 
H010223640
H010223640H010223640
H010223640
 
Graph theoretic neuromorphology
Graph theoretic neuromorphologyGraph theoretic neuromorphology
Graph theoretic neuromorphology
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data Fragments
 
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
 
Elastic path2path (International Conference on Image Processing'18)
Elastic path2path (International Conference on Image Processing'18)Elastic path2path (International Conference on Image Processing'18)
Elastic path2path (International Conference on Image Processing'18)
 

Recently uploaded

會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
中 央社
 
Neurulation and the formation of the neural tube
Neurulation and the formation of the neural tubeNeurulation and the formation of the neural tube
Neurulation and the formation of the neural tube
SaadHumayun7
 

Recently uploaded (20)

REPRODUCTIVE TOXICITY STUDIE OF MALE AND FEMALEpptx
REPRODUCTIVE TOXICITY  STUDIE OF MALE AND FEMALEpptxREPRODUCTIVE TOXICITY  STUDIE OF MALE AND FEMALEpptx
REPRODUCTIVE TOXICITY STUDIE OF MALE AND FEMALEpptx
 
philosophy and it's principles based on the life
philosophy and it's principles based on the lifephilosophy and it's principles based on the life
philosophy and it's principles based on the life
 
Championnat de France de Tennis de table/
Championnat de France de Tennis de table/Championnat de France de Tennis de table/
Championnat de France de Tennis de table/
 
2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
 
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
 
factors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptxfactors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptx
 
size separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceuticssize separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceutics
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
 
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxslides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
 
The Ultimate Guide to Social Media Marketing in 2024.pdf
The Ultimate Guide to Social Media Marketing in 2024.pdfThe Ultimate Guide to Social Media Marketing in 2024.pdf
The Ultimate Guide to Social Media Marketing in 2024.pdf
 
Open Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPointOpen Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPoint
 
Essential Safety precautions during monsoon season
Essential Safety precautions during monsoon seasonEssential Safety precautions during monsoon season
Essential Safety precautions during monsoon season
 
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
Operations Management - Book1.p  - Dr. Abdulfatah A. SalemOperations Management - Book1.p  - Dr. Abdulfatah A. Salem
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
 
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
 
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfINU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
 
Navigating the Misinformation Minefield: The Role of Higher Education in the ...
Navigating the Misinformation Minefield: The Role of Higher Education in the ...Navigating the Misinformation Minefield: The Role of Higher Education in the ...
Navigating the Misinformation Minefield: The Role of Higher Education in the ...
 
Neurulation and the formation of the neural tube
Neurulation and the formation of the neural tubeNeurulation and the formation of the neural tube
Neurulation and the formation of the neural tube
 
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
 

Presentation 2009 Journal Club Azhar Ali Shah

  • 1. Azhar Ali Shah @ Interdisciplinary Optimization and Decision Making Journal Club (IODMJC) IODMJC, March 20 , 2009
  • 2.
  • 3. Introduction: authors Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
  • 4. Introduction: Hierarchical Clustering Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
  • 5.
  • 6. Introduction: about the topic Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31 There is no guideline for selecting the best linkage method. In practice, people almost always use average linkage. UPGMA (Unweighted Pair Group Method using arithmetic Averages) Scalable to large datasets as it requires only (O(1)) edges in memory. BUT Highly susceptible to outliers!
  • 7.
  • 8. Introduction: UPGMA -Sparse input N=11 input singletons ( vertices ): {1,2,3,4,11,12,13,14,21,22,23} and 14 edges in the sparse input. The input is considered sparse since not all pairs are given e.g. there is no edge b/w 1 and 22. Clusters 1,2,3,4 form a clique A. Clusters 11,12,13,14 are missing edge < 11,14 > to form clique B. Clusters 21,22,23 are loosely connected to each other and to the cluster of clique A. In total there are two connected components in the input graph: ({1,2,3,4,21,22,23}) (producing 6 merges for 7 vertices) and {11,12,13,14} (producing 4 merges for 3 nodes), which therefore forms a forest of two disjoint trees , rather than the full tree of N-1=10 merges. UPGMA-input 90 23 1 70 23 22 50 22 21 30 14 13 20 14 12 12 13 12 11 13 11 1e+01 12 11 4e-10 4 3 1e-50 4 2 1e-80 3 2 2e-40 4 1 1e-40 3 1 1e-100 2 1 UPGMA-tree 32 99.167 31 26 31 85 29 23 30 50 28 14 29 50 22 21 28 11.5 27 13 27 10 12 11 26 1.33e-10 25 4 25 5e-41 24 3 24 1e-100 2 1
  • 9.
  • 10. Methodology: 1) Sparse-UPGMA Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31 Can’t cope with huge datasets, where an O ( E ) memory requirement is intolerable (e.g. Table 1). UPGMA (mean): New eq: Time and memory improvement:
  • 11.
  • 12.
  • 13.
  • 14. Methodology: 2) Single-Round MC-UPGMA Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31 Requires O(n) memory for holding forming tree!
  • 15. Methodology: 2) Single-Round MC-UPGMA
  • 16.
  • 17.
  • 18.
  • 19. Results Smith–Waterman BLAST Sparse UPGMA With reduced dataset 220K 1.80M
  • 20. Results 200 clustering rounds on a single 4GB memory 4-CPU workstation took about 1-2 days.
  • 22.
  • 23. Azhar A Shah Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space /31
  • 25. View Proteins of Cluster
  • 28. similarity matrix for the proteins in this cluster
  • 29.  
  • 30.  
  • 31.  
  • 32.