Your SlideShare is downloading. ×
Crunching Huge Phylogenies. A. Stamatakis
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Crunching Huge Phylogenies. A. Stamatakis

1,168
views

Published on

Published in: Technology, Education

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,168
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Crunching Huge Phylogenies: A Rapid Bootstrap Algorithm and Massive Parallelism on the IBM BlueGene Alexandros Stamatakis Swiss Federal Institute of Technology Lausanne (EPFL) School of Computer & Communication Sciences Laboratory for Computational Biology and Bioinformatics Lausanne, Switzerland & Swiss Institute of Bioinformatics Alexandros.Stamatakis@epfl.ch icwww.epfl.ch/~stamatak
  • 2. The Missing Part Data Assembly Inference ? Tree Analysis Alexandros Stamatakis, October 2007
  • 3. The Missing Part Data Assembly Tree Analysis Alexandros Stamatakis, October 2007
  • 4. IBM BlueGene/L supercomputer Alexandros Stamatakis, October 2007
  • 5. Rapid Bootstrapping Bootstopping Criterion Alexandros Stamatakis, October 2007
  • 6. The Big Hardware Problem CPU Speed 40% p.a. Memory Speed 9% p.a. 2007 1980 Alexandros Stamatakis, October 2007
  • 7. ... and why this concerns Bioinformatics Sequence CPU Speed 40% p.a. Data Memory Speed 9% p.a. 2007 1980 Alexandros Stamatakis, October 2007
  • 8. ... and why this concerns Bioinformatics Application of HPC techniques will become Sequence much moreSpeed 40% p.a. CPU important Data Memory Speed 9% p.a. 2007 1980 Alexandros Stamatakis, October 2007
  • 9. Cache Hierarchy Alexandros Stamatakis, October 2007
  • 10. Outline Introduction ● Computation of Phylogenies ● Maximum Likelihood ● Web & Grid Services ● Three Steps Towards the Tree of Life ● Parallelism on IBM BlueGene/L ● Rapid Bootstrapping ● A Bootstopping criterion ● Related Projects ● Outlook ● Alexandros Stamatakis, October 2007
  • 11. Phylogenetics Input: “good” multiple Alignment  Output: unrooted binary tree  Various methods for phylogenetic  inference Neighbour Joining (fast & simple)  Maximum Parsimony (relatively fast &  simple) Maximum Likelihood (complex & slow)  Bayesian Methods (complex & slower)  Alexandros Stamatakis, October 2007
  • 12. Phylogenetics Input: “good” multiple Alignment  Output: unrooted binary tree  ML & Bayesian: explicit Various methods choice model for phylogenetic  inference Neighbour Joining (fast & simple)  Maximum Parsimony (relatively fast &  simple) Maximum Likelihood (complex & slow)  Bayesian Methods (complex & slower)  Alexandros Stamatakis, October 2007
  • 13. Phylogenetics Complex Methods & Input: “good” multiple Alignment  Models required to Output: unrooted binary tree  reconstruct large & Various methods for phylogenetic complicated trees !  inference NeighbourFocus of(fast talk is on Joining this & simple)  Maximum Likelihood! Maximum Parsimony (relatively fast &  simple) Maximum Likelihood (complex & slow)  Bayesian Methods (complex & slower)  Alexandros Stamatakis, October 2007
  • 14. Phylogenetics Input: “good” multiple Alignment  Output: unrooted binary tree  Various methods for phylogenetic  inference NeighbourThe real (fast & simple) Joining reason for  Maximum working on (relatively fast & Parsimony ML: ......  simple) Maximum Likelihood (complex & slow)  Bayesian Methods (complex & slower)  Alexandros Stamatakis, October 2007
  • 15. Challenges for Phyloinformatics Holy grail: “Tree of Life”  What is a good alignment in a  phylogenetic context? Simultaneous alignment and tree building  Improve/extend models ... but thereby size  of computable trees decreases! More HPC awareness  Exploit multi-core architectures  Amount of available data grows at a  higher rate than algorithms are getting faster Alexandros Stamatakis, October 2007
  • 16. The algorithmic problem Alexandros Stamatakis, October 2007
  • 17. The number of trees Alexandros Stamatakis, October 2007
  • 18. The number of trees Alexandros Stamatakis, October 2007
  • 19. The number of trees Alexandros Stamatakis, October 2007
  • 20. The number of trees explodes! BANG ! Alexandros Stamatakis, October 2007
  • 21. Outline Introduction ● Computation of Phylogenies ● Maximum Likelihood ● Web & Grid Services ● Three Steps Towards the Tree of Life ● Parallelism on IBM BlueGene/L ● Rapid Bootstrapping ● A Bootstopping criterion ● Related Projects ● Outlook ● Alexandros Stamatakis, October 2007
  • 22. Maximum Likelihood Length: m Seq1 Seq2 Alignment Seq3 Seq4 Alexandros Stamatakis, October 2007
  • 23. Maximum Likelihood Length: m ACGT Seq1 A Seq2 C Substitution Alignment model Seq3 G Seq4 T Alexandros Stamatakis, October 2007
  • 24. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T Alexandros Stamatakis, October 2007
  • 25. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T Seq 3 Seq 1 b3 b1 b5 b2 b4 Seq 2 Seq 4 Alexandros Stamatakis, October 2007
  • 26. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T Seq 3 Seq 1 b3 b1 b5 b2 b4 Seq 2 Seq 4 virtual root: vr Alexandros Stamatakis, October 2007
  • 27. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T Seq 3 Seq 1 b3 b1 vr b5 b2 b4 Seq 2 Seq 4 P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T) m Alexandros Stamatakis, October 2007
  • 28. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T Lots of floating pointSeq 3 Seq 1 b3 b1 operations! vr b5 b2 b4 Seq 2 Seq 4 P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T) m Alexandros Stamatakis, October 2007
  • 29. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T Seq 3 Seq 1 Seq 2 Seq 4 optimize branch lengths Alexandros Stamatakis, October 2007
  • 30. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T optimize model parameters Seq 3 Seq 1 Seq 2 Seq 4 Alexandros Stamatakis, October 2007
  • 31. Maximum Likelihood Goal: Obtain topology with maximum likelihood value Problem I: Number of possible topologies is exponential in n Problem II: Computation of likelihood function is expensive Problem III: Probably high score accuracy required Problem IV: High memory consumption Solution: • New Algorithms • New Models • High Performance Computing Alexandros Stamatakis, October 2007
  • 32. Maximum Likelihood Goal: Obtain topology with maximum likelihood value Problem I: Number of possible topologies is exponential in n RAxML Problem II: Computation of likelihood function is expensive Randomized Axelerated Problem III: Probably high score accuracy required Maximum Likelihood Problem IV: High memory consumption Solution: • New Algorithms • New Models • High Performance Computing Alexandros Stamatakis, October 2007
  • 33. Web & Grid Services RAxML Web-Server at San Diego Supercomputing  Center via www.phylo.org (CIPRES project) Web-Server at Vital-IT unit of Swiss Institute of  Bioinformatics phylobench.vital-it.ch/raxml-bb/  Includes novel search algorithm with 1 order of magnitude run-time improvement  Since Sept 3, about 700 jobs from 130 Ips  Extension to SwissGrid planned  Novel algorithm with Bootstopping to be integrated into CIPRES portal soon RAxML integration into Distributed European  Infrastructure for Supercomputing Applications www.deisa.org started 10 days ago Integration into Debian medical distribution  Alexandros Stamatakis, October 2007
  • 34. RAxML Black Box Alexandros Stamatakis, October 2007
  • 35. RAxML Black Box Why are Black Boxes useful? Alexandros Stamatakis, October 2007
  • 36. Outline Introduction ● Computation of Phylogenies ● Maximum Likelihood ● Web & Grid Services ● Three Steps Towards the Tree of Life ● Parallelism on IBM BlueGene/L ● Rapid Bootstrapping ● A Bootstopping criterion ● Related Projects ● Outlook ● Alexandros Stamatakis, October 2007
  • 37. Levels of Parallelism Embarrassing Parallelism MPI, CORBA, Grid Technologies Alexandros Stamatakis, October 2007
  • 38. Coarse-Grained Parallelism: MPI Version of RAxML PC-CLUSTER Worker Processes B-2 B-3 B-1 B-4 Interconnection B-0 Network Master Process Alexandros Stamatakis, October 2007
  • 39. Levels of Parallelism Embarrassing Parallelism MPI, CORBA, Grid Technologies Inference Parallelism MPI, algorithm-dependent Alexandros Stamatakis, October 2007
  • 40. Levels of Parallelism Embarrassing Parallelism MPI, CORBA, Grid Technologies Inference Parallelism MPI, algorithm-dependent Loop-Level Parallelism OpenMP, GPUs, IBM CELL (Playstation), IBM BlueGene, Clusters with fast Interconnect Alexandros Stamatakis, October 2007
  • 41. Loop Level Parallelism virtual root P Q R P[i] = f(Q[i], R[i]) Alexandros Stamatakis, October 2007
  • 42. Loop Level Parallelism virtual root This operation uses ≥ 90% P of total execution time ! Q R P[i] = f(Q[i], R[i]) Alexandros Stamatakis, October 2007
  • 43. Loop Level Parallelism virtual root This operation uses ≥ 90% P of total execution time ! → simple fine-grained parallelization Q R P[i] = f(Q[i], R[i]) Alexandros Stamatakis, October 2007
  • 44. Loop Level Parallelism virtual root P Q R Alexandros Stamatakis, October 2007
  • 45. Loop Level Parallelism virtual root P Q R Alexandros Stamatakis, October 2007
  • 46. Loop Level Parallelism virtual root P Q R Alexandros Stamatakis, October 2007
  • 47. Loop Level Parallelism virtual root The real reason for assuming independent evolution among sites: P ...... Q R Alexandros Stamatakis, October 2007
  • 48. Fine-Grained Parallelism: OpenMP version of RAxML Alexandros Stamatakis, October 2007
  • 49. Fine-Grained Parallelism: OpenMP version of RAxML Alexandros Stamatakis, October 2007
  • 50. HPC for ML (Bayesian) Proof of Concept & Programming  Techniques:  RAxML on a Graphics Processing Unit  RAxML on the IBM CELL & Playstation Production Level Implementations:   RAxML with OpenMP  RaxML with MPI  RAxML on BlueGene  Multi-Core Architectures Alexandros Stamatakis, October 2007
  • 51. HPC for ML (Bayesian) Proof of Concept & Programming  Techniques:  RAxML on a Graphics Processing Unit  RAxML on the IBM CELL & Playstation Production Level Implementations:  A good excuse to buy one  RAxML with OpenMP  RaxML with MPI  RAxML on BlueGene  Multi-Core Architectures Alexandros Stamatakis, October 2007
  • 52. RAxML-BlueGene Many slow processors: 1024 in one rack  512 MB or 1GB of main memory per node  But: high performance network  Challenges:  Distribute tree data structure among CPUs  Exploit fast collective communication network  For optimal efficiency: loop-level +  embarrassing parallelism  hybrid parallelism with MPI Test & Production Run Data  With Olaf Bininda-Emonds, Jena: 2,182  mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human  Haplotype Map sequences x 500,000 base pairs Alexandros Stamatakis, October 2007
  • 53. RAxML-BlueGene To be presented at IEEE/ACM 2007 Supercomputing Many slow processors: 1024 in one rack  Conference. 512 MB or 1GB of main memory per node  But: high performance network  Challenges:  Distribute tree data structure among CPUs  Exploit fast collective communication network  For optimal efficiency: loop-level +  embarrassing parallelism  hybrid parallelism with MPI Test & Production Run Data  With Olaf Bininda-Emonds, Jena: 2,182  mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human  Haplotype Map sequences x 500,000 base pairs Alexandros Stamatakis, October 2007
  • 54. RAxML-BlueGene Many slow processors: 1024 in one rack  512 MB or 1GB of main memory per node  But: high performance network  Challenges:  Distribute tree data structure among CPUs in Largest ML analysis to date  terms of memory footprint Exploit fast collective communication network  For optimal efficiency: loop-level +  embarrassing parallelism  hybrid parallelism with MPI Test & Production Run Data  With Olaf Bininda-Emonds, Jena: 2,182  mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human  Haplotype Map sequences x 500,000 base pairs Alexandros Stamatakis, October 2007
  • 55. Loop-Level Parallelism on BlueGene Alexandros Stamatakis, October 2007
  • 56. 50 Seqs x 23,385 bp Alexandros Stamatakis, October 2007
  • 57. 50 Seqs x 23,385 bp Superlinear Speedup Alexandros Stamatakis, October 2007
  • 58. 250 Seqs x 403,581 bp Alexandros Stamatakis, October 2007
  • 59. Embarrassing Parallelism W W W W M W W M M M W W W W W W Alexandros Stamatakis, October 2007
  • 60. Outline Introduction ● Computation of Phylogenies ● Maximum Likelihood ● Web & Grid Services ● Three Steps Towards the Tree of Life ● Parallelism on IBM BlueGene/L ● Rapid Bootstrapping ● A Bootstopping criterion ● Related Projects ● Outlook ● Alexandros Stamatakis, October 2007
  • 61. Confidence Values Tree without node confidence  values is mostly useless Problem:  Confidence value calculation is major  computational obstacle  We can compute large trees but not analyse them: compute ≠analyse ! Current Slow Methods  Sampling with Bayesian methods  Non-parametric Bootstrapping  Alexandros Stamatakis, October 2007
  • 62. A Tree with Confidence Values Joint work Stamatakis, October 2007 Alexandros with Marc Gottschling, Charite Hospital, Berlin
  • 63. Bootstrapping Original Alignment perturbation compute tree compute tree compute tree Alexandros Stamatakis, October 2007
  • 64. Bootstrapping Original Alignment This needs to be done 100-1000 times Embarrassingly Parallel ! perturbation compute tree compute tree compute tree Alexandros Stamatakis, October 2007
  • 65. Two Questions How to compute Bootstraps faster?  How many Bootstrap replicates do we  need? Alexandros Stamatakis, October 2007
  • 66. Current Work: Rapid Bootstrapping Algorithm Tested on 22 diverse (mammals, bacteria, archaea,  grasses, fishes, plants, viral) real-world DNA/AA single-/multi-gene datasets containing 125-7,764 sequences Pearson correlation on best-scoring ML trees between  RBS (Rapid BS) & SBS (Standard BS) support values 0.95-0.99 (except one dataset at 0.91), average 0.97 Weighted topological distance < 6%, average 4%  Program Acceleration: 8-20, average ≈ 15  Acceleration by one order of magnitude  Full ML analysis (100BS + ML search) of datasets of  up to 5,000 sequences within less than 5 days on your desktop! Allows for a sufficiently large number of Bootstrap  replicates Alexandros Stamatakis, October 2007
  • 67. Quick & Dirty Bootstrap Modify Algorithm Computational Experiments Alexandros Stamatakis, October 2007
  • 68. Quick & Dirty Bootstrap Modify Algorithm iterate Computational Experiments Alexandros Stamatakis, October 2007
  • 69. Rapid Bootstrap 11111111111111 01102211111111 10111102220111 11111110112021 Alexandros Stamatakis, October 2007
  • 70. Rapid Bootstrap 11111111111111 Compute Starting Tree 01102211111111 10111102220111 11111110112021 Alexandros Stamatakis, October 2007
  • 71. Rapid Bootstrap Optimize Model Params & 11111111111111 Branch Lengths 01102211111111 10111102220111 11111110112021 Alexandros Stamatakis, October 2007
  • 72. Rapid Bootstrap Use Starting Tree & Model Params to compute RELL scores 11111111111111 01102211111111 -110 10111102220111 -105 11111110112021 -100 Alexandros Stamatakis, October 2007
  • 73. Rapid Bootstrap Use Starting Tree & Model Params to compute RELL scores 11111111111111 01102211111111 -110 10111102220111 -105 Sort by RELL 11111110112021 -100 Alexandros Stamatakis, October 2007
  • 74. Rapid Bootstrap 11111111111111 11111110112021 -100 T0: Thorough Search 10111102220111 -105 01102211111111 -110 Alexandros Stamatakis, October 2007
  • 75. Rapid Bootstrap 11111111111111 11111110112021 -100 T0: Thorough Search 10111102220111 -105 T1: Quick Search on T0 01102211111111 -110 Alexandros Stamatakis, October 2007
  • 76. Rapid Bootstrap 11111111111111 11111110112021 -100 T0: Thorough Search 10111102220111 -105 T1: Quick Search on T0 01102211111111 -110 T2: Quick Search on T1 Alexandros Stamatakis, October 2007
  • 77. Rapid Bootstrap 11111111111111 sequential dependency is bad for 11111110112021 -100 parallelism T0: Thorough Search 10111102220111 -105 T1: Quick Search on T0 01102211111111 -110 T2: Quick Search on T1 Alexandros Stamatakis, October 2007
  • 78. Scalability of Rapid Bootstrap Alexandros Stamatakis, October 2007
  • 79. Scalability of Rapid Bootstrap Some datasets are harder than others Alexandros Stamatakis, October 2007
  • 80. Scalability of Rapid Bootstrap Alexandros Stamatakis, October 2007
  • 81. ML-Scores: Garli, RAxML, PHYML 715 Sequences Alexandros Stamatakis, October 2007
  • 82. Correlation 125 Taxa: 0.91 Alexandros Stamatakis, October 2007
  • 83. Support Value Distribution Alexandros Stamatakis, October 2007
  • 84. Bootstrap Likelihood Values 125 x 19,436 10,000 replicates only 195 non-trivial bipartitions Alexandros Stamatakis, October 2007
  • 85. Bootstrap Likelihood Values 125 x 19,436 Alexandros Stamatakis, October 2007
  • 86. 3,491 rBCL sequences Rapid versus Standard BS Correlation: 0.98 Alexandros Stamatakis, October 2007
  • 87. 7,764 DNA Best Tree Alexandros Stamatakis, October 2007
  • 88. 7,764 DNA All Bipartitions Alexandros Stamatakis, October 2007
  • 89. 775 x 3,838 AA Alexandros Stamatakis, October 2007
  • 90. New Opportunities Assess Impact of Alignment Method  on tree and support values Test Bootstrap of the Bootstrap  (double Bootstrap) procedures Devise and empirically verify  Bootstopping criteria Alexandros Stamatakis, October 2007
  • 91. Bootstrap of the Bootstrap 140 AA (Efron et al PNAS 1996) Alexandros Stamatakis, October 2007
  • 92. Bootstrap of the Bootstrap 3,491 rBCL Alexandros Stamatakis, October 2007
  • 93. Bootstopping Rapid Bootstrapping allows to assess  Bootstopping criteria as follows 1. Compute a high number of BS replicates (10,000) 2. Devise topology-based bootstopping criterion and apply it to these 10,000 replicates 3. Compare support values induced by bootstopped trees (say 300 replicates) with 10,000 replicates We have 10,000 replicates for 18  datasets containing 125 to 2,554 sequences Alexandros Stamatakis, October 2007
  • 94. Bootstopping Criterion Every 50, 100, 150, ... replicates do a test:   Say we have N BS trees  Do the following 100 times:  Randomly split up this set of N trees into 2 equal sets S1, S2, of size N/2  Compute the bipartition support vectors for S1 and S2  Compute Pearson correlation of the support vectors  return average of the 100 Pearson correlations if average > 0.99 stop  Alexandros Stamatakis, October 2007
  • 95. Result Overview Bootstopped between 100-400 (avg  213) Correlation on best tree: Bootstopped  versus 10,000 replicates > 0.99 (avg 0.995) Correlation of all bipartitions > 0.995  (avg 0.997) Alexandros Stamatakis, October 2007
  • 96. Bootstopping Best 140 AA Alexandros Stamatakis, October 2007
  • 97. Bootstopping Best 404 DNA (Multi-Gene) Alexandros Stamatakis, October 2007
  • 98. Bootstopping Best 994 DNA Alexandros Stamatakis, October 2007
  • 99. Bootstopping All 994 DNA Alexandros Stamatakis, October 2007
  • 100. Bootstopping Best 1,908 DNA Alexandros Stamatakis, October 2007
  • 101. Bootstopping Best 2,554 DNA Alexandros Stamatakis, October 2007
  • 102. Putting the Pieces together Blue-Gene: Can handle huge datasets  Use Cat approximation on BlueGene  Further speedup of factor 3.5  Memory footprint reduction factor 4  Alexandros Stamatakis, October 2007
  • 103. 8,864 Bacteria under GTR+Γ and GTR+CAT Log Likelihood Score under Γ 7 days 14 days Execution Time Alexandros Stamatakis, October 2007
  • 104. Putting the Pieces together Blue-Gene: Can handle huge datasets  Use Cat approximation on BlueGene  Further speedup of factor 3.5  Memory footprint reduction factor 4  Integrate rapid Bootstrap into BlueGene  version Additional speedup ≈ 15  Mechanisms available to accelerate  BlueGene version by factor 50-60 Integrate Bootstopping into BlueGene   Conclusion: We will soon be able to compute a small tree of life with 10,000 organisms and data from multiple genes! Alexandros Stamatakis, October 2007
  • 105. Outline Introduction ● Computation of Phylogenies ● Maximum Likelihood ● Web & Grid Services ● Three Steps Towards the Tree of Life ● Parallelism on IBM BlueGene/L ● Rapid Bootstrapping ● A Bootstopping criterion ● Related Projects ● Outlook ● Alexandros Stamatakis, October 2007
  • 106. Host-Parasite Co-Evolution Parasites (eg Lice) Hosts (eg Mammals) Alexandros Stamatakis, October 2007
  • 107. Host-Parasite Co-Evolution Hosts Parasites Co-Evolution Hypothesis 8 Parasites Adjacency 6 hosts Matrix 0/1 Alexandros Stamatakis, October 2007
  • 108. Host-Parasite Co-Evolution Hosts Parasites Co-Evolution Hypothesis 8 Parasites Adjacency 6 hosts Matrix 0/1 Statistical Test Alexandros Stamatakis, October 2007
  • 109. What can HPC do forBioinformatics? Axelerated Parafit “Parafit: statistical test of co-evolution”, Pierre  Legendre, Syst. Biol. 2003 AxParafit (Axelerated Parafit)   Statistical test of hypotheses of host-parasite co- evolution  C porting, optimization, BLAS integration  Speedup up to factor 67  Master-Worker MPI-parallelization Largest co-phylogenetic study to date conducted  within 8 minutes instead of 4 weeks Open-Source Code:  http://icwww.epfl.ch/~stamatak/AxParafit.html SwissGrid-based Web-Server planned  Alexandros Stamatakis, October 2007
  • 110. AxParafit: Sequential Performance Alexandros Stamatakis, October 2007
  • 111. AxParafit: Parallel Performance Alexandros Stamatakis, October 2007
  • 112. The ML Benchmark: A Current Community Project Standardized way required to test ML search programs  Web-Server with real-world alignments and performance data  at Swiss Institute of Bioinformatics Many developers of popular ML programs involved   Stephane Guindon (PHYML) Montpellier  Simon Wheelan (LeaPhy) Manchester  Bui Quang Minh (IQPNNI) Vienna  Derrick Zwickl (GARLI) Virginia  Thomas Keane (dprML) Cambridge Byproduct: SPEC-like CPU benchmark for phylogenetics  Follow-up: (planned) ML competition at major conference with  industrial sponsor Alexandros Stamatakis, October 2007
  • 113. A Current Problem: Handling Multi-Gene Alignments Gene 1 Gene 2 Sequence 1 Sequence 5 Missing Data ≠ Gap Data Alexandros Stamatakis, October 2007
  • 114. A Multi-Gene Model Alexandros Stamatakis, October 2007
  • 115. A Multi-Gene Model Alexandros Stamatakis, October 2007
  • 116. A Multi-Gene Model Alexandros Stamatakis, October 2007
  • 117. A Multi-Gene Model LogLH (T) = LogLh (T|Red) Alexandros Stamatakis, October 2007
  • 118. A Multi-Gene Model LogLH (T) = LogLh (T|Red) + LogLH(T|Yellow) Alexandros Stamatakis, October 2007
  • 119. A Multi-Gene Model Challenge: devise efficient data structures for this LogLH (T) = LogLh (T|Red) + LogLH(T|Yellow) Alexandros Stamatakis, October 2007
  • 120. Why are Individual Branches per Gene a Challenge? Alexandros Stamatakis, October 2007
  • 121. Why are Individual Branches per Gene a Challenge? Alexandros Stamatakis, October 2007
  • 122. Outlook Alexandros Stamatakis, October 2007
  • 123. Outlook Tree of Life  What is a good alignment in a  phylogenetic context? Simultaneous alignment and tree building  More HPC & memory-aware programming  Multi-core architectures  Models for “gappy” multi-gene alignments  Alexandros Stamatakis, October 2007
  • 124. Acknowledgements BlueGene Project  Michael Ott, TUM  Srinivas Aluru, Jaroslaw Zola, Iowa State  Dan Janies, Andrew Johnson, Ohio State  IBM CELL & Playstation  Filip Blagojevic, Dimitris Nikolopoulos, Virginia Tech  Christos Antonopoulos, Univ. of Thessaly  Bootstopping  Bernard Moret, Masoud Alipour, EPFL  Olaf Bininda-Emonds, Univ. Jena  RAxML Web-Server  Jacques Rougemont, SIB  Terri Liebowitz, SDSC  AxParafit/AxPcoords  Markus Goeker, Alexander Auch, Jan Meier-Kolthoff, University of Tuebingen  Datasets for Studies  Jun Inoue (Florida), Nicolas Salamin (Lausanne), Marc Gottschling (Berlin), Guido Grimm  (Tuebingen), Nikos Poulakakis (Yale), Usman Roshan (NJIT) Alexandros Stamatakis, October 2007
  • 125. Thank you for your Attention ! Lake Geneva, Switzerland Alexandros Stamatakis, October 2007