Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Crunching Huge Phylogenies:
A Rapid Bootstrap Algorithm and
 Massive Parallelism on the IBM
           BlueGene
          ...
The Missing Part



Data Assembly                            Inference ?   Tree Analysis




   Alexandros Stamatakis, Oct...
The Missing Part



Data Assembly                            Tree Analysis




   Alexandros Stamatakis, October 2007
IBM BlueGene/L
supercomputer




   Alexandros Stamatakis, October 2007
Rapid Bootstrapping
Bootstopping Criterion




    Alexandros Stamatakis, October 2007
The Big Hardware Problem



                  CPU Speed 40% p.a.




                                      Memory Speed 9%...
... and why this concerns
             Bioinformatics


                                                             Seque...
... and why this concerns
             Bioinformatics

   Application of HPC
   techniques will become                    ...
Cache Hierarchy




Alexandros Stamatakis, October 2007
Outline
         Introduction
     ●


             Computation of Phylogenies
         ●


             Maximum Likelihoo...
Phylogenetics
            Input: “good” multiple Alignment
        

            Output: unrooted binary tree
        

...
Phylogenetics
            Input: “good” multiple Alignment
        

            Output: unrooted binary tree
        
 ...
Phylogenetics
                       Complex Methods &
            Input: “good” multiple Alignment
        
            ...
Phylogenetics
            Input: “good” multiple Alignment
        

            Output: unrooted binary tree
        

...
Challenges for Phyloinformatics

            Holy grail: “Tree of Life”
        

            What is a good alignment in...
The algorithmic problem




Alexandros Stamatakis, October 2007
The number of trees




Alexandros Stamatakis, October 2007
The number of trees




Alexandros Stamatakis, October 2007
The number of trees




Alexandros Stamatakis, October 2007
The number of trees
                    explodes!



                                      BANG !




Alexandros Stamataki...
Outline
         Introduction
     ●


             Computation of Phylogenies
         ●


             Maximum Likelihoo...
Maximum Likelihood
         Length: m

Seq1
Seq2
        Alignment
Seq3
Seq4




 Alexandros Stamatakis, October 2007
Maximum Likelihood
         Length: m
                                       ACGT
Seq1                              A
Seq2...
Maximum Likelihood
         Length: m
                                       ACGT                     Prior probabilities,...
Maximum Likelihood
         Length: m
                                       ACGT                          Prior probabili...
Maximum Likelihood
         Length: m
                                       ACGT                          Prior probabili...
Maximum Likelihood
         Length: m
                                         ACGT                                     Pr...
Maximum Likelihood
         Length: m
                                         ACGT                               Prior pr...
Maximum Likelihood
         Length: m
                                       ACGT                     Prior probabilities,...
Maximum Likelihood
         Length: m
                                       ACGT                     Prior probabilities,...
Maximum Likelihood
Goal: Obtain topology with maximum likelihood value
Problem I: Number of possible topologies is exponen...
Maximum Likelihood
Goal: Obtain topology with maximum likelihood value
Problem I: Number of possible topologies is exponen...
Web & Grid Services
      RAxML Web-Server at San Diego Supercomputing
  
      Center via www.phylo.org (CIPRES project)...
RAxML Black Box




Alexandros Stamatakis, October 2007
RAxML Black Box


                                      Why are Black Boxes
                                            us...
Outline
         Introduction
     ●


             Computation of Phylogenies
         ●


             Maximum Likelihoo...
Levels of Parallelism
                Embarrassing Parallelism
               MPI, CORBA, Grid Technologies




Alexandros...
Coarse-Grained Parallelism:
       MPI Version of RAxML
 PC-CLUSTER
                                                Worker...
Levels of Parallelism
                Embarrassing Parallelism
               MPI, CORBA, Grid Technologies
              ...
Levels of Parallelism
                Embarrassing Parallelism
               MPI, CORBA, Grid Technologies
              ...
Loop Level Parallelism
                       virtual root


                 P



Q
                                     ...
Loop Level Parallelism
                       virtual root

                                This operation uses ≥ 90%
    ...
Loop Level Parallelism
                       virtual root

                                This operation uses ≥ 90%
    ...
Loop Level Parallelism
                       virtual root


                 P



Q
                                     ...
Loop Level Parallelism
                       virtual root


                 P



Q
                                     ...
Loop Level Parallelism
                       virtual root


                 P



Q
                                     ...
Loop Level Parallelism
                       virtual root
                                   The real reason for
        ...
Fine-Grained Parallelism:
          OpenMP version of RAxML




Alexandros Stamatakis, October 2007
Fine-Grained Parallelism:
          OpenMP version of RAxML




Alexandros Stamatakis, October 2007
HPC for ML (Bayesian)
    Proof of Concept & Programming


    Techniques:
      RAxML on a Graphics Processing Unit

  ...
HPC for ML (Bayesian)
    Proof of Concept & Programming


    Techniques:
      RAxML on a Graphics Processing Unit

  ...
RAxML-BlueGene
      Many slow processors: 1024 in one rack
  

      512 MB or 1GB of main memory per node
  

      Bu...
RAxML-BlueGene
                       To be presented at IEEE/ACM
                    2007 Supercomputing
      Many slow ...
RAxML-BlueGene
      Many slow processors: 1024 in one rack
  

      512 MB or 1GB of main memory per node
  

      Bu...
Loop-Level Parallelism on
                  BlueGene




Alexandros Stamatakis, October 2007
50 Seqs x 23,385 bp




Alexandros Stamatakis, October 2007
50 Seqs x 23,385 bp



           Superlinear Speedup




Alexandros Stamatakis, October 2007
250 Seqs x 403,581 bp




Alexandros Stamatakis, October 2007
Embarrassing Parallelism

     W                 W              W   W


                                      M   W
     W...
Outline
         Introduction
     ●


             Computation of Phylogenies
         ●


             Maximum Likelihoo...
Confidence Values
     Tree without node confidence
 

     values is mostly useless
     Problem:
 

         Confidenc...
A Tree with Confidence Values




Joint work Stamatakis, October 2007
    Alexandros with Marc Gottschling, Charite Hospit...
Bootstrapping
                        Original Alignment



                              perturbation




compute tree co...
Bootstrapping
                        Original Alignment
                                        This needs to be done
   ...
Two Questions
      How to compute Bootstraps faster?
  

      How many Bootstrap replicates do we
  

      need?




...
Current Work:
   Rapid Bootstrapping Algorithm
      Tested on 22 diverse (mammals, bacteria, archaea,
  

      grasses,...
Quick & Dirty Bootstrap



                      Modify Algorithm




           Computational Experiments


Alexandros St...
Quick & Dirty Bootstrap



                      Modify Algorithm

iterate

           Computational Experiments


Alexand...
Rapid Bootstrap

11111111111111


01102211111111
10111102220111
11111110112021



  Alexandros Stamatakis, October 2007
Rapid Bootstrap

11111111111111                          Compute Starting Tree


01102211111111
10111102220111
11111110112...
Rapid Bootstrap

                                        Optimize Model Params &
11111111111111                           ...
Rapid Bootstrap
               Use Starting Tree &
            Model Params to compute
                   RELL scores
1111...
Rapid Bootstrap
               Use Starting Tree &
            Model Params to compute
                   RELL scores
1111...
Rapid Bootstrap

11111111111111


11111110112021                         -100   T0: Thorough Search

10111102220111       ...
Rapid Bootstrap

11111111111111


11111110112021                         -100    T0: Thorough Search

10111102220111      ...
Rapid Bootstrap

11111111111111


11111110112021                         -100    T0: Thorough Search

10111102220111      ...
Rapid Bootstrap

11111111111111
       sequential
       dependency is
       bad for
11111110112021                      ...
Scalability of Rapid
                     Bootstrap




Alexandros Stamatakis, October 2007
Scalability of Rapid
                     Bootstrap


                  Some datasets
                  are harder than
  ...
Scalability of Rapid
                     Bootstrap




Alexandros Stamatakis, October 2007
ML-Scores: Garli, RAxML,
        PHYML 715 Sequences




Alexandros Stamatakis, October 2007
Correlation 125 Taxa: 0.91




Alexandros Stamatakis, October 2007
Support Value Distribution




Alexandros Stamatakis, October 2007
Bootstrap Likelihood Values
          125 x 19,436

10,000 replicates
only 195 non-trivial
bipartitions




 Alexandros St...
Bootstrap Likelihood Values
         125 x 19,436




Alexandros Stamatakis, October 2007
3,491 rBCL sequences
            Rapid versus Standard BS

Correlation:
0.98




 Alexandros Stamatakis, October 2007
7,764 DNA Best Tree




Alexandros Stamatakis, October 2007
7,764 DNA All Bipartitions




Alexandros Stamatakis, October 2007
775 x 3,838 AA




Alexandros Stamatakis, October 2007
New Opportunities

     Assess Impact of Alignment Method
 

     on tree and support values
     Test Bootstrap of the B...
Bootstrap of the Bootstrap
   140 AA (Efron et al PNAS 1996)




Alexandros Stamatakis, October 2007
Bootstrap of the Bootstrap
           3,491 rBCL




Alexandros Stamatakis, October 2007
Bootstopping
         Rapid Bootstrapping allows to assess
  

         Bootstopping criteria as follows
        1. Compu...
Bootstopping Criterion
     Every 50, 100, 150, ... replicates do a test:
 

       Say we have N BS trees

       Do t...
Result Overview

     Bootstopped between 100-400 (avg
 

     213)
     Correlation on best tree: Bootstopped
 

     v...
Bootstopping Best 140 AA




Alexandros Stamatakis, October 2007
Bootstopping Best 404 DNA
         (Multi-Gene)




Alexandros Stamatakis, October 2007
Bootstopping Best 994 DNA




Alexandros Stamatakis, October 2007
Bootstopping All 994 DNA




Alexandros Stamatakis, October 2007
Bootstopping Best 1,908
                DNA




Alexandros Stamatakis, October 2007
Bootstopping Best 2,554
                DNA




Alexandros Stamatakis, October 2007
Putting the Pieces together
      Blue-Gene: Can handle huge datasets
  

            Use Cat approximation on BlueGene
 ...
8,864 Bacteria under GTR+Γ
                and GTR+CAT
Log Likelihood
Score under Γ




                                  ...
Putting the Pieces together
      Blue-Gene: Can handle huge datasets
  

            Use Cat approximation on BlueGene
 ...
Outline
         Introduction
     ●


             Computation of Phylogenies
         ●


             Maximum Likelihoo...
Host-Parasite Co-Evolution
                                         Parasites (eg Lice)
Hosts (eg Mammals)




   Alexandr...
Host-Parasite Co-Evolution
Hosts                                    Parasites

          Co-Evolution Hypothesis

        ...
Host-Parasite Co-Evolution
Hosts                                       Parasites

          Co-Evolution Hypothesis

     ...
What can HPC do forBioinformatics?
                 Axelerated Parafit

      “Parafit: statistical test of co-evolution”,...
AxParafit: Sequential
                     Performance




Alexandros Stamatakis, October 2007
AxParafit: Parallel
                     Performance




Alexandros Stamatakis, October 2007
The ML Benchmark:
      A Current Community Project
      Standardized way required to test ML search programs
  

      ...
A Current Problem:
            Handling Multi-Gene Alignments

                              Gene 1    Gene 2
Sequence 1

...
A Multi-Gene Model




Alexandros Stamatakis, October 2007
A Multi-Gene Model




Alexandros Stamatakis, October 2007
A Multi-Gene Model




Alexandros Stamatakis, October 2007
A Multi-Gene Model



LogLH (T) = LogLh (T|Red)




    Alexandros Stamatakis, October 2007
A Multi-Gene Model



LogLH (T) = LogLh (T|Red) +
     LogLH(T|Yellow)




    Alexandros Stamatakis, October 2007
A Multi-Gene Model
                                    Challenge: devise efficient data
                                  ...
Why are Individual Branches
         per Gene a Challenge?




Alexandros Stamatakis, October 2007
Why are Individual Branches
         per Gene a Challenge?




Alexandros Stamatakis, October 2007
Outlook




Alexandros Stamatakis, October 2007
Outlook

            Tree of Life
        

            What is a good alignment in a
        

            phylogenetic...
Acknowledgements
      BlueGene Project
  


            Michael Ott, TUM
        


            Srinivas Aluru, Jarosla...
Thank you for your
                    Attention !




Lake Geneva, Switzerland
 Alexandros Stamatakis, October 2007
Upcoming SlideShare
Loading in …5
×

Crunching Huge Phylogenies. A. Stamatakis

1,607 views

Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Crunching Huge Phylogenies. A. Stamatakis

  1. 1. Crunching Huge Phylogenies: A Rapid Bootstrap Algorithm and Massive Parallelism on the IBM BlueGene Alexandros Stamatakis Swiss Federal Institute of Technology Lausanne (EPFL) School of Computer & Communication Sciences Laboratory for Computational Biology and Bioinformatics Lausanne, Switzerland & Swiss Institute of Bioinformatics Alexandros.Stamatakis@epfl.ch icwww.epfl.ch/~stamatak
  2. 2. The Missing Part Data Assembly Inference ? Tree Analysis Alexandros Stamatakis, October 2007
  3. 3. The Missing Part Data Assembly Tree Analysis Alexandros Stamatakis, October 2007
  4. 4. IBM BlueGene/L supercomputer Alexandros Stamatakis, October 2007
  5. 5. Rapid Bootstrapping Bootstopping Criterion Alexandros Stamatakis, October 2007
  6. 6. The Big Hardware Problem CPU Speed 40% p.a. Memory Speed 9% p.a. 2007 1980 Alexandros Stamatakis, October 2007
  7. 7. ... and why this concerns Bioinformatics Sequence CPU Speed 40% p.a. Data Memory Speed 9% p.a. 2007 1980 Alexandros Stamatakis, October 2007
  8. 8. ... and why this concerns Bioinformatics Application of HPC techniques will become Sequence much moreSpeed 40% p.a. CPU important Data Memory Speed 9% p.a. 2007 1980 Alexandros Stamatakis, October 2007
  9. 9. Cache Hierarchy Alexandros Stamatakis, October 2007
  10. 10. Outline Introduction ● Computation of Phylogenies ● Maximum Likelihood ● Web & Grid Services ● Three Steps Towards the Tree of Life ● Parallelism on IBM BlueGene/L ● Rapid Bootstrapping ● A Bootstopping criterion ● Related Projects ● Outlook ● Alexandros Stamatakis, October 2007
  11. 11. Phylogenetics Input: “good” multiple Alignment  Output: unrooted binary tree  Various methods for phylogenetic  inference Neighbour Joining (fast & simple)  Maximum Parsimony (relatively fast &  simple) Maximum Likelihood (complex & slow)  Bayesian Methods (complex & slower)  Alexandros Stamatakis, October 2007
  12. 12. Phylogenetics Input: “good” multiple Alignment  Output: unrooted binary tree  ML & Bayesian: explicit Various methods choice model for phylogenetic  inference Neighbour Joining (fast & simple)  Maximum Parsimony (relatively fast &  simple) Maximum Likelihood (complex & slow)  Bayesian Methods (complex & slower)  Alexandros Stamatakis, October 2007
  13. 13. Phylogenetics Complex Methods & Input: “good” multiple Alignment  Models required to Output: unrooted binary tree  reconstruct large & Various methods for phylogenetic complicated trees !  inference NeighbourFocus of(fast talk is on Joining this & simple)  Maximum Likelihood! Maximum Parsimony (relatively fast &  simple) Maximum Likelihood (complex & slow)  Bayesian Methods (complex & slower)  Alexandros Stamatakis, October 2007
  14. 14. Phylogenetics Input: “good” multiple Alignment  Output: unrooted binary tree  Various methods for phylogenetic  inference NeighbourThe real (fast & simple) Joining reason for  Maximum working on (relatively fast & Parsimony ML: ......  simple) Maximum Likelihood (complex & slow)  Bayesian Methods (complex & slower)  Alexandros Stamatakis, October 2007
  15. 15. Challenges for Phyloinformatics Holy grail: “Tree of Life”  What is a good alignment in a  phylogenetic context? Simultaneous alignment and tree building  Improve/extend models ... but thereby size  of computable trees decreases! More HPC awareness  Exploit multi-core architectures  Amount of available data grows at a  higher rate than algorithms are getting faster Alexandros Stamatakis, October 2007
  16. 16. The algorithmic problem Alexandros Stamatakis, October 2007
  17. 17. The number of trees Alexandros Stamatakis, October 2007
  18. 18. The number of trees Alexandros Stamatakis, October 2007
  19. 19. The number of trees Alexandros Stamatakis, October 2007
  20. 20. The number of trees explodes! BANG ! Alexandros Stamatakis, October 2007
  21. 21. Outline Introduction ● Computation of Phylogenies ● Maximum Likelihood ● Web & Grid Services ● Three Steps Towards the Tree of Life ● Parallelism on IBM BlueGene/L ● Rapid Bootstrapping ● A Bootstopping criterion ● Related Projects ● Outlook ● Alexandros Stamatakis, October 2007
  22. 22. Maximum Likelihood Length: m Seq1 Seq2 Alignment Seq3 Seq4 Alexandros Stamatakis, October 2007
  23. 23. Maximum Likelihood Length: m ACGT Seq1 A Seq2 C Substitution Alignment model Seq3 G Seq4 T Alexandros Stamatakis, October 2007
  24. 24. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T Alexandros Stamatakis, October 2007
  25. 25. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T Seq 3 Seq 1 b3 b1 b5 b2 b4 Seq 2 Seq 4 Alexandros Stamatakis, October 2007
  26. 26. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T Seq 3 Seq 1 b3 b1 b5 b2 b4 Seq 2 Seq 4 virtual root: vr Alexandros Stamatakis, October 2007
  27. 27. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T Seq 3 Seq 1 b3 b1 vr b5 b2 b4 Seq 2 Seq 4 P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T) m Alexandros Stamatakis, October 2007
  28. 28. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T Lots of floating pointSeq 3 Seq 1 b3 b1 operations! vr b5 b2 b4 Seq 2 Seq 4 P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T) m Alexandros Stamatakis, October 2007
  29. 29. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T Seq 3 Seq 1 Seq 2 Seq 4 optimize branch lengths Alexandros Stamatakis, October 2007
  30. 30. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T optimize model parameters Seq 3 Seq 1 Seq 2 Seq 4 Alexandros Stamatakis, October 2007
  31. 31. Maximum Likelihood Goal: Obtain topology with maximum likelihood value Problem I: Number of possible topologies is exponential in n Problem II: Computation of likelihood function is expensive Problem III: Probably high score accuracy required Problem IV: High memory consumption Solution: • New Algorithms • New Models • High Performance Computing Alexandros Stamatakis, October 2007
  32. 32. Maximum Likelihood Goal: Obtain topology with maximum likelihood value Problem I: Number of possible topologies is exponential in n RAxML Problem II: Computation of likelihood function is expensive Randomized Axelerated Problem III: Probably high score accuracy required Maximum Likelihood Problem IV: High memory consumption Solution: • New Algorithms • New Models • High Performance Computing Alexandros Stamatakis, October 2007
  33. 33. Web & Grid Services RAxML Web-Server at San Diego Supercomputing  Center via www.phylo.org (CIPRES project) Web-Server at Vital-IT unit of Swiss Institute of  Bioinformatics phylobench.vital-it.ch/raxml-bb/  Includes novel search algorithm with 1 order of magnitude run-time improvement  Since Sept 3, about 700 jobs from 130 Ips  Extension to SwissGrid planned  Novel algorithm with Bootstopping to be integrated into CIPRES portal soon RAxML integration into Distributed European  Infrastructure for Supercomputing Applications www.deisa.org started 10 days ago Integration into Debian medical distribution  Alexandros Stamatakis, October 2007
  34. 34. RAxML Black Box Alexandros Stamatakis, October 2007
  35. 35. RAxML Black Box Why are Black Boxes useful? Alexandros Stamatakis, October 2007
  36. 36. Outline Introduction ● Computation of Phylogenies ● Maximum Likelihood ● Web & Grid Services ● Three Steps Towards the Tree of Life ● Parallelism on IBM BlueGene/L ● Rapid Bootstrapping ● A Bootstopping criterion ● Related Projects ● Outlook ● Alexandros Stamatakis, October 2007
  37. 37. Levels of Parallelism Embarrassing Parallelism MPI, CORBA, Grid Technologies Alexandros Stamatakis, October 2007
  38. 38. Coarse-Grained Parallelism: MPI Version of RAxML PC-CLUSTER Worker Processes B-2 B-3 B-1 B-4 Interconnection B-0 Network Master Process Alexandros Stamatakis, October 2007
  39. 39. Levels of Parallelism Embarrassing Parallelism MPI, CORBA, Grid Technologies Inference Parallelism MPI, algorithm-dependent Alexandros Stamatakis, October 2007
  40. 40. Levels of Parallelism Embarrassing Parallelism MPI, CORBA, Grid Technologies Inference Parallelism MPI, algorithm-dependent Loop-Level Parallelism OpenMP, GPUs, IBM CELL (Playstation), IBM BlueGene, Clusters with fast Interconnect Alexandros Stamatakis, October 2007
  41. 41. Loop Level Parallelism virtual root P Q R P[i] = f(Q[i], R[i]) Alexandros Stamatakis, October 2007
  42. 42. Loop Level Parallelism virtual root This operation uses ≥ 90% P of total execution time ! Q R P[i] = f(Q[i], R[i]) Alexandros Stamatakis, October 2007
  43. 43. Loop Level Parallelism virtual root This operation uses ≥ 90% P of total execution time ! → simple fine-grained parallelization Q R P[i] = f(Q[i], R[i]) Alexandros Stamatakis, October 2007
  44. 44. Loop Level Parallelism virtual root P Q R Alexandros Stamatakis, October 2007
  45. 45. Loop Level Parallelism virtual root P Q R Alexandros Stamatakis, October 2007
  46. 46. Loop Level Parallelism virtual root P Q R Alexandros Stamatakis, October 2007
  47. 47. Loop Level Parallelism virtual root The real reason for assuming independent evolution among sites: P ...... Q R Alexandros Stamatakis, October 2007
  48. 48. Fine-Grained Parallelism: OpenMP version of RAxML Alexandros Stamatakis, October 2007
  49. 49. Fine-Grained Parallelism: OpenMP version of RAxML Alexandros Stamatakis, October 2007
  50. 50. HPC for ML (Bayesian) Proof of Concept & Programming  Techniques:  RAxML on a Graphics Processing Unit  RAxML on the IBM CELL & Playstation Production Level Implementations:   RAxML with OpenMP  RaxML with MPI  RAxML on BlueGene  Multi-Core Architectures Alexandros Stamatakis, October 2007
  51. 51. HPC for ML (Bayesian) Proof of Concept & Programming  Techniques:  RAxML on a Graphics Processing Unit  RAxML on the IBM CELL & Playstation Production Level Implementations:  A good excuse to buy one  RAxML with OpenMP  RaxML with MPI  RAxML on BlueGene  Multi-Core Architectures Alexandros Stamatakis, October 2007
  52. 52. RAxML-BlueGene Many slow processors: 1024 in one rack  512 MB or 1GB of main memory per node  But: high performance network  Challenges:  Distribute tree data structure among CPUs  Exploit fast collective communication network  For optimal efficiency: loop-level +  embarrassing parallelism  hybrid parallelism with MPI Test & Production Run Data  With Olaf Bininda-Emonds, Jena: 2,182  mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human  Haplotype Map sequences x 500,000 base pairs Alexandros Stamatakis, October 2007
  53. 53. RAxML-BlueGene To be presented at IEEE/ACM 2007 Supercomputing Many slow processors: 1024 in one rack  Conference. 512 MB or 1GB of main memory per node  But: high performance network  Challenges:  Distribute tree data structure among CPUs  Exploit fast collective communication network  For optimal efficiency: loop-level +  embarrassing parallelism  hybrid parallelism with MPI Test & Production Run Data  With Olaf Bininda-Emonds, Jena: 2,182  mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human  Haplotype Map sequences x 500,000 base pairs Alexandros Stamatakis, October 2007
  54. 54. RAxML-BlueGene Many slow processors: 1024 in one rack  512 MB or 1GB of main memory per node  But: high performance network  Challenges:  Distribute tree data structure among CPUs in Largest ML analysis to date  terms of memory footprint Exploit fast collective communication network  For optimal efficiency: loop-level +  embarrassing parallelism  hybrid parallelism with MPI Test & Production Run Data  With Olaf Bininda-Emonds, Jena: 2,182  mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human  Haplotype Map sequences x 500,000 base pairs Alexandros Stamatakis, October 2007
  55. 55. Loop-Level Parallelism on BlueGene Alexandros Stamatakis, October 2007
  56. 56. 50 Seqs x 23,385 bp Alexandros Stamatakis, October 2007
  57. 57. 50 Seqs x 23,385 bp Superlinear Speedup Alexandros Stamatakis, October 2007
  58. 58. 250 Seqs x 403,581 bp Alexandros Stamatakis, October 2007
  59. 59. Embarrassing Parallelism W W W W M W W M M M W W W W W W Alexandros Stamatakis, October 2007
  60. 60. Outline Introduction ● Computation of Phylogenies ● Maximum Likelihood ● Web & Grid Services ● Three Steps Towards the Tree of Life ● Parallelism on IBM BlueGene/L ● Rapid Bootstrapping ● A Bootstopping criterion ● Related Projects ● Outlook ● Alexandros Stamatakis, October 2007
  61. 61. Confidence Values Tree without node confidence  values is mostly useless Problem:  Confidence value calculation is major  computational obstacle  We can compute large trees but not analyse them: compute ≠analyse ! Current Slow Methods  Sampling with Bayesian methods  Non-parametric Bootstrapping  Alexandros Stamatakis, October 2007
  62. 62. A Tree with Confidence Values Joint work Stamatakis, October 2007 Alexandros with Marc Gottschling, Charite Hospital, Berlin
  63. 63. Bootstrapping Original Alignment perturbation compute tree compute tree compute tree Alexandros Stamatakis, October 2007
  64. 64. Bootstrapping Original Alignment This needs to be done 100-1000 times Embarrassingly Parallel ! perturbation compute tree compute tree compute tree Alexandros Stamatakis, October 2007
  65. 65. Two Questions How to compute Bootstraps faster?  How many Bootstrap replicates do we  need? Alexandros Stamatakis, October 2007
  66. 66. Current Work: Rapid Bootstrapping Algorithm Tested on 22 diverse (mammals, bacteria, archaea,  grasses, fishes, plants, viral) real-world DNA/AA single-/multi-gene datasets containing 125-7,764 sequences Pearson correlation on best-scoring ML trees between  RBS (Rapid BS) & SBS (Standard BS) support values 0.95-0.99 (except one dataset at 0.91), average 0.97 Weighted topological distance < 6%, average 4%  Program Acceleration: 8-20, average ≈ 15  Acceleration by one order of magnitude  Full ML analysis (100BS + ML search) of datasets of  up to 5,000 sequences within less than 5 days on your desktop! Allows for a sufficiently large number of Bootstrap  replicates Alexandros Stamatakis, October 2007
  67. 67. Quick & Dirty Bootstrap Modify Algorithm Computational Experiments Alexandros Stamatakis, October 2007
  68. 68. Quick & Dirty Bootstrap Modify Algorithm iterate Computational Experiments Alexandros Stamatakis, October 2007
  69. 69. Rapid Bootstrap 11111111111111 01102211111111 10111102220111 11111110112021 Alexandros Stamatakis, October 2007
  70. 70. Rapid Bootstrap 11111111111111 Compute Starting Tree 01102211111111 10111102220111 11111110112021 Alexandros Stamatakis, October 2007
  71. 71. Rapid Bootstrap Optimize Model Params & 11111111111111 Branch Lengths 01102211111111 10111102220111 11111110112021 Alexandros Stamatakis, October 2007
  72. 72. Rapid Bootstrap Use Starting Tree & Model Params to compute RELL scores 11111111111111 01102211111111 -110 10111102220111 -105 11111110112021 -100 Alexandros Stamatakis, October 2007
  73. 73. Rapid Bootstrap Use Starting Tree & Model Params to compute RELL scores 11111111111111 01102211111111 -110 10111102220111 -105 Sort by RELL 11111110112021 -100 Alexandros Stamatakis, October 2007
  74. 74. Rapid Bootstrap 11111111111111 11111110112021 -100 T0: Thorough Search 10111102220111 -105 01102211111111 -110 Alexandros Stamatakis, October 2007
  75. 75. Rapid Bootstrap 11111111111111 11111110112021 -100 T0: Thorough Search 10111102220111 -105 T1: Quick Search on T0 01102211111111 -110 Alexandros Stamatakis, October 2007
  76. 76. Rapid Bootstrap 11111111111111 11111110112021 -100 T0: Thorough Search 10111102220111 -105 T1: Quick Search on T0 01102211111111 -110 T2: Quick Search on T1 Alexandros Stamatakis, October 2007
  77. 77. Rapid Bootstrap 11111111111111 sequential dependency is bad for 11111110112021 -100 parallelism T0: Thorough Search 10111102220111 -105 T1: Quick Search on T0 01102211111111 -110 T2: Quick Search on T1 Alexandros Stamatakis, October 2007
  78. 78. Scalability of Rapid Bootstrap Alexandros Stamatakis, October 2007
  79. 79. Scalability of Rapid Bootstrap Some datasets are harder than others Alexandros Stamatakis, October 2007
  80. 80. Scalability of Rapid Bootstrap Alexandros Stamatakis, October 2007
  81. 81. ML-Scores: Garli, RAxML, PHYML 715 Sequences Alexandros Stamatakis, October 2007
  82. 82. Correlation 125 Taxa: 0.91 Alexandros Stamatakis, October 2007
  83. 83. Support Value Distribution Alexandros Stamatakis, October 2007
  84. 84. Bootstrap Likelihood Values 125 x 19,436 10,000 replicates only 195 non-trivial bipartitions Alexandros Stamatakis, October 2007
  85. 85. Bootstrap Likelihood Values 125 x 19,436 Alexandros Stamatakis, October 2007
  86. 86. 3,491 rBCL sequences Rapid versus Standard BS Correlation: 0.98 Alexandros Stamatakis, October 2007
  87. 87. 7,764 DNA Best Tree Alexandros Stamatakis, October 2007
  88. 88. 7,764 DNA All Bipartitions Alexandros Stamatakis, October 2007
  89. 89. 775 x 3,838 AA Alexandros Stamatakis, October 2007
  90. 90. New Opportunities Assess Impact of Alignment Method  on tree and support values Test Bootstrap of the Bootstrap  (double Bootstrap) procedures Devise and empirically verify  Bootstopping criteria Alexandros Stamatakis, October 2007
  91. 91. Bootstrap of the Bootstrap 140 AA (Efron et al PNAS 1996) Alexandros Stamatakis, October 2007
  92. 92. Bootstrap of the Bootstrap 3,491 rBCL Alexandros Stamatakis, October 2007
  93. 93. Bootstopping Rapid Bootstrapping allows to assess  Bootstopping criteria as follows 1. Compute a high number of BS replicates (10,000) 2. Devise topology-based bootstopping criterion and apply it to these 10,000 replicates 3. Compare support values induced by bootstopped trees (say 300 replicates) with 10,000 replicates We have 10,000 replicates for 18  datasets containing 125 to 2,554 sequences Alexandros Stamatakis, October 2007
  94. 94. Bootstopping Criterion Every 50, 100, 150, ... replicates do a test:   Say we have N BS trees  Do the following 100 times:  Randomly split up this set of N trees into 2 equal sets S1, S2, of size N/2  Compute the bipartition support vectors for S1 and S2  Compute Pearson correlation of the support vectors  return average of the 100 Pearson correlations if average > 0.99 stop  Alexandros Stamatakis, October 2007
  95. 95. Result Overview Bootstopped between 100-400 (avg  213) Correlation on best tree: Bootstopped  versus 10,000 replicates > 0.99 (avg 0.995) Correlation of all bipartitions > 0.995  (avg 0.997) Alexandros Stamatakis, October 2007
  96. 96. Bootstopping Best 140 AA Alexandros Stamatakis, October 2007
  97. 97. Bootstopping Best 404 DNA (Multi-Gene) Alexandros Stamatakis, October 2007
  98. 98. Bootstopping Best 994 DNA Alexandros Stamatakis, October 2007
  99. 99. Bootstopping All 994 DNA Alexandros Stamatakis, October 2007
  100. 100. Bootstopping Best 1,908 DNA Alexandros Stamatakis, October 2007
  101. 101. Bootstopping Best 2,554 DNA Alexandros Stamatakis, October 2007
  102. 102. Putting the Pieces together Blue-Gene: Can handle huge datasets  Use Cat approximation on BlueGene  Further speedup of factor 3.5  Memory footprint reduction factor 4  Alexandros Stamatakis, October 2007
  103. 103. 8,864 Bacteria under GTR+Γ and GTR+CAT Log Likelihood Score under Γ 7 days 14 days Execution Time Alexandros Stamatakis, October 2007
  104. 104. Putting the Pieces together Blue-Gene: Can handle huge datasets  Use Cat approximation on BlueGene  Further speedup of factor 3.5  Memory footprint reduction factor 4  Integrate rapid Bootstrap into BlueGene  version Additional speedup ≈ 15  Mechanisms available to accelerate  BlueGene version by factor 50-60 Integrate Bootstopping into BlueGene   Conclusion: We will soon be able to compute a small tree of life with 10,000 organisms and data from multiple genes! Alexandros Stamatakis, October 2007
  105. 105. Outline Introduction ● Computation of Phylogenies ● Maximum Likelihood ● Web & Grid Services ● Three Steps Towards the Tree of Life ● Parallelism on IBM BlueGene/L ● Rapid Bootstrapping ● A Bootstopping criterion ● Related Projects ● Outlook ● Alexandros Stamatakis, October 2007
  106. 106. Host-Parasite Co-Evolution Parasites (eg Lice) Hosts (eg Mammals) Alexandros Stamatakis, October 2007
  107. 107. Host-Parasite Co-Evolution Hosts Parasites Co-Evolution Hypothesis 8 Parasites Adjacency 6 hosts Matrix 0/1 Alexandros Stamatakis, October 2007
  108. 108. Host-Parasite Co-Evolution Hosts Parasites Co-Evolution Hypothesis 8 Parasites Adjacency 6 hosts Matrix 0/1 Statistical Test Alexandros Stamatakis, October 2007
  109. 109. What can HPC do forBioinformatics? Axelerated Parafit “Parafit: statistical test of co-evolution”, Pierre  Legendre, Syst. Biol. 2003 AxParafit (Axelerated Parafit)   Statistical test of hypotheses of host-parasite co- evolution  C porting, optimization, BLAS integration  Speedup up to factor 67  Master-Worker MPI-parallelization Largest co-phylogenetic study to date conducted  within 8 minutes instead of 4 weeks Open-Source Code:  http://icwww.epfl.ch/~stamatak/AxParafit.html SwissGrid-based Web-Server planned  Alexandros Stamatakis, October 2007
  110. 110. AxParafit: Sequential Performance Alexandros Stamatakis, October 2007
  111. 111. AxParafit: Parallel Performance Alexandros Stamatakis, October 2007
  112. 112. The ML Benchmark: A Current Community Project Standardized way required to test ML search programs  Web-Server with real-world alignments and performance data  at Swiss Institute of Bioinformatics Many developers of popular ML programs involved   Stephane Guindon (PHYML) Montpellier  Simon Wheelan (LeaPhy) Manchester  Bui Quang Minh (IQPNNI) Vienna  Derrick Zwickl (GARLI) Virginia  Thomas Keane (dprML) Cambridge Byproduct: SPEC-like CPU benchmark for phylogenetics  Follow-up: (planned) ML competition at major conference with  industrial sponsor Alexandros Stamatakis, October 2007
  113. 113. A Current Problem: Handling Multi-Gene Alignments Gene 1 Gene 2 Sequence 1 Sequence 5 Missing Data ≠ Gap Data Alexandros Stamatakis, October 2007
  114. 114. A Multi-Gene Model Alexandros Stamatakis, October 2007
  115. 115. A Multi-Gene Model Alexandros Stamatakis, October 2007
  116. 116. A Multi-Gene Model Alexandros Stamatakis, October 2007
  117. 117. A Multi-Gene Model LogLH (T) = LogLh (T|Red) Alexandros Stamatakis, October 2007
  118. 118. A Multi-Gene Model LogLH (T) = LogLh (T|Red) + LogLH(T|Yellow) Alexandros Stamatakis, October 2007
  119. 119. A Multi-Gene Model Challenge: devise efficient data structures for this LogLH (T) = LogLh (T|Red) + LogLH(T|Yellow) Alexandros Stamatakis, October 2007
  120. 120. Why are Individual Branches per Gene a Challenge? Alexandros Stamatakis, October 2007
  121. 121. Why are Individual Branches per Gene a Challenge? Alexandros Stamatakis, October 2007
  122. 122. Outlook Alexandros Stamatakis, October 2007
  123. 123. Outlook Tree of Life  What is a good alignment in a  phylogenetic context? Simultaneous alignment and tree building  More HPC & memory-aware programming  Multi-core architectures  Models for “gappy” multi-gene alignments  Alexandros Stamatakis, October 2007
  124. 124. Acknowledgements BlueGene Project  Michael Ott, TUM  Srinivas Aluru, Jaroslaw Zola, Iowa State  Dan Janies, Andrew Johnson, Ohio State  IBM CELL & Playstation  Filip Blagojevic, Dimitris Nikolopoulos, Virginia Tech  Christos Antonopoulos, Univ. of Thessaly  Bootstopping  Bernard Moret, Masoud Alipour, EPFL  Olaf Bininda-Emonds, Univ. Jena  RAxML Web-Server  Jacques Rougemont, SIB  Terri Liebowitz, SDSC  AxParafit/AxPcoords  Markus Goeker, Alexander Auch, Jan Meier-Kolthoff, University of Tuebingen  Datasets for Studies  Jun Inoue (Florida), Nicolas Salamin (Lausanne), Marc Gottschling (Berlin), Guido Grimm  (Tuebingen), Nikos Poulakakis (Yale), Usman Roshan (NJIT) Alexandros Stamatakis, October 2007
  125. 125. Thank you for your Attention ! Lake Geneva, Switzerland Alexandros Stamatakis, October 2007

×