SlideShare a Scribd company logo
1 of 125
Download to read offline
Crunching Huge Phylogenies:
A Rapid Bootstrap Algorithm and
 Massive Parallelism on the IBM
           BlueGene
                  Alexandros Stamatakis

       Swiss Federal Institute of Technology Lausanne (EPFL)
          School of Computer & Communication Sciences
      Laboratory for Computational Biology and Bioinformatics
                       Lausanne, Switzerland
                                  &
                  Swiss Institute of Bioinformatics

                  Alexandros.Stamatakis@epfl.ch
                     icwww.epfl.ch/~stamatak
The Missing Part



Data Assembly                            Inference ?   Tree Analysis




   Alexandros Stamatakis, October 2007
The Missing Part



Data Assembly                            Tree Analysis




   Alexandros Stamatakis, October 2007
IBM BlueGene/L
supercomputer




   Alexandros Stamatakis, October 2007
Rapid Bootstrapping
Bootstopping Criterion




    Alexandros Stamatakis, October 2007
The Big Hardware Problem



                  CPU Speed 40% p.a.




                                      Memory Speed 9% p.a.



                                                      2007
    1980
Alexandros Stamatakis, October 2007
... and why this concerns
             Bioinformatics


                                                             Sequence
                  CPU Speed 40% p.a.                         Data




                                      Memory Speed 9% p.a.



                                                      2007
    1980
Alexandros Stamatakis, October 2007
... and why this concerns
             Bioinformatics

   Application of HPC
   techniques will become                                    Sequence
   much moreSpeed 40% p.a.
           CPU important                                     Data




                                      Memory Speed 9% p.a.



                                                      2007
    1980
Alexandros Stamatakis, October 2007
Cache Hierarchy




Alexandros Stamatakis, October 2007
Outline
         Introduction
     ●


             Computation of Phylogenies
         ●


             Maximum Likelihood
         ●


             Web & Grid Services
         ●



         Three Steps Towards the Tree of Life
     ●


             Parallelism on IBM BlueGene/L
         ●


             Rapid Bootstrapping
         ●


             A Bootstopping criterion
         ●



         Related Projects
     ●


         Outlook
     ●


Alexandros Stamatakis, October 2007
Phylogenetics
            Input: “good” multiple Alignment
        

            Output: unrooted binary tree
        

            Various methods for phylogenetic
        

            inference
                 Neighbour Joining (fast & simple)
             


                 Maximum Parsimony (relatively fast &
             

                 simple)
                 Maximum Likelihood (complex & slow)
             


                 Bayesian Methods (complex & slower)
             




Alexandros Stamatakis, October 2007
Phylogenetics
            Input: “good” multiple Alignment
        

            Output: unrooted binary tree
        
                       ML & Bayesian: explicit
            Various methods choice
                       model for phylogenetic
        

            inference
                 Neighbour Joining (fast & simple)
             


                 Maximum Parsimony (relatively fast &
             

                 simple)
                 Maximum Likelihood (complex & slow)
             


                 Bayesian Methods (complex & slower)
             




Alexandros Stamatakis, October 2007
Phylogenetics
                       Complex Methods &
            Input: “good” multiple Alignment
        
                       Models required to
            Output: unrooted binary tree
        
                       reconstruct large &
            Various methods for phylogenetic
                       complicated trees !
        

            inference
                 NeighbourFocus of(fast talk is on
                           Joining this & simple)
             
                          Maximum Likelihood!
                 Maximum Parsimony (relatively fast &
             

                 simple)
                 Maximum Likelihood (complex & slow)
             


                 Bayesian Methods (complex & slower)
             




Alexandros Stamatakis, October 2007
Phylogenetics
            Input: “good” multiple Alignment
        

            Output: unrooted binary tree
        

            Various methods for phylogenetic
        

            inference
                 NeighbourThe real (fast & simple)
                           Joining reason for
             


                 Maximum working on (relatively fast &
                          Parsimony ML: ......
             

                 simple)
                 Maximum Likelihood (complex & slow)
             


                 Bayesian Methods (complex & slower)
             




Alexandros Stamatakis, October 2007
Challenges for Phyloinformatics

            Holy grail: “Tree of Life”
        

            What is a good alignment in a
        

            phylogenetic context?
            Simultaneous alignment and tree building
        

            Improve/extend models ... but thereby size
        

            of computable trees decreases!
            More HPC awareness
        

            Exploit multi-core architectures
        

            Amount of available data grows at a
        

            higher rate than algorithms are getting
            faster
Alexandros Stamatakis, October 2007
The algorithmic problem




Alexandros Stamatakis, October 2007
The number of trees




Alexandros Stamatakis, October 2007
The number of trees




Alexandros Stamatakis, October 2007
The number of trees




Alexandros Stamatakis, October 2007
The number of trees
                    explodes!



                                      BANG !




Alexandros Stamatakis, October 2007
Outline
         Introduction
     ●


             Computation of Phylogenies
         ●


             Maximum Likelihood
         ●


             Web & Grid Services
         ●



         Three Steps Towards the Tree of Life
     ●


             Parallelism on IBM BlueGene/L
         ●


             Rapid Bootstrapping
         ●


             A Bootstopping criterion
         ●



         Related Projects
     ●


         Outlook
     ●


Alexandros Stamatakis, October 2007
Maximum Likelihood
         Length: m

Seq1
Seq2
        Alignment
Seq3
Seq4




 Alexandros Stamatakis, October 2007
Maximum Likelihood
         Length: m
                                       ACGT
Seq1                              A
Seq2                              C           Substitution
        Alignment                               model
Seq3                              G
Seq4                              T




 Alexandros Stamatakis, October 2007
Maximum Likelihood
         Length: m
                                       ACGT                     Prior probabilities,
                                                             Empirical base frequencies
Seq1                              A
Seq2                              C           Substitution
        Alignment                                                   πA πC πG πT
                                                model
Seq3                              G
Seq4                              T




 Alexandros Stamatakis, October 2007
Maximum Likelihood
         Length: m
                                       ACGT                          Prior probabilities,
                                                                  Empirical base frequencies
Seq1                              A
Seq2                              C           Substitution
        Alignment                                                         πA πC πG πT
                                                model
Seq3                              G
Seq4                              T



                                                                  Seq 3
               Seq 1                                  b3
                             b1
                                         b5
                        b2                                   b4
                Seq 2                                             Seq 4




 Alexandros Stamatakis, October 2007
Maximum Likelihood
         Length: m
                                       ACGT                          Prior probabilities,
                                                                  Empirical base frequencies
Seq1                               A
Seq2                               C          Substitution
        Alignment                                                         πA πC πG πT
                                                model
Seq3                               G
Seq4                               T



                                                                  Seq 3
               Seq 1                                  b3
                             b1
                                         b5
                        b2                                   b4
                Seq 2                                             Seq 4



                                  virtual root: vr

 Alexandros Stamatakis, October 2007
Maximum Likelihood
         Length: m
                                         ACGT                                     Prior probabilities,
                                                                               Empirical base frequencies
Seq1                                 A
Seq2                                 C                   Substitution
        Alignment                                                                      πA πC πG πT
                                                           model
Seq3                                 G
Seq4                                 T



                                                                               Seq 3
               Seq 1                                                b3
                             b1
                                               vr
                                                    b5
                        b2                                                b4
                Seq 2                                                          Seq 4
                         P(A) P(C) P(G) P(T)             P(A) P(C) P(G) P(T)




                                                m




 Alexandros Stamatakis, October 2007
Maximum Likelihood
         Length: m
                                         ACGT                               Prior probabilities,
                                                                         Empirical base frequencies
Seq1                                 A
Seq2                                 C             Substitution
        Alignment                                                                πA πC πG πT
                                                     model
Seq3                                 G
Seq4                                 T



                               Lots of floating pointSeq 3
               Seq 1                            b3
                             b1
                                    operations!
                                      vr
                                         b5
                        b2                                          b4
                Seq 2                                                    Seq 4
                         P(A) P(C) P(G) P(T)       P(A) P(C) P(G) P(T)




                                               m




 Alexandros Stamatakis, October 2007
Maximum Likelihood
         Length: m
                                       ACGT                     Prior probabilities,
                                                             Empirical base frequencies
Seq1                              A
Seq2                              C           Substitution
        Alignment                                                    πA πC πG πT
                                                model
Seq3                              G
Seq4                              T



                                                             Seq 3
               Seq 1




                Seq 2                                        Seq 4



                     optimize branch lengths

 Alexandros Stamatakis, October 2007
Maximum Likelihood
         Length: m
                                       ACGT                     Prior probabilities,
                                                             Empirical base frequencies
Seq1                              A
Seq2                              C           Substitution
        Alignment                                                    πA πC πG πT
                                                model
Seq3                              G
Seq4                              T


               optimize model parameters
                                                             Seq 3
               Seq 1




                Seq 2                                        Seq 4




 Alexandros Stamatakis, October 2007
Maximum Likelihood
Goal: Obtain topology with maximum likelihood value
Problem I: Number of possible topologies is exponential in n
Problem II: Computation of likelihood function is expensive
Problem III: Probably high score accuracy required
Problem IV: High memory consumption
Solution:
    • New Algorithms
    • New Models
    • High Performance Computing


   Alexandros Stamatakis, October 2007
Maximum Likelihood
Goal: Obtain topology with maximum likelihood value
Problem I: Number of possible topologies is exponential in n
                                             RAxML
Problem II: Computation of likelihood function is expensive
                                   Randomized Axelerated
Problem III: Probably high score accuracy required
                                     Maximum Likelihood
Problem IV: High memory consumption
Solution:
    • New Algorithms
    • New Models
    • High Performance Computing


   Alexandros Stamatakis, October 2007
Web & Grid Services
      RAxML Web-Server at San Diego Supercomputing
  
      Center via www.phylo.org (CIPRES project)
      Web-Server at Vital-IT unit of Swiss Institute of
  
      Bioinformatics phylobench.vital-it.ch/raxml-bb/
            Includes novel search algorithm with 1 order of

             magnitude run-time improvement
            Since Sept 3, about 700 jobs from 130 Ips

            Extension to SwissGrid planned

            Novel algorithm with Bootstopping to be

             integrated into CIPRES portal soon
      RAxML integration into Distributed European
  

      Infrastructure for Supercomputing Applications
      www.deisa.org started 10 days ago
      Integration into Debian medical distribution
  


Alexandros Stamatakis, October 2007
RAxML Black Box




Alexandros Stamatakis, October 2007
RAxML Black Box


                                      Why are Black Boxes
                                            useful?




Alexandros Stamatakis, October 2007
Outline
         Introduction
     ●


             Computation of Phylogenies
         ●


             Maximum Likelihood
         ●


             Web & Grid Services
         ●



         Three Steps Towards the Tree of Life
     ●


             Parallelism on IBM BlueGene/L
         ●


             Rapid Bootstrapping
         ●


             A Bootstopping criterion
         ●



         Related Projects
     ●


         Outlook
     ●


Alexandros Stamatakis, October 2007
Levels of Parallelism
                Embarrassing Parallelism
               MPI, CORBA, Grid Technologies




Alexandros Stamatakis, October 2007
Coarse-Grained Parallelism:
       MPI Version of RAxML
 PC-CLUSTER
                                                Worker Processes


                                          B-2
                                                        B-3
                     B-1
                                                              B-4
                                      Interconnection
     B-0                                  Network




                                                Master Process
Alexandros Stamatakis, October 2007
Levels of Parallelism
                Embarrassing Parallelism
               MPI, CORBA, Grid Technologies
                         Inference Parallelism
                         MPI, algorithm-dependent




Alexandros Stamatakis, October 2007
Levels of Parallelism
                Embarrassing Parallelism
               MPI, CORBA, Grid Technologies
                         Inference Parallelism
                         MPI, algorithm-dependent
                                Loop-Level Parallelism
                                 OpenMP, GPUs,
                                 IBM CELL (Playstation),
                                 IBM BlueGene,
                                 Clusters with fast Interconnect




Alexandros Stamatakis, October 2007
Loop Level Parallelism
                       virtual root


                 P



Q
                                          R

          P[i] = f(Q[i], R[i])

    Alexandros Stamatakis, October 2007
Loop Level Parallelism
                       virtual root

                                This operation uses ≥ 90%
                 P              of total execution time !




Q
                                          R

          P[i] = f(Q[i], R[i])

    Alexandros Stamatakis, October 2007
Loop Level Parallelism
                       virtual root

                                This operation uses ≥ 90%
                 P              of total execution time !
                                → simple fine-grained
                                parallelization


Q
                                          R

          P[i] = f(Q[i], R[i])

    Alexandros Stamatakis, October 2007
Loop Level Parallelism
                       virtual root


                 P



Q
                                          R



    Alexandros Stamatakis, October 2007
Loop Level Parallelism
                       virtual root


                 P



Q
                                          R



    Alexandros Stamatakis, October 2007
Loop Level Parallelism
                       virtual root


                 P



Q
                                          R



    Alexandros Stamatakis, October 2007
Loop Level Parallelism
                       virtual root
                                   The real reason for
                                   assuming independent
                                   evolution among sites:
                 P
                                   ......



Q
                                          R



    Alexandros Stamatakis, October 2007
Fine-Grained Parallelism:
          OpenMP version of RAxML




Alexandros Stamatakis, October 2007
Fine-Grained Parallelism:
          OpenMP version of RAxML




Alexandros Stamatakis, October 2007
HPC for ML (Bayesian)
    Proof of Concept & Programming


    Techniques:
      RAxML on a Graphics Processing Unit

      RAxML on the IBM CELL & Playstation

    Production Level Implementations:


      RAxML with OpenMP

      RaxML with MPI

      RAxML on BlueGene

      Multi-Core Architectures




Alexandros Stamatakis, October 2007
HPC for ML (Bayesian)
    Proof of Concept & Programming


    Techniques:
      RAxML on a Graphics Processing Unit

      RAxML on the IBM CELL & Playstation

    Production Level Implementations:

                    A good excuse to buy one
      RAxML with OpenMP

      RaxML with MPI

      RAxML on BlueGene

      Multi-Core Architectures




Alexandros Stamatakis, October 2007
RAxML-BlueGene
      Many slow processors: 1024 in one rack
  

      512 MB or 1GB of main memory per node
  

      But: high performance network
  

      Challenges:
  

            Distribute tree data structure among CPUs
        

            Exploit fast collective communication network
        

      For optimal efficiency: loop-level +
  
      embarrassing parallelism  hybrid
      parallelism with MPI
      Test & Production Run Data
  

            With Olaf Bininda-Emonds, Jena: 2,182
        

            mammalian sequences x 51,000 base pairs
            With Dan Janies, Ohio State: 270 Human
        

            Haplotype Map sequences x 500,000 base pairs
Alexandros Stamatakis, October 2007
RAxML-BlueGene
                       To be presented at IEEE/ACM
                    2007 Supercomputing
      Many slow processors: 1024 in one rack
  
                    Conference.
      512 MB or 1GB of main memory per node
  

      But: high performance network
  

      Challenges:
  

            Distribute tree data structure among CPUs
        

            Exploit fast collective communication network
        

      For optimal efficiency: loop-level +
  
      embarrassing parallelism  hybrid
      parallelism with MPI
      Test & Production Run Data
  

            With Olaf Bininda-Emonds, Jena: 2,182
        

            mammalian sequences x 51,000 base pairs
            With Dan Janies, Ohio State: 270 Human
        

            Haplotype Map sequences x 500,000 base pairs
Alexandros Stamatakis, October 2007
RAxML-BlueGene
      Many slow processors: 1024 in one rack
  

      512 MB or 1GB of main memory per node
  

      But: high performance network
  

      Challenges:
  

            Distribute tree data structure among CPUs in
                          Largest ML analysis to date
        
                          terms of memory footprint
            Exploit fast collective communication network
        

      For optimal efficiency: loop-level +
  
      embarrassing parallelism  hybrid
      parallelism with MPI
      Test & Production Run Data
  

            With Olaf Bininda-Emonds, Jena: 2,182
        

            mammalian sequences x 51,000 base pairs
            With Dan Janies, Ohio State: 270 Human
        

            Haplotype Map sequences x 500,000 base pairs
Alexandros Stamatakis, October 2007
Loop-Level Parallelism on
                  BlueGene




Alexandros Stamatakis, October 2007
50 Seqs x 23,385 bp




Alexandros Stamatakis, October 2007
50 Seqs x 23,385 bp



           Superlinear Speedup




Alexandros Stamatakis, October 2007
250 Seqs x 403,581 bp




Alexandros Stamatakis, October 2007
Embarrassing Parallelism

     W                 W              W   W


                                      M   W
     W                 M




                       M              M
     W                                    W


                       W                  W
     W                                W

Alexandros Stamatakis, October 2007
Outline
         Introduction
     ●


             Computation of Phylogenies
         ●


             Maximum Likelihood
         ●


             Web & Grid Services
         ●



         Three Steps Towards the Tree of Life
     ●


             Parallelism on IBM BlueGene/L
         ●


             Rapid Bootstrapping
         ●


             A Bootstopping criterion
         ●



         Related Projects
     ●


         Outlook
     ●


Alexandros Stamatakis, October 2007
Confidence Values
     Tree without node confidence
 

     values is mostly useless
     Problem:
 

         Confidence value calculation is major
       

         computational obstacle
        We can compute large trees but not

         analyse them: compute ≠analyse !
     Current Slow Methods
 

           Sampling with Bayesian methods
       

           Non-parametric Bootstrapping
       


Alexandros Stamatakis, October 2007
A Tree with Confidence Values




Joint work Stamatakis, October 2007
    Alexandros with Marc Gottschling, Charite Hospital, Berlin
Bootstrapping
                        Original Alignment



                              perturbation




compute tree compute tree compute tree

  Alexandros Stamatakis, October 2007
Bootstrapping
                        Original Alignment
                                        This needs to be done
                                        100-1000 times
                                        Embarrassingly
                                        Parallel !
                              perturbation




compute tree compute tree compute tree

  Alexandros Stamatakis, October 2007
Two Questions
      How to compute Bootstraps faster?
  

      How many Bootstrap replicates do we
  

      need?




Alexandros Stamatakis, October 2007
Current Work:
   Rapid Bootstrapping Algorithm
      Tested on 22 diverse (mammals, bacteria, archaea,
  

      grasses, fishes, plants, viral) real-world DNA/AA
      single-/multi-gene datasets containing 125-7,764
      sequences
      Pearson correlation on best-scoring ML trees between
  

      RBS (Rapid BS) & SBS (Standard BS) support values
      0.95-0.99 (except one dataset at 0.91), average 0.97
      Weighted topological distance < 6%, average 4%
  

      Program Acceleration: 8-20, average ≈ 15
  

            Acceleration by one order of magnitude
        

            Full ML analysis (100BS + ML search) of datasets of
        

            up to 5,000 sequences within less than 5 days on
            your desktop!
            Allows for a sufficiently large number of Bootstrap
        

            replicates

Alexandros Stamatakis, October 2007
Quick & Dirty Bootstrap



                      Modify Algorithm




           Computational Experiments


Alexandros Stamatakis, October 2007
Quick & Dirty Bootstrap



                      Modify Algorithm

iterate

           Computational Experiments


Alexandros Stamatakis, October 2007
Rapid Bootstrap

11111111111111


01102211111111
10111102220111
11111110112021



  Alexandros Stamatakis, October 2007
Rapid Bootstrap

11111111111111                          Compute Starting Tree


01102211111111
10111102220111
11111110112021



  Alexandros Stamatakis, October 2007
Rapid Bootstrap

                                        Optimize Model Params &
11111111111111                              Branch Lengths


01102211111111
10111102220111
11111110112021



  Alexandros Stamatakis, October 2007
Rapid Bootstrap
               Use Starting Tree &
            Model Params to compute
                   RELL scores
11111111111111


01102211111111                         -110
10111102220111                         -105
11111110112021                         -100



 Alexandros Stamatakis, October 2007
Rapid Bootstrap
               Use Starting Tree &
            Model Params to compute
                   RELL scores
11111111111111


01102211111111                         -110
10111102220111                         -105   Sort by RELL
11111110112021                         -100



 Alexandros Stamatakis, October 2007
Rapid Bootstrap

11111111111111


11111110112021                         -100   T0: Thorough Search

10111102220111                         -105
01102211111111                         -110



 Alexandros Stamatakis, October 2007
Rapid Bootstrap

11111111111111


11111110112021                         -100    T0: Thorough Search

10111102220111                         -105   T1: Quick Search on T0
01102211111111                         -110



 Alexandros Stamatakis, October 2007
Rapid Bootstrap

11111111111111


11111110112021                         -100    T0: Thorough Search

10111102220111                         -105   T1: Quick Search on T0
01102211111111                         -110   T2: Quick Search on T1




 Alexandros Stamatakis, October 2007
Rapid Bootstrap

11111111111111
       sequential
       dependency is
       bad for
11111110112021                         -100
       parallelism                             T0: Thorough Search

10111102220111                         -105   T1: Quick Search on T0
01102211111111                         -110   T2: Quick Search on T1




 Alexandros Stamatakis, October 2007
Scalability of Rapid
                     Bootstrap




Alexandros Stamatakis, October 2007
Scalability of Rapid
                     Bootstrap


                  Some datasets
                  are harder than
                  others




Alexandros Stamatakis, October 2007
Scalability of Rapid
                     Bootstrap




Alexandros Stamatakis, October 2007
ML-Scores: Garli, RAxML,
        PHYML 715 Sequences




Alexandros Stamatakis, October 2007
Correlation 125 Taxa: 0.91




Alexandros Stamatakis, October 2007
Support Value Distribution




Alexandros Stamatakis, October 2007
Bootstrap Likelihood Values
          125 x 19,436

10,000 replicates
only 195 non-trivial
bipartitions




 Alexandros Stamatakis, October 2007
Bootstrap Likelihood Values
         125 x 19,436




Alexandros Stamatakis, October 2007
3,491 rBCL sequences
            Rapid versus Standard BS

Correlation:
0.98




 Alexandros Stamatakis, October 2007
7,764 DNA Best Tree




Alexandros Stamatakis, October 2007
7,764 DNA All Bipartitions




Alexandros Stamatakis, October 2007
775 x 3,838 AA




Alexandros Stamatakis, October 2007
New Opportunities

     Assess Impact of Alignment Method
 

     on tree and support values
     Test Bootstrap of the Bootstrap
 

     (double Bootstrap) procedures
     Devise and empirically verify
 

     Bootstopping criteria




Alexandros Stamatakis, October 2007
Bootstrap of the Bootstrap
   140 AA (Efron et al PNAS 1996)




Alexandros Stamatakis, October 2007
Bootstrap of the Bootstrap
           3,491 rBCL




Alexandros Stamatakis, October 2007
Bootstopping
         Rapid Bootstrapping allows to assess
  

         Bootstopping criteria as follows
        1. Compute a high number of BS replicates (10,000)
        2. Devise topology-based bootstopping criterion and
           apply it to these 10,000 replicates
        3. Compare support values induced by bootstopped
           trees (say 300 replicates) with 10,000 replicates
         We have 10,000 replicates for 18
  
         datasets containing 125 to 2,554
         sequences



Alexandros Stamatakis, October 2007
Bootstopping Criterion
     Every 50, 100, 150, ... replicates do a test:
 

       Say we have N BS trees

       Do the following 100 times:

           Randomly split up this set of N trees into 2

            equal sets S1, S2, of size N/2
           Compute the bipartition support vectors for

            S1 and S2
           Compute Pearson correlation of the support

            vectors
       return average of the 100 Pearson correlations

     if average > 0.99 stop
 



Alexandros Stamatakis, October 2007
Result Overview

     Bootstopped between 100-400 (avg
 

     213)
     Correlation on best tree: Bootstopped
 

     versus 10,000 replicates > 0.99 (avg
     0.995)
     Correlation of all bipartitions > 0.995
 

     (avg 0.997)



Alexandros Stamatakis, October 2007
Bootstopping Best 140 AA




Alexandros Stamatakis, October 2007
Bootstopping Best 404 DNA
         (Multi-Gene)




Alexandros Stamatakis, October 2007
Bootstopping Best 994 DNA




Alexandros Stamatakis, October 2007
Bootstopping All 994 DNA




Alexandros Stamatakis, October 2007
Bootstopping Best 1,908
                DNA




Alexandros Stamatakis, October 2007
Bootstopping Best 2,554
                DNA




Alexandros Stamatakis, October 2007
Putting the Pieces together
      Blue-Gene: Can handle huge datasets
  

            Use Cat approximation on BlueGene
        

                 Further speedup of factor 3.5
             

                 Memory footprint reduction factor 4
             




Alexandros Stamatakis, October 2007
8,864 Bacteria under GTR+Γ
                and GTR+CAT
Log Likelihood
Score under Γ




                                         7 days   14 days

                           Execution
                              Time
   Alexandros Stamatakis, October 2007
Putting the Pieces together
      Blue-Gene: Can handle huge datasets
  

            Use Cat approximation on BlueGene
        

                 Further speedup of factor 3.5
             

                 Memory footprint reduction factor 4
             


            Integrate rapid Bootstrap into BlueGene
        

            version
                 Additional speedup ≈ 15
             


            Mechanisms available to accelerate
        

            BlueGene version by factor 50-60
            Integrate Bootstopping into BlueGene
        

   Conclusion: We will soon be able to
    compute a small tree of life with 10,000
    organisms and data from multiple genes!
Alexandros Stamatakis, October 2007
Outline
         Introduction
     ●


             Computation of Phylogenies
         ●


             Maximum Likelihood
         ●


             Web & Grid Services
         ●



         Three Steps Towards the Tree of Life
     ●


             Parallelism on IBM BlueGene/L
         ●


             Rapid Bootstrapping
         ●


             A Bootstopping criterion
         ●



         Related Projects
     ●


         Outlook
     ●


Alexandros Stamatakis, October 2007
Host-Parasite Co-Evolution
                                         Parasites (eg Lice)
Hosts (eg Mammals)




   Alexandros Stamatakis, October 2007
Host-Parasite Co-Evolution
Hosts                                    Parasites

          Co-Evolution Hypothesis

                           8 Parasites

               Adjacency
       6 hosts Matrix 0/1



 Alexandros Stamatakis, October 2007
Host-Parasite Co-Evolution
Hosts                                       Parasites

          Co-Evolution Hypothesis

                           8 Parasites

               Adjacency
       6 hosts Matrix 0/1


                         Statistical Test
 Alexandros Stamatakis, October 2007
What can HPC do forBioinformatics?
                 Axelerated Parafit

      “Parafit: statistical test of co-evolution”, Pierre
  
      Legendre, Syst. Biol. 2003
      AxParafit (Axelerated Parafit)
  
        Statistical test of hypotheses of host-parasite co-

         evolution
        C porting, optimization, BLAS integration

        Speedup up to factor 67

        Master-Worker MPI-parallelization

      Largest co-phylogenetic study to date conducted
  
      within 8 minutes instead of 4 weeks
      Open-Source Code:
  
      http://icwww.epfl.ch/~stamatak/AxParafit.html
      SwissGrid-based Web-Server planned
  




Alexandros Stamatakis, October 2007
AxParafit: Sequential
                     Performance




Alexandros Stamatakis, October 2007
AxParafit: Parallel
                     Performance




Alexandros Stamatakis, October 2007
The ML Benchmark:
      A Current Community Project
      Standardized way required to test ML search programs
  

      Web-Server with real-world alignments and performance data
  

      at Swiss Institute of Bioinformatics
      Many developers of popular ML programs involved
  

            Stephane Guindon (PHYML) Montpellier

            Simon Wheelan (LeaPhy) Manchester

            Bui Quang Minh (IQPNNI) Vienna

            Derrick Zwickl (GARLI) Virginia

            Thomas Keane (dprML) Cambridge


      Byproduct: SPEC-like CPU benchmark for phylogenetics
  

      Follow-up: (planned) ML competition at major conference with
  
      industrial sponsor



Alexandros Stamatakis, October 2007
A Current Problem:
            Handling Multi-Gene Alignments

                              Gene 1    Gene 2
Sequence 1




Sequence 5

                 Missing Data ≠ Gap Data

  Alexandros Stamatakis, October 2007
A Multi-Gene Model




Alexandros Stamatakis, October 2007
A Multi-Gene Model




Alexandros Stamatakis, October 2007
A Multi-Gene Model




Alexandros Stamatakis, October 2007
A Multi-Gene Model



LogLH (T) = LogLh (T|Red)




    Alexandros Stamatakis, October 2007
A Multi-Gene Model



LogLH (T) = LogLh (T|Red) +
     LogLH(T|Yellow)




    Alexandros Stamatakis, October 2007
A Multi-Gene Model
                                    Challenge: devise efficient data
                                    structures for this



LogLH (T) = LogLh (T|Red) +
     LogLH(T|Yellow)




    Alexandros Stamatakis, October 2007
Why are Individual Branches
         per Gene a Challenge?




Alexandros Stamatakis, October 2007
Why are Individual Branches
         per Gene a Challenge?




Alexandros Stamatakis, October 2007
Outlook




Alexandros Stamatakis, October 2007
Outlook

            Tree of Life
        

            What is a good alignment in a
        

            phylogenetic context?
            Simultaneous alignment and tree building
        

            More HPC & memory-aware programming
        

            Multi-core architectures
        

            Models for “gappy” multi-gene alignments
        




Alexandros Stamatakis, October 2007
Acknowledgements
      BlueGene Project
  


            Michael Ott, TUM
        


            Srinivas Aluru, Jaroslaw Zola, Iowa State
        


            Dan Janies, Andrew Johnson, Ohio State
        


      IBM CELL & Playstation
  


            Filip Blagojevic, Dimitris Nikolopoulos, Virginia Tech
        


            Christos Antonopoulos, Univ. of Thessaly
        


      Bootstopping
  


            Bernard Moret, Masoud Alipour, EPFL
        


            Olaf Bininda-Emonds, Univ. Jena
        


      RAxML Web-Server
  


            Jacques Rougemont, SIB
        


            Terri Liebowitz, SDSC
        


      AxParafit/AxPcoords
  


            Markus Goeker, Alexander Auch, Jan Meier-Kolthoff, University of Tuebingen
        


      Datasets for Studies
  


            Jun Inoue (Florida), Nicolas Salamin (Lausanne), Marc Gottschling (Berlin), Guido Grimm
        

            (Tuebingen), Nikos Poulakakis (Yale), Usman Roshan (NJIT)


Alexandros Stamatakis, October 2007
Thank you for your
                    Attention !




Lake Geneva, Switzerland
 Alexandros Stamatakis, October 2007

More Related Content

More from Roderic Page

The Sam Adams talk
The Sam Adams talkThe Sam Adams talk
The Sam Adams talkRoderic Page
 
Unknown knowns, long tails, and long data
Unknown knowns, long tails, and long dataUnknown knowns, long tails, and long data
Unknown knowns, long tails, and long dataRoderic Page
 
In praise of grumpy old men: Open versus closed data and the challenge of cre...
In praise of grumpy old men: Open versus closed data and the challenge of cre...In praise of grumpy old men: Open versus closed data and the challenge of cre...
In praise of grumpy old men: Open versus closed data and the challenge of cre...Roderic Page
 
BHL, BioStor, and beyond
BHL, BioStor, and beyondBHL, BioStor, and beyond
BHL, BioStor, and beyondRoderic Page
 
Cisco Digital Catapult
Cisco Digital CatapultCisco Digital Catapult
Cisco Digital CatapultRoderic Page
 
Built in the 19th century, rebuilt for the 21st
Built in the 19th century, rebuilt for the 21stBuilt in the 19th century, rebuilt for the 21st
Built in the 19th century, rebuilt for the 21stRoderic Page
 
Two graphs, three responses
Two graphs, three responsesTwo graphs, three responses
Two graphs, three responsesRoderic Page
 
GrBio Workshop talk
GrBio Workshop talkGrBio Workshop talk
GrBio Workshop talkRoderic Page
 
Biodiversity Knowledge Graphs
Biodiversity Knowledge GraphsBiodiversity Knowledge Graphs
Biodiversity Knowledge GraphsRoderic Page
 
Visualing phylogenies: a personal view
Visualing phylogenies: a personal viewVisualing phylogenies: a personal view
Visualing phylogenies: a personal viewRoderic Page
 
Biodiversity informatics: digitising the living world
Biodiversity informatics: digitising the living worldBiodiversity informatics: digitising the living world
Biodiversity informatics: digitising the living worldRoderic Page
 
Ebbe Nielsen Challenge GBIF #gb21
Ebbe Nielsen Challenge GBIF #gb21Ebbe Nielsen Challenge GBIF #gb21
Ebbe Nielsen Challenge GBIF #gb21Roderic Page
 
GBIF Science Committee Report GB21, Delhi, India
GBIF Science Committee Report GB21, Delhi, IndiaGBIF Science Committee Report GB21, Delhi, India
GBIF Science Committee Report GB21, Delhi, IndiaRoderic Page
 
Building the Biodiversity Knowledge Graph
Building the Biodiversity Knowledge GraphBuilding the Biodiversity Knowledge Graph
Building the Biodiversity Knowledge GraphRoderic Page
 
Biodiversity informatics: why aren't we there yet?
Biodiversity informatics: why aren't we there yet?Biodiversity informatics: why aren't we there yet?
Biodiversity informatics: why aren't we there yet?Roderic Page
 
Something about links
Something about linksSomething about links
Something about linksRoderic Page
 
Why I blog instead of writing papers
Why I blog instead of writing papersWhy I blog instead of writing papers
Why I blog instead of writing papersRoderic Page
 
Surfacing the deep data of taxonomy
Surfacing the deep data of taxonomySurfacing the deep data of taxonomy
Surfacing the deep data of taxonomyRoderic Page
 

More from Roderic Page (20)

The Sam Adams talk
The Sam Adams talkThe Sam Adams talk
The Sam Adams talk
 
Unknown knowns, long tails, and long data
Unknown knowns, long tails, and long dataUnknown knowns, long tails, and long data
Unknown knowns, long tails, and long data
 
In praise of grumpy old men: Open versus closed data and the challenge of cre...
In praise of grumpy old men: Open versus closed data and the challenge of cre...In praise of grumpy old men: Open versus closed data and the challenge of cre...
In praise of grumpy old men: Open versus closed data and the challenge of cre...
 
BHL, BioStor, and beyond
BHL, BioStor, and beyondBHL, BioStor, and beyond
BHL, BioStor, and beyond
 
Cisco Digital Catapult
Cisco Digital CatapultCisco Digital Catapult
Cisco Digital Catapult
 
Built in the 19th century, rebuilt for the 21st
Built in the 19th century, rebuilt for the 21stBuilt in the 19th century, rebuilt for the 21st
Built in the 19th century, rebuilt for the 21st
 
Two graphs, three responses
Two graphs, three responsesTwo graphs, three responses
Two graphs, three responses
 
GrBio Workshop talk
GrBio Workshop talkGrBio Workshop talk
GrBio Workshop talk
 
Biodiversity Knowledge Graphs
Biodiversity Knowledge GraphsBiodiversity Knowledge Graphs
Biodiversity Knowledge Graphs
 
Visualing phylogenies: a personal view
Visualing phylogenies: a personal viewVisualing phylogenies: a personal view
Visualing phylogenies: a personal view
 
Biodiversity informatics: digitising the living world
Biodiversity informatics: digitising the living worldBiodiversity informatics: digitising the living world
Biodiversity informatics: digitising the living world
 
Ebbe Nielsen Challenge GBIF #gb21
Ebbe Nielsen Challenge GBIF #gb21Ebbe Nielsen Challenge GBIF #gb21
Ebbe Nielsen Challenge GBIF #gb21
 
GBIF Science Committee Report GB21, Delhi, India
GBIF Science Committee Report GB21, Delhi, IndiaGBIF Science Committee Report GB21, Delhi, India
GBIF Science Committee Report GB21, Delhi, India
 
Building the Biodiversity Knowledge Graph
Building the Biodiversity Knowledge GraphBuilding the Biodiversity Knowledge Graph
Building the Biodiversity Knowledge Graph
 
GBIF ideas
GBIF ideasGBIF ideas
GBIF ideas
 
Biodiversity informatics: why aren't we there yet?
Biodiversity informatics: why aren't we there yet?Biodiversity informatics: why aren't we there yet?
Biodiversity informatics: why aren't we there yet?
 
Something about links
Something about linksSomething about links
Something about links
 
Why I blog instead of writing papers
Why I blog instead of writing papersWhy I blog instead of writing papers
Why I blog instead of writing papers
 
Social media
Social mediaSocial media
Social media
 
Surfacing the deep data of taxonomy
Surfacing the deep data of taxonomySurfacing the deep data of taxonomy
Surfacing the deep data of taxonomy
 

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Crunching Huge Phylogenies. A. Stamatakis

  • 1. Crunching Huge Phylogenies: A Rapid Bootstrap Algorithm and Massive Parallelism on the IBM BlueGene Alexandros Stamatakis Swiss Federal Institute of Technology Lausanne (EPFL) School of Computer & Communication Sciences Laboratory for Computational Biology and Bioinformatics Lausanne, Switzerland & Swiss Institute of Bioinformatics Alexandros.Stamatakis@epfl.ch icwww.epfl.ch/~stamatak
  • 2. The Missing Part Data Assembly Inference ? Tree Analysis Alexandros Stamatakis, October 2007
  • 3. The Missing Part Data Assembly Tree Analysis Alexandros Stamatakis, October 2007
  • 4. IBM BlueGene/L supercomputer Alexandros Stamatakis, October 2007
  • 5. Rapid Bootstrapping Bootstopping Criterion Alexandros Stamatakis, October 2007
  • 6. The Big Hardware Problem CPU Speed 40% p.a. Memory Speed 9% p.a. 2007 1980 Alexandros Stamatakis, October 2007
  • 7. ... and why this concerns Bioinformatics Sequence CPU Speed 40% p.a. Data Memory Speed 9% p.a. 2007 1980 Alexandros Stamatakis, October 2007
  • 8. ... and why this concerns Bioinformatics Application of HPC techniques will become Sequence much moreSpeed 40% p.a. CPU important Data Memory Speed 9% p.a. 2007 1980 Alexandros Stamatakis, October 2007
  • 10. Outline Introduction ● Computation of Phylogenies ● Maximum Likelihood ● Web & Grid Services ● Three Steps Towards the Tree of Life ● Parallelism on IBM BlueGene/L ● Rapid Bootstrapping ● A Bootstopping criterion ● Related Projects ● Outlook ● Alexandros Stamatakis, October 2007
  • 11. Phylogenetics Input: “good” multiple Alignment  Output: unrooted binary tree  Various methods for phylogenetic  inference Neighbour Joining (fast & simple)  Maximum Parsimony (relatively fast &  simple) Maximum Likelihood (complex & slow)  Bayesian Methods (complex & slower)  Alexandros Stamatakis, October 2007
  • 12. Phylogenetics Input: “good” multiple Alignment  Output: unrooted binary tree  ML & Bayesian: explicit Various methods choice model for phylogenetic  inference Neighbour Joining (fast & simple)  Maximum Parsimony (relatively fast &  simple) Maximum Likelihood (complex & slow)  Bayesian Methods (complex & slower)  Alexandros Stamatakis, October 2007
  • 13. Phylogenetics Complex Methods & Input: “good” multiple Alignment  Models required to Output: unrooted binary tree  reconstruct large & Various methods for phylogenetic complicated trees !  inference NeighbourFocus of(fast talk is on Joining this & simple)  Maximum Likelihood! Maximum Parsimony (relatively fast &  simple) Maximum Likelihood (complex & slow)  Bayesian Methods (complex & slower)  Alexandros Stamatakis, October 2007
  • 14. Phylogenetics Input: “good” multiple Alignment  Output: unrooted binary tree  Various methods for phylogenetic  inference NeighbourThe real (fast & simple) Joining reason for  Maximum working on (relatively fast & Parsimony ML: ......  simple) Maximum Likelihood (complex & slow)  Bayesian Methods (complex & slower)  Alexandros Stamatakis, October 2007
  • 15. Challenges for Phyloinformatics Holy grail: “Tree of Life”  What is a good alignment in a  phylogenetic context? Simultaneous alignment and tree building  Improve/extend models ... but thereby size  of computable trees decreases! More HPC awareness  Exploit multi-core architectures  Amount of available data grows at a  higher rate than algorithms are getting faster Alexandros Stamatakis, October 2007
  • 16. The algorithmic problem Alexandros Stamatakis, October 2007
  • 17. The number of trees Alexandros Stamatakis, October 2007
  • 18. The number of trees Alexandros Stamatakis, October 2007
  • 19. The number of trees Alexandros Stamatakis, October 2007
  • 20. The number of trees explodes! BANG ! Alexandros Stamatakis, October 2007
  • 21. Outline Introduction ● Computation of Phylogenies ● Maximum Likelihood ● Web & Grid Services ● Three Steps Towards the Tree of Life ● Parallelism on IBM BlueGene/L ● Rapid Bootstrapping ● A Bootstopping criterion ● Related Projects ● Outlook ● Alexandros Stamatakis, October 2007
  • 22. Maximum Likelihood Length: m Seq1 Seq2 Alignment Seq3 Seq4 Alexandros Stamatakis, October 2007
  • 23. Maximum Likelihood Length: m ACGT Seq1 A Seq2 C Substitution Alignment model Seq3 G Seq4 T Alexandros Stamatakis, October 2007
  • 24. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T Alexandros Stamatakis, October 2007
  • 25. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T Seq 3 Seq 1 b3 b1 b5 b2 b4 Seq 2 Seq 4 Alexandros Stamatakis, October 2007
  • 26. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T Seq 3 Seq 1 b3 b1 b5 b2 b4 Seq 2 Seq 4 virtual root: vr Alexandros Stamatakis, October 2007
  • 27. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T Seq 3 Seq 1 b3 b1 vr b5 b2 b4 Seq 2 Seq 4 P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T) m Alexandros Stamatakis, October 2007
  • 28. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T Lots of floating pointSeq 3 Seq 1 b3 b1 operations! vr b5 b2 b4 Seq 2 Seq 4 P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T) m Alexandros Stamatakis, October 2007
  • 29. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T Seq 3 Seq 1 Seq 2 Seq 4 optimize branch lengths Alexandros Stamatakis, October 2007
  • 30. Maximum Likelihood Length: m ACGT Prior probabilities, Empirical base frequencies Seq1 A Seq2 C Substitution Alignment πA πC πG πT model Seq3 G Seq4 T optimize model parameters Seq 3 Seq 1 Seq 2 Seq 4 Alexandros Stamatakis, October 2007
  • 31. Maximum Likelihood Goal: Obtain topology with maximum likelihood value Problem I: Number of possible topologies is exponential in n Problem II: Computation of likelihood function is expensive Problem III: Probably high score accuracy required Problem IV: High memory consumption Solution: • New Algorithms • New Models • High Performance Computing Alexandros Stamatakis, October 2007
  • 32. Maximum Likelihood Goal: Obtain topology with maximum likelihood value Problem I: Number of possible topologies is exponential in n RAxML Problem II: Computation of likelihood function is expensive Randomized Axelerated Problem III: Probably high score accuracy required Maximum Likelihood Problem IV: High memory consumption Solution: • New Algorithms • New Models • High Performance Computing Alexandros Stamatakis, October 2007
  • 33. Web & Grid Services RAxML Web-Server at San Diego Supercomputing  Center via www.phylo.org (CIPRES project) Web-Server at Vital-IT unit of Swiss Institute of  Bioinformatics phylobench.vital-it.ch/raxml-bb/  Includes novel search algorithm with 1 order of magnitude run-time improvement  Since Sept 3, about 700 jobs from 130 Ips  Extension to SwissGrid planned  Novel algorithm with Bootstopping to be integrated into CIPRES portal soon RAxML integration into Distributed European  Infrastructure for Supercomputing Applications www.deisa.org started 10 days ago Integration into Debian medical distribution  Alexandros Stamatakis, October 2007
  • 34. RAxML Black Box Alexandros Stamatakis, October 2007
  • 35. RAxML Black Box Why are Black Boxes useful? Alexandros Stamatakis, October 2007
  • 36. Outline Introduction ● Computation of Phylogenies ● Maximum Likelihood ● Web & Grid Services ● Three Steps Towards the Tree of Life ● Parallelism on IBM BlueGene/L ● Rapid Bootstrapping ● A Bootstopping criterion ● Related Projects ● Outlook ● Alexandros Stamatakis, October 2007
  • 37. Levels of Parallelism Embarrassing Parallelism MPI, CORBA, Grid Technologies Alexandros Stamatakis, October 2007
  • 38. Coarse-Grained Parallelism: MPI Version of RAxML PC-CLUSTER Worker Processes B-2 B-3 B-1 B-4 Interconnection B-0 Network Master Process Alexandros Stamatakis, October 2007
  • 39. Levels of Parallelism Embarrassing Parallelism MPI, CORBA, Grid Technologies Inference Parallelism MPI, algorithm-dependent Alexandros Stamatakis, October 2007
  • 40. Levels of Parallelism Embarrassing Parallelism MPI, CORBA, Grid Technologies Inference Parallelism MPI, algorithm-dependent Loop-Level Parallelism OpenMP, GPUs, IBM CELL (Playstation), IBM BlueGene, Clusters with fast Interconnect Alexandros Stamatakis, October 2007
  • 41. Loop Level Parallelism virtual root P Q R P[i] = f(Q[i], R[i]) Alexandros Stamatakis, October 2007
  • 42. Loop Level Parallelism virtual root This operation uses ≥ 90% P of total execution time ! Q R P[i] = f(Q[i], R[i]) Alexandros Stamatakis, October 2007
  • 43. Loop Level Parallelism virtual root This operation uses ≥ 90% P of total execution time ! → simple fine-grained parallelization Q R P[i] = f(Q[i], R[i]) Alexandros Stamatakis, October 2007
  • 44. Loop Level Parallelism virtual root P Q R Alexandros Stamatakis, October 2007
  • 45. Loop Level Parallelism virtual root P Q R Alexandros Stamatakis, October 2007
  • 46. Loop Level Parallelism virtual root P Q R Alexandros Stamatakis, October 2007
  • 47. Loop Level Parallelism virtual root The real reason for assuming independent evolution among sites: P ...... Q R Alexandros Stamatakis, October 2007
  • 48. Fine-Grained Parallelism: OpenMP version of RAxML Alexandros Stamatakis, October 2007
  • 49. Fine-Grained Parallelism: OpenMP version of RAxML Alexandros Stamatakis, October 2007
  • 50. HPC for ML (Bayesian) Proof of Concept & Programming  Techniques:  RAxML on a Graphics Processing Unit  RAxML on the IBM CELL & Playstation Production Level Implementations:   RAxML with OpenMP  RaxML with MPI  RAxML on BlueGene  Multi-Core Architectures Alexandros Stamatakis, October 2007
  • 51. HPC for ML (Bayesian) Proof of Concept & Programming  Techniques:  RAxML on a Graphics Processing Unit  RAxML on the IBM CELL & Playstation Production Level Implementations:  A good excuse to buy one  RAxML with OpenMP  RaxML with MPI  RAxML on BlueGene  Multi-Core Architectures Alexandros Stamatakis, October 2007
  • 52. RAxML-BlueGene Many slow processors: 1024 in one rack  512 MB or 1GB of main memory per node  But: high performance network  Challenges:  Distribute tree data structure among CPUs  Exploit fast collective communication network  For optimal efficiency: loop-level +  embarrassing parallelism  hybrid parallelism with MPI Test & Production Run Data  With Olaf Bininda-Emonds, Jena: 2,182  mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human  Haplotype Map sequences x 500,000 base pairs Alexandros Stamatakis, October 2007
  • 53. RAxML-BlueGene To be presented at IEEE/ACM 2007 Supercomputing Many slow processors: 1024 in one rack  Conference. 512 MB or 1GB of main memory per node  But: high performance network  Challenges:  Distribute tree data structure among CPUs  Exploit fast collective communication network  For optimal efficiency: loop-level +  embarrassing parallelism  hybrid parallelism with MPI Test & Production Run Data  With Olaf Bininda-Emonds, Jena: 2,182  mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human  Haplotype Map sequences x 500,000 base pairs Alexandros Stamatakis, October 2007
  • 54. RAxML-BlueGene Many slow processors: 1024 in one rack  512 MB or 1GB of main memory per node  But: high performance network  Challenges:  Distribute tree data structure among CPUs in Largest ML analysis to date  terms of memory footprint Exploit fast collective communication network  For optimal efficiency: loop-level +  embarrassing parallelism  hybrid parallelism with MPI Test & Production Run Data  With Olaf Bininda-Emonds, Jena: 2,182  mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human  Haplotype Map sequences x 500,000 base pairs Alexandros Stamatakis, October 2007
  • 55. Loop-Level Parallelism on BlueGene Alexandros Stamatakis, October 2007
  • 56. 50 Seqs x 23,385 bp Alexandros Stamatakis, October 2007
  • 57. 50 Seqs x 23,385 bp Superlinear Speedup Alexandros Stamatakis, October 2007
  • 58. 250 Seqs x 403,581 bp Alexandros Stamatakis, October 2007
  • 59. Embarrassing Parallelism W W W W M W W M M M W W W W W W Alexandros Stamatakis, October 2007
  • 60. Outline Introduction ● Computation of Phylogenies ● Maximum Likelihood ● Web & Grid Services ● Three Steps Towards the Tree of Life ● Parallelism on IBM BlueGene/L ● Rapid Bootstrapping ● A Bootstopping criterion ● Related Projects ● Outlook ● Alexandros Stamatakis, October 2007
  • 61. Confidence Values Tree without node confidence  values is mostly useless Problem:  Confidence value calculation is major  computational obstacle  We can compute large trees but not analyse them: compute ≠analyse ! Current Slow Methods  Sampling with Bayesian methods  Non-parametric Bootstrapping  Alexandros Stamatakis, October 2007
  • 62. A Tree with Confidence Values Joint work Stamatakis, October 2007 Alexandros with Marc Gottschling, Charite Hospital, Berlin
  • 63. Bootstrapping Original Alignment perturbation compute tree compute tree compute tree Alexandros Stamatakis, October 2007
  • 64. Bootstrapping Original Alignment This needs to be done 100-1000 times Embarrassingly Parallel ! perturbation compute tree compute tree compute tree Alexandros Stamatakis, October 2007
  • 65. Two Questions How to compute Bootstraps faster?  How many Bootstrap replicates do we  need? Alexandros Stamatakis, October 2007
  • 66. Current Work: Rapid Bootstrapping Algorithm Tested on 22 diverse (mammals, bacteria, archaea,  grasses, fishes, plants, viral) real-world DNA/AA single-/multi-gene datasets containing 125-7,764 sequences Pearson correlation on best-scoring ML trees between  RBS (Rapid BS) & SBS (Standard BS) support values 0.95-0.99 (except one dataset at 0.91), average 0.97 Weighted topological distance < 6%, average 4%  Program Acceleration: 8-20, average ≈ 15  Acceleration by one order of magnitude  Full ML analysis (100BS + ML search) of datasets of  up to 5,000 sequences within less than 5 days on your desktop! Allows for a sufficiently large number of Bootstrap  replicates Alexandros Stamatakis, October 2007
  • 67. Quick & Dirty Bootstrap Modify Algorithm Computational Experiments Alexandros Stamatakis, October 2007
  • 68. Quick & Dirty Bootstrap Modify Algorithm iterate Computational Experiments Alexandros Stamatakis, October 2007
  • 70. Rapid Bootstrap 11111111111111 Compute Starting Tree 01102211111111 10111102220111 11111110112021 Alexandros Stamatakis, October 2007
  • 71. Rapid Bootstrap Optimize Model Params & 11111111111111 Branch Lengths 01102211111111 10111102220111 11111110112021 Alexandros Stamatakis, October 2007
  • 72. Rapid Bootstrap Use Starting Tree & Model Params to compute RELL scores 11111111111111 01102211111111 -110 10111102220111 -105 11111110112021 -100 Alexandros Stamatakis, October 2007
  • 73. Rapid Bootstrap Use Starting Tree & Model Params to compute RELL scores 11111111111111 01102211111111 -110 10111102220111 -105 Sort by RELL 11111110112021 -100 Alexandros Stamatakis, October 2007
  • 74. Rapid Bootstrap 11111111111111 11111110112021 -100 T0: Thorough Search 10111102220111 -105 01102211111111 -110 Alexandros Stamatakis, October 2007
  • 75. Rapid Bootstrap 11111111111111 11111110112021 -100 T0: Thorough Search 10111102220111 -105 T1: Quick Search on T0 01102211111111 -110 Alexandros Stamatakis, October 2007
  • 76. Rapid Bootstrap 11111111111111 11111110112021 -100 T0: Thorough Search 10111102220111 -105 T1: Quick Search on T0 01102211111111 -110 T2: Quick Search on T1 Alexandros Stamatakis, October 2007
  • 77. Rapid Bootstrap 11111111111111 sequential dependency is bad for 11111110112021 -100 parallelism T0: Thorough Search 10111102220111 -105 T1: Quick Search on T0 01102211111111 -110 T2: Quick Search on T1 Alexandros Stamatakis, October 2007
  • 78. Scalability of Rapid Bootstrap Alexandros Stamatakis, October 2007
  • 79. Scalability of Rapid Bootstrap Some datasets are harder than others Alexandros Stamatakis, October 2007
  • 80. Scalability of Rapid Bootstrap Alexandros Stamatakis, October 2007
  • 81. ML-Scores: Garli, RAxML, PHYML 715 Sequences Alexandros Stamatakis, October 2007
  • 82. Correlation 125 Taxa: 0.91 Alexandros Stamatakis, October 2007
  • 83. Support Value Distribution Alexandros Stamatakis, October 2007
  • 84. Bootstrap Likelihood Values 125 x 19,436 10,000 replicates only 195 non-trivial bipartitions Alexandros Stamatakis, October 2007
  • 85. Bootstrap Likelihood Values 125 x 19,436 Alexandros Stamatakis, October 2007
  • 86. 3,491 rBCL sequences Rapid versus Standard BS Correlation: 0.98 Alexandros Stamatakis, October 2007
  • 87. 7,764 DNA Best Tree Alexandros Stamatakis, October 2007
  • 88. 7,764 DNA All Bipartitions Alexandros Stamatakis, October 2007
  • 89. 775 x 3,838 AA Alexandros Stamatakis, October 2007
  • 90. New Opportunities Assess Impact of Alignment Method  on tree and support values Test Bootstrap of the Bootstrap  (double Bootstrap) procedures Devise and empirically verify  Bootstopping criteria Alexandros Stamatakis, October 2007
  • 91. Bootstrap of the Bootstrap 140 AA (Efron et al PNAS 1996) Alexandros Stamatakis, October 2007
  • 92. Bootstrap of the Bootstrap 3,491 rBCL Alexandros Stamatakis, October 2007
  • 93. Bootstopping Rapid Bootstrapping allows to assess  Bootstopping criteria as follows 1. Compute a high number of BS replicates (10,000) 2. Devise topology-based bootstopping criterion and apply it to these 10,000 replicates 3. Compare support values induced by bootstopped trees (say 300 replicates) with 10,000 replicates We have 10,000 replicates for 18  datasets containing 125 to 2,554 sequences Alexandros Stamatakis, October 2007
  • 94. Bootstopping Criterion Every 50, 100, 150, ... replicates do a test:   Say we have N BS trees  Do the following 100 times:  Randomly split up this set of N trees into 2 equal sets S1, S2, of size N/2  Compute the bipartition support vectors for S1 and S2  Compute Pearson correlation of the support vectors  return average of the 100 Pearson correlations if average > 0.99 stop  Alexandros Stamatakis, October 2007
  • 95. Result Overview Bootstopped between 100-400 (avg  213) Correlation on best tree: Bootstopped  versus 10,000 replicates > 0.99 (avg 0.995) Correlation of all bipartitions > 0.995  (avg 0.997) Alexandros Stamatakis, October 2007
  • 96. Bootstopping Best 140 AA Alexandros Stamatakis, October 2007
  • 97. Bootstopping Best 404 DNA (Multi-Gene) Alexandros Stamatakis, October 2007
  • 98. Bootstopping Best 994 DNA Alexandros Stamatakis, October 2007
  • 99. Bootstopping All 994 DNA Alexandros Stamatakis, October 2007
  • 100. Bootstopping Best 1,908 DNA Alexandros Stamatakis, October 2007
  • 101. Bootstopping Best 2,554 DNA Alexandros Stamatakis, October 2007
  • 102. Putting the Pieces together Blue-Gene: Can handle huge datasets  Use Cat approximation on BlueGene  Further speedup of factor 3.5  Memory footprint reduction factor 4  Alexandros Stamatakis, October 2007
  • 103. 8,864 Bacteria under GTR+Γ and GTR+CAT Log Likelihood Score under Γ 7 days 14 days Execution Time Alexandros Stamatakis, October 2007
  • 104. Putting the Pieces together Blue-Gene: Can handle huge datasets  Use Cat approximation on BlueGene  Further speedup of factor 3.5  Memory footprint reduction factor 4  Integrate rapid Bootstrap into BlueGene  version Additional speedup ≈ 15  Mechanisms available to accelerate  BlueGene version by factor 50-60 Integrate Bootstopping into BlueGene   Conclusion: We will soon be able to compute a small tree of life with 10,000 organisms and data from multiple genes! Alexandros Stamatakis, October 2007
  • 105. Outline Introduction ● Computation of Phylogenies ● Maximum Likelihood ● Web & Grid Services ● Three Steps Towards the Tree of Life ● Parallelism on IBM BlueGene/L ● Rapid Bootstrapping ● A Bootstopping criterion ● Related Projects ● Outlook ● Alexandros Stamatakis, October 2007
  • 106. Host-Parasite Co-Evolution Parasites (eg Lice) Hosts (eg Mammals) Alexandros Stamatakis, October 2007
  • 107. Host-Parasite Co-Evolution Hosts Parasites Co-Evolution Hypothesis 8 Parasites Adjacency 6 hosts Matrix 0/1 Alexandros Stamatakis, October 2007
  • 108. Host-Parasite Co-Evolution Hosts Parasites Co-Evolution Hypothesis 8 Parasites Adjacency 6 hosts Matrix 0/1 Statistical Test Alexandros Stamatakis, October 2007
  • 109. What can HPC do forBioinformatics? Axelerated Parafit “Parafit: statistical test of co-evolution”, Pierre  Legendre, Syst. Biol. 2003 AxParafit (Axelerated Parafit)   Statistical test of hypotheses of host-parasite co- evolution  C porting, optimization, BLAS integration  Speedup up to factor 67  Master-Worker MPI-parallelization Largest co-phylogenetic study to date conducted  within 8 minutes instead of 4 weeks Open-Source Code:  http://icwww.epfl.ch/~stamatak/AxParafit.html SwissGrid-based Web-Server planned  Alexandros Stamatakis, October 2007
  • 110. AxParafit: Sequential Performance Alexandros Stamatakis, October 2007
  • 111. AxParafit: Parallel Performance Alexandros Stamatakis, October 2007
  • 112. The ML Benchmark: A Current Community Project Standardized way required to test ML search programs  Web-Server with real-world alignments and performance data  at Swiss Institute of Bioinformatics Many developers of popular ML programs involved   Stephane Guindon (PHYML) Montpellier  Simon Wheelan (LeaPhy) Manchester  Bui Quang Minh (IQPNNI) Vienna  Derrick Zwickl (GARLI) Virginia  Thomas Keane (dprML) Cambridge Byproduct: SPEC-like CPU benchmark for phylogenetics  Follow-up: (planned) ML competition at major conference with  industrial sponsor Alexandros Stamatakis, October 2007
  • 113. A Current Problem: Handling Multi-Gene Alignments Gene 1 Gene 2 Sequence 1 Sequence 5 Missing Data ≠ Gap Data Alexandros Stamatakis, October 2007
  • 114. A Multi-Gene Model Alexandros Stamatakis, October 2007
  • 115. A Multi-Gene Model Alexandros Stamatakis, October 2007
  • 116. A Multi-Gene Model Alexandros Stamatakis, October 2007
  • 117. A Multi-Gene Model LogLH (T) = LogLh (T|Red) Alexandros Stamatakis, October 2007
  • 118. A Multi-Gene Model LogLH (T) = LogLh (T|Red) + LogLH(T|Yellow) Alexandros Stamatakis, October 2007
  • 119. A Multi-Gene Model Challenge: devise efficient data structures for this LogLH (T) = LogLh (T|Red) + LogLH(T|Yellow) Alexandros Stamatakis, October 2007
  • 120. Why are Individual Branches per Gene a Challenge? Alexandros Stamatakis, October 2007
  • 121. Why are Individual Branches per Gene a Challenge? Alexandros Stamatakis, October 2007
  • 123. Outlook Tree of Life  What is a good alignment in a  phylogenetic context? Simultaneous alignment and tree building  More HPC & memory-aware programming  Multi-core architectures  Models for “gappy” multi-gene alignments  Alexandros Stamatakis, October 2007
  • 124. Acknowledgements BlueGene Project  Michael Ott, TUM  Srinivas Aluru, Jaroslaw Zola, Iowa State  Dan Janies, Andrew Johnson, Ohio State  IBM CELL & Playstation  Filip Blagojevic, Dimitris Nikolopoulos, Virginia Tech  Christos Antonopoulos, Univ. of Thessaly  Bootstopping  Bernard Moret, Masoud Alipour, EPFL  Olaf Bininda-Emonds, Univ. Jena  RAxML Web-Server  Jacques Rougemont, SIB  Terri Liebowitz, SDSC  AxParafit/AxPcoords  Markus Goeker, Alexander Auch, Jan Meier-Kolthoff, University of Tuebingen  Datasets for Studies  Jun Inoue (Florida), Nicolas Salamin (Lausanne), Marc Gottschling (Berlin), Guido Grimm  (Tuebingen), Nikos Poulakakis (Yale), Usman Roshan (NJIT) Alexandros Stamatakis, October 2007
  • 125. Thank you for your Attention ! Lake Geneva, Switzerland Alexandros Stamatakis, October 2007