Phylogenomics of microbes:
                          the dark matter of biology

                              Jonathan A. Eisen
                                 UC Davis

                             Talk for iEVOBIO
                               June 29, 2010

Tuesday, June 29, 2010
Eisen Lab - Phylogenomics of Novelty


         Origin of New                                               Genome
         Functions and                                               Dynamics
           Processes
                                                                    •Evolvability
         •New genes                                                 •Repair and recombination
         •Changes in old genes                                      processes
         •Changes in pathways                                       •Intragenomic variation




                                    Species Evolution
                                 •Phylogenetic history
                                 •Vertical vs. horizontal descent
                                 •Needed to track gain/loss of
                                 processes, infer convergence
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Social Networking in Science




Tuesday, June 29, 2010
Bacterial evolve




Tuesday, June 29, 2010
An homage to Donald Rumsfeld

            • There are known knowns. These are things we
              know that we know.

            • There are known unknowns. That is to say,
              there are things that we know we don't know.

            • But there are also unknown unknowns. There
              are things we don't know we don't know.


Tuesday, June 29, 2010
Outline
            • Known knowns (background)
                   –rRNA Tree of Life
                   –Genomics
                   –rRNA PCR
                   –Metagenomics
            • Known unknowns
                   –GEBA project - past
                   –GEBA project - present
                   –GEBA project - future
            • Unknown unknowns?

Tuesday, June 29, 2010
Known Knowns 1:

                         rRNA Tree of Life




Tuesday, June 29, 2010
Tuesday, June 29, 2010
rRNA Tree of Life
                         Bacteria




                                                           Archaea




                          Eukaryotes

                            FIgure from Barton, Eisen et al.
                               “Evolution”, CSHL Press.
                           Based on tree from Pace NR, 2003.
Tuesday, June 29, 2010
The Tree ofEukaryotes
                                     Life: Three Main
                                   Domains
         The Tree of Life




                                                                             Bacteria

   Archaea




                                  Unrooted Tree of Life from Barton et al. Evolution
Tuesday, June 29, 2010
Known Knowns 2:

                         Genomics and Phylogenomics




Tuesday, June 29, 2010
Fleischmann et al.
                         1995
Tuesday, June 29, 2010
Microbial genomes




                             From http://genomesonline.org
Tuesday, June 29, 2010
Genome Sequences Have
                         Revolutionized Microbiology
            • Predictions of metabolic processes
            • Better vaccine and drug design
            • New insights into mechanisms of evolution
            • Genomes serve as template for functional
              studies
            • New enzymes and materials for engineering
              and synthetic biology

Tuesday, June 29, 2010
Microbes Run the Planet




Tuesday, June 29, 2010
4.

                         Microbes in the world I:
                              rRNA PCR




                           Perna et al. 2003
Tuesday, June 29, 2010
Lateral Transfer




                                            from Doolittle, 1999
Tuesday, June 29, 2010
from Lerat et al
Tuesday, June 29, 2010
Known Knowns 3:

                            rRNA PCR




Tuesday, June 29, 2010
Great Plate Count Anomaly




                         Culturing     Microscope

                          Count          Count


Tuesday, June 29, 2010
Great Plate Count Anomaly




                         Culturing          Microscope

                          Count      <<<<    Count


Tuesday, June 29, 2010
Great Plate Count Anomaly


                                                         DNA




                         Culturing          Microscope

                          Count      <<<<    Count


Tuesday, June 29, 2010
PCR Revolution
                                   Extract DNA


                          PCR w/ Universal rDNA Primers


                                    Sequence


                                Align and compare
                                 to other rDNAs


                         Phylogenetic          OTUs       Ecology
                         classification
Tuesday, June 29, 2010
Uses of rDNA PCR
                                           Bohannan and Hughes
                                           2003




                         Hugenholtz 2002




Tuesday, June 29, 2010
rRNA challenges
            • Massive amounts of data from next-gen
            • Need for full automation but
                   –Non overlapping
                   –Alignments not always straightforward
                   –BLAST insufficient
                   –Phylogenetic methods that have been automated
                    still need work
            • Tree of everything might be useful


Tuesday, June 29, 2010
Known Knowns 4:

                          Metagenomics




Tuesday, June 29, 2010
4.

                         Microbes in the world I:
                              rRNA PCR




                           Perna et al. 2003
Tuesday, June 29, 2010
Metagenomics


                                 shotgun



                                           clone




Tuesday, June 29, 2010
Tuesday, June 29, 2010
Metagenomics Challenge




Tuesday, June 29, 2010
Metagenomics Challenge




Tuesday, June 29, 2010
Metagenomics Challenge



                          Who is out there?
                         What are they doing?




Tuesday, June 29, 2010
rRNA phylotyping from metagenomics




                             Venter et al., 2004
Tuesday, June 29, 2010
Shotgun Sequencing Allows Use of
                          Alternative Anchors (e.g., RecA)




                                                     Venter et al., 2004
Tuesday, June 29, 2010
Weighted % of Clones




                                                                                                                           0
                                                                                                                               0.1250
                                                                                                                                                 0.2500
                                                                                                                                                               0.3750
                                                                                                                                                                        0.5000
                                                                           Al
                                                                                ph
                                                                                      ap
                                                                                           ro
                                                                                              t   eo
                                                                                                    ba
                                                                               Be                           ct
                                                                                  t                              er




Tuesday, June 29, 2010
                                                                                      ap                           ia
                                                                                          ro
                                                                                             t  eo
                                                                          G                            ba
                                                                                                            ct
                                                                           am                                    er
                                                                                 m                                    ia
                                                                                      ap
                                                                                           ro
                                                                                              t  eo
                                                                          Ep                           ba
                                                                               si                           ct
                                                                                 lo                              er
                                                                                     np                            ia
                                                                                           ro
                                                                                              t   eo
                                                                               De                      ba
                                                                                    lta                     ct
                                                                                                                 er
                                                                                       pr                          ia
                                                                                         ot
                                                                                                  eo
                                                                                                       ba
                                                                                                            ct
                                                                                          C                      er
                                                                                                                      ia
                                                                                           ya
                                                                                                no
                                                                                                       ba
                                                                                                            ct
                                                                                                                 er
                                                                                                                     ia
                                                                                                  Fi
                                                                                                     rm
                                                                                                           ic
                                                                                                                ut
                                                                                                                    es
                                                                                          Ac
                                                                                              tin
                                                                                                    ob
                                                                                                           ac
                                                                                                              te
                                                                                                                   ria
                                                                                                       C
                                                                                                         hl
                                                                                                            or
                                                                                                              ob
                                                                                                                       i




                                               Major Phylogenetic Group
                                                                                                              C
                                                                                                                 FB
                                                                                                                                                                                 Sargasso Phylotypes




                                                                                                C
                                                                                                    hl
                                                                                                        or
                                                                                                          of
                                                                                                                 le
                                                                                                                      xi
                                                                                           Sp
                                                                                                  iro
                                                                                                       ch
                                                                                                            ae
                                                                                                                 te
                                                                                                                      s
                                                                                           Fu
                                                                                                  so
                                                                                                       ba
                                                                                                            ct
                                                                          De                                    er
                                                                            in                                       ia
                                                                                oc
                                                                                      oc
                                                                                           cu
                                                                                             s-
                                                                                                     Th
                                                                                                           er
                                                                                          Eu                    m
                                                                                               ry                  us
                                                                                                  ar
                                                                                                    ch
                                                                                                           ae
                                                                                                                 ot
                                                                                      C                               a
                                                                                           re
                                                                                              n     ar
                                                                                                      ch
                                                                                                        ae
                                                                                                                 ot
                                                                                                                      a
                                                                                                                                                                                              Shotgun Sequencing Allows Use of Other Markers




                         Venter et al., 2004
                                                                                                                                        EFG
                                                                                                                                        EFTu



                                                                                                                                        rRNA
                                                                                                                                        RecA
                                                                                                                                        RpoB
                                                                                                                                        HSP70
Functional Inference from
                              Metagenomics
            • Can work well for individual genes
            • Predicting “community” function is
              challenging because treating community as a
              bag of genes does not work well
            • Better to “compartmentalize” data ...




Tuesday, June 29, 2010
Binning challenge

         A                                   T
         B                                   U
         C                                   V
         D                                   W
         E                                   X
         F                                   Y
         G                                   Z
Tuesday, June 29, 2010
Binning challenge

         A                                                        T
         B                                                        U
         C                                                        V
         D                                                        W
         E                                                        X
         F                                                        Y
         G               Best binning method: reference genomes   Z
Tuesday, June 29, 2010
Reference Genomes Coming from
                           Select Environment




Tuesday, June 29, 2010
Binning challenge

         A                                                      T
         B                                                      U
         C                                                      V
         D                                                      W
         E                                                      X
         F                                                      Y
         G               No reference genome? What do you do?   Z
Tuesday, June 29, 2010
Binning challenge

         A                                                             T
         B                                                             U
         C                                                             V
         D                                                             W
         E                                                             X
         F                                                             Y
         G               No reference genome? What do you do?          Z
                         Assembly? Composition? Get more references?
Tuesday, June 29, 2010
Binning challenge

         A                                                      T
         B                                                      U
         C                                                      V
         D                                                      W
         E                                                      X
         F                                                      Y
         G               No reference genome? What do you do?   Z
                         Phylogeny ....
Tuesday, June 29, 2010
Metagenomic challenges

            • Massive amounts of data from next-gen
            • Need for full automation but
                   –Data fragmentary
                   –BLAST insufficient
                   –Automation of phylogenetic methods a bit better for
                    protein coding genes b/c alignments better
                   –Reference databases incomplete




Tuesday, June 29, 2010
Known Unknowns 1:

                            GEBA Past




Tuesday, June 29, 2010
Microbial genomes




                             From http://genomesonline.org
Tuesday, June 29, 2010
Proteobacteria

 2002                    TM6
                         OS-K
                         Acidobacteria
                                                 • At least 40
                         Termite Group
                         OP8
                                                   phyla of
                         Nitrospira
                         Bacteroides
                                                   bacteria
                         Chlorobi
                         Fibrobacteres
                         Marine GroupA
                         WS3
                         Gemmimonas
                         Firmicutes
                         Fusobacteria
                         Actinobacteria
                         OP9
                         Cyanobacteria
                         Synergistes
                         Deferribacteres
                         Chrysiogenetes
                         NKB19
                         Verrucomicrobia
                         Chlamydia
                         OP3
                         Planctomycetes
                         Spriochaetes
                         Coprothmermobacter
                         OP10
                         Thermomicrobia
                         Chloroflexi
                         TM7
                         Deinococcus-Thermus
                         Dictyoglomus
                         Aquificae
                         Thermudesulfobacteria
                         Thermotogae
                         OP1                       Based on Hugenholtz,
                         OP11                      2002
Tuesday, June 29, 2010
2002
                         Proteobacteria
                         TM6
                         OS-K
                                                 • At least 40
                         Acidobacteria
                         Termite Group
                         OP8
                                                   phyla of
                         Nitrospira
                         Bacteroides
                                                   bacteria
                         Chlorobi
                         Fibrobacteres
                         Marine GroupA
                                                 • Genome
                         WS3
                         Gemmimonas                sequences are
                         Firmicutes
                         Fusobacteria              mostly from
                         Actinobacteria
                         OP9
                         Cyanobacteria
                                                   three phyla
                         Synergistes
                         Deferribacteres
                         Chrysiogenetes
                         NKB19
                         Verrucomicrobia
                         Chlamydia
                         OP3
                         Planctomycetes
                         Spriochaetes
                         Coprothmermobacter
                         OP10
                         Thermomicrobia
                         Chloroflexi
                         TM7
                         Deinococcus-Thermus
                         Dictyoglomus
                         Aquificae
                         Thermudesulfobacteria
                         Thermotogae
                         OP1                       Based on Hugenholtz,
                         OP11                      2002
Tuesday, June 29, 2010
2002
                         Proteobacteria
                         TM6
                         OS-K
                                                 • At least 40
                         Acidobacteria
                         Termite Group
                         OP8
                                                   phyla of
                         Nitrospira
                         Bacteroides
                                                   bacteria
                         Chlorobi
                         Fibrobacteres
                         Marine GroupA
                                                 • Genome
                         WS3
                         Gemmimonas                sequences are
                         Firmicutes
                         Fusobacteria              mostly from
                         Actinobacteria
                         OP9
                         Cyanobacteria
                                                   three phyla
                         Synergistes
                         Deferribacteres
                         Chrysiogenetes          • Some other
                         NKB19
                         Verrucomicrobia
                         Chlamydia
                                                   phyla are only
                         OP3
                         Planctomycetes
                         Spriochaetes
                                                   sparsely
                         Coprothmermobacter
                         OP10
                                                   sampled
                         Thermomicrobia
                         Chloroflexi
                         TM7
                         Deinococcus-Thermus
                         Dictyoglomus
                         Aquificae
                         Thermudesulfobacteria
                         Thermotogae
                         OP1                       Based on Hugenholtz,
                         OP11                      2002
Tuesday, June 29, 2010
2002
                         Proteobacteria
                         TM6
                         OS-K
                                                 • At least 40
                         Acidobacteria
                         Termite Group
                         OP8
                                                   phyla of
                         Nitrospira
                         Bacteroides
                                                   bacteria
                         Chlorobi
                         Fibrobacteres
                         Marine GroupA
                                                 • Genome
                         WS3
                         Gemmimonas                sequences are
                         Firmicutes
                         Fusobacteria              mostly from
                         Actinobacteria
                         OP9
                         Cyanobacteria
                                                   three phyla
                         Synergistes
                         Deferribacteres
                         Chrysiogenetes          • Some other
                         NKB19
                         Verrucomicrobia
                         Chlamydia
                                                   phyla are only
                         OP3
                         Planctomycetes
                         Spriochaetes
                                                   sparsely
                         Coprothmermobacter
                         OP10
                                                   sampled
                         Thermomicrobia
                         Chloroflexi
                         TM7
                                                 • Same trend in
                         Deinococcus-Thermus
                         Dictyoglomus
                         Aquificae
                                                   Archaea
                         Thermudesulfobacteria
                         Thermotogae
                         OP1                       Based on Hugenholtz,
                         OP11                      2002
Tuesday, June 29, 2010
2002
                         Proteobacteria
                         TM6
                         OS-K
                                                 • At least 40
                         Acidobacteria
                         Termite Group
                         OP8
                                                   phyla of
                         Nitrospira
                         Bacteroides
                                                   bacteria
                         Chlorobi
                         Fibrobacteres
                         Marine GroupA
                                                 • Genome
                         WS3
                         Gemmimonas                sequences are
                         Firmicutes
                         Fusobacteria              mostly from
                         Actinobacteria
                         OP9
                         Cyanobacteria
                                                   three phyla
                         Synergistes
                         Deferribacteres
                         Chrysiogenetes          • Some other
                         NKB19
                         Verrucomicrobia
                         Chlamydia
                                                   phyla are only
                         OP3
                         Planctomycetes
                         Spriochaetes
                                                   sparsely
                         Coprothmermobacter
                         OP10
                                                   sampled
                         Thermomicrobia
                         Chloroflexi
                         TM7
                                                 • Same trend in
                         Deinococcus-Thermus
                         Dictyoglomus
                         Aquificae
                                                   Eukaryotes
                         Thermudesulfobacteria
                         Thermotogae
                         OP1                       Based on Hugenholtz,
                         OP11                      2002
Tuesday, June 29, 2010
The Tree is not Happy
                         Bacteria




                                                           Archaea




                          Eukaryotes

                            FIgure from Barton, Eisen et al.
                               “Evolution”, CSHL Press.
                           Based on tree from Pace NR, 2003.
Tuesday, June 29, 2010
Proteobacteria
• NSF-funded             TM6                     • At least 40 phyla
                         OS-K
  Tree of Life           Acidobacteria
                         Termite Group             of bacteria
                         OP8
  Project                Nitrospira
                                                 • Genome
                         Bacteroides
                         Chlorobi
• A genome               Fibrobacteres
                         Marine GroupA
                                                   sequences are
  from each of           WS3
                         Gemmimonas                mostly from
  eight phyla            Firmicutes
                         Fusobacteria              three phyla
                         Actinobacteria
                         OP9
                         Cyanobacteria
                         Synergistes
                                                 • Some other
                         Deferribacteres
                         Chrysiogenetes            phyla are only
                         NKB19
                         Verrucomicrobia
                         Chlamydia
                                                   sparsely sampled
                         OP3
                         Planctomycetes
                         Spriochaetes
                                                 • Solution I:
                         Coprothmermobacter
                         OP10                      sequence more
                         Thermomicrobia
                         Chloroflexi
                         TM7
                                                   phyla
                         Deinococcus-Thermus
                         Dictyoglomus
                         Aquificae
 Eisen & Ward, PIs       Thermudesulfobacteria
                         Thermotogae
                         OP1
                         OP11

Tuesday, June 29, 2010
Tuesday, June 29, 2010
The Tree of Life is Still Angry
                            Bacteria




                                                              Archaea




                             Eukaryotes

                               FIgure from Barton, Eisen et al.
                                  “Evolution”, CSHL Press.
                              Based on tree from Pace NR, 2003.
Tuesday, June 29, 2010
Proteobacteria
                         TM6
                         OS-K
                                                 • At least 100 phyla of bacteria
                         Acidobacteria
                         Termite Group
                         OP8
                                                 • Genome sequences are mostly
                         Nitrospira
                         Bacteroides               from three phyla
                         Chlorobi
                         Fibrobacteres
                         Marine GroupA           • Most phyla with cultured
                         WS3
                         Gemmimonas
                         Firmicutes
                                                   species are sparsely sampled
                         Fusobacteria
                         Actinobacteria          • Lineages with no cultured
                         OP9
                         Cyanobacteria
                         Synergistes
                                                   taxa even more poorly
                         Deferribacteres
                         Chrysiogenetes
                         NKB19
                                                   sampled
                         Verrucomicrobia
                         Chlamydia
                         OP3
                                                 • Solution - use tree to really
                         Planctomycetes
                         Spriochaetes              fill gaps
                         Coprothmermobacter
                         OP10
                         Thermomicrobia
                         Chloroflexi
                         TM7
                         Deinococcus-Thermus
                         Dictyoglomus
                         Aquificae
                                                      Well sampled phyla
                         Thermudesulfobacteria
                         Thermotogae
                         OP1
                         OP11

Tuesday, June 29, 2010
A Genomic Encyclopedia of Bacteria and
                        Archaea (GEBA)




Tuesday, June 29, 2010
GEBA Pilot Project Overview

           • Identify major branches in rRNA tree for
             which no genomes are available
           • Identify branches with a cultured
             representative in DSMZ
           • Grow > 200 of these and prep. DNA
           • Sequence and finish 100 (covering breadth of
             bacterial/archaea diversity)
           • Annotate, analyze, release data
           • Assess benefits of tree guided sequencing

Tuesday, June 29, 2010
GEBA Pilot Project: Components
         • Project overview (Phil Hugenholtz, Nikos Kyrpides, Jonathan Eisen,
           Eddy Rubin, Jim Bristow)
         • Project management (David Bruce, Eileen Dalin, Lynne Goodwin)
         • Culture collection and DNA prep (DSMZ, Hans-Peter Klenk)
         • Sequencing and closure (Eileen Dalin, Susan Lucas, Alla Lapidus, Mat
           Nolan, Alex Copeland, Cliff Han, Feng Chen, Jan-Fang Cheng)
         • Annotation and data release (Nikos Kyrpides, Victor Markowitz, et al)
         • Analysis (Dongying Wu, Kostas Mavrommatis, Martin Wu, Victor
           Kunin, Neil Rawlings, Ian Paulsen, Patrick Chain, Patrik D’Haeseleer,
           Sean Hooper, Iain Anderson, Amrita Pati, Natalia N. Ivanova,
           Athanasios Lykidis, Adam Zemla)
         • Adopt a microbe education project (Cheryl Kerfeld)
         • Outreach (David Gilbert)
         • $$$ (DOE, Eddy Rubin, Jim Bristow)



Tuesday, June 29, 2010
GEBA and Openness
  • All data released as quickly as
    possible w/ no restrictions to
    IMG-GEBA; Genbank, etc
  • Data also available in Biotorrents
    (http://biotorrents.net)
  • Individual genome reports
    published in OA “Standards in
    Genome Sciences (SIGS)”
  • 1st GEBA paper in Nature freely
    available and published using
    Creative Commons License
Tuesday, June 29, 2010
Known Unknowns 2:

                           GEBA present




Tuesday, June 29, 2010
GEBA Lesson 1

                         rRNA Tree is Useful for Identifying
                          Phylogenetically Novel Genomes




Tuesday, June 29, 2010
rRNA Tree of Life
                         Bacteria




                                                           Archaea




                          Eukaryotes

                            FIgure from Barton, Eisen et al.
                               “Evolution”, CSHL Press.
                           Based on tree from Pace NR, 2003.
Tuesday, June 29, 2010
Network of Life
                         Bacteria




                                                           Archaea




                          Eukaryotes

                            Figure from Barton, Eisen et al.
                               “Evolution”, CSHL Press.
                           Based on tree from Pace NR, 2003.
Tuesday, June 29, 2010
“Whole Genome” Concatenation
                             Tree w/ AMPHORA




                See Wu and Eisen, Genome Biology 2008 9: R151
                http://bobcat.genomecenter.ucdavis.edu/AMPHORA/
Tuesday, June 29, 2010
Compare PD in Trees




Tuesday, June 29, 2010
PD of rRNA, Genome Trees Similar




     From Wu et al. 2009 Nature 462, 1056-1060
Tuesday, June 29, 2010
GEBA Lesson 2

                         Phylogeny-driven genome selection
                         helps discover new genetic diversity




Tuesday, June 29, 2010
Network of Life
                         Bacteria




                                                           Archaea




                          Eukaryotes

                            FIgure from Barton, Eisen et al.
                               “Evolution”, CSHL Press.
                           Based on tree from Pace NR, 2003.
Tuesday, June 29, 2010
Protein Family Rarefaction Curves

            • Take data set of multiple complete genomes
            • Identify all protein families using MCL
            • Plot # of genomes vs. # of protein families




Tuesday, June 29, 2010
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Synapomorphies exist




Tuesday, June 29, 2010
GEBA Lesson 3

                         Phylogeny-driven genome selection
                            improves genome annotation




Tuesday, June 29, 2010
Predicting Function

            • Key step in genome projects
            • More accurate predictions help guide
              experimental and computational analyses
            • Many diverse approaches
            • Comparative and evolutionary analysis greatly
              improves most predictions




Tuesday, June 29, 2010
Phylogeny vs. Blast
   Many methods focus on “top                     EXAMPLE A                                   METHOD                          EXAMPLE B


   blast hits”                                            2A                         CHOOSE GENE(S) OF INTEREST                        5


                                                                                                                                   1 3 4
                                                       3A                                                                      2
                                                            2B                                                                        5
                                                  1A 2A 1B 3B                           IDENTIFY HOMOLOGS                           6



                                                                                         ALIGN SEQUENCES

                                         1A      2A    3A 1B        2B      3B                                      1    2         3       4   5   6



                                                                                       CALCULATE GENE TREE


                                                                  Duplication?


                                        1A       2A 3A 1B          2B      3B                                       1    2         3       4   5   6



                                                                                         OVERLAY KNOWN
                                                                                       FUNCTIONS ONTO TREE



   But much better to build             1A       2A 3A 1B
                                                                  Duplication?


                                                                   2B      3B                                      1      2        3       4   5   6



   phylogenetic trees of genes and                                                     INFER LIKELY FUNCTION
                                                                                       OF GENE(S) OF INTEREST


   compare to relatives
                                                                                                                  Ambiguous

                                                                  Duplication?



                                     Species 1        Species 2          Species 3
                                      1A 1B            2A 2B              3A 3B                                     1    2         3       4   5   6


                                                                                         ACTUAL EVOLUTION
                                                                                     (ASSUMED TO BE UNKNOWN)

   Allows better integration of                                   Duplication



   evolutionary history (e.g.,
   orthologs and paralogs)                                                                Based on Eisen,
                                                                                          1998 Genome
                                                                                          Res 8: 163-167.
Tuesday, June 29, 2010
Wu et al. 2005 PLoS Genetics 1: e65.
Tuesday, June 29, 2010
Most/All Functional Prediction Improves
                  w/ Better Phylogenetic Sampling
              • Conversion of hypothetical into
                conserved hypotheticals
              • Improved phylogenomics
              • Linking distantly related members of
                protein families
              • Improved non-homology prediction



Tuesday, June 29, 2010
Known Unknowns 3:

                            GEBA future




Tuesday, June 29, 2010
GEBA Future 1

                         How much further should we go?




Tuesday, June 29, 2010
rRNA Tree of Life
                         Bacteria




                                                           Archaea




                          Eukaryotes

                            FIgure from Barton, Eisen et al.
                               “Evolution”, CSHL Press.
                           Based on tree from Pace NR, 2003.
Tuesday, June 29, 2010
Phylogenetic Diversity:
                         Sequenced Bacteria & Archaea




                                   From Wu et al. 2009
Tuesday, June 29, 2010
Phylogenetic Diversity with GEBA




                              From Wu et al. 2009
Tuesday, June 29, 2010
Phylogenetic Diversity: Isolates




                                     From Wu et al. 2009
Tuesday, June 29, 2010
Phylogenetic Diversity: All




                                   From Wu et al. 2009
Tuesday, June 29, 2010
Proteobacteria
                         TM6
                         OS-K
                                                 • At least 40 phyla of bacteria
                         Acidobacteria
                         Termite Group
                         OP8
                                                 • Genome sequences are mostly
                         Nitrospira
                         Bacteroides               from three phyla
                         Chlorobi
                         Fibrobacteres
                         Marine GroupA           • Most phyla with cultured
                         WS3
                         Gemmimonas
                         Firmicutes
                                                   species are sparsely sampled
                         Fusobacteria
                         Actinobacteria          • Lineages with no cultured
                         OP9
                         Cyanobacteria
                         Synergistes
                                                   taxa even more poorly
                         Deferribacteres
                         Chrysiogenetes
                         NKB19
                                                   sampled
                         Verrucomicrobia
                         Chlamydia
                         OP3
                         Planctomycetes
                         Spriochaetes
                         Coprothmermobacter
                         OP10
                         Thermomicrobia
                         Chloroflexi
                         TM7
                         Deinococcus-Thermus
                         Dictyoglomus
                         Aquificae
                                                      Well sampled phyla
                         Thermudesulfobacteria
                         Thermotogae                  Poorly sampled
                         OP1
                         OP11                         No cultured taxa
Tuesday, June 29, 2010
Proteobacteria
                         TM6
                         OS-K
                          Acidobacteria
                          Termite Group
                                                  • At least 40 phyla of bacteria
                         OP8
                         Nitrospira               • Genome sequences are mostly
                         Bacteroides
                          Chlorobi
                          Fibrobacteres
                                                    from three phyla
                          Marine GroupA
                         WS3                      • Most phyla with cultured
                          Gemmimonas
                         Firmicutes                 species are sparsely sampled
                         Fusobacteria
                          Actinobacteria
                         OP9
                                                  • Lineages with no cultured taxa
                          Cyanobacteria
                         Synergistes                even more poorly sampled
                         Deferribacteres
                         Chrysiogenetes
                         NKB19
                          Verrucomicrobia
                         Chlamydia
                         OP3
                          Planctomycetes
                          Spriochaetes
                         Coprothmermobacter
                         OP10
                          Thermomicrobia
                         Chloroflexi
                         TM7
                           Deinococcus-Thermus
                         Dictyoglomus
                         Aquificae
                                                        Well sampled phyla
                          Thermudesulfobacteria
                          Thermotogae                   Poorly sampled
                         OP1
                         OP11                           No cultured taxa
Tuesday, June 29, 2010
Uncultured Lineages:
                         Technical Approaches
            • Get into culture
            • Enrichment cultures
            • If abundant in low diversity ecosystems
            • Flow sorting
            • Microbeads
            • Microfluidic sorting
            • Single cell amplification


Tuesday, June 29, 2010
GEBA Future 2

                         How many gene families are there?




Tuesday, June 29, 2010
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Compare PD in Trees




Tuesday, June 29, 2010
Gene Families vs PD
                                    PD vs. Gene Families (per genome)
                       0.4




                       0.3
           PD/Genome




                       0.2




                       0.1




                        0
                             0      275            550            825   1100



                                    Gene families / genome

Tuesday, June 29, 2010
How many protein families?
                                                        GEBA Genomes
                                                        PD/Genome
                                                        ~0.1

                                                        PFAMs/Genome
                                      Text              ~1000

                                                        PFAMs/PD
                                                        ~10000

                                                        Total PFAMS
                                                        ~10,000,000


                                  From Wu et al. 2009
Tuesday, June 29, 2010
Caveats (of many)

                    •    Novel protein families per genome likely taxon
                         specific

                    •    Parameters other than PD clearly important

                    •    Does not include viruses, eukaryotes




Tuesday, June 29, 2010
GEBA Future 3

                         Need to better leverage improved
                             phylogenetic sampling




Tuesday, June 29, 2010
Example 1: Protein Family Space

            • Much less biased sampling of protein family
              space now available

            • Need to rebuild / reassess many protein family
              databases (e.g., HMMs)

            • Structural space



Tuesday, June 29, 2010
Example 2: Experiments




Tuesday, June 29, 2010
As of 2002               Proteobacteria
                         TM6
                         OS-K                    • At least 40
                         Acidobacteria
                         Termite Group
                         OP8
                                                   phyla of
                         Nitrospira
                         Bacteroides
                                                   bacteria
                         Chlorobi
                         Fibrobacteres
                         Marine GroupA
                         WS3
                         Gemmimonas
                         Firmicutes
                         Fusobacteria
                         Actinobacteria
                         OP9
                         Cyanobacteria
                         Synergistes
                         Deferribacteres
                         Chrysiogenetes
                         NKB19
                         Verrucomicrobia
                         Chlamydia
                         OP3
                         Planctomycetes
                         Spriochaetes
                         Coprothmermobacter
                         OP10
                         Thermomicrobia
                         Chloroflexi
                         TM7
                         Deinococcus-Thermus
                         Dictyoglomus
                         Aquificae
                         Thermudesulfobacteria
                         Thermotogae
                         OP1                       Based on Hugenholtz,
                         OP11                      2002
Tuesday, June 29, 2010
As of 2002              Proteobacteria
                         TM6
                         OS-K
                                                 • At least 40
                         Acidobacteria
                         Termite Group
                         OP8
                                                   phyla of
                         Nitrospira
                         Bacteroides
                                                   bacteria
                         Chlorobi
                         Fibrobacteres
                         Marine GroupA
                                                 • Experimental
                         WS3
                         Gemmimonas                studies are
                         Firmicutes
                         Fusobacteria              mostly from
                         Actinobacteria
                         OP9
                         Cyanobacteria
                                                   three phyla
                         Synergistes
                         Deferribacteres
                         Chrysiogenetes
                         NKB19
                         Verrucomicrobia
                         Chlamydia
                         OP3
                         Planctomycetes
                         Spriochaetes
                         Coprothmermobacter
                         OP10
                         Thermomicrobia
                         Chloroflexi
                         TM7
                         Deinococcus-Thermus
                         Dictyoglomus
                         Aquificae
                         Thermudesulfobacteria
                         Thermotogae
                         OP1                       Based on Hugenholtz,
                         OP11                      2002
Tuesday, June 29, 2010
As of 2002              Proteobacteria
                         TM6
                         OS-K
                                                 • At least 40
                         Acidobacteria
                         Termite Group
                         OP8
                                                   phyla of
                         Nitrospira
                         Bacteroides
                                                   bacteria
                         Chlorobi
                         Fibrobacteres
                         Marine GroupA
                                                 • Experimental
                         WS3
                         Gemmimonas                studies are
                         Firmicutes
                         Fusobacteria              mostly from
                         Actinobacteria
                         OP9
                         Cyanobacteria
                                                   three phyla
                         Synergistes
                         Deferribacteres
                         Chrysiogenetes          • Some studies
                         NKB19
                         Verrucomicrobia
                         Chlamydia
                                                   in other phyla
                         OP3
                         Planctomycetes
                         Spriochaetes
                         Coprothmermobacter
                         OP10
                         Thermomicrobia
                         Chloroflexi
                         TM7
                         Deinococcus-Thermus
                         Dictyoglomus
                         Aquificae
                         Thermudesulfobacteria
                         Thermotogae
                         OP1                       Based on Hugenholtz,
                         OP11                      2002
Tuesday, June 29, 2010
As of 2002              Proteobacteria
                         TM6
                         OS-K
                                                 • At least 40
                         Acidobacteria
                         Termite Group
                         OP8
                                                   phyla of
                         Nitrospira
                         Bacteroides
                                                   bacteria
                         Chlorobi
                         Fibrobacteres
                         Marine GroupA
                                                 • Genome
                         WS3
                         Gemmimonas                sequences are
                         Firmicutes
                         Fusobacteria              mostly from
                         Actinobacteria
                         OP9
                         Cyanobacteria
                                                   three phyla
                         Synergistes
                         Deferribacteres
                         Chrysiogenetes          • Some other
                         NKB19
                         Verrucomicrobia
                         Chlamydia
                                                   phyla are only
                         OP3
                         Planctomycetes
                         Spriochaetes
                                                   sparsely
                         Coprothmermobacter
                         OP10
                                                   sampled
                         Thermomicrobia
                         Chloroflexi
                         TM7
                                                 • Same trend in
                         Deinococcus-Thermus
                         Dictyoglomus
                         Aquificae
                                                   Eukaryotes
                         Thermudesulfobacteria
                         Thermotogae
                         OP1                       Based on Hugenholtz,
                         OP11                      2002
Tuesday, June 29, 2010
As of 2002              Proteobacteria
                         TM6
                         OS-K
                                                 • At least 40
                         Acidobacteria
                         Termite Group
                         OP8
                                                   phyla of
                         Nitrospira
                         Bacteroides
                                                   bacteria
                         Chlorobi
                         Fibrobacteres
                         Marine GroupA
                                                 • Genome
                         WS3
                         Gemmimonas                sequences are
                         Firmicutes
                         Fusobacteria              mostly from
                         Actinobacteria
                         OP9
                         Cyanobacteria
                                                   three phyla
                         Synergistes
                         Deferribacteres
                         Chrysiogenetes          • Some other
                         NKB19
                         Verrucomicrobia
                         Chlamydia
                                                   phyla are only
                         OP3
                         Planctomycetes
                         Spriochaetes
                                                   sparsely
                         Coprothmermobacter
                         OP10
                                                   sampled
                         Thermomicrobia
                         Chloroflexi
                         TM7
                                                 • Same trend in
                         Deinococcus-Thermus
                         Dictyoglomus
                         Aquificae
                                                   Viruses
                         Thermudesulfobacteria
                         Thermotogae
                         OP1                       Based on Hugenholtz,
                         OP11                      2002
Tuesday, June 29, 2010
Proteobacteria
                         TM6
                         OS-K
                                                   Need
                         Acidobacteria
                         Termite Group
                         OP8
                                                   experimental
                         Nitrospira
                         Bacteroides
                         Chlorobi
                                                   studies from
                         Fibrobacteres
                         Marine GroupA
                         WS3
                                                   across the tree
                         Gemmimonas
                         Firmicutes
                                                   too
                         Fusobacteria
                         Actinobacteria
                         OP9
                         Cyanobacteria
                         Synergistes
                         Deferribacteres
                         Chrysiogenetes
                         NKB19
                         Verrucomicrobia
                         Chlamydia
                         OP3
                         Planctomycetes
                         Spriochaetes                  0.1
                         Coprothmermobacter
                         OP10
                         Thermomicrobia
                         Chloroflexi
                         TM7
                         Deinococcus-Thermus
                         Dictyoglomus
                         Aquificae                Tree based on
                         Thermudesulfobacteria
                         Thermotogae
                                                 Hugenholtz (2002)
                         OP1                     with some
                         OP11                    modifications.
Tuesday, June 29, 2010
Example 3: Improving the tree

            • To make best use of GEBA data we need a
              better tree




Tuesday, June 29, 2010
Wh




   Concatenated
   alignment “whole
   genome tree” built
   using AMPHORA



Tuesday, June 29, 2010
Why       Wh
              does the
                tree
              matter?


   Whole genome tree
   built using
   AMPHORA
   by Martin Wu and
   Dongying Wu


Tuesday, June 29, 2010
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Many Alternatives to Concatenation

            • Gene presence/absence
            • Supertrees / consensus methods
            • Separate phylogeny of genes and then
              integration of results (e.g., networks)
            • Models that incorporate gain/loss as well as
              gene phylogeny




Tuesday, June 29, 2010
Example 4: Metagenomic Analysis




Tuesday, June 29, 2010
Weighted % of Clones




                                                                                                                           0
                                                                                                                               0.1250
                                                                                                                                                 0.2500
                                                                                                                                                               0.3750
                                                                                                                                                                        0.5000
                                                                           Al
                                                                                ph
                                                                                      ap
                                                                                           ro
                                                                                              t   eo
                                                                                                    ba
                                                                               Be                           ct
                                                                                  t                              er




Tuesday, June 29, 2010
                                                                                      ap                           ia
                                                                                          ro
                                                                                             t  eo
                                                                          G                            ba
                                                                                                            ct
                                                                           am                                    er
                                                                                 m                                    ia
                                                                                      ap
                                                                                           ro
                                                                                              t  eo
                                                                          Ep                           ba
                                                                               si                           ct
                                                                                 lo                              er
                                                                                     np                            ia
                                                                                           ro
                                                                                              t   eo
                                                                               De                      ba
                                                                                    lta                     ct
                                                                                                                 er
                                                                                       pr                          ia
                                                                                         ot
                                                                                                  eo
                                                                                                       ba
                                                                                                            ct
                                                                                          C                      er
                                                                                                                      ia
                                                                                           ya
                                                                                                no
                                                                                                       ba
                                                                                                            ct
                                                                                                                 er
                                                                                                                     ia
                                                                                                  Fi
                                                                                                     rm
                                                                                                           ic
                                                                                                                ut
                                                                                                                    es
                                                                                          Ac
                                                                                              tin
                                                                                                    ob
                                                                                                           ac
                                                                                                              te
                                                                                                                   ria
                                                                                                       C
                                                                                                         hl
                                                                                                            or
                                                                                                              ob
                                                                                                                       i




                                               Major Phylogenetic Group
                                                                                                              C
                                                                                                                 FB
                                                                                                                                                                                 Sargasso Phylotypes




                                                                                                C
                                                                                                    hl
                                                                                                        or
                                                                                                          of
                                                                                                                 le
                                                                                                                      xi
                                                                                           Sp
                                                                                                  iro
                                                                                                       ch
                                                                                                            ae
                                                                                                                 te
                                                                                                                      s
                                                                                           Fu
                                                                                                  so
                                                                                                       ba
                                                                                                            ct
                                                                          De                                    er
                                                                            in                                       ia
                                                                                oc
                                                                                      oc
                                                                                           cu
                                                                                             s-
                                                                                                     Th
                                                                                                           er
                                                                                          Eu                    m
                                                                                               ry                  us
                                                                                                  ar
                                                                                                    ch
                                                                                                                                                                                        Phylogeny for Typing and Binning




                                                                                                           ae
                                                                                                                 ot
                                                                                      C                               a
                                                                                           re
                                                                                              n     ar
                                                                                                      ch
                                                                                                        ae
                                                                                                                 ot
                                                                                                                      a
                         Venter et al., 2004
                                                                                                                                        EFG
                                                                                                                                        EFTu



                                                                                                                                        rRNA
                                                                                                                                        RecA
                                                                                                                                        RpoB
                                                                                                                                        HSP70
Weighted % of Clones




                                                                                                                           0
                                                                                                                               0.1250
                                                                                                                                                 0.2500
                                                                                                                                                                     0.3750
                                                                                                                                                                              0.5000
                                                                           Al
                                                                                ph
                                                                                      ap
                                                                                           ro
                                                                                              t   eo
                                                                                                    ba
                                                                               Be                           ct
                                                                                  t                              er




Tuesday, June 29, 2010
                                                                                      ap                           ia
                                                                                          ro
                                                                                             t  eo
                                                                          G                            ba
                                                                                                            ct
                                                                           am                                    er
                                                                                 m                                    ia
                                                                                      ap
                                                                                           ro
                                                                                              t  eo
                                                                          Ep                           ba
                                                                               si                           ct
                                                                                 lo                              er
                                                                                     np                            ia
                                                                                           ro
                                                                                              t   eo
                                                                               De                      ba
                                                                                    lta                     ct
                                                                                                                 er
                                                                                       pr                          ia
                                                                                         ot
                                                                                                  eo
                                                                                                       ba
                                                                                                            ct
                                                                                          C                      er
                                                                                                                      ia
                                                                                           ya
                                                                                                no
                                                                                                       ba
                                                                                                            ct
                                                                                                                 er
                                                                                                                     ia
                                                                                                  Fi
                                                                                                     rm
                                                                                                           ic
                                                                                                                ut
                                                                                                                    es
                                                                                          Ac
                                                                                              tin
                                                                                                    ob
                                                                                                           ac
                                                                                                              te
                                                                                                                   ria
                                                                                                       C
                                                                                                         hl
                                                                                                            or
                                                                                                              ob
                                                                                                                       i




                                               Major Phylogenetic Group
                                                                                                              C
                                                                                                                 FB
                                                                                                                                                                                       Sargasso Phylotypes




                                                                                                C
                                                                                                    hl
                                                                                                        or
                                                                                                          of
                                                                                                                 le
                                                                                                                      xi
                                                                                           Sp
                                                                                                  iro
                                                                                                       ch
                                                                                                            ae
                                                                                                                 te
                                                                                                                      s
                                                                                           Fu
                                                                                                  so
                                                                                                       ba
                                                                                                            ct
                                                                          De                                    er
                                                                            in                                       ia
                                                                                oc
                                                                                                                                             Should improve with




                                                                                      oc
                                                                                           cu
                                                                                             s-
                                                                                                     Th
                                                                                                           er
                                                                                          Eu                    m
                                                                                                                                           better genomic sampling




                                                                                               ry                  us
                                                                                                  ar
                                                                                                    ch
                                                                                                                                                                                              Phylogeny for Typing and Binning




                                                                                                           ae
                                                                                                                 ot
                                                                                      C                               a
                                                                                           re
                                                                                              n     ar
                                                                                                      ch
                                                                                                        ae
                                                                                                                 ot
                                                                                                                      a
                         Venter et al., 2004
                                                                                                                                        EFG
                                                                                                                                        EFTu



                                                                                                                                        rRNA
                                                                                                                                        RecA
                                                                                                                                        RpoB
                                                                                                                                        HSP70
Weighted % of Clones




                                                                                                                           0
                                                                                                                               0.1250
                                                                                                                                                 0.2500
                                                                                                                                                                          0.3750
                                                                                                                                                                                   0.5000
                                                                           Al
                                                                                ph
                                                                                      ap
                                                                                           ro
                                                                                              t   eo
                                                                                                    ba
                                                                               Be                           ct
                                                                                  t                              er




Tuesday, June 29, 2010
                                                                                      ap                           ia
                                                                                          ro
                                                                                             t  eo
                                                                          G                            ba
                                                                                                            ct
                                                                           am                                    er
                                                                                 m                                    ia
                                                                                      ap
                                                                                           ro
                                                                                              t  eo
                                                                          Ep                           ba
                                                                               si                           ct
                                                                                 lo                              er
                                                                                     np                            ia
                                                                                           ro
                                                                                              t   eo
                                                                               De                      ba
                                                                                    lta                     ct
                                                                                                                 er
                                                                                       pr                          ia
                                                                                         ot
                                                                                                  eo
                                                                                                       ba
                                                                                                            ct
                                                                                          C                      er
                                                                                                                      ia
                                                                                           ya
                                                                                                no
                                                                                                       ba
                                                                                                            ct
                                                                                                                 er
                                                                                                                     ia
                                                                                                  Fi
                                                                                                     rm
                                                                                                           ic
                                                                                                                ut
                                                                                                                    es
                                                                                          Ac
                                                                                              tin
                                                                                                    ob
                                                                                                           ac
                                                                                                              te
                                                                                                                   ria
                                                                                                       C
                                                                                                         hl
                                                                                                            or
                                                                                                              ob
                                                                                                                       i




                                               Major Phylogenetic Group
                                                                                                              C
                                                                                                                 FB
                                                                                                                                                                                            Sargasso Phylotypes




                                                                                                C
                                                                                                    hl
                                                                                                        or
                                                                                                          of
                                                                                                                 le
                                                                                                                      xi
                                                                                           Sp
                                                                                                  iro
                                                                                                       ch
                                                                                                            ae
                                                                                                                 te
                                                                                                                      s
                                                                                           Fu
                                                                                                  so
                                                                                                       ba
                                                                                                            ct
                                                                          De                                    er
                                                                            in                                       ia
                                                                                oc
                                                                                      oc
                                                                                           cu
                                                                                                                                                 Only improved a little




                                                                                             s-
                                                                                                     Th
                                                                                                           er
                                                                                          Eu                    m
                                                                                               ry                  us
                                                                                                  ar
                                                                                                    ch
                                                                                                                                                                                                   Phylogeny for Typing and Binning




                                                                                                           ae
                                                                                                                 ot
                                                                                      C                               a
                                                                                           re
                                                                                              n     ar
                                                                                                      ch
                                                                                                        ae
                                                                                                                 ot
                                                                                                                      a
                         Venter et al., 2004
                                                                                                                                        EFG
                                                                                                                                        EFTu



                                                                                                                                        rRNA
                                                                                                                                        RecA
                                                                                                                                        RpoB
                                                                                                                                        HSP70
How to improve phylogenetic
                         analysis of metagenomic data
            • Better phylogenetic and OTU methods for
              fragmented data

            • Better assessment of which genes to use?

            • More automation of all methods




Tuesday, June 29, 2010
Phylogenetic challenge




                         How place all in one tree?
                         How identify OTUs including all fragments?
                         Can you analyze more than 1 gene family at a
                         time?
Tuesday, June 29, 2010
Approach 1:
                    Place Reads on Reference Tree
            • Examples
                   –AMPHORA (Wu and Eisen)
                   –PPlacer (Erik Matsen)
                   –FastTree (Morgan Price)
            • General approach
                   –Precompute reference tree for full length sequences
                   –Place individual reads on reference tree
                   –Merge trees

Tuesday, June 29, 2010
Variants

            • Use concatenated alignment of markers not
              just individual genes (Steven Kembel)
            • Apply to OTU identification not just
              classification (Thomas Sharpton)
            • CoBinning: look for linkage among
              fragments/genes (Aaron Darling)




Tuesday, June 29, 2010
How to improve phylogenetic
                         analysis of metagenomic data
            • Better phylogenetic and OTU methods for
              fragmented data

            • Better assessment of which genes to use?

            • More automation




Tuesday, June 29, 2010
New “Marker Genes”

            • 100 representative genomes, including many
              GEBAs
            • MCL gene families
            • Identify gene families w/
                   –High universality
                   –High uniformity of copy number
                   –Phylogenetic tree similar to “whole genome tree”




Tuesday, June 29, 2010
Distances between gene trees and the AMPHORA concatenated genome tree
          rpmA                                                                                          coaE
           coaE                                                                                        rpmA
          trmD                                                                                            rplL
           rpsS                                                                                         rpsQ
           radA                                                                                          rplR
            rplD                                                                                         rplQ
              tsf                                                                                       rpsH
              frr                                                                                      smpB
               ttf                                                                                      rpsO
            rplR                                                                                          rplP
            rplM                                                                                        rpsS
             rplI                                                                                        rplV
           rpsB                                                                                           rplT
           rpsO                                                                                          rplO
         mraW                                                                                            rpsP
           rpsH                                                                                         rpsK
            rplQ                                                                                         rplU
             rplL                                                                                           tsf
             rplT                                                                                      trmD
            rplE                                                                                         rplS
            rpsP                                                                                            ttf
            rplC                                                                                         rpsI
            rplV                                                                                       mraW
            rplS                                                                                         rpsL
            infC                                                                                        rpsG
           rpsM                                                                                          rplM
            rplO                                                                                           rplI
            rplU                                                                                        pyrH
            rpsL                                                                                        rpsM
           rpsQ                                                                                         ruvA
          guaA                                                                                          radA
           rpsG                                                                                         purA
          smpB                                                                                           rplK
            priA                                                                                         rplD
           rpsK                                                                                           infC
            rplK                                                                                         rplC
           serS                                                                                           rplE
            rplA                                                                                         rplA
             rplF                                                                                           frr
           ruvA                                                                                           rplF
           rpsC                                                                                         serS
            rplN                                                                                         rplN
             rplP                                                                                      guaA
           rpsE                                                                                         ruvB
           pyrH                                                                                         rpsB
            rpsI                                                                                         rpsJ
           secY                                                                                     rRNA16S
            rpsJ                                                                                        secY
           purA                                                                                          rplB
            rplB                                                                                         priA
           nusA                                                                                         rpsE
           ruvB                                                                                         rpsC
       rRNA16S                                                                                          nusA
                     0           1          2          3            4           5           6                       0   0.1    0.2     0.3   0.4       0.5   0.6    0.7    0.8   0.9

                                  NODAL distance                                                                                                   SPLIT distance

                         AMPHORA marker         Ribosomal protein       Transcription/translation related protein       DNA repair protein     Protein of other function

                         Distance between the genome tree and 100 random trees (average ± standard deviation)

Tuesday, June 29, 2010
Screen gene markers for different taxonomic groups


                         phylum            Genome            Gene Number
                         Actinobacteria    Number
                                           63                267783
                         Alphaproteobac    94                347287
                         teria
                         Betaproteobact    56                266362
                         eria
                         Gammaproteob      126               483632
                         acteria
                         Deltaproteobact   25                102115
                         eria
                         Epislonproteoba   18                33416
                         cteria
                         Bacteriodes       25                71531
                         Chlamydae         13                13823
                         Chloroflexi       10                33577
                         Cyanobacteria     36                124080
                         Firmicutes        106               312309
                         Spirochaetes      18                38832
                         Thermi            5                 14160
                         Thermotogae       9                 17037
Tuesday, June 29, 2010
Keep only the families with:

                                        Universality * Evenness * monophyly >= 90*90*90


                         Phylogenetic group            Genome Number     Gene Number      Maker Candidates


                         Archaea                       62                145415           102

                         Actinobacteria                63                267783           136

                         Alphaproteobacteria           94                347287           142

                         Betaproteobacteria            56                266362           294

                         Gammaproteobacteria           126               483632           141

                         Deltaproteobacteria           25                102115           44

                         Epislonproteobacteria         18                33416            446

                         Bacteriodes                   25                71531            179

                         Chlamydae                     13                13823            561

                         Chloroflexi                   10                33577            140

                         Cyanobacteria                 36                124080           532

                         Firmicutes                    106               312309           80

                         Spirochaetes                  18                38832            72

                         Thermi                        5                 14160            727

                         Thermotogae                   9                 17037            646



Tuesday, June 29, 2010
How to improve phylogenetic
                         analysis of metagenomic data
            • Better phylogenetic and OTU methods for
              fragmented data

            • Better assessment of which genes to use?

            • More automation




Tuesday, June 29, 2010
AMPHORA




                         Guide tree
Tuesday, June 29, 2010
Al
                                                                            ph
                                                                                   ap
                                                                                     ro
                                                                           Be            te




Tuesday, June 29, 2010
                                                                             ta              ob
                                                                      G




                                                                                                             0
                                                                                                                 0.1
                                                                                                                       0.2
                                                                                                                             0.3
                                                                                                                                   0.4
                                                                                                                                         0.5
                                                                                                                                               0.6
                                                                                                                                                     0.7
                                                                                   pr            ac
                                                                          am            ot         te
                                                                               m          eo         ria
                                                                                ap           b ac
                                                                                    ro             te
                                                                         D              te             ria
                                                                           el              ob
                                                                               ta
                                                                                  pr           ac
                                                                      Ep             ot            te
                                                               U          si
                                                                             lo          eo            ria
                                                                 nc                          ba
                                                                    la          np
                                                                       ss          ro            ct
                                                                         ifi            te           er
                                                                             ed            ob           ia
                                                                                  Pr           ac
                                                                                     ot            te
                                                                                         eo            ria
                                                                                             ba
                                                                                  Cy             ct
                                                                                      an             er
                                                                                           ob           ia
                                                                                               ac
                                                                                      Ch           te
                                                                                                       ria
                                                                                           la
                                                                                              m
                                                                                   Ac             yd
                                                                                        id            ia
                                                                                           ob            e
                                                                                   Ba act
                                                                                        ct           er
                                                                                           er           ia
                                                                                  Ac          oi
                                                                                                  de
                                                                                      tin             te
                                                                                           ob            s
                                                                                               ac
                                                                                                   te
                                                                                                       ria
                                                                                           Aq
                                                                                Pl             ui
                                                                                   an              fic
                                                                                       ct
                                                                                          om ae
                                                                                               yc
                                                                                    Sp              et



                         AMPHORA - each read on its own tree
                                                                                         iro            es
                                                                                              ch
                                                                                                  ae
                                                                                         Fi           te
                                                                                            rm           s
                                                                                                 ic
                                                                                                    ut
                                                                                         Ch            es
                                                                                             lo
                                                                                                ro
                                                                        U                           fle
                                                                          nc                            xi
                                                                              la            Ch
                                                                                 ss              lo
                                                                                    ifi              ro
                                                                                        ed              bi
                                                                                             Ba
                                                                                                 ct
                                                                                                     er
                                                                                                        ia
                                                                                                                                                           Phylogenetic Binning Using AMPHORA
                                                               frr




                                                               tsf
                                                               pgk




                                                               rplL
                                                               rplF




                                                               rplP

                                                               rplT
                                                               rplE
                                                               infC




                                                               rpsI
                                                               rplS
                                                               rplA
                                                               rplB




                                                               rplK
                                                               rplC




                                                               rpsJ
                                                               rplN
                                                               rplD




                                                               rplM




                                                               rpsE




                                                               rpsS
                                                               rpsB




                                                               rpsK
                                                               rpsC
                                                               rpoB




                                                               rpsM
                                                               pyrG
                                                               nusA
                                                               dnaG




                                                               rpmA




                                                               smpB
Zorro

            • http://sourceforge.net/projects/probmask/
            • ZORRO is a probabilistic masking program
              that assigns confidence scores to each column
              in a multiple sequence alignment. These
              scores can then be used to account for
              alignment accuracy in phylogenetic inference
              pipelines
            • Wu, Chatterji, Eisen submitted


Tuesday, June 29, 2010
Conclusions

            • Phylogeny-driven sampling produces many
              benefits immediately
            • For the most benefits to come, we need to re-
              direct many informatics efforts to take
              advantage of less biased data
            • Still a long way away from full benefits
            • Note - most of the benefits can come from
              (aack) - unfinished genomes


Tuesday, June 29, 2010
Tuesday, June 29, 2010
MICROBES




Tuesday, June 29, 2010
A Happy Tree of Life




Tuesday, June 29, 2010

Jonathan Eisen talk at #ievobio 2010

  • 1.
    Phylogenomics of microbes: the dark matter of biology Jonathan A. Eisen UC Davis Talk for iEVOBIO June 29, 2010 Tuesday, June 29, 2010
  • 2.
    Eisen Lab -Phylogenomics of Novelty Origin of New Genome Functions and Dynamics Processes •Evolvability •New genes •Repair and recombination •Changes in old genes processes •Changes in pathways •Intragenomic variation Species Evolution •Phylogenetic history •Vertical vs. horizontal descent •Needed to track gain/loss of processes, infer convergence Tuesday, June 29, 2010
  • 3.
  • 4.
    Social Networking inScience Tuesday, June 29, 2010
  • 5.
  • 6.
    An homage toDonald Rumsfeld • There are known knowns. These are things we know that we know. • There are known unknowns. That is to say, there are things that we know we don't know. • But there are also unknown unknowns. There are things we don't know we don't know. Tuesday, June 29, 2010
  • 7.
    Outline • Known knowns (background) –rRNA Tree of Life –Genomics –rRNA PCR –Metagenomics • Known unknowns –GEBA project - past –GEBA project - present –GEBA project - future • Unknown unknowns? Tuesday, June 29, 2010
  • 8.
    Known Knowns 1: rRNA Tree of Life Tuesday, June 29, 2010
  • 9.
  • 10.
    rRNA Tree ofLife Bacteria Archaea Eukaryotes FIgure from Barton, Eisen et al. “Evolution”, CSHL Press. Based on tree from Pace NR, 2003. Tuesday, June 29, 2010
  • 11.
    The Tree ofEukaryotes Life: Three Main Domains The Tree of Life Bacteria Archaea Unrooted Tree of Life from Barton et al. Evolution Tuesday, June 29, 2010
  • 12.
    Known Knowns 2: Genomics and Phylogenomics Tuesday, June 29, 2010
  • 13.
    Fleischmann et al. 1995 Tuesday, June 29, 2010
  • 14.
    Microbial genomes From http://genomesonline.org Tuesday, June 29, 2010
  • 15.
    Genome Sequences Have Revolutionized Microbiology • Predictions of metabolic processes • Better vaccine and drug design • New insights into mechanisms of evolution • Genomes serve as template for functional studies • New enzymes and materials for engineering and synthetic biology Tuesday, June 29, 2010
  • 16.
    Microbes Run thePlanet Tuesday, June 29, 2010
  • 17.
    4. Microbes in the world I: rRNA PCR Perna et al. 2003 Tuesday, June 29, 2010
  • 18.
    Lateral Transfer from Doolittle, 1999 Tuesday, June 29, 2010
  • 19.
    from Lerat etal Tuesday, June 29, 2010
  • 20.
    Known Knowns 3: rRNA PCR Tuesday, June 29, 2010
  • 21.
    Great Plate CountAnomaly Culturing Microscope Count Count Tuesday, June 29, 2010
  • 22.
    Great Plate CountAnomaly Culturing Microscope Count <<<< Count Tuesday, June 29, 2010
  • 23.
    Great Plate CountAnomaly DNA Culturing Microscope Count <<<< Count Tuesday, June 29, 2010
  • 24.
    PCR Revolution Extract DNA PCR w/ Universal rDNA Primers Sequence Align and compare to other rDNAs Phylogenetic OTUs Ecology classification Tuesday, June 29, 2010
  • 25.
    Uses of rDNAPCR Bohannan and Hughes 2003 Hugenholtz 2002 Tuesday, June 29, 2010
  • 26.
    rRNA challenges • Massive amounts of data from next-gen • Need for full automation but –Non overlapping –Alignments not always straightforward –BLAST insufficient –Phylogenetic methods that have been automated still need work • Tree of everything might be useful Tuesday, June 29, 2010
  • 27.
    Known Knowns 4: Metagenomics Tuesday, June 29, 2010
  • 28.
    4. Microbes in the world I: rRNA PCR Perna et al. 2003 Tuesday, June 29, 2010
  • 29.
    Metagenomics shotgun clone Tuesday, June 29, 2010
  • 30.
  • 31.
  • 32.
  • 33.
    Metagenomics Challenge Who is out there? What are they doing? Tuesday, June 29, 2010
  • 34.
    rRNA phylotyping frommetagenomics Venter et al., 2004 Tuesday, June 29, 2010
  • 35.
    Shotgun Sequencing AllowsUse of Alternative Anchors (e.g., RecA) Venter et al., 2004 Tuesday, June 29, 2010
  • 36.
    Weighted % ofClones 0 0.1250 0.2500 0.3750 0.5000 Al ph ap ro t eo ba Be ct t er Tuesday, June 29, 2010 ap ia ro t eo G ba ct am er m ia ap ro t eo Ep ba si ct lo er np ia ro t eo De ba lta ct er pr ia ot eo ba ct C er ia ya no ba ct er ia Fi rm ic ut es Ac tin ob ac te ria C hl or ob i Major Phylogenetic Group C FB Sargasso Phylotypes C hl or of le xi Sp iro ch ae te s Fu so ba ct De er in ia oc oc cu s- Th er Eu m ry us ar ch ae ot C a re n ar ch ae ot a Shotgun Sequencing Allows Use of Other Markers Venter et al., 2004 EFG EFTu rRNA RecA RpoB HSP70
  • 37.
    Functional Inference from Metagenomics • Can work well for individual genes • Predicting “community” function is challenging because treating community as a bag of genes does not work well • Better to “compartmentalize” data ... Tuesday, June 29, 2010
  • 38.
    Binning challenge A T B U C V D W E X F Y G Z Tuesday, June 29, 2010
  • 39.
    Binning challenge A T B U C V D W E X F Y G Best binning method: reference genomes Z Tuesday, June 29, 2010
  • 40.
    Reference Genomes Comingfrom Select Environment Tuesday, June 29, 2010
  • 41.
    Binning challenge A T B U C V D W E X F Y G No reference genome? What do you do? Z Tuesday, June 29, 2010
  • 42.
    Binning challenge A T B U C V D W E X F Y G No reference genome? What do you do? Z Assembly? Composition? Get more references? Tuesday, June 29, 2010
  • 43.
    Binning challenge A T B U C V D W E X F Y G No reference genome? What do you do? Z Phylogeny .... Tuesday, June 29, 2010
  • 44.
    Metagenomic challenges • Massive amounts of data from next-gen • Need for full automation but –Data fragmentary –BLAST insufficient –Automation of phylogenetic methods a bit better for protein coding genes b/c alignments better –Reference databases incomplete Tuesday, June 29, 2010
  • 45.
    Known Unknowns 1: GEBA Past Tuesday, June 29, 2010
  • 46.
    Microbial genomes From http://genomesonline.org Tuesday, June 29, 2010
  • 47.
    Proteobacteria 2002 TM6 OS-K Acidobacteria • At least 40 Termite Group OP8 phyla of Nitrospira Bacteroides bacteria Chlorobi Fibrobacteres Marine GroupA WS3 Gemmimonas Firmicutes Fusobacteria Actinobacteria OP9 Cyanobacteria Synergistes Deferribacteres Chrysiogenetes NKB19 Verrucomicrobia Chlamydia OP3 Planctomycetes Spriochaetes Coprothmermobacter OP10 Thermomicrobia Chloroflexi TM7 Deinococcus-Thermus Dictyoglomus Aquificae Thermudesulfobacteria Thermotogae OP1 Based on Hugenholtz, OP11 2002 Tuesday, June 29, 2010
  • 48.
    2002 Proteobacteria TM6 OS-K • At least 40 Acidobacteria Termite Group OP8 phyla of Nitrospira Bacteroides bacteria Chlorobi Fibrobacteres Marine GroupA • Genome WS3 Gemmimonas sequences are Firmicutes Fusobacteria mostly from Actinobacteria OP9 Cyanobacteria three phyla Synergistes Deferribacteres Chrysiogenetes NKB19 Verrucomicrobia Chlamydia OP3 Planctomycetes Spriochaetes Coprothmermobacter OP10 Thermomicrobia Chloroflexi TM7 Deinococcus-Thermus Dictyoglomus Aquificae Thermudesulfobacteria Thermotogae OP1 Based on Hugenholtz, OP11 2002 Tuesday, June 29, 2010
  • 49.
    2002 Proteobacteria TM6 OS-K • At least 40 Acidobacteria Termite Group OP8 phyla of Nitrospira Bacteroides bacteria Chlorobi Fibrobacteres Marine GroupA • Genome WS3 Gemmimonas sequences are Firmicutes Fusobacteria mostly from Actinobacteria OP9 Cyanobacteria three phyla Synergistes Deferribacteres Chrysiogenetes • Some other NKB19 Verrucomicrobia Chlamydia phyla are only OP3 Planctomycetes Spriochaetes sparsely Coprothmermobacter OP10 sampled Thermomicrobia Chloroflexi TM7 Deinococcus-Thermus Dictyoglomus Aquificae Thermudesulfobacteria Thermotogae OP1 Based on Hugenholtz, OP11 2002 Tuesday, June 29, 2010
  • 50.
    2002 Proteobacteria TM6 OS-K • At least 40 Acidobacteria Termite Group OP8 phyla of Nitrospira Bacteroides bacteria Chlorobi Fibrobacteres Marine GroupA • Genome WS3 Gemmimonas sequences are Firmicutes Fusobacteria mostly from Actinobacteria OP9 Cyanobacteria three phyla Synergistes Deferribacteres Chrysiogenetes • Some other NKB19 Verrucomicrobia Chlamydia phyla are only OP3 Planctomycetes Spriochaetes sparsely Coprothmermobacter OP10 sampled Thermomicrobia Chloroflexi TM7 • Same trend in Deinococcus-Thermus Dictyoglomus Aquificae Archaea Thermudesulfobacteria Thermotogae OP1 Based on Hugenholtz, OP11 2002 Tuesday, June 29, 2010
  • 51.
    2002 Proteobacteria TM6 OS-K • At least 40 Acidobacteria Termite Group OP8 phyla of Nitrospira Bacteroides bacteria Chlorobi Fibrobacteres Marine GroupA • Genome WS3 Gemmimonas sequences are Firmicutes Fusobacteria mostly from Actinobacteria OP9 Cyanobacteria three phyla Synergistes Deferribacteres Chrysiogenetes • Some other NKB19 Verrucomicrobia Chlamydia phyla are only OP3 Planctomycetes Spriochaetes sparsely Coprothmermobacter OP10 sampled Thermomicrobia Chloroflexi TM7 • Same trend in Deinococcus-Thermus Dictyoglomus Aquificae Eukaryotes Thermudesulfobacteria Thermotogae OP1 Based on Hugenholtz, OP11 2002 Tuesday, June 29, 2010
  • 52.
    The Tree isnot Happy Bacteria Archaea Eukaryotes FIgure from Barton, Eisen et al. “Evolution”, CSHL Press. Based on tree from Pace NR, 2003. Tuesday, June 29, 2010
  • 53.
    Proteobacteria • NSF-funded TM6 • At least 40 phyla OS-K Tree of Life Acidobacteria Termite Group of bacteria OP8 Project Nitrospira • Genome Bacteroides Chlorobi • A genome Fibrobacteres Marine GroupA sequences are from each of WS3 Gemmimonas mostly from eight phyla Firmicutes Fusobacteria three phyla Actinobacteria OP9 Cyanobacteria Synergistes • Some other Deferribacteres Chrysiogenetes phyla are only NKB19 Verrucomicrobia Chlamydia sparsely sampled OP3 Planctomycetes Spriochaetes • Solution I: Coprothmermobacter OP10 sequence more Thermomicrobia Chloroflexi TM7 phyla Deinococcus-Thermus Dictyoglomus Aquificae Eisen & Ward, PIs Thermudesulfobacteria Thermotogae OP1 OP11 Tuesday, June 29, 2010
  • 54.
  • 55.
    The Tree ofLife is Still Angry Bacteria Archaea Eukaryotes FIgure from Barton, Eisen et al. “Evolution”, CSHL Press. Based on tree from Pace NR, 2003. Tuesday, June 29, 2010
  • 56.
    Proteobacteria TM6 OS-K • At least 100 phyla of bacteria Acidobacteria Termite Group OP8 • Genome sequences are mostly Nitrospira Bacteroides from three phyla Chlorobi Fibrobacteres Marine GroupA • Most phyla with cultured WS3 Gemmimonas Firmicutes species are sparsely sampled Fusobacteria Actinobacteria • Lineages with no cultured OP9 Cyanobacteria Synergistes taxa even more poorly Deferribacteres Chrysiogenetes NKB19 sampled Verrucomicrobia Chlamydia OP3 • Solution - use tree to really Planctomycetes Spriochaetes fill gaps Coprothmermobacter OP10 Thermomicrobia Chloroflexi TM7 Deinococcus-Thermus Dictyoglomus Aquificae Well sampled phyla Thermudesulfobacteria Thermotogae OP1 OP11 Tuesday, June 29, 2010
  • 57.
    A Genomic Encyclopediaof Bacteria and Archaea (GEBA) Tuesday, June 29, 2010
  • 58.
    GEBA Pilot ProjectOverview • Identify major branches in rRNA tree for which no genomes are available • Identify branches with a cultured representative in DSMZ • Grow > 200 of these and prep. DNA • Sequence and finish 100 (covering breadth of bacterial/archaea diversity) • Annotate, analyze, release data • Assess benefits of tree guided sequencing Tuesday, June 29, 2010
  • 59.
    GEBA Pilot Project:Components • Project overview (Phil Hugenholtz, Nikos Kyrpides, Jonathan Eisen, Eddy Rubin, Jim Bristow) • Project management (David Bruce, Eileen Dalin, Lynne Goodwin) • Culture collection and DNA prep (DSMZ, Hans-Peter Klenk) • Sequencing and closure (Eileen Dalin, Susan Lucas, Alla Lapidus, Mat Nolan, Alex Copeland, Cliff Han, Feng Chen, Jan-Fang Cheng) • Annotation and data release (Nikos Kyrpides, Victor Markowitz, et al) • Analysis (Dongying Wu, Kostas Mavrommatis, Martin Wu, Victor Kunin, Neil Rawlings, Ian Paulsen, Patrick Chain, Patrik D’Haeseleer, Sean Hooper, Iain Anderson, Amrita Pati, Natalia N. Ivanova, Athanasios Lykidis, Adam Zemla) • Adopt a microbe education project (Cheryl Kerfeld) • Outreach (David Gilbert) • $$$ (DOE, Eddy Rubin, Jim Bristow) Tuesday, June 29, 2010
  • 60.
    GEBA and Openness • All data released as quickly as possible w/ no restrictions to IMG-GEBA; Genbank, etc • Data also available in Biotorrents (http://biotorrents.net) • Individual genome reports published in OA “Standards in Genome Sciences (SIGS)” • 1st GEBA paper in Nature freely available and published using Creative Commons License Tuesday, June 29, 2010
  • 61.
    Known Unknowns 2: GEBA present Tuesday, June 29, 2010
  • 62.
    GEBA Lesson 1 rRNA Tree is Useful for Identifying Phylogenetically Novel Genomes Tuesday, June 29, 2010
  • 63.
    rRNA Tree ofLife Bacteria Archaea Eukaryotes FIgure from Barton, Eisen et al. “Evolution”, CSHL Press. Based on tree from Pace NR, 2003. Tuesday, June 29, 2010
  • 64.
    Network of Life Bacteria Archaea Eukaryotes Figure from Barton, Eisen et al. “Evolution”, CSHL Press. Based on tree from Pace NR, 2003. Tuesday, June 29, 2010
  • 65.
    “Whole Genome” Concatenation Tree w/ AMPHORA See Wu and Eisen, Genome Biology 2008 9: R151 http://bobcat.genomecenter.ucdavis.edu/AMPHORA/ Tuesday, June 29, 2010
  • 66.
    Compare PD inTrees Tuesday, June 29, 2010
  • 67.
    PD of rRNA,Genome Trees Similar From Wu et al. 2009 Nature 462, 1056-1060 Tuesday, June 29, 2010
  • 68.
    GEBA Lesson 2 Phylogeny-driven genome selection helps discover new genetic diversity Tuesday, June 29, 2010
  • 69.
    Network of Life Bacteria Archaea Eukaryotes FIgure from Barton, Eisen et al. “Evolution”, CSHL Press. Based on tree from Pace NR, 2003. Tuesday, June 29, 2010
  • 70.
    Protein Family RarefactionCurves • Take data set of multiple complete genomes • Identify all protein families using MCL • Plot # of genomes vs. # of protein families Tuesday, June 29, 2010
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
    GEBA Lesson 3 Phylogeny-driven genome selection improves genome annotation Tuesday, June 29, 2010
  • 78.
    Predicting Function • Key step in genome projects • More accurate predictions help guide experimental and computational analyses • Many diverse approaches • Comparative and evolutionary analysis greatly improves most predictions Tuesday, June 29, 2010
  • 79.
    Phylogeny vs. Blast Many methods focus on “top EXAMPLE A METHOD EXAMPLE B blast hits” 2A CHOOSE GENE(S) OF INTEREST 5 1 3 4 3A 2 2B 5 1A 2A 1B 3B IDENTIFY HOMOLOGS 6 ALIGN SEQUENCES 1A 2A 3A 1B 2B 3B 1 2 3 4 5 6 CALCULATE GENE TREE Duplication? 1A 2A 3A 1B 2B 3B 1 2 3 4 5 6 OVERLAY KNOWN FUNCTIONS ONTO TREE But much better to build 1A 2A 3A 1B Duplication? 2B 3B 1 2 3 4 5 6 phylogenetic trees of genes and INFER LIKELY FUNCTION OF GENE(S) OF INTEREST compare to relatives Ambiguous Duplication? Species 1 Species 2 Species 3 1A 1B 2A 2B 3A 3B 1 2 3 4 5 6 ACTUAL EVOLUTION (ASSUMED TO BE UNKNOWN) Allows better integration of Duplication evolutionary history (e.g., orthologs and paralogs) Based on Eisen, 1998 Genome Res 8: 163-167. Tuesday, June 29, 2010
  • 80.
    Wu et al.2005 PLoS Genetics 1: e65. Tuesday, June 29, 2010
  • 81.
    Most/All Functional PredictionImproves w/ Better Phylogenetic Sampling • Conversion of hypothetical into conserved hypotheticals • Improved phylogenomics • Linking distantly related members of protein families • Improved non-homology prediction Tuesday, June 29, 2010
  • 82.
    Known Unknowns 3: GEBA future Tuesday, June 29, 2010
  • 83.
    GEBA Future 1 How much further should we go? Tuesday, June 29, 2010
  • 84.
    rRNA Tree ofLife Bacteria Archaea Eukaryotes FIgure from Barton, Eisen et al. “Evolution”, CSHL Press. Based on tree from Pace NR, 2003. Tuesday, June 29, 2010
  • 85.
    Phylogenetic Diversity: Sequenced Bacteria & Archaea From Wu et al. 2009 Tuesday, June 29, 2010
  • 86.
    Phylogenetic Diversity withGEBA From Wu et al. 2009 Tuesday, June 29, 2010
  • 87.
    Phylogenetic Diversity: Isolates From Wu et al. 2009 Tuesday, June 29, 2010
  • 88.
    Phylogenetic Diversity: All From Wu et al. 2009 Tuesday, June 29, 2010
  • 89.
    Proteobacteria TM6 OS-K • At least 40 phyla of bacteria Acidobacteria Termite Group OP8 • Genome sequences are mostly Nitrospira Bacteroides from three phyla Chlorobi Fibrobacteres Marine GroupA • Most phyla with cultured WS3 Gemmimonas Firmicutes species are sparsely sampled Fusobacteria Actinobacteria • Lineages with no cultured OP9 Cyanobacteria Synergistes taxa even more poorly Deferribacteres Chrysiogenetes NKB19 sampled Verrucomicrobia Chlamydia OP3 Planctomycetes Spriochaetes Coprothmermobacter OP10 Thermomicrobia Chloroflexi TM7 Deinococcus-Thermus Dictyoglomus Aquificae Well sampled phyla Thermudesulfobacteria Thermotogae Poorly sampled OP1 OP11 No cultured taxa Tuesday, June 29, 2010
  • 90.
    Proteobacteria TM6 OS-K Acidobacteria Termite Group • At least 40 phyla of bacteria OP8 Nitrospira • Genome sequences are mostly Bacteroides Chlorobi Fibrobacteres from three phyla Marine GroupA WS3 • Most phyla with cultured Gemmimonas Firmicutes species are sparsely sampled Fusobacteria Actinobacteria OP9 • Lineages with no cultured taxa Cyanobacteria Synergistes even more poorly sampled Deferribacteres Chrysiogenetes NKB19 Verrucomicrobia Chlamydia OP3 Planctomycetes Spriochaetes Coprothmermobacter OP10 Thermomicrobia Chloroflexi TM7 Deinococcus-Thermus Dictyoglomus Aquificae Well sampled phyla Thermudesulfobacteria Thermotogae Poorly sampled OP1 OP11 No cultured taxa Tuesday, June 29, 2010
  • 91.
    Uncultured Lineages: Technical Approaches • Get into culture • Enrichment cultures • If abundant in low diversity ecosystems • Flow sorting • Microbeads • Microfluidic sorting • Single cell amplification Tuesday, June 29, 2010
  • 92.
    GEBA Future 2 How many gene families are there? Tuesday, June 29, 2010
  • 93.
  • 94.
  • 95.
    Compare PD inTrees Tuesday, June 29, 2010
  • 96.
    Gene Families vsPD PD vs. Gene Families (per genome) 0.4 0.3 PD/Genome 0.2 0.1 0 0 275 550 825 1100 Gene families / genome Tuesday, June 29, 2010
  • 97.
    How many proteinfamilies? GEBA Genomes PD/Genome ~0.1 PFAMs/Genome Text ~1000 PFAMs/PD ~10000 Total PFAMS ~10,000,000 From Wu et al. 2009 Tuesday, June 29, 2010
  • 98.
    Caveats (of many) • Novel protein families per genome likely taxon specific • Parameters other than PD clearly important • Does not include viruses, eukaryotes Tuesday, June 29, 2010
  • 99.
    GEBA Future 3 Need to better leverage improved phylogenetic sampling Tuesday, June 29, 2010
  • 100.
    Example 1: ProteinFamily Space • Much less biased sampling of protein family space now available • Need to rebuild / reassess many protein family databases (e.g., HMMs) • Structural space Tuesday, June 29, 2010
  • 101.
  • 102.
    As of 2002 Proteobacteria TM6 OS-K • At least 40 Acidobacteria Termite Group OP8 phyla of Nitrospira Bacteroides bacteria Chlorobi Fibrobacteres Marine GroupA WS3 Gemmimonas Firmicutes Fusobacteria Actinobacteria OP9 Cyanobacteria Synergistes Deferribacteres Chrysiogenetes NKB19 Verrucomicrobia Chlamydia OP3 Planctomycetes Spriochaetes Coprothmermobacter OP10 Thermomicrobia Chloroflexi TM7 Deinococcus-Thermus Dictyoglomus Aquificae Thermudesulfobacteria Thermotogae OP1 Based on Hugenholtz, OP11 2002 Tuesday, June 29, 2010
  • 103.
    As of 2002 Proteobacteria TM6 OS-K • At least 40 Acidobacteria Termite Group OP8 phyla of Nitrospira Bacteroides bacteria Chlorobi Fibrobacteres Marine GroupA • Experimental WS3 Gemmimonas studies are Firmicutes Fusobacteria mostly from Actinobacteria OP9 Cyanobacteria three phyla Synergistes Deferribacteres Chrysiogenetes NKB19 Verrucomicrobia Chlamydia OP3 Planctomycetes Spriochaetes Coprothmermobacter OP10 Thermomicrobia Chloroflexi TM7 Deinococcus-Thermus Dictyoglomus Aquificae Thermudesulfobacteria Thermotogae OP1 Based on Hugenholtz, OP11 2002 Tuesday, June 29, 2010
  • 104.
    As of 2002 Proteobacteria TM6 OS-K • At least 40 Acidobacteria Termite Group OP8 phyla of Nitrospira Bacteroides bacteria Chlorobi Fibrobacteres Marine GroupA • Experimental WS3 Gemmimonas studies are Firmicutes Fusobacteria mostly from Actinobacteria OP9 Cyanobacteria three phyla Synergistes Deferribacteres Chrysiogenetes • Some studies NKB19 Verrucomicrobia Chlamydia in other phyla OP3 Planctomycetes Spriochaetes Coprothmermobacter OP10 Thermomicrobia Chloroflexi TM7 Deinococcus-Thermus Dictyoglomus Aquificae Thermudesulfobacteria Thermotogae OP1 Based on Hugenholtz, OP11 2002 Tuesday, June 29, 2010
  • 105.
    As of 2002 Proteobacteria TM6 OS-K • At least 40 Acidobacteria Termite Group OP8 phyla of Nitrospira Bacteroides bacteria Chlorobi Fibrobacteres Marine GroupA • Genome WS3 Gemmimonas sequences are Firmicutes Fusobacteria mostly from Actinobacteria OP9 Cyanobacteria three phyla Synergistes Deferribacteres Chrysiogenetes • Some other NKB19 Verrucomicrobia Chlamydia phyla are only OP3 Planctomycetes Spriochaetes sparsely Coprothmermobacter OP10 sampled Thermomicrobia Chloroflexi TM7 • Same trend in Deinococcus-Thermus Dictyoglomus Aquificae Eukaryotes Thermudesulfobacteria Thermotogae OP1 Based on Hugenholtz, OP11 2002 Tuesday, June 29, 2010
  • 106.
    As of 2002 Proteobacteria TM6 OS-K • At least 40 Acidobacteria Termite Group OP8 phyla of Nitrospira Bacteroides bacteria Chlorobi Fibrobacteres Marine GroupA • Genome WS3 Gemmimonas sequences are Firmicutes Fusobacteria mostly from Actinobacteria OP9 Cyanobacteria three phyla Synergistes Deferribacteres Chrysiogenetes • Some other NKB19 Verrucomicrobia Chlamydia phyla are only OP3 Planctomycetes Spriochaetes sparsely Coprothmermobacter OP10 sampled Thermomicrobia Chloroflexi TM7 • Same trend in Deinococcus-Thermus Dictyoglomus Aquificae Viruses Thermudesulfobacteria Thermotogae OP1 Based on Hugenholtz, OP11 2002 Tuesday, June 29, 2010
  • 107.
    Proteobacteria TM6 OS-K Need Acidobacteria Termite Group OP8 experimental Nitrospira Bacteroides Chlorobi studies from Fibrobacteres Marine GroupA WS3 across the tree Gemmimonas Firmicutes too Fusobacteria Actinobacteria OP9 Cyanobacteria Synergistes Deferribacteres Chrysiogenetes NKB19 Verrucomicrobia Chlamydia OP3 Planctomycetes Spriochaetes 0.1 Coprothmermobacter OP10 Thermomicrobia Chloroflexi TM7 Deinococcus-Thermus Dictyoglomus Aquificae Tree based on Thermudesulfobacteria Thermotogae Hugenholtz (2002) OP1 with some OP11 modifications. Tuesday, June 29, 2010
  • 108.
    Example 3: Improvingthe tree • To make best use of GEBA data we need a better tree Tuesday, June 29, 2010
  • 109.
    Wh Concatenated alignment “whole genome tree” built using AMPHORA Tuesday, June 29, 2010
  • 110.
    Why Wh does the tree matter? Whole genome tree built using AMPHORA by Martin Wu and Dongying Wu Tuesday, June 29, 2010
  • 111.
  • 112.
  • 113.
  • 114.
    Many Alternatives toConcatenation • Gene presence/absence • Supertrees / consensus methods • Separate phylogeny of genes and then integration of results (e.g., networks) • Models that incorporate gain/loss as well as gene phylogeny Tuesday, June 29, 2010
  • 115.
    Example 4: MetagenomicAnalysis Tuesday, June 29, 2010
  • 116.
    Weighted % ofClones 0 0.1250 0.2500 0.3750 0.5000 Al ph ap ro t eo ba Be ct t er Tuesday, June 29, 2010 ap ia ro t eo G ba ct am er m ia ap ro t eo Ep ba si ct lo er np ia ro t eo De ba lta ct er pr ia ot eo ba ct C er ia ya no ba ct er ia Fi rm ic ut es Ac tin ob ac te ria C hl or ob i Major Phylogenetic Group C FB Sargasso Phylotypes C hl or of le xi Sp iro ch ae te s Fu so ba ct De er in ia oc oc cu s- Th er Eu m ry us ar ch Phylogeny for Typing and Binning ae ot C a re n ar ch ae ot a Venter et al., 2004 EFG EFTu rRNA RecA RpoB HSP70
  • 117.
    Weighted % ofClones 0 0.1250 0.2500 0.3750 0.5000 Al ph ap ro t eo ba Be ct t er Tuesday, June 29, 2010 ap ia ro t eo G ba ct am er m ia ap ro t eo Ep ba si ct lo er np ia ro t eo De ba lta ct er pr ia ot eo ba ct C er ia ya no ba ct er ia Fi rm ic ut es Ac tin ob ac te ria C hl or ob i Major Phylogenetic Group C FB Sargasso Phylotypes C hl or of le xi Sp iro ch ae te s Fu so ba ct De er in ia oc Should improve with oc cu s- Th er Eu m better genomic sampling ry us ar ch Phylogeny for Typing and Binning ae ot C a re n ar ch ae ot a Venter et al., 2004 EFG EFTu rRNA RecA RpoB HSP70
  • 118.
    Weighted % ofClones 0 0.1250 0.2500 0.3750 0.5000 Al ph ap ro t eo ba Be ct t er Tuesday, June 29, 2010 ap ia ro t eo G ba ct am er m ia ap ro t eo Ep ba si ct lo er np ia ro t eo De ba lta ct er pr ia ot eo ba ct C er ia ya no ba ct er ia Fi rm ic ut es Ac tin ob ac te ria C hl or ob i Major Phylogenetic Group C FB Sargasso Phylotypes C hl or of le xi Sp iro ch ae te s Fu so ba ct De er in ia oc oc cu Only improved a little s- Th er Eu m ry us ar ch Phylogeny for Typing and Binning ae ot C a re n ar ch ae ot a Venter et al., 2004 EFG EFTu rRNA RecA RpoB HSP70
  • 119.
    How to improvephylogenetic analysis of metagenomic data • Better phylogenetic and OTU methods for fragmented data • Better assessment of which genes to use? • More automation of all methods Tuesday, June 29, 2010
  • 120.
    Phylogenetic challenge How place all in one tree? How identify OTUs including all fragments? Can you analyze more than 1 gene family at a time? Tuesday, June 29, 2010
  • 121.
    Approach 1: Place Reads on Reference Tree • Examples –AMPHORA (Wu and Eisen) –PPlacer (Erik Matsen) –FastTree (Morgan Price) • General approach –Precompute reference tree for full length sequences –Place individual reads on reference tree –Merge trees Tuesday, June 29, 2010
  • 122.
    Variants • Use concatenated alignment of markers not just individual genes (Steven Kembel) • Apply to OTU identification not just classification (Thomas Sharpton) • CoBinning: look for linkage among fragments/genes (Aaron Darling) Tuesday, June 29, 2010
  • 123.
    How to improvephylogenetic analysis of metagenomic data • Better phylogenetic and OTU methods for fragmented data • Better assessment of which genes to use? • More automation Tuesday, June 29, 2010
  • 124.
    New “Marker Genes” • 100 representative genomes, including many GEBAs • MCL gene families • Identify gene families w/ –High universality –High uniformity of copy number –Phylogenetic tree similar to “whole genome tree” Tuesday, June 29, 2010
  • 125.
    Distances between genetrees and the AMPHORA concatenated genome tree rpmA coaE coaE rpmA trmD rplL rpsS rpsQ radA rplR rplD rplQ tsf rpsH frr smpB ttf rpsO rplR rplP rplM rpsS rplI rplV rpsB rplT rpsO rplO mraW rpsP rpsH rpsK rplQ rplU rplL tsf rplT trmD rplE rplS rpsP ttf rplC rpsI rplV mraW rplS rpsL infC rpsG rpsM rplM rplO rplI rplU pyrH rpsL rpsM rpsQ ruvA guaA radA rpsG purA smpB rplK priA rplD rpsK infC rplK rplC serS rplE rplA rplA rplF frr ruvA rplF rpsC serS rplN rplN rplP guaA rpsE ruvB pyrH rpsB rpsI rpsJ secY rRNA16S rpsJ secY purA rplB rplB priA nusA rpsE ruvB rpsC rRNA16S nusA 0 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 NODAL distance SPLIT distance AMPHORA marker Ribosomal protein Transcription/translation related protein DNA repair protein Protein of other function Distance between the genome tree and 100 random trees (average ± standard deviation) Tuesday, June 29, 2010
  • 126.
    Screen gene markersfor different taxonomic groups phylum Genome Gene Number Actinobacteria Number 63 267783 Alphaproteobac 94 347287 teria Betaproteobact 56 266362 eria Gammaproteob 126 483632 acteria Deltaproteobact 25 102115 eria Epislonproteoba 18 33416 cteria Bacteriodes 25 71531 Chlamydae 13 13823 Chloroflexi 10 33577 Cyanobacteria 36 124080 Firmicutes 106 312309 Spirochaetes 18 38832 Thermi 5 14160 Thermotogae 9 17037 Tuesday, June 29, 2010
  • 127.
    Keep only thefamilies with: Universality * Evenness * monophyly >= 90*90*90 Phylogenetic group Genome Number Gene Number Maker Candidates Archaea 62 145415 102 Actinobacteria 63 267783 136 Alphaproteobacteria 94 347287 142 Betaproteobacteria 56 266362 294 Gammaproteobacteria 126 483632 141 Deltaproteobacteria 25 102115 44 Epislonproteobacteria 18 33416 446 Bacteriodes 25 71531 179 Chlamydae 13 13823 561 Chloroflexi 10 33577 140 Cyanobacteria 36 124080 532 Firmicutes 106 312309 80 Spirochaetes 18 38832 72 Thermi 5 14160 727 Thermotogae 9 17037 646 Tuesday, June 29, 2010
  • 128.
    How to improvephylogenetic analysis of metagenomic data • Better phylogenetic and OTU methods for fragmented data • Better assessment of which genes to use? • More automation Tuesday, June 29, 2010
  • 129.
    AMPHORA Guide tree Tuesday, June 29, 2010
  • 130.
    Al ph ap ro Be te Tuesday, June 29, 2010 ta ob G 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 pr ac am ot te m eo ria ap b ac ro te D te ria el ob ta pr ac Ep ot te U si lo eo ria nc ba la np ss ro ct ifi te er ed ob ia Pr ac ot te eo ria ba Cy ct an er ob ia ac Ch te ria la m Ac yd id ia ob e Ba act ct er er ia Ac oi de tin te ob s ac te ria Aq Pl ui an fic ct om ae yc Sp et AMPHORA - each read on its own tree iro es ch ae Fi te rm s ic ut Ch es lo ro U fle nc xi la Ch ss lo ifi ro ed bi Ba ct er ia Phylogenetic Binning Using AMPHORA frr tsf pgk rplL rplF rplP rplT rplE infC rpsI rplS rplA rplB rplK rplC rpsJ rplN rplD rplM rpsE rpsS rpsB rpsK rpsC rpoB rpsM pyrG nusA dnaG rpmA smpB
  • 131.
    Zorro • http://sourceforge.net/projects/probmask/ • ZORRO is a probabilistic masking program that assigns confidence scores to each column in a multiple sequence alignment. These scores can then be used to account for alignment accuracy in phylogenetic inference pipelines • Wu, Chatterji, Eisen submitted Tuesday, June 29, 2010
  • 132.
    Conclusions • Phylogeny-driven sampling produces many benefits immediately • For the most benefits to come, we need to re- direct many informatics efforts to take advantage of less biased data • Still a long way away from full benefits • Note - most of the benefits can come from (aack) - unfinished genomes Tuesday, June 29, 2010
  • 133.
  • 134.
  • 135.
    A Happy Treeof Life Tuesday, June 29, 2010