Phylogeny Driven Approaches to Genomic and Metagenomic Studies
Upcoming SlideShare
Loading in...5
×
 

Phylogeny Driven Approaches to Genomic and Metagenomic Studies

on

  • 1,163 views

 

Statistics

Views

Total Views
1,163
Slideshare-icon Views on SlideShare
1,130
Embed Views
33

Actions

Likes
1
Downloads
31
Comments
0

3 Embeds 33

http://paper.li 30
http://a0.twimg.com 2
https://twitter.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Phylogeny Driven Approaches to Genomic and Metagenomic Studies Phylogeny Driven Approaches to Genomic and Metagenomic Studies Presentation Transcript

    • Searching for novelty using phylogeny-driven approaches to genomics and metagenomics iPAM November 15, 2011 Jonathan A. Eisen University of California, Davis 1Wednesday, November 16, 11
    • Searching for novelty using phylogeny-driven approaches to genomics and metagenomics iPAM November 15, 2011 Jonathan A. Eisen University of California, Davis 2Wednesday, November 16, 11
    • Searching for novelty using phylogeny-driven approaches to genomics and metagenomics iPAM November 15, 2011 Jonathan A. Eisen University of California, Davis 3Wednesday, November 16, 11
    • Phylogeny • Phylogeny is a description of the evolutionary history of relationships among organisms (or their parts). • This is frequently portrayed in a diagram called a phylogenetic tree. • Phylogenies can be more complex than a bifurcating tree (e.g., lateral gene transfer, recombination, hybridization)Wednesday, November 16, 11
    • Whatever the History: Trying to Incorporate it is Critical Four Models for Rooting TOL from Lake et al. doi: 10.1098/rstb.2009.0035Wednesday, November 16, 11
    • Evolutionary Rate VariationWednesday, November 16, 11
    • Uses of Phylogeny • Applies to – Species – Genes – GenomesWednesday, November 16, 11
    • Uses of Phylogeny in Genomics and Metagenomics Example 1: PhylotypingWednesday, November 16, 11
    • rRNA PhylotypingWednesday, November 16, 11
    • rRNA Phylotyping • Collect DNA from environment • PCR amplify rRNA genes using broad (so- called universal) primers • Sequence • Align to others • Infer evolutionary tree • Unknowns “identified” by placement on treeWednesday, November 16, 11
    • rRNA PhylotypingWednesday, November 16, 11
    • Data Overload #1 circa 2003 • 1000s of rRNA sequences per sample being generated via Sanger Sequencing • most being classified by BLAST searches and ID of top hit • seemed like a bad idea ...Wednesday, November 16, 11
    • Metagenomics shotgun sequenceWednesday, November 16, 11
    • STAP Wu et al. 2008 PLoS OneFigure 1. A flow chart of the STAP pipeline.Wednesday, November 16, 11doi:10.1371/journal.pone.0002566.g001
    • STAP Figure 1. A flow chart of the STAP pipeline. doi:10.1371/journal.pone.0002566.g001 STAP database, and the query sequence is aligned to them using the CLUSTALW profile alignment algorithm [40] as described a w above for domain assignment. By adapting the profile alignment s a t o G t t s Each sequence T c a analyzed separately q c e b b S p a Figure 2. Domain assignment. In Step 1, STAP assigns a domain to t each query sequence based on its position in a maximum likelihood d tree of representative ss-rRNA sequences. Because the tree illustrated ‘ here is not rooted, domain assignment would not be accurate and s reliable (sequence similarity based methods cannot make an accurate s assignment in this case either). However the figure illustrates an important role of the tree-based domain assignment step, namely s automatic identification of deep-branching environmental ss-rRNAs. d doi:10.1371/journal.pone.0002566.g002 a PLoS ONE | www.plosone.org 5 Wu et al. 2008 PLoS OneFigure 1. A flow chart of the STAP pipeline.Wednesday, November 16, 11doi:10.1371/journal.pone.0002566.g001
    • Combine all into one alignment Figure 1. A flow chart of the STAP pipeline.Wednesday,doi:10.1371/journal.pone.0002566.g001 November 16, 11
    • Metagenomic Phylogenetic challenge xxxxxxxxxxxxxxxxxxxxxxx xxxxxx xxxxxxxxxxxxx xxxxxxxxxxxxxx xxxxxxxxxxxxxx A single tree with everythingWednesday, November 16, 11
    • Metagenomic Phylogenetic challenge xxxxxxxxxxxxxxxxxxxxxxx xxxxxx xxxxxxxxxxxxx xxxxxxxxxxxxxx xxxxxxxxxxxxxx A single tree with everythingWednesday, November 16, 11
    • rRNA Phylotyping in Sargasso Sea Metagenomic Metagenomic Data Venter et al., Science 304: 66. 2004Wednesday, November 16, 11
    • RecA Phylotyping in Sargasso Data Venter et al., Science 304: 66. 2004Wednesday, November 16, 11
    • Sargasso Phylotypes 0.500 EFG EFTu HSP70 RecA RpoB rRNA 0.375Weighted % of Clones 0.250 0.125 0 ia ia ia s i xi ria a ob te t le er er er eo e u or of t t t ct ic ac ac ac a hl or ba ch rm b ob b C hl ar eo eo so Fi C te ry Fu t ot ro ro Eu pr ap ap lta ph m De am Al G Major Phylogenetic Group Venter et al., Science 304: 66-74. 2004Wednesday, November 16, 11
    • Really Weird Stuff Out ThereWednesday, November 16, 11
    • rRNA Tree of Life Bacteria Archaea Eukaryotes Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007. Based on tree from Pace 1997 Science 276:734-740Wednesday, November 16, 11
    • rRNA Tree of Life Bacteria Archaea Eukaryotes Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007. Based on tree from Pace 1997 Science 276:734-740Wednesday, November 16, 11
    • rRNA Tree of Life Bacteria Archaea Eukaryotes Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007. Based on tree from Pace 1997 Science 276:734-740Wednesday, November 16, 11
    • rRNA Tree of Life Bacteria Archaea Eukaryotes Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007. Based on tree from Pace 1997 Science 276:734-740Wednesday, November 16, 11
    • rRNA Tree of Life Bacteria Archaea ?????? Eukaryotes Figure from Barton, Eisen et al. Wu et al. (2011) PLoS ONE “Evolution”, CSHL Press. 2007. 6(3): e18011. doi:10.1371/ Based on tree from Pace 1997 Science journal.pone.0018011 276:734-740Wednesday, November 16, 11
    • rRNA Tree of Life Bacteria Archaea Scanned through GOS data for rRNAs that fit this pattern Eukaryotes Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007. Based on tree from Pace 1997 Science 276:734-740Wednesday, November 16, 11
    • rRNA Tree of Life Bacteria Archaea Found many, but closer examination revealed all to have issues Eukaryotes Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007. Based on tree from Pace 1997 Science 276:734-740Wednesday, November 16, 11
    • rRNA Tree of Life Bacteria Archaea RecA???? Eukaryotes Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007. Based on tree from Pace 1997 Science 276:734-740Wednesday, November 16, 11
    • GOS 1 RecA GOS 2 RecA GOS 3 GOS 4 GOS 5Wednesday, November 16, 11
    • RpoB TooWednesday, November 16, 11
    • rRNA Tree of Life Bacteria Archaea +++++ Eukaryotes Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007. Based on tree from Pace 1997 Science 276:734-740Wednesday, November 16, 11
    • Automation for ProteinsWednesday, November 16, 11
    • AMPHORA Wu and Eisen Genome Biology 2008 9:R151 doi: 10.1186/ gb-2008-9-10- r151Wednesday, November 16, 11
    • WGT Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151Wednesday, November 16, 11
    • AMPHORA Wu and Eisen Genome Biology 2008 9:R151 doi: 10.1186/ gb-2008-9-10- r151 Guide treeWednesday, November 16, 11
    • Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151Wednesday, November 16, 11
    • Resource MEGAN analysis of metagenomic data Downloaded from www.genome.org on January 26, 2007 Downloaded from www.genome.org on January 26, 2007 Daniel H. al. Huson et Huson,1,3 Alexander F. Auch,1 Ji Qi,2 and Stephan C. Schuster2,3 Species identification fro 1 Center for Bioinformatics, Tübingen University, Sand 14, 72076 Tübingen, Germany; 2Center for Comparative Genomics duced as tables and Bioinformatics, Center for Infectious Disease Dynamics, Penn State University, University Park, Pennsylvania 16802, USA side comparis (see Fig. 4). Metagenomics is the study of the genomic content of a sample of organisms obtained from a common habitat using Species identif targeted or random sequencing. Goals include understanding the extent and role of microbial diversity. The Several compa taxonomical content of such a sample is usually estimated by comparison against sequence databases of known sequencing tec sequences. Most published studies use the analysis of paired-end reads, complete sequences of environmental fosmid produce high- substantially and BAC clones, or environmental assemblies. Emerging sequencing-by-synthesis technologies with very high reads as shor throughput are paving the way to low-cost random “shotgun” approaches. This paper introduces MEGAN, a new length of read computer program that allows laptop analysis of large metagenomic data sets. In a preprocessing step, the set of Roche GS20 se DNA sequences is compared against databases of known sequences using BLAST or another comparison tool. troduced last 2005), is ∼100 MEGAN is then used to compute and explore the taxonomical content of the data set, employing the NCBI by current San taxonomy to summarize and order the results. A simple lowest common ancestor algorithm assigns reads to taxa bp in length such that the taxonomical level of the assigned taxon reflects the level of conservation of the sequence. The software question the length is requi allows large data sets to be dissected without the need for assembly or the targeting of specific phylogenetic markers. metagenomic It provides graphical and statistical output for comparing different data sets. The approach is applied to several data A simple sets, including the Sargasso Sea data set, a recently published metagenomic data setfrom Sample 1, pooled Samples 2–4, and the weighted average Figure 4. The distribution of reads sampled from a mammoth bone, this is to coll known genom and several complete microbial genomes. Also, simulations thatdata sets, over 16 major phylogenetic groups, the approach for For the sake of of these two evaluate the performance of as computed by MEGAN. metagenomic comparison, the diagram also shows the relative contribution of organisms to these groups, as esti- different read lengths are presented. mated from Venter et al. (2004) by averaging over the values for all six genes that are reported there. above), and th racy of the as [MEGAN is freely available at http://www-ab.informatik.uni-tuebingen.de/software/megan.] pose, the geno the BLASTX search into a prelimary version of MEGAN and ap- organisms E. coli K12 and B. bacteriovo The genomic revolution of the early 1990s targeted the study of plied the LCA algorithm to compute an assignment of reads to coding density. it is used as a cloning features, such as GC content, codon usage, or chose E. coli as individual genomes of microorganisms, plants, and animals. taxa,This obtaining an estimation of the taxonomical content of (meta)-genome thus strategy was soon complemented by whole sequencing projects and is thus likely t the sample. database sequences by mistake. The sec While this type of analysis has almost become routine, the ge- sequencing using a “shotgun” approach (Venter et al. 2004) that Here we provide details of the MEGAN analysis, using a teriovorus, is very distinctive in its seq nomic analysis of complex mixtures of organisms remains chal- bit-score threshold of 30 andpaired-end sequencing of plasmid libraries. no close relatives that employs cloning and discarding any isolated assign- bacteria and has lenging. Metagenomics has been defined as “the genomic analy- ments, that is, any taxon that has only a single read assigned to Recent projects based on these methodologies includethe sequence databases. Its metageno in data sets sis of microorganisms by direct extraction and cloning of DNA it. The LCA algorithm assigned 50,093 reads to taxa, 2004), seawater result in a much better signal/noise from an acid mine biofilm (Tyson et al. and 2086 fore samples from an assemblage of microorganisms” (Handelsman 2004), remained unassigned either because the bit-scoredeep-sea sediment (Hal- the results of simulatio (Venter et al. 2004; DeLong et al. 2006), of their We show matches fell below the threshold or because they gave rise to an nomes in Tables 1 (E. coli) (Blattner et and its importance stems from the fact that 99% or more of all isolated hit. al. 2004), or soil and whale falls (Tringe et al. 2005). lam et iovorus) (Rendulic et al. 2004). For each microbes are deemed to be unculturable. Goals of metagenomic A total of 19,841 reads were assigned to Eukaryota, of which based on clon- These projects all use “Sanger sequencing,” intervals of length 35 bp, 100 bp, 200 ing, assigned to dideoxynucleotides, and capillary lengths correspond to upcoming or exi studies include assessing the coding potential of environmental 7969 werefluorescent Gnathostomata (jawed vertebrates) and electrophore- ogy. We simulated 5000 random sho organisms, quantifying the relative abundances of (known) spe- thussis (Meldrum from mammoth sequences. Furthermore, presumably derive 2000a,b). Recently, a new “sequencing-by- a total of 16,972 reads were assigned to Bacteria, 761 to Archea, point, compared them to the NCBI-N cies, and estimating the amount of unknown sequence informa- and synthesis” strategy was published (Margulies et al. 2005; Zhang et 152 to Viruses, respectively. These numbers are marginally and then processed the reads with M tion (environmental sequences) for which no species, or only lower than thoseThis approach uses emulsion-based our amplification retaining only those h al. 2006). reported in Poinar et al. (2006) because of PCR threshold of 35,Wednesday, November 16, 11 distant relatives, have yet been described. It is useful to extend new of a large number of DNA robustness of the LCA parallel pyro- a read, and discarding filters, thus underlining the intrinsic fragments and the best hit for
    • Uses of Phylogeny in Genomics and Metagenomics Example 2: Phylogenetic EcologyWednesday, November 16, 11
    • rRNA survey • Sequence rRNAs • ClusterWednesday, November 16, 11
    • rRNA survey OTU1 • Sequence OTU2 rRNAs OTU3 • Cluster OTU4 OTU5 • Identify OTU6 “OTUs” OTU7 OTU8 OTU9 OTU10Wednesday, November 16, 11
    • OTUs on Tree OTU1 OTU5 OTU4 OTU6 OTU2 OTU3 OTU7 OTU9 OTU8 OTU10Wednesday, November 16, 11
    • OTUs on Tree OTU1 OTU5 OTU4 OTU6 OTU2 OTU3 OTU7 OTU9 OTU8 OTU10Wednesday, November 16, 11
    • OTUs on Tree OTU1 OTU5 OTU4 OTU6 OTU2 OTU3 OTU7 OTU9 OTU8 OTU10Wednesday, November 16, 11
    • OTUs on Tree OTU1 OTU5 OTU4 OTU6 OTU2 OTU3 OTU7 OTU9 OTU8 OTU10Wednesday, November 16, 11
    • OTUs on Tree OTU1 OTU5 OTU4 OTU6 OTU2 OTU3 OTU7 OTU9 OTU8 OTU10Wednesday, November 16, 11
    • Metagenomic Phylogenetic challenge xxxxxxxxxxxxxxxxxxxxxxx xxxxxx xxxxxxxxxxxxx xxxxxxxxxxxxxx xxxxxxxxxxxxxx A single tree with everythingWednesday, November 16, 11
    • Metagenomic Phylogenetic challenge A single tree with everythingWednesday, November 16, 11
    • Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylin PhylOTU - Sharpton et al. PLoS Comp. Bio 2011 workflow of PhylOTU. See Results section for details. doi:10.1371/journal.pcbi.1001061.g001Wednesday, November 16, 11
    • OTUs = Richness 1682 PENN ET AL. APPL. ENVIRON. MICROBIOL. Downloaded from http://aem.asm.org/ on Novem FIG. 2. Rarefaction curves for the accumulated coral-associated 16S rRNA gene sequences generated for this study (CGOA, -C, -D, -F, and -G) and the sequences of Rohwer et al. (12, 13). Bars indicate 95% confidence intervals. Statistical resampling was performed using EstimateS. containing at least 10% gammaproteobacteria. Sequences fall- We thank R/V Atlantis and DSV Alvin personnel, NOAA’s Ocean ing within the pseudomonad tree (see Fig. S2 in the supple- Exploration Program, and Brad Stevens, Randy Keller, Tom Shirley, and Tom Guilderson for help with data acquisition. mental material) appear most closely related to the oligotro- Phylogenetic analysis was supported in part by NSF Assembling the phic marine gammaproteobacteria (OMG) (2). The lack of a Tree of Life grant 0228651 to J.A.E. and N.W. close phylogenetic relationship between representatives of theWednesday,described major OMG clades and our coral sequences suggests November 16, 11 REFERENCES
    • OTUs on Tree OTU1 • Clades OTU5 OTU4 • Rates of change OTU6 OTU2 • LGT OTU3 • Convergence OTU7 OTU9 • Character OTU8 OTU10 historyWednesday, November 16, 11
    • nuscript typically used as a qualitative measure because duplicate se- Weighted UniFrac. Weighted UniFrac is a new variant of the original un- quences are usually removed from the tree. However, the P weighted UniFrac measure that weights the branches of a phylogenetic tree test may be used in a semiquantitative manner if all clones, based on the abundance of information (Fig. 1B). Weighted UniFrac is thus a quantitative measure of diversity that can detect changes in how many se- even those with identical or near-identical sequences, are in- cluded in the tree (13). Here we describe a quantitative version of UniFrac that we call “weighted UniFrac.” We show that weighted UniFrac be- haves similarly to the FST test in situations where both are Unifrac quences from each lineage are present, as well as detect changes in which taxa are present. This ability is important because the relative abundance of different kinds of bacteria can be critical for describing community changes. In contrast, the original, unweighted UniFrac (Fig. 1A) is a qualitative diversity measure because duplicate sequences contribute no additional branch length to the tree (by definition, the branch length that separates a pair of duplicate sequences is zero, because no substitutions separate them). The first step in applying weighted UniFrac is to calculate the raw weighted UniFrac value (u), according to the first equation: NIH-PA Author Manuscript n Ai Bi u bi AT BT i Here, n is the total number of branches in the tree, bi is the length of branch i, Ai and Bi are the numbers of sequences that descend from branch i in commu- nities A and B, respectively, and AT and BT are the total numbers of sequences in communities A and B, respectively. In order to control for unequal sampling effort, Ai and Bi are divided by AT and BT. If the phylogenetic tree is not ultrametric (i.e., if different sequences in the sample have evolved at different rates), clustering with weighted UniFrac will place more emphasis on communities that contain quickly evolving taxa. Since these taxa are assigned more branch length, a comparison of the communities FIG. 1. Calculation of the unweighted and the weighted UniFrac that contain them will tend to produce higher values of u. In some situations, it measures. Squares and circles represent sequences from two different may be desirable to normalize u so that it has a value of 0 for identical commu- environments. (a) In unweighted UniFrac, the distance between the nities and 1 for nonoverlapping communities. This is accomplished by dividing u circle and square communities is calculated as the fraction of the by a scaling factor (D), which is the average distance of each sequence from the branch length that has descendants from either the square or the circle root, as shown in the equation as follows: environment (black) but not both (gray). (b) In weighted UniFrac, branch lengths are weighted by the relative abundance of sequences in n the square and circle communities; square sequences are weighted Aj Bj D dj twice as much as circle sequences because there are twice as many total AT BT j circle sequences in the data set. The width of branches is proportional Figure 1. NIH-PA Author Manuscript to the degree to which each branch is weighted in the calculations, and Here, dj is the distance of sequence j from the root, (PD) and PD numbers Estimates of Phylogenetic Diversity Aj and Bj are the Gain (G) for the grey community. The gray branches have no weight. Branches 1 and 2 have heavy weights of times the sequences were observed in communities A and B, respectively, and boxes represent taxa from the black, white, and grey communities. (A) PD is the sum of the since the descendants are biased toward the square and circles, respec- AT and BT are the total numbers of sequences from communities A and B, tively. Branch 3 contributes no value since it has an equal contribution branches leading to the grey taxa. (B) G is the sum of the branches leading only to the grey respectively. from circle and square sequences after normalization. Clustering with normalized u values treatsshowing the increase inof taxa. (C) PD rarefaction curves each sample equally instead branch length with sampling effort for the intestinal and stool bacteria from three healthy individuals. Aligned16S rRNA sequences from the three individuals were available with the Supplementary Materials in (Eckburg, et al., 2005). The Arb parsimony insertion tool was used to add the sequences to a tree containing over 9,000 sequences (Hugenholtz, 2002) that is available for download at the rRNA Database Project II website (Maidak, et al., 2001). The curves represent the average values for 50 replicate trials. FEMS Microbiol Rev. Author manuscript; available in PMC 2009 July 1.Wednesday, November 16, 11
    • Challenge • Each gene poorly sampled in metagenomes • Can we combine all into a single tree?Wednesday, November 16, 11
    • Kembel et al. PLoS One 2011Wednesday, November 16, 11
    • Wednesday, November 16, 11
    • Figure 3. Taxonomic diversity and standardized phylogenetic diversity versus depth in environmental samples along an oceanic depth gradient at the HOT ALOHA site.Wednesday, November 16, 11
    • Uses of Phylogeny in Genomics and Metagenomics Example 3: BinningWednesday, November 16, 11
    • MetagenomicsWednesday, November 16, 11
    • Binning challengeWednesday, November 16, 11
    • Binning challenge Best binning method: reference genomesWednesday, November 16, 11
    • Binning challenge Best binning method: reference genomesWednesday, November 16, 11
    • Binning challenge No reference genome? What do you do?Wednesday, November 16, 11
    • Binning challenge No reference genome? What do you do? Composition, Assembly, othersWednesday, November 16, 11
    • Binning challenge No reference genome? What do you do? PhylogenyWednesday, November 16, 11
    • CFB PhylaWednesday, November 16, 11
    • Sulcia makes amino acids Baumannia makes vitamins and cofactors Wu et al. 2006 PLoS Biology 4: e188.Wednesday, November 16, 11
    • Uses of Phylogeny in Genomics and Metagenomics Example 4: Functional Diversity and Functional PredictionsWednesday, November 16, 11
    • Predicting Function • Identification of motifs – Short regions of sequence similarity that are indicative of general activity – e.g., ATP binding • Homology/similarity based methods – Gene sequence is searched against a databases of other sequences – If significant similar genes are found, their functional information is used • Problem – Genes frequently have similarity to hundreds of motifs and multiple genes, not all with the same functionWednesday, November 16, 11
    • PHYLOGENENETIC PREDICTION OF GENE FUNCTION EXAMPLE A METHOD EXAMPLE B 2A CHOOSE GENE(S) OF INTEREST 5 3A 1 3 4 2B 2 IDENTIFY HOMOLOGS 5 1A 2A 1B 3B 6 ALIGN SEQUENCES 1A 2A 3A 1B 2B 3B 1 2 3 4 5 6 CALCULATE GENE TREE Duplication? 1A 2A 3A 1B 2B 3B 1 2 3 4 5 6 OVERLAY KNOWN FUNCTIONS ONTO TREE Duplication? 2A 3A 1B 2B 3B 1 2 3 4 5 6 1A INFER LIKELY FUNCTION OF GENE(S) OF INTEREST Ambiguous Duplication? Species 1 Species 2 Species 3 1A 1B 2A 2B 3A 3B 1 2 3 4 5 6 ACTUAL EVOLUTION (ASSUMED TO BE UNKNOWN) Based on Eisen, 1998 Genome Duplication Res 8: 163-167.Wednesday, November 16, 11
    • rRNA Phylotyping • Collect DNA from environment • PCR amplify rRNA genes using broad (so- called universal) primers • Sequence • Align to others • Infer evolutionary tree • Unknowns “identified” by placement on treeWednesday, November 16, 11
    • Massive Diversity of Proteorhodopsins Venter et al., 2004Wednesday, November 16, 11
    • Uses of Phylogeny in Genomics and Metagenomics Example 5: Selecting Organisms for StudyWednesday, November 16, 11
    • http://www.jgi.doe.gov/programs/GEBA/pilot.htmlWednesday, November 16, 11
    • rRNA Tree of Life Bacteria Archaea Eukaryotes Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007. Based on tree from Pace 1997 Science 276:734-740Wednesday, November 16, 11
    • GEBA Lesson 1: The rRNA Tree of Life is a Useful Tool for Identifying Phylogenetically Novel From Wu et al. 2009 Nature 462, 1056-1060Wednesday, November 16, 11
    • GEBA Lesson 2: The rRNA Tree of Life is not perfect ... 16s WGT, 23S Badger et al. 2005 Int J System Evol Microbiol 55: 1021-1026.Wednesday, November 16, 11
    • GEBA Lesson 3: Phylogeny driven genome selection (and phylogenetics) improves genome annotation • Took 56 GEBA genomes and compared results vs. 56 randomly sampled new genomes • Better definition of protein family sequence “patterns” • Greatly improves “comparative” and “evolutionary” based predictions • Conversion of hypothetical into conserved hypotheticals • Linking distantly related members of protein families • Improved non-homology predictionWednesday, November 16, 11
    • GEBA Lesson 4 Phylogeny-driven genome selection helps discover new genetic diversityWednesday, November 16, 11
    • Protein Family Rarefaction Curves • Take data set of multiple complete genomes • Identify all protein families using MCL • Plot # of genomes vs. # of protein familiesWednesday, November 16, 11
    • Wu et al. 2009 Nature 462, 1056-1060Wednesday, November 16, 11
    • Wu et al. 2009 Nature 462, 1056-1060Wednesday, November 16, 11
    • Wu et al. 2009 Nature 462, 1056-1060Wednesday, November 16, 11
    • Wu et al. 2009 Nature 462, 1056-1060Wednesday, November 16, 11
    • Wu et al. 2009 Nature 462, 1056-1060Wednesday, November 16, 11
    • Synapomorphies existWu et al. 2009 Nature 462, 1056-1060Wednesday, November 16, 11
    • Families/PD not uniform 31    6   Wednesday, November 16, 11
    • GEBA Lesson 5 Improves analysis of genome data from uncultured organismsWednesday, November 16, 11
    • Shotgun Sequencing Allows Use of Other Markers Sargasso Phylotypes 0.500 0.375Weighted % of Clones 0.250 EFG EFTu HSP70 RecA RpoB rRNA 0.125 0 ia ia ia s i xi ria a ob te t le er er er eo e u or of t t t ct ic ac ac ac a hl or ba ch rm b ob b C hl ar eo eo so Fi C te ry Fu t ot ro ro Eu pr ap ap lta ph m De am Al G Major Phylogenetic Group Venter et al., Science 304: 66-74. 2004 Wednesday, November 16, 11
    • Shotgun Sequencing Allows Use of Other Markers Sargasso Phylotypes 0.500 0.375 Cannot be doneWeighted % of Clones 0.250 without good EFG EFTu HSP70 RecA 0.125 sampling of genomes RpoB rRNA 0 ia ia ia s i xi ria a ob te t le er er er eo e u or of t t t ct ic ac ac ac a hl or ba ch rm b ob b C hl ar eo eo so Fi C te ry Fu t ot ro ro Eu pr ap ap lta ph m De am Al G Major Phylogenetic Group Venter et al., Science 304: 66-74. 2004 Wednesday, November 16, 11
    • Phylogenetic Binning Sargasso Phylotypes 0.500 0.375Weighted % of Clones 0.250 EFG EFTu HSP70 RecA RpoB rRNA 0.125 0 ia ia ia s i xi ria a ob te t le er er er eo e u or of t t t ct ic ac ac ac a hl or ba ch rm b ob b C hl ar eo eo so Fi C te ry Fu t ot ro ro Eu pr ap ap lta ph m De am Al G Major Phylogenetic Group Venter et al., Science 304: 66-74. 2004 Wednesday, November 16, 11
    • Shotgun Sequencing Allows Use of Other Markers Sargasso Phylotypes 0.500 0.375 Cannot be doneWeighted % of Clones 0.250 without good EFG EFTu HSP70 RecA 0.125 sampling of genomes RpoB rRNA 0 ia ia ia s i xi ria a ob te t le er er er eo e u or of t t t ct ic ac ac ac a hl or ba ch rm b ob b C hl ar eo eo so Fi C te ry Fu t ot ro ro Eu pr ap ap lta ph m De am Al G Major Phylogenetic Group Venter et al., Science 304: 66-74. 2004 Wednesday, November 16, 11
    • Shotgun Sequencing Allows Use of Other Markers Sargasso Phylotypes 0.500 0.375 GEBA ProjectWeighted % of Clones 0.250 improves EFG EFTu HSP70 0.125 metagenomic analysis RecA RpoB rRNA 0 ia ia ia s i xi ria a ob te t le er er er eo e u or of t t t ct ic ac ac ac a hl or ba ch rm b ob b C hl ar eo eo so Fi C te ry Fu t ot ro ro Eu pr ap ap lta ph m De am Al G Major Phylogenetic Group Venter et al., Science 304: 66-74. 2004 Wednesday, November 16, 11
    • Shotgun Sequencing Allows Use of Other Markers Sargasso Phylotypes 0.500 0.375 But not a lotWeighted % of Clones 0.250 EFG EFTu HSP70 RecA RpoB rRNA 0.125 0 ia ia ia s i xi ria a ob te t le er er er eo e u or of t t t ct ic ac ac ac a hl or ba ch rm b ob b C hl ar eo eo so Fi C te ry Fu t ot ro ro Eu pr ap ap lta ph m De am Al G Major Phylogenetic Group Venter et al., Science 304: 66-74. 2004 Wednesday, November 16, 11
    • Phylogeny and Metagenomics Future 1 Need to adapt genomic and metagenomic methods to make better use of dataWednesday, November 16, 11
    • iSEEM ProjectWednesday, November 16, 11
    • AMPHORA 2 Coming w/ More Markers Phylogenetic Genome Gene Maker group Number Number Candidates Archaea 62 145415 106 Actinobacteria 63 267783 136 Alphaproteobacteri 94 347287 121 a Betaproteobacteria 56 266362 311 Gammaproteobacte 126 483632 118 ria Deltaproteobacteria 25 102115 206 Epislonproteobacter 18 33416 455 ia Bacteriodes 25 71531 286 Chlamydae 13 13823 560 Chloroflexi 10 33577 323 Cyanobacteria 36 124080 590 Firmicutes 106 312309 87 Spirochaetes 18 38832 176 Thermi 5 14160 974 Thermotogae 9 17037 684 See posters by Dongying Wu and Guillaume JospinWednesday, November 16, 11
    • • Build AMPHORA ALL reference tree with concatenated alignment • Align reads that match any of the HMMs to concatenated alignment • Place reads into reference tree one at a timeWednesday, November 16, 11
    • Phylogeny and Metagenomics Future 2 We have still only scratched the surface of microbial diversityWednesday, November 16, 11
    • rRNA Tree of Life Bacteria Archaea Eukaryotes Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007. Based on tree from Pace 1997 Science 276:734-740Wednesday, November 16, 11
    • Phylogenetic Diversity: GenomesFrom Wuet al. 2009Nature462,1056-1060Wednesday, November 16, 11
    • Phylogenetic Diversity with GEBAFrom Wuet al. 2009Nature462,1056-1060Wednesday, November 16, 11
    • Phylogenetic Diversity: Isolates From Wu et al. 2009 Nature 462, 1056-1060Wednesday, November 16, 11
    • Phylogenetic Diversity: All From Wu et al. 2009 Nature 462, 1056-1060Wednesday, November 16, 11
    • GEBA uncultured Number of SAGs from Candidate Phyla 406 1 OD1 OP1 OP3 SAR Site A: Hydrothermal vent 4 1 - - Site B: Gold Mine 6 13 2 - Site C: Tropical gyres (Mesopelagic) - - - 2 Site D: Tropical gyres (Photic zone) 1 - - -Sample collections at 4 additional sites are underway. Phil Hugenholtz 101Wednesday, November 16, 11
    • Earth Microbiome Project www.earthmicrobiome.org • Goal – to systematically approach the problem of characterizing microbial life on earth • Strategy: – Explore microbes in environmental parameter space – Design ‘ideal’ strategy to interrogate these biomes – Acquire samples and sequence broad and deep both DNA, mRNA and rRNA – Define microbial community structure and the protein universe • Gilbert et al., 2010a,b SIGS •Wednesday, November 16, 11
    • Phylogenomics Future 3 Need Experiments from Across the Tree of Life tooWednesday, November 16, 11
    • A Happy Tree of LifeWednesday, November 16, 11
    • Acknowledgements • GEBA: DOE-JGI, DSMZ • iSEEM: Katie Pollard, Jessica Green, Martin Wu, Steven Kembel, Tom Sharpton • RecA: Dongying Wu, Craig Venter, Aaron Halpern, Doug Rusch, et al. • Eisen Lab: Aaron Darling, Jenna Morgan, Dongying Wu • $$$ - Moore Foundation, NSF, DOE, DARPA, Sloan FoundationWednesday, November 16, 11
    • Wednesday, November 16, 11
    • MICROBESWednesday, November 16, 11