Phylogeny-Driven Approaches to Genomics and Metagenomics              June 23, 2012    Canadian Society for Microbiology  ...
Acknowledgements• $$$  •   DOE  •   NSF  •   GBMF  •   Sloan  •   DARPA  •   DSMZ  •   DHS• People, places  • DOE JGI: Edd...
Phylogeny: What is it?
Phylogeny: What is it?• Phylogeny is a description of  the evolutionary history of  relationships among organisms  (or the...
Whatever the History:               Trying to Incorporate it is Criticalfrom Lake et al. doi: 10.1098/rstb.2009.0035
Phylogeny            • Applies to             • Species             • Genes             • Genomes
Phylogeny: What is it good for?
Phylogeny: What is it good for?      Uses of Phylogenyin Genomics and Metagenomics
Uses of Phylogenyin Genomics and Metagenomics         Example 1:        Phylotyping
rRNA Phylotyping                DNA                extraction                              PCR                            ...
rRNA Phylotyping          • Collect DNA from            environment          • PCR amplify rRNA            genes using bro...
Era IV: Genomes in Environment                 shotgun                      sequenceMetagenomics
rRNA Phylotyping in SargassoVenter et al., Science304: 66. 2004
RecA Phylotyping in Sargasso DataVenter et al., Science304: 66. 2004
Weighted % of Clones                                                                                                      ...
Side benefit: binning
Metagenomics
Binning challenge
Binning challengeBest binning method: reference genomes
Binning challengeBest binning method: reference genomes
Binning challengeNo reference genome? What do you do?
Binning challengeNo reference genome? What do you do?Composition, Assembly, others
Binning challengeNo reference genome? What do you do?Phylogeny
Sulcia makes amino acidsBaumannia makes vitamins and cofactors                       Wu et al. 2006 PLoS Biology 4: e188.
CFB Phyla
Side benefit II: PG Ecology
rRNA survey              • Sequence                rRNAs              • Cluster
rRNA surveyOTU1                  • SequenceOTU2                    rRNAsOTU3                  • ClusterOTU4               ...
OTUs on Tree         OTU1         OTU5  OTU4          OTU6  OTU2  OTU3   OTU7     OTU9                 OTU8     OTU10
OTUs on Tree      OTU1       • Clades      OTU5                 • Rates of  OTU4                   change        OTU6     ...
Unifrac                                          nuscripttypically used as a qualitative measure because duplicate se-    ...
Caveat: Not Everything in Groups
RecA, RpoB in GOS                                     GOS 1                                     GOS 2                     ...
Uses of Phylogenyin Genomics and Metagenomics         Example 2:   Functional Diversity and    Functional Predictions
Predicting Function• Key step in genome projects• More accurate predictions help guide  experimental and computational  an...
Predicting Function• Identification of motifs  – Short regions of sequence similarity that are indicative of    general ac...
From Eisen et al.1997 NatureMedicine 3:1076-1078.
Blast Search of H. pylori “MutS”• Blast search pulls up Syn. sp MutS#2 with much higher p  value than other MutS homologs•...
MutL??Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.
Overlaying Functions onto Tree                                                                     MutS2                  ...
PHYLOGENENETIC PREDICTION OF GENE FUNCTION            EXAMPLE A                                   METHOD                  ...
PHYLOGENENETIC PREDICTION OF GENE FUNCTION            EXAMPLE A                                   METHOD                  ...
Diversity of Proteorhodopsins                      Venter et al., 2004
Carboxydothermus sporulates       Wu et al. 2005 PLoS Genetics 1: e65.
Wu et al. 2005 PLoS Genetics 1: e65.
Uses of Phylogenyin Genomics and Metagenomics         Example 3:Selecting Organisms for Study
As of 2002   Proteobacteria             TM6             OS-K                                     • At least 40            ...
As of 2002   Proteobacteria             TM6             OS-K                                     • At least 40            ...
As of 2002   Proteobacteria             TM6             OS-K                                     • At least 40            ...
As of 2002   Proteobacteria             TM6             OS-K                                     • At least 40            ...
As of 2002   Proteobacteria             TM6             OS-K                                     • At least 40            ...
GEBAhttp://www.jgi.doe.gov/programs/GEBA/pilot.html
GEBA: Components• Project overview (Phil Hugenholtz, Nikos Kyrpides, Jonathan  Eisen, Eddy Rubin, Jim Bristow)• Project ma...
GEBA Now• 300+ genomes• Rich sampling of major groups of  cultured organisms
GEBA Lesson 1:             The rRNA Tree of Life is a Useful ToolFrom Wu et al. 2009 Nature 462, 1056-1060
GEBA Lesson 2:               The rRNA Tree of Life is not perfect ...              16s                                    ...
GEBA Lesson 3:     Phylogeny improves genome annotation• Took 56 GEBA genomes and compared results vs. 56  randomly sample...
GEBA Lesson 4 :Metadata Important
GEBA Lesson 5:Improves discovering new genetic diversity
Phylogenetic Distribution Novelty:                  Bacterial Actin Related Protein                                       ...
Protein Family Rarefaction• Take data set of multiple complete  genomes• Identify all protein families using MCL• Plot # o...
Wu et al. 2009 Nature 462, 1056-1060
Wu et al. 2009 Nature 462, 1056-1060
Wu et al. 2009 Nature 462, 1056-1060
Wu et al. 2009 Nature 462, 1056-1060
Wu et al. 2009 Nature 462, 1056-1060
Synapomorphies existWu et al. 2009 Nature 462, 1056-1060
GEBA Lesson 6:Improves Analysis of Uncultured
Weighted % of Clones                                                                                                      ...
Weighted % of Clones                                                                                                      ...
• AND THEN ALL OF THEM WERE  DECEIVED
• For each of these areas - need to do a  MUCH better job ...
Improving Phylotyping
Major Issues in PhylotpyingBeyond Moore’s Law                 Metagenomics                     Short reads
Major Issues in PhylotpyingBeyond Moore’s Law                 Metagenomics                     Short reads                ...
Method 1: Each is an island• Each new sequences is an island• Take reference data• Build alignment, models, trees• Add new...
STAP                                             ss-rRNA Taxonomy Pip       Figure 1. A flow chart of the STAP pipeline.  ...
AMPHORAWu and Eisen GenomeBiology 2008 9:R151doi:10.1186/gb-2008-9-10-r151         Guide tree
Phylotyping w/ ProteinsWu and Eisen Genome Biology 2008 9:R151   doi:10.1186/gb-2008-9-10-r151
Whole Genome Tree               Wu and Eisen               Genome Biology               2008 9:R151 doi:               10....
Method 2: Most in the Family
Phylogenetic Challenge          xxxxxxxxxxxxxxxxxxxxxxx        xxxxxx             xxxxxxxxxxxxx                         xx...
Phylogenetic Challenge               xxxxxxxxxxxxxxxxxxxxxxx             xxxxxx             xxxxxxxxxxxxx                 ...
Phylogenetic Challenge               xxxxxxxxxxxxxxxxxxxxxxx             xxxxxx             xxxxxxxxxxxxx                 ...
Phylogenetic ChallengeA single tree with everything?
rRNA in Sargasso MetagenomeVenter et al., Science304: 66. 2004
STAP All           ss-rRNA Taxonomy Pip                                                          Combine all into         ...
RecA in SargassoVenter et al., Science304: 66. 2004
Weighted % of Clones                                                                                                      ...
Kembel Correction
Method 3: All in the family• Combine new sequences into one tree• Take reference data• Build alignment, models, trees• Add...
Phylogenetic ChallengeA single tree with everything?
Phylogenetic ChallengeA single tree with everything?
PhylOTU                                                   Finding Metagenomic OTFigure 1. PhylOTU Workflow. Computational ...
Phylosift/ pplacer
Method 4: All in the genome• Combine new sequences from different  gene families into one tree• Take reference data• Build...
Challenge• Each gene poorly sampled in  metagenomes• Can we combine all into a single tree?
Kembel CombinerKembel et al. The phylogenetic diversity of metagenomes. PLoS One 2011
Kembel Combiner                  VOL. 73, 2007                                            PHYL                            ...
Improving Phylotyping II• We need to analyze more gene families
Families/PD not uniform    31	                            6
More Markers   Phylogenetic group    Genome   Gene     Maker                         Number   Number   Candidates   Archae...
Improving Functional Predictions
Improving Functional Predictions• We need to analyze even more gene  families
Sifting Families                   Representative                     Genomes                     Extract          New    ...
BA             C    Sharpton et al. submitted
Phylogenetic Contrasts
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
Upcoming SlideShare
Loading in …5
×

"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012

4,400 views
4,340 views

Published on

Talk

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,400
On SlideShare
0
From Embeds
0
Number of Embeds
26
Actions
Shares
0
Downloads
24
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012

  1. 1. Phylogeny-Driven Approaches to Genomics and Metagenomics June 23, 2012 Canadian Society for Microbiology Jonathan A. Eisen University of California, Davis @phylogenomics
  2. 2. Acknowledgements• $$$ • DOE • NSF • GBMF • Sloan • DARPA • DSMZ • DHS• People, places • DOE JGI: Eddy Rubin, Phil Hugenholtz, Nikos Kyrpides • UC Davis: Aaron Darling, Dongying Wu, Holly Bik, Russell Neches, Jenna Morgan-Lang • Other: Jessica Green, Katie Pollard, Martin Wu, Tom Slezak, Jack Gilbert, Steven Kembel, J. Craig Venter, Naomi Ward, Hans-Peter Klenk
  3. 3. Phylogeny: What is it?
  4. 4. Phylogeny: What is it?• Phylogeny is a description of the evolutionary history of relationships among organisms (or their parts).• This is frequently portrayed in a diagram called a phylogenetic tree.• Phylogenies can be more complex than a bifurcating tree (e.g., lateral gene transfer, recombination, hybridization)
  5. 5. Whatever the History: Trying to Incorporate it is Criticalfrom Lake et al. doi: 10.1098/rstb.2009.0035
  6. 6. Phylogeny • Applies to • Species • Genes • Genomes
  7. 7. Phylogeny: What is it good for?
  8. 8. Phylogeny: What is it good for? Uses of Phylogenyin Genomics and Metagenomics
  9. 9. Uses of Phylogenyin Genomics and Metagenomics Example 1: Phylotyping
  10. 10. rRNA Phylotyping DNA extraction PCR Makes lots of Sequence PCR copies of the rRNA genes rRNA genes in sample rRNA1 5’...ACACACATAGGTGGAGCTA GCGATCGATCGA... 3’ Phylogenetic tree Sequence alignment = Data matrix rRNA2 rRNA1 rRNA2 rRNA1 A C A C A C 5’..TACAGTATAGGTGGAGCTAG CGACGATCGA... 3’ rRNA4rRNA3 rRNA2 T A C A G T rRNA3 rRNA3 C A C T G T 5’...ACGGCAAAATAGGTGGATT E. coli Humans rRNA4 C A C A G T CTAGCGATATAGA... 3’ Yeast E. coli A G A C A G rRNA4 5’...ACGGCCCGATAGGTGGATT Humans T A T A G T CTAGCGCCATAGA... 3’ Yeast T A C A G T
  11. 11. rRNA Phylotyping • Collect DNA from environment • PCR amplify rRNA genes using broad (so-called universal) primers • Sequence • Align to others • Infer evolutionary tree • Unknowns “identified” by placement on tree
  12. 12. Era IV: Genomes in Environment shotgun sequenceMetagenomics
  13. 13. rRNA Phylotyping in SargassoVenter et al., Science304: 66. 2004
  14. 14. RecA Phylotyping in Sargasso DataVenter et al., Science304: 66. 2004
  15. 15. Weighted % of Clones 0 0.125 0.250 0.375 0.500 Al ph ap ro t eo Be ba ta ct er pr ia ot eo G b am ac m t er ap ia ro Ep teo si ba lo ct np er ro ia eo t De ba lta ct pr er ot ia eo ba C EFG ct ya er no ia ba ct er Fi ia rm ic EFTu ut es Ac tin ob ac te ria C hl HSP70 or ob i C Major Phylogenetic Group FB Sargasso Phylotypes C RecA hl or of le xi Sp iro ch ae te s RpoB Fu so ba De ct in er ia oc oc cu s- rRNA Th Eu er ry m ar u ch s ae C ot a re na rc ha eo taVenter et al., Science 304: 66-74. 2004
  16. 16. Side benefit: binning
  17. 17. Metagenomics
  18. 18. Binning challenge
  19. 19. Binning challengeBest binning method: reference genomes
  20. 20. Binning challengeBest binning method: reference genomes
  21. 21. Binning challengeNo reference genome? What do you do?
  22. 22. Binning challengeNo reference genome? What do you do?Composition, Assembly, others
  23. 23. Binning challengeNo reference genome? What do you do?Phylogeny
  24. 24. Sulcia makes amino acidsBaumannia makes vitamins and cofactors Wu et al. 2006 PLoS Biology 4: e188.
  25. 25. CFB Phyla
  26. 26. Side benefit II: PG Ecology
  27. 27. rRNA survey • Sequence rRNAs • Cluster
  28. 28. rRNA surveyOTU1 • SequenceOTU2 rRNAsOTU3 • ClusterOTU4 • IdentifyOTU5OTU6 “OTUs”OTU7OTU8OTU9OTU10
  29. 29. OTUs on Tree OTU1 OTU5 OTU4 OTU6 OTU2 OTU3 OTU7 OTU9 OTU8 OTU10
  30. 30. OTUs on Tree OTU1 • Clades OTU5 • Rates of OTU4 change OTU6 • LGT OTU2 OTU3 • Convergence OTU7 • Character OTU9 OTU8 history OTU10
  31. 31. Unifrac nuscripttypically used as a qualitative measure because duplicate se- Weighted UniFrac. Weighted UniFrac is a new variant of the original un-quences are usually removed from the tree. However, the P weighted UniFrac measure that weights the branches of a phylogenetic treetest may be used in a semiquantitative manner if all clones, based on the abundance of information (Fig. 1B). Weighted UniFrac is thus a quantitative measure of ␤ diversity that can detect changes in how many se-even those with identical or near-identical sequences, are in- quences from each lineage are present, as well as detect changes in which taxacluded in the tree (13). are present. This ability is important because the relative abundance of different Here we describe a quantitative version of UniFrac that we kinds of bacteria can be critical for describing community changes. In contrast,call “weighted UniFrac.” We show that weighted UniFrac be- the original, unweighted UniFrac (Fig. 1A) is a qualitative ␤ diversity measurehaves similarly to the FST test in situations where both are because duplicate sequences contribute no additional branch length to the tree (by definition, the branch length that separates a pair of duplicate sequences is zero, because no substitutions separate them). The first step in applying weighted UniFrac is to calculate the raw weighted UniFrac value (u), according to the first equation: NIH-PA Author Manuscript ͸ n uϭ bi ϫ ͯA Ϫ B ͯ Ai T B T i i Here, n is the total number of branches in the tree, bi is the length of branch i, Ai and Bi are the numbers of sequences that descend from branch i in commu- nities A and B, respectively, and AT and BT are the total numbers of sequences in communities A and B, respectively. In order to control for unequal sampling effort, Ai and Bi are divided by AT and BT. If the phylogenetic tree is not ultrametric (i.e., if different sequences in the sample have evolved at different rates), clustering with weighted UniFrac will place more emphasis on communities that contain quickly evolving taxa. Since these taxa are assigned more branch length, a comparison of the communities FIG. 1. Calculation of the unweighted and the weighted UniFrac that contain them will tend to produce higher values of u. In some situations, itmeasures. Squares and circles represent sequences from two different may be desirable to normalize u so that it has a value of 0 for identical commu-environments. (a) In unweighted UniFrac, the distance between the nities and 1 for nonoverlapping communities. This is accomplished by dividing ucircle and square communities is calculated as the fraction of the by a scaling factor (D), which is the average distance of each sequence from thebranch length that has descendants from either the square or the circle root, as shown in the equation as follows:environment (black) but not both (gray). (b) In weighted UniFrac, ͸ ͩbranch lengths are weighted by the relative abundance of sequences in ͪ nthe square and circle communities; square sequences are weighted Aj Bj Dϭ dj ϫ ϩtwice as much as circle sequences because there are twice as many total AT BTcircle sequences in the data set. The width of branches is proportional Figure 1. j NIH-PA Author Manuscriptto the degree to which each branch is weighted in the calculations, and Here, dj is the distance of sequence j from the root, (PD) and PD Gain (G) for the grey community. The Estimates of Phylogenetic Diversity Aj and Bj are the numbersgray branches have no weight. Branches 1 and 2 have heavy weights of times the sequences were observed in communitieswhite, and grey communities. (A) PD is the sum of the boxes represent taxa from the black, A and B, respectively, andsince the descendants are biased toward the square and circles, respec- AT and BT are the total numbers of sequences from communities A and B,tively. Branch 3 contributes no value since it has an equal contribution branches leading to the grey taxa. (B) G is the sum of the branches leading only to the grey respectively.from circle and square sequences after normalization. Clustering with normalized u values treatsshowing the increase inof taxa. (C) PD rarefaction curves each sample equally instead branch length with sampling effort for the intestinal and stool bacteria from three healthy individuals. Aligned16S rRNA sequences from the three individuals were available with the Supplementary Materials in (Eckburg, et al., 2005). The Arb parsimony insertion tool was used to add the sequences to a tree containing over 9,000 sequences (Hugenholtz, 2002) that is available for download at the rRNA Database Project II website (Maidak, et al., 2001). The curves represent the average values for 50 replicate trials. FEMS Microbiol Rev. Author manuscript; available in PMC 2009 July 1.
  32. 32. Caveat: Not Everything in Groups
  33. 33. RecA, RpoB in GOS GOS 1 GOS 2 GOS 3 GOS 4 GOS 5Wu et al PLoS One 2011
  34. 34. Uses of Phylogenyin Genomics and Metagenomics Example 2: Functional Diversity and Functional Predictions
  35. 35. Predicting Function• Key step in genome projects• More accurate predictions help guide experimental and computational analyses• Many diverse approaches• All improved both by “phylogenomic” type analyses that integrate evolutionary reconstructions and understanding of how new functions evolve
  36. 36. Predicting Function• Identification of motifs – Short regions of sequence similarity that are indicative of general activity – e.g., ATP binding• Homology/similarity based methods – Gene sequence is searched against a databases of other sequences – If significant similar genes are found, their functional information is used• Problem – Genes frequently have similarity to hundreds of motifs and multiple genes, not all with the same function
  37. 37. From Eisen et al.1997 NatureMedicine 3:1076-1078.
  38. 38. Blast Search of H. pylori “MutS”• Blast search pulls up Syn. sp MutS#2 with much higher p value than other MutS homologs• Based on this TIGR predicted this species had mismatch repair• Assumes functional constancy Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.
  39. 39. MutL??Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.
  40. 40. Overlaying Functions onto Tree MutS2 Aquae MSH5 StrpyBacsuSynsp Deira Helpy Yeast Human Borbu Celeg Metth MSH6 mSaco Yeast Human Mouse Arath Yeast MSH4 Celeg Human Arath HumanMSH3 Mouse Fly Spombe Yeast Xenla Rat Mouse Yeast HumanMSH1 Spombe Yeast MSH2 Neucr Arath Aquae Trepa Chltr Deira Theaq Bacsu Borbu Thema Synsp Strpy Ecoli Based on Eisen, Neigo 1998 Nucl Acids Res MutS1 26: 4291-4300.
  41. 41. PHYLOGENENETIC PREDICTION OF GENE FUNCTION EXAMPLE A METHOD EXAMPLE B 2A CHOOSE GENE(S) OF INTEREST 5 3A 1 3 4 2B 2 IDENTIFY HOMOLOGS 5 1A 2A 1B 3B 6 ALIGN SEQUENCES 1A 2A 3A 1B 2B 3B 1 2 3 4 5 6 CALCULATE GENE TREE Duplication? 1A 2A 3A 1B 2B 3B 1 2 3 4 5 6 OVERLAY KNOWN FUNCTIONS ONTO TREE Duplication? 2A 3A 1B 2B 3B 1 2 3 4 5 6 1A INFER LIKELY FUNCTION OF GENE(S) OF INTEREST Ambiguous Duplication?Species 1 Species 2 Species 3 Based on 1A 1B 2A 2B 3A 3B 1 2 3 4 5 6 ACTUAL EVOLUTION (ASSUMED TO BE UNKNOWN) Eisen, 1998 Genome Res 8: Duplication 163-167.
  42. 42. PHYLOGENENETIC PREDICTION OF GENE FUNCTION EXAMPLE A METHOD EXAMPLE B 2A CHOOSE GENE(S) OF INTEREST 5 3A 1 3 4 2B 2 IDENTIFY HOMOLOGS 5 1A 2A 1B 3B 6 ALIGN SEQUENCES 1A 2A 3A 1B 2B 3B 1 2 3 4 5 6 CALCULATE GENE TREE Duplication? 1A 2A 3A 1B 2B 3B 1 2 3 4 5 6 OVERLAY KNOWN FUNCTIONS ONTO TREE Duplication? 2A 3A 1B 2B 3B 1 2 3 4 5 6 1A INFER LIKELY FUNCTION OF GENE(S) OF INTEREST Ambiguous Duplication?Species 1 Species 2 Species 3 1A 1B 1 2 3 4 5 6 2A 2B 3A 3B ACTUAL EVOLUTION (ASSUMED TO BE UNKNOWN) Based on Duplication Eisen, 1998 Genome Res 8:
  43. 43. Diversity of Proteorhodopsins Venter et al., 2004
  44. 44. Carboxydothermus sporulates Wu et al. 2005 PLoS Genetics 1: e65.
  45. 45. Wu et al. 2005 PLoS Genetics 1: e65.
  46. 46. Uses of Phylogenyin Genomics and Metagenomics Example 3:Selecting Organisms for Study
  47. 47. As of 2002 Proteobacteria TM6 OS-K • At least 40 Acidobacteria Termite Group phyla of bacteria OP8 Nitrospira Bacteroides Chlorobi Fibrobacteres Marine GroupA WS3 Gemmimonas Firmicutes Fusobacteria Actinobacteria OP9 Cyanobacteria Synergistes Deferribacteres Chrysiogenetes NKB19 Verrucomicrobia Chlamydia OP3 Planctomycetes Spriochaetes Coprothmermobacter OP10 Thermomicrobia Chloroflexi TM7 Deinococcus-Thermus Dictyoglomus Aquificae Thermudesulfobacteria Thermotogae OP1 Based on Hugenholtz, OP11 2002
  48. 48. As of 2002 Proteobacteria TM6 OS-K • At least 40 Acidobacteria Termite Group OP8 phyla of Nitrospira Bacteroides bacteria Chlorobi Fibrobacteres Marine GroupA • Most genomes WS3 Gemmimonas from three Firmicutes Fusobacteria phyla Actinobacteria OP9 Cyanobacteria Synergistes Deferribacteres Chrysiogenetes NKB19 Verrucomicrobia Chlamydia OP3 Planctomycetes Spriochaetes Coprothmermobacter OP10 Thermomicrobia Chloroflexi TM7 Deinococcus-Thermus Dictyoglomus Aquificae Thermudesulfobacteria Thermotogae OP1 Based on Hugenholtz, OP11 2002
  49. 49. As of 2002 Proteobacteria TM6 OS-K • At least 40 Acidobacteria Termite Group OP8 phyla of Nitrospira Bacteroides bacteria Chlorobi Fibrobacteres Marine GroupA • Most genomes WS3 Gemmimonas from three Firmicutes Fusobacteria phyla Actinobacteria OP9 Cyanobacteria Synergistes • Some studies Deferribacteres Chrysiogenetes in other phyla NKB19 Verrucomicrobia Chlamydia OP3 Planctomycetes Spriochaetes Coprothmermobacter OP10 Thermomicrobia Chloroflexi TM7 Deinococcus-Thermus Dictyoglomus Aquificae Thermudesulfobacteria Thermotogae OP1 Based on Hugenholtz, OP11 2002
  50. 50. As of 2002 Proteobacteria TM6 OS-K • At least 40 Acidobacteria Termite Group OP8 phyla of Nitrospira Bacteroides bacteria Chlorobi Fibrobacteres Marine GroupA • Most genomes WS3 Gemmimonas from three Firmicutes Fusobacteria phyla Actinobacteria OP9 Cyanobacteria Synergistes • Some other Deferribacteres Chrysiogenetes phyla are only NKB19 Verrucomicrobia Chlamydia sparsely OP3 Planctomycetes Spriochaetes sampled Coprothmermobacter OP10 • Same trend in Thermomicrobia Chloroflexi TM7 Eukaryotes Deinococcus-Thermus Dictyoglomus Aquificae Thermudesulfobacteria Thermotogae OP1 Based on Hugenholtz, OP11 2002
  51. 51. As of 2002 Proteobacteria TM6 OS-K • At least 40 Acidobacteria Termite Group OP8 phyla of Nitrospira Bacteroides bacteria Chlorobi Fibrobacteres Marine GroupA • Most genomes WS3 Gemmimonas from three Firmicutes Fusobacteria phyla Actinobacteria OP9 Cyanobacteria Synergistes • Some other Deferribacteres Chrysiogenetes phyla are only NKB19 Verrucomicrobia Chlamydia sparsely OP3 Planctomycetes Spriochaetes sampled Coprothmermobacter OP10 • Same trend in Thermomicrobia Chloroflexi TM7 Viruses Deinococcus-Thermus Dictyoglomus Aquificae Thermudesulfobacteria Thermotogae OP1 Based on Hugenholtz, OP11 2002
  52. 52. GEBAhttp://www.jgi.doe.gov/programs/GEBA/pilot.html
  53. 53. GEBA: Components• Project overview (Phil Hugenholtz, Nikos Kyrpides, Jonathan Eisen, Eddy Rubin, Jim Bristow)• Project management (David Bruce, Eileen Dalin, Lynne Goodwin)• Culture collection and DNA prep (DSMZ, Hans-Peter Klenk)• Sequencing and closure (Eileen Dalin, Susan Lucas, Alla Lapidus, Mat Nolan, Alex Copeland, Cliff Han, Feng Chen, Jan-Fang Cheng)• Annotation and data release (Nikos Kyrpides, Victor Markowitz, et al)• Analysis (Dongying Wu, Kostas Mavrommatis, Martin Wu, Victor Kunin, Neil Rawlings, Ian Paulsen, Patrick Chain, Patrik D’Haeseleer, Sean Hooper, Iain Anderson, Amrita Pati, Natalia N. Ivanova, Athanasios Lykidis, Adam Zemla)• Adopt a microbe education project (Cheryl Kerfeld)• Outreach (David Gilbert)• $$$ (DOE, Eddy Rubin, Jim Bristow)
  54. 54. GEBA Now• 300+ genomes• Rich sampling of major groups of cultured organisms
  55. 55. GEBA Lesson 1: The rRNA Tree of Life is a Useful ToolFrom Wu et al. 2009 Nature 462, 1056-1060
  56. 56. GEBA Lesson 2: The rRNA Tree of Life is not perfect ... 16s WGT, 23SBadger et al. 2005 Int J System Evol Microbiol 55: 1021-1026.
  57. 57. GEBA Lesson 3: Phylogeny improves genome annotation• Took 56 GEBA genomes and compared results vs. 56 randomly sampled new genomes• Better definition of protein family sequence “patterns”• Greatly improves “comparative” and “evolutionary” based predictions• Conversion of hypothetical into conserved hypotheticals• Linking distantly related members of protein families• Improved non-homology prediction
  58. 58. GEBA Lesson 4 :Metadata Important
  59. 59. GEBA Lesson 5:Improves discovering new genetic diversity
  60. 60. Phylogenetic Distribution Novelty: Bacterial Actin Related Protein C. boidinii  gi57157304 S. cerevisiae  gi14318479 L. starkeyi  gi166080363  S. japonicus  gi213407080 ACTIN A. cliftonii  gi14269497 99 U. pertusa  gi50355609 H. sapiens  gi4501889 M. cerebralis  gi46326807 67 C. cinerea  gi169844021 N. crassa  gi85101929 ARP1 100 I. scapularis  gi215507378  51 100 H. sapiens  gi5031569 65 S. japonicus  gi213404844 100 S. cerevisiae  gi6320175 ARP2 D. melanogaster  gi24642545 100 G. gallus  gi45382569 75 C. neoformans  gi58266690 S. cerevisiae  gi6322525 ARP3 100 D. melanogaster  gi17737543 100 H. sapiens  gi5031573  H. ochraceum  gi227395998 BARP S. cerevisiae  gi1008244  73 P. patens  gi168051992  ARP4 99 A. thaliana  gi18394608  94 S. cerevisiae  gi1301932 100 S. japonicus  gi213408393  ARP5 87 D. discoideum  gi66802418 74 D. melanogaster  gi17737347 97 S. cerevisiae  gi6323114 100 D. hansenii gi21851 1921 ARP6 100 O. sativa  gi182657420  A. thaliana gi1841 1737 ARP7 D. melanogater  gi19920358 100 M. musculus  gi226246593 ARP10 0.5  Haliangium ochraceum DSM 14365 Patrik D’haeseleer, Adam Zemla, Victor KuninWu et al. 2009 Nature 462, 1056-1060 See also Guljamow et al. 2007 Current Biology.
  61. 61. Protein Family Rarefaction• Take data set of multiple complete genomes• Identify all protein families using MCL• Plot # of genomes vs. # of protein families
  62. 62. Wu et al. 2009 Nature 462, 1056-1060
  63. 63. Wu et al. 2009 Nature 462, 1056-1060
  64. 64. Wu et al. 2009 Nature 462, 1056-1060
  65. 65. Wu et al. 2009 Nature 462, 1056-1060
  66. 66. Wu et al. 2009 Nature 462, 1056-1060
  67. 67. Synapomorphies existWu et al. 2009 Nature 462, 1056-1060
  68. 68. GEBA Lesson 6:Improves Analysis of Uncultured
  69. 69. Weighted % of Clones 0 0.125 0.250 0.375 0.500 Al ph ap ro t eo Be ba ta ct er pr ia ot eo G b am ac m t er ap ia ro Ep teo si ba lo ct np er ro ia eo t De ba lta ct pr er ot ia eo ba C ct ya er no ia ba ct er Fi ia rm ic ut Ac tin es analysis improves ob ac te ria C hl or ob i C Major Phylogenetic Group FB Sargasso Phylotypes metagenomic GEBA Project C hl or of le xi Sp iro ch ae te Fu s so ba De ct in er ia oc oc cu Metagenomic Phylotyping s- Th Eu er ry m ar u ch s ae C ot a re na rc ha eo ta EFG EFTu rRNA RecA RpoB HSP70Venter et al., Science 304: 66-74. 200
  70. 70. Weighted % of Clones 0 0.125 0.250 0.375 0.500 Al ph ap ro t eo Be ba ta ct er pr ia ot eo G b am ac m t er ap ia ro Ep teo si ba lo ct np er ro ia eo t De ba lta ct pr er ot ia eo ba C ct ya er no ia ba ct er Fi ia rm ic ut es Ac tin ob ac te ria C hl or ob i But not a lot C Major Phylogenetic Group FB Sargasso Phylotypes C hl or of le xi Sp iro ch ae te Fu s so ba De ct in er ia oc oc cu Metagenomic Phylotyping s- Th Eu er ry m ar u ch s ae C ot a re na rc ha eo ta EFG EFTu rRNA RecA RpoB HSP70Venter et al., Science 304: 66-74. 200
  71. 71. • AND THEN ALL OF THEM WERE DECEIVED
  72. 72. • For each of these areas - need to do a MUCH better job ...
  73. 73. Improving Phylotyping
  74. 74. Major Issues in PhylotpyingBeyond Moore’s Law Metagenomics Short reads
  75. 75. Major Issues in PhylotpyingBeyond Moore’s Law Metagenomics Short reads WE NEED NEW METHODS
  76. 76. Method 1: Each is an island• Each new sequences is an island• Take reference data• Build alignment, models, trees• Add new sequence to reference alignment and build tree
  77. 77. STAP ss-rRNA Taxonomy Pip Figure 1. A flow chart of the STAP pipeline. doi:10.1371/journal.pone.0002566.g001 STAP database, and the query sequence is aligned to them using a the CLUSTALW profile alignment algorithm [40] as described w above for domain assignment. By adapting the profile alignment s a t o G t t Each sequence s T c analyzed separately a q c e b b S p a Figure 2. Domain assignment. In Step 1, STAP assigns a domain to t each query sequence based on its position in a maximum likelihood d tree of representative ss-rRNA sequences. Because the tree illustrated ‘ here is not rooted, domain assignment would not be accurate and s reliable (sequence similarity based methods cannot make an accurate s assignment in this case either). However the figure illustrates an important role of the tree-based domain assignment step, namely s automatic identification of deep-branching environmental ss-rRNAs. d doi:10.1371/journal.pone.0002566.g002 a PLoS ONE | www.plosone.org 5 Wu et al. 2008 PLoS One
  78. 78. AMPHORAWu and Eisen GenomeBiology 2008 9:R151doi:10.1186/gb-2008-9-10-r151 Guide tree
  79. 79. Phylotyping w/ ProteinsWu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151
  80. 80. Whole Genome Tree Wu and Eisen Genome Biology 2008 9:R151 doi: 10.1186/ gb-2008-9-10-r151
  81. 81. Method 2: Most in the Family
  82. 82. Phylogenetic Challenge xxxxxxxxxxxxxxxxxxxxxxx xxxxxx xxxxxxxxxxxxx xxxxxxxxxxxxxx xxxxxxxxxxxxxxA single tree with everything?
  83. 83. Phylogenetic Challenge xxxxxxxxxxxxxxxxxxxxxxx xxxxxx xxxxxxxxxxxxx xxxxxxxxxxxxxx xxxxxxxxxxxxxx A single tree with everything(as long as there is a lot of overlap)
  84. 84. Phylogenetic Challenge xxxxxxxxxxxxxxxxxxxxxxx xxxxxx xxxxxxxxxxxxx xxxxxxxxxxxxxx xxxxxxxxxxxxxx A single tree with everything(as long as there is a lot of overlap)
  85. 85. Phylogenetic ChallengeA single tree with everything?
  86. 86. rRNA in Sargasso MetagenomeVenter et al., Science304: 66. 2004
  87. 87. STAP All ss-rRNA Taxonomy Pip Combine all into one alignmentFigure 1. A flow chart of the STAP pipeline.
  88. 88. RecA in SargassoVenter et al., Science304: 66. 2004
  89. 89. Weighted % of Clones 0 0.125 0.250 0.375 0.500 Al ph ap ro t eo Be ba ta ct er pr ia ot eo G b am ac m t er ap ia ro Ep teo si ba lo ct np er ro ia eo t De ba lta ct pr er ot ia eo ba C EFG ct ya er no ia ba ct er Fi ia rm ic EFTu ut es Ac tin ob ac te ria C hl HSP70 or ob i C Major Phylogenetic Group FB Sargasso Phylotypes C RecA hl or of le xi Sp iro ch ae te RpoB Fu s so ba De ct in er ia oc oc cu s- rRNA Th Eu er ry m ar u ch s ae C ot a re na Protein vs. rRNA Sargasso Data rc ha eo taVenter et al., Science 304: 66-74. 200
  90. 90. Kembel Correction
  91. 91. Method 3: All in the family• Combine new sequences into one tree• Take reference data• Build alignment, models, trees• Add all sequences to reference alignment and build tree
  92. 92. Phylogenetic ChallengeA single tree with everything?
  93. 93. Phylogenetic ChallengeA single tree with everything?
  94. 94. PhylOTU Finding Metagenomic OTFigure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in this generalworkflow of PhylOTU. See Results section for details. Bio 2011 PhylOTU - Sharpton et al. PLoS Comp.doi:10.1371/journal.pcbi.1001061.g001
  95. 95. Phylosift/ pplacer
  96. 96. Method 4: All in the genome• Combine new sequences from different gene families into one tree• Take reference data• Build alignment, models• Concatenate• Add all sequences to reference alignment and build tree
  97. 97. Challenge• Each gene poorly sampled in metagenomes• Can we combine all into a single tree?
  98. 98. Kembel CombinerKembel et al. The phylogenetic diversity of metagenomes. PLoS One 2011
  99. 99. Kembel Combiner VOL. 73, 2007 PHYL TABLE 1. Measure Only presence/absence of taxa considered Qua Additionally accounts for the no. of times that Qua each taxon was observed cally defined by a sequence similarity threshold) in the sam as equally related. Newer ␤ diversity measures that incorpo phylogenetic information are more powerful because they count for the degree of divergence between sequences (13 29, 30). Phylogenetic ␤ diversity measures can also be ei quantitative or qualitative depending on whether abundanc taken into account. The original, unweighted UniFrac mea (13) is a qualitative measure. Unweighted UniFrac meas the distance between two communities by calculating the f tion of the branch length in a phylogenetic tree that lead descendants in either, but not both, of the two commun (Fig. 1A). The fixation index (FST), which measures distance between two communities by comparing the gen diversity within each community to the total genetic diversit the communities combined (18), is a quantitative measure accounts for different levels of divergence between sequen The phylogenetic test (P test), which measures the significa of the association between environment and phylogeny (18 typically used as a qualitative measure because duplicate quences are usually removed from the tree. However, th test may be used in a semiquantitative manner if all clo even those with identical or near-identical sequences, are cluded in the tree (13). Here we describe a quantitative version of UniFrac tha call “weighted UniFrac.” We show that weighted UniFrac haves similarly to the FST test in situations where both FIG. 1. Calculation of the unweighted and the weighted Uni measures. Squares and circles represent sequences from two diffe environments. (a) In unweighted UniFrac, the distance between
  100. 100. Improving Phylotyping II• We need to analyze more gene families
  101. 101. Families/PD not uniform 31 6
  102. 102. More Markers Phylogenetic group Genome Gene Maker Number Number Candidates Archaea 62 145415 106 Actinobacteria 63 267783 136 Alphaproteobacteria 94 347287 121 Betaproteobacteria 56 266362 311 Gammaproteobacter 126 483632 118 ia Deltaproteobacteria 25 102115 206 Epislonproteobacter 18 33416 455 ia Bacteriodes 25 71531 286 Chlamydae 13 13823 560 Chloroflexi 10 33577 323 Cyanobacteria 36 124080 590 Firmicutes 106 312309 87 Spirochaetes 18 38832 176 Thermi 5 14160 974 Thermotogae 9 17037 684
  103. 103. Improving Functional Predictions
  104. 104. Improving Functional Predictions• We need to analyze even more gene families
  105. 105. Sifting Families Representative Genomes Extract New Protein Genomes Annotation Extract All v. All Protein BLAST Annotation Homology Screen for Clustering Homologs (MCL) SFams HMMs Align Build Sharpton et al. submittedFigure 1 HMMs
  106. 106. BA C Sharpton et al. submitted
  107. 107. Phylogenetic Contrasts

×