10. Homework
• Do blastp search with other famous people
associated with Lake Arrowhead Meeting
• JEFFREYHMILLER
• SARAHPALIN and her relationship to fungi
B. fuckeliana
• see http://phylogenomics.blogspot.com/
2008/09/tracing-evolutionary-history-of-
sarah.html
19. Quotes 2004
• Space-time continuum of genes and genomes
• Gene sequences are the wormhole that allows
one to tunnel into the past
• The human mind can conceive of things with no
basis in physical reality
• Thoughts can go faster than the speed of light
20.
21. Quotes 2006
• The human guts are a real milieu of stuff
• You better kiss everybody
• Microbes not only have a lot of sex, they have a
lot of weird sex
• This is how you do metagenomics on 50
dollars, and that’s Canadian dollars
22. Quotes 2008
• Antibiotics do not kill things, they corrupt them
• There comes a point in life when you have to bring
chemists into the picture
• The rectal swabs are here in tan color
• And there's Jeffrey Dahmer
• We are the environment. We live the phenotype.
• If I have time I will tell you about a dream
• A paper came out next year
23. Quotes 2010
• We have been using this word for many years without actually realizing it
was correct
• Another thing you need to know" pause "Actually you don't NEED to
know any of this
• "I have been influenced by Fisher Price throughout my life
• Don't take that away from us
• It takes 1000 nanobiologists to make one microbiologist
• I am going to wrap up as I hear the crickets chirping
• And we will bring out the unused cheese from yesterday
• In an engineering sense, the vagina is a simple plug flow reactor
• This is going to be ironic coming from someone who studies circumcision
• A little bit about time, but I am going to spend a lot less time on time than
on space
24. Keywords I remember from 2010
• Penis
• Vagina
• Anthrax
• Acne
• Ulcer (multiple kinds)
• Global warming
• Antibiotic resistance
• Virulence
24
25.
26.
27. rRNA Tree of Life
Bacteria
Archaea
Eukaryotes
FIgure from Barton, Eisen et al.
“Evolution”, CSHL Press.
Based on tree from Pace NR, 2003.
28. Proteobacteria
2002 TM6
OS-K
Acidobacteria
• At least 40
Termite Group
OP8
phyla of
Nitrospira
Bacteroides
bacteria
Chlorobi
Fibrobacteres
Marine GroupA
WS3
Gemmimonas
Firmicutes
Fusobacteria
Actinobacteria
OP9
Cyanobacteria
Synergistes
Deferribacteres
Chrysiogenetes
NKB19
Verrucomicrobia
Chlamydia
OP3
Planctomycetes
Spriochaetes
Coprothmermobacter
OP10
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on Hugenholtz,
OP11 2002
29. 2002
Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides
bacteria
Chlorobi
Fibrobacteres
Marine GroupA
• Genome
WS3
Gemmimonas sequences are
Firmicutes
Fusobacteria mostly from
Actinobacteria
OP9
Cyanobacteria
three phyla
Synergistes
Deferribacteres
Chrysiogenetes
NKB19
Verrucomicrobia
Chlamydia
OP3
Planctomycetes
Spriochaetes
Coprothmermobacter
OP10
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on Hugenholtz,
OP11 2002
30. 2002
Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides
bacteria
Chlorobi
Fibrobacteres
Marine GroupA
• Genome
WS3
Gemmimonas sequences are
Firmicutes
Fusobacteria mostly from
Actinobacteria
OP9
Cyanobacteria
three phyla
Synergistes
Deferribacteres
Chrysiogenetes • Some other
NKB19
Verrucomicrobia
Chlamydia
phyla are only
OP3
Planctomycetes
Spriochaetes
sparsely
Coprothmermobacter
OP10
sampled
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on Hugenholtz,
OP11 2002
31. 2002
Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides
bacteria
Chlorobi
Fibrobacteres
Marine GroupA
• Genome
WS3
Gemmimonas sequences are
Firmicutes
Fusobacteria mostly from
Actinobacteria
OP9
Cyanobacteria
three phyla
Synergistes
Deferribacteres
Chrysiogenetes • Some other
NKB19
Verrucomicrobia
Chlamydia
phyla are only
OP3
Planctomycetes
Spriochaetes
sparsely
Coprothmermobacter
OP10
sampled
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on Hugenholtz,
OP11 2002
32. Why Increase Phylogenetic Coverage?
• Common approach within some eukaryotic
groups (FGP, NHGRI, etc)
• Many successful small projects to fill in
bacterial or archaeal gaps
• Phylogenetic gaps in bacterial and archaeal
projects commonly lamented in literature
• Many potential benefits
33. Proteobacteria
• NSF-funded TM6 • At least 40 phyla
OS-K
Tree of Life Acidobacteria
Termite Group of bacteria
OP8
Project Nitrospira
• Genome
Bacteroides
Chlorobi
• A genome Fibrobacteres
Marine GroupA
sequences are
from each of WS3
Gemmimonas mostly from
eight phyla Firmicutes
Fusobacteria three phyla
Actinobacteria
OP9
Cyanobacteria
Synergistes
• Some other
Deferribacteres
Chrysiogenetes phyla are only
NKB19
Verrucomicrobia
Chlamydia
sparsely sampled
OP3
Planctomycetes
Spriochaetes
• Solution I:
Coprothmermobacter
OP10 sequence more
Thermomicrobia
Chloroflexi
TM7
phyla
Deinococcus-Thermus
Dictyoglomus
Aquificae
Eisen & Ward, PIs Thermudesulfobacteria
Thermotogae
OP1
OP11
34.
35. Proteobacteria
• NSF-funded TM6 • At least 40 phyla
OS-K
Tree of Life Acidobacteria
Termite Group of bacteria
OP8
Project Nitrospira
• Genome
Bacteroides
Chlorobi
• A genome Fibrobacteres
Marine GroupA
sequences are
from each of WS3
Gemmimonas mostly from
eight phyla Firmicutes
Fusobacteria three phyla
Actinobacteria
OP9
Cyanobacteria
Synergistes
• Some other
Deferribacteres
Chrysiogenetes phyla are only
NKB19
Verrucomicrobia
Chlamydia
sparsely sampled
OP3
Planctomycetes
Spriochaetes
• Still highly
Coprothmermobacter
OP10 biased in terms
Thermomicrobia
Chloroflexi
TM7
of the tree
Deinococcus-Thermus
Dictyoglomus
Aquificae
Eisen & Ward, PIs Thermudesulfobacteria
Thermotogae
OP1
OP11
37. Proteobacteria
• NSF-funded TM6 • At least 40 phyla
OS-K
Tree of Life Acidobacteria
Termite Group of bacteria
OP8
Project Nitrospira
• Genome
Bacteroides
Chlorobi
• A genome Fibrobacteres
Marine GroupA
sequences are
from each of WS3
Gemmimonas mostly from
eight phyla Firmicutes
Fusobacteria three phyla
Actinobacteria
OP9
Cyanobacteria
Synergistes
• Some other
Deferribacteres
Chrysiogenetes phyla are only
NKB19
Verrucomicrobia
Chlamydia
sparsely sampled
OP3
Planctomycetes
Spriochaetes
• Same trend in
Coprothmermobacter
OP10 Archaea
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Eisen & Ward, PIs Thermudesulfobacteria
Thermotogae
OP1
OP11
38. Proteobacteria
• NSF-funded TM6 • At least 40 phyla
OS-K
Tree of Life Acidobacteria
Termite Group of bacteria
OP8
Project Nitrospira
• Genome
Bacteroides
Chlorobi
• A genome Fibrobacteres
Marine GroupA
sequences are
from each of WS3
Gemmimonas mostly from
eight phyla Firmicutes
Fusobacteria three phyla
Actinobacteria
OP9
Cyanobacteria
Synergistes
• Some other
Deferribacteres
Chrysiogenetes phyla are only
NKB19
Verrucomicrobia
Chlamydia
sparsely sampled
OP3
Planctomycetes
Spriochaetes
• Same trend in
Coprothmermobacter
OP10 Eukaryotes
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Eisen & Ward, PIs Thermudesulfobacteria
Thermotogae
OP1
OP11
39. Proteobacteria
• NSF-funded TM6 • At least 40 phyla
OS-K
Tree of Life Acidobacteria
Termite Group of bacteria
OP8
Project Nitrospira
• Genome
Bacteroides
Chlorobi
• A genome Fibrobacteres
Marine GroupA
sequences are
from each of WS3
Gemmimonas mostly from
eight phyla Firmicutes
Fusobacteria three phyla
Actinobacteria
OP9
Cyanobacteria
Synergistes
• Some other
Deferribacteres
Chrysiogenetes phyla are only
NKB19
Verrucomicrobia
Chlamydia
sparsely sampled
OP3
Planctomycetes
Spriochaetes
• Same trend in
Coprothmermobacter
OP10 Viruses
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Eisen & Ward, PIs Thermudesulfobacteria
Thermotogae
OP1
OP11
40. Proteobacteria
• GEBA TM6
OS-K • At least 40 phyla
Acidobacteria
• A genomic Termite Group
OP8
of bacteria
encyclopedia Nitrospira
Bacteroides • Genome
Chlorobi
of bacteria and Fibrobacteres
Marine GroupA sequences are
archaea WS3
Gemmimonas mostly from
Firmicutes
Fusobacteria
Actinobacteria
three phyla
OP9
Cyanobacteria
Synergistes
• Some other
Deferribacteres
Chrysiogenetes phyla are only
NKB19
Verrucomicrobia
Chlamydia sparsely sampled
OP3
Planctomycetes
Spriochaetes • Solution: Really
Coprothmermobacter
OP10
Thermomicrobia
Fill in the Tree
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Eisen & Ward, PIs Thermudesulfobacteria
Thermotogae
OP1
OP11
41. GEBA Pilot Project Overview
• Identify major branches in rRNA tree for
which no genomes are available
• Identify those with a cultured representative in
DSMZ
• DSMZ grew > 200 of these and prepped DNA
• Sequence and finish 100+ (covering breadth of
bacterial/archaea diversity)
• Annotate, analyze, release data
• Assess benefits of tree guided sequencing
• 1st paper Wu et al in Nature Dec 2009
42. GEBA Pilot Project: Components
• Project overview (Phil Hugenholtz, Nikos Kyrpides, Jonathan Eisen,
Eddy Rubin, Jim Bristow, Tanya Woyke)
• Project management (David Bruce, Eileen Dalin, Lynne Goodwin)
• Culture collection and DNA prep (DSMZ, Hans-Peter Klenk)
• Sequencing and closure (Eileen Dalin, Susan Lucas, Alla Lapidus, Mat
Nolan, Alex Copeland, Cliff Han, Feng Chen, Jan-Fang Cheng)
• Annotation and data release (Nikos Kyrpides, Victor Markowitz, et al)
• Analysis (Dongying Wu, Kostas Mavrommatis, Martin Wu, Victor
Kunin, Neil Rawlings, Ian Paulsen, Patrick Chain, Patrik D’Haeseleer,
Sean Hooper, Iain Anderson, Amrita Pati, Natalia N. Ivanova,
Athanasios Lykidis, Adam Zemla)
• Adopt a microbe education project (Cheryl Kerfeld)
• Outreach (David Gilbert)
• $$$ (DOE, DSMZ, GBMF)
43. GEBA and Openness
• All data released as quickly as
possible w/ no restrictions to
IMG-GEBA; Genbank, etc
• Data also available in
Biotorrents (http://
biotorrents.net)
• Individual genome reports
published in OA “Standards in
Genome Sciences (SIGS)”
• 1st GEBA paper in Nature freely
available and published using
Creative Commons License
43
44. GEBA Lesson 1
rRNA Tree is Useful for Identifying
Phylogenetically Novel Organisms
44
45. rRNA Tree of Life
Bacteria
Archaea
Eukaryotes
FIgure from Barton, Eisen et al.
“Evolution”, CSHL Press.
Based on tree from Pace NR, 2003.
46. Network of Life?
Bacteria
Archaea
Eukaryotes
Figure from Barton, Eisen et al.
“Evolution”, CSHL Press.
Based on tree from Pace NR, 2003.
50. Network of Life?
Bacteria
Archaea
Eukaryotes
FIgure from Barton, Eisen et al.
“Evolution”, CSHL Press.
Based on tree from Pace NR, 2003.
51. Protein Family Rarefaction
Curves
• Take data set of multiple complete genomes
• Identify all protein families using MCL
• Plot # of genomes vs. # of protein families
58. Phylogenetic Distribution Novelty:
Bacterial Actin Related Protein
C. boidinii gi57157304
S. cerevisiae gi14318479
L. starkeyi gi166080363
S. japonicus gi213407080 ACTIN
A. cliftonii gi14269497
99 U. pertusa gi50355609
H. sapiens gi4501889
M. cerebralis gi46326807
67 C. cinerea gi169844021
N. crassa gi85101929 ARP1
100 I. scapularis gi215507378
51 100 H. sapiens gi5031569
65 S. japonicus gi213404844
100 S. cerevisiae gi6320175
ARP2
D. melanogaster gi24642545
100 G. gallus gi45382569
75 C. neoformans gi58266690
S. cerevisiae gi6322525 ARP3
100 D. melanogaster gi17737543
100 H. sapiens gi5031573
H. ochraceum gi227395998 BARP
S. cerevisiae gi1008244
73 P. patens gi168051992 ARP4
99 A. thaliana gi18394608
94 S. cerevisiae gi1301932
100 S. japonicus gi213408393 ARP5
87 D. discoideum gi66802418
74 D. melanogaster gi17737347
97 S. cerevisiae gi6323114
100 D. hansenii gi21851 1921 ARP6
100 O. sativa gi182657420
A. thaliana gi1841 1737 ARP7
D. melanogater gi19920358
100 M. musculus gi226246593 ARP10
0.5
Haliangium ochraceum DSM 14365 Patrik D’haeseleer, Adam Zemla,
Victor Kunin
See also Guljamow et al. 2007 Current Biology.
60. Most/All Functional Prediction Improves
w/ Better Phylogenetic Sampling
• Took 56 GEBA genomes and compared results vs. 56
randomly sampled new genomes
• Better definition of protein family sequence “patterns”
• Greatly improves “comparative” and “evolutionary”
based predictions
• Conversion of hypothetical into conserved hypotheticals
• Linking distantly related members of protein families
• Improved non-homology prediction
Kostas Natalia Thanos Nikos Iain
Mavrommatis Ivanova Lykidis Kyrpides Anderson
63. GEBA Lesson 5
Phylogeny-driven genome selection
improves analysis of metagenome data
64. genomes
if no reference
• Assigning reads to
phylogenetic groups
using multiple genes
• Phylogenetic binning
• Phylogenetic ecology
- especially important
Weighted % of Clones
Al
pha
pr
ot
0
0.1250
0.2500
0.3750
0.5000
Be eo
Al
ta ba
ph
G
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
am
pr
ot ct
er
a
m eo ia Be pro
ap ba
ro ct ta teo
D te er G p b
el ob ia
ta am rot ac
pr ac
Ep ot te
U si
lo eo ria m eo te
nc ba
ba ria
la np Ep ap
ss ro ct ct
ifi te er si rot
ed ob ia lo
Pr ac n eo eria
ot te De pr ba
eo ria
ba lta ote cte
Cy ct pr ob ria
an er
ob ia o a
ac C teo cte
Ch te ya b ri
ria
la no ac a
m b te
Ac yd
id ia
ob e Fi act ria
rm er
Ba act
ct er
ia
Ac ic ia
Uses of phylogenetic
er ut
Ac oi tin es
de
tin te ob
ob s a
ac
te C cte
ria hl ri
Aq or a
Pl ui
an fic ob
ct
om ae C i
yc FB
Sp et C
iro es hl
ch o
ae
te
Major Phylogenetic Group
Fi
Sp rof
rm s
ic
iro lex
i
Sargasso Phylotypes
ut
classification in metagenomics
Ch es Fu cha
lo
ro De
U fle
so ete
nc xi in ba s
la Ch oc
ss lo ct
ifi ro oc
ed bi
er
Ba Ecus ia
ct ur -
er
ia yaTh
C rcherm
re
na aeous
frr
tsf
t
pgk
rplL
rplF
rplP
rplT
rplE
infC
rpsI
rplS
rplA
rplB
rplK
rplC
rpsJ
rc
rplN
rplD
rplM
rpsE
rpsS
rpsB
rpsK
rpsC
rpoB
rpsM
pyrG
nusA
dnaG
rpmA
smpB
ha a
eo
ta
65. genomes
if no reference
phylogenetic groups
using multiple genes
Limited
• Phylogenetic binning
• Phylogenetic ecology
- especially important
sampling
Weighted % of Clones
Al
pha
pr
ot
0
0.1250
0.2500
0.3750
0.5000
Be eo
Al
ta ba
ph
G
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
pr a
poor genomic
am ot ct
er
m eo ia Be pro
ap ba
ro ct ta teo
D te er G p b
el ob ia
ta
• Assigning reads to in past
pr ac am rot ac
Ep ot te
U si
lo eo ria m eo te
nc ba
ba ria
la np Ep ap
ss ro ct ct
ifi te er si rot
ed ob ia lo
Pr ac n eo eria
ot te De pr ba
eo ria
ba lta ote cte
Cy ct pr ob ria
an er
ob ia o a
by
ac C teo cte
Ch te ya b ri
ria
la no ac a
m b te
Ac yd
id ia
ob e Fi act ria
rm er
Ba act
ct er
ia
Ac ic ia
Uses of phylogenetic
er ut
Ac oi tin es
de
tin te ob
ob s a
ac
te C cte
ria hl ri
Aq or a
Pl ui
an fic ob
ct
om ae C i
yc FB
Sp et C
iro es hl
ch o
ae
te
Major Phylogenetic Group
Fi
Sp rof
rm s
ic
iro lex
i
Sargasso Phylotypes
ut
classification in metagenomics
Ch es Fu cha
lo
ro De
U fle
so ete
nc xi in ba s
la Ch oc
ss lo ct
ifi ro oc
ed bi
er
Ba Ecus ia
ct ur -
er
ia yaTh
C rcherm
re
na aeous
frr
tsf
t
pgk
rplL
rplF
rplP
rplT
rplE
infC
rpsI
rplS
rplA
rplB
rplK
rplC
rpsJ
rc
rplN
rplD
rplM
rpsE
rpsS
rpsB
rpsK
rpsC
rpoB
rpsM
pyrG
nusA
dnaG
rpmA
smpB
ha a
eo
ta
66. Metagenomic Analysis Improves
w/ Phylogenetic Sampling
• Small but real improvements in
–Gene identification / confirmation
–Functional prediction
–Binning
–Phylogenetic classification
67. Metagenomic Analysis Improves
w/ Phylogenetic Sampling
• Small but real improvements in
–Gene identification / confirmation
–Functional prediction
–Binning
–Phylogenetic classification
• But not a lot ...
68. GEBA Future 1
Need to adapt genomic and
metagenomic methods to make use of
GEBA data
69. Al
p
ha
pr
ot
Be eo
ta ba
G
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
pr ct
am ot er
m eo ia
ap ba
ro ct
D te er
el ob ia
ta
pr ac
Ep ot te
U si
lo eo ria
nc ba
la np
ct
ss ro er
ifi te ia
ed ob
Pr ac
ot te
eo ria
ba
Cy ct
an er
ob ia
ac
Ch te
ria
la
m
Ac yd
id ia
ob e
Ba act
ct er
er ia
Ac oi
de
tin te
ob s
ac
te
ria
Aq
Pl ui
an fic
ct
om ae
yc
Sp et
AMPHORA - each read on its own tree
iro es
ch
ae
Fi te
rm s
ic
ut
Ch es
Improves with better
lo
ro
U fle
nc
phylogenetic methods
la xi
ss Ch
ifi lo
ed ro
bi
Ba
ct
er
ia
Phylogenetic Binning Using AMPHORA
frr
tsf
pgk
rplL
rplF
rplP
rplT
rplE
infC
rpsI
rplS
rplA
rplB
rplK
rplC
rpsJ
rplN
rplD
rplM
rpsE
rpsS
rpsB
rpsK
rpsC
rpoB
rpsM
pyrG
nusA
dnaG
rpmA
smpB
70. Improving Phylogeny for
Metagenomic Reads
• Examples using reference trees
– AMPHORA (Wu and Eisen)
– PPlacer (Erik Matsen)
– FastTree (Morgan Price)
• Variants
– Use concatenated alignment of markers not just
individual genes (Steven Kembel)
– Apply to OTU identification not just classification
(Thomas Sharpton)
– CoBinning: look for linkage among fragments/genes
(Aaron Darling)
71. Al
p
ha
pr
ot
Be eo
ta ba
G
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
pr ct
am ot er
m eo ia
ap ba
ro ct
D te er
el ob ia
ta
pr ac
Ep ot te
U si
lo eo ria
nc ba
la np
ct
ss ro er
ifi te ia
ed ob
Pr ac
ot te
eo ria
ba
Cy ct
an er
ob ia
ac
Ch te
ria
la
m
Ac yd
id ia
ob e
Ba act
ct er
er ia
Ac oi
de
tin te
ob s
ac
gene families
te
ria
Aq
Pl ui
an fic
ct
om ae
yc
Sp et
AMPHORA - each read on its own tree
iro es
ch
ae
Fi te
rm s
ic
ut
Improves with more
Ch es
lo
ro
U fle
nc xi
la Ch
ss lo
ifi ro
ed bi
Ba
ct
er
ia
Phylogenetic Binning Using AMPHORA
frr
tsf
pgk
rplL
rplF
rplP
rplT
rplE
infC
rpsI
rplS
rplA
rplB
rplK
rplC
rpsJ
rplN
rplD
rplM
rpsE
rpsS
rpsB
rpsK
rpsC
rpoB
rpsM
pyrG
nusA
dnaG
rpmA
smpB
72. Identifying new markers
• Take all genomes
• All vs. all search
• Identify protein families
• For each family measure
–Evenness in copy number
–Universality
–Phylogenetic congruence with WGT
–Monophyly for superfamilies
73. Distances between gene trees and the AMPHORA concatenated genome tree
rpmA coaE
coaE rpmA
trmD rplL
rpsS rpsQ
radA rplR
rplD rplQ
tsf rpsH
frr smpB
ttf rpsO
rplR rplP
rplM rpsS
rplI rplV
rpsB rplT
rpsO rplO
mraW rpsP
rpsH rpsK
rplQ rplU
rplL tsf
rplT trmD
rplE rplS
rpsP ttf
rplC rpsI
rplV mraW
rplS rpsL
infC rpsG
rpsM rplM
rplO rplI
rplU pyrH
rpsL rpsM
rpsQ ruvA
guaA radA
rpsG purA
smpB rplK
priA rplD
rpsK infC
rplK rplC
serS rplE
rplA rplA
rplF frr
ruvA rplF
rpsC serS
rplN rplN
rplP guaA
rpsE ruvB
pyrH rpsB
rpsI rpsJ
secY rRNA16S
rpsJ secY
purA rplB
rplB priA
nusA rpsE
ruvB rpsC
rRNA16S nusA
0 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
NODAL distance SPLIT distance
AMPHORA marker Ribosomal protein Transcription/translation related protein DNA repair protein Protein of other function
Distance between the genome tree and 100 random trees (average ± standard deviation)
74. Identifying new phylogenetic
markers within phyla
• Take all genomes within a phylum
• All vs. all search
• Identify protein families
• For each family measure
–Evenness in copy number
–Universality
–Phylogenetic congruence with WGT
–Monophyly for superfamilies
76. Al
p
ha
pr
ot
Be eo
ta ba
G
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
pr ct
am ot er
m eo ia
ap ba
ro ct
D te er
el ob ia
ta
pr ac
Ep ot te
U si
lo eo ria
nc ba
la np
ct
ss ro er
ifi te ia
ed ob
Pr ac
ot te
eo ria
ba
Cy ct
an er
ob ia
ac
Ch te
ria
la
m
Ac yd
id ia
ob e
Ba act
ct er
er ia
Ac oi
de
tin te
ob s
Other needs?
ac
te
ria
Aq
Pl ui
an fic
ct
om ae
yc
Sp et
AMPHORA - each read on its own tree
iro es
ch
ae
Fi te
rm s
ic
ut
Ch es
lo
ro
U fle
nc xi
la Ch
ss lo
ifi ro
ed bi
Ba
ct
er
ia
Phylogenetic Binning Using AMPHORA
frr
tsf
pgk
rplL
rplF
rplP
rplT
rplE
infC
rpsI
rplS
rplA
rplB
rplK
rplC
rpsJ
rplN
rplD
rplM
rpsE
rpsS
rpsB
rpsK
rpsC
rpoB
rpsM
pyrG
nusA
dnaG
rpmA
smpB
77. Other Ways to Make Better Use
of GEBA Data
• Rebuild protein family models
• Experiments from across the tree needed
• Need better phylogenies, including HGT
• Improved tools for using distantly related
genomes in metagenomic analysis
• Better recording and sharing of metadata
about organisms
84. Fantasy analysis of # PFAMs
GEBA Genomes
PD/Genome
~0.1
PFAMs/Genome
~1000
PFAMs/PD
~10000
Total PFAMS
~10,000,000
From Wu et al. 2009
85. Conclusions
• Sequencing phylogenetically novel genomes
has many benefits
• To obtain the most benefits, we need to change
and adapt: computationally and
experimentally
• Most of the phylogenetic diversity of microbes
remains to be sampled
• Long live the Lake Arrowhead Microbial
Genomes meeting
88. Proteobacteria
• GEBA TM6
OS-K • At least 40 phyla
Acidobacteria
• A genomic Termite Group
OP8
of bacteria
encyclopedia Nitrospira
Bacteroides • Genome
Chlorobi
of bacteria and Fibrobacteres
Marine GroupA sequences are
archaea WS3
Gemmimonas mostly from
Firmicutes
Fusobacteria
Actinobacteria
three phyla
OP9
Cyanobacteria
Synergistes
• Some other
Deferribacteres
Chrysiogenetes phyla are only
NKB19
Verrucomicrobia
Chlamydia sparsely sampled
OP3
Planctomycetes
Spriochaetes • Solution: Really
Coprothmermobacter
OP10
Thermomicrobia
Fill in the Tree
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Eisen & Ward, PIs Thermudesulfobacteria
Thermotogae
OP1
OP11
89. Thanks
Institutions $$$$
JGI etc DOE
UC Davis NSF
DSMZ GBMF
TIGR
People
Dongying Wu
Phil Hugenholtz
Nikos Kyrpides FIgure from Barton, Eisen et al.
Hans-Peter Klenk “Evolution”, CSHL Press.
Eddy Rubin Based on tree from Pace NR, 2003.
Editor's Notes
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
Gets better with more markers - but we do not have lots of sequences for these markers. We can get them from genomes. The more diverse the genomes, thebeter the marker set will be\n
\n
Gets better with more markers - but we do not have lots of sequences for these markers. We can get them from genomes. The more diverse the genomes, thebeter the marker set will be\n
\n
\n
\n
\n
Gets better with more markers - but we do not have lots of sequences for these markers. We can get them from genomes. The more diverse the genomes, thebeter the marker set will be\n