Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Phylogeny-Driven Approaches to
Studies of Microbial and Microbiome
Diversity
Jonathan A. Eisen
University of California, Davis
@phylogenomics
February 7, 2015
UCSB EEMB Graduate Student Symposium

Diversity
Jonathan A. Eisen
@phylogenomics
February 7, 2015
Some Lessons I
Think I Have
Learned

Diversity
Jonathan A. Eisen
@phylogenomics
February 7, 2015
Lesson 1:
Go With Your
Obsessions

Microbial Evolution
Lesson 2:
History Matters

Microbial Evolution
Lesson 2:
History (of
species, genes,
people, science)
Matters

Example I: Lost in Graduate School?

Lost in Graduate School?
Get A Map

Tree from Woese. 1987.
Microbiological Reviews 51:221
Map for Graduate School
Carl Woese

Limited Sampling of RRR Studies

My Study Organisms

H. volcanii Excision Repair
0
0.2
0.4
0.6
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Avg. Mol. Wt.(Base Pairs)
H. volcanii UV Repair Label 7 - 45J / m2)
45 J/m2 Dark 24 Hours
45 J/m2 Photoreac.
45 J/m2 t0
0 J/m2 t0
By Grombo - from Wikipedia
1E-07
1E-06
1E-05
0.0001
0.001
0.01
0.1
1
Relative
Survival
0 50 100 150 200 250 300 350 400
UV J/m2
UV Survival E.coli vs H.volcanii
H.volcanii WFD11
E.coli NR10125 mfd+
E.coli NR10121 mfd-
From Eisen 1998. PhD Thesis.

Map for Graduate School
Lesson 3:
Go Fishing Where
Nobody Else Has

Example II: Rice Microbiomes and Phylogeny
Joseph
Edwards
@Bulk_Soil
Sundar
@sundarlab
Cameron
Johnson
Srijak
Bhatnagar
@srijakbhatnagar
Edwards et al. 2015. Structure, variation,
and assembly of the root-associated
microbiomes of rice. PNAS
Supplementary Figures1
2
Fig. S1 Map depicting soil collection locations for greenhouse experiment.3
10
234
Fig. S2. Sampling and collection of the rhizocompartments. Roots are collected from rice235
plants and soil is shaken off the roots to leave ~1mm of soil around the roots. The ~1 mm of soil236

DNA
extraction
PCR
Sequence
rRNA genes
Sequence alignment = Data matrixPhylogenetic tree
PCR
rRNA1
rRNA2
Makes lots of
copies of the
rRNA genes
in sample
rRNA1
5’...ACACACATAGGTGGAGCTA
GCGATCGATCGA... 3’
E. coli
Humans
A
T
T
A
G
A
A
C
A
T
C
A
C
A
A
C
A
G
G
A
G
T
T
C
rRNA1
E. coli Humans
rRNA2
rRNA2
5’..TACAGTATAGGTGGAGCTAG
CGACGATCGA... 3’
rRNA3
5’...ACGGCAAAATAGGTGGATT
CTAGCGATATAGA... 3’
rRNA4
5’...ACGGCCCGATAGGTGGATT
CTAGCGCCATAGA... 3’
rRNA3 C A C T G T
rRNA4 C A C A G T
Yeast T A C A G T
Yeast
rRNA3
rRNA4
Phylogeny
PCR and phylogenetic analysis of rRNA genes

STAP
An Automated Phylogenetic Tree-Based Small Subunit
rRNA Taxonomy and Alignment Pipeline (STAP)
Dongying Wu1
*, Amber Hartman1,6
, Naomi Ward4,5
, Jonathan A. Eisen1,2,3
1 UC Davis Genome Center, University of California Davis, Davis, California, United States of America, 2 Section of Evolution and Ecology, College of Biological Sciences,
University of California Davis, Davis, California, United States of America, 3 Department of Medical Microbiology and Immunology, School of Medicine, University of
California Davis, Davis, California, United States of America, 4 Department of Molecular Biology, University of Wyoming, Laramie, Wyoming, United States of America,
5 Center of Marine Biotechnology, Baltimore, Maryland, United States of America, 6 The Johns Hopkins University, Department of Biology, Baltimore, Maryland, United
States of America
Abstract
Comparative analysis of small-subunit ribosomal RNA (ss-rRNA) gene sequences forms the basis for much of what we know
about the phylogenetic diversity of both cultured and uncultured microorganisms. As sequencing costs continue to decline
and throughput increases, sequences of ss-rRNA genes are being obtained at an ever-increasing rate. This increasing flow of
data has opened many new windows into microbial diversity and evolution, and at the same time has created significant
methodological challenges. Those processes which commonly require time-consuming human intervention, such as the
preparation of multiple sequence alignments, simply cannot keep up with the flood of incoming data. Fully automated
methods of analysis are needed. Notably, existing automated methods avoid one or more steps that, though
computationally costly or difficult, we consider to be important. In particular, we regard both the building of multiple
sequence alignments and the performance of high quality phylogenetic analysis to be necessary. We describe here our fully-
automated ss-rRNA taxonomy and alignment pipeline (STAP). It generates both high-quality multiple sequence alignments
and phylogenetic trees, and thus can be used for multiple purposes including phylogenetically-based taxonomic
assignments and analysis of species diversity in environmental samples. The pipeline combines publicly-available packages
(PHYML, BLASTN and CLUSTALW) with our automatic alignment, masking, and tree-parsing programs. Most importantly,
this automated process yields results comparable to those achievable by manual analysis, yet offers speed and capacity that
are unattainable by manual efforts.
Citation: Wu D, Hartman A, Ward N, Eisen JA (2008) An Automated Phylogenetic Tree-Based Small Subunit rRNA Taxonomy and Alignment Pipeline (STAP). PLoS
ONE 3(7): e2566. doi:10.1371/journal.pone.0002566
multiple alignment and phylogeny was deemed unfeasible.
However, this we believe can compromise the value of the results.
For example, the delineation of OTUs has also been automated
via tools that do not make use of alignments or phylogenetic trees
(e.g., Greengenes). This is usually done by carrying out pairwise
comparisons of sequences and then clustering of sequences that
have better than some cutoff threshold of similarity with each
other). This approach can be powerful (and reasonably efficient)
but it too has limitations. In particular, since multiple sequence
alignments are not used, one cannot carry out standard
phylogenetic analyses. In addition, without multiple sequence
alignments one might end up comparing and contrasting different
regions of a sequence depending on what it is paired with.
The limitations of avoiding multiple sequence alignments and
phylogenetic analysis are readily apparent in tools to classify
sequences. For example, the Ribosomal Database Project’s
Classifier program [29] focuses on composition characteristics of
each sequence (e.g., oligonucleotide frequency) and assigns
taxonomy based upon clustering genes by their composition.
Though this is fast and completely automatable, it can be misled in
cases where distantly related sequences have converged on similar
composition, something known to be a major problem in ss-rRNA
sequences [30]. Other taxonomy assignment systems focus
primarily on the similarity of sequences. The simplest of these is
classification tools it does have some limitations. For example,
the generation of new alignments for each sequence is both
computational costly, and does not take advantage of available
curated alignments that make use of ss-RNA secondary structure
to guide the primary sequence alignment. Perhaps most
importantly however is that the tool is not fully automated. In
addition, it does not generate multiple sequence alignments for all
sequences in a dataset which would be necessary for doing many
analyses.
Automated methods for analyzing rRNA sequences are also
available at the web sites for multiple rRNA centric databases,
such as Greengenes and the Ribosomal Database Project (RDPII).
Though these and other web sites offer diverse powerful tools, they
do have some limitations. For example, not all provide multiple
sequence alignments as output and few use phylogenetic
approaches for taxonomy assignments or other analyses. More
importantly, all provide only web-based interfaces and their
integrated software, (e.g., alignment and taxonomy assignment),
cannot be locally installed by the user. Therefore, the user cannot
take advantage of the speed and computing power of parallel
processing such as is available on linux clusters, or locally alter and
potentially tailor these programs to their individual computing
needs (Table 1).
Given the limited automated tools that are available for
Table 1. Comparison of STAP’s computational abilities relative to existing commonly-used ss-RNA analysis tools.
STAP ARB Greengenes RDP
Installed where? Locally Locally Web only Web only
User interface Command line GUI Web portal Web portal
Parallel processing YES NO NO NO
Manual curation for taxonomy assignment NO YES NO NO
Manual curation for alignment NO YES NO* NO
Open source YES** NO NO NO
Processing speed Fast Slow Medium Medium
It is important to note, that STAP is the only software that runs on the command line and can take advantage of parallel processing on linux clusters and, further, is
more amenable to downstream code manipulation.
*
Note: Greengenes alignment output is compatible with upload into ARB and downstream manual alignment.
**
The STAP program itself is open source, the programs it depends on are freely available but not open source.
doi:10.1371/journal.pone.0002566.t001
ss-rRNA Taxonomy Pipeline
STAP database, and the query sequence is aligned to them using
the CLUSTALW profile alignment algorithm [40] as described
above for domain assignment. By adapting the profile alignment
algorithm, th
while gaps ar
sequence ac
Figure 1. A flow chart of the STAP pipeline.
doi:10.1371/journal.pone.0002566.g001
STAP database, and the query sequence is aligned to them using
the CLUSTALW profile alignment algorithm [40] as described
above for domain assignment. By adapting the profile alignment
algorithm, the alignments from the STAP database remain intact,
while gaps are inserted and nucleotides are trimmed for the query
sequence according to the profile defined by the previous
alignments from the databases. Thus the accuracy and quality of
the alignment generated at this step depends heavily on the quality
of the Bacterial/Archaeal ss-rRNA alignments from the
Greengenes project or the Eukaryotic ss-rRNA alignments from
the RDPII project.
Phylogenetic analysis using multiple sequence alignments rests on
the assumption that the residues (nucleotides or amino acids) at the
same position in every sequence in the alignment are homologous.
Thus, columns in the alignment for which ‘‘positional homology’’
cannot be robustly determined must be excluded from subsequent
analyses. This process of evaluating homology and eliminating
questionable columns, known as masking, typically requires time-
consuming, skillful, human intervention. We designed an automat-
ed masking method for ss-rRNA alignments, thus eliminating this
bottleneck in high-throughput processing.
First, an alignment score is calculated for each aligned column
by a method similar to that used in the CLUSTALX package [42].
Specifically, an R-dimensional sequence space representing all the
possible nucleotide character states is defined. Then for each
aligned column, the nucleotide populating that column in each of
the aligned sequences is assigned a score in each of the R
dimensions (Sr) according to the IUB matrix [42]. The consensus
‘‘nucleotide’’ for each column (X) also has R dimensions, with the
Figure 2. Domain assignment. In Step 1, STAP assigns a domain to
each query sequence based on its position in a maximum likelihood
tree of representative ss-rRNA sequences. Because the tree illustrated
here is not rooted, domain assignment would not be accurate and
Figure 1. A flow chart of the STAP pipeline.
doi:10.1371/journal.pone.0002566.g001
ss-rRNA Taxonomy Pipeline
Dongying Wu
Amber
Hartman Naomi Ward

WATERsPage 2 of 14
chimeric sequences generated during PCR identifying
closely related sets of sequences (also known as opera-
tional taxonomic units or OTUs), removing redundant
sequences above a certain percent identity cutoff, assign-
ing putative taxonomic identifiers to each sequence or
representative of a group, inferring a phylogenetic tree of
the sequences, and comparing the phylogenetic structure
Figure 1 Overview of WATERS. Schema of WATERS where white
boxes indicate "behind the scenes" analyses that are performed in WA-
TERS. Quality control files are generated for white boxes, but not oth-
erwise routinely analyzed. Black arrows indicate that metadata (e.g.,
sample type) has been overlaid on the data for downstream interpre-
tation. Colored boxes indicate different types of results files that are
generated for the user for further use and biological interpretation.
Colors indicate different types of WATERS actors from Fig. 2 which
were used: green, Diversity metrics, WriteGraphCoordinates, Diversity
graphs; blue, Taxonomy, BuildTree, Rename Trees, Save Trees; Create-
Unifrac; yellow, CreateOtuTable, CreateCytoscape, CreateOTUFile;
white, remaining unnamed actors.
Align
Check
chimeras
Cluster Build
Tree
Assign
Taxonomy
Tree w/
Taxonomy
Diversity
statistics &
graphs
Unifrac
ﬁles
Cytoscape
network
OTU table
Hartman et al 2010. W.A.T.E.R.S.: a Workﬂow for the Alignment,
Taxonomy, and Ecology of Ribosomal Sequences. BMC Bioinformatics
2010, 11:317 doi:10.1186/1471-2105-11-317
Hartman et al. BMC Bioinformatics 2010, 11:317
http://www.biomedcentral.com/1471-2105/11/317
Page 9 of 14
default is 97% and 99%), and they are also generated for
every metadata variable comparison that the user
includes.
Data pruning
To assist in troubleshooting and quality control,
WATERS returns to the user three fasta files of sequences
that were removed at various steps in the workflow. A
short_sequences.fas file is created that contains all
Figure 3 Biologically similar results automatically produced by WATERS on published colonic microbiota samples. (A) Rarefaction curves sim-
ilar to curves shown in Eckburg et al. Fig. 2; 70-72, indicate patient numbers, i.e., 3 different individuals. (B) Weighted Unifrac analysis based on phylo-
genetic tree and OTU data produced by WATERS very similar to Eckburg et al. Fig. 3B. (C) Neighbor-joining phylogenetic tree (Quicktree) representing
the sequences analyzed by WATERS, which is clearly similar to Fig. S1 in Eckburg et al.
BA
3 3HUFHQW YDULDWLRQ H[SODLQHG
33HUFHQWYDULDWLRQH[SODLQHG
$%
&
'(
)
6
$ %
&
'(
)
6
$
%&
'
()
6
3&$ 3 YV 3
C
%$&7(52,'(7(6
%$&7(52,'$/(6
'(/7$3527(2%$&7(5,$
$&7,12%$&7(5,$
9(558&20,&52%,$
(36,/213527(2%$&7(5,$
),50,&87(6
&/2675,',$
&/2675,',$/(6
*$00$3527(2%$&7(5,$
&<$12%$&7(5,$
$/3+$3527(2%$&7(5,$
)862%$&7(5,$
),50,&87(6
%$&,//,
),50,&87(6
02//,&87(6
Amber
Hartman
Bertram
Ludaescer

alignment used to build the profile, resulting in a multiple
sequence alignment of full-length reference sequences and
PD versus PID clustering, 2) to explore overlap between PhylOTU
clusters and recognized taxonomic designations, and 3) to quantify
Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in this generalize
workflow of PhylOTU. See Results section for details.
doi:10.1371/journal.pcbi.1001061.g001
Finding Metagenomic OTUs
Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O'Dwyer JP, Green JL, Eisen JA,
Pollard KS. (2011) PhylOTU: A High-Throughput Procedure Quantifies Microbial
Community Diversity and Resolves Novel Taxa from Metagenomic Data. PLoS Comput
Biol 7(1): e1001061. doi:10.1371/journal.pcbi.1001061
PhylOTU
Tom Sharpton
@tjsharpton

QIIME Phylotyping and Phylogenetic Ecology
296
Fig. S6. A set of 96 OTUs mainly consisting of Proteobacteria is297
compartment in the greenhouse experiment. (A) Number of OTU298
they belong to that are enriched across all rhizocompartments in the299
A subset of the Proteobacteria and the classes and families they belo300
enriched across all rhizocompartments in the greenhouse.301
https://evomics.org/2014/01/the-glories-of-the-gut-ask-a-fat-mouse/

296
Lesson 4:
Accept When You
Are Defeated

Rice Microbiome: Variation w/in Plant
Joseph
Edwards
@Bulk_Soil
Sundar
@sundarlab
Cameron
Johnson
Srijak
Bhatnagar
@srijakbhatnagar
growth. For our study, the rhizosphere compartment was com-
the un
sitive t
zocomp
indicat
microb
and SI
ration
the ext
terior o
(PERM
talizati
microb
P < 0.0
howeve
the sec
P < 0.0
perform
(CAP)
iance a
Materia
PCoA
analysi
terest t
on the
soil typ
quenci
agreem
Fig. 1. Root-associated microbial communities are separable by rhizo-
compartment and soil type. (A) A representation of a rice root cross-section
depicting the locations of the microbial communities sampled. (B) Within-
sample diversity (α-diversity) measurements between rhizospheric compart-
ments indicate a decreasing gradient in microbial diversity from the rhizo-
sphere to the endosphere independent of soil type. Estimated species

Rice Genotype Affects Microbiome
rhizocompartments were analyzed as before. Unfortunately,
collection of bulk soil controls for the field experiment was not
Fig. 3. Host plant genotype significantly affects microbial communities in
the rhizospheric compartments. (A) Ordination of CAP analysis using the
WUF metric constrained to rice genotype. (B) Within-sample diversity
measurements of rhizosphere samples of each cultivar grown in each soil.
Estimated species richness was calculated as eShannon_entropy
. The horizontal

Rice: Cultivation Site Effects
Edwards et al. 2015.
Structure, variation, and
assembly of the root-
associated
microbiomes of rice.
PNAS
the field plants again showed that the rhizosphere had the
highest microbial diversity, whereas the endosphere had the least
found to be enriche
greenhouse plants (S
OTUs were classifiabl
sisted of taxa in the fa
and Myxococcaceae, al
bidopsis root endosphe
Cultivation Practice Result
The rice fields that we
practices, organic farmi
tion called ecofarming
farming in that chemica
are all permitted but g
harvest fumigants are n
itself does significantly
partments overall (P =
a significant interaction
the rhizocompartments
indicating that the α-d
affected differentially by
the rhizosphere compa
practice, with the mean
zospheres than organic
Dataset S14), whereas
crobial communities (P
tests; Dataset S14). Un
practices are separable a
the WUF metric (Fig.

Rice: Functional Enrichment x Genotype
and mitochondrial) reads to analyze microbial abundance in
the endosphere over time (Fig. 6A). Using this technique, we
confirmed the sterility of seedling roots before transplantation.
(13 d) approach the endosphere and rhizoplane microbiome
compositions for plants that have been grown in the green-
house for 42 d.
Fig. 5. OTU coabundance network reveals modules of OTUs associated with methane cycling. (A) Subset of the entire network corresponding to 11
modules with methane cycling potential. Each node represents one OTU and an edge is drawn between OTUs if they share a Pearson correlation of
greater than or equal to 0.6. (B) Depiction of module 119 showing the relationship between methanogens, syntrophs, methanotrophs, and other
methane cycling taxonomies. Each node represents one OTU and is labeled by the presumed function of that OTU’s taxonomy in methane cycling. An
edge is drawn between two OTUs if they have a Pearson correlation of greater than or equal to 0.6. (C) Mean abundance profile for OTUs in module 119
across all rhizocompartments and field sites. The position along the x axis corresponds to a different field site. Error bars represent SE. The x and y axes
represent no particular scale.
Edwards et al. 2015. Structure, variation, and assembly of the root-associated

Rice Developmental Time Series
of magnitude greater than in any single plant species
Under controlled greenhouse conditions, the rhizocomp
described the largest source of variation in the microb
munities sampled (Dataset S5A). The pattern of separ
tween the microbial communities in each compar
consistent with a spatial gradient from the bulk soil a
rhizosphere and rhizoplane into the endosphere (F
Similarly, microbial diversity patterns within samples
same pattern where there is a gradient in α-diversity
rhizosphere to the endosphere (Fig. 1B). Enrichment
pletion of certain microbes across the rhizocompartme
cates that microbial colonization of rice roots is not a
process and that plants have the ability to select for ce
crobial consortia or that some microbes are better at f
root colonizing niche. Similar to studies in Arabidopsis, w
that the relative abundance of Proteobacteria is increas
endosphere compared with soil, and that the relative abu
of Acidobacteria and Gemmatimonadetes decrease from
to the endosphere (9–11), suggesting that the distrib
different bacterial phyla inside the roots might be simil
land plants (Fig. 1D and Dataset S6). Under controlle
house conditions, soil type described the second large
of variation within the microbial communities of each
However, the soil source did not affect the pattern of se
between the rhizospheric compartments, suggesting
rhizocompartments exert a recruitment effect on micro
sortia independent of the microbiome source.
By using differential OTU abundance analysis in t
partments, we observed that the rhizosphere serves an
ment role for a subset of microbial OTUs relative to
(Fig. 2). Further, the majority of the OTUs enriche
rhizosphere are simultaneously enriched in the rhizoplan
endosphere of rice roots (Fig. 2B and SI Appendix, Fig
consistent with a recruitment model in which factors pro
the root attract taxa that can colonize the endosphere. W
that the rhizoplane, although enriched for OTUs that
enriched in the endosphere, is also uniquely enriched for
of OTUs, suggesting that the rhizoplane serves as a sp
Edwards et al. 2015.
Structure, variation, and
assembly of the root-
associated
microbiomes of rice.
PNAS

Example III: rRNA Not Perfect
Lesson 5:
Nothing is Perfect

Taxa Phylogeny III: rRNA Not Perfect

rRNA Copy # Correction by Phylogeny
Kembel SW, Wu M, Eisen JA, Green JL (2012) Incorporating 16S Gene Copy Number Information Improves Estimates
of Microbial Diversity and Abundance. PLoS Comput Biol 8(10): e1002743. doi:10.1371/journal.pcbi.1002743
Jessica Green
@jessicaleegreen
Steven Kembel
@stevenkembel
Martin Wu

DNA
extraction
PCR
Sequence
all genes
Phylogenetic tree
Shotgun
GeneX
E. coli Humans
GeneX
Yeast
GeneX
GeneX
Phylotyping
Phylogeny in Shotgun Metagenomics

RecA vs. rRNA
Eisen 1995 Journal of Molecular Evolution 41: 1105-1123..

RecA vs. rRNA
Eisen 1995 Journal of Molecular Evolution 41: 1105-1123..
Lesson 6:
Keep Going Back
to Your Past

Phylotyping w/ Protein Markers
AMPHORA
http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Alphaproteobacteria
Betaproteobacteria
G
am
m
aproteobacteria
D
eltaproteobacteria
Epsilonproteobacteria
U
nclassified
proteobacteria
Bacteroidetes
C
hlam
ydiae
C
yanobacteria
Acidobacteria
Therm
otogae
Fusobacteria
ActinobacteriaAquificae
Planctom
ycetes
Spirochaetes
Firm
icutes
C
hloroflexiC
hlorobi
U
nclassified
bacteria
dnaG
frr
infC
nusA
pgk
pyrG
rplA
rplB
rplC
rplD
rplE
rplF
rplK
rplL
rplM
rplN
rplP
rplS
rplT
rpmA
rpoB
rpsB
rpsC
rpsE
rpsI
rpsJ
rpsK
rpsM
rpsS
smpB
tsf
Relativeabundance
Martin Wu

GOS 1
GOS 2
GOS 3
GOS 4
GOS 5
Phylogenetic ID of Novel Lineages
Wu et al PLoS One 2011
Dongying Wu

Phylogenetic Diversity of Metagenomes
typically used as a qualitative measure because duplicate s
quences are usually removed from the tree. However, the
test may be used in a semiquantitative manner if all clone
even those with identical or near-identical sequences, are i
cluded in the tree (13).
Here we describe a quantitative version of UniFrac that w
call “weighted UniFrac.” We show that weighted UniFrac b
haves similarly to the FST test in situations where both a
FIG. 1. Calculation of the unweighted and the weighted UniFr
measures. Squares and circles represent sequences from two differe
environments. (a) In unweighted UniFrac, the distance between t
circle and square communities is calculated as the fraction of t
branch length that has descendants from either the square or the circ
environment (black) but not both (gray). (b) In weighted UniFra
branch lengths are weighted by the relative abundance of sequences
the square and circle communities; square sequences are weight
twice as much as circle sequences because there are twice as many tot
circle sequences in the data set. The width of branches is proportion
to the degree to which each branch is weighted in the calculations, an
gray branches have no weight. Branches 1 and 2 have heavy weigh
since the descendants are biased toward the square and circles, respe
tively. Branch 3 contributes no value since it has an equal contributio
from circle and square sequences after normalization.
Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of
Metagenomes. PLoS ONE 6(8): e23214. doi:10.1371/journal.pone.0023214
Jessica
Green
Steven
Kembel
Katie
Pollard

Phylosift/ pplacer Workflow
Input Sequences
rRNA workflow
protein workflow
profile HMMs used to align
candidates to reference alignment
Taxonomic
Summaries
parallel option
hmmalign
multiple alignment
LAST
fast candidate search
pplacer
phylogenetic placement
LAST
LAST
search input against references
hmmalign
multiple alignment
hmmalign
multiple alignment
Infernal
multiple alignment
LAST
<600 bp
>600 bp
Sample Analysis &
Comparison
Krona plots,
Number of reads placed
for each marker gene
Edge PCA,
Tree visualization,
Bayes factor tests
eachinputsequencescannedagainstbothworkflows
Aaron Darling
@koadman
Erik Matsen
@ematsen
Holly Bik
@hollybik
Guillaume Jospin
@guillaumejospin
Darling AE, Jospin G, Lowe E,
Matsen FA IV, Bik HM, Eisen JA.
(2014) PhyloSift: phylogenetic
analysis of genomes and
metagenomes. PeerJ 2:e243
http://dx.doi.org/10.7717/peerj.
243
Erik Lowe

Whole Genome Tree of 2000 Taxa
Lang JM, Darling AE, Eisen JA (2013)
Phylogeny of Bacterial and Archaeal
Genomes Using Conserved Genes:
Supertrees and Supermatrices. PLoS
ONE 8(4): e62510. doi:10.1371/
journal.pone.0062510
Jenna Lang
@jennnomics
Aaron Darling
@koadman

Phylosift Markers
• PMPROK – Dongying Wu’s Bac/Arch
markers
• Eukaryotic Orthologs – Parfrey 2011 paper
• 16S/18S rRNA
• Mitochondria - protein-coding genes
• Viral Markers – Markov clustering on
genomes
• Codon Subtrees – finer scale taxonomy
• Extended Markers – plastids, gene families

PhyEco Markers
Phylogenetic group Genome Number Gene Number Maker Candidates
Archaea 62 145415 106
Actinobacteria 63 267783 136
Alphaproteobacteria 94 347287 121
Betaproteobacteria 56 266362 311
Gammaproteobacteria 126 483632 118
Deltaproteobacteria 25 102115 206
Epislonproteobacteria 18 33416 455
Bacteriodes 25 71531 286
Chlamydae 13 13823 560
Chloroflexi 10 33577 323
Cyanobacteria 36 124080 590
Firmicutes 106 312309 87
Spirochaetes 18 38832 176
Thermi 5 14160 974
Thermotogae 9 17037 684
Wu D, Jospin G, Eisen JA (2013) Systematic Identification of Gene Families
for Use as “Markers” for Phylogenetic and Phylogeny-Driven Ecological
Studies of Bacteria and Archaea and Their Major Subgroups. PLoS ONE
8(10): e77033. doi:10.1371/journal.pone.0077033

Edge PCA: Identify
lineages that explain most
variation among samples
Edge PCA - Matsen and Evans 2013
Output: Edge PCA

296
Lesson 7:
Don’t Accept
When You Are
Defeated

Example IV: Functional Evolution

1st Genome Sequence
Fleischmann et al.
1995

TIGR Genome Projects

1st Genome Sequence
Fleischmann et al.
1995
Lesson 8:
If you can’t beat
them, critique
them or join them

• Leveraging an understanding of the
evolution of function to better prediction
functions
Function & Phylogeny

PHYLOGENENETIC PREDICTION OF GENE FUNCTION
IDENTIFY HOMOLOGS
OVERLAY KNOWN
FUNCTIONS ONTO TREE
INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1A
3A
1B
2B
3B
ALIGN SEQUENCES
CALCULATE GENE TREE
1
2
4
6
CHOOSE GENE(S) OF INTEREST
2A
2A
5
3
Species 3Species 1 Species 2
1
1 2
2
2 31
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN)
Duplication?
EXAMPLE A EXAMPLE B
Duplication?
Duplication?
Duplication
5
METHOD
Ambiguous
Based on
Eisen, 1998
Genome Res 8:
163-167.
Phylogenomics

PHYLOGENENETIC PREDICTION OF GENE FUNCTION
IDENTIFY HOMOLOGS
OVERLAY KNOWN
FUNCTIONS ONTO TREE
INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1A
3A
1B
2B
3B
ALIGN SEQUENCES
CALCULATE GENE TREE
1
2
4
6
CHOOSE GENE(S) OF INTEREST
2A
2A
5
3
Species 3Species 1 Species 2
1
1 2
2
2 31
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN)
Duplication?
EXAMPLE A EXAMPLE B
Duplication?
Duplication?
Duplication
5
METHOD
Ambiguous
Based on
Eisen, 1998
Genome Res 8:
163-167.
Phylogenomics
Lesson 9:
If you invent your
own omics word,
you are stuck with it
so use it for
branding

Phylogenomics ~~ Phylotyping
Eisen et al.
1992Eisen et al. 1992. J. Bact.174: 3416

Phylogenomics ~~ Phylotyping
Eisen et al.
1992Eisen et al. 1992. J. Bact.174: 3416
Lesson 10:
Stealing (with
acknowledgement)
is OK

Proteorhodopsin Functional Diversity
Venter et al., Science 304: 66. 2004

• Leveraging understanding of gene gain
and loss to better predict genome
functions
Lesson 11:
Who you hang out
with matters

Carboxydothermus hydrogenoformans
• Isolated from a Russian hotspring
• Thermophile (grows at 80°C)
• Anaerobic
• Grows very efficiently on CO (Carbon
Monoxide)
• Produces hydrogen gas
• Low GC Gram positive (Firmicute)
• Genome Determined (Wu et al. 2005
PLoS Genetics 1: e65. )

Homologs of Sporulation Genes
Wu et al. 2005 PLoS
Genetics 1: e65.

Carboxydothermus sporulates
Wu et al. 2005 PLoS Genetics 1: e65.

Non-Homology Predictions:
Phylogenetic Profiling
• Step 1: Search all genes in
organisms of interest against all
other genomes
• Ask: Yes or No, is each gene
found in each other species
• Cluster genes by distribution
patterns (profiles)

Sporulation Gene Profile
Wu et al. 2005 PLoS Genetics 1: e65.

B. subtilis new sporulation genes
J Bacteriol. 2013 Jan;195(2):253-60. doi: 10.1128/JB.01778-12
Bjorn Traag
Richard Losick

Example V: More Gaps
Lesson 12:
Keep Returning to
the Same Theme
Over and Over
and Over

Yet Another Map
Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree

Genomes Poorly Sampled

TIGR Tree of Life Project

Genomic Encyclopedia of Bacteria & Archaea
Wu et al. 2009 Nature 462, 1056-1060

Family Diversity vs. PD
Wu et al. 2009 Nature 462, 1056-1060

GEBA Cyanobacteria
Shih et al. 2013. PNAS 10.1073/pnas.1217107110
0.3
B1
B2
C1
Paulinella
Glaucophyte
Green
Red
Chromalveolates
C2
C3
A
E
F
G
B3
D
A
B
Fig.
mum
noba

Haloarchaeal GEBA-like
Lynch et al. (2012) PLoS ONE 7(7): e41389. doi:10.1371/journal.pone.0041389

The Dark Matter of Biology
From Wu et al. 2009 Nature 462, 1056-1060

75
Number of SAGs from Candidate Phyla
OD1
OP11
OP3
SAR406
Site A: Hydrothermal vent 4 1 - -
Site B: Gold Mine 6 13 2 -
Site C: Tropical gyres (Mesopelagic) - - - 2
Site D: Tropical gyres (Photic zone) 1 - - -
Sample collections at 4 additional sites are underway.
Phil Hugenholtz
GEBA Uncultured

JGI Dark Matter Project
environmental
samples (n=9)
isolation of single
cells (n=9,600)
whole genome
amplification (n=3,300)
SSU rRNA gene
based identification
(n=2,000)
genome sequencing,
assembly and QC (n=201)
draft genomes
(n=201)
SAK
HSM ETLTG
HOT
GOM
GBS
EPR
TAETL T
PR
EBS
AK E
SM G TATTG
OM
OT
seawater brackish/freshwater hydrothermal sediment bioreactor
GN04
WS3 (Latescibacteria)
GN01
+Gí
LD1
WS1
Poribacteria
BRC1
Lentisphaerae
Verrucomicrobia
OP3 (Omnitrophica)
Chlamydiae
Planctomycetes
NKB19 (Hydrogenedentes)
WYO
Armatimonadetes
WS4
Actinobacteria
Gemmatimonadetes
NC10
SC4
WS2
Cyanobacteria
:36í2
Deltaproteobacteria
EM19 (Calescamantes)
2FW6SDí )HUYLGLEDFWHULD

GAL35
Aquificae
EM3
Thermotogae
Dictyoglomi
SPAM
GAL15
CD12 (Aerophobetes)
OP8 (Aminicenantes)
AC1
SBR1093
Thermodesulfobacteria
Deferribacteres
Synergistetes
OP9 (Atribacteria)
:36í2
Caldiserica
AD3
Chloroflexi
Acidobacteria
Elusimicrobia
Nitrospirae
49S1 2B
Caldithrix
GOUTA4
6$5 0DULQLPLFURELD

Chlorobi
)LUPLFXWHV
Tenericutes
)XVREDFWHULD
Chrysiogenetes
Proteobacteria
)LEUREDFWHUHV
TG3
Spirochaetes
WWE1 (Cloacamonetes)
70
ZB3
093í
'HLQRFRFFXVí7KHUPXV
OP1 (Acetothermia)
Bacteriodetes
TM7
GN02 (Gracilibacteria)
SR1
BH1
OD1 (Parcubacteria)
:6
OP11 (Microgenomates)
Euryarchaeota
Micrarchaea
DSEG (Aenigmarchaea)
Nanohaloarchaea
Nanoarchaea
Cren MCG
Thaumarchaeota
Cren C2
Aigarchaeota
Cren pISA7
Cren Thermoprotei
Korarchaeota
pMC2A384 (Diapherotrites)
BACTERIA ARCHAEA
archaeal toxins (Nanoarchaea)
lytic murein transglycosylase
stringent response
(Diapherotrites, Nanoarchaea)
ppGpp
limiting
amino acids
SpotT RelA
(GTP or GDP)
+ PPi
GTP or GDP
+ATP
limiting
phosphate,
fatty acids,
carbon, iron
DksA
Expression of components
for stress response
sigma factor (Diapherotrites, Nanoarchaea)
ı4
ȕ ȕ¶
ı2ı3 ı1
-35 -10
Į17'
Į7'
51$ SROPHUDVH
oxidoretucase
+ +e- donor e- acceptor
H
1
Ribo
ADP
+
1+2
O
Reduction
Oxidation
H
1
Ribo
ADP
1+
O
2H
1$' + H 1$'++ + -
HGT from Eukaryotes (Nanoarchaea)
Eukaryota
O
+2+2
OH
1+
2+3
O
O
+2+2
1+
2+3
O
tetra-
peptide
O
+2+2
OH
1+
2+3
O
O
+2+2
1+
2+3
O
tetra-
peptide
murein (peptido-glycan)
archaeal type purine synthesis
(Microgenomates)
PurF
PurD
3XU1
PurL/Q
PurM
PurK
PurE
3XU
PurB
PurP
?
Archaea
adenine guanine
O
+ 12
+
1
1+2
1
1
H
H
1
1
1
H
H
H1 1
H
PRPP )$,$5
IMP
$,$5
A

GUA
G U
G
U
A

G
U
A U
A U
A U
Growing
AA chain
W51$*O

recognizes
UGA
P51$
UGA recoded for Gly (Gracilibacteria)
ribosome
Woyke et al. Nature 2013.

A Genomic Encyclopedia of Microbes (GEM)

Example VI: Beyond Sequence
Lesson 13:
Don’t Overdo It
With That Theme

DNA
extraction
PCR
Sequence
all genes
Shotgun
Shotgun Metagenomics

Wu et al. 2006 PLoS Biology 4: e188.
Baumannia makes vitamins and cofactors
Sulcia makes amino acids
Phylogenetic Binning

HiC Crosslinking Sequencing
Beitel CW, Froenicke L, Lang JM, Korf IF, Michelmore
RW, Eisen JA, Darling AE. (2014) Strain- and plasmid-
level deconvolution of a synthetic metagenome by
sequencing proximity ligation products. PeerJ 2:e415
http://dx.doi.org/10.7717/peerj.415
Table 1 Species alignment fractions. The number of reads aligning to each replicon present in the
synthetic microbial community are shown before and after filtering, along with the percent of total
constituted by each species. The GC content (“GC”) and restriction site counts (“#R.S.”) of each replicon,
species, and strain are shown. Bur1: B. thailandensis chromosome 1. Bur2: B. thailandensis chromosome
2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2: L. brevis plasmid 2, Ped: P. pentosaceus,
K12: E. coli K12 DH10B, BL21: E. coli BL21. An expanded version of this table can be found in Table S2.
Sequence Alignment % of Total Filtered % of aligned Length GC #R.S.
Lac0 10,603,204 26.17% 10,269,562 96.85% 2,291,220 0.462 629
Lac1 145,718 0.36% 145,478 99.84% 13,413 0.386 3
Lac2 691,723 1.71% 665,825 96.26% 35,595 0.385 16
Lac 11,440,645 28.23% 11,080,865 96.86% 2,340,228 0.46 648
Ped 2,084,595 5.14% 2,022,870 97.04% 1,832,387 0.373 863
BL21 12,882,177 31.79% 2,676,458 20.78% 4,558,953 0.508 508
K12 9,693,726 23.92% 1,218,281 12.57% 4,686,137 0.507 568
E. coli 22,575,903 55.71% 3,894,739 17.25% 9,245,090 0.51 1076
Bur1 1,886,054 4.65% 1,797,745 95.32% 2,914,771 0.68 144
Bur2 2,536,569 6.26% 2,464,534 97.16% 3,809,201 0.672 225
Bur 4,422,623 10.91% 4,262,279 96.37% 6,723,972 0.68 369
Figure 1 Hi-C insert distribution. The distribution of genomic distances between Hi-C read pairs is
shown for read pairs mapping to each chromosome. For each read pair the minimum path length on
the circular chromosome was calculated and read pairs separated by less than 1000 bp were discarded.
The 2.5 Mb range was divided into 100 bins of equal size and the number of read pairs in each bin
was recorded for each chromosome. Bin values for each chromosome were normalized to sum to 1 and
plotted.
E. coli K12 genome were distributed in a similar manner as previously reported (Fig. 1;
(Lieberman-Aiden et al., 2009)). We observed a minor depletion of alignments spanning
the linearization point of the E. coli K12 assembly (e.g., near coordinates 0 and 4686137)
due to edge eVects induced by BWA treating the sequence as a linear chromosome rather
than circular.
10.7717/peerj.415 9/19
Figure 2 Metagenomic Hi-C associations. The log-scaled, normalized number of Hi-C read pairs
associating each genomic replicon in the synthetic community is shown as a heat map (see color scale,
blue to yellow: low to high normalized, log scaled association rates). Bur1: B. thailandensis chromosome
1. Bur2: B. thailandensis chromosome 2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2:
L. brevis plasmid 2, Ped: P. pentosaceus, K12: E. coli K12 DH10B, BL21: E. coli BL21.
reference assemblies of the members of our synthetic microbial community with the same
alignment parameters as were used in the top ranked clustering (described above). We first
Figure 3 Contigs associated by Hi-C reads. A graph is drawn with nodes depicting contigs and edges
depicting associations between contigs as indicated by aligned Hi-C read pairs, with the count thereof
depicted by the weight of edges. Nodes are colored to reflect the species to which they belong (see legend)
with node size reflecting contig size. Contigs below 5 kb and edges with weights less than 5 were excluded.
Contig associations were normalized for variation in contig size.
typically represent the reads and variant sites as a variant graph wherein variant sites are
represented as nodes, and sequence reads define edges between variant sites observed in
the same read (or read pair). We reasoned that variant graphs constructed from Hi-C
data would have much greater connectivity (where connectivity is defined as the mean
path length between randomly sampled variant positions) than graphs constructed from
mate-pair sequencing data, simply because Hi-C inserts span megabase distances. Such
Figure 4 Hi-C contact maps for replicons of Lactobacillus brevis. Contact maps show the number of
Hi-C read pairs associating each region of the L. brevis genome. The L. brevis chromosome (Lac0, (A),
Chris Beitel
@datscimed
Aaron Darling
@koadman

Sequence Isn’t Everything
PB-PSB1
(Purple sulfur bacteria)
PB-SRB1
(Sulfate reducing bacteria)
(sulfate)
(sulfide)
Wilbanks, E.G. et al (2014). Environmental Microbiology
Lizzy Wilbanks
@lizzywilbanks

12
C, 12
C14
N, 32
S
Biomass
(RGB composite)
0.044 0.080
34S-incorporation
(34S/32S ratio)
Wilbanks, E.G. et al (2014). Environmental Microbiology
Transfer of 34
S from SRB to PSB

Long Reads Help, A Lot
Hiseq Miseq
100-250 bp
Moleculo
2-20 kb
Pacbio RSII
2-20kb
Micky Kertesz,
Tim Blauwcamp
Meredith Ashby
Cheryl Heiner
Illumina-based
synthetic long reads”
Real-time single molecul
sequencing
(p4-c2, p5-c3)
295 Megabases 474 Megabases61 Gigabases

Light-responsive sulfate reducer?
rhodopsin
w/ Susumu Yoshizawa

Lesson 14:
Asking for, and
getting, help, is a
good thing

Seagrass Microbiome
1000 samples collected.
Not a blade of seagrass touched.
YEAR ONE

ZEN (Zostera Experimental Network) 
25 partner sites
leaves, roots, sediment, and water samples

Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Similar to Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015 (20)

More from Jonathan Eisen

More from Jonathan Eisen (20)

Recently uploaded

Recently uploaded (20)

Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015