Talk by J. Eisen for NZ Computational Genomics meeting

Phylogeny driven approaches
to the study of microbial diversity
September 3, 2015
Queenstown Computational Genomics
Conference
Jonathan A. Eisen
@phylogenomics
University of California, Davis

0
1000
2000
3000
4000
00 01 02 03 04 05 06 07 08 09 10 11 12 13
Pubmed “Microbiome” Hits
The Rise of the Microbiome

microBIOME or microbiOME
• microbi-OME
• collection of genomes of microbes from a
community (emphasis on OME)
• micro-BIOME
• a community of microbes (emphasis on
BIOME)
• see http://tinyurl.com/definemicrobiome

Not Just About Humans or Hosts

Why Now I: Appreciation of Microbial Diversity
Functional Diversity
Diversity of Form
Phylogenetic Diversity

Why Now I: Appreciation of Microbial Diversity
Functional Diversity
Diversity of Form
Phylogenetic Diversity
MICROBES
RUN THE
PLANET

Why Now II: Post Genome Blues
The Microbiome
Transcriptome
VariomeEpigenome
Overselling the Human Genome?

<<<<
Culturing Observation
CountCount
http://www.google.com/url?
sa=i&rct=j&q=&esrc=s&source=images&
cd=&docid=rLu5sL207WlE1M&tbnid=CR
LQYP7d9d_TcM:&ved=0CAUQjRw&url=h
ttp%3A%2F%2Fwww.biol.unt.edu
%2F~jajohnson
%2FDNA_sequencing_process&ei=hFu7
U_TyCtOqsQSu9YGwBg&psig=AFQjCN
G-8EBdEljE7-
yHFG2KPuBZt8kIPw&ust=140487395121
1424
DNA
Why Now III: CSI-Microbiology Advances

Why Now IV: Sequencing Has Gone Crazy

Sequencing Revolution
!10
•More genes and genomes
•Deeper sequencing
• The rare biosphere
• Relative abundance estimates
•More samples (with barcoding)
• Times series
• Spatially diverse sampling
• Fine scale sampling

Turnbaugh et al Nature. 2006 444(7122):1027-31.
Why Now V: Microbiome Functions

Uses of Phylogeny 1: Species Phylogeny

Woese: Classification of Cultured Taxa by rRNA
!13
rRNA rRNArRNA
ACUGC
ACCUAU
CGUUCG
ACUCC
AGCUAU
CGAUCG
ACCCC
AGCUCU
CGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
R ACUCCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
F ACUCCAGGUAUCGAUCG
C ACCCCAGCUCUCGCUCG
W ACCCCAGCUCUGGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
C ACCCCAGCUCUCGCUCG
EukaryotesBacteria ?????ArchaebacteriaArchaea
Isolate Ribosomes

Archaea
Woese: Classification of Cultured Taxa by rRNA PCR
!15
rRNA
rRNA
PCR
rRNA
PCR
ACUGC
ACCUAU
CGUUCG
ACUCC
AGCUAU
CGAUCG
ACCCC
AGCUCU
CGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
R ACUCCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
F ACUCCAGGUAUCGAUCG
C ACCCCAGCUCUCGCUCG
W ACCCCAGCUCUGGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
C ACCCCAGCUCUCGCUCG
EukaryotesBacteria
Isolate DNA

Archaea
!16
rRNA
rRNA
PCR
rRNA
PCR
EukaryotesBacteria
Isolate DNA
ACTGC
ACCTAT
CGTTCG
ACTGC
ACCTAT
CGTTCG
ACTGC
ACCTAT
CGTTCG
Taxa Characters
B1 ACTGCACCTATCGTTCG
B2 ACTCCACCTATCGTTCG
E1 ACTCCAGCTATCGATCG
E2 ACTCCAGGTATCGATCG
A1 ACCCCAGCTCTCGCTCG
A2 ACCCCAGCTCTGGCTCG
New1 ACTGCACCTATCGTTCG
Phylotyping via rRNA PCR: One Taxon

Chemosymbiont rRNA Phylotyping
!17
Eisen et al. 1992. J. Bact.174: 3416Colleen Cavanaugh

Taxa Characters
New1 ACCCCAGCTCTGCCTCG
Archaea EukaryotesBacteria
ACTGC
ACCTAT
CGTTCG
ACTGC
ACCTAT
CGTTCG
ACCCC
AGCTCT
CGCTCG
!18
rRNA
rRNA
PCR
rRNA
PCR
Isolate DNA
Phylotyping via rRNA PCR: Two Taxa

ACTGC
ACCTAT
CGTTCG
ACTCC
AGCTAT
CGATCG
ACCCC
AGCTCT
CGCTCG
AGGGG
AGCTCT
CGCTCG
AGGGG
AGCTCT
CGCTCG
ACTGC
ACCTAT
CGTTCG
Taxa Characters
New3 ACCCCAGCTCTCGCTCG 
New4 AGGGGAGCTCTCGCTCG
!19
rRNA
rRNA
PCR
rRNA
PCR
Isolate DNA
Phylotyping via rRNA PCR: Four Taxa

!21
Approaching to NGS
Discovery of DNA structure
(Cold Spring Harb. Symp. Quant. Biol. 1953;18:123-31)
1953
Sanger sequencing method by F. Sanger
(PNAS ,1977, 74: 560-564)
1977
PCR by K. Mullis
(Cold Spring Harb Symp Quant Biol. 1986;51 Pt 1:263-73)
1983
Development of pyrosequencing
(Anal. Biochem., 1993, 208: 171-175; Science ,1998, 281: 363-365)
1993
1980
1990
2000
2010
Single molecule emulsion PCR 1998
Human Genome Project
(Nature , 2001, 409: 860–92; Science, 2001, 291: 1304–1351)
Founded 454 Life Science 2000
454 GS20 sequencer
(First NGS sequencer)
2005
Founded Solexa 1998
Solexa Genome Analyzer
(First short-read NGS sequencer)
2006
GS FLX sequencer
(NGS with 400-500 bp read lenght)
2008
Hi-Seq2000
(200Gbp per Flow Cell)
2010
Illumina acquires Solexa
(Illumina enters the NGS business)
2006
ABI SOLiD
(Short-read sequencer based upon ligation)
2007
Roche acquires 454 Life Sciences
(Roche enters the NGS business)
2007
NGS Human Genome sequencing
(First Human Genome sequencing based upon NGS technology)
2008
From Slideshare presentation of Cosentino Cristian
http://www.slideshare.net/cosentia/high-throughput-equencing
Miseq
Roche Jr
Ion Torrent
PacBio
Oxford
Automation is Critical
AAATCGCTAGCGC
CGGCGAGCTAGC
CGAGCGATCGAGC
CGAGCATCGAGTA

STAP (for rRNA)
An Automated Phylogenetic Tree-Based Small Subunit
rRNA Taxonomy and Alignment Pipeline (STAP)
Dongying Wu1
*, Amber Hartman1,6
, Naomi Ward4,5
, Jonathan A. Eisen1,2,3
1 UC Davis Genome Center, University of California Davis, Davis, California, United States of America, 2 Section of Evolution and Ecology, College of Biological Sciences,
University of California Davis, Davis, California, United States of America, 3 Department of Medical Microbiology and Immunology, School of Medicine, University of
California Davis, Davis, California, United States of America, 4 Department of Molecular Biology, University of Wyoming, Laramie, Wyoming, United States of America,
5 Center of Marine Biotechnology, Baltimore, Maryland, United States of America, 6 The Johns Hopkins University, Department of Biology, Baltimore, Maryland, United
States of America
Abstract
Comparative analysis of small-subunit ribosomal RNA (ss-rRNA) gene sequences forms the basis for much of what we know
about the phylogenetic diversity of both cultured and uncultured microorganisms. As sequencing costs continue to decline
and throughput increases, sequences of ss-rRNA genes are being obtained at an ever-increasing rate. This increasing flow of
data has opened many new windows into microbial diversity and evolution, and at the same time has created significant
methodological challenges. Those processes which commonly require time-consuming human intervention, such as the
preparation of multiple sequence alignments, simply cannot keep up with the flood of incoming data. Fully automated
methods of analysis are needed. Notably, existing automated methods avoid one or more steps that, though
computationally costly or difficult, we consider to be important. In particular, we regard both the building of multiple
sequence alignments and the performance of high quality phylogenetic analysis to be necessary. We describe here our fully-
automated ss-rRNA taxonomy and alignment pipeline (STAP). It generates both high-quality multiple sequence alignments
and phylogenetic trees, and thus can be used for multiple purposes including phylogenetically-based taxonomic
assignments and analysis of species diversity in environmental samples. The pipeline combines publicly-available packages
(PHYML, BLASTN and CLUSTALW) with our automatic alignment, masking, and tree-parsing programs. Most importantly,
this automated process yields results comparable to those achievable by manual analysis, yet offers speed and capacity that
are unattainable by manual efforts.
Citation: Wu D, Hartman A, Ward N, Eisen JA (2008) An Automated Phylogenetic Tree-Based Small Subunit rRNA Taxonomy and Alignment Pipeline (STAP). PLoS
ONE 3(7): e2566. doi:10.1371/journal.pone.0002566
multiple alignment and phylogeny was deemed unfeasible.
However, this we believe can compromise the value of the results.
For example, the delineation of OTUs has also been automated
via tools that do not make use of alignments or phylogenetic trees
(e.g., Greengenes). This is usually done by carrying out pairwise
comparisons of sequences and then clustering of sequences that
have better than some cutoff threshold of similarity with each
other). This approach can be powerful (and reasonably efficient)
but it too has limitations. In particular, since multiple sequence
alignments are not used, one cannot carry out standard
phylogenetic analyses. In addition, without multiple sequence
alignments one might end up comparing and contrasting different
regions of a sequence depending on what it is paired with.
The limitations of avoiding multiple sequence alignments and
phylogenetic analysis are readily apparent in tools to classify
sequences. For example, the Ribosomal Database Project’s
Classifier program [29] focuses on composition characteristics of
each sequence (e.g., oligonucleotide frequency) and assigns
taxonomy based upon clustering genes by their composition.
Though this is fast and completely automatable, it can be misled in
cases where distantly related sequences have converged on similar
composition, something known to be a major problem in ss-rRNA
sequences [30]. Other taxonomy assignment systems focus
primarily on the similarity of sequences. The simplest of these is
classification tools it does have some limitations. For example,
the generation of new alignments for each sequence is both
computational costly, and does not take advantage of available
curated alignments that make use of ss-RNA secondary structure
to guide the primary sequence alignment. Perhaps most
importantly however is that the tool is not fully automated. In
addition, it does not generate multiple sequence alignments for all
sequences in a dataset which would be necessary for doing many
analyses.
Automated methods for analyzing rRNA sequences are also
available at the web sites for multiple rRNA centric databases,
such as Greengenes and the Ribosomal Database Project (RDPII).
Though these and other web sites offer diverse powerful tools, they
do have some limitations. For example, not all provide multiple
sequence alignments as output and few use phylogenetic
approaches for taxonomy assignments or other analyses. More
importantly, all provide only web-based interfaces and their
integrated software, (e.g., alignment and taxonomy assignment),
cannot be locally installed by the user. Therefore, the user cannot
take advantage of the speed and computing power of parallel
processing such as is available on linux clusters, or locally alter and
potentially tailor these programs to their individual computing
needs (Table 1).
Given the limited automated tools that are available for
Table 1. Comparison of STAP’s computational abilities relative to existing commonly-used ss-RNA analysis tools.
STAP ARB Greengenes RDP
Installed where? Locally Locally Web only Web only
User interface Command line GUI Web portal Web portal
Parallel processing YES NO NO NO
Manual curation for taxonomy assignment NO YES NO NO
Manual curation for alignment NO YES NO* NO
Open source YES** NO NO NO
Processing speed Fast Slow Medium Medium
It is important to note, that STAP is the only software that runs on the command line and can take advantage of parallel processing on linux clusters and, further, is
more amenable to downstream code manipulation.
*
Note: Greengenes alignment output is compatible with upload into ARB and downstream manual alignment.
**
The STAP program itself is open source, the programs it depends on are freely available but not open source.
doi:10.1371/journal.pone.0002566.t001
ss-rRNA Taxonomy Pipeline
STAP database, and the query sequence is aligned to them using
the CLUSTALW profile alignment algorithm [40] as described
above for domain assignment. By adapting the profile alignment
algorithm, th
while gaps ar
sequence ac
Figure 1. A flow chart of the STAP pipeline.
doi:10.1371/journal.pone.0002566.g001
STAP database, and the query sequence is aligned to them using
the CLUSTALW profile alignment algorithm [40] as described
above for domain assignment. By adapting the profile alignment
algorithm, the alignments from the STAP database remain intact,
while gaps are inserted and nucleotides are trimmed for the query
sequence according to the profile defined by the previous
alignments from the databases. Thus the accuracy and quality of
the alignment generated at this step depends heavily on the quality
of the Bacterial/Archaeal ss-rRNA alignments from the
Greengenes project or the Eukaryotic ss-rRNA alignments from
the RDPII project.
Phylogenetic analysis using multiple sequence alignments rests on
the assumption that the residues (nucleotides or amino acids) at the
same position in every sequence in the alignment are homologous.
Thus, columns in the alignment for which ‘‘positional homology’’
cannot be robustly determined must be excluded from subsequent
analyses. This process of evaluating homology and eliminating
questionable columns, known as masking, typically requires time-
consuming, skillful, human intervention. We designed an automat-
ed masking method for ss-rRNA alignments, thus eliminating this
bottleneck in high-throughput processing.
First, an alignment score is calculated for each aligned column
by a method similar to that used in the CLUSTALX package [42].
Specifically, an R-dimensional sequence space representing all the
possible nucleotide character states is defined. Then for each
aligned column, the nucleotide populating that column in each of
the aligned sequences is assigned a score in each of the R
dimensions (Sr) according to the IUB matrix [42]. The consensus
‘‘nucleotide’’ for each column (X) also has R dimensions, with the
Figure 2. Domain assignment. In Step 1, STAP assigns a domain to
each query sequence based on its position in a maximum likelihood
tree of representative ss-rRNA sequences. Because the tree illustrated
here is not rooted, domain assignment would not be accurate and
Figure 1. A flow chart of the STAP pipeline.
doi:10.1371/journal.pone.0002566.g001
ss-rRNA Taxonomy Pipeline
Dongying  
Wu
Amber
Hartman
Naomi Ward

alignment used to build the profile, resulting in a multiple PD versus PID clustering, 2) to explore overlap between PhylOT
Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in this generaliz
workflow of PhylOTU. See Results section for details.
doi:10.1371/journal.pcbi.1001061.g001
Finding Metagenomic OTU
Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O'Dwyer JP, Green JL, Eisen JA, Pollard
KS. (2011) PhylOTU: A High-Throughput Procedure Quantifies Microbial Community Diversity
and Resolves Novel Taxa from Metagenomic Data. PLoS Comput Biol 7(1): e1001061. doi:
10.1371/journal.pcbi.1001061
PhylOTU
Tom Sharpton
Katie Pollard
Jessica Green

!24
rRNA PCR: Community Comparisons

Taxa Characters
New4 AGGGGAGCTCTCGCTCG
!24
rRNA
rRNA
PCR
rRNA
PCR
Isolate DNA
A A A A
AA
A A A A
AA
A A
A A A
AA
A A

Taxa Characters
New4 AGGGGAGCTCTCGCTCG !25
rRNA
rRNA
PCR
rRNA
PCR
Isolate DNA
A A A A
AA
A A A A
AA
A A
A A A
AA
A A

Hartman et al. BMC Bioinformatics 2010, 11:317
http://www.biomedcentral.com/1471-2105/11/317
Open AccessSOFTWARE
Software
Introducing W.A.T.E.R.S.: a Workflow for the
Alignment, Taxonomy, and Ecology of Ribosomal
Sequences
Amber L Hartman†1,3, Sean Riddle†2, Timothy McPhillips2, Bertram Ludäscher2 and Jonathan A Eisen*1
Abstract
Background: For more than two decades microbiologists have used a highly conserved microbial gene as a
phylogenetic marker for bacteria and archaea. The small-subunit ribosomal RNA gene, also known as 16 S rRNA, is
encoded by ribosomal DNA, 16 S rDNA, and has provided a powerful comparative tool to microbial ecologists. Over
time, the microbial ecology field has matured from small-scale studies in a select number of environments to massive
collections of sequence data that are paired with dozens of corresponding collection variables. As the complexity of
data and tool sets have grown, the need for flexible automation and maintenance of the core processes of 16 S rDNA
sequence analysis has increased correspondingly.
Results: We present WATERS, an integrated approach for 16 S rDNA analysis that bundles a suite of publicly available 16
S rDNA analysis software tools into a single software package. The "toolkit" includes sequence alignment, chimera
removal, OTU determination, taxonomy assignment, phylogentic tree construction as well as a host of ecological
analysis and visualization tools. WATERS employs a flexible, collection-oriented 'workflow' approach using the open-
source Kepler system as a platform.
Conclusions: By packaging available software tools into a single automated workflow, WATERS simplifies 16 S rDNA
analyses, especially for those without specialized bioinformatics, programming expertise. In addition, WATERS, like
some of the newer comprehensive rRNA analysis tools, allows researchers to minimize the time dedicated to carrying
out tedious informatics steps and to focus their attention instead on the biological interpretation of the results. One
advantage of WATERS over other comprehensive tools is that the use of the Kepler workflow system facilitates result
interpretation and reproducibility via a data provenance sub-system. Furthermore, new "actors" can be added to the
workflow as desired and we see WATERS as an initial seed for a sizeable and growing repository of interoperable, easy-
to-combine tools for asking increasingly complex microbial ecology questions.
Background
Microbial communities and how they are surveyed
Microbial communities abound in nature and are crucial
for the success and diversity of ecosystems. There is no
end in sight to the number of biological questions that
can be asked about microbial diversity on earth. From
animal and human guts to open ocean surfaces and deep
sea hydrothermal vents, to anaerobic mud swamps or
boiling thermal pools, to the tops of the rainforest canopy
and the frozen Antarctic tundra, the composition of
microbial communities is a source of natural history,
intellectual curiosity, and reservoir of environmental
health [1]. Microbial communities are also mediators of
insight into global warming processes [2,3], agricultural
success [4], pathogenicity [5,6], and even human obesity
[7,8].
In the mid-1980 s, researchers began to sequence ribo-
somal RNAs from environmental samples in order to
characterize the types of microbes present in those sam-
ples, (e.g., [9,10]). This general approach was revolution-
ized by the invention of the polymerase chain reaction
(PCR), which made it relatively easy to clone and then
* Correspondence: jaeisen@ucdavis.edu
1 Department of Medical Microbiology and Immunology and the Department
of Evolution and Ecology, Genome Center, University of California Davis, One
Shields Avenue, Davis, CA, 95616, USA
† Contributed equally
Full list of author information is available at the end of the article
WATERS - Kepler Workflow for rRNA
matics 2010, 11:317
.com/1471-2105/11/317
Page 2 of 14
genes for ribosomal RNA) in partic-
ubunit ribosomal RNA (ss-rRNA).
ed a large amount of previously
l diversity [1,11-13]. Researchers
all subunit rRNA gene not only
ith which it can be PCR amplified,
has variable and highly conserved
to be universally distributed among
nd it is useful for inferring phyloge-
4,15]. Since then, "cultivation-inde-
" have brought a revolution to the
by allowing scientists to study a
mount of diversity in many different
ments [16-18]. The general premise
Figure 1 Overview of WATERS. Schema of WATERS where white
boxes indicate "behind the scenes" analyses that are performed in WA-
Align
Check
chimeras
Cluster Build
Tree
Assign
Taxonomy
Tree w/
Taxonomy
Diversity
statistics &
graphs
Unifrac
ﬁles
Cytoscape
network
OTU table
Page 3 of 14
Motivations
As outlined above, successfully processing microbial
sequence collections is far from trivial. Each step is com-
plex and usually requires significant bioinformatics
expertise and time investment prior to the biological
interpretation. In order to both increase efficiency and
ensure that all best-practice tools are easily usable, we
sought to create an "all-inclusive" method for performing
all of these bioinformatics steps together in one package.
To this end, we have built an automated, user-friendly,
workflow-based system called WATERS: a Workflow for
the Alignment, Taxonomy, and Ecology of Ribosomal
Sequences (Fig. 1). In addition to being automated and
simple to use, because WATERS is executed in the Kepler
scientific workflow system (Fig. 2) it also has the advan-
tage that it keeps track of the data lineage and provenance
of data products [23,24].
Automation
The primary motivation in building WATERS was to
minimize the technical, bioinformatics challenges that
arise when performing DNA sequence clustering, phylo-
genetic tree, and statistical analyses by automating the 16
S rDNA analysis workflow. We also hoped to exploit
additional features that workflow-based approaches
entail, such as optimized execution and data lineage
tracking and browsing [23,25-27]. In the earlier days of 16
S rDNA analysis, simply knowing which microbes were
present and whether they were biologically novel was a
noteworthy achievement. It was reasonable and expected,
therefore, to invest a large amount of time and effort to
get to that list of microbes. But now that current efforts
are significantly more advanced and often require com-
parison of dozens of factors and variables with datasets of
thousands of sequences, it is not practically feasible to
process these large collections "by hand", and hugely inef-
ficient if instead automated methods can be successfully
employed.
Broadening the user base
A second motivation and perspective is that by minimiz-
ing the technical difficulty of 16 S rDNA analysis through
the use of WATERS, we aim to make the analysis of these
datasets more widely available and allow individuals with
Figure 2 Screenshot of WATERS in Kepler software. Key features: the library of actors un-collapsed and displayed on the left-hand side, the input
and output paths where the user declares the location of their input files and desired location for the results files. Each green box is an individual Kepler
actor that performs a single action on the data stream. The connectors (black arrows) direct and hook up the actors in a defined sequence. Double-
clicking on any actor or connector allows it to be manipulated and re-arranged.
Page 9 of
default is 97% and 99%), and they are also generated for
every metadata variable comparison that the user
includes.
Data pruning
To assist in troubleshooting and quality contro
WATERS returns to the user three fasta files of sequenc
Figure 3 Biologically similar results automatically produced by WATERS on published colonic microbiota samples. (A) Rarefaction curves sim
ilar to curves shown in Eckburg et al. Fig. 2; 70-72, indicate patient numbers, i.e., 3 different individuals. (B) Weighted Unifrac analysis based on phylo
genetic tree and OTU data produced by WATERS very similar to Eckburg et al. Fig. 3B. (C) Neighbor-joining phylogenetic tree (Quicktree) representing
the sequences analyzed by WATERS, which is clearly similar to Fig. S1 in Eckburg et al.
BA
3 3HUFHQW YDULDWLRQ H[SODLQHG
33HUFHQWYDULDWLRQH[SODLQHG
$%
&
'(
)
6
$ %
&
'(
)
6
$
%&
'
()
6
3&$ 3 YV 3
C
%$&7(52,'(7(6
%$&7(52,'$/(6
'(/7$3527(2%$&7(5,$
$&7,12%$&7(5,$
9(558&20,&52%,$
(36,/213527(2%$&7(5,$
),50,&87(6
&/2675,',$
&/2675,',$/(6
*$00$3527(2%$&7(5,$
&<$12%$&7(5,$
$/3+$3527(2%$&7(5,$
)862%$&7(5,$
),50,&87(6
%$&,//,
),50,&87(6
02//,&87(6
Amber 
Hartman

Tree from Woese. 1987.
Microbiological Reviews 51:221
rRNA Not Perfect
Nothing is Perfect

rRNA Phylogeny Copy # Correction
Kembel SW, Wu M,
Eisen JA, Green JL
(2012) Incorporating
16S Gene Copy
Number Information
Improves Estimates of
Microbial Diversity and
Abundance. PLoS
Comput Biol 8(10):
e1002743. doi:
10.1371/journal.pcbi.
1002743 Steven
Kembel
Jessica
Green
Martin
Wu

Tree Complications 1
!29
rRNA rRNArRNA
ACUGC
ACCUAU
CGUUCG
ACUCC
AGCUAU
CGAUCG
ACCCC
AGCUCU
CGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
R ACUCCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
F ACUCCAGGUAUCGAUCG
C ACCCCAGCUCUCGCUCG
W ACCCCAGCUCUGGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
C ACCCCAGCUCUCGCUCG
EuksBacteria Arch
Isolate Ribosomes
Arch

!30
rRNA rRNArRNA
ACUGC
ACCUAU
CGUUCG
ACUCC
AGCUAU
CGAUCG
ACCCC
AGCUCU
CGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
R ACUCCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
F ACUCCAGGUAUCGAUCG
C ACCCCAGCUCUCGCUCG
W ACCCCAGCUCUGGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
C ACCCCAGCUCUCGCUCG
EuksBacteria Arch
Isolate Ribosomes
Arch

!31
rRNA rRNArRNA
ACUGC
ACCUAU
CGUUCG
ACUCC
AGCUAU
CGAUCG
ACCCC
AGCUCU
CGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
R ACUCCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
F ACUCCAGGUAUCGAUCG
C ACCCCAGCUCUCGCUCG
W ACCCCAGCUCUGGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
C ACCCCAGCUCUCGCUCG
EuksBacteria Arch
Isolate Ribosomes
Arch

Automated Accurate Genome Tree
Lang JM, Darling AE, Eisen JA (2013) Phylogeny of
Bacterial and Archaeal Genomes Using Conserved
Genes: Supertrees and Supermatrices. PLoS ONE
8(4): e62510. doi:10.1371/journal.pone.0062510
Jenna
Lang
Aaron
Darling

Metagenomics
metagenomics
ACUGC
ACCUAU
CGUUCG
ACUCC
AGCUAU
CGAUCG
ACCCC
AGCUCU
CGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
R ACUCCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
F ACUCCAGGUAUCGAUCG
C ACCCCAGCUCUCGCUCG
W ACCCCAGCUCUGGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
C ACCCCAGCUCUCGCUCG
EukaryotesBacteria Archaea

inputs of fixed carbon or nitrogen from external sources. As with
Leptospirillum group I, both Leptospirillum group II and III have the
genes needed to fix carbon by means of the Calvin–Benson–
Bassham cycle (using type II ribulose 1,5-bisphosphate carboxy-
lase–oxygenase). All genomes recovered from the AMD system
contain formate hydrogenlyase complexes. These, in combination
with carbon monoxide dehydrogenase, may be used for carbon
fixation via the reductive acetyl coenzyme A (acetyl-CoA) pathway
by some, or all, organisms. Given the large number of ABC-type
sugar and amino acid transporters encoded in the Ferroplasma type
Figure 4 Cell metabolic cartoons constructed from the annotation of 2,180 ORFs
identified in the Leptospirillum group II genome (63% with putative assigned function) and
1,931 ORFs in the Ferroplasma type II genome (58% with assigned function). The cell
cartoons are shown within a biofilm that is attached to the surface of an acid mine
drainage stream (viewed in cross-section). Tight coupling between ferrous iron oxidation,
pyrite dissolution and acid generation is indicated. Rubisco, ribulose 1,5-bisphosphate
carboxylase–oxygenase. THF, tetrahydrofolate.
articles
NATURE | doi:10.1038/nature02340 | www.nature.com/nature 5©2004 NaturePublishing Group
Metagenomics
metagenomics
ACUGC
ACCUAU
CGUUCG
ACUCC
AGCUAU
CGAUCG
ACCCC
AGCUCU
CGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
R ACUCCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
F ACUCCAGGUAUCGAUCG
C ACCCCAGCUCUCGCUCG
W ACCCCAGCUCUGGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
C ACCCCAGCUCUCGCUCG

Metagenomics
metagenomics
ACUGC
ACCUAU
CGUUCG
ACUCC
AGCUAU
CGAUCG
ACCCC
AGCUCU
CGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
R ACUCCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
F ACUCCAGGUAUCGAUCG
C ACCCCAGCUCUCGCUCG
W ACCCCAGCUCUGGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
C ACCCCAGCUCUCGCUCG

Culture Independent “Metagenomics”
DNA DNADNA
!35
Taxa Characters
New2 AGGGGAGCTCTGCCTCG
New3 ACTCCAGCTATCGATCG
RecA RecARecA
http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.7
Genome Biology 2008, 9:R151
sequences are not conserved at the nucleotide level [29]. As a
result, the nr database does not actually contain many more
protein marker sequences that can be used as references than
those available from complete genome sequences.
Comparison of phylogeny-based and similarity-based phylotyping
Although our phylogeny-based phylotyping is fully auto-
mated, it still requires many more steps than, and is slower
than, similarity based phylotyping methods such as a
MEGAN [30]. Is it worth the trouble? Similarity based phylo-
typing works by searching a query sequence against a refer-
ence database such as NCBI nr and deriving taxonomic
information from the best matches or 'hits'. When species
that are closely related to the query sequence exist in the ref-
erence database, similarity-based phylotyping can work well.
However, if the reference database is a biased sample or if it
contains no closely related species to the query, then the top
hits returned could be misleading [31]. Furthermore, similar-
ity-based methods require an arbitrary similarity cut-off
value to define the top hits. Because individual bacterial
genomes and proteins can evolve at very different rates, a uni-
versal cut-off that works under all conditions does not exist.
As a result, the final results can be very subjective.
In contrast, our tree-based bracketing algorithm places the
query sequence within the context of a phylogenetic tree and
only assigns it to a taxonomic level if that level has adequate
sampling (see Materials and methods [below] for details of
the algorithm). With the well sampled species Prochlorococ-
cus marinus, for example, our method can distinguish closely
related organisms and make taxonomic identifications at the
species level. Our reanalysis of the Sargasso Sea data placed
672 sequences (3.6% of the total) within a P. marinus clade.
On the other hand, for sparsely sampled clades such as
Aquifex, assignments will be made only at the phylum level.
Thus, our phylogeny-based analysis is less susceptible to data
sampling bias than a similarity based approach, and it makes
Major phylotypes identified in Sargasso Sea metagenomic dataFigure 3
Major phylotypes identified in Sargasso Sea metagenomic data. The metagenomic data previously obtained from the Sargasso Sea was reanalyzed using
AMPHORA and the 31 protein phylogenetic markers. The microbial diversity profiles obtained from individual markers are remarkably consistent. The
breakdown of the phylotyping assignments by markers and major taxonomic groups is listed in Additional data file 5.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Alphaproteobacteria
Betaproteobacteria
G
am
m
aproteobacteria
D
eltaproteobacteria
Epsilonproteobacteria
U
nclassified
proteobacteria
Bacteroidetes
C
hlam
ydiae
C
yanobacteria
Acidobacteria
Therm
otogae
Fusobacteria
ActinobacteriaAquificae
Planctom
ycetes
Spirochaetes
Firm
icutes
C
hloroflexiC
hlorobi
U
nclassified
bacteria
dnaG
frr
infC
nusA
pgk
pyrG
rplA
rplB
rplC
rplD
rplE
rplF
rplK
rplL
rplM
rplN
rplP
rplS
rplT
rpmA
rpoB
rpsB
rpsC
rpsE
rpsI
rpsJ
rpsK
rpsM
rpsS
smpB
tsf
Relativeabundance
RpoB RpoBRpoB
Rpl4 Rpl4Rpl4 rRNA rRNArRNA
Hsp70 Hsp70Hsp70
EFTu EFTuEFTu
Many other genes
better than rRNA

Phylotyping w/ Protein Markers
AMPHORA
http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Alphaproteobacteria
Betaproteobacteria
G
am
m
aproteobacteria
D
eltaproteobacteria
Epsilonproteobacteria
U
nclassified
proteobacteria
Bacteroidetes
C
hlam
ydiae
C
yanobacteria
Acidobacteria
Therm
otogae
Fusobacteria
ActinobacteriaAquificae
Planctom
ycetes
Spirochaetes
Firm
icutes
C
hloroflexiC
hlorobi
U
nclassified
bacteria
dnaG
frr
infC
nusA
pgk
pyrG
rplA
rplB
rplC
rplD
rplE
rplF
rplK
rplL
rplM
rplN
rplP
rplS
rplT
rpmA
rpoB
rpsB
rpsC
rpsE
rpsI
rpsJ
rpsK
rpsM
rpsS
smpB
tsf
Relativeabundance
Martin Wu

GOS 1
GOS 2
GOS 3
GOS 4
GOS 5
Phylogenetic ID of Novel Lineages
Dongying  
Wu
Wu D, Wu M, Halpern A, Rusch DB,
Yooseph S, Frazier M, et al. (2011)
Stalking the Fourth Domain in
Metagenomic Data: Searching for,
Discovering, and Interpreting Novel, Deep
Branches in Marker Gene Phylogenetic
Trees. PLoS ONE 6(3): e18011. doi:
10.1371/journal.pone.0018011

Phylogenetic Diversity of Metagenomes
typically used as a qualitative measure because duplicate s
quences are usually removed from the tree. However, the
test may be used in a semiquantitative manner if all clone
even those with identical or near-identical sequences, are i
cluded in the tree (13).
Here we describe a quantitative version of UniFrac that w
call “weighted UniFrac.” We show that weighted UniFrac b
haves similarly to the FST test in situations where both a
FIG. 1. Calculation of the unweighted and the weighted UniFr
measures. Squares and circles represent sequences from two differe
environments. (a) In unweighted UniFrac, the distance between t
circle and square communities is calculated as the fraction of t
branch length that has descendants from either the square or the circ
environment (black) but not both (gray). (b) In weighted UniFra
branch lengths are weighted by the relative abundance of sequences
the square and circle communities; square sequences are weight
twice as much as circle sequences because there are twice as many tot
circle sequences in the data set. The width of branches is proportion
to the degree to which each branch is weighted in the calculations, an
gray branches have no weight. Branches 1 and 2 have heavy weigh
since the descendants are biased toward the square and circles, respe
tively. Branch 3 contributes no value since it has an equal contributio
from circle and square sequences after normalization.
Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of
Metagenomes. PLoS ONE 6(8): e23214. doi:10.1371/journal.pone.0023214
Jessica
Green
Steven
Kembel
Katie
Pollard

Phylosift
Input Sequences
rRNA workflow
protein workflow
profile HMMs used to align
candidates to reference alignment
Taxonomic
Summaries
parallel option
hmmalign
multiple alignment
LAST
fast candidate search
pplacer
phylogenetic placement
LAST
LAST
search input against references
hmmalign
multiple alignment
hmmalign
multiple alignment
Infernal
multiple alignment
LAST
<600 bp
>600 bp
Sample Analysis &
Comparison
Krona plots,
Number of reads placed
for each marker gene
Edge PCA,
Tree visualization,
Bayes factor tests
eachinputsequencescannedagainstbothworkflows
Aaron Darling
@koadman
Erik Matsen
@ematsen
Holly Bik
@hollybik
Guillaume Jospin
@guillaumejospin
Darling AE, Jospin G, Lowe E,
Matsen FA IV, Bik HM, Eisen JA.
(2014) PhyloSift: phylogenetic
analysis of genomes and
metagenomes. PeerJ 2:e243
http://dx.doi.org/10.7717/peerj.
243
Erik Lowe

Edge PCA: Identify
lineages that explain most
variation among samples
Edge PCA - Matsen and Evans 2013
Output: Edge PCA

Using Phylogeny 2: Functional Prediction

PHYLOGENENETIC PREDICTION OF GENE FUNCTION
IDENTIFY HOMOLOGS
OVERLAY KNOWN
FUNCTIONS ONTO TREE
INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1A
3A
1B
2B
3B
ALIGN SEQUENCES
CALCULATE GENE TREE
1
2
4
6
CHOOSE GENE(S) OF INTEREST
2A
2A
5
3
Species 3Species 1 Species 2
1
1 2
2
2 31
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN)
Duplication?
EXAMPLE A EXAMPLE B
Duplication?
Duplication?
Duplication
5
METHOD
Ambiguous
Based on
Eisen, 1998
Genome Res 8:
163-167.
Phylogenomics

Overlaying Functions onto Tree
Aquae Trepa
Rat
Fly
Xenla
Mouse
Human
Yeast
Neucr
Arath
Borbu
Synsp
Neigo
Thema
Strpy
Bacsu
Ecoli
TheaqDeira
Chltr
Spombe
Yeast
Yeast
Spombe
Mouse
Human
Arath
Yeast
Human
Mouse
Arath
StrpyBacsu
Human
Celeg
Yeast
MetthBorbu
Aquae
Synsp
Deira Helpy
mSaco
Yeast
Celeg
Human
MSH4
MSH5
MutS2
MutS1
MSH1
MSH3
MSH6
MSH2
Based on Eisen, 1998 
Nucl Acids Res 26: 4291-4300.

Phylogenomics ~~ Phylotyping
Eisen et al.
1992Eisen et al. 1992. J. Bact.174: 3416

Proteorhodopsin Functional Diversity
Venter et al., Science 304: 66. 2004

Shotmap
Simulate)
metagenomic)
library)
Translate)
metagenomic)
reads)
Search)
metagenomic)
pep6des)
Classify)
metagenomic)
pep6des)
Es6mate)
protein)family)
abundance)
Taxonomic)
proﬁles)from)real)
metagenomes)
Protein)family)
database)
IMG/ER)
reference)
genomes)
Construct))
mock))
community)
1"
Annotate)
genes)in)
genomes)
2"
Expected)
abundance)of)
gene)families)
3"
4"
5"
Protein)family)
database)
Evaluate)
es6ma6on)
accuracy)
6" 7"
8"
9"
Tom Sharpton
Katie Pollardhttps://github.com/sharpton/shotmap

dFunctional Prediction from Metagenomes
DNA DNADNA
!23
Taxa Characters
New2 AGGGGAGCTCTGCCTCG
New3 ACTCCAGCTATCGATCG
inputs of fixed carbon or nitrogen from external sources. As with
Leptospirillum group I, both Leptospirillum group II and III have the
genes needed to fix carbon by means of the Calvin–Benson–
Bassham cycle (using type II ribulose 1,5-bisphosphate carboxy-
lase–oxygenase). All genomes recovered from the AMD system
contain formate hydrogenlyase complexes. These, in combination
with carbon monoxide dehydrogenase, may be used for carbon
fixation via the reductive acetyl coenzyme A (acetyl-CoA) pathway
by some, or all, organisms. Given the large number of ABC-type
sugar and amino acid transporters encoded in the Ferroplasma type
Figure 4 Cell metabolic cartoons constructed from the annotation of 2,180 ORFs
identified in the Leptospirillum group II genome (63% with putative assigned function) and
1,931 ORFs in the Ferroplasma type II genome (58% with assigned function). The cell
cartoons are shown within a biofilm that is attached to the surface of an acid mine
drainage stream (viewed in cross-section). Tight coupling between ferrous iron oxidation,
pyrite dissolution and acid generation is indicated. Rubisco, ribulose 1,5-bisphosphate
carboxylase–oxygenase. THF, tetrahydrofolate.
articles
NATURE | doi:10.1038/nature02340 | www.nature.com/nature 5©2004 NaturePublishing Group

Phylogenetic Prediction of Function
• Many powerful and automated similarity based
methods for assigning genes to protein families
• COGs
• PFAM HMM searches
• Some limitations of similarity based methods can be
overcome by phylogenetic approaches
• Automated methods now available
• Sean Eddy
• Steven Brenner
• Kimmen Sjölander

Phylogenetic Prediction of Function
• Many powerful and automated similarity based
methods for assigning genes to protein families
• COGs
• PFAM HMM searches
• Some limitations of similarity based methods can be
overcome by phylogenetic approaches
• Automated methods now available
• Sean Eddy
• Steven Brenner
• Kimmen Sjölander
• But …

Carboxydothermus hydrogenoformans
• Isolated from a Russian hotspring
• Thermophile (grows at 80°C)
• Anaerobic
• Grows very efficiently on CO (Carbon
Monoxide)
• Produces hydrogen gas
• Low GC Gram positive (Firmicute)
• Genome Determined (Wu et al. 2005
PLoS Genetics 1: e65. )

Homologs of Sporulation Genes
Wu et al. 2005 PLoS
Genetics 1: e65.

Carboxydothermus sporulates
Wu et al. 2005 PLoS Genetics 1: e65.

Non-Homology Predictions:
Phylogenetic Profiling
• Step 1: Search all genes in
organisms of interest against all
other genomes
• Ask: Yes or No, is each gene
found in each other species
• Cluster genes by distribution
patterns (profiles)

Sporulation Gene Profile
Wu et al. 2005 PLoS Genetics 1: e65.

B. subtilis new sporulation genes
J Bacteriol. 2013 Jan;195(2):253-60. doi: 10.1128/JB.01778-12
Bjorn Traag
Richard Losick

Phylogenetic Profiling for Metagenomics?

Using Phylogeny 3: Linking Function and Phylogeny

HiC Crosslinking & Sequencing
Beitel CW, Froenicke L, Lang JM, Korf IF, Michelmore
RW, Eisen JA, Darling AE. (2014) Strain- and plasmid-
level deconvolution of a synthetic metagenome by
sequencing proximity ligation products. PeerJ 2:e415
http://dx.doi.org/10.7717/peerj.415
Table 1 Species alignment fractions. The number of reads aligning to each replicon present in the
synthetic microbial community are shown before and after filtering, along with the percent of total
constituted by each species. The GC content (“GC”) and restriction site counts (“#R.S.”) of each replicon,
species, and strain are shown. Bur1: B. thailandensis chromosome 1. Bur2: B. thailandensis chromosome
2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2: L. brevis plasmid 2, Ped: P. pentosaceus,
K12: E. coli K12 DH10B, BL21: E. coli BL21. An expanded version of this table can be found in Table S2.
Sequence Alignment % of Total Filtered % of aligned Length GC #R.S.
Lac0 10,603,204 26.17% 10,269,562 96.85% 2,291,220 0.462 629
Lac1 145,718 0.36% 145,478 99.84% 13,413 0.386 3
Lac2 691,723 1.71% 665,825 96.26% 35,595 0.385 16
Lac 11,440,645 28.23% 11,080,865 96.86% 2,340,228 0.46 648
Ped 2,084,595 5.14% 2,022,870 97.04% 1,832,387 0.373 863
BL21 12,882,177 31.79% 2,676,458 20.78% 4,558,953 0.508 508
K12 9,693,726 23.92% 1,218,281 12.57% 4,686,137 0.507 568
E. coli 22,575,903 55.71% 3,894,739 17.25% 9,245,090 0.51 1076
Bur1 1,886,054 4.65% 1,797,745 95.32% 2,914,771 0.68 144
Bur2 2,536,569 6.26% 2,464,534 97.16% 3,809,201 0.672 225
Bur 4,422,623 10.91% 4,262,279 96.37% 6,723,972 0.68 369
Figure 1 Hi-C insert distribution. The distribution of genomic distances between Hi-C read pairs is
shown for read pairs mapping to each chromosome. For each read pair the minimum path length on
the circular chromosome was calculated and read pairs separated by less than 1000 bp were discarded.
The 2.5 Mb range was divided into 100 bins of equal size and the number of read pairs in each bin
was recorded for each chromosome. Bin values for each chromosome were normalized to sum to 1 and
plotted.
E. coli K12 genome were distributed in a similar manner as previously reported (Fig. 1;
(Lieberman-Aiden et al., 2009)). We observed a minor depletion of alignments spanning
the linearization point of the E. coli K12 assembly (e.g., near coordinates 0 and 4686137)
due to edge eVects induced by BWA treating the sequence as a linear chromosome rather
than circular.
10.7717/peerj.415 9/19
Figure 2 Metagenomic Hi-C associations. The log-scaled, normalized number of Hi-C read pairs
associating each genomic replicon in the synthetic community is shown as a heat map (see color scale,
blue to yellow: low to high normalized, log scaled association rates). Bur1: B. thailandensis chromosome
1. Bur2: B. thailandensis chromosome 2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2:
L. brevis plasmid 2, Ped: P. pentosaceus, K12: E. coli K12 DH10B, BL21: E. coli BL21.
reference assemblies of the members of our synthetic microbial community with the same
alignment parameters as were used in the top ranked clustering (described above). We first
Figure 3 Contigs associated by Hi-C reads. A graph is drawn with nodes depicting contigs and edges
depicting associations between contigs as indicated by aligned Hi-C read pairs, with the count thereof
depicted by the weight of edges. Nodes are colored to reflect the species to which they belong (see legend)
with node size reflecting contig size. Contigs below 5 kb and edges with weights less than 5 were excluded.
Contig associations were normalized for variation in contig size.
typically represent the reads and variant sites as a variant graph wherein variant sites are
represented as nodes, and sequence reads define edges between variant sites observed in
the same read (or read pair). We reasoned that variant graphs constructed from Hi-C
data would have much greater connectivity (where connectivity is defined as the mean
path length between randomly sampled variant positions) than graphs constructed from
mate-pair sequencing data, simply because Hi-C inserts span megabase distances. Such
Figure 4 Hi-C contact maps for replicons of Lactobacillus brevis. Contact maps show the number of
Hi-C read pairs associating each region of the L. brevis genome. The L. brevis chromosome (Lac0, (A),
Chris Beitel
@datscimed
Aaron Darling
@koadman

Pink Berries
PB-PSB1
(Purple sulfur bacteria)
PB-SRB1
(Sulfate reducing bacteria)
(sulfate)
(sulfide)
Wilbanks, E.G. et al (2014). Environmental Microbiology
Lizzy Wilbanks
@lizzywilbanks

Long Reads Help, A Lot
Hiseq & Miseq
100-250 bp
Moleculo
2-20 kb
Pacbio RSII
2-20kb
Micky Kertesz,
Tim Blauwcamp
Meredith Ashby
Cheryl Heiner
Illumina-based
“synthetic long
reads”
Real-time single
molecule
sequencing
(p4-c2, p5-c3)
295 Megabases 474 Megabases61 Gigabases

Using Phylogeny 4: Better Reference Data

PhyEco Markers
Phylogenetic group Genome Number Gene Number Maker Candidates
Archaea 62 145415 106
Actinobacteria 63 267783 136
Alphaproteobacteria 94 347287 121
Betaproteobacteria 56 266362 311
Gammaproteobacteria 126 483632 118
Deltaproteobacteria 25 102115 206
Epislonproteobacteria 18 33416 455
Bacteriodes 25 71531 286
Chlamydae 13 13823 560
Chloroflexi 10 33577 323
Cyanobacteria 36 124080 590
Firmicutes 106 312309 87
Spirochaetes 18 38832 176
Thermi 5 14160 974
Thermotogae 9 17037 684
Wu D, Jospin G, Eisen JA (2013) Systematic Identification of Gene Families
for Use as “Markers” for Phylogenetic and Phylogeny-Driven Ecological
Studies of Bacteria and Archaea and Their Major Subgroups. PLoS ONE
8(10): e77033. doi:10.1371/journal.pone.0077033

Better Protein Families
Representative
Genomes
Extract
Protein
Annotation
All v. All
BLAST
Homology
Clustering
(MCL)
SFams
Align &
Build
HMMs
HMMs
Screen for
Homologs
New
Genomes
Extract
Protein
Annotation
Figure 1
Sharpton et al. 2012.BMC bioinformatics,
13(1), 264.
A
B
C

Microbial Dark Matter Part 2
• Ramunas
Stepanauskas
• Tanja Woyke
• Jonathan Eisen
• Duane Moser
• Tullis Onstott

Phylogeny Isn’t Everything .. Model Systems

Wu et al. 2006 PLoS Biology 4: e188.
Baumannia makes vitamins and cofactors
Sulcia makes amino acids
Simple Symbioses
Wu et al. 2006 PLoS Biology 4: e188.
Baumannia makes vitamins and cofactors
Sulcia makes amino acids
Phylogenetic Binning
Nancy Moran
Dongying Wu

Drosophila microbiome w/ Kopp Lab
Both natural surveys and laboratory
experiments indicate that host diet
plays a major role in shaping the
Drosophila bacterial microbiome.
Laboratory strains provide only a
limited model of natural host–microbe
interactions
Jenna Lang Angus Chandler

Rice Microbiome w/ Sundar Lab
Edwards et al. 2015. Structure, variation,
and assembly of the root-associated
microbiomes of rice. PNAS
9
Supplementary Figures31
32
Fig. S1 Map depicting soil collection locations for greenhouse experiment.33
10
234
Fig. S2. Sampling and collection of the rhizocompartments. Roots are collected from rice235
plants and soil is shaken off the roots to leave ~1mm of soil around the roots. The ~1 mm of soil236
three separate rhizocompartments: the rhizosphere, rhizoplane,
and endosphere (Fig. 1A). Because the root microbiome has
been shown to correlate with the developmental stage of the
plant (10), the root-associated microbial communities were
sampled at 42 d (6 wk), when rice plants from all genotypes were
well-established in the soil but still in their vegetative phase of
growth. For our study, the rhizosphere compartment was com-
w
i
t
i
(
t
s
z
i
m
a
r
t
t
(
t
m
P
h
t
P
p
(
i
M
P
a
t
o
s
q
a
n
v
v
p
t
p
s
G
Fig. 1. Root-associated microbial communities are separable by rhizo-
compartment and soil type. (A) A representation of a rice root cross-section
depicting the locations of the microbial communities sampled. (B) Within-
sample diversity (α-diversity) measurements between rhizospheric compart-
ments indicate a decreasing gradient in microbial diversity from the rhizo-
sphere to the endosphere independent of soil type. Estimated species
richness was calculated as eShannon_entropy
. The horizontal bars within boxes
represent median. The tops and bottoms of boxes represent 75th and 25th
quartiles, respectively. The upper and lower whiskers extend 1.5× the
interquartile range from the upper edge and lower edge of the box, re-
spectively. All outliers are plotted as individual points. (C) PCoA using the
WUF metric indicates that the largest separation between microbial com-
munities is spatial proximity to the root (PCo 1) and the second largest
source of variation is soil type (PCo 2). (D) Histograms of phyla abundances in
each compartment and soil. B, bulk soil; E, endosphere; P, rhizoplane; S,
rhizosphere; Sac, Sacramento.
2 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1414592112
gate the relationship between rice ge-
icrobiome, domesticated rice varieties
rated growing regions were tested. Six
spanning two species within the Oryza
2 d in the greenhouse before sampling.
a) cultivars M104, Nipponbare (both
ties), IR50, and 93-11 (both indica va-
gside two cultivars of African cultivated
g7102 (Glab B) and TOg7267 (Glab E).
ed that rice genotype accounted for
ariation between microbial communities
% of the variance, P < 0.001; Dataset
f the variance, P < 0.066; Dataset S5H);
ntations for clustering patterns of the
nt on the first two axes of unconstrained
ppendix, Fig. S10). We then used CAP
ffect of rice genotype on the microbial
g on rice cultivar and controlling for
and technical factors, we found that ge-
ice have a significant effect on root-
mmunities (5.1%, P = 0.005, WUF, Fig.
, UUF, SI Appendix, Fig. S11A). Ordi-
AP analysis revealed clustering patterns
only partially consistent with genetic
F and UUF metrics. The two japonica
her and the two O. glaberrima cultivars
ver, the indica cultivars were split, with
O. glaberrima cultivars and IR50 clus-
cultivars.
enotypic effect manifests in individual
eparated the whole dataset to focus on
vidually and conducted CAP analysis
and technical factors. The rhizosphere
eight sites were operated under two cultivation practices: organic
cultivation and a more conventional cultivation practice termed
“ecofarming” (see below). Because genotype explained the least
variance in the greenhouse data, we limited the analysis to one
cultivar, S102, a California temperate japonica variety that is
widely cultivated by commercial growers and is closely related to
M104 (26). Field samples were collected from vegetatively
growing rice plants in flooded fields and the previously defined
rhizocompartments were analyzed as before. Unfortunately,
collection of bulk soil controls for the field experiment was not
Fig. 3. Host plant genotype significantly affects microbial communities in
the rhizospheric compartments. (A) Ordination of CAP analysis using the
WUF metric constrained to rice genotype. (B) Within-sample diversity
measurements of rhizosphere samples of each cultivar grown in each soil.
Estimated species richness was calculated as eShannon_entropy
. The horizontal
bars within boxes represent median. The tops and bottoms of boxes repre-
sent 75th and 25th quartiles, respectively. The upper and lower whiskers
extend 1.5× the interquartile range from the upper edge and lower edge of
the box, respectively. All outliers are plotted as individual points.
oi/10.1073/pnas.1414592112 Edwards et al.
fields are too high to find representative soil that is unlikely to
be affected by nearby plants. Amplification and sequencing of
the field microbiome samples yielded 13,349,538 high-quality
sequences (median: 54,069 reads per sample; range: 12,535–
148,233 reads per sample; Dataset S13). The sequences were
clustered into OTUs using the same criteria as the greenhouse
experiment, yielding 222,691 microbial OTUs and 47,983 OTUs
with counts >5 across the field dataset.
We found that the microbial diversity of field rice plants is
significantly influenced by the field site. α-Diversity measure-
ments of the field rhizospheres indicated that the cultivation site
significantly impacts microbial diversity (SI Appendix, Fig. S14A,
P = 2.00E-16, ANOVA and Dataset S14). Unconstrained PCoA
using both the WUF and UUF metrics showed that microbial
communities separated by field site across the first axis (Fig. 4B,
WUF and SI Appendix, Fig. S14B, UUF). PERMANOVA agreed
with the unconstrained PCoA in that field site explained the
largest proportion of variance between the microbial communi-
ties for field plants (30.4% of variance, P < 0.001, WUF, Dataset
S5O and 26.6% of variance, P < 0.001, UUF, Dataset S5P). CAP
analysis constrained to field site and controlled for rhizocom-
partment, cultivation practice, and technical factors (sequencing
batch and biological replicate) agreed with the PERMANOVA
results in that the field site explains the largest proportion of
variance between the root-associated microbial communities in
field plants (27.3%, P = 0.005, WUF, SI Appendix, Fig. S15A
and 28.9%, P = 0.005, UUF, SI Appendix, Fig. S15E), sug-
gesting that geographical factors may shape root-associated
microbial communities.
Rhizospheric Compartmentalization Is Retained in Field Plants. Sim-
ilar to the greenhouse plants, the rhizospheric microbiomes of
field plants are distinguishable by compartment. α-Diversity of
the field plants again showed that the rhizosphere had the
highest microbial diversity, whereas the endosphere had the least
S15). PCoA
the WUF a
compartmen
Appendix, F
separation i
ond largest
(20.76%, P
UUF, Data
biomes cons
trolled for f
agreed with
variance bet
compartmen
and 10.9%,
Taxonomi
overall sim
Chloroflexi,
microbiota.
endosphere
Proteobacteri
and Plancto
distribution
trend from t
Appendix, Fi
We again
OTUs in the
S16). We fo
endosphere c
representing
Fig. S17). Th
the genus A
and Alphap
terestingly, 1
found to b
greenhouse
OTUs were
sisted of tax
and Myxoco
bidopsis roo
Cultivation Pr
The rice fiel
practices, org
tion called
farming in th
are all perm
harvest fumi
itself does si
partments ov
a significant
the rhizocom
indicating th
affected diffe
the rhizosph
practice, with
zospheres th
Dataset S14)
crobial comm
tests; Datase
practices are
the WUF m
S14D). PERFig. 4. Root-associated microbiomes from field-grown plants are separable
by cultivation site, rhizospheric compartment, and cultivation practice. (A)
Variation w/in Plant
Cultivation Site Effects
Rice Genotype Effects
and mitochondrial) reads to analyze microbial abundance in
the endosphere over time (Fig. 6A). Using this technique, we
confirmed the sterility of seedling roots before transplantation.
We found that microbial penetrance into the endosphere oc-
curred at or before 24 h after transplantation and that the pro-
portion of microbial reads to organellar reads increased over the
first 2 wk after transplantation (Fig. 6A). To further support the
evidence for microbiome acquisition within the first 24 h, we
sampled root endospheric microbiomes from sterilely germi-
nated seedlings before transplanting into Davis field soil as well
as immediately after transplantation and 24 h after transplan-
tation (SI Appendix, Fig. S24). The root endospheres of sterilely
germinated seedlings, as well as seedlings transplanted into
Davis field soil for 1 min, both had a very low percentage of
microbial reads compared with organellar reads (0.22% and
0.71%), with the differences not statistically significant (P = 0.1,
Wilcoxon test). As before, endospheric microbial abundance
increased significantly, by >10-fold after 24 h in field soil (3.95%,
P = 0.05, Wilcoxon test). We conclude that brief soil contact
does not strongly increase the proportion of microbial reads, and
therefore the increase in microbial reads at 24 h is indicative of
endophyte acquisition within 1 d after transplantation.
α-Diversity significantly varied by rhizocompartment (P < 2E-
16; Dataset S23) and there was a significant interaction between
rhizocompartment and collection time (P = 0.042; Dataset S23);
however, when each rhizocompartment was analyzed individ-
ually, the bulk soil was the only compartment that showed
(13 d) approach the endosphere and rhizoplane microbiome
compositions for plants that have been grown in the green-
house for 42 d.
There are slight shifts in the distribution of phyla over time;
however, there are significant distinctions between the com-
partments starting as early as 24 h after transplantation into soil
(Fig. 6D, SI Appendix, Figs. S24B and S26, and Dataset S24).
Because each phylum consists of diverse OTUs that could ex-
hibit very different behaviors during acquisition, we next ex-
amined the dynamics and colonization patterns of specific
OTUs within the time-course experiment. The core set of 92
endosphere-enriched OTUs obtained from the previous green-
house experiment (SI Appendix, Fig. S9C) was analyzed for
relative abundances at different time points (Fig. 6E). Of the 92
core endosphere-enriched microbes present in the greenhouse
experiment, 53 OTUs were detectable in the endosphere in the
time-course experiment. The average abundance profile over
time revealed a colonization pattern for the core endospheric
microbiome. Relative abundance of the core endosphere-
enriched microbiome peaks early (3 d) in the rhizosphere and
then decreases back to a steady, low level for the remainder of
the time points. Similarly, the rhizoplane profile shows an in-
crease after 3 d with a peak at 8 d with a decline at 13 d. The
endosphere generally follows the rhizoplane profile, except that
relative abundance is still increasing at 13 d. These results sug-
gest that the core endospheric microbes are first attracted to the
rhizosphere and then locate to the rhizoplane, where they attach
Fig. 5. OTU coabundance network reveals modules of OTUs associated with methane cycling. (A) Subset of the entire network corresponding to 11
modules with methane cycling potential. Each node represents one OTU and an edge is drawn between OTUs if they share a Pearson correlation of
greater than or equal to 0.6. (B) Depiction of module 119 showing the relationship between methanogens, syntrophs, methanotrophs, and other
methane cycling taxonomies. Each node represents one OTU and is labeled by the presumed function of that OTU’s taxonomy in methane cycling. An
edge is drawn between two OTUs if they have a Pearson correlation of greater than or equal to 0.6. (C) Mean abundance profile for OTUs in module 119
across all rhizocompartments and field sites. The position along the x axis corresponds to a different field site. Error bars represent SE. The x and y axes
represent no particular scale.
PLANTBIOLOGYPNASPLUS
Function x Genotype
of magnitude greater than in any single plant species to date.
Under controlled greenhouse conditions, the rhizocompartments
described the largest source of variation in the microbial com-
munities sampled (Dataset S5A). The pattern of separation be-
tween the microbial communities in each compartment is
consistent with a spatial gradient from the bulk soil across the
rhizosphere and rhizoplane into the endosphere (Fig. 1C).
Similarly, microbial diversity patterns within samples hold the
same pattern where there is a gradient in α-diversity from the
rhizosphere to the endosphere (Fig. 1B). Enrichment and de-
pletion of certain microbes across the rhizocompartments indi-
cates that microbial colonization of rice roots is not a passive
process and that plants have the ability to select for certain mi-
crobial consortia or that some microbes are better at filling the
root colonizing niche. Similar to studies in Arabidopsis, we found
that the relative abundance of Proteobacteria is increased in the
endosphere compared with soil, and that the relative abundances
of Acidobacteria and Gemmatimonadetes decrease from the soil
to the endosphere (9–11), suggesting that the distribution of
different bacterial phyla inside the roots might be similar for all
land plants (Fig. 1D and Dataset S6). Under controlled green-
house conditions, soil type described the second largest source
of variation within the microbial communities of each sample.
However, the soil source did not affect the pattern of separation
between the rhizospheric compartments, suggesting that the
rhizocompartments exert a recruitment effect on microbial con-
sortia independent of the microbiome source.
By using differential OTU abundance analysis in the com-
partments, we observed that the rhizosphere serves an enrich-
ment role for a subset of microbial OTUs relative to bulk soil
(Fig. 2). Further, the majority of the OTUs enriched in the
rhizosphere are simultaneously enriched in the rhizoplane and/or
endosphere of rice roots (Fig. 2B and SI Appendix, Fig. S16B),
consistent with a recruitment model in which factors produced by
the root attract taxa that can colonize the endosphere. We found
that the rhizoplane, although enriched for OTUs that are also
Time Series

Acknowledgements
DOE JGI Sloan GBMF NSF
DHS DARPA
Aaron Darling 
Lizzy
Wilbanks
Jenna Lang Russell
Neches
Rob Knight
Jack Gilbert Tanja Woyke Rob Dunn
Katie Pollard
Jessica
Green
Darlene
Cavalier
Eddy RubinWendy Brown
Dongying Wu
Phil
Hugenholtz
DSMZ
Sundar
Srijak
Bhatnagar David Coil
Alex Alexiev
Hannah
Holland-Moritz
Holly Bik
John Zhang
Holly
Menninger
Guillaume
Jospin
David Lang
Cassie
Ettinger
Tim HarkinsJennifer Gardy
Holly Ganz

Talk by J. Eisen for NZ Computational Genomics meeting

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Talk by J. Eisen for NZ Computational Genomics meeting

Similar to Talk by J. Eisen for NZ Computational Genomics meeting (20)

More from Jonathan Eisen

More from Jonathan Eisen (20)

Recently uploaded

Recently uploaded (20)

Talk by J. Eisen for NZ Computational Genomics meeting