Jonathan Eisen talk for 2019 ADVANCE Scholar Award SymposiumJonathan Eisen
Slides for my talk at the 2019 ADVANCE Scholar Award Symposium. Talk covered a little bit about mt research and more about STEM Diversity. See https://diversity.ucdavis.edu/2019-advance-scholar-award-symposium
Jonathan Eisen talk for 2019 ADVANCE Scholar Award SymposiumJonathan Eisen
Slides for my talk at the 2019 ADVANCE Scholar Award Symposium. Talk covered a little bit about mt research and more about STEM Diversity. See https://diversity.ucdavis.edu/2019-advance-scholar-award-symposium
Similar to Phylogenomic Case Studies: The Benefits (and Occasional Drawbacks) of Integrating Evolutionary and Genomic Studies. Talk by J. Eisen for BIATA 2021
Use the Harvard Business Case, West Jet Airlines Information Tec.docxjessiehampson
Use the Harvard Business Case, “West Jet Airlines: Information Technology Governance and Corporate Strategy," as the basis for answering the following questions:
What was West Jet’s strategic plan?
What were the main problems faced by the West Jet IT organization?
Discuss how West Jet transformed their IT organizational structure. How was the structure itself realigned? What methods and processes were introduced or removed?
Discuss IT governance models that were considered to enable IT to function more efficiently at West Jet.
How does IT affect a company’s corporate strategy and the overall strategic impact?
Business School, Cespedes, Frank & Kindley, James
Minimum 2 scholarly Articles References.
Minimum of 500 Words, APA Format
Your paper will be submitted to Turnitin software, No plagiarism.
Contents lists available at ScienceDirect
Infection, Genetics and Evolution
journal homepage: www.elsevier.com/locate/meegid
Short communication
Genetic diversity and evolution of SARS-CoV-2
Tung Phan⁎
Division of Clinical Microbiology, University of Pittsburgh and University of Pittsburgh Medical Center, Pittsburgh, PA, USA
A R T I C L E I N F O
Keywords:
Coronavirus
SARS-CoV-2
Mutations
Genomic diversity
A B S T R A C T
COVID-19 is a viral respiratory illness caused by a new coronavirus called SARS-CoV-2. The World Health
Organization declared the SARS-CoV-2 outbreak a global public health emergency. We performed genetic
analyses of eighty-six complete or near-complete genomes of SARS-CoV-2 and revealed many mutations and
deletions on coding and non-coding regions. These observations provided evidence of the genetic diversity and
rapid evolution of this novel coronavirus.
1. The study
A new coronavirus SARS-CoV-2 is spreading cross the world (Phan,
2020). Since the virus emerged at the seafood wholesale market at the
end of last year (Zhu et al., 2019), the number of infected cases has
been rising dramatically (Velavan and Meyer, 2020). Human-to-human
transmission of SARS-CoV-2 has been confirmed (Nishiura et al., 2020).
The virus has been detected in bronchoalveolar-lavage (Zhu et al.,
2019), sputum (Lin et al., 2020), saliva (K.K. To et al., 2020), throat
(Bastola et al., 2020) and nasopharyngeal swabs (To et al., 2020).
Nucleotide substitution has been proposed to be one of the most
important mechanisms of viral evolution in nature (Lauring and
Andino, 2010). The rapid spread of SARS-CoV-2 raises intriguing
questions such as whether its evolution is driven by mutations. To as-
sess the genetic variation, eighty-six complete or near-complete gen-
omes of SARS-CoV-2 were collected from GISAID [https://www.gisaid.
org/]. These SARS-CoV-2 strains were detected in infected patients
from China (50), USA (11), Australia (5), Japan (5), France (4), Sin-
gapore (3), England (2), Taiwan (2), South Korea (1), Belgium (1),
Germany (1), and Vietnam (1). The pair-wise nucleotide sequence
alignment was performed by ClustalX2 (Saitou and Nei, ...
Heavy metals, particularly silver and mercury, have a variety of applications in controlling microbial population. Ps. aeruginosa is a high intrinsic resistant to antibiotics and heavy metals including Copper Sulfate, Silver Sulfate, Mercury chloride, Lead nitrate, Zinc sulfate, Cadmium sulfate, and Nickel sulfate.
Detection of Parapoxvirus in goats during contagious ecthyma outbreak in Cear...Agriculture Journal IJOEAR
— Contagious ecthyma or contagious pustular dermatitis, is a viral skin disease that occurs in sheep, goats and wild ruminants and is characterized by the formation of papules, nodules or vesicles that progress into thick crusts or heavy scabs on the lips, gingiva and tongue, caused by a member of the Parapoxvirus genus. Humans are occasionally affected constituting important zoonosis. The disease not only has an economic impact on farmers worldwide but also has a considerable negative effect on animal welfare. In this study, a contagious ecthyma outbreak which occurred in one flock with 90 goats located in the Ceará State, Brazil, was described. Twenty-two goats older than 6 months were affected. The animals presented crusted lesions on the buccal region, tongue, udder and teats, which began with swelling in the mouth area. Dried crusts and serum collected were processed for transmission electron microscopy utilizing, negative staining (rapid preparation), Immunocytochemistry (immunolabelling with colloidal gold particles) and resin embedding techniques. At the Philips EM 208 transmission electron microscopy all the samples were analyzed by negative staining technique and a great number of parapoxvirus particles ovoid or cylindrical, showing two morphological forms, a mulberry (M) form with a distinctive crisscross filament pattern derived from the superimposition of upper and lower virion surfaces and a capsular (C) form caused by stain penetration and distention of the virion core, measuring 300 x 180 nm was observed. Antigen antibody reaction was increased by the colloidal gold particles. In the ultrathin sections of crusts, we verified the presence of three types of intracytoplasmic inclusion bodies, type A or Bollinger inclusion bodies, outlined by membrane, presented in it is interior, oval, mature or complete viral particles, measuring on the average, 225nm x 130 nm, showing an inner dumbell-shaped core, two lateral bodies and an external envelope, or cigar shaped core. In the type B electron dense inclusions bodies, were visualized parapoxvirus particles budding of dense and amorphous material. Fibrillar intracytoplasmic inclusions were also found located between the virions, consisting of groups of fibrils, arranged in groups or concentrically in the middle of the granular material. Intracytoplasmic vesicles outlined by membranes, measuring 560 x 420 nm, containing granular material in its interior were also observed. The nuclei showed an aspect deformed.
Applications of Whole Genome Sequencing (WGS) to Food Safety – Perspective fr...ExternalEvents
http://tiny.cc/faowgsworkshop
Applications of genome sequencing technology on food safety management- United Kingdom. Presentation from the FAO expert workshop on practical applications of Whole Genome Sequencing (WGS) for food safety management - 7-8 December 2015, Rome, Italy.
S. prasanth kumar young scientist awarded presentationPrasanthperceptron
Recipient of Young Scientist Award for Research Article Presentation on “Emergence of Indian Tomato Yellow Leaf Curl Viral (TYLCV) Disease: Insights from Evolutionary Divergence and Molecular Prospects of Coat Protein” on an National Symposium on “Evolving Paradigm to Improve Productivity from Dynamic Management and Value Addition for Plant Genetic Resources” held at Department of Botany, Gujarat University, Ahmedabad- 380 009 between Oct 13-15, 2011.
Avs significant achievements and present status of trichoderma spp. inAMOL SHITOLE
-By- AMOL VIJAY SHITOLE
Similar to Phylogenomic Case Studies: The Benefits (and Occasional Drawbacks) of Integrating Evolutionary and Genomic Studies. Talk by J. Eisen for BIATA 2021 (20)
Innovations in Sequencing & Bioinformatics
Talk for
Healthy Central Valley Together Research Workshop
Jonathan A. Eisen University of California, Davis
January 31, 2024 linktr.ee/jonathaneisen
Thoughts on UC Davis' COVID Current ActionsJonathan Eisen
Slides I used for a presentation to Chancellor May's leadership council about the current state of UC Davis' response to COVID and how it could be improved
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Nucleic Acid-its structural and functional complexity.
Phylogenomic Case Studies: The Benefits (and Occasional Drawbacks) of Integrating Evolutionary and Genomic Studies. Talk by J. Eisen for BIATA 2021
1. Phylogenomic Case Studies:
The Benefits (and Occasional Drawbacks)
of Integrating
Evolutionary and Genomic Studies
BIATA 2021
Jonathan A. Eisen
University of California, Davis
@phylogenomics
http://phylogenomics.me
10. RecA Structure & Function I
Intrinsic
Liu SK, Eisen JA, Hanawalt PC, Tessman IW. 1993. recA mutations that reduce the constitutive coprotease activity of the RecA1202(PrtC) protein: possible involvement of interfilament
association in proteolytic and recombination activities. Journal of Bacteriology 175: 6518-6529. PMID: 8407828. PMCID: PMC206762.
11. RecA vs. rRNA
Eisen 1995 Journal of Molecular Evolution 41: 1105-1123..
More on this later …
I
Intrinsic
13. RecA Missing From Some Taxa
Those taxa without RecA
homologs have no
homologous recombination
which has major impacts on
tempo and modes of
evolution
I
Intrinsic
Moran NA, Mira A. The process of genome shrinkage in the obligate symbiont Buchnera aphidicola. Genome Biol. 2001;2(12):RESEARCH0054. doi:10.1186/gb-2001-2-12-
research0054
21. Wu et al., 2004. Collaboration between Jonathan Eisen and Scott
O’Neill (Yale, U. Queensland).
Wolbachia pipientis wMel E1
Extrinsic
Collaboration with Scott O’ Neill and others
Wu M, Sun LV, Vamathevan J, et al. Phylogenomics of the reproductive parasite Wolbachia pipientis wMel: a streamlined genome overrun by mobile genetic elements.
PLoS Biol. 2004;2(3):E69. doi:10.1371/journal.pbio.0020069
27. Symbiosis Under Stress
When organisms are placed under selective
pressure or stress where novelty would be
beneficial, can we predict which pathway
they will use?
What leads to interactions / symbioses
being a potential solution?
Can we manipulate interactions and/or force
new ones upon systems?
Extrinsic
Novelty
29. HMS Type 1: Nutrient Acquisition
Host
Microbiome Nutrients
E2
Extrinsic
30. HMS Type 1: Chemosymbioses
Marine Invertebrates
Endosymbionts Carbon
Colleen
Cavanaugh
E2
Extrinsic
31. HMS Type 1: Xylem Feeders
Glassy Winged Sharpshooter
Gut
Endosymbionts
Trying to
Live on
Xylem Fluid
Nancy Moran
Dongying Wu
E2
Extrinsic
32. HMS Type 1: Nitrogen Acquisition
Oloton
Corn
Mucilage
Microbiome
Low
N
Van Deynze A, Zamora P, Delaux PM, Heitmann C, Jayaraman D, Rajasekar S, Graham D, Maeda J, Gibson D, Schwartz KD, Berry AM,
Bhatnagar S, Jospin G, Darling A, Jeannotte R, Lopez J, Weimer BC, Eisen JA, Shapiro HY, Ané JM, Bennett AB. 2018. Nitrogen fixation in a
landrace of maize is supported by a mucilage-associated diazotrophic microbiota. PLoS Biology 16(8):e2006352. doi: 10.1371/
journal.pbio.2006352. PMID: 30086128. PMCID: PMC6080747.
E2
Extrinsic
33. HMS Type 2: Pathogens
Host
Microbiome Pathogen
E2
Extrinsic
34. HMS Type 2: Flu & Ducks
Ducks
Gut
Microbiome
Flu
Walter
Boyce
Holly
Ganz
Sarah
Hird
Ladan
Daroud
Alana
Firl
Hird SM, Ganz H, Eisen JA, Boyce WM. 2018. The cloacal microbiome of
fi
ve wild duck species varies by species and in
fl
uenza A virus infection status. mSphere 3:e00382-18. https:// doi.org/10.1128/
mSphere.00382-18
Ganz, H.H., Doroud, L., Firl, A.J., Hird, S.M., Eisen, J.A. and Boyce, W.M., 2017. Community-level differences in the microbiome of healthy wild mallards and those infected by influenza A
viruses. mSystems, 2(1) .e00188-16.
E2
Extrinsic
35. HMS Type 2: Kolalas & Chlamydia
Koala
Gut
Microbiome
Chlamydia
&
Antibiotics
Katherine
Dahlhausen
E2
Extrinsic
Dahlhausen KE, Jospin G, Coil DA, Eisen JA, Wilkins LGE. Isolation and sequence-based characterization of a koala symbiont: Lonepinella koalarum. PeerJ. 2020;8:e10177.
Published 2020 Oct 20. doi:10.7717/peerj.10177
Dahlhausen KE, Doroud L, Firl AJ, Polkinghorne A, Eisen JA. Characterization of shifts of koala (Phascolarctos cinereus) intestinal microbial communities associated with
antibiotic treatment. PeerJ. 2018;6:e4452. Published 2018 Mar 12. doi:10.7717/peerj.4452
38. HMS Type 3: Rice Microbiome
Rice
Root
Microbiome Domestication
E2
Extrinsic
Sundar Lab Srijak
Bhatnagar
Edwards J, Johnson C, Santos-Medellín C, et al. Structure, variation, and assembly of the root-associated microbiomes of rice. Proc Natl Acad Sci U S A. 2015;112(8):E911-
E920. doi:10.1073/pnas.1414592112
48. STAP
An Automated Phylogenetic Tree-Based Small Subunit
rRNA Taxonomy and Alignment Pipeline (STAP)
Dongying Wu1
*, Amber Hartman1,6
, Naomi Ward4,5
, Jonathan A. Eisen1,2,3
1 UC Davis Genome Center, University of California Davis, Davis, California, United States of America, 2 Section of Evolution and Ecology, College of Biological Sciences,
University of California Davis, Davis, California, United States of America, 3 Department of Medical Microbiology and Immunology, School of Medicine, University of
California Davis, Davis, California, United States of America, 4 Department of Molecular Biology, University of Wyoming, Laramie, Wyoming, United States of America,
5 Center of Marine Biotechnology, Baltimore, Maryland, United States of America, 6 The Johns Hopkins University, Department of Biology, Baltimore, Maryland, United
States of America
Abstract
Comparative analysis of small-subunit ribosomal RNA (ss-rRNA) gene sequences forms the basis for much of what we know
about the phylogenetic diversity of both cultured and uncultured microorganisms. As sequencing costs continue to decline
and throughput increases, sequences of ss-rRNA genes are being obtained at an ever-increasing rate. This increasing flow of
data has opened many new windows into microbial diversity and evolution, and at the same time has created significant
methodological challenges. Those processes which commonly require time-consuming human intervention, such as the
preparation of multiple sequence alignments, simply cannot keep up with the flood of incoming data. Fully automated
methods of analysis are needed. Notably, existing automated methods avoid one or more steps that, though
computationally costly or difficult, we consider to be important. In particular, we regard both the building of multiple
sequence alignments and the performance of high quality phylogenetic analysis to be necessary. We describe here our fully-
automated ss-rRNA taxonomy and alignment pipeline (STAP). It generates both high-quality multiple sequence alignments
and phylogenetic trees, and thus can be used for multiple purposes including phylogenetically-based taxonomic
assignments and analysis of species diversity in environmental samples. The pipeline combines publicly-available packages
(PHYML, BLASTN and CLUSTALW) with our automatic alignment, masking, and tree-parsing programs. Most importantly,
this automated process yields results comparable to those achievable by manual analysis, yet offers speed and capacity that
are unattainable by manual efforts.
Citation: Wu D, Hartman A, Ward N, Eisen JA (2008) An Automated Phylogenetic Tree-Based Small Subunit rRNA Taxonomy and Alignment Pipeline (STAP). PLoS
ONE 3(7): e2566. doi:10.1371/journal.pone.0002566
multiple alignment and phylogeny was deemed unfeasible.
However, this we believe can compromise the value of the results.
For example, the delineation of OTUs has also been automated
via tools that do not make use of alignments or phylogenetic trees
(e.g., Greengenes). This is usually done by carrying out pairwise
comparisons of sequences and then clustering of sequences that
have better than some cutoff threshold of similarity with each
other). This approach can be powerful (and reasonably efficient)
but it too has limitations. In particular, since multiple sequence
alignments are not used, one cannot carry out standard
phylogenetic analyses. In addition, without multiple sequence
alignments one might end up comparing and contrasting different
regions of a sequence depending on what it is paired with.
The limitations of avoiding multiple sequence alignments and
phylogenetic analysis are readily apparent in tools to classify
sequences. For example, the Ribosomal Database Project’s
Classifier program [29] focuses on composition characteristics of
each sequence (e.g., oligonucleotide frequency) and assigns
taxonomy based upon clustering genes by their composition.
Though this is fast and completely automatable, it can be misled in
cases where distantly related sequences have converged on similar
composition, something known to be a major problem in ss-rRNA
sequences [30]. Other taxonomy assignment systems focus
primarily on the similarity of sequences. The simplest of these is
to use BLASTN to search a sequence database (e.g., Genbank) and
to then use information about the top match to assign some sort of
classification tools it does have some limitations. For example,
the generation of new alignments for each sequence is both
computational costly, and does not take advantage of available
curated alignments that make use of ss-RNA secondary structure
to guide the primary sequence alignment. Perhaps most
importantly however is that the tool is not fully automated. In
addition, it does not generate multiple sequence alignments for all
sequences in a dataset which would be necessary for doing many
analyses.
Automated methods for analyzing rRNA sequences are also
available at the web sites for multiple rRNA centric databases,
such as Greengenes and the Ribosomal Database Project (RDPII).
Though these and other web sites offer diverse powerful tools, they
do have some limitations. For example, not all provide multiple
sequence alignments as output and few use phylogenetic
approaches for taxonomy assignments or other analyses. More
importantly, all provide only web-based interfaces and their
integrated software, (e.g., alignment and taxonomy assignment),
cannot be locally installed by the user. Therefore, the user cannot
take advantage of the speed and computing power of parallel
processing such as is available on linux clusters, or locally alter and
potentially tailor these programs to their individual computing
needs (Table 1).
Given the limited automated tools that are available for
researchers have had to choose between two non-ideal options:
manually generating and/or curating alignments (an expensive
Table 1. Comparison of STAP’s computational abilities relative to existing commonly-used ss-RNA analysis tools.
STAP ARB Greengenes RDP
Installed where? Locally Locally Web only Web only
User interface Command line GUI Web portal Web portal
Parallel processing YES NO NO NO
Manual curation for taxonomy assignment NO YES NO NO
Manual curation for alignment NO YES NO* NO
Open source YES** NO NO NO
Processing speed Fast Slow Medium Medium
It is important to note, that STAP is the only software that runs on the command line and can take advantage of parallel processing on linux clusters and, further, is
more amenable to downstream code manipulation.
*
Note: Greengenes alignment output is compatible with upload into ARB and downstream manual alignment.
**
The STAP program itself is open source, the programs it depends on are freely available but not open source.
doi:10.1371/journal.pone.0002566.t001
ss-rRNA Taxonomy Pipeline
STAP database, and the query sequence is aligned to them using
the CLUSTALW profile alignment algorithm [40] as described
above for domain assignment. By adapting the profile alignment
algorithm, t
while gaps a
sequence a
alignments
Figure 1. A flow chart of the STAP pipeline.
doi:10.1371/journal.pone.0002566.g001
STAP database, and the query sequence is aligned to them using
the CLUSTALW profile alignment algorithm [40] as described
above for domain assignment. By adapting the profile alignment
algorithm, the alignments from the STAP database remain intact,
while gaps are inserted and nucleotides are trimmed for the query
sequence according to the profile defined by the previous
alignments from the databases. Thus the accuracy and quality of
the alignment generated at this step depends heavily on the quality
of the Bacterial/Archaeal ss-rRNA alignments from the
Greengenes project or the Eukaryotic ss-rRNA alignments from
the RDPII project.
Phylogenetic analysis using multiple sequence alignments rests on
the assumption that the residues (nucleotides or amino acids) at the
same position in every sequence in the alignment are homologous.
Thus, columns in the alignment for which ‘‘positional homology’’
cannot be robustly determined must be excluded from subsequent
analyses. This process of evaluating homology and eliminating
questionable columns, known as masking, typically requires time-
consuming, skillful, human intervention. We designed an automat-
ed masking method for ss-rRNA alignments, thus eliminating this
bottleneck in high-throughput processing.
First, an alignment score is calculated for each aligned column
by a method similar to that used in the CLUSTALX package [42].
Specifically, an R-dimensional sequence space representing all the
possible nucleotide character states is defined. Then for each
aligned column, the nucleotide populating that column in each of
the aligned sequences is assigned a score in each of the R
dimensions (Sr) according to the IUB matrix [42]. The consensus
‘‘nucleotide’’ for each column (X) also has R dimensions, with the
score for each dimension (Xr) calculated as the average of the
scores for that column in that dimension (average of Sr). Thus the
score of the consensus nucleotide is a mathematical expression
describing the average ‘‘nucleotide’’ in that column for that
Figure 2. Domain assignment. In Step 1, STAP assigns a domain to
each query sequence based on its position in a maximum likelihood
tree of representative ss-rRNA sequences. Because the tree illustrated
here is not rooted, domain assignment would not be accurate and
reliable (sequence similarity based methods cannot make an accurate
assignment in this case either). However the figure illustrates an
important role of the tree-based domain assignment step, namely
automatic identification of deep-branching environmental ss-rRNAs.
Figure 1. A flow chart of the STAP pipeline.
doi:10.1371/journal.pone.0002566.g001
ss-rRNA Taxonomy Pipeline
Wu D, Hartman A, Ward N, Eisen JA. An automated phylogenetic tree-based small subunit rRNA taxonomy and alignment
pipeline (STAP) [published correction appears in PLoS ONE. 2008;3(7). doi: 10.1371/annotation/
c1aa88dd-4360-4902-8599-4d7edca79817]. PLoS One. 2008;3(7):e2566. Published 2008 Jul 2. doi:10.1371/
journal.pone.0002566
49. Venter et al., Science 304: 66. 2004
STAP for Sargasso Metagenome
51. alignment used to build the profile, resulting in a multiple
sequence alignment of full-length reference sequences and
metagenomic reads. The final step of the alignment process is a
quality control filter that 1) ensures that only homologous SSU-
rRNA sequences from the appropriate phylogenetic domain are
included in the final alignment, and 2) masks highly gapped
alignment columns (see Text S1).
We use this high quality alignment of metagenomic reads and
references sequences to construct a fully-resolved, phylogenetic
tree and hence determine the evolutionary relationships between
the reads. Reference sequences are included in this stage of the
analysis to guide the phylogenetic assignment of the relatively
short metagenomic reads. While the software can be easily
extended to incorporate a number of different phylogenetic tools
capable of analyzing metagenomic data (e.g., RAxML [27],
pplacer [28], etc.), PhylOTU currently employs FastTree as a
default method due to its relatively high speed-to-performance
PD versus PID clustering, 2) to explore overlap between PhylOTU
clusters and recognized taxonomic designations, and 3) to quantify
the accuracy of PhylOTU clusters from shotgun reads relative to
those obtained from full-length sequences.
PhylOTU Clusters Recapitulate PID Clusters
We sought to identify how PD-based clustering compares to
commonly employed PID-based clustering methods by applying
the two methods to the same set of sequences. Both PID-based
clustering and PhylOTU may be used to identify OTUs from
overlapping sequences. Therefore we applied both methods to a
dataset of 508 full-length bacterial SSU-rRNA sequences (refer-
ence sequences; see above) obtained from the Ribosomal Database
Project (RDP) [25]. Recent work has demonstrated that PID is
more accurately calculated from pairwise alignments than multiple
sequence alignments [32–33], so we used ESPRIT, which
Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in this generalize
workflow of PhylOTU. See Results section for details.
doi:10.1371/journal.pcbi.1001061.g001
Finding Metagenomic OTUs
Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O'Dwyer
JP, Green JL, Eisen JA, Pollard KS. (2011) PhylOTU: A High-
Throughput Procedure Quantifies Microbial Community
Diversity and Resolves Novel Taxa from Metagenomic Data.
PLoS Comput Biol 7(1): e1001061. doi:10.1371/
journal.pcbi.1001061
OTUs via Phylogeny (PhylOTU)
Tom
Sharpton
Katie
Pollard
Jessica
Green
Finding Metagenomic OTUs
52. rRNA Copy # vs. Phylogeny
Steven
Kembel
Jessic
a
Green
Martin
Wu
Kembel SW, Wu M, Eisen JA, Green JL (2012)
Incorporating 16S Gene Copy Number
Information Improves Estimates of Microbial
Diversity and Abundance. PLoS Comput Biol
8(10): e1002743. doi:10.1371/
journal.pcbi.1002743
60. GOS 1
GOS 2
GOS 3
GOS 4
GOS 5
Phylogenetic ID of Novel Lineages
Wu et al PLoS One 2011
61. Metagenomics
DNA
RecA RecA
RecA
RpoB RpoB
RpoB
Rpl4 Rpl4
Rpl4 rRNA rRNA
rRNA
Hsp70 Hsp70
Hsp70
EFTu EFTu
EFTu
http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.7
Genome Biology 2008, 9:R151
sequences are not conserved at the nucleotide level [29]. As a
result, the nr database does not actually contain many more
protein marker sequences that can be used as references than
those available from complete genome sequences.
Comparison of phylogeny-based and similarity-based phylotyping
Although our phylogeny-based phylotyping is fully auto-
mated, it still requires many more steps than, and is slower
than, similarity based phylotyping methods such as a
MEGAN [30]. Is it worth the trouble? Similarity based phylo-
typing works by searching a query sequence against a refer-
ence database such as NCBI nr and deriving taxonomic
information from the best matches or 'hits'. When species
that are closely related to the query sequence exist in the ref-
erence database, similarity-based phylotyping can work well.
However, if the reference database is a biased sample or if it
contains no closely related species to the query, then the top
hits returned could be misleading [31]. Furthermore, similar-
ity-based methods require an arbitrary similarity cut-off
value to define the top hits. Because individual bacterial
genomes and proteins can evolve at very different rates, a uni-
versal cut-off that works under all conditions does not exist.
As a result, the final results can be very subjective.
In contrast, our tree-based bracketing algorithm places the
query sequence within the context of a phylogenetic tree and
only assigns it to a taxonomic level if that level has adequate
sampling (see Materials and methods [below] for details of
the algorithm). With the well sampled species Prochlorococ-
cus marinus, for example, our method can distinguish closely
related organisms and make taxonomic identifications at the
species level. Our reanalysis of the Sargasso Sea data placed
672 sequences (3.6% of the total) within a P. marinus clade.
On the other hand, for sparsely sampled clades such as
Aquifex, assignments will be made only at the phylum level.
Thus, our phylogeny-based analysis is less susceptible to data
sampling bias than a similarity based approach, and it makes
Major phylotypes identified in Sargasso Sea metagenomic data
Figure 3
Major phylotypes identified in Sargasso Sea metagenomic data. The metagenomic data previously obtained from the Sargasso Sea was reanalyzed using
AMPHORA and the 31 protein phylogenetic markers. The microbial diversity profiles obtained from individual markers are remarkably consistent. The
breakdown of the phylotyping assignments by markers and major taxonomic groups is listed in Additional data file 5.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
A
l
p
h
a
p
r
o
t
e
o
b
a
c
t
e
r
i
a
B
e
t
a
p
r
o
t
e
o
b
a
c
t
e
r
i
a
G
a
m
m
a
p
r
o
t
e
o
b
a
c
t
e
r
i
a
D
e
l
t
a
p
r
o
t
e
o
b
a
c
t
e
r
i
a
E
p
s
i
l
o
n
p
r
o
t
e
o
b
a
c
t
e
r
i
a
U
n
c
l
a
s
s
i
f
i
e
d
p
r
o
t
e
o
b
a
c
t
e
r
i
a
B
a
c
t
e
r
o
i
d
e
t
e
s
C
h
l
a
m
y
d
i
a
e
C
y
a
n
o
b
a
c
t
e
r
i
a
A
c
i
d
o
b
a
c
t
e
r
i
a
T
h
e
r
m
o
t
o
g
a
e
F
u
s
o
b
a
c
t
e
r
i
a
A
c
t
i
n
o
b
a
c
t
e
r
i
a
A
q
u
i
f
i
c
a
e
P
l
a
n
c
t
o
m
y
c
e
t
e
s
S
p
i
r
o
c
h
a
e
t
e
s
F
i
r
m
i
c
u
t
e
s
C
h
l
o
r
o
f
l
e
x
i
C
h
l
o
r
o
b
i
U
n
c
l
a
s
s
i
f
i
e
d
b
a
c
t
e
r
i
a
dnaG
frr
infC
nusA
pgk
pyrG
rplA
rplB
rplC
rplD
rplE
rplF
rplK
rplL
rplM
rplN
rplP
rplS
rplT
rpmA
rpoB
rpsB
rpsC
rpsE
rpsI
rpsJ
rpsK
rpsM
rpsS
smpB
tsf
Relative
abundance
Many other genes
better than rRNA
62. Sargasso Phylotypes
Weighted
%
of
Clones
0.000
0.125
0.250
0.375
0.500
Major Phylogenetic Group
A
l
p
h
a
p
r
o
t
e
o
b
a
c
t
e
r
i
a
B
e
t
a
p
r
o
t
e
o
b
a
c
t
e
r
i
a
G
a
m
m
a
p
r
o
t
e
o
b
a
c
t
e
r
i
a
E
p
s
i
l
o
n
p
r
o
t
e
o
b
a
c
t
e
r
i
a
D
e
l
t
a
p
r
o
t
e
o
b
a
c
t
e
r
i
a
C
y
a
n
o
b
a
c
t
e
r
i
a
F
i
r
m
i
c
u
t
e
s
A
c
t
i
n
o
b
a
c
t
e
r
i
a
C
h
l
o
r
o
b
i
C
F
B
C
h
l
o
r
o
fl
e
x
i
S
p
i
r
o
c
h
a
e
t
e
s
F
u
s
o
b
a
c
t
e
r
i
a
D
e
i
n
o
c
o
c
c
u
s
-
T
h
e
r
m
u
s
E
u
r
y
a
r
c
h
a
e
o
t
a
C
r
e
n
a
r
c
h
a
e
o
t
a
EFG EFTu HSP70 RecA RpoB rRNA
Venter et al., Science 304: 66. 2004
RecA Phylotyping - Sargasso Metagenome
63. Amphora
W
Martin
Wu
Wu M, Eisen JA. A simple, fast, and accurate method of phylogenomic inference. Genome Biol. 2008;9(10):R151.
Published 2008 Oct 13. doi:10.1186/gb-2008-9-10-r151
64. AMPHORA
http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.7
Major phylotypes identified in Sargasso Sea metagenomic data
Figure 3
Major phylotypes identified in Sargasso Sea metagenomic data. The metagenomic data previously obtained from the Sargasso Sea was reanalyzed using
AMPHORA and the 31 protein phylogenetic markers. The microbial diversity profiles obtained from individual markers are remarkably consistent. The
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
A
l
p
h
a
p
r
o
t
e
o
b
a
c
t
e
r
i
a
B
e
t
a
p
r
o
t
e
o
b
a
c
t
e
r
i
a
G
a
m
m
a
p
r
o
t
e
o
b
a
c
t
e
r
i
a
D
e
l
t
a
p
r
o
t
e
o
b
a
c
t
e
r
i
a
E
p
s
i
l
o
n
p
r
o
t
e
o
b
a
c
t
e
r
i
a
U
n
c
l
a
s
s
i
f
i
e
d
p
r
o
t
e
o
b
a
c
t
e
r
i
a
B
a
c
t
e
r
o
i
d
e
t
e
s
C
h
l
a
m
y
d
i
a
e
C
y
a
n
o
b
a
c
t
e
r
i
a
A
c
i
d
o
b
a
c
t
e
r
i
a
T
h
e
r
m
o
t
o
g
a
e
F
u
s
o
b
a
c
t
e
r
i
a
A
c
t
i
n
o
b
a
c
t
e
r
i
a
A
q
u
i
f
i
c
a
e
P
l
a
n
c
t
o
m
y
c
e
t
e
s
S
p
i
r
o
c
h
a
e
t
e
s
F
i
r
m
i
c
u
t
e
s
C
h
l
o
r
o
f
l
e
x
i
C
h
l
o
r
o
b
i
U
n
c
l
a
s
s
i
f
i
e
d
b
a
c
t
e
r
i
a
dnaG
frr
infC
nusA
pgk
pyrG
rplA
rplB
rplC
rplD
rplE
rplF
rplK
rplL
rplM
rplN
rplP
rplS
rplT
rpmA
rpoB
rpsB
rpsC
rpsE
rpsI
rpsJ
rpsK
rpsM
rpsS
smpB
tsf
Relative
abundance AMPHORA Phylotyping w/ Protein Markers
Martin
Wu
Wu M, Eisen JA. A simple, fast, and accurate method of phylogenomic inference. Genome Biol. 2008;9(10):R151.
Published 2008 Oct 13. doi:10.1186/gb-2008-9-10-r151
65. Phylosift - Bayesian Phylotyping
Input Sequences
rRNA workflow
protein workflow
profile HMMs used to align
candidates to reference alignment
Taxonomic
Summaries
parallel option
hmmalign
multiple alignment
LAST
fast candidate search
pplacer
phylogenetic placement
LAST
fast candidate search
LAST
fast candidate search
search input against references
hmmalign
multiple alignment
hmmalign
multiple alignment
Infernal
multiple alignment
LAST
fast candidate search
<600 bp
>600 bp
Sample Analysis &
Comparison
Krona plots,
Number of reads placed
for each marker gene
Edge PCA,
Tree visualization,
Bayes factor tests
each
input
sequence
scanned
against
both
workflows
Aaron
Darling
Erik
Matsen
Holly
Bik
Guillaume
Jospin
Darling AE, Jospin G, Lowe E,
Matsen FA IV, Bik HM, Eisen JA.
(2014) PhyloSift: phylogenetic
analysis of genomes and
metagenomes. PeerJ 2:e243
http://dx.doi.org/10.7717/
peerj.243
Erik
Lowe
66. PD from Metagenomes
typically used as a qualitative measure because duplicate s
quences are usually removed from the tree. However, the
test may be used in a semiquantitative manner if all clone
even those with identical or near-identical sequences, are i
cluded in the tree (13).
Here we describe a quantitative version of UniFrac that w
call “weighted UniFrac.” We show that weighted UniFrac b
haves similarly to the FST test in situations where both a
FIG. 1. Calculation of the unweighted and the weighted UniFr
measures. Squares and circles represent sequences from two differe
environments. (a) In unweighted UniFrac, the distance between t
circle and square communities is calculated as the fraction of t
branch length that has descendants from either the square or the circ
environment (black) but not both (gray). (b) In weighted UniFra
branch lengths are weighted by the relative abundance of sequences
the square and circle communities; square sequences are weight
twice as much as circle sequences because there are twice as many tot
circle sequences in the data set. The width of branches is proportion
to the degree to which each branch is weighted in the calculations, an
gray branches have no weight. Branches 1 and 2 have heavy weigh
since the descendants are biased toward the square and circles, respe
tively. Branch 3 contributes no value since it has an equal contributio
from circle and square sequences after normalization.
Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of
Metagenomes. PLoS ONE 6(8): e23214. doi:10.1371/journal.pone.0023214
Jessica
Green
Steven
Kembel
Katie
Pollard
67. Zorro - Automated Masking
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
1.0
2.0
3.0
4.0
5.0
6.0
no ma
zorro
gbloc
Distance
to
True
Tree
NJ
ML
A
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
200 400 800 1600 3200
1.0
2.0
3.0
4.0
5.0
6.0
7.0
Seque
Distance
to
True
Tree
NJ
ML
A
C
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
200 400 800 1600 3200
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
200 400 800 1600 3200
0.0
1.0
2.0
3.0
4.0
5.0
6.0
200 400 8
Sequence Length
Di
ML
C
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
200 400 800 1600 3200
D
0.0
1.0
2.0
3.0
4.0
5.0
200 400 800
0.0
1.0
2.0
3.0
4.0
5.0
6.0
200 400 800
no masking
zorro
gblocks
Distance
to
True
Tree
NJ
ML
A
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
200 400 800 1600 3200
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
B
Wu M, Chatterji S, Eisen JA (2012) Accounting For Alignment Uncertainty
in Phylogenomics. PLoS ONE 7(1): e30288. doi:10.1371/
journal.pone.0030288
71. Helicobacter pylori genome 1997
“The ability of H. pylori to
perform mismatch repair is
suggested by the presence of
methyl transferases, mutS
and uvrD. However,
orthologues of MutH and
MutL were not identified.”
73. Blast Search of H. pylori “MutS”
Score E
Sequences producing significant alignments: (bits) Value
sp|P73625|MUTS_SYNY3 DNA MISMATCH REPAIR PROTEIN 117 3e-25
sp|P74926|MUTS_THEMA DNA MISMATCH REPAIR PROTEIN 69 1e-10
sp|P44834|MUTS_HAEIN DNA MISMATCH REPAIR PROTEIN 64 3e-09
sp|P10339|MUTS_SALTY DNA MISMATCH REPAIR PROTEIN 62 2e-08
sp|O66652|MUTS_AQUAE DNA MISMATCH REPAIR PROTEIN 57 4e-07
sp|P23909|MUTS_ECOLI DNA MISMATCH REPAIR PROTEIN 57 4e-07
Blast search pulls up Syn. sp MutS#2 with much higher p value
than other MutS homologs
Based on this TIGR predicted this species had mismatch repair
Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.
75. Overlaying Functions onto Tree
Aquae Trepa
Rat
Fly
Xenla
Mouse
Human
Yeast
Neucr
Arath
Borbu
Synsp
Neigo
Thema
Strpy
Bacsu
Ecoli
Theaq
Deira
Chltr
Spombe
Yeast
Yeast
Spombe
Mouse
Human
Arath
Yeast
Human
Mouse
Arath
StrpyBacsu
Human
Celeg
Yeast
Metth
Borbu
Aquae
Synsp
Deira Helpy
mSaco
Yeast
Celeg
Human
MSH4
MSH5
MutS2
MutS1
MSH1
MSH3
MSH6
MSH2
Based on Eisen, 1998
Nucl Acids Res 26: 4291-4300.
76. High Mutation Rate in H. pylori
Blast search pulls up Syn. sp MutS#2 with much higher p value
than other MutS homologs
Based on this TIGR predicted this species had mismatch repair
Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.
77. PHYLOGENENETIC PREDICTION OF GENE FUNCTION
IDENTIFY HOMOLOGS
OVERLAY KNOWN
FUNCTIONS ONTO TREE
INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1A
3A
1B
2B
3B
ALIGN SEQUENCES
CALCULATE GENE TREE
1
2
4
6
CHOOSE GENE(S) OF INTEREST
2A
2A
5
3
Species 3
Species 1 Species 2
1
1 2
2
2 3
1
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN)
Duplication?
EXAMPLE A EXAMPLE B
Duplication?
Duplication?
Duplication
5
METHOD
Ambiguous
Based on
Eisen, 1998
Genome Res 8:
163-167.
Phylogenomics
81. Limitations of Phylogenetic Prediction of Function
• Still imperfectly automated
• Each gene family different
• Each function different
• In some cases, function does not track with phylogeny well
• Does not work when NO members of a gene family have
been characterized
86. Non-Homology Predictions: Phylogenetic Profiling
• Step 1: Search all genes in organisms of
interest against all other genomes
• Ask: Yes or No, is each gene found in each
other species
• Cluster genes by distribution patterns
(profiles)
88. B. subtilis new sporulation genes
Bjorn Traag
Richard Losick
Antonia Pugliese
J Bacteriol. 2013 Jan;195(2):253-60. doi: 10.1128/JB.01778-12
89. PG Profiling Works Better with Orthology
Martin Wu
Eisen JA, Wu M. 2002. Phylogenetic analysis and gene functional predictions: phylogenomics in action. Theoretical and Population Biology
61: 481-487. PMID: 12167367.
90. PG Profiling for Metagenomes
Jiang X, Langille MGI, Neches RY, Elliot M, Levin SA, Eisen JA, et al. (2012) Functional Biogeography of Ocean Microbes Revealed
through Non-Negative Matrix Factorization. PLoS ONE 7(9): e43866. doi:10.1371/journal.pone.0043866
Unidentified Pfams with high association to Components 1, 2
and 5 may have similar functional themes to other Pfams seen in
these components, or they may have functions that are ecologically
linked to the identified theme, or they may be associated
taxonomically rather than functionally (ie., they may be expressed
by the same taxa that express the identified Pfams). In the future,
Additionally, we inspected the Pfams that were associated with
the ‘‘ubiquitous’’ cluster previously identified in Figure 2. Many of
these Pfams are associated with bacterial primary metabolism and
only 1% of these had unknown functions (Table S6). This is a
striking difference compared to the 15–54% proportion of
unknown Pfams seen in the five NMF components.
Figure 3. Components across sites. a) Weight for each of the five components at each of the 45 sites (HT
); b) the site-similarity matrix ( ^
H
HT ^
H
H); c)
environmental variables for the sites. The matrices are aligned so that the same row corresponds to the same site in each matrix. Sites are ordered by
applying spectral reordering to the similarity matrix (see Materials and Methods). Rows are aligned across the three matrices.
doi:10.1371/journal.pone.0043866.g003
PLOS ONE | www.plosone.org 4 September 2012 | Volume 7 | Issue 9 | e43866
92. We need to know how organisms are
related to each other
Tools: Whole Genome Phylogeny
93. 16s Says Hyphomonas is in Rhodobacteriales
Badger et al. 2005
Int J System Evol
Microbiol 55:
1021-1026.
Naomi
Ward
Jonatha
n
Badger
94. WGT & gene trees: Related to Caulobacterales
Badger et al. 2005
Int J System Evol
Microbiol 55:
1021-1026.
Naomi
Ward
Jonatha
n
Badger
95. HMS Type 1: Xylem Feeders
Glassy Winged Sharpshooter
Gut
Endosymbionts
Trying to
Live on
Xylem Fluid
Nancy Moran
Dongying Wu
E2
Extrinsic
96. WGT: Higher Evolutionary Rates in Endosymbionts
Wu et al. 2006 PLoS Biology 4: e188. Collaboration with Nancy Moran’ s Lab
Higher
Evolutionary
Rates in
Endosymbionts
97. Wu et al. 2006 PLoS Biology 4: e188. Collaboration with Nancy Moran’ s Lab
MutS MutL
+ +
+ +
+ +
+ +
_ _
_ _
Variation in Evolution Rates Correlated with Repair Gene Presence
Highest Rates
In Those Missing
Mismatch Repair
Genes
98. Wu et al. 2006 PLoS Biology 4: e188. Collaboration with Nancy Moran’ s Lab
MutS MutL
+ +
+ +
+ +
+ +
_ _
_ _
Variation in Evolution Rates Correlated with Repair Gene Presence
Important Use of
Whole Genome Trees
99. Whole Genome Trees: Many Possible Methods
Lang JM, Darling AE, Eisen JA (2013) Phylogeny of
Bacterial and Archaeal Genomes Using Conserved
Genes: Supertrees and Supermatrices. PLoS ONE
8(4): e62510. doi:10.1371/journal.pone.0062510
Jenna Lang
101. Automated WGT: Phylosift
Input Sequences
rRNA workflow
protein workflow
profile HMMs used to align
candidates to reference alignment
Taxonomic
Summaries
parallel option
hmmalign
multiple alignment
LAST
fast candidate search
pplacer
phylogenetic placement
LAST
fast candidate search
LAST
fast candidate search
search input against references
hmmalign
multiple alignment
hmmalign
multiple alignment
Infernal
multiple alignment
LAST
fast candidate search
<600 bp
>600 bp
Sample Analysis &
Comparison
Krona plots,
Number of reads placed
for each marker gene
Edge PCA,
Tree visualization,
Bayes factor tests
each
input
sequence
scanned
against
both
workflows
Aaron
Darling
Erik
Matsen
Holly
Bik
Guillaume
Jospin
Darling AE, Jospin G, Lowe E,
Matsen FA IV, Bik HM, Eisen JA.
(2014) PhyloSift: phylogenetic
analysis of genomes and
metagenomes. PeerJ 2:e243
http://dx.doi.org/10.7717/
peerj.243
Erik
Lowe
102. Normalizing Across Genes Tree OTU
Wu, D., Doroud, L, Eisen, JA 2013. arXiv. TreeOTU:
Operational Taxonomic Unit Classi
fi
cation Based on
Phylogenetic
Dongying Wu
106. HiC Metagenomic Binning
Beitel CW, Froenicke L, Lang JM, Korf IF, Michelmore RW, Eisen JA,
Darling AE. (2014) Strain- and plasmid-level deconvolution of a
synthetic metagenome by sequencing proximity ligation products.
PeerJ 2:e415 http://dx.doi.org/10.7717/peerj.415
Table 1 Species alignment fractions. The number of reads aligning to each replicon present in the
synthetic microbial community are shown before and after filtering, along with the percent of total
constituted by each species. The GC content (“GC”) and restriction site counts (“#R.S.”) of each replicon,
species, and strain are shown. Bur1: B. thailandensis chromosome 1. Bur2: B. thailandensis chromosome
2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2: L. brevis plasmid 2, Ped: P. pentosaceus,
K12: E. coli K12 DH10B, BL21: E. coli BL21. An expanded version of this table can be found in Table S2.
Sequence Alignment % of Total Filtered % of aligned Length GC #R.S.
Lac0 10,603,204 26.17% 10,269,562 96.85% 2,291,220 0.462 629
Lac1 145,718 0.36% 145,478 99.84% 13,413 0.386 3
Lac2 691,723 1.71% 665,825 96.26% 35,595 0.385 16
Lac 11,440,645 28.23% 11,080,865 96.86% 2,340,228 0.46 648
Ped 2,084,595 5.14% 2,022,870 97.04% 1,832,387 0.373 863
BL21 12,882,177 31.79% 2,676,458 20.78% 4,558,953 0.508 508
K12 9,693,726 23.92% 1,218,281 12.57% 4,686,137 0.507 568
E. coli 22,575,903 55.71% 3,894,739 17.25% 9,245,090 0.51 1076
Bur1 1,886,054 4.65% 1,797,745 95.32% 2,914,771 0.68 144
Bur2 2,536,569 6.26% 2,464,534 97.16% 3,809,201 0.672 225
Bur 4,422,623 10.91% 4,262,279 96.37% 6,723,972 0.68 369
Figure 1 Hi-C insert distribution. The distribution of genomic distances between Hi-C read pairs is
shown for read pairs mapping to each chromosome. For each read pair the minimum path length on
the circular chromosome was calculated and read pairs separated by less than 1000 bp were discarded.
The 2.5 Mb range was divided into 100 bins of equal size and the number of read pairs in each bin
was recorded for each chromosome. Bin values for each chromosome were normalized to sum to 1 and
plotted.
E. coli K12 genome were distributed in a similar manner as previously reported (Fig. 1;
(Lieberman-Aiden et al., 2009)). We observed a minor depletion of alignments spanning
the linearization point of the E. coli K12 assembly (e.g., near coordinates 0 and 4686137)
due to edge eVects induced by BWA treating the sequence as a linear chromosome rather
than circular.
OI 10.7717/peerj.415 9/19
Figure 2 Metagenomic Hi-C associations. The log-scaled, normalized number of Hi-C read pairs
associating each genomic replicon in the synthetic community is shown as a heat map (see color scale,
blue to yellow: low to high normalized, log scaled association rates). Bur1: B. thailandensis chromosome
1. Bur2: B. thailandensis chromosome 2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2:
L. brevis plasmid 2, Ped: P. pentosaceus, K12: E. coli K12 DH10B, BL21: E. coli BL21.
reference assemblies of the members of our synthetic microbial community with the same
alignment parameters as were used in the top ranked clustering (described above). We first
counted the number of Hi-C reads associating each reference assembly replicon (Fig. 2;
Figure 3 Contigs associated by Hi-C reads. A graph is drawn with nodes depicting contigs and
depicting associations between contigs as indicated by aligned Hi-C read pairs, with the count t
depicted by the weight of edges. Nodes are colored to reflect the species to which they belong (see le
with node size reflecting contig size. Contigs below 5 kb and edges with weights less than 5 were exc
Contig associations were normalized for variation in contig size.
typically represent the reads and variant sites as a variant graph wherein variant sit
represented as nodes, and sequence reads define edges between variant sites observ
the same read (or read pair). We reasoned that variant graphs constructed from H
data would have much greater connectivity (where connectivity is defined as the m
path length between randomly sampled variant positions) than graphs constructed
Chris Beite
l
@datscimed
Aaron Darling
@koadman
107. Long Reads Help, A Lot
Hiseq & Miseq
100-250 bp
Moleculo
2-20 kb
Pacbio RSII
2-20kb
Micky Kertesz,
Tim Blauwcamp
Meredith Ashby
Cheryl Heiner
Illumina-based
“synthetic long
reads”
Real-time single
molecule
sequencing
(p4-c2, p5-c3)
295 Megabases 474 Megabases
61 Gigabases
Meredith Ashby
116. 2002-2007: TIGR Tree of Life Project
Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree
Naomi
Ward
Kare
n
Nelson
117. 2007-2014: GEBA
Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree
Dongyin
g
Wu
Phi
l
Hugenholtz
Niko
s
Kyrpides
Hans-Pete
r
Klenk
All
a
Lapidus
120. GEBA Cyanobacteria
Shih et al. 2013. PNAS 10.1073/pnas.1217107110
0.3
B1
B2
C1
Paulinella
Glaucophyte
Green
Red
Chromalveolates
C2
C3
A
E
F
G
B3
D
A
B
Fig. 2. Implications on plastid evolution. (A) Maxi-
mum-likelihood phylogenetic tree of plastids and cya-
nobacteria, grouped by subclades (Fig. 1). The red dot
Chery
l
Kerfeld