An introduction to Phage Genome Annotation (Viruses of Microbes 2018)
1.
2.
3. From Sequence to Knowledge
Assembly, Annotation, and Analysis
of Phage genomes from Isolated
Phages and Metagenomic Data Sets
A helping hand through
The Annotation Bottleneck
Ramy K. Aziz
Professor & Chair, Microbiology and Immunology, Faculty of
Pharmacy, Cairo University
(Twitter @azizrk)
5. A bit of history…
• Since 2009, the Genomics Workshop has
become an essential part of the Evergreen
phage meeting
• The challenge always is: how to meet
needs/expectations that are so many and
so diverse, in ~4 hours
• The answer is:
…….
9 July 2018 Phage Genomics - VoM 2018
6. A bit of history…
9 July 2018 Phage Genomics - VoM 2018
10. “The analysis bottleneck”
• Observation:
– We generate more data than we can analyze.
– We generate sequence data faster
than we can analyze them.
• Opinion:
– Bottlenecks are not
created equal!
– It is important to define the question(s)
before working on the answer(s)!
9 July 2018 Phage Genomics - VoM 2018
15. Attendees’ expectations
• Who (how many) among you have:
– annotated at least a phage genome?
– worked on a viral metagenome?
– used the command line (Unix, Linux, Mac
Terminal) for sequence analysis?
• To optimize the content, let’s
take this survey on SOCRATIVE
(http://socrative.com)
– Enter ROOM: AZIZ15
9 July 2018 Phage Genomics - VoM 2018
16. Activity: think, pair, share!
Defining the question(s):
• Introduce yourself, your institution, and your
favorite phage/virus
• Do you have a genome sequenced? Planning to?
– Why have you sequenced your phage genome?
– Why you want to sequence your phage genome?
• What is the single most pressing question you
want to have answered from genome analysis?
• What’s your top wish(es) for analysis tools that are
not in the current programs?
9 July 2018 Phage Genomics - VoM 2018
19. What you want …... is
from genome from metagenome
9 July 2018 Phage Genomics - VoM 2018
Incomplete
frameshift
- complete
- accurate
Credit: Andrew Kropinski Credit: Bas Dutilh
faulty assembly
20. What you want …... is
from genome from metagenome
9 July 2018
Incomplete faulty assembly
frameshift
- complete
- accurate
Phage Genomics - VoM 2018
Credit: Andrew Kropinski Credit: Bas Dutilh
21. A process of reconstruction
9 July 2018 Phage Genomics - VoM 2018
22. A process of reconstruction
• Experimentally
9 July 2018 Phage Genomics - VoM 2018
DNA
TGATTGTGTGTTTGCGCAATGCG
ATGTGTATATATAGTGAGCTTGCCC
GTCTCTCTNNNTCTCTTG
TGATTGGTCTNNNTCTCTTGCGCAATGCG
23. A process of reconstruction
• Experimentally
• Computationally
9 July 2018 Phage Genomics - VoM 2018
TGATTGTGTGTTTGCGCAATGCG
ATGTGTATATATAGTGAGCTTGCCC
GTCTCTCTNNNTCTCTTG
TGATTGGTCTNNNTCTCTTGCGCAATGCG
DNA
TGATTGTGTGTTTGCGCAATGCG
ATGTGTATATATAGTGAGCTTGCCC
GTCTCTCTNNNTCTCTTG
TGATTGGTCTNNNTCTCTTGCGCAATGCG
24. A process of reconstruction
• Experimentally
• Computationally
9 July 2018 Phage Genomics - VoM 2018
TGATTGTGTGTTTGCGCAATGCG
ATGTGTATATATAGTGAGCTTGCCC
GTCTCTCTNNNTCTCTTG
TGATTGGTCTNNNTCTCTTGCGCAATGCG
“Any phage
one can get!”
“eDNA”
TGATTGTGTGTTTGCGCAATGCG
ATGTGTATATATAGTGAGCTTGCCC
GTCTCTCTNNNTCTCTTG
TGATTGGTCTNNNTCTCTTGCGCAATGCG
25. What will be covered?
1. Annotation overview
2. Using the RAST family for genome annotation:
– Optimizing RAST for phages
– Command line/ Batch options
3. Introducing PATRIC and resources in
development
– Therapeutic phage database
– Assembly
– Variation analysis
– Metagenome binning
9 July 2018 Phage Genomics - VoM 2018
26. Assembly
Gene finding/
ORF calling
tRNA calling
Annotation
(Assigning
functions)
orienting
Validation
Fixing frameshifts
Introns and Inteins Subsystem
assignment
Refinement/
Secondary
annotation
loop
Special purpose:
toxins, morons, integrases,
lifestyle prediction
Regulatory elements
(promoters, terminators)
Output: files and graphics
From Sequence to Knowledge
From raw sequence data to
genome submission/ publication
31. The Kropinski wisdom
1. Always use more than one tool.
2. Never blindly trust any automated (or manual)
process.
3. Use your eyes and hands: visual inspection/
manual proofreading, re-annotation
– Every apparently complicated file is still editable on
your favorite text editor (e.g., NotePad).
4. If you don’t know a gene’s function (if you
can’t convince your grandma), better keep it
unnamed than contribute to error propagation.
9 July 2018 Phage Genomics - VoM 2018
35. Desired outcome
Well characterized genome, in which, ideally we
know:
the location & function of all the genes
the location of promoters & terminators
the correct taxonomy
PstI PstI
20
21
22
23
24
25
26
26A
27 28 29
30
31
32
33
30.0 kb
Viruses; dsDNA viruses, no RNA stage; Caudovirales; Siphoviridae;
T1virus
9 July 2018 Phage Genomics - VoM 2018
36. Desired outcome:
Create GenBank submission
• Complete, accurate description of the
genome and its taxonomy
9 July 2018 Phage Genomics - VoM 2018
39. Desired outcome (4)
Protein products of concern, particularly
for those interested in phage therapy:
Integrases
Toxins
PstI PstI
20
21
22
23
24
25
26
26A
27 28 29
30
31
32
33
30.0 kb
9 July 2018 Phage Genomics - VoM 2018
40. Processes and Steps
I. Primary analysis
(QC/ pre-annotation proofreading: e.g., orient with BLASTN)
II. Genome annotation
– Gene finding (ORF calling)
– Automated annotation
– Massaging (edition, functional assignment)
III. Second dimension (regulatory elements)
IV. Comparative genomics
V. Metadata
VI. Visualization
9 July 2018 Phage Genomics - VoM 2018
41. Assembly
Gene finding/
ORF calling
tRNA calling
Annotation
(Assigning
functions)
orienting
Validation (segmenter)
Fixing frameshifts
Introns and Inteins Subsystem
assignment
Refinement/
Secondary
annotation
loop
Special purpose:
toxins, morons, integrases,
lifestyle prediction
Regulatory elements
(promoters, terminators)
Output: files and graphics
From Sequence to Knowledge
From raw sequence data to
genome submission/ publication
42. Classification
• The phage sequence space (Lima-Mendez et al.)
• The phage proteomic tree (Edwards & Rohwer)
• New: VIP tree http://www.genome.jp/viptree
9 July 2018 Phage Genomics - VoM 2018
44. RAST (subsystems-based tools)
• Will be the major focus of this short
tutorial…
• The goal is:
1. Quick demo how to use RAST
2. Optimize RAST for phage annotation
3. New RAST implementation in the PATRIC
database
4. PATRIC features and future development
9 July 2018 Phage Genomics - VoM 2018
45. Nomenclature Sins
hypothetical protein DNA polymerase with no
or poor quality evidence is far worse than:
DNA polymerase hypothetical protein
Be cautious about using BLASTP hits in naming
gps – is there additional evidence to back the
designation up
9 July 2018 Phage Genomics - VoM 2018
46. All of these describe homologs of the
product of the coliphage T4 rIIA gene!
rIIA protector from prophage-induced early lysis
protector from prophage-induced early lysis
protector from prophage-induced early lysis rIIA
membrane-associated affects host membrane ATPase
rIIA membrane-associated affects host membrane ATPase
phage rIIA lysis inhibitor
rIIA protector
rIIA
rIIA protein
membrane integrity protector
hypothetical protein
unnamed protein product !!!!!!
protein of unknown function
9 July 2018 Phage Genomics - VoM 2018
Consistent Nomenclature
47. What do I call my gene product
(i.e. protein)?
“phage hypothetical protein” – redundant
“gp87” (gp = gene product) hypothetical protein
gp200 describes radically different proteins in
Listeria, Enterococcus, Mycobacterium,
Rhodococcus, Sphingomonas, Pseudomonas,
• Bacillus and Synechococcus phage genomes
Add /note=“similar to gp43 of Escherichia coli
phage T4”
9 July 2018 Phage Genomics - VoM 2018
48. What do I call my gene product
(i.e. protein)?
“phage hypothetical protein” – redundant
“gp87” (gp = gene product) hypothetical protein
gp200 describes radically different proteins in
Listeria, Enterococcus, Mycobacterium,
Rhodococcus, Sphingomonas, Pseudomonas,
• Bacillus and Synechococcus phage genomes
Add /note=“similar to gp43 of Escherichia coli
phage T4”
9 July 2018 Phage Genomics - VoM 2018
49. Bottom line:
Manual vs. Automated
• “Turtles know the road better than
rabbits… ” Khalil Gibran
• “… but they may never reach the end!”
• The best approach?
– Human expert-based annotation
9 July 2018 Phage Genomics - VoM 2018
60. BLAST Ring Image Generator
9 July 2018 Phage Genomics - VoM 2018
Editor's Notes
Gp200 from Pseudomonas phage 201phi2-1 is related to phiKZ gp120 and EL gp78
Gp200 from Pseudomonas phage 201phi2-1 is related to phiKZ gp120 and EL gp78
"Shifting the genomic gold standard for the prokaryotic species definition" Michael Richter and Ramon Rosselló-Móra. PNAS vol. 106 no. 45 pg 19126–19131, doi: 10.1073/pnas.0906412106
JSpeciesWS is a quick and easy to use online service to measure the probability if two or more (draft) genomes belong to the same species or not by pairwise comparison of (1) their Average Nucleotide Identity (ANI) and/or (2) correlation indexes of their Tetra-nucleotide signatures.