This is an introduction to the PATRIC Phage Genomics Workshop at the 24th Biennial Evergreen International Phage Meeting, Aug 6 2021.
It introduces the workshop outline, system, and the genome annotation workflow
From Sequence to Knowledge (Tools for Phage Genome Annotation)
1. From Sequence to Knowledge
Computational tools for
phage genome annotation
A helping hand through
The Annotation Bottleneck
Ramy K. Aziz
Professor of Microbiology and Immunology,
Children’s Cancer Hospital (Egypt 57357) &
Faculty of Pharmacy, Cairo University
Twitter: @azizrk
3. A bit of history…
• Since 2009, a Genomics Workshop has
become an essential part of the world-
famous Biennial Evergreen phage meeting
• The challenge was: how to meet
needs/expectations that are so many and
so diverse, in ~4 hours
• The next-level challenging request =
objectively keeping up with all excellent
tools that are being developed
6 August 2021 Phage Genomics - Evergreen 2021
5. “The analysis bottleneck”
• Observation:
– We generate more data than we can analyze.
– We generate sequence data faster
than we can analyze them.
• Opinion:
– Bottlenecks are not
created equal!
– It is important to define the question(s)
before working on the answer(s)!
6 August 2021 Phage Genomics - Evergreen 2021
10. Attendees’ expectations
• How many Evergreen/Annotation workshops you attended?
• Have you:
– annotated at least a phage genome?
– compared several phage genomes?
– worked on a viral metagenome?
– used the command line (Unix, Linux,
Mac Terminal) for sequence analysis?
• To optimize the content, let’s
take this survey on SOCRATIVE
(http://socrative.com)
– Enter ROOM: AZIZ15
6 August 2021 Phage Genomics - Evergreen 2021
11. What biologists want?
6 August 2021 Phage Genomics - Evergreen 2021
• A flawless, fully automated machine that reads
scientists’ mind, takes sequence as input and
converts it into publishable knowledge
Charlie Chaplin - Feeding Machine - Modern Times
13. What working in genomics
really is: “It takes two to tango”
6 August 2021 Phage Genomics - Evergreen 2021
Biologist
(aka human)
Computer
(aka machine)
Give me
everything
tonight!
Garbage IN
Garbage
OUT
15. What you want …... is
from genome from metagenome
6 August 2021 Phage Genomics - Evergreen 2021
Incomplete
frameshift
- complete
- accurate
Credit: Andrew Kropinski Credit: Bas Dutilh
faulty assembly
16. What you want …... is
from genome from metagenome
6 August 2021
Incomplete faulty assembly
frameshift
- complete
- accurate
Phage Genomics - Evergreen 2021
Credit: Andrew Kropinski Credit: Bas Dutilh
17. A process of reconstruction
6 August 2021 Phage Genomics - Evergreen 2021
18. A process of reconstruction
• Experimentally
6 August 2021 Phage Genomics - Evergreen 2021
DNA
GTCTCTCTNNNTCTCTTG
19. A process of reconstruction
• Experimentally
• Computationally
6 August 2021 Phage Genomics - Evergreen 2021
DNA
GTCTCTCTNNNTCTCTTG
GTCTCTCTNNNTCTCTTG
20. A process of reconstruction
• Experimentally
• Computationally
6 August 2021 Phage Genomics - Evergreen 2021
“Any phage
one can get!”
“eDNA”
GTCTCTCTNNNTCTCTTG
GTCTCTCTNNNTCTCTTG
21. THE PROCESS / PIPELINE
6 August 2021 Phage Genomics - Evergreen 2021
22. Assembly
Gene finding/
ORF calling
tRNA calling
Annotation
(Assigning
functions)
orienting
Validation
Fixing frameshifts
Introns and Inteins Subsystem
assignment
Refinement/
Secondary
annotation
loop
Special purpose:
toxins, morons, integrases,
lifestyle prediction
Regulatory elements
(promoters, terminators)
Output: files and graphics
From Sequence to Knowledge
From raw sequence data to
genome submission/ publication
28. Desired outcome
Well characterized genome, in which, ideally we
know:
the location & function of all the genes
the location of promoters & terminators
the correct taxonomy
PstI PstI
20
21
22
23
24
25
26
26A
27 28 29
30
31
32
33
30.0 kb
Viruses; dsDNA viruses, no RNA stage; Caudovirales; Siphoviridae;
T1virus
6 August 2021 Phage Genomics - Evergreen 2021
29. Desired outcome:
Create GenBank submission
• Complete, accurate description of the
genome and its taxonomy
6 August 2021 Phage Genomics - Evergreen 2021
34. Classification
• The phage sequence space (Lima-Mendez et al.)
• The phage proteomic tree (Edwards & Rohwer)
• New: VIP tree http://www.genome.jp/viptree
6 August 2021 Phage Genomics - Evergreen 2021
37. It is all about
Matching/ Comparing Classifying
6 August 2021 Phage Genomics - Evergreen 2021
From:
Current Opinion in Biotechnology
2003, 14:303–310
48. 6 August 2021
What to count? How to bin?
How to classify these?
Phage Genomics - Evergreen 2021
49. 6 August 2021
What to count? How to bin?
Assembly or long-reads
Phage Genomics - Evergreen 2021
50. 6 August 2021
What to count? How to bin?
“Truth”
Phage Genomics - Evergreen 2021
51. 6 August 2021
What to count? How to bin?
Similarity, variability, and functional prediction
Phage Genomics - Evergreen 2021
52. 6 August 2021
What to count? How to bin?
Counting genes/ gene families (protein families)…
Counting domains/ motifs
Phage Genomics - Evergreen 2021
53. Assembly
Gene finding/
ORF calling
tRNA calling
Annotation
(Assigning
functions)
orienting
Validation (segmenter)
Fixing frameshifts
Introns and Inteins Subsystem
assignment
Refinement/
Secondary
annotation
loop
Special purpose:
toxins, morons, integrases,
lifestyle prediction
Regulatory elements
(promoters, terminators)
Output: files and graphics
From Sequence to Knowledge
From raw sequence data to
genome submission/ publication
54. Nomenclature Sins
hypothetical protein DNA polymerase with no
or poor quality evidence is far worse than:
DNA polymerase hypothetical protein
Be cautious about using BLASTP hits in naming
gps – is there additional evidence to back the
designation up?
6 August 2021 Phage Genomics - Evergreen 2021
55. All of these describe homologs of the
product of the coliphage T4 rIIA gene!
rIIA protector from prophage-induced early lysis
protector from prophage-induced early lysis
protector from prophage-induced early lysis rIIA
membrane-associated affects host membrane ATPase
rIIA membrane-associated affects host membrane ATPase
phage rIIA lysis inhibitor
rIIA protector
rIIA
rIIA protein
membrane integrity protector
hypothetical protein
unnamed protein product !!!!!!
protein of unknown function
6 August 2021 Phage Genomics - Evergreen 2021
Consistent Nomenclature
56. What do I call my gene product
(i.e. protein)?
“phage hypothetical protein” – redundant
“gp87” (gp = gene product) hypothetical protein
gp200 describes radically different proteins in
Listeria, Enterococcus, Mycobacterium,
Rhodococcus, Sphingomonas, Pseudomonas,
• Bacillus and Synechococcus phage genomes
Add /note=“similar to gp43 of Escherichia coli
phage T4”
6 August 2021 Phage Genomics - Evergreen 2021
57. What do I call my gene product
(i.e. protein)?
“phage hypothetical protein” – redundant
“gp87” (gp = gene product) hypothetical protein
gp200 describes radically different proteins in
Listeria, Enterococcus, Mycobacterium,
Rhodococcus, Sphingomonas, Pseudomonas,
• Bacillus and Synechococcus phage genomes
Add /note=“similar to gp43 of Escherichia coli
phage T4”
6 August 2021 Phage Genomics - Evergreen 2021
58. Bottom line:
Manual vs. Automated
• “Tortoises can tell you more about the road
than rabbits… ” Khalil Gibran
• “… but they may never reach the end!”
• The best approach?
– Human expert-based annotation
6 August 2021 Phage Genomics - Evergreen 2021
60. PATRIC/SEED/RAST: Main concept
One genome
All genomes
6 August 2021 Phage Genomics - Evergreen 2021
“Subsystems-based technologies were developed in the SEED with the view that
the interpretation of one genome can be made more efficient and consistent if
hundreds of genomes are simultaneously annotated in one subsystem at a time”
62. Subsystems-based tools
(Extended RAST family)
• (At least) Five ways to annotate a genome via RAST:
– RAST (http://rast.nmpdr.org)
• annotates online, saves your genome on server
– Use your favorite gene caller then upload gbk file to RAST
– myRAST (local)
• uses the server but you can edit offline)
– RASTtk (second-generation RAST)
• modular
• batch upload
– PATRIC
– Phanotator
6 August 2021 Phage Genomics - Evergreen 2021
69. Genomic pairwise comparisons
6 August 2021 Phage Genomics - Evergreen 2021
EMBOSS Stretcher: http://emboss.bioinformatics.nl/cgi-
bin/emboss/stretcher N.B. genomes must be collinear
BLASTN - NCBI
ANI (Average Nucleotide Identity): http://enve-
omics.ce.gatech.edu/ani/
GGDC 2.0 (Genome to Genome Distance Calculator):
http://ggdc.dsmz.de/distcalc2.php
jSpeciesWS – ANI:
http://jspecies.ribohost.com/jspeciesws/
70. Proteomic pairwise comparisons
CoreGenes
http://binf.gmu.edu:8080/CoreGenes3.5/
tBLASTX
Remember that protein sequence is more
conserved than DNA sequence; probably
useful for more distant relationships.
6 August 2021 Phage Genomics - Evergreen 2021
80. The Kropinski wisdom
1. Always use more than one tool.
2. Never blindly trust any automated (or manual)
process.
3. Use your eyes and hands: visual inspection/
manual proofreading, re-annotation
– Every apparently complicated file is still editable on
your favorite text editor (e.g., NotePad).
4. If you don’t know a gene’s function (if you
can’t convince your grandma), better keep it
unnamed than contribute to error propagation.
6 August 2021 Phage Genomics - Evergreen 2021
Editor's Notes
Gp200 from Pseudomonas phage 201phi2-1 is related to phiKZ gp120 and EL gp78
Gp200 from Pseudomonas phage 201phi2-1 is related to phiKZ gp120 and EL gp78
"Shifting the genomic gold standard for the prokaryotic species definition" Michael Richter and Ramon Rosselló-Móra. PNAS vol. 106 no. 45 pg 19126–19131, doi: 10.1073/pnas.0906412106
JSpeciesWS is a quick and easy to use online service to measure the probability if two or more (draft) genomes belong to the same species or not by pairwise comparison of (1) their Average Nucleotide Identity (ANI) and/or (2) correlation indexes of their Tetra-nucleotide signatures.