First part of the phage annotation workshop at the 2016 EMBO Viruses of Microbes Meeting (Liverpool, UK), presented on 21 July 2016 (http://events.embo.org/16-virus-microbe)
From Sequence to Knowledge: The Art and Science of Phage Genome Annotation
1. From Sequence to Knowledge:
The Art & Science of Phage
Genome Annotation
Ramy K. Aziz – Cairo University
2. From Sequence to
Knowledge:
PhAnToMe, RAST, and the
Ultimate Kropinski Toolkit
A helping hand through
The Annotation Bottleneck
Compiled by: Andrew Kropinski and Ramy Aziz
3. Online material
• Data & links:
– http://egybio.net/tutorial
• Slides
– http://bit.ly/annotation2016
– http://bit.ly/phantome4
– Old tutorials (more detailed, but missing latest ):
• Evergreen 2011: http://slidesha.re/phantome1
• http://slidesha.re/phiRAST1 (Karin)
• Evergreen 2013: http://bit.ly/phantome2
• Evergreen 2015: http://bit.ly/phantome3
21 July 2016 Phage Genomics - VoM 2016
5. “The analysis bottleneck”
• Observation:
– We generate more data than we can analyze.
– We generate sequence data faster than
we can analyze them.
• Opinion:
– Bottlenecks are not
created equal!
– It is important to define the question(s)
before working on the answer(s)!
21 July 2016 Phage Genomics - VoM 2016
8. Quick group activity
Defining the question(s):
• How many of you have annotated a
genome?
• How many phage genomes have you
sequenced (or are in the process of
sequencing)?
a) None b) 1-5 c) 5-50 d) > 50
• What is the single most pressing question
you want to answer from genome analysis?
21 July 2016 Phage Genomics - VoM 2016
10. What You Want
The goal:
complete
accurate
Incomplete:
genome
termini Faulty assembly
Frameshift
chimeric
fragments21 July 2016 Phage Genomics - VoM 2016
11. A process of reconstruction
21 July 2016 Phage Genomics - VoM 2016
12. Annotation Reconstruction
from genome from metagenome
21 July 2016 Phage Genomics - VoM 2016
Incomplete
frameshift
- complete
- accurate
Credit: Andrew Kropinski Credit: Bas Dutilh
faulty assembly
13. Annotation Reconstruction
from genome from metagenome
21 July 2016
Incomplete faulty assembly
frameshift
- complete
- accurate
Phage Genomics - VoM 2016
Credit: Andrew Kropinski Credit: Bas Dutilh
14. A process of reconstruction
• Experimentally
DNA
TGATTGTGTGTTTGCGCAATGCG
ATGTGTATATATAGTGAGCTTGCCC
GTCTCTCTNNNTCTCTTG
TGATTGGTCTNNNTCTCTTGCGCAATGCG
21 July 2016 Phage Genomics - VoM 2016
15. A process of reconstruction
• Experimentally
• Computationally
TGATTGTGTGTTTGCGCAATGCG
ATGTGTATATATAGTGAGCTTGCCC
GTCTCTCTNNNTCTCTTG
TGATTGGTCTNNNTCTCTTGCGCAATGCG
21 July 2016 Phage Genomics - VoM 2016
DNA
TGATTGTGTGTTTGCGCAATGCG
ATGTGTATATATAGTGAGCTTGCCC
GTCTCTCTNNNTCTCTTG
TGATTGGTCTNNNTCTCTTGCGCAATGCG
16. Assembly
Gene finding/
ORF calling
tRNA calling
Annotation
(Assigning
functions)
orienting
Validation (segmenter)
Fixing frameshifts
Introns and Inteins Subsystem
assignment
Refinement/
Secondary
annotation
loop
Special purpose:
toxins, morons, integrases,
lifestyle prediction
Regulatory elements
(promoters, terminators)
Output: files and graphics
From Sequence to Knowledge
From raw sequence data to
genome submission/ publication
19. General outline
• Part I: The “Kropinski toolkit”
– Tools approved and recommended by Andrew
Kropinski (http://molbiol-tools.ca): from seq to pub
• Part II: SEED-based tools:
– The RAST family
– The PhAnToMe database/portal
21 July 2016 Phage Genomics - VoM 2016
21. What we want, according to Andrew
Well characterized genome, in which, ideally we
know:
the location & function of all the genes
the location of promoters & terminators
the correct taxonomy
PstI PstI
20
21
22
23
24
25
26
26A
27 28 29
30
31
32
33
30.0 kb
Viruses; dsDNA viruses, no RNA stage; Caudovirales; Siphoviridae;
T1virus
21 July 2016 Phage Genomics - VoM 2016
22. Desired outcome: Create GenBank
submission
• Complete, accurate description of the
genome and its taxonomy
Good title
25. Desired outcome (4)
Protein products of concern, particularly
for those interested in phage therapy:
Integrases
Toxins
PstI PstI
20
21
22
23
24
25
26
26A
27 28 29
30
31
32
33
30.0 kb
21 July 2016 Phage Genomics - VoM 2016
26. Processes and Steps
I. Primary analysis
(QC/ pre-annotation proofreading: e.g., orient with BLASTN)
II. Genome annotation
– Gene finding (ORF calling)
– Automated annotation
– Massaging (edition, functional assignment)
III. Second dimension (regulatory elements)
IV. Comparative genomics
V. Metadata
VI. Visualization
21 July 2016 Phage Genomics - VoM 2016
27. Assembly
Gene finding/
ORF calling
tRNA calling
Annotation
(Assigning
functions)
orienting
Validation (segmenter)
Fixing frameshifts
Introns and Inteins Subsystem
assignment
Refinement/
Secondary
annotation
loop
Special purpose:
toxins, morons, integrases,
lifestyle prediction
Regulatory elements
(promoters, terminators)
Output: files and graphics
From Sequence to Knowledge
From raw sequence data to
genome submission/ publication
29. RAST (subsystems-based tools)
• Will be the major focus of this short
tutorial…
• The goal is:
1. Quick demo how to use RAST
2. Quick preview batch annotation in RAST
3. Optimize RAST for phage annotation
4. Demonstrate & discuss how to improve
RAST output
21 July 2016 Phage Genomics - VoM 2016
31. The Kropinski wisdom
1. Always use more than one tool
2. Never blindly trust any automated (or manual)
process
3. Use your eyes and hands: visual inspection/
manual proofreading, re-annotation
– Every apparently complicated file is still editable on
your favorite text editor (e.g., NotePad)
4. If you don’t know a gene’s function (if you
can’t convince your grandma), better keep it
unnamed than contribute to error propagation
2 Aug 2015 Phage Genomics - Evergreen 2015
32. What do I call my gene product
(i.e. protein)?
“phage hypothetical protein” – redundant
“gp87” (gp = gene product) hypothetical protein
gp200 describes radically different proteins in
Listeria, Enterococcus, Mycobacterium,
Rhodococcus, Sphingomonas, Pseudomonas,
• Bacillus and Synechococcus phage genomes
Add /note=“similar to gp43 of Escherichia coli
phage T4”
21 July 2016 Phage Genomics - VoM 2016
33. What do I call my gene product
(i.e. protein)?
/product=“UboA”; “NrdA”; “hypothetical protein
SA5_0153/152”; “ORF184” (as bad as gp184); “RNAP1”;
"32 kDa protein”
Bad because they don`t mean anything to the casual (or
informed) reader.
Unless you are a bioinformatician or biostatistician be
conservative in recording “hits.” Could you convince your
grandmda?, if not list as a “hypothetical protein” but do take
a stand “putative DNA polymerase” is cowardly
21 July 2016 Phage Genomics - VoM 2016
34. Nomenclature Sins
hypothetical protein DNA polymerase with no
or poor quality evidence is far worse than:
DNA polymerase hypothetical protein
Be cautious about using BLASTP hits in naming
gps – is there additional evidence to back the
designation up
21 July 2016 Phage Genomics - VoM 2016
35. Consistent Nomenclature
All of these describe homologs of the
product of the coliphage T4 rIIA gene!
rIIA protector from prophage-induced early lysis
protector from prophage-induced early lysis
protector from prophage-induced early lysis rIIA
membrane-associated affects host membrane ATPase
rIIA membrane-associated affects host membrane ATPase
phage rIIA lysis inhibitor
rIIA protector
rIIA
rIIA protein
membrane integrity protector
hypothetical protein
unnamed protein product !!!!!!
protein of unknown function
21 July 2016 Phage Genomics - VoM 2016
36. Bottom line:
Manual vs. Automated
• “Turtles know the road better than
rabbits… ” Khalil Gibran
• “… but they may never reach the end!”
• The best approach?
– Human expert-based annotation
2 Aug 2015 Phage Genomics - Evergreen 2015
37. Assembly
Gene finding/
ORF calling
tRNA calling
Annotation
(Assigning
functions)
orienting
Validation (segmenter)
Fixing frameshifts
Introns and Inteins Subsystem
assignment
Refinement/
Secondary
annotation
loop
Special purpose:
toxins, morons, integrases,
lifestyle prediction
Regulatory elements
(promoters, terminators)
Output: files and graphics
From Sequence to Knowledge
From raw sequence data to
genome submission/ publication
39. Genomic pairwise comparisons
EMBOSS Stretcher:http://emboss.bioinformatics.nl/cgi-
bin/emboss/stretcher N.B. genomes must be collinear
BLASTN - NCBI
ANI (Average Nucleotide Identity):http://enve-
omics.ce.gatech.edu/ani/
GGDC 2.0 (Genome to Genome Distance Calculator):
http://ggdc.dsmz.de/distcalc2.php
jSpeciesWS –
ANI:http://jspecies.ribohost.com/jspeciesws/
40. Proteomic pairwise
comparisons
CoreGenes –
(http://binf.gmu.edu:8080/CoreGenes3.0/)
TBLASTX
Remember protein sequence is more conserved
than DNA sequence; probably useful for more
distant relationships
Gp200 from Pseudomonas phage 201phi2-1 is related to phiKZ gp120 and EL gp78
"Shifting the genomic gold standard for the prokaryotic species definition" Michael Richter and Ramon Rosselló-Móra. PNAS vol. 106 no. 45 pg 19126–19131, doi: 10.1073/pnas.0906412106
JSpeciesWS is a quick and easy to use online service to measure the probability if two or more (draft) genomes belong to the same species or not by pairwise comparison of (1) their Average Nucleotide Identity (ANI) and/or (2) correlation indexes of their Tetra-nucleotide signatures.