From Sequence to Knowledge:
The Art & Science of Phage
Genome Annotation
Ramy K. Aziz – Cairo University
From Sequence to
Knowledge:
PhAnToMe, RAST, and the
Ultimate Kropinski Toolkit
A helping hand through
The Annotation Bottleneck
Compiled by: Andrew Kropinski and Ramy Aziz
Online material
• Data & links:
– http://egybio.net/tutorial
• Slides
– http://bit.ly/annotation2016
– http://bit.ly/phantome4
– Old tutorials (more detailed, but missing latest ):
• Evergreen 2011: http://slidesha.re/phantome1
• http://slidesha.re/phiRAST1 (Karin)
• Evergreen 2013: http://bit.ly/phantome2
• Evergreen 2015: http://bit.ly/phantome3
21 July 2016 Phage Genomics - VoM 2016
INTRODUCTION
21 July 2016 Phage Genomics - VoM 2016
“The analysis bottleneck”
• Observation:
– We generate more data than we can analyze.
– We generate sequence data faster than
we can analyze them.
• Opinion:
– Bottlenecks are not
created equal!
– It is important to define the question(s)
before working on the answer(s)!
21 July 2016 Phage Genomics - VoM 2016
“The analysis bottleneck”
• The Lavigne paradox
21 July 2016 Phage Genomics - VoM 2016
“The analysis bottleneck”
• The Lavigne paradox
21 July 2016 Phage Genomics - VoM 2016
Quick group activity
Defining the question(s):
• How many of you have annotated a
genome?
• How many phage genomes have you
sequenced (or are in the process of
sequencing)?
a) None b) 1-5 c) 5-50 d) > 50
• What is the single most pressing question
you want to answer from genome analysis?
21 July 2016 Phage Genomics - VoM 2016
DEFINING THE QUESTION(S)
“Begin with the end in mind” (Covey, the 7 habits)
21 July 2016 Phage Genomics - VoM 2016
What You Want
The goal:
 complete
 accurate
Incomplete:
 genome
termini Faulty assembly
Frameshift
 chimeric
fragments21 July 2016 Phage Genomics - VoM 2016
A process of reconstruction
21 July 2016 Phage Genomics - VoM 2016
Annotation  Reconstruction
from genome from metagenome
21 July 2016 Phage Genomics - VoM 2016
Incomplete
frameshift
- complete
- accurate
Credit: Andrew Kropinski Credit: Bas Dutilh
faulty assembly
Annotation  Reconstruction
from genome from metagenome
21 July 2016
Incomplete faulty assembly
frameshift
- complete
- accurate
Phage Genomics - VoM 2016
Credit: Andrew Kropinski Credit: Bas Dutilh
A process of reconstruction
• Experimentally
DNA
TGATTGTGTGTTTGCGCAATGCG
ATGTGTATATATAGTGAGCTTGCCC
GTCTCTCTNNNTCTCTTG
TGATTGGTCTNNNTCTCTTGCGCAATGCG
21 July 2016 Phage Genomics - VoM 2016
A process of reconstruction
• Experimentally
• Computationally
TGATTGTGTGTTTGCGCAATGCG
ATGTGTATATATAGTGAGCTTGCCC
GTCTCTCTNNNTCTCTTG
TGATTGGTCTNNNTCTCTTGCGCAATGCG
21 July 2016 Phage Genomics - VoM 2016
DNA
TGATTGTGTGTTTGCGCAATGCG
ATGTGTATATATAGTGAGCTTGCCC
GTCTCTCTNNNTCTCTTG
TGATTGGTCTNNNTCTCTTGCGCAATGCG
Assembly
Gene finding/
ORF calling
tRNA calling
Annotation
(Assigning
functions)
orienting
Validation (segmenter)
Fixing frameshifts
Introns and Inteins Subsystem
assignment
Refinement/
Secondary
annotation
loop
Special purpose:
toxins, morons, integrases,
lifestyle prediction
Regulatory elements
(promoters, terminators)
Output: files and graphics
From Sequence to Knowledge
From raw sequence data to
genome submission/ publication
Countless tools
21 July 2016 Phage Genomics - VoM 2016
Authority figures
Andrew Kropinski Rob Lavigne
21 July 2016 Phage Genomics - VoM 2016
Rob Edwards
General outline
• Part I: The “Kropinski toolkit”
– Tools approved and recommended by Andrew
Kropinski (http://molbiol-tools.ca): from seq to pub
• Part II: SEED-based tools:
– The RAST family
– The PhAnToMe database/portal
21 July 2016 Phage Genomics - VoM 2016
The Kropinski Toolkit
21 July 2016 Phage Genomics - VoM 2016
What we want, according to Andrew
Well characterized genome, in which, ideally we
know:
 the location & function of all the genes
 the location of promoters & terminators
 the correct taxonomy
PstI PstI
20
21
22
23
24
25
26
26A
27 28 29
30
31
32
33
30.0 kb
Viruses; dsDNA viruses, no RNA stage; Caudovirales; Siphoviridae;
T1virus
21 July 2016 Phage Genomics - VoM 2016
Desired outcome: Create GenBank
submission
• Complete, accurate description of the
genome and its taxonomy
Good title
Desired outcome (2)
21 July 2016 Phage Genomics - VoM 2016
Desired outcome (3)
21 July 2016 Phage Genomics - VoM 2016
Desired outcome (4)
 Protein products of concern, particularly
for those interested in phage therapy:
 Integrases
 Toxins
PstI PstI
20
21
22
23
24
25
26
26A
27 28 29
30
31
32
33
30.0 kb
21 July 2016 Phage Genomics - VoM 2016
Processes and Steps
I. Primary analysis
(QC/ pre-annotation proofreading: e.g., orient with BLASTN)
II. Genome annotation
– Gene finding (ORF calling)
– Automated annotation
– Massaging (edition, functional assignment)
III. Second dimension (regulatory elements)
IV. Comparative genomics
V. Metadata
VI. Visualization
21 July 2016 Phage Genomics - VoM 2016
Assembly
Gene finding/
ORF calling
tRNA calling
Annotation
(Assigning
functions)
orienting
Validation (segmenter)
Fixing frameshifts
Introns and Inteins Subsystem
assignment
Refinement/
Secondary
annotation
loop
Special purpose:
toxins, morons, integrases,
lifestyle prediction
Regulatory elements
(promoters, terminators)
Output: files and graphics
From Sequence to Knowledge
From raw sequence data to
genome submission/ publication
AUTOMATED ANNOTATION
II. Genome Annotation
21 July 2016 Phage Genomics - VoM 2016
RAST (subsystems-based tools)
• Will be the major focus of this short
tutorial…
• The goal is:
1. Quick demo how to use RAST
2. Quick preview batch annotation in RAST
3. Optimize RAST for phage annotation
4. Demonstrate & discuss how to improve
RAST output
21 July 2016 Phage Genomics - VoM 2016
RAST (subsystems-based tools)
• But,
before getting there …
21 July 2016 Phage Genomics - VoM 2016
The Kropinski wisdom
1. Always use more than one tool
2. Never blindly trust any automated (or manual)
process
3. Use your eyes and hands: visual inspection/
manual proofreading, re-annotation
– Every apparently complicated file is still editable on
your favorite text editor (e.g., NotePad)
4. If you don’t know a gene’s function (if you
can’t convince your grandma), better keep it
unnamed than contribute to error propagation
2 Aug 2015 Phage Genomics - Evergreen 2015
What do I call my gene product
(i.e. protein)?
 “phage hypothetical protein” – redundant
 “gp87” (gp = gene product)  hypothetical protein
 gp200 describes radically different proteins in
Listeria, Enterococcus, Mycobacterium,
Rhodococcus, Sphingomonas, Pseudomonas,
• Bacillus and Synechococcus phage genomes
 Add /note=“similar to gp43 of Escherichia coli
phage T4”
21 July 2016 Phage Genomics - VoM 2016
What do I call my gene product
(i.e. protein)?
 /product=“UboA”; “NrdA”; “hypothetical protein
SA5_0153/152”; “ORF184” (as bad as gp184); “RNAP1”;
"32 kDa protein”
 Bad because they don`t mean anything to the casual (or
informed) reader.
 Unless you are a bioinformatician or biostatistician be
conservative in recording “hits.” Could you convince your
grandmda?, if not list as a “hypothetical protein” but do take
a stand “putative DNA polymerase” is cowardly
21 July 2016 Phage Genomics - VoM 2016
Nomenclature Sins
 hypothetical protein  DNA polymerase with no
or poor quality evidence is far worse than:
 DNA polymerase  hypothetical protein
 Be cautious about using BLASTP hits in naming
gps – is there additional evidence to back the
designation up
21 July 2016 Phage Genomics - VoM 2016
Consistent Nomenclature
 All of these describe homologs of the
product of the coliphage T4 rIIA gene!
rIIA protector from prophage-induced early lysis
protector from prophage-induced early lysis
protector from prophage-induced early lysis rIIA
membrane-associated affects host membrane ATPase
rIIA membrane-associated affects host membrane ATPase
phage rIIA lysis inhibitor
rIIA protector
rIIA
rIIA protein
membrane integrity protector
hypothetical protein
unnamed protein product !!!!!!
protein of unknown function
21 July 2016 Phage Genomics - VoM 2016
Bottom line:
Manual vs. Automated
• “Turtles know the road better than
rabbits… ” Khalil Gibran
• “… but they may never reach the end!”
• The best approach?
– Human expert-based annotation
2 Aug 2015 Phage Genomics - Evergreen 2015
Assembly
Gene finding/
ORF calling
tRNA calling
Annotation
(Assigning
functions)
orienting
Validation (segmenter)
Fixing frameshifts
Introns and Inteins Subsystem
assignment
Refinement/
Secondary
annotation
loop
Special purpose:
toxins, morons, integrases,
lifestyle prediction
Regulatory elements
(promoters, terminators)
Output: files and graphics
From Sequence to Knowledge
From raw sequence data to
genome submission/ publication
IV. COMPARATIVE GENOMICS
Genomic pairwise comparisons
 EMBOSS Stretcher:http://emboss.bioinformatics.nl/cgi-
bin/emboss/stretcher N.B. genomes must be collinear
 BLASTN - NCBI
 ANI (Average Nucleotide Identity):http://enve-
omics.ce.gatech.edu/ani/
 GGDC 2.0 (Genome to Genome Distance Calculator):
http://ggdc.dsmz.de/distcalc2.php
 jSpeciesWS –
ANI:http://jspecies.ribohost.com/jspeciesws/
Proteomic pairwise
comparisons
 CoreGenes –
(http://binf.gmu.edu:8080/CoreGenes3.0/)
 TBLASTX
 Remember protein sequence is more conserved
than DNA sequence; probably useful for more
distant relationships
VI. “POLISH” IT TO PUBLISH IT
Assembly
Gene finding/
ORF calling
tRNA calling
Annotation
(Assigning
functions)
orienting
Validation (segmenter)
Fixing frameshifts
Introns and Inteins Subsystem
assignment
Refinement/
Secondary
annotation
loop
Special purpose:
toxins, morons, integrases,
lifestyle prediction
Regulatory elements
(promoters, terminators)
Output: files and graphics
From Sequence to Knowledge
From raw sequence data to
genome submission/ publication
Servers & software
 BLAST Ring Image Generator (http://brig.sourceforge.net)
 CGView (http://wishart.biology.ualberta.ca/cgview)
 CGView Comparison Tool:
http://stothard.afns.ualberta.ca/downloads/CCT
 Circos (http://circos.ca)
 DNAPlotter:
(http://www.sanger.ac.uk/science/tools/dnaplotter)
 Easyfig (http://easyfig.sourceforge.net)
 GenomeVx (http://wolfe.ucd.ie/GenomeVx)
 GView Server (https://server.gview.ca)
 progressiveMauve and ACT
EasyFig
CGView Comparison Tool
BLAST Ring Image Generator

From Sequence to Knowledge: The Art and Science of Phage Genome Annotation

  • 1.
    From Sequence toKnowledge: The Art & Science of Phage Genome Annotation Ramy K. Aziz – Cairo University
  • 2.
    From Sequence to Knowledge: PhAnToMe,RAST, and the Ultimate Kropinski Toolkit A helping hand through The Annotation Bottleneck Compiled by: Andrew Kropinski and Ramy Aziz
  • 3.
    Online material • Data& links: – http://egybio.net/tutorial • Slides – http://bit.ly/annotation2016 – http://bit.ly/phantome4 – Old tutorials (more detailed, but missing latest ): • Evergreen 2011: http://slidesha.re/phantome1 • http://slidesha.re/phiRAST1 (Karin) • Evergreen 2013: http://bit.ly/phantome2 • Evergreen 2015: http://bit.ly/phantome3 21 July 2016 Phage Genomics - VoM 2016
  • 4.
    INTRODUCTION 21 July 2016Phage Genomics - VoM 2016
  • 5.
    “The analysis bottleneck” •Observation: – We generate more data than we can analyze. – We generate sequence data faster than we can analyze them. • Opinion: – Bottlenecks are not created equal! – It is important to define the question(s) before working on the answer(s)! 21 July 2016 Phage Genomics - VoM 2016
  • 6.
    “The analysis bottleneck” •The Lavigne paradox 21 July 2016 Phage Genomics - VoM 2016
  • 7.
    “The analysis bottleneck” •The Lavigne paradox 21 July 2016 Phage Genomics - VoM 2016
  • 8.
    Quick group activity Definingthe question(s): • How many of you have annotated a genome? • How many phage genomes have you sequenced (or are in the process of sequencing)? a) None b) 1-5 c) 5-50 d) > 50 • What is the single most pressing question you want to answer from genome analysis? 21 July 2016 Phage Genomics - VoM 2016
  • 9.
    DEFINING THE QUESTION(S) “Beginwith the end in mind” (Covey, the 7 habits) 21 July 2016 Phage Genomics - VoM 2016
  • 10.
    What You Want Thegoal:  complete  accurate Incomplete:  genome termini Faulty assembly Frameshift  chimeric fragments21 July 2016 Phage Genomics - VoM 2016
  • 11.
    A process ofreconstruction 21 July 2016 Phage Genomics - VoM 2016
  • 12.
    Annotation  Reconstruction fromgenome from metagenome 21 July 2016 Phage Genomics - VoM 2016 Incomplete frameshift - complete - accurate Credit: Andrew Kropinski Credit: Bas Dutilh faulty assembly
  • 13.
    Annotation  Reconstruction fromgenome from metagenome 21 July 2016 Incomplete faulty assembly frameshift - complete - accurate Phage Genomics - VoM 2016 Credit: Andrew Kropinski Credit: Bas Dutilh
  • 14.
    A process ofreconstruction • Experimentally DNA TGATTGTGTGTTTGCGCAATGCG ATGTGTATATATAGTGAGCTTGCCC GTCTCTCTNNNTCTCTTG TGATTGGTCTNNNTCTCTTGCGCAATGCG 21 July 2016 Phage Genomics - VoM 2016
  • 15.
    A process ofreconstruction • Experimentally • Computationally TGATTGTGTGTTTGCGCAATGCG ATGTGTATATATAGTGAGCTTGCCC GTCTCTCTNNNTCTCTTG TGATTGGTCTNNNTCTCTTGCGCAATGCG 21 July 2016 Phage Genomics - VoM 2016 DNA TGATTGTGTGTTTGCGCAATGCG ATGTGTATATATAGTGAGCTTGCCC GTCTCTCTNNNTCTCTTG TGATTGGTCTNNNTCTCTTGCGCAATGCG
  • 16.
    Assembly Gene finding/ ORF calling tRNAcalling Annotation (Assigning functions) orienting Validation (segmenter) Fixing frameshifts Introns and Inteins Subsystem assignment Refinement/ Secondary annotation loop Special purpose: toxins, morons, integrases, lifestyle prediction Regulatory elements (promoters, terminators) Output: files and graphics From Sequence to Knowledge From raw sequence data to genome submission/ publication
  • 17.
    Countless tools 21 July2016 Phage Genomics - VoM 2016
  • 18.
    Authority figures Andrew KropinskiRob Lavigne 21 July 2016 Phage Genomics - VoM 2016 Rob Edwards
  • 19.
    General outline • PartI: The “Kropinski toolkit” – Tools approved and recommended by Andrew Kropinski (http://molbiol-tools.ca): from seq to pub • Part II: SEED-based tools: – The RAST family – The PhAnToMe database/portal 21 July 2016 Phage Genomics - VoM 2016
  • 20.
    The Kropinski Toolkit 21July 2016 Phage Genomics - VoM 2016
  • 21.
    What we want,according to Andrew Well characterized genome, in which, ideally we know:  the location & function of all the genes  the location of promoters & terminators  the correct taxonomy PstI PstI 20 21 22 23 24 25 26 26A 27 28 29 30 31 32 33 30.0 kb Viruses; dsDNA viruses, no RNA stage; Caudovirales; Siphoviridae; T1virus 21 July 2016 Phage Genomics - VoM 2016
  • 22.
    Desired outcome: CreateGenBank submission • Complete, accurate description of the genome and its taxonomy Good title
  • 23.
    Desired outcome (2) 21July 2016 Phage Genomics - VoM 2016
  • 24.
    Desired outcome (3) 21July 2016 Phage Genomics - VoM 2016
  • 25.
    Desired outcome (4) Protein products of concern, particularly for those interested in phage therapy:  Integrases  Toxins PstI PstI 20 21 22 23 24 25 26 26A 27 28 29 30 31 32 33 30.0 kb 21 July 2016 Phage Genomics - VoM 2016
  • 26.
    Processes and Steps I.Primary analysis (QC/ pre-annotation proofreading: e.g., orient with BLASTN) II. Genome annotation – Gene finding (ORF calling) – Automated annotation – Massaging (edition, functional assignment) III. Second dimension (regulatory elements) IV. Comparative genomics V. Metadata VI. Visualization 21 July 2016 Phage Genomics - VoM 2016
  • 27.
    Assembly Gene finding/ ORF calling tRNAcalling Annotation (Assigning functions) orienting Validation (segmenter) Fixing frameshifts Introns and Inteins Subsystem assignment Refinement/ Secondary annotation loop Special purpose: toxins, morons, integrases, lifestyle prediction Regulatory elements (promoters, terminators) Output: files and graphics From Sequence to Knowledge From raw sequence data to genome submission/ publication
  • 28.
    AUTOMATED ANNOTATION II. GenomeAnnotation 21 July 2016 Phage Genomics - VoM 2016
  • 29.
    RAST (subsystems-based tools) •Will be the major focus of this short tutorial… • The goal is: 1. Quick demo how to use RAST 2. Quick preview batch annotation in RAST 3. Optimize RAST for phage annotation 4. Demonstrate & discuss how to improve RAST output 21 July 2016 Phage Genomics - VoM 2016
  • 30.
    RAST (subsystems-based tools) •But, before getting there … 21 July 2016 Phage Genomics - VoM 2016
  • 31.
    The Kropinski wisdom 1.Always use more than one tool 2. Never blindly trust any automated (or manual) process 3. Use your eyes and hands: visual inspection/ manual proofreading, re-annotation – Every apparently complicated file is still editable on your favorite text editor (e.g., NotePad) 4. If you don’t know a gene’s function (if you can’t convince your grandma), better keep it unnamed than contribute to error propagation 2 Aug 2015 Phage Genomics - Evergreen 2015
  • 32.
    What do Icall my gene product (i.e. protein)?  “phage hypothetical protein” – redundant  “gp87” (gp = gene product)  hypothetical protein  gp200 describes radically different proteins in Listeria, Enterococcus, Mycobacterium, Rhodococcus, Sphingomonas, Pseudomonas, • Bacillus and Synechococcus phage genomes  Add /note=“similar to gp43 of Escherichia coli phage T4” 21 July 2016 Phage Genomics - VoM 2016
  • 33.
    What do Icall my gene product (i.e. protein)?  /product=“UboA”; “NrdA”; “hypothetical protein SA5_0153/152”; “ORF184” (as bad as gp184); “RNAP1”; "32 kDa protein”  Bad because they don`t mean anything to the casual (or informed) reader.  Unless you are a bioinformatician or biostatistician be conservative in recording “hits.” Could you convince your grandmda?, if not list as a “hypothetical protein” but do take a stand “putative DNA polymerase” is cowardly 21 July 2016 Phage Genomics - VoM 2016
  • 34.
    Nomenclature Sins  hypotheticalprotein  DNA polymerase with no or poor quality evidence is far worse than:  DNA polymerase  hypothetical protein  Be cautious about using BLASTP hits in naming gps – is there additional evidence to back the designation up 21 July 2016 Phage Genomics - VoM 2016
  • 35.
    Consistent Nomenclature  Allof these describe homologs of the product of the coliphage T4 rIIA gene! rIIA protector from prophage-induced early lysis protector from prophage-induced early lysis protector from prophage-induced early lysis rIIA membrane-associated affects host membrane ATPase rIIA membrane-associated affects host membrane ATPase phage rIIA lysis inhibitor rIIA protector rIIA rIIA protein membrane integrity protector hypothetical protein unnamed protein product !!!!!! protein of unknown function 21 July 2016 Phage Genomics - VoM 2016
  • 36.
    Bottom line: Manual vs.Automated • “Turtles know the road better than rabbits… ” Khalil Gibran • “… but they may never reach the end!” • The best approach? – Human expert-based annotation 2 Aug 2015 Phage Genomics - Evergreen 2015
  • 37.
    Assembly Gene finding/ ORF calling tRNAcalling Annotation (Assigning functions) orienting Validation (segmenter) Fixing frameshifts Introns and Inteins Subsystem assignment Refinement/ Secondary annotation loop Special purpose: toxins, morons, integrases, lifestyle prediction Regulatory elements (promoters, terminators) Output: files and graphics From Sequence to Knowledge From raw sequence data to genome submission/ publication
  • 38.
  • 39.
    Genomic pairwise comparisons EMBOSS Stretcher:http://emboss.bioinformatics.nl/cgi- bin/emboss/stretcher N.B. genomes must be collinear  BLASTN - NCBI  ANI (Average Nucleotide Identity):http://enve- omics.ce.gatech.edu/ani/  GGDC 2.0 (Genome to Genome Distance Calculator): http://ggdc.dsmz.de/distcalc2.php  jSpeciesWS – ANI:http://jspecies.ribohost.com/jspeciesws/
  • 40.
    Proteomic pairwise comparisons  CoreGenes– (http://binf.gmu.edu:8080/CoreGenes3.0/)  TBLASTX  Remember protein sequence is more conserved than DNA sequence; probably useful for more distant relationships
  • 41.
    VI. “POLISH” ITTO PUBLISH IT
  • 42.
    Assembly Gene finding/ ORF calling tRNAcalling Annotation (Assigning functions) orienting Validation (segmenter) Fixing frameshifts Introns and Inteins Subsystem assignment Refinement/ Secondary annotation loop Special purpose: toxins, morons, integrases, lifestyle prediction Regulatory elements (promoters, terminators) Output: files and graphics From Sequence to Knowledge From raw sequence data to genome submission/ publication
  • 43.
    Servers & software BLAST Ring Image Generator (http://brig.sourceforge.net)  CGView (http://wishart.biology.ualberta.ca/cgview)  CGView Comparison Tool: http://stothard.afns.ualberta.ca/downloads/CCT  Circos (http://circos.ca)  DNAPlotter: (http://www.sanger.ac.uk/science/tools/dnaplotter)  Easyfig (http://easyfig.sourceforge.net)  GenomeVx (http://wolfe.ucd.ie/GenomeVx)  GView Server (https://server.gview.ca)  progressiveMauve and ACT
  • 44.
  • 45.
  • 46.

Editor's Notes

  • #33 Gp200 from Pseudomonas phage 201phi2-1 is related to phiKZ gp120 and EL gp78
  • #40 "Shifting the genomic gold standard for the prokaryotic species definition" Michael Richter and Ramon Rosselló-Móra. PNAS vol. 106 no. 45 pg 19126–19131, doi: 10.1073/pnas.0906412106 JSpeciesWS is a quick and easy to use online service to measure the probability if two or more (draft) genomes belong to the same species or not by pairwise comparison of (1) their Average Nucleotide Identity (ANI) and/or (2) correlation indexes of their Tetra-nucleotide signatures.
  • #44 Star - online