1. Rapid automatic microbial
genome annotation
using Prokka
Dr Torsten Seemann
Applied Bioinformatics and Public Health Microbiology - Wed 15 May 2013 @ 1530h - Moller Centre, Cambridge, UK
4. The team
● Simon Gladman
o VelvetOptimiser author, presenting Galaxy poster
● Paul Harrison
o author of Nesoni toolkit
● David Powell
o author of VAGUE, software wizard, theoretician
● Dieter Bulach
o sequence magician, closes genomes at will
... and we are recruiting.
8. De novo assembly
Ideally, one sequence per replicon.
Millions of short
sequences
(reads)
A few long
sequences
(contigs)
Reconstruct the original genome sequence
from the sequence reads only
9. Annotation
Adding biological information to sequences.
ACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGA
AAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTC
CCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTG
GCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGG
ACAGAATGCCCTGCAGGAACTTCTTCTAGAAGACCTTCTCCTCCTG
CAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGA
CCTGAAACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCT
CTCCGTCCGTCCGTGGGCCACGGCCACCGCTTTTTTTTTTGCC
delta toxin
PubMed: 15353161
ribosome
binding site
transfer RNA
Leu-(UUR)
tandem repeat
CCGT x 3
homopolymer
10 x T
10. What's in an annotation?
● Location
o which sequence? chromosome 2
o where on the sequence? 100..659
o what strand? -ve
● Feature type
o what is it? protein
coding gene
● Attributes
o protein product? alcohol
dehydrogenase
o enzyme code? EC:1.1.1.1
o subcellular location? cytoplasm
11. Bacterial feature types
● protein coding genes
o promoter (-10, -35)
o ribosome binding site (RBS)
o coding sequence (CDS)
signal peptide, protein domains, structure
o terminator
● non coding genes
o transfer RNA (tRNA)
o ribosomal RNA (rRNA)
o non-coding RNA (ncRNA)
● other
o repeat patterns, operons, origin of replication, ...
13. Key bacterial features
● tRNA
o easy to find and annotate: anti-codon
● rRNA
o easy to find and annotate: 5s 16s 23s
● CDS
o straightforward to find candidates
false positives are often small ORFs
wrong start codon
o partial genes, remnants
o pseudogenes
o assigning function is the bulk of the workload
14. Automatic annotation
Two strategies for identifying coding genes:
● sequence alignment
o find known protein sequences in the contigs
transfer the annotation across
o will miss proteins not in your database
o may miss partial proteins
● ab initio gene finding
o find candidate open reading frames
build model of ribosome binding sites
predict coding regions
o may choose the incorrect start codon
o may miss atypical genes, overpredict small genes
15. Some good existing tools
Software
ab
initio
align-
ment
Availability Speed
RAST yes yes web only 12-24 hours
xBASE yes no web only >8 hours
BG7 no yes standalone >10 hours
PGAAP
(NCBI)
yes yes email / we >1 month
16. Why another tool?
● Convenience
o I have sequence, just tell me what's in it, please.
● Speed
o exploit multi-core computers (aim < 15min)
● Standards compliant
o GFF3/GBK for viewing, TBL/FSA for Genbank sub.
● Rich consistent trustworthy output
o /product /gene /EC_number
● Provenance
o a record of where/how/why is was annotated so
17. Why "Prokka" ?
● Unique in Google
● I like the letter "k"
● Easy to type
● It sounds Aussie
● Loosely fits "Prokaryotic Annotation"
● It rhymes with "Quokka"
o Australian cat-sized nocturnal marsupial herbivore
o first Aussie mammal seen by Europeans - "giant rat"
20. Predicting protein function
Sequence similarity is a proxy for homology
● Sequence based (alignment)
o tools: BLAST, BLAT, FASTA, Exonerate
o databases: RefSeq, Uniprot, ...
● Model based ("fuzzy sequence" matching)
o PSSM: position specific scoring matrix
tools: RPS-BLAST, Psi-BLAST
databases: CDD, COG, Smart
o HMM: hidden Markov models
tools: HMMER, HHblits
databases: Pfam, TIGRfams
21. Sequence databases
I'll just BLAST against the non-redundant database.
-- Anonymous
● Which one?
o nucleotide (nt) or protein (nr)
● It's actually quite redundant
o only eliminates exact matching sequences
● It's not picky
o nearly anything is admitted, garbage in garbage out
● It's too big
o searching takes too long
22. Hierarchical searching
● Facts
o searching against smaller databases is faster
o searching against similar sequences is faster
● Idea
o start with small set of close proteins
o advance to larger sets of more distant proteins
● Prokka
o your own custom "trusted" set (optional)
o core bacterial proteome (default)
o genus specific proteome (optional)
o whole protein HMMs: PRK clusters, TIGRfams
o protein domain HMMs: Pfam
23. Core bacterial proteome
● Many bacterial proteins are conserved
o experimentally validated
o small number of them
o good annotations
● Prokka provides this database
o derived from UniProt-Swissprot
o only bacterial proteins
o only accept evidence level 1 (aa) or 2 (RNA)
o reject "Fragment" entries
o extract /gene /EC_number /product /db_xref
● First step gets ~50% of the genes
o BLAST+ blastp, multi-threading to use all CPUs
24. The remainder
● Prokka has genus specific databases
o aim to capture "genus specific" naming conventions
o derived from proteins in completed genomes
o proteins are clustered and majority annotation wins
o some annotations are rubbish though
● Custom model databases
o I took COG/PRK MSAs and made HMMs
● Existing model databases
o Pfam, TIGRfams are well curated
● And if all else fails
26. Provenance
Recording where an annotation came from.
Prokka uses Genbank "evidence qualifier" tags:
Wet lab
/experiment="EXISTENCE:Northern blot"
Dry lab
/inference="similar to DNA sequence:INSD:AACN010222672.1"
/inference="profile:tRNAscan:2.1"
/inference="protein motif:InterPro:IPR001900"
/inference="ab initio prediction:Glimmer:3.0"
27. Example from Prokka
Feature Type:
tRNA
Location:
contig000341 @ 655..730 +
Attributes:
/gene="tRNA-Leu(UUR)"
/anticodon=(pos:678..680,aa:Leu)
/product="transfer RNA-Leu(UUR)"
/inference="profile:Aragorn:1.2"
29. Software goals
● Follow basic conventions
o "prokka" should say something helpful
o "prokka -h" or "prokka --help" should show help
● All options should be optional
o "prokka contigs.fa" should do something useful
● Fail gracefully
o check your dependencies exist
o produce useful error messages
o generate a log file (provenance!)
● Use standard input and output file formats
o or at least tab-separated values if you insist...
30. Prokka in context
● Prokka is not
o particularly original
o technically or algorithmically significant
o foolproof to install some dependencies
o for everyone
BUT
● Prokka
o is an ongoing project which will only improve :-)
o checks it will run properly before wasting your time
o does what it claims, and does it quickly
o is being used widely
32. Prokka in the wild
● Pathogen Informatics @ Sanger UK
o Andrew Page
o 50,000 draft genomes in 2 weeks (24 sec each!)
● Austin Hospital @ Melbourne AU
o Ben Howden - Dept Infectious Diseases
o assembly & annotation of MiSeq clinical isolates
● VBC @ Monash AU
o assemble & annotate all of SRA
● Many more
o Public Health Agency of Canada
33. Planned features
● Modularity
o alternate sub-tools eg. Aragorn vs tRNAscan-SE
o every sub-system should be optional
o facilitate entry into Galaxy Toolshed
● Better support for
o metagenome assemblies, viruses and archaea
o broken genes, pseudogenes, assembly breakpoints
● Faster
o smaller core databases
o better parallelisation and less disk i/o
● Prokka-Web
o web server version currently in beta testing
34. Acknowledgements
● Organisers
o Conference committee - for inviting me
o Wellcome Trust - Laura Hubbard
● Original Prokka testers
o Simon Gladman & Dieter Bulach (internal)
o Tim Stinear & Scott Chandry (external)
● Funding
o VLSCI / LSCC
o Monash University
● Family
o Naomi, Oskar, Zoe - for tolerating my absences