Prokka - rapid bacterial genome annotation - ABPHM 2013

Rapid automatic microbial
genome annotation
using Prokka
Dr Torsten Seemann
Applied Bioinformatics and Public Health Microbiology - Wed 15 May 2013 @ 1530h - Moller Centre, Cambridge, UK

We come from a land down-under

The team
● Simon Gladman
o VelvetOptimiser author, presenting Galaxy poster
● Paul Harrison
o author of Nesoni toolkit
● David Powell
o author of VAGUE, software wizard, theoretician
● Dieter Bulach
o sequence magician, closes genomes at will
... and we are recruiting.

History
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
?
?
?
?
?
?
?
?

De novo
assembly
Align reads to
a reference
Process

De novo assembly
Ideally, one sequence per replicon.
Millions of short
sequences
(reads)
A few long
sequences
(contigs)
Reconstruct the original genome sequence
from the sequence reads only

Annotation
Adding biological information to sequences.
ACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGA
AAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTC
CCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTG
GCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGG
ACAGAATGCCCTGCAGGAACTTCTTCTAGAAGACCTTCTCCTCCTG
CAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGA
CCTGAAACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCT
CTCCGTCCGTCCGTGGGCCACGGCCACCGCTTTTTTTTTTGCC
delta toxin
PubMed: 15353161
ribosome
binding site
transfer RNA
Leu-(UUR)
tandem repeat
CCGT x 3
homopolymer
10 x T

What's in an annotation?
● Location
o which sequence? chromosome 2
o where on the sequence? 100..659
o what strand? -ve
● Feature type
o what is it? protein
coding gene
● Attributes
o protein product? alcohol
dehydrogenase
o enzyme code? EC:1.1.1.1
o subcellular location? cytoplasm

Bacterial feature types
● protein coding genes
o promoter (-10, -35)
o ribosome binding site (RBS)
o coding sequence (CDS)
 signal peptide, protein domains, structure
o terminator
● non coding genes
o transfer RNA (tRNA)
o ribosomal RNA (rRNA)
o non-coding RNA (ncRNA)
● other
o repeat patterns, operons, origin of replication, ...

Key bacterial features
● tRNA
o easy to find and annotate: anti-codon
● rRNA
o easy to find and annotate: 5s 16s 23s
● CDS
o straightforward to find candidates
 false positives are often small ORFs
 wrong start codon
o partial genes, remnants
o pseudogenes
o assigning function is the bulk of the workload

Automatic annotation
Two strategies for identifying coding genes:
● sequence alignment
o find known protein sequences in the contigs
 transfer the annotation across
o will miss proteins not in your database
o may miss partial proteins
● ab initio gene finding
o find candidate open reading frames
 build model of ribosome binding sites
 predict coding regions
o may choose the incorrect start codon
o may miss atypical genes, overpredict small genes

Some good existing tools
Software
ab
initio
align-
ment
Availability Speed
RAST yes yes web only 12-24 hours
xBASE yes no web only >8 hours
BG7 no yes standalone >10 hours
PGAAP
(NCBI)
yes yes email / we >1 month

Why another tool?
● Convenience
o I have sequence, just tell me what's in it, please.
● Speed
o exploit multi-core computers (aim < 15min)
● Standards compliant
o GFF3/GBK for viewing, TBL/FSA for Genbank sub.
● Rich consistent trustworthy output
o /product /gene /EC_number
● Provenance
o a record of where/how/why is was annotated so

Why "Prokka" ?
● Unique in Google
● I like the letter "k"
● Easy to type
● It sounds Aussie
● Loosely fits "Prokaryotic Annotation"
● It rhymes with "Quokka"
o Australian cat-sized nocturnal marsupial herbivore
o first Aussie mammal seen by Europeans - "giant rat"

Prokka pipeline (simplified)
tRNA
rRNA
ncRNA
CDS
FASTA
contigs
Infernal
RNAmmer
Prodigal SignalP
Aragorn
sig_peptide
protein domains
HMMER3
protein annotation
BLAST+
Rfam
Swiss Pfam TIGRUser
GFF3
GBK
ASN1

Predicting protein function
Sequence similarity is a proxy for homology
● Sequence based (alignment)
o tools: BLAST, BLAT, FASTA, Exonerate
o databases: RefSeq, Uniprot, ...
● Model based ("fuzzy sequence" matching)
o PSSM: position specific scoring matrix
 tools: RPS-BLAST, Psi-BLAST
 databases: CDD, COG, Smart
o HMM: hidden Markov models
 tools: HMMER, HHblits
 databases: Pfam, TIGRfams

Sequence databases
I'll just BLAST against the non-redundant database.
-- Anonymous
● Which one?
o nucleotide (nt) or protein (nr)
● It's actually quite redundant
o only eliminates exact matching sequences
● It's not picky
o nearly anything is admitted, garbage in garbage out
● It's too big
o searching takes too long

Hierarchical searching
● Facts
o searching against smaller databases is faster
o searching against similar sequences is faster
● Idea
o start with small set of close proteins
o advance to larger sets of more distant proteins
● Prokka
o your own custom "trusted" set (optional)
o core bacterial proteome (default)
o genus specific proteome (optional)
o whole protein HMMs: PRK clusters, TIGRfams
o protein domain HMMs: Pfam

Core bacterial proteome
● Many bacterial proteins are conserved
o experimentally validated
o small number of them
o good annotations
● Prokka provides this database
o derived from UniProt-Swissprot
o only bacterial proteins
o only accept evidence level 1 (aa) or 2 (RNA)
o reject "Fragment" entries
o extract /gene /EC_number /product /db_xref
● First step gets ~50% of the genes
o BLAST+ blastp, multi-threading to use all CPUs

The remainder
● Prokka has genus specific databases
o aim to capture "genus specific" naming conventions
o derived from proteins in completed genomes
o proteins are clustered and majority annotation wins
o some annotations are rubbish though
● Custom model databases
o I took COG/PRK MSAs and made HMMs
● Existing model databases
o Pfam, TIGRfams are well curated
● And if all else fails

Provenance
Recording where an annotation came from.
Prokka uses Genbank "evidence qualifier" tags:
Wet lab
/experiment="EXISTENCE:Northern blot"
Dry lab
/inference="similar to DNA sequence:INSD:AACN010222672.1"
/inference="profile:tRNAscan:2.1"
/inference="protein motif:InterPro:IPR001900"
/inference="ab initio prediction:Glimmer:3.0"

Example from Prokka
Feature Type:
tRNA
Location:
contig000341 @ 655..730 +
Attributes:
/gene="tRNA-Leu(UUR)"
/anticodon=(pos:678..680,aa:Leu)
/product="transfer RNA-Leu(UUR)"
/inference="profile:Aragorn:1.2"

Software goals
● Follow basic conventions
o "prokka" should say something helpful
o "prokka -h" or "prokka --help" should show help
● All options should be optional
o "prokka contigs.fa" should do something useful
● Fail gracefully
o check your dependencies exist
o produce useful error messages
o generate a log file (provenance!)
● Use standard input and output file formats
o or at least tab-separated values if you insist...

Prokka in context
● Prokka is not
o particularly original
o technically or algorithmically significant
o foolproof to install some dependencies
o for everyone
BUT
● Prokka
o is an ongoing project which will only improve :-)
o checks it will run properly before wasting your time
o does what it claims, and does it quickly
o is being used widely

Prokka in the wild
● Pathogen Informatics @ Sanger UK
o Andrew Page
o 50,000 draft genomes in 2 weeks (24 sec each!)
● Austin Hospital @ Melbourne AU
o Ben Howden - Dept Infectious Diseases
o assembly & annotation of MiSeq clinical isolates
● VBC @ Monash AU
o assemble & annotate all of SRA
● Many more
o Public Health Agency of Canada

Planned features
● Modularity
o alternate sub-tools eg. Aragorn vs tRNAscan-SE
o every sub-system should be optional
o facilitate entry into Galaxy Toolshed
● Better support for
o metagenome assemblies, viruses and archaea
o broken genes, pseudogenes, assembly breakpoints
● Faster
o smaller core databases
o better parallelisation and less disk i/o
● Prokka-Web
o web server version currently in beta testing

Acknowledgements
● Organisers
o Conference committee - for inviting me
o Wellcome Trust - Laura Hubbard
● Original Prokka testers
o Simon Gladman & Dieter Bulach (internal)
o Tim Stinear & Scott Chandry (external)
● Funding
o VLSCI / LSCC
o Monash University
● Family
o Naomi, Oskar, Zoe - for tolerating my absences

Contact
Email torsten.seemann@monash.edu
Twitter @torstenseemann
Blog TheGenomeFactory.blogspot.com
Web www.bioinformatics.net.au

Prokka - rapid bacterial genome annotation - ABPHM 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Prokka - rapid bacterial genome annotation - ABPHM 2013

Similar to Prokka - rapid bacterial genome annotation - ABPHM 2013 (20)

More from Torsten Seemann

More from Torsten Seemann (15)

Recently uploaded

Recently uploaded (20)

Prokka - rapid bacterial genome annotation - ABPHM 2013