SlideShare a Scribd company logo
1 of 36
Rapid automatic microbial
genome annotation
using Prokka
Dr Torsten Seemann
Applied Bioinformatics and Public Health Microbiology - Wed 15 May 2013 @ 1530h - Moller Centre, Cambridge, UK
Background
We come from a land down-under
The team
● Simon Gladman
o VelvetOptimiser author, presenting Galaxy poster
● Paul Harrison
o author of Nesoni toolkit
● David Powell
o author of VAGUE, software wizard, theoretician
● Dieter Bulach
o sequence magician, closes genomes at will
... and we are recruiting.
History
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
?
?
?
?
?
?
?
?
Introduction
De novo
assembly
Align reads to
a reference
Process
De novo assembly
Ideally, one sequence per replicon.
Millions of short
sequences
(reads)
A few long
sequences
(contigs)
Reconstruct the original genome sequence
from the sequence reads only
Annotation
Adding biological information to sequences.
ACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGA
AAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTC
CCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTG
GCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGG
ACAGAATGCCCTGCAGGAACTTCTTCTAGAAGACCTTCTCCTCCTG
CAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGA
CCTGAAACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCT
CTCCGTCCGTCCGTGGGCCACGGCCACCGCTTTTTTTTTTGCC
delta toxin
PubMed: 15353161
ribosome
binding site
transfer RNA
Leu-(UUR)
tandem repeat
CCGT x 3
homopolymer
10 x T
What's in an annotation?
● Location
o which sequence? chromosome 2
o where on the sequence? 100..659
o what strand? -ve
● Feature type
o what is it? protein
coding gene
● Attributes
o protein product? alcohol
dehydrogenase
o enzyme code? EC:1.1.1.1
o subcellular location? cytoplasm
Bacterial feature types
● protein coding genes
o promoter (-10, -35)
o ribosome binding site (RBS)
o coding sequence (CDS)
 signal peptide, protein domains, structure
o terminator
● non coding genes
o transfer RNA (tRNA)
o ribosomal RNA (rRNA)
o non-coding RNA (ncRNA)
● other
o repeat patterns, operons, origin of replication, ...
Automatic annotation
Key bacterial features
● tRNA
o easy to find and annotate: anti-codon
● rRNA
o easy to find and annotate: 5s 16s 23s
● CDS
o straightforward to find candidates
 false positives are often small ORFs
 wrong start codon
o partial genes, remnants
o pseudogenes
o assigning function is the bulk of the workload
Automatic annotation
Two strategies for identifying coding genes:
● sequence alignment
o find known protein sequences in the contigs
 transfer the annotation across
o will miss proteins not in your database
o may miss partial proteins
● ab initio gene finding
o find candidate open reading frames
 build model of ribosome binding sites
 predict coding regions
o may choose the incorrect start codon
o may miss atypical genes, overpredict small genes
Some good existing tools
Software
ab
initio
align-
ment
Availability Speed
RAST yes yes web only 12-24 hours
xBASE yes no web only >8 hours
BG7 no yes standalone >10 hours
PGAAP
(NCBI)
yes yes email / we >1 month
Why another tool?
● Convenience
o I have sequence, just tell me what's in it, please.
● Speed
o exploit multi-core computers (aim < 15min)
● Standards compliant
o GFF3/GBK for viewing, TBL/FSA for Genbank sub.
● Rich consistent trustworthy output
o /product /gene /EC_number
● Provenance
o a record of where/how/why is was annotated so
Why "Prokka" ?
● Unique in Google
● I like the letter "k"
● Easy to type
● It sounds Aussie
● Loosely fits "Prokaryotic Annotation"
● It rhymes with "Quokka"
o Australian cat-sized nocturnal marsupial herbivore
o first Aussie mammal seen by Europeans - "giant rat"
Prokka pipeline (simplified)
tRNA
rRNA
ncRNA
CDS
FASTA
contigs
Infernal
RNAmmer
Prodigal SignalP
Aragorn
sig_peptide
protein domains
HMMER3
protein annotation
BLAST+
Rfam
Swiss Pfam TIGRUser
GFF3
GBK
ASN1
What can you trust?
Predicting protein function
Sequence similarity is a proxy for homology
● Sequence based (alignment)
o tools: BLAST, BLAT, FASTA, Exonerate
o databases: RefSeq, Uniprot, ...
● Model based ("fuzzy sequence" matching)
o PSSM: position specific scoring matrix
 tools: RPS-BLAST, Psi-BLAST
 databases: CDD, COG, Smart
o HMM: hidden Markov models
 tools: HMMER, HHblits
 databases: Pfam, TIGRfams
Sequence databases
I'll just BLAST against the non-redundant database.
-- Anonymous
● Which one?
o nucleotide (nt) or protein (nr)
● It's actually quite redundant
o only eliminates exact matching sequences
● It's not picky
o nearly anything is admitted, garbage in garbage out
● It's too big
o searching takes too long
Hierarchical searching
● Facts
o searching against smaller databases is faster
o searching against similar sequences is faster
● Idea
o start with small set of close proteins
o advance to larger sets of more distant proteins
● Prokka
o your own custom "trusted" set (optional)
o core bacterial proteome (default)
o genus specific proteome (optional)
o whole protein HMMs: PRK clusters, TIGRfams
o protein domain HMMs: Pfam
Core bacterial proteome
● Many bacterial proteins are conserved
o experimentally validated
o small number of them
o good annotations
● Prokka provides this database
o derived from UniProt-Swissprot
o only bacterial proteins
o only accept evidence level 1 (aa) or 2 (RNA)
o reject "Fragment" entries
o extract /gene /EC_number /product /db_xref
● First step gets ~50% of the genes
o BLAST+ blastp, multi-threading to use all CPUs
The remainder
● Prokka has genus specific databases
o aim to capture "genus specific" naming conventions
o derived from proteins in completed genomes
o proteins are clustered and majority annotation wins
o some annotations are rubbish though
● Custom model databases
o I took COG/PRK MSAs and made HMMs
● Existing model databases
o Pfam, TIGRfams are well curated
● And if all else fails
Provenance
Provenance
Recording where an annotation came from.
Prokka uses Genbank "evidence qualifier" tags:
Wet lab
/experiment="EXISTENCE:Northern blot"
Dry lab
/inference="similar to DNA sequence:INSD:AACN010222672.1"
/inference="profile:tRNAscan:2.1"
/inference="protein motif:InterPro:IPR001900"
/inference="ab initio prediction:Glimmer:3.0"
Example from Prokka
Feature Type:
tRNA
Location:
contig000341 @ 655..730 +
Attributes:
/gene="tRNA-Leu(UUR)"
/anticodon=(pos:678..680,aa:Leu)
/product="transfer RNA-Leu(UUR)"
/inference="profile:Aragorn:1.2"
Software quality
Software goals
● Follow basic conventions
o "prokka" should say something helpful
o "prokka -h" or "prokka --help" should show help
● All options should be optional
o "prokka contigs.fa" should do something useful
● Fail gracefully
o check your dependencies exist
o produce useful error messages
o generate a log file (provenance!)
● Use standard input and output file formats
o or at least tab-separated values if you insist...
Prokka in context
● Prokka is not
o particularly original
o technically or algorithmically significant
o foolproof to install some dependencies
o for everyone
BUT
● Prokka
o is an ongoing project which will only improve :-)
o checks it will run properly before wasting your time
o does what it claims, and does it quickly
o is being used widely
Conclusions
Prokka in the wild
● Pathogen Informatics @ Sanger UK
o Andrew Page
o 50,000 draft genomes in 2 weeks (24 sec each!)
● Austin Hospital @ Melbourne AU
o Ben Howden - Dept Infectious Diseases
o assembly & annotation of MiSeq clinical isolates
● VBC @ Monash AU
o assemble & annotate all of SRA
● Many more
o Public Health Agency of Canada
Planned features
● Modularity
o alternate sub-tools eg. Aragorn vs tRNAscan-SE
o every sub-system should be optional
o facilitate entry into Galaxy Toolshed
● Better support for
o metagenome assemblies, viruses and archaea
o broken genes, pseudogenes, assembly breakpoints
● Faster
o smaller core databases
o better parallelisation and less disk i/o
● Prokka-Web
o web server version currently in beta testing
Acknowledgements
● Organisers
o Conference committee - for inviting me
o Wellcome Trust - Laura Hubbard
● Original Prokka testers
o Simon Gladman & Dieter Bulach (internal)
o Tim Stinear & Scott Chandry (external)
● Funding
o VLSCI / LSCC
o Monash University
● Family
o Naomi, Oskar, Zoe - for tolerating my absences
Contact
Email torsten.seemann@monash.edu
Twitter @torstenseemann
Blog TheGenomeFactory.blogspot.com
Web www.bioinformatics.net.au
Thank you.

More Related Content

What's hot

Whole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisWhole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysis
drelamuruganvet
 
New generation sequencing equipments
New generation sequencing equipmentsNew generation sequencing equipments
New generation sequencing equipments
Kalaivani P
 
The European Nucleotide Archive
The European Nucleotide ArchiveThe European Nucleotide Archive
The European Nucleotide Archive
EBI
 

What's hot (20)

NEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGNEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCING
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Whole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisWhole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysis
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
Illumina sequencing introduction
Illumina sequencing introductionIllumina sequencing introduction
Illumina sequencing introduction
 
2016. daisuke tsugama. next generation sequencing (ngs) for plant research
2016. daisuke tsugama. next generation sequencing (ngs) for plant research2016. daisuke tsugama. next generation sequencing (ngs) for plant research
2016. daisuke tsugama. next generation sequencing (ngs) for plant research
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data Analysis
 
Genomics(functional genomics)
Genomics(functional genomics)Genomics(functional genomics)
Genomics(functional genomics)
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation Sequencing
 
Introduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqIntroduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seq
 
Motif andpatterndatabase
Motif andpatterndatabaseMotif andpatterndatabase
Motif andpatterndatabase
 
New generation sequencing equipments
New generation sequencing equipmentsNew generation sequencing equipments
New generation sequencing equipments
 
The European Nucleotide Archive
The European Nucleotide ArchiveThe European Nucleotide Archive
The European Nucleotide Archive
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Protein purification
Protein purificationProtein purification
Protein purification
 
Validation of homology modeling
Validation of homology modelingValidation of homology modeling
Validation of homology modeling
 
PCR Primer desining
PCR Primer desiningPCR Primer desining
PCR Primer desining
 
Illumina (sequencing by synthesis) method
Illumina (sequencing by synthesis) methodIllumina (sequencing by synthesis) method
Illumina (sequencing by synthesis) method
 
Introduction to second generation sequencing
Introduction to second generation sequencingIntroduction to second generation sequencing
Introduction to second generation sequencing
 
ChIP-seq
ChIP-seqChIP-seq
ChIP-seq
 

Viewers also liked

Dna sequencing powerpoint
Dna sequencing powerpointDna sequencing powerpoint
Dna sequencing powerpoint
14cummke
 
De novo assemble for NGS
De novo assemble for NGSDe novo assemble for NGS
De novo assemble for NGS
langkang
 
Annotating the cytochrome P450 CYP720B gene family in white spruce
Annotating the cytochrome P450 CYP720B gene family in white spruceAnnotating the cytochrome P450 CYP720B gene family in white spruce
Annotating the cytochrome P450 CYP720B gene family in white spruce
Shaun Jackman
 

Viewers also liked (20)

Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
 
Dna sequencing
Dna    sequencingDna    sequencing
Dna sequencing
 
DNA Sequencing
DNA SequencingDNA Sequencing
DNA Sequencing
 
Dna sequencing powerpoint
Dna sequencing powerpointDna sequencing powerpoint
Dna sequencing powerpoint
 
De novo assemble for NGS
De novo assemble for NGSDe novo assemble for NGS
De novo assemble for NGS
 
Earthquake Fuse
Earthquake FuseEarthquake Fuse
Earthquake Fuse
 
Genome Assembly Forensics
Genome Assembly ForensicsGenome Assembly Forensics
Genome Assembly Forensics
 
Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1
 
The use of Fuse Connectors in Cold-Formed Steel Drive-In Racks, Thesis Presen...
The use of Fuse Connectors in Cold-Formed Steel Drive-In Racks, Thesis Presen...The use of Fuse Connectors in Cold-Formed Steel Drive-In Racks, Thesis Presen...
The use of Fuse Connectors in Cold-Formed Steel Drive-In Racks, Thesis Presen...
 
Improving and validating the Atlantic Cod genome assembly using PacBio
Improving and validating the Atlantic Cod genome assembly using PacBioImproving and validating the Atlantic Cod genome assembly using PacBio
Improving and validating the Atlantic Cod genome assembly using PacBio
 
20140711 3 t_clark_ercc2.0_workshop
20140711 3 t_clark_ercc2.0_workshop20140711 3 t_clark_ercc2.0_workshop
20140711 3 t_clark_ercc2.0_workshop
 
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
 
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
 
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesThe NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
 
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
 
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
 
2015 12-09 nmdd
2015 12-09 nmdd2015 12-09 nmdd
2015 12-09 nmdd
 
GenomeTrakr: Whole-Genome Sequencing for Food Safety and A New Way Forward in...
GenomeTrakr: Whole-Genome Sequencing for Food Safety and A New Way Forward in...GenomeTrakr: Whole-Genome Sequencing for Food Safety and A New Way Forward in...
GenomeTrakr: Whole-Genome Sequencing for Food Safety and A New Way Forward in...
 
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
 
Annotating the cytochrome P450 CYP720B gene family in white spruce
Annotating the cytochrome P450 CYP720B gene family in white spruceAnnotating the cytochrome P450 CYP720B gene family in white spruce
Annotating the cytochrome P450 CYP720B gene family in white spruce
 

Similar to Prokka - rapid bacterial genome annotation - ABPHM 2013

Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
Nikolay Vyahhi
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
c.titus.brown
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
David Ruau
 

Similar to Prokka - rapid bacterial genome annotation - ABPHM 2013 (20)

RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
 
Snapgene
SnapgeneSnapgene
Snapgene
 
Introducing data analysis: reads to results
Introducing data analysis: reads to resultsIntroducing data analysis: reads to results
Introducing data analysis: reads to results
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
Thesis biobix
Thesis biobixThesis biobix
Thesis biobix
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 
Open pacbiomodelorgpaper j_landolin_20150121
Open pacbiomodelorgpaper j_landolin_20150121Open pacbiomodelorgpaper j_landolin_20150121
Open pacbiomodelorgpaper j_landolin_20150121
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and Annotations
 
Part 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataPart 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw data
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
IPK - Reproducible research - To infinity
IPK - Reproducible research - To infinityIPK - Reproducible research - To infinity
IPK - Reproducible research - To infinity
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
 
Toolbox for bacterial population analysis using NGS
Toolbox for bacterial population analysis using NGSToolbox for bacterial population analysis using NGS
Toolbox for bacterial population analysis using NGS
 
Apollo Collaborative genome annotation editing
Apollo Collaborative genome annotation editing Apollo Collaborative genome annotation editing
Apollo Collaborative genome annotation editing
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
New Technologies at the Center for Bioinformatics & Functional Genomics at Mi...
New Technologies at the Center for Bioinformatics & Functional Genomics at Mi...New Technologies at the Center for Bioinformatics & Functional Genomics at Mi...
New Technologies at the Center for Bioinformatics & Functional Genomics at Mi...
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics Workshop
 

More from Torsten Seemann

De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
Torsten Seemann
 

More from Torsten Seemann (15)

How to write bioinformatics software no one will use
How to write bioinformatics software no one will useHow to write bioinformatics software no one will use
How to write bioinformatics software no one will use
 
How to write bioinformatics software people will use and cite - t.seemann - ...
How to write bioinformatics software people will use and cite -  t.seemann - ...How to write bioinformatics software people will use and cite -  t.seemann - ...
How to write bioinformatics software people will use and cite - t.seemann - ...
 
Snippy - T.Seemann - Poster - Genome Informatics 2016
Snippy - T.Seemann - Poster - Genome Informatics 2016Snippy - T.Seemann - Poster - Genome Informatics 2016
Snippy - T.Seemann - Poster - Genome Informatics 2016
 
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
 
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
 
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
 
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
Sequencing your poo with a usb stick -  Linux.conf.au 2016 miniconf  - mon 1 ...Sequencing your poo with a usb stick -  Linux.conf.au 2016 miniconf  - mon 1 ...
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
 
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
 
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
 
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
 
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
 
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
 

Recently uploaded

Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptxNanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
ssusera4ec7b
 
Warming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxWarming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptx
GlendelCaroz
 

Recently uploaded (20)

EU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdfEU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdf
 
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
 
Taphonomy and Quality of the Fossil Record
Taphonomy and Quality of the  Fossil RecordTaphonomy and Quality of the  Fossil Record
Taphonomy and Quality of the Fossil Record
 
Factor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary GlandFactor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary Gland
 
Classification of Kerogen, Perspective on palynofacies in depositional envi...
Classification of Kerogen,  Perspective on palynofacies in depositional  envi...Classification of Kerogen,  Perspective on palynofacies in depositional  envi...
Classification of Kerogen, Perspective on palynofacies in depositional envi...
 
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptxNanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
 
Information science research with large language models: between science and ...
Information science research with large language models: between science and ...Information science research with large language models: between science and ...
Information science research with large language models: between science and ...
 
Technical english Technical english.pptx
Technical english Technical english.pptxTechnical english Technical english.pptx
Technical english Technical english.pptx
 
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
 
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
 
PARENTAL CARE IN FISHES.pptx for 5th sem
PARENTAL CARE IN FISHES.pptx for 5th semPARENTAL CARE IN FISHES.pptx for 5th sem
PARENTAL CARE IN FISHES.pptx for 5th sem
 
RACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptxRACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptx
 
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdfFORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
 
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed RahimoonVital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
 
NUMERICAL Proof Of TIme Electron Theory.
NUMERICAL Proof Of TIme Electron Theory.NUMERICAL Proof Of TIme Electron Theory.
NUMERICAL Proof Of TIme Electron Theory.
 
GBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of AsepsisGBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of Asepsis
 
Warming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxWarming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptx
 
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed RahimoonVital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
 
ANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENS
ANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENSANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENS
ANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENS
 
Film Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdfFilm Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdf
 

Prokka - rapid bacterial genome annotation - ABPHM 2013

  • 1. Rapid automatic microbial genome annotation using Prokka Dr Torsten Seemann Applied Bioinformatics and Public Health Microbiology - Wed 15 May 2013 @ 1530h - Moller Centre, Cambridge, UK
  • 3. We come from a land down-under
  • 4. The team ● Simon Gladman o VelvetOptimiser author, presenting Galaxy poster ● Paul Harrison o author of Nesoni toolkit ● David Powell o author of VAGUE, software wizard, theoretician ● Dieter Bulach o sequence magician, closes genomes at will ... and we are recruiting.
  • 5. History 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 ? ? ? ? ? ? ? ?
  • 7. De novo assembly Align reads to a reference Process
  • 8. De novo assembly Ideally, one sequence per replicon. Millions of short sequences (reads) A few long sequences (contigs) Reconstruct the original genome sequence from the sequence reads only
  • 9. Annotation Adding biological information to sequences. ACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGA AAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTC CCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTG GCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGG ACAGAATGCCCTGCAGGAACTTCTTCTAGAAGACCTTCTCCTCCTG CAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGA CCTGAAACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCT CTCCGTCCGTCCGTGGGCCACGGCCACCGCTTTTTTTTTTGCC delta toxin PubMed: 15353161 ribosome binding site transfer RNA Leu-(UUR) tandem repeat CCGT x 3 homopolymer 10 x T
  • 10. What's in an annotation? ● Location o which sequence? chromosome 2 o where on the sequence? 100..659 o what strand? -ve ● Feature type o what is it? protein coding gene ● Attributes o protein product? alcohol dehydrogenase o enzyme code? EC:1.1.1.1 o subcellular location? cytoplasm
  • 11. Bacterial feature types ● protein coding genes o promoter (-10, -35) o ribosome binding site (RBS) o coding sequence (CDS)  signal peptide, protein domains, structure o terminator ● non coding genes o transfer RNA (tRNA) o ribosomal RNA (rRNA) o non-coding RNA (ncRNA) ● other o repeat patterns, operons, origin of replication, ...
  • 13. Key bacterial features ● tRNA o easy to find and annotate: anti-codon ● rRNA o easy to find and annotate: 5s 16s 23s ● CDS o straightforward to find candidates  false positives are often small ORFs  wrong start codon o partial genes, remnants o pseudogenes o assigning function is the bulk of the workload
  • 14. Automatic annotation Two strategies for identifying coding genes: ● sequence alignment o find known protein sequences in the contigs  transfer the annotation across o will miss proteins not in your database o may miss partial proteins ● ab initio gene finding o find candidate open reading frames  build model of ribosome binding sites  predict coding regions o may choose the incorrect start codon o may miss atypical genes, overpredict small genes
  • 15. Some good existing tools Software ab initio align- ment Availability Speed RAST yes yes web only 12-24 hours xBASE yes no web only >8 hours BG7 no yes standalone >10 hours PGAAP (NCBI) yes yes email / we >1 month
  • 16. Why another tool? ● Convenience o I have sequence, just tell me what's in it, please. ● Speed o exploit multi-core computers (aim < 15min) ● Standards compliant o GFF3/GBK for viewing, TBL/FSA for Genbank sub. ● Rich consistent trustworthy output o /product /gene /EC_number ● Provenance o a record of where/how/why is was annotated so
  • 17. Why "Prokka" ? ● Unique in Google ● I like the letter "k" ● Easy to type ● It sounds Aussie ● Loosely fits "Prokaryotic Annotation" ● It rhymes with "Quokka" o Australian cat-sized nocturnal marsupial herbivore o first Aussie mammal seen by Europeans - "giant rat"
  • 18. Prokka pipeline (simplified) tRNA rRNA ncRNA CDS FASTA contigs Infernal RNAmmer Prodigal SignalP Aragorn sig_peptide protein domains HMMER3 protein annotation BLAST+ Rfam Swiss Pfam TIGRUser GFF3 GBK ASN1
  • 19. What can you trust?
  • 20. Predicting protein function Sequence similarity is a proxy for homology ● Sequence based (alignment) o tools: BLAST, BLAT, FASTA, Exonerate o databases: RefSeq, Uniprot, ... ● Model based ("fuzzy sequence" matching) o PSSM: position specific scoring matrix  tools: RPS-BLAST, Psi-BLAST  databases: CDD, COG, Smart o HMM: hidden Markov models  tools: HMMER, HHblits  databases: Pfam, TIGRfams
  • 21. Sequence databases I'll just BLAST against the non-redundant database. -- Anonymous ● Which one? o nucleotide (nt) or protein (nr) ● It's actually quite redundant o only eliminates exact matching sequences ● It's not picky o nearly anything is admitted, garbage in garbage out ● It's too big o searching takes too long
  • 22. Hierarchical searching ● Facts o searching against smaller databases is faster o searching against similar sequences is faster ● Idea o start with small set of close proteins o advance to larger sets of more distant proteins ● Prokka o your own custom "trusted" set (optional) o core bacterial proteome (default) o genus specific proteome (optional) o whole protein HMMs: PRK clusters, TIGRfams o protein domain HMMs: Pfam
  • 23. Core bacterial proteome ● Many bacterial proteins are conserved o experimentally validated o small number of them o good annotations ● Prokka provides this database o derived from UniProt-Swissprot o only bacterial proteins o only accept evidence level 1 (aa) or 2 (RNA) o reject "Fragment" entries o extract /gene /EC_number /product /db_xref ● First step gets ~50% of the genes o BLAST+ blastp, multi-threading to use all CPUs
  • 24. The remainder ● Prokka has genus specific databases o aim to capture "genus specific" naming conventions o derived from proteins in completed genomes o proteins are clustered and majority annotation wins o some annotations are rubbish though ● Custom model databases o I took COG/PRK MSAs and made HMMs ● Existing model databases o Pfam, TIGRfams are well curated ● And if all else fails
  • 26. Provenance Recording where an annotation came from. Prokka uses Genbank "evidence qualifier" tags: Wet lab /experiment="EXISTENCE:Northern blot" Dry lab /inference="similar to DNA sequence:INSD:AACN010222672.1" /inference="profile:tRNAscan:2.1" /inference="protein motif:InterPro:IPR001900" /inference="ab initio prediction:Glimmer:3.0"
  • 27. Example from Prokka Feature Type: tRNA Location: contig000341 @ 655..730 + Attributes: /gene="tRNA-Leu(UUR)" /anticodon=(pos:678..680,aa:Leu) /product="transfer RNA-Leu(UUR)" /inference="profile:Aragorn:1.2"
  • 29. Software goals ● Follow basic conventions o "prokka" should say something helpful o "prokka -h" or "prokka --help" should show help ● All options should be optional o "prokka contigs.fa" should do something useful ● Fail gracefully o check your dependencies exist o produce useful error messages o generate a log file (provenance!) ● Use standard input and output file formats o or at least tab-separated values if you insist...
  • 30. Prokka in context ● Prokka is not o particularly original o technically or algorithmically significant o foolproof to install some dependencies o for everyone BUT ● Prokka o is an ongoing project which will only improve :-) o checks it will run properly before wasting your time o does what it claims, and does it quickly o is being used widely
  • 32. Prokka in the wild ● Pathogen Informatics @ Sanger UK o Andrew Page o 50,000 draft genomes in 2 weeks (24 sec each!) ● Austin Hospital @ Melbourne AU o Ben Howden - Dept Infectious Diseases o assembly & annotation of MiSeq clinical isolates ● VBC @ Monash AU o assemble & annotate all of SRA ● Many more o Public Health Agency of Canada
  • 33. Planned features ● Modularity o alternate sub-tools eg. Aragorn vs tRNAscan-SE o every sub-system should be optional o facilitate entry into Galaxy Toolshed ● Better support for o metagenome assemblies, viruses and archaea o broken genes, pseudogenes, assembly breakpoints ● Faster o smaller core databases o better parallelisation and less disk i/o ● Prokka-Web o web server version currently in beta testing
  • 34. Acknowledgements ● Organisers o Conference committee - for inviting me o Wellcome Trust - Laura Hubbard ● Original Prokka testers o Simon Gladman & Dieter Bulach (internal) o Tim Stinear & Scott Chandry (external) ● Funding o VLSCI / LSCC o Monash University ● Family o Naomi, Oskar, Zoe - for tolerating my absences
  • 35. Contact Email torsten.seemann@monash.edu Twitter @torstenseemann Blog TheGenomeFactory.blogspot.com Web www.bioinformatics.net.au