Automated Prokaryotic Annotation at JCVI


Published on

Conference: Annual BRC Meeting (BRC6), Oct 28-29, 2008 in Ft. Lauderedale, Florida.
Presenter: Dan Haft

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Automated Prokaryotic Annotation at JCVI

  1. 1. Aut omated Prokaryot ic Annotation at the JCVI Da n ie l Ha ft 200 8
  2. 2. A Dual-Use Pipeline  Multiple types of stored evidence  Persistent & Flexibly Interleaved  Supports selective re-annotation  Features annotation-driving databases - CHAR - TIGRFAMs - Genome Properties - BrainGrab Rules  Evidence used by Machine and by Experts  MANATEE interface for annotators  Capture new rules with BrainGrab
  3. 3. Computable objects: Output from one program becomes input to another.  HMM results drive Genome Properties  Genome Properties guide GO process assignments  GO process terms
  4. 4. Identification of Genome Features IMM ORFs Genome Sequence built Glimmer builds a statistical model from the training set : Other Genome Features • rRNA, tRNA, Rfam • IS elements ·Phage regions ·Repeats
  5. 5. Gene Finding Glimmer & friends, homology methods Homology Searches (gathering evidence) BLAST-Extend-Repraze Hidden Markov Models misc. Structural Curation ( ORF Management) Auto_Gene_Curate (start sites, overlaps) InterEvidence Functional Assignments Auto_Annotate Manual Mapped Data Availability
  6. 6. Homology Searches HMM searches: TIGRFAMs & Pfam • BLAST searches: against internal NIAA • PROSITE motifs • InterPro • TmHMM • SignalP • Lipoprotein • Psort • Generate Paralogous Families • Custom databases searches (TransportDB, Rules) •
  7. 7. Gene Model Curation • Overlaps resolved by evidence competition • Start site curation • Missed genes / unsupported gene calls
  8. 8. Evidence can Overhang the Gene Blast-Extend-Repraze (BER) The extensions help in the detection of frameshifts (FS) and point mutations resulting in in-frame stop codons (PM). This is indicated when similarity extends outside the coordinates of the protein coding sequence. Blue line indicates predicted protein coding seqeunce, green line indicates up- and downstream extensions. Red line is the match protein. end5 end3 ORFxxxxx 300 bp 300 bp search protein match protein normal full length match ! similarity extends upstream through a start, or downstream past a frameshift * similarity extends in the same frame through a stop codon
  9. 9. Pfam vs. TIGRFAMs  Names for homology  Functional assignments to proteins domains in proteins  Granularity tuned for  Granularity tuned for single-hit equivalogs twilight-level sequence (mono-functional !) similarity detection  Explains things to  Generates computable annotator objects --> pathway reconstructions  TIGRFAMs: RULES  Pfam : Explanations
  10. 10. TIGRFAMs equivalogs vs. Pfam domains } X TIGRxxxxx X X Y Z } PFxxxxx
  11. 11. TIGRFAMs as annotation rules  EC number computable !  GO term computable !  protein name computable ?  HMM hit computable !!
  12. 12. Isology (homology) types: ranking our rules  EXCEPTION additional info, e.g. “vegetative”  EQUIVALOG the SAME (in enough ways) to receive the same name across multiple genomes, reflecting one specific function.  SUBFAMILY can name a whole class  DOMAIN class name for a protein region (and apply these classifications also to Pfam)
  13. 13. CHAR : Experimentally Characterized Protein Database • Highly curated database of experimentally characterized proteins; connects protein accessions, known function, and the scientific literature. •What does it include: –Controlled vocabulary describes the type of experimentation performed in each publication –Key annotation fields (protein name, gene symbol, Enzyme Commission (EC) number, taxonomic data, Gene Ontology (GO) terms) are extracted –Synonymous protein accessions obtained from public databases (Genbank and UniProt) are stored
  14. 14. Annotation Proceeds from … Inside --> out (e.g. AutoAnnotate): for every protein   Collect evidence  Best-guess annotation Outside --> in (e.g. TIGRFAMs): for every model  Search tool + cutoff + standards = annotation rule  Achieves partial coverage Hybrid (BrainGrab) for every unfinished protein   Look for means to annotate: blastp, synteny, hole-filling, etc. annotator logic as a new rule  Capture  Add to library of rules/models for all future genomes
  15. 15. RULES T OR P  IM  validate NEW  Subject Genome Br  share ai nG ra b  Proper Realm of Annotator Attention  Trusted  Complete  Automatic  genome  genome
  16. 16. A Teachable Moment EcHS_A1984 is manually annotated confidently because it is similar enough to :  SP|P07363|CHEA_ECOLI  Chemotaxis protein cheA  EC (method: defines “similar enough”) BLASTP_MATCH [SP|P07363, 1600, 95, 92, 60, 1] Must be the only protein in genome that scores >= 1600 by blastp, covering >= 95 % of the length of the characterized protein and >= 92 % of the target protein, with >= 60 % sequence identity.
  17. 17. a sample of expert opinion: “For This Particular Protein Family”  I (D.H.H.) assert that any > 75 %-identical, full- length match is the same protein.  Ditto any > 65 % match, as long as the region is clearly syntenic.  Ditto any single-copy > 50 % match, as long as it fills this hole in this otherwise mostly complete pathway.
  18. 18. B “Bag of Genes” G Genome Properties E Evidence to drive other programs Image from Gödel, Escher, Bach: an Eternal Golden Braid by Douglas Hofstadter, 1979
  19. 19. Genome Properties: annotation at the level of systems not some NO YES supported evidence  pathway (glyoxylate shunt) system (type III secretion)  structure (outer membrane)  genometrics (GC content)  phenotype (motility, pathogenesis) 
  20. 20. Some Novel Genome Properties  12 subtypes of CRISPR/Cas system  PEP-CTERM / exo-sortase: Biofilm-associated protein sorting  Type VI secretion (53 loci in B. mallei 23344)  Post-translational selenium-modified enzymes  Heterocycle-containing bacterial toxin production: BA_2677 = “heterocyclo-anthracin”
  21. 21. A family of variable putative toxins with patterns of CGG insertions.
  22. 22. Future Annotation Pipeline Enhancements • Populate the Characterized Protein Database • Develop META-RULES from CHAR • BrainGrab for novel content • Import additional computable evidence • Improve exchanges of validation sets • Build a protein names ontology
  23. 23. Acknowlegements Ramana Madupu Jeremy Selengut Alex Richter JCVI microbial annotation team