• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
 

Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview

on

  • 2,142 views

Conference: Sept 24 - 26, 2008 at the JCVI Rockville, MD Campus

Conference: Sept 24 - 26, 2008 at the JCVI Rockville, MD Campus
Presenters: Ramana Madupu, Lauren Brinkac, Derek Harkins

Statistics

Views

Total Views
2,142
Views on SlideShare
2,141
Embed Views
1

Actions

Likes
0
Downloads
20
Comments
0

1 Embed 1

http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview Presentation Transcript

    • Prokaryotic Annotation Overview Ramana Madupu 2008
    • Outline
      • Overview of annotation methodology
        • Gene finding
        • Gathering evidence
        • Gene model curation
        • Functional assignments
    • What is Annotation?
      • Webster’s definition of “to annotate”:
        • “ to make or furnish critical or explanatory notes or comment”
      • some of what this includes for genomics
        • gene product names/symbols
        • functional characteristics of gene products
        • physical characteristics of gene products
        • overall metabolic profile of the organism
      • elements of the annotation process
        • gene finding
        • homology searches
        • functional assignment
        • gene model curation
        • providing data to the community
      • manual vs. automatic
        • slow vs. fast
        • expensive vs. cheap
        • higher accuracy vs. lower accuracy
        • decision requires a cost-benefit analysis to determine what is required to achieve the goals of the individual project
    • Gene Finding
    • ORFs vs. Protein Coding Genes
      • ORF = open reading frame
        • defined by the absence of a translational “stop” codon
          • 3 “stops” in bacteria: TAA, TAG, TGA
          • an ORF goes from “stop” to “stop”
        • ORFs can easily occur by chance in any given DNA sequence
        • Since “stop” codons are AT rich:
          • GC rich DNA has less random stops and therefore, on average, more/longer ORFs
          • AT rich DNA has more random stops and therefore, on average, fewer/shorter ORFs
      • Protein Coding Gene
        • coding sequence is contained within an ORF
        • requires translational “start” codon
          • 3 “starts” in bacteria: ATG, GTG, TTG
          • coding sequence goes from “start” to “stop”
          • in JCVI jargon we tend to equate gene and coding sequence, even though we know that a gene consists of more than coding sequence.
          • you may hear JCVI people use gene and protein interchangeably
      • Telling the difference between random ORFs and genes is the goal in the gene finding process.
    • Each DNA sequence has 6 possible translations, one for each frame (Very) Short sample sequence TAGATGATTAGCTTGGATGAGCTCATATAGCCCGTAAGA ATCTACTAATCGAACCTACTCGAGTATATCGGGCATTGT Frame and corresponding codons +1 TAG ATG ATT AGC TTG GAT GAG CTC ATA TAG CCC GTA +2 AGA TGA TTA GCT TGG ATG AGC TCA TAT AGC CCG TAA +3 GAT GAT TAG CTT GGA TGA GCT CAT ATA GCC CGT AAG -1 TGT TAC GGG CTA TAT GAG CTC ATC CAA GCT AAG CAT -2 GTT ACG GGC TAT ATG AGC TCA TCC AAG CTA AGC ATC -3 TTA CGG GCT ATA TGA GCT CAT CCA AGC TAA GCA TCT color key translational start translational stop Just one sequence can result in many different potential coding sequence situations: frame +1 and +2 have relatively long ORFs with start codons and therefore are potential genes. Frames +3 and -3 have short ORFs with no starts and therefore no genes. Frame -1 and -2 are all ORF since there are no stops at all, but only one has a start.
    • More on 6-Frame translations In order to visualize the genes within the context of their neighbors along a DNA sequence, the sequence is often represented as a 6-frame translation. There are 6 possible frames for translation in every sequence of DNA, 3 in the forward (+) direction on the DNA sequence, and 3 in the reverse (-) direction on the DNA sequence. These are represented as horizontal bars with vertical marks for stops (long) and starts (short) in the order shown below. +1 +2 +3 -1 -2 -3 These are some of the many ORFs in this graphic. stop start
    • Gene Finding with Glimmer
      • Glimmer is a tool which uses Interpolated Markov Models (IMMs) to predict which ORFs in a genome contain real genes.
      • Glimmer does this by comparing the nucleotide patterns it finds in a training set of “known real genes” to the nucleotide patterns of the ORFs in the whole genome. ORFs with patterns similar to the patterns in the training genes are considered real themselves.
      • Using Glimmer is a two-part process:
        • Train Glimmer with genes from the organism that was sequenced.
        • Run the trained Glimmer against the entire genome sequence.
    • Gene finding with Glimmer: Gathering the Training Set Glimmer built-in system: long ORFs that do not overlap other long ORFs IMM built Glimmer builds a model from the training set One can gather published sequences from the organism sequenced, however, this generally does not provide enough sequence for training - a general guideline is that ~250 kb of total sequence is needed (that is if you line up all of the training genes end to end they total 250 kb) BLAST all ORFs against your favorite protein database, retain only extremely strong matches
    • Gene Finding with Glimmer What happens during training. ATGCG T AAGGCTTTCACAGT ATGCG A GTAAGCTGCGTCGTAAGG A TGCGT A AGGCTTTCACAGTATGCGAGTAAGC TGCGT C GTAAGG AT GCGTA A GGCTTTCACAGTATGCGAGTAAGCTGCGTCGTAAGG ATG CGTAA G GCTTTCACAGTATGCGAGTAAGCTGCGTCGTAAGG ATGC GTAAG G CTTTCACAGTATGCGA GTAAG C TGCGTC GTAAG G Glimmer moves sequentially through each sequence in the training set, recording the nucleotide that occurs after each possible oligomer up to oligomers of length eight Glimmer then calculates the statistical probability of each pattern appearing in a real gene. These probabilities form the statistical model of what a real gene looks like in the given organism. This model is then run against the complete genome sequence. All ORFs above the chosen minimum length (90 bp at JCVI) are scored according to how closely they match the model of a real gene. Example for a 5-mer:
    • Candidate ORFs 6-frame translation map for a region of DNA: Each horizontal bar corresponds to one of the 6 translation frames (3 forward, 3 reverse) long vertical lines represent stops (TAA, TAG, TGA) short vertical lines represent starts (ATG, GTG, TTG) = ORFs meeting minimum length In the first panel the ORFs highlighted in blue meet the minimum length requirement of what can be considered a gene. (in our pipeline 90 bp) In the second panel, potential coding sequences within the candidate ORFs are represented by arrows, going from start to stop. A long ORF does not necessarily result in a long putative gene. Green ORFs scored well to the model, red ORFs scored less well. The green ORFs are chosen by Glimmer as the set of likely genes. ORFs in the area of lateral transfer, although real genes, often will not be chosen since their nucleotide patterns likely won’t match the model built from the patterns of the genome as a whole. Below genes are represented as arrows drawn above (forward orientation) or below (reverse orientation) a line representing the DNA molecule. Glimmer numbers the genes sequentially from the beginning of the DNA molecule on which they reside. Genes missed by Glimmer and overlapping genes are resolved by post-Glimmer processes, which will be discussed on later slides. ORF00001 ORF00002 ORF00003 ORF00004 = laterally transferred DNA ? +1 +2 +3 -1 -2 -3
    • Coordinates Genes are mapped to the underlying genome sequence via coordinates. Each gene is defined by two coordinates: end5 (the 5 prime end of the gene) and end3 (the 3 prime end of the gene). Nucleotide #1 for each molecule in the genome is the beginning of each final assembled molecule. Circular chromosomes are treated as linear for the purposes of annotation. One position is chosen to be nucleotide #1 and that becomes the “beginning” of the molecule. Some genomes have just one DNA molecule, some have several (multiple chromosomes or plasmids). 1 10,000 gene end5 end3 purple 12 527 red 802 675 blue 927 1543 green 9425 7894 pink 9575 9945 Note that forward genes have end5<end3, while reverse genes have end5>end3.
    • Gathering Evidence Determining How The New Proteins Function
    • Finding the function of a new protein
      • Experimental characterization
        • mutant phenotype
        • enzyme assay
        • difficult on a whole-genome scale
        • Literature curation
      • Homology searching
        • looks for similarity between sequences
        • comparing sequences of unknown function to those of known function
        • relies on the assumption that shared sequence implies shared function
    • Protein Homology
      • shared sequence implies shared function
        • binding sites
        • catalytic sites
        • entire proteins
      • but beware
        • there are occurrences of proteins where one amino acid substitution changes the function
        • all functional assignments based on sequence similarity should be considered putative until experimental confirmation
      • assessed using protein alignments
        • proteins are aligned one on top of the other so that the maximal number of amino acids “match”
        • identity in an alignment means that at a given position in the alignment the amino acids are identical
        • similarity in an alignment means that at a given position in the alignment the amino acids are similar in chemical structure to each other, and may therefore be able to carry out the same role in the protein.
      • pairwise alignments
        • two protein’s amino acid sequences aligned next to each other so that the maximum number of amino acids match
      • multiple alignments
        • 3 or more protein’s amino acid sequence aligned next to each other so that the maximum number of amino acids match in each column
        • more meaningful than pairwise alignments since it is much less likely that several proteins will share sequence similarity due to chance alone, than that 2 will share sequence similarity due to chance alone. Therefore, such shared similarity is more likely to be indicative of shared function.
      • protein families
        • clusters of proteins that all share sequence similarity
        • may be modeled by various statistical techniques
      • motifs
        • short regions of amino acid sequence shared by many proteins
          • transmembrane regions
          • active sites
          • domains
          • signal peptides
      Collecting Sequence Similarity Evidence
    • Protein Alignment Tools
      • Local pairwise alignment tools do not worry about matching proteins over their entire lengths, they look for any regions of similarity within the proteins that score well.
        • BLAST
          • fast
          • comes in many varieties (see NCBI site)
        • Smith-Waterman
          • finds best out of all possible local alignments
          • slow but sensitive
      • Global pairwise alignment tools take two sequences and attempt to find an alignment of the two over their full lengths.
        • Needleman-Wunsch
          • finds best out of all possible alignments
      • Multiple alignments tools try to align 3 or more proteins so that the maximal number of amino acids from each protein are matched in the alignment - this may or may not include the full length of some or all of the proteins
        • clustalW
        • muscle
    • Sample Alignments Pairwise Multiple -two rows of amino acids compared to each other, the top row is the search protein and the bottom row is the match protein, numbers indicate amino acid position in the sequence -solid lines between amino acids indicate identity -dashed lines (colons) between amino acids indicate similarity -next slide shows a full alignment Different shadings indicate amount of matching
    • Sample full-length protein alignment
    • Useful Protein Databases
      • Swiss-Prot
        • European Bioinformatics Institute (EBI) and Swiss Institute of Bioinformatics (SIB)
        • all entries manually curated
        • annotation includes
          • links to references
          • coordinates of protein features
          • links to cross-referenced databases
            • HMMs
            • Enzyme Commission
      • TrEMBL
        • EBI and SIB
        • entries have not been manually curated
        • once they are accessions remain the same but move into Swiss-Prot
      • PIR (at Georgetown University)
      • UniProt
        • Swiss-Prot + TrEMBL + PIR
      • NCBI
        • National Center for Biotechnology Information
        • protein and DNA sequences
        • taxonomy resource
        • many other resources
      • Omnium
        • database that underlies JCVI’s CMR
        • contains data from all completed sequenced bacterial genomes
        • data is downloaded from the sequencing center
      • Enzyme Commission
        • not sequence based
        • catagorized collection of enzymatic reactions
        • reactions have accession numbers indicating the type of reaction
          • Ex. 1.2.1.5
      • KEGG, Metacyc, etc.
      Other Useful Databases
    • NIAA
      • Non-Identical Amino Acid
      • JCVI’s protein file used for searching
      • File composed of protein sequences from several source databases
        • Swiss-Prot
        • Omnium
        • NCBI
        • PIR
      • The file is made non-redundant
        • identical protein sequences that came into the file from different source databases are collapsed into one entry
        • all of the protein’s accession numbers from the various source databases where it is found are stored linked to the protein
        • users can always view the protein at the source database
    • Sample NIAA entry >biotin synthase, Escherichia coli MAHRPRWTLSQVTELFEKPLLDLLFEAQQVHRQHFDPRQVQVSTLLSIKTGACPEDCKYC PQSSRYKTGLEAERLMEVEQVLESARKAKAAGSTRFCMGAAWKNPHERDMPYLEQMVQGV KAMGLEACMTLGTLSESQAQRLANAGLDYYNHNLDTSPEFYGNIITTRTYQERLDTLEKV RDAGIKVCSGGIVGLGETVKDRAGLLLQLANLPTPPESVPINMLVKVKGTPLADNDDVDA FDFIRTIAVARIMMPTSYVRLSAGREQMNEQTQAMCFMAGANSIFYGCKLLTTPNPEEDK DLQLFRKLGLNPQQTAVLAGDNEQQQRLEQALMTPDTDEYYNAAAL Source databases where this protein is found: -Swiss-Prot, accession # SP:P12996 -Protein Information Resource, accession # PIR:JC2517 -NCBI’s GenBank, accession # GB:AAC73862.1 All of these are collapsed into one entry in NIAA that is linked to all three accessions.
    • BLAST-extend-repraze (BER) Our pairwise protein search tool
      • Initial BLAST search of new query proteins from the genome sequence
        • search is against NIAA
        • stores good hits in mini-database for each query protein from the genome
      • Query protein sequence is extended by 300 nucleotides on each end and translated (see later slide)
      • A modified Smith-Waterman alignment is generated between the query and each match sequence in the mini-database
        • extends local alignments as far as homology continues over lengths of extended proteins
        • produces a file of alignments between the query protein and the match protein for each match protein in the mini-database
        • as the alignment generating algorithm builds the alignment, if the level of similarity falls below the necessary threshold, the program looks in different frames and through stop codons for homology to continue - this similarity can continue into the upstream and/or downstream extensions
    • BLAST-extend-repraze (BER) niaa BLAST Extend 300 nucleotides on both ends (see later slide) modified Smith- Waterman Alignment genome.pep vs. Significant hits put into mini-dbs for each protein (non-identical amino acid) vs. extended protein mini-database from BLAST search , mini-db for protein #1 mini-db for protein #2 , mini-db for protein #3 ... mini-db for protein #3000
    • BER Alignment An alignment like this will be generated for every match protein in the mini-database. In the next slides we will look closely at the types of information displayed here.
    • BER Alignment detail: Boxed Header -The background color of this box will be gold if the protein is in the characterized table and grey if it is not. -The top bar lists the percent identity/similarity and the organism from which the protein comes (if available). -The bottom section lists all of the accession numbers and names for all the instances of the match protein from the source databases (used in building NIAA for the searches.) -The accession numbers are links to pages for the match protein in the source databases. -A particular entry in the list will have colored text (the color corresponding to its characterized status) if that is the accession that is entered into the characterized table - this tells the annotators which link they should follow to find experimental characterization information. Only one accession for the match protein need be in the characterized table for the header to turn gold. -There are links at the end of each line to enter the accession into the characterized table or to edit an already existing entry in the characterized table.
    • BER Alignment detail: alignment header -It is most important to look at the range over which the alignment stretches and the percent identity -The top line show the amino acid coordinates over which the match extends for our protein -The second line shows the amino acid coordinates over which the match extends for the match protein, along with the name and accession of the match protein -The last line indicates the number of amino acids in the alignment found in each forward frame for the sequence as defined by the coordinates of the gene. The primary frame is the one starting with nucleotide one of the gene. If all is well with the protein, all of the matching amino acids should be in frame 1. -If there is a frameshift in the alignment (see later slide) the phrase “Frame Shifts = #” will flash and indicate how many frameshifts there are. In addition, inside the brackets in the last line will be a count of how many amino acids of match are in each frame.
    • BER Alignment detail: alignment of amino acids -In these alignments the codons of the DNA sequence read down in columns with the corresponding amino acid underneath. -The numbers refer to amino acid position. Position 1 is the first amino acid of the protein. The first nucleotide of the codon coding for amino acid 1 is nucleotide 1 of the coding sequence. Negative amino acid numbers indicate positions upstream of the predicted start of the protein. -Vertical lines between amino acids of our protein and the match protein (bottom line) indicate exact matches, dotted lines (colons) indicate similar amino acids. -Start sites are color coded: ATG is green, GTG is blue, TTG is red/orange -Stop codons are represented as asterisks in the amino acid sequence. An open reading frame goes from an upstream stop codon to the stop at the end of the protein, while the gene starts at the chosen start codon.
    • BER Skim A list of best matches from niaa to the search protein with statistics on length of match and BLAST p-value. Colored backgrounds indicate presence in characterized table and corresponding status.
    • Extensions in BER The extensions help in the detection of frameshifts (FS) and point mutations resulting in in-frame stop codons (PM). This is indicated when similarity extends outside the coordinates of the protein coding sequence. Blue line indicates predicted protein coding sequence, green line indicates up- and downstream extensions. Red line is the match protein. ORFxxxxx 300 bp 300 bp FS PM FS or PM ? two functionally unrelated genes from other species matching one query protein could indicate incorrectly fused ORFs end5 end3 search protein match protein similarity extending through a frameshift upstream or downstream into extensions similarity extending in the same frame through a stop codon ? normal full length match * ! !
    • Frameshifted alignment
    • H idden M arkov M odels - HMMs
      • HMMs are statistical models of the patterns of amino acids in a group of functionally related proteins found across species.
        • this group is called the “seed”
        • HMMs are built from multiple alignments of the seed members.
      • Proteins searched against an HMM receive a score indicating how well they match the model.
        • Proteins scoring well to the model can be expected to share the function that the HMM represents.
      • HMMs can be built at varying levels of functional relationship.
        • The most powerful level of relationship is one representing the exact same function.
        • It is important to know the kind of relationship an HMM models to be able to draw the correct conclusions from it
      • Annotation can be attached to HMMs
        • protein name
        • gene symbol
        • EC number
        • role information
    • Our HMM-Type Labels Domain: These HMMs describe a region of homology that is not required to be the full length of a protein. The function of the region may or may not be known. Superfamily : This type of HMM describes a group of proteins which have full length protein sequence similarity and have the same domain architecture, but which do not necessarily have the same function. Subfamily: This type of HMM describes a group of proteins which also have full length homology, which represent more specific sub-groupings with a superfamily. Equivalog: The supreme HMM, designed so that all members of the family being modeled and all proteins scoring above trusted share the exact same function. Equivalog_domain: Just like equivalog, only applies to a polypeptide that can be found as a single function protein or part of a multifunctional protein. (The above are just a sampling of the most commonly seen isology types arranged from most general to most specific, there are more types than those listed here.) Pfam: Indicates that no isology type has yet been assigned to the Pfam HMM. The starred HMMs provide the most specific information, look for matches to these first. They are very strong evidence for function.
    • Annotation attached to HMMs
      • TIGR00433
        • isology: equivalog
        • name: biotin synthase
        • EC: 2.8.1.6
        • gene symbol: bioB
        • TIGR role: 77 (Biotin biosynthesis)
        • GO terms: GO:0004076 (biotin synthase activity), GO:0009102 (biotin biosynthesis)
      • PF04055
        • isology: domain
        • name: radical SAM domain protein
        • EC: not applicable
        • gene symbol: not applicable
        • TIGR role: 703 (enzymes of unknown specificity)
        • GO terms: GO:0003824 (catalytic activity), GO:0008152 (metabolism)
    • Building HMMs Collect proteins to be in the “seed” Generate and Curate Multiple Alignment of Seed proteins Run HMM algorithm Choose “noise” and “trusted” cutoff scores based on what scores the “known” vs. “unknown” proteins receive. HMM is ready to go! Region of good alignment and closest similarity (same function/similar domain/ family membership) Computes statistical probabilities for amino acid patterns in the seed Search new model against all proteins this step may need a few iterations
    • Choosing cutoff scores 100 300 matches (seed members bold) score protein “definitely” 547 protein “absolutely” 501 protein “sure thing” 487 protein “confident” 398 protein “safe bet” 376 protein “very confident” 365 protein “has to be one” 355 protein “could be” 210 protein “maybe” 198 protein “not sure” 150 protein “no way” 74 protein “can’t be” 54 protein “not a chance” 47 =search the new HMM against NIAA =see the range of scores the match proteins receive =do analysis to determine where confirmed members score =do analysis to determine where confirmed non-members score =set the cut-offs accordingly =proteins that score above trusted can be considered part of the protein family modeled by the HMM =proteins that score below noise should not be considered part of the protein family modeled by the HMM =usefulness of an HMM is directly related to the care taken by the person building the HMM since some steps are subjective
    • HMM Searches HMM hits for protein #1 , , , etc. HMM hits for protein #3 HMM hits for protein #2 genome.pep vs. HMM database TIGRFAMS + Pfams TIGR01234 PF00012 TIGR00005 TIGR01004 NONE Each protein in the genome is searched against all HMMs in our db. Some will not have significant hits to any HMM, some will have significant hits to several HMMs. Multiple HMM hits can arise in many ways, for example: the same protein could hit an equivalog model, a superfamily model to which the equivalog function belongs, and a domain model representing the catalytic domain for the particular equivalog function. There is also overlap between TIGR and Pfam HMMs.
    • Evaluating HMM scores - if the protein’s total score is….. … above trusted: the protein is a member of family the HMM models … below noise: the protein is not a member of family the HMM models … in-between noise and trusted: the protein MAY be a member of the family the HMM models ...above trusted and some or all scores are negative: the protein is a member of the family the HMM models 0 100 -50 0 100 -50 0 100 -50 T N P 0 100 -50
    • HMM Output in Manatee
    • Genome Properties
      • Record and/or predict the presence/absence of:
        • metabolic pathways
          • for example, biotin biosynthesis
        • protein complexes
          • for example, ATP synthase
        • cellular structures
          • for example, outer membrane
        • traits
          • for example, optimal growth temperature, cell shape
      • Particular property has a given “state” in each organism, for example:
        • some are values (37 degrees C, rod)
        • some are presence/absence of pathway/structure
          • YES - the property is definitely present
          • NO - the property is definitely not present
          • Some evidence
      • The state of some properties can be determined computationally
        • the property is defined be several reaction steps or protein components which are modeled by HMMs
        • HMM matches to all steps/components indicate that the organism has the property
      • Other property’s states must be entered manually (growth temp, shape, etc.)
      • data for a particular genome viewable in Manatee
        • links from HMM section
        • links from gene list for role category
        • entire list of properties and states can be viewed
      • Searchable across genomes on the CMR site
        • covered in the CMR segment of the course
    • Genome Property: “Biotin Biosynthesis”
    • Genome Property: “Cell Shape”
    • Paralogous Families genome.pep genome.pep vs. Proteins from the same genome are grouped into families (minimum two members) according to sequence similarity. First, proteins are clustered according to HMM hits, second, other regions of the proteins, not found in HMM hits, are searched and clustered. Those families associated with HMMs get names containing the HMM accession (ex. fam_PF00528), those not associated with HMMs are given sequential numbered names as they are built (ex. fam_11). Reveals expansion/contraction of various families of proteins in one genome verses another. Helps in annotation consistency, frameshift detection, and start site editing.
    • Paralogous family output in Manatee
      • PROSITE Motifs
        • collection of protein motifs associated with active sites, binding sites, etc.
        • help in classifying genes into functional families when HMMs for that family have not been built
      • InterPro
        • Brings together HMMs (both TIGR and Pfam) Prosite motifs and other forms of motif/domain clustering (Prints, Smart)
        • Useful annotation information
        • GO terms have been assigned to many of these
      • TmHMM
        • an HMM that recognizes membrane spans
        • a product of the Center for Biological Sequence Analysis (CBS), Denmark
      • Signal P
        • potential secreted proteins
        • another CBS product
      • Lipoprotein
        • potential lipoproteins
        • this is actually a specific Prosite motif
      Other searches
    • non-coding RNAs
      • tRNAs are found using tRNAscan (Lowe/ Eddy, 1997)
      • structural RNAs are found using BLAST searches
      • OR
      • Rfam, a set of multiple sequence alignments and profile stochaswtic context-free grammars used to predict non-coding RNAs (Sanger, WashU) www.sanger.ac.uk/Software/Rfam/
    • Other Information
      • Molecular Weight/pI
      • DNA repeats
      • GC content
        • whole genome/individual genes
      • terminators
      • operons
    • Gene Model Curation
      • The process of insuring that the predicted genes have the correct coordinates and that the set of predicted genes is complete and correct (to the best of our ability.)
        • Start site curation
        • overlap resolution (false positives)
        • missed genes (false negatives)
    • Gene Model Curation: Start site edits What to consider: - Start site frequency: ATG >> GTG >> TTG - Ribosome Binding Site (RBS): a string of AG rich sequence located 5-11 bp upstream of the start codon - Similarity to match protein, in BER, Paralogous Family, and multiple alignments - the most important factor. (Remember to note, that the DNA sequence reads down in columns for each codon.) -In the example below (showing just the beginning of one BER alignment), homology starts exactly at the first atg (the current chosen start, aa #1), there is a very favorable RBS beginning 9bp upstream of this atg (gagggaga). There is no reason to consider the ttg, and no justification for moving to the second atg (this would cut off some similarity and it does not have an RBS.) RBS upstream of chosen start 3 possible start sites This ORF’s upstream boundary
    • Gene Model Curation: Overlaps and InterEvidence Overlap analysis When two ORFs overlap (boxed areas), the one without similarity to anything (another protein, an HMM, etc.) is removed. If both don’t match anything, other considerations such as presence in a putative operon and potential start codon quality are considered. This process has both automated (for the easy ones) and manual (for the hard ones) components. Small regions of overlap are allowed (circle). InterEvidence regions Areas of the genome with no genes and areas within genes without any kind of evidence (no match to another protein, HMM, etc., such regions may include an entire gene in case of “hypothetical proteins”) are translated in all 6 frames and searched against niaa. Results are evaluated by the annotation team.
    • Functional Assignments Making the annotations: Assigning names and roles to the proteins
    • Functional Assignments: What we want to accomplish. Name and associated info Descriptive common name for the protein, with as much specificity as the evidence supports; gene symbol. EC number if the protein is an enzyme Role TIGR role Gene Ontology terms from all 3 ontologies (more on this soon) to describe what the protein is doing in the cell and why. Supporting evidence HMMs, Prosite, InterPro Characterized match from BER search Paralogous family membership.
    • Functional Assignments: What we want to avoid!
        • Transitive Annotation is the process of passing annotation attached to one gene to another based on sequence similarity:
        • A B, so B gets A’s name
        • B C, so C gets B’s name
        • C D, so D gets C’s name
        • A’s name has passed to D through several intermediates, however, if
        • A D
        • this is a transitive annotation error .
        • We take a very conservative approach and err on the side of missing homology rather than stretching weak data.
      Genome Rot! ~ ~ ~ ~ ~ ~ ~ ~
      • NO!
      • In order to prevent transitive annotation errors, we require that in order for a BER match to be used as justification for a functional annotation the match protein must be experimentally characterized (we call these “characterized matches”.)
      • Increasingly, the BER search results are filled with sequences from genome projects, the names of those proteins can not be considered reliable. Therefore, it becomes more and more difficult to find matches to experimentally characterized proteins.
      • To help annotators in this effort we have started storing in our database accessions of proteins known to be experimentally characterized.
      Are all pairwise matches with equal alignment quality of equal value to annotation?
    • Experimentally Characterized Proteins
      • It is important to know what proteins in our search database are characterized.
        • We store the accessions of proteins known or suspected to be characterized in the “characterized table” in our database
        • A confidence status is assigned to each entry in the characterized table.
      • Annotators see this information in the search results as color coded output:
        • green = full experimental characterization
        • sky blue = partial characterization
        • red = automated process (Swiss-Prot parse)
      • Our table does not contain all characterized proteins, not even close.
        • Any time additional characterized proteins are found it is important that they be entered into the table
    • AutoAnnotate Software tool which gives a preliminary name and role assignment to all the proteins in the genome. AutoAnnotate is not designed to give the most accurate annotation, it is designed to give at least some annotation (particularly TIGR roles) to every gene in the genome it can. It has been designed with the knowledge that manual annotation will follow the automatic annotation. This tool would need to be modified if we wanted it to produce stand-alone automatic annotation. Makes decisions based on ranked evidence types Equivalog HMM Other HMMs Characterized BER match Other BER match best evidence
    • Manual Annotation: Assigning Names to Proteins
    • Annotation must be grounded with supporting evidence The process of manual annotation involves assessing all available evidence and reaching a conclusion about what you think the protein is doing in the cell and why. Functional annotations should only be as specific as the supporting evidence allows, you must have high confidence in the level of functional annotation you are asserting. We have evolved a layered set of naming conventions that allows us to reflect the level of functional specificity of which we are confident in the names and other annotations assigned to a protein. All evidence that led to the annotation conclusions that were made must be stored so that others can assess the annotation for themselves. In addition, detailed documentation of methodologies and general rules or guidelines used in any annotation process should be provided.
    • Evaluate the Evidence
      • Visually inspect alignments
        • they should be full length
          • or partial matches to identified domains
        • at least 35% identity
      • Check HMM scores
        • need to be above trusted to be considered part of the family
        • how specific is the HMM?
      • Look at Genome Properties analysis
        • pathways, complexes?
      • Check for operon structure or other information from neighboring genes.
        • presence of a gene in an operon can supplement weak similarity evidence
      • Are there transmembrane regions?
      • Is there a signal peptide?
      • Are there any motifs that might give a clue to function?
      • Is there a paralogous family?
    • Functional Assignments: High Confidence in Precise Function Required evidence: -At least one good alignment (minimum 35% identity, over the full lengths of both proteins) to a protein from another organism that has been experimentally characterized, preferably multiple such alignments. -Hits to appropriate HMMs above the trusted cutoff. -Conservation of active sites, binding sites, appropriate number of membrane spans, etc. Action: Give the protein a specific name and accompanying gene symbol, this is the only confidence level where we assign gene symbols. We default to E. coli gene symbols when possible, for Gram positive genes we use B. subtilis gene symbols. Example: name: “adenylosuccinate lyase” gene symbol: purB EC number: 4.3.2.2
    • Functional Assignments: High Confidence in General Function, Unsure of Substrate Specificity A good example of this is seen with transporters: If you can definitively identify a specific substrate: “ ribose ABC transporter, permease protein” -Hit scoring above trusted to an equivalog HMM for one substrate -high quality match to an experimentally characterized protein in the BER with that substrate specificity If you can identify a substrate group: “ sugar ABC transporter, permease protein” -Multiple characterized hits to a specific type of transporter in the BER results -Hits to appropriate HMMs for that type of transporter -The substrate identified for in these matches are not all be the same, but may fall into a group, for example they are all sugars. If you can only identify the type of transporter and nothing about substrate specificity: “ ABC transporter, permease protein” -Hits to HMMs defining the transporter family but without substrate specificity -Matches to characterized proteins in the BER where the substrates identified do not fall into a group ------------ Another example of a known function but not exact substrate: “ carbohydrate kinase, FGGY family”
    • Functional Assignments : Precise Function Unclear The “family” designation: -No matches to specific characterized protein -score above trusted cutoff to an HMM which defines a family, but not a specific function. Example: “CbbY family protein ” The general function name: -no match to a specific enzyme -HMM match to a superfamily of functions Example: “kinase” The “homolog” designation, used when we want to show sequence relationship but not functional relationship: -good match to a function not expected in the organism (like a photosynthesis gene in a non-photosynthetic bug) OR -two matches exist in the genome to a particular function, however, one has much stonger evidence than the other. One is likely to be the “real” one while the other is likely not. Example: “galactokinase homolog” The “putative” designation, used when evidence is missing one small element but is otherwise strong: -has been largely replace by “homolog” and “family” Example: “putative galactokinase”
    • Functional Assignments: Hypotheticals If a protein has no matches to any protein from another species in BER, HMM, Prosite, or TmHMM it is called: “ hypothetical protein” This term is used since we are not actually sure the gene/protein is actually real. It has been predicted by Glimmer but no evidence exists that it is really expressed. --------------------------------------------------------- If a “hypothetical protein” from one species matches a “hypothetical protein” from another, they both now become: “ conserved hypothetical protein” These are really no longer “hypothetical” because it is very unlikely that two species will both retain a sequence if it is not an actual gene. Other centers use terms for this such as: “ protein of unknown function” “ unknown protein” NOTE: the above names will likely change soon as discussion is underway to agree on new names for these that would be used by the whole community.
    • Functional Assignments: Frameshifts and Point Mutations When a frameshift or in-frame stop codon is seen within a region of alignment in our BER search results, two possible things may have occurred: a sequencing error, or a mutation in the gene. When these are found, the underlying sequence data is manually reviewed for accuracy. Occasionally we find that the sequence we are using has an error (missed base, extra base, miscalled base). When these are corrected the frameshift/in-frame stop is resolved. If we find that the sequence is correct as reported to us, the protein is annotated to reflect the presence of a disruption in the open reading frame: “ great protein, authentic frameshift” “ fun protein, authentic point mutation” We do NOT use the term “pseudogene” as this implies that some experiment was done to confirm the lack of function of the gene. We have done no such experiments and indeed it may be that the protein may be functional in some truncated form, or that some read-through mechanism allows expression. Also, it may be that mutations we see only occurred in the cells grown for the DNA sample used for the sequencing project.
    • Functional Assignments: Other ORF disruptions -many FS / PM, “ degenerate” -”truncation” -”deletion” -”insertion” (~20-50aa) -interruption “ interruption-N” “ interruption-C” -”fusion” -”fragment” ! ! ! * * * These are given descriptive terms in the common name and all are put into a “Disrupted reading frame” role category to make them easy to find. Again we do not use the term “pseudogene”. (some genes) N-term C-term
    • TIGR Roles Amino acid biosynthesis Purines, pyrimidines, nucleosides, and nucleotides Fatty acid and phospholipid metabolism Biosynthesis of cofactors, prosthetic groups, and carriers Central intermediary metabolism Energy metabolism Transport and binding proteins DNA metabolism Transcription Protein synthesis Protein Fate Regulatory Functions Signal Transduction Cell envelope Cellular processes Other categories Unknown Hypothetical Disrupted Reading Frame Unclassified ( not a real role ) Role Notes: Notes written by annotators expert in each role category to aid other annotators in knowing what belongs in that category and the JCVI naming conventions for it. AutoAnnotate makes a first pass at assigning role, based on roles associated with HMMs or with match proteins. Human annotator checks and adjusts as necessary. TIGR bacterial roles were first adapted from Monica Riley’s roles for E. coli - both systems have since undergone much change.
    • What is the Gene Ontology? and Why is it useful?
    • GO offers an annotation capture system that allows the unambiguous communication of annotation information in a format readable by both computers and humans and which facilitates data exchange.
      • Traditionally, annotations have been captured in “name”, “gene symbol”, “comment” and perhaps “EC number” or “comment” fields
      • However,
        • this mode of capturing information is hard to standardize (just think of the many nomenclature battles that have gone on over the years).
        • generally the name field is all that is seen when people view annotations (think of a Blast search).
        • comment fields, although they can store a lot of detailed information, rarely are made to conform to format standards and it is difficult for a computer to search effectively since text-matching search tools will be forever challenged by free text and the inconsistencies of human language
    • Do we use precise, consistent terminology in biology?
      • Well, sometimes…
        • DNA, RNA, protein and many other terms are universally understood by biologists
      • but not always
        • there are numerous instances of the same thing being known by more than one term
        • there are numerous instances of different things being known by the same term
    • Enzymes pyruvate kinase and phosphoenol transphosphorylase both are names for the reaction: ATP + pyruvate = ADP + phosphoenolpyruvate The Enzyme Commission has standardized enzyme reactions and provides id numbers and alternate names
    • sporulation
      • reproductive sporulation
      • sporulation to survive adverse conditions
      (both photos from Wikipedia site) Neurospora , photo from NIH
    • Is a little ambiguity a big problem?
      • Not so much in the past when datasets were relatively small and different research communities did their own thing
      • But with the arrival of hundreds of complete genomes and many millions of new sequences it is quite important.
      • In order to be useful, genomic data must be organized so that meaningful comparisons and searches can be done
    • Biologists need to:
      • Describe gene products ( a.k.a. annotate them)
      • Search annotation data to answer biological questions
      • To accomplish this we must all speak the same biological language!
      • GO fills this need be providing a controlled set of defined terms to use in making annotations.
    • GO History
      • GO began as a collaboration between the databases for the model organisms of mouse (MGI) fruit fly (FlyBase) and baker’s yeast (SGD)
      • GO first appeared on the scene in 1998 with the initial vocabularies coming mostly from Michael Ashburner of FlyBase
      • First publication on GO was in 2000
      • The Consortium has now grown to include approximately 20 consortium and associate members who actively contribute to GO development and annotation efforts.
      • In addition there are a host of other groups that use the GO in many different ways for their research.
      • Organisms represented with GO annotations come from every domain and kingdom of life.
    • (Thanks to Jen Clark for this slide) The groups that contribute to the GO effort: ZFIN Reactome
    • The Structure of GO
    • GO is a set of 3 controlled vocabularies, stored as ontologies, which are used to tag gene products with information.
      • what is a controlled vocabulary?
      • what is an ontology?
    • What is a controlled vocabulary?
      • a set of official terms, each with a precise definition
      • solves problems associated with
        • synonyms (two different words/phrases mean the same thing)
        • homographs (the same word/phrase means more than one thing)
      • allows one to use precise and consistent terminology in describing something
    • GO consists of 3 separate controlled vocabularies The vocabularies are used to describe 3 aspects of a gene products existence: what it does (molecular function), why it does what it does (biological process), and where it does what it does (cellular component)
    • What is an ontology?
      • I quote from Wikipedia: “In both computer science and information science, an ontology is a data model that represents a set of concepts within a domain and the relationships between those concepts. It is used to reason about the objects within that domain.”
        • the set of concepts are the three controlled vocabularies of the GO
        • the domains involved are the aspects of the activities of gene products: what, why, and where
        • every term is connected to other terms in the vocabulary using one of a set of defined relationship types
    • GO terms
      • Each term has a name and a detailed definition - these are understandable by humans.
      • Each term also has a unique id number - this makes the term easily understandable and searchable by a computer.
      • Each term is related to at least one other term in a “parent-child” relationship where the child term is more specific than the parent term.
      • Each term may have one or more synonyms which capture alternate names, spellings, phrasing, etc.
    • GO term relationships
      • Each GO term has a relationship to at least one other term
        • “ biological process”, “molecular function”, and ‘cellular component” are roots
        • terms always have at least one parent (accept the roots)
        • a term may have children terms, children terms are more specific than their parents
        • a term may have sibling terms, siblings share the same parent
        • as one moves down the tree from parents to children, to grandchildren, the functions, processes, and structures become more specific (or granular) in nature
        • terms may have more than one parent
      • relationship types
        • is a
          • ribokinase “is a” kinase
        • part of
          • periplasm is “part of” a cell
    • General GO tree structure Root term general term less general term more specific term even more specific term most specific term
    • Examples Molecular function catalytic activity kinase activity carbohydrate kinase activity ribokinase activity Biological process cellular process metabolism carbohydrate metabolism ribose metabolism transporter activity carbohydrate transporter activity ribose transporter activity lets look at just one branch: now two branches:
    • A look at some general terms
    • Example GO term
      • ID number: GO:0004076
      • Name: biotin synthase activity
      • Definition: Catalysis of the reaction: dethiobiotin + sulfur = biotin.
      • parent term: sulfurtransferase activity (GO:0016783)
      • relationship to parent: “is a”
      • cross reference: EC:2.8.1.6
    • Abbreviated Tree
    • Expanded Tree
    • A Multi-parent term
      • ID number : GO:0015171
      • Name : amino acid transporter activity
      • Definition : Enables the directed movement of amino acids, organic acids containing one or more amino substituents, into, out of, within or between cells.
      • parent term : amine transporter activity (GO:0005275)
      • relationship to parent : “is a”
      • parent term : carboxylic acid transporter activity (GO:0046943)
      • relationship to parent : “is a”
      amine group carboxylic acid group
    • Tree view in AmiGO of a multi-parent term: Each node in an AmiGO tree can be expanded or collapsed as needed by clicking on the +/- in the box at the beginning of each line. The letter in the green circle (I/P) indicates the relationship between the term and its parent. The numbers in parentheses at the end of each line indicate the number of gene products in the GO repository annotation collection which have been annotated to that term or one of its children.
    • True Path Rule The pathway from a term all the way up to its top-level parent(s) must always be true for any gene product annotated to that term. cell organelle mitochondrian proton-transporting ATP synthase complex Incorrect for Bacteria cell intracellular proton-transporting ATP synthase complex plasma membrane proton-transporting ATP synthase complex mitochondrial proton-transporting ATP synthase complex membrane plasma membrane plasma membrane proton-transporting ATP synthase complex organelle mitochondrion mitochondrial inner membrane mitochondrial proton-transporting ATP synthase complex Correct for Bacteria (and Eukaryotes) NOTE: these trees are abbreviated versions of the actual ones
      • Obsolete terms - if a term is found to be incorrect, it is made obsolete, it gets the tag is_obsolete yes in the GO ontology file. Obsolete terms should not be used for annotations.
      • Secondary ids - if two terms are found to mean the same thing, they are merged, and one is made secondary to the other. The primary term should be used in annotations .
      • as the ontologies change, annotations must be changed too - in particular annotations to terms that have become obsolete
      • The ontologies change on a daily basis.
      GO is a work in progress
    • Getting involved with Ontology Development
      • SourceForge site to submit requests for new terms or changes to existing terms
      • Anyone can submit a SF item
      • GO maintains a 4-person full-time staff to care and feed the ontologies and respond to SF items
      • Join “interest groups” to participate in ontology development in specific areas of biological interest
    • https://sourceforge.net/tracker/?func=add&group_id=36855&atid=440764 The Gene Ontology SourceForge Tracker for Ontology Development
    • http://www.geneontology.org/GO.interests.shtml?all
    • Making associations between GO terms and gene products: GO Annotations
    • What is GO Annotation?
      • Linking a GO term to a gene product is making a GO annotation
      • GO terms from all 3 ontologies should be used to capture all available information about the gene product
    • 6-phosphofructokinase
      • function
        • 6-phosphofructokinase activity (GO:0003872)
      • process
        • glycolysis (GO:0006096)
      • component
        • cytoplasm (GO:0005737)
    • Making Annotations
      • literature based
        • read all literature for a gene product
        • find corresponding GO terms
      • sequence similarity based
        • find similarity between your gene product and one that has annotation attached to it
          • BLAST (BER at JCVI)
          • HMMs (Pfam, TIGRFAMS)
          • Prosite, InterPro, TMHMM, SignalP
          • many more….
        • find corresponding GO terms
      • automatic vs. manual
      • Using GO to capture annotations is SOP (standard operating procedure) independent.
    • Evaluate the Evidence
      • Visually inspect alignments
        • they should be full length
          • or partial matches to identified domains
        • at least 35% identity
      • Check HMM scores
        • need to be above trusted to be considered part of the family
        • how specific is the HMM?
      • Look at Genome Properties analysis
        • pathways, complexes?
      • Check for operon structure or other information from neighboring genes.
        • presence of a gene in an operon can supplement weak similarity evidence
      • Are there transmembrane regions?
      • Is there a signal peptide?
      • Are there any motifs that might give a clue to function?
      • Is there a paralogous family?
      • Decide what you think the protein does, why it does it, and where it does it.
    • Functional confidence captured with GO GO trees (very abbreviated) Function catalytic activity kinase activity carbohydrate kinase activity ribokinase activity glucokinase activity fructokinase activity Process metabolism carbohydrate metabolism monosaccharide metabolism hexose metabolism glucose metabolism fructose metabolism pentose metabolism ribose metabolism Available evidence for 3 genes #1 -good match to an HMM specific for “ribokinase” -a high-quality BER match to an experimentally characterized ribokinase #2 -good match to an HMM for “kinase” -a high-quality BER match to an experimentally characterized “glucokinase’ AND a ‘fructokinase’ #3 -good match to an HMM for “kinase’
    • Finding the terms to assign to your protein
      • Once you know what the protein does, why it does it, and where it does it, you need to find the terms to describe your knowledge about the protein.
        • Manatee Suggestions
          • mapping files (just 2 examples out of many)
            • ec2go - EC numbers and corresponding GO terms
            • Tigrfams2go - GO terms assigned to TIGR HMMs
          • Genome Property suggestions
            • available in Manatee and on the Comprehensive Microbial Resource
        • Search the ontologies with your favorite tree viewing tool
          • Manatee has an integrated GO viewer
          • AmiGO is the tool available on the GO web site
          • there are numerous other GO viewers, most listed on the GO web site
        • Search against proteins already annotated to GO
          • GOst (at GO web page)
          • Manatee search option (in “Select Function” menu)
          • (but don’t contribute to transitive annotation errors with this one - make sure you view the terms you find as suggestions that need to be confirmed with your evidence any your knowledge of the organism)
    • Full Annotation
      • Try to get a term from every ontology at the level of specificity you are confidant of. If there is no information on what the function, process, and/or component are, or if there is insufficient information to make even the most general of annotations, then the function, process, and/or component is “unknown”. This is indicated by an annotation to the root term(s).
      • Assign as many terms as are appropriate to completely describe what is known about the protein. This is likely to result in assignment of multiple terms from one or more ontologies.
    • chaperone DnaK
      • function
        • ATP binding (GO:0005524)
        • ATPase activity (GO:0016887)
        • unfolded protein binding (GO:0051082)
        • misfolded protein binding (GO:0051787)
        • denatured protein binding (GO:0031249)
      • process
        • protein folding (GO:0006457)
        • protein refolding (GO:0042026)
        • protein stabilization (GO:0050821)
        • response to stress (GO:0006950)
      • component
        • cytoplasm (GO:0005737)
    • Evidence
      • It is vitally important to store the evidence that was used to make annotations
        • literature
          • method (direct assay, mutant phenotype, etc.)
          • references where experiments/analysis were done
        • sequence based
          • methods of manual curation
          • accession numbers of similar sequences, HMMs, etc. that were used in analysis
          • references describing method of curation
      • GO has a format for storage of this information
        • evidence codes, references, “with”
    • GO Evidence A key element of a good annotation system is the storage of evidence:
      • Assign “Evidence Code” - evidence codes tell users what kind of evidence was used
          • “ inferred from sequence similarity” = ISS
          • Various experimental characterizations - IMP, IDA, etc.
          • “ inferred from electronic annotation” = IEA - immediately allows users to tell manual curation from automatic
      • Assign “Reference” - describes source of the annotation
          • PMID of paper describing characterization or method used for annotation
          • Standard GO reference set - submitted by various groups, text descriptions of their methods - JCVI has submitted three (HMM, pairwise match, operon). These can be found here: www.geneontology.org/doc/GO.references
      • Assign “with” (when appropriate)
          • Used with ISS to store the accession of the thing the sequence similarity is with
          • Used with IPI /IGI (inferred from physical interaction/genetic interaction)
      • Modifier column
          • “ contributes to” - use this modifier when you assign a function term representing the function of a complex to proteins that are part of the complex but can not themselves carry out the function of the complex
          • NOT - went you want to assert that a gene product should NOT get an annotation
          • “ co-localizes_with” - gene product transiently or peripherally associated
    • • IDA inferred from direct assay - Enzyme assays - In vitro reconstitution (e.g. transcription) - Immunofluorescence (for cellular component) - Cell fractionation (for cellular component) - Physical interaction/binding • IEP inferred from expression pattern - Transcript levels (e.g. Northerns, microarray data) - Protein levels (e.g. Western blots) • IGI inferred from genetic interaction - &quot;Traditional&quot; genetic interactions such as suppressors, synthetic lethals, etc. - Functional complementation - Rescue experiments - Inference about one gene drawn from the phenotype of a mutation in a different gene. • IMP inferred from mutant phenotype - Any gene mutation/knockout - Overexpression/ectopic expression of wild-type or mutant genes - Anti-sense experiments - RNAi experiments - Specific protein inhibitors • IPI inferred from physical interaction - 2-hybrid interactions - Co-purification/Co-immunoprecipitation - Ion/protein binding experiments • ISS inferred from sequence or structural similarity - Sequence similarity (to characterized proteins, domains, motifs) - Structural similarity • IGC inferred from genomic context -operon structure -syntenic regions -pathway/system reconstruction • ND no biological data available (used when annotating to the root terms) • IC inferred by curator • TAS/NAS traceable/nontraceable author statement • IEA inferred from electronic annotation http://www.geneontology.org/GO.evidence.html GO Evidence codes
    • GO Standard References GO_REF:0000011 A Hidden Markov Model (HMM) is a statistical representation of patterns found in a data set. When using HMMs with proteins, the HMM is a statistical model of the patterns of the amino acids found in a multiple alignment of a set of proteins called the &quot;seed&quot;. Seed proteins are chosen based on sequence similarity to each other. Seed members can be chosen with different levels of relationship to each other. They can be members of a superfamily (ex. ABC transporter, ATP-binding proteins), they can all share the same exact specific function (ex. biotin synthase) or they could share another type of relationship of intermediate specificity (ex. subfamily, domain). New proteins can be scored against the model generated from the seed according to how closely the patterns of amino acids in the new proteins match those in the seed. There are two scores assigned to the HMM which allow annotators to judge how well any new protein scores to the model. Proteins scoring above the &quot;trusted cutoff&quot; score can be assumed to be part of the group defined by the seed. Proteins scoring below the &quot;noise cutoff&quot; score can be assumed to NOT be a part of the group. Proteins scoring between the trusted and noise cutoffs may be part of the group but may not. One of the important features of HMMs is that they are built from a multiple alignment of protein sequences, not a pairwise alignment. This is significant, since shared similarity between many proteins is much more likely to indicate shared functional relationship than sequence similarity between just two proteins. The usefulness of an HMM is directly related to the amount of care that is taken in chosing the seed members, building a good multiple alignment of the seed members, assessing the level of specificity of the model, and choosing the cutoff scores correctly. GO_REF:0000012 Pairwise alignments are generated by taking two sequences and aligning them so that the maximum number of amino acids in each protein match, or are similar to, each other. Tools such as BLAST work by comparing a protein-of-interest individually with every protein in a database of known protein sequences and retaining only those matches with a high probability of being significant. Basic BLAST generates local alignments between proteins for regions of high similarity. Other pairwise alignment tools attempt to generate global (full-length) protein alignments. GO_REF:0000025 Genes in prokaryotic organisms are often arranged in operons. Genes in an operon are all transcribed into one mRNA. Generally the genes in the operons code for proteins that all have related functions. For example, they may be the steps in a biochemical pathway, or they may be the subunits of a protein complex. Often the genes in operons shared between organisms are syntenic; that is, the same genes are in the same order in the operon in different species. When assessing sequence-comparison-based evidence during the process of manual annotation of a genome, it is often the case that some of the genes in the operon will have strong sequence-based evidence while others will have weak evidence. If seen alone, not in the presence of an operon, the weak evidence in question may not be sufficient to make a functional annotation. However, in the presence of an operon in which there is strong evidence for some of the genes, the very presence of the gene in the operon is a strong indication that the gene shares in the process carried out by the operon. If the putative function is one expected to exist for the process in question and particularly if that function has been observed in the same operon in another species, then the annotation should be made. This type of evidence is inferred from the context of the gene in an operon, and therefore the evidence code is IGC &quot;inferred from genomic context.&quot;
      • Examples:
      • Protein sequence from UniProt --> UniProt:P12345
      • TIGR HMM --> TIGR_TIGRFAMS:TIGR01234
      • Publication describing experiments --> PMID:12345678
      Accession Format All objects (papers, proteins, etc.) used in evidence must be represented according to GO’s format: “ database:accession” where “database” is an abbreviation defined at GO for the database housing the object, and “accession” is the accession number or other identifier for the object Each database must have a registered abbreviation with GO. New abbreviations can be requested at any time. The current file can be viewed at: www.geneontology.org/doc/GO.xrf_abbs
    • abbreviation: JCVI database: J Craig Venter Institute generic_url: http://www.jcvi.org/ abbreviation: JCVI_CMR database: Comprehensive Microbial Resource object: Locus example_id: JCVI_CMR:VCA0557 generic_url: http://www.jcvi.org/ url_syntax: http://www.jcvi.org/tigr-scripts/CMR2/GenePage.spl?locus= url_example: http://www.jcvi.org/tigr-scripts/CMR2/GenePage.spl?locus=VCA0557 abbreviation: TIGR_TIGRFAMS database: J Craig Venter Institute, TIGRFAMs HMM collection object: Accession example_id: TIGR_TIGRFAMS:TIGR00254 generic_url: http://www.jcvi.org/ url_syntax: http://www.jcvi.org/tigr-scripts/CMR2/hmm_report.spl?acc= url_example: http://www.jcvi.org/tigr-scripts/CMR2/hmm_report.spl?acc=TIGR00254 Example Abbreviations from GO file
    • GO annotation examples
    • From the literature…… Enzymes involved in DNA ligation and end-healing in the radioresistant bacterium Deinococcus radiodurans Blasius, et al, 2007, BMC Molecular Biology “ Results: in this report, we describe the cloning and expression of a Deinococcus radiodurans DNA ligase in Escherichia coli . This enzyme efficiently catalysis DNA ligation in the presence of Mn(II) and NAD+ as cofactors…….” Read on in the text to see how they did the experiments and we find that they performed a direct assay on purified protein. This is IDA for the ligase activity. cellular component biological process molecular function aspect GO:0003909 PMID: 17705817 IC cytoplasm GO:0005737 N/A PMID: 17705817 IDA DNA ligation GO:0006266 N/A PMID: 17705817 IDA DNA ligase activity GO:0003909 with reference ev code term name GO id
    • Sequence similarity-based Annotations
      • find similarity between your gene product and one that is experimentally characterized
        • BLAST-type alignments
      • find similarity between your gene product and a defined protein family
        • HMMs (Pfam, TIGRFAMS)
        • Prosite
        • InterPro
        • and more….
      • use prediction tools to find motifs in your gene product
        • TMHMM
        • SignalP
        • and more….
    • For example…. … . a high quality alignment to an experimentally characterized triosephosphate isomerase from Vibrio marinus
    • further down the page Info from Swiss-Prot database on experimentally characterized match protein
    • … . full-length match, high percent identity (67.8%), conserved active and binding sites (boxed in red). High quality…..
    • Genome Property for glycolysis
    • Annotations resulting from this match (and others in the pathway for the biological process annotation) …. cellular component biological process molecular function aspect GO:0004807 GO_REF: 0000012 IC cytoplasm GO:0005737 TIGR_GenProp: GenProp0120 PMID: 15347579 IGC glycolysis GO:0006096 Swiss-Prot: P50921 GO_REF: 0000012 ISS triose-phosphate isomerase activity GO:0004807 with reference ev code term name GO id
    • Sharing Annotations
      • Send annotation to GO to be placed in the repository of annotated genes to be a resource to the community (this is optional but benefits the whole community.)
      • All files are in the same format
        • 15 column tab-delimited file
        • stores gene product identifiers, GO term identifiers and associated evidence
      • Facilitates use of data by everyone and development of tools that work on one consistent file format
      • All data in the repository is free to anyone who wishes to use it.
      • There are hundreds of thousands of gene product annotations available in the repository with more arriving all the time
    • Annotation (or Association) file format The annotation file consists of 15 tab-delimited columns: DB - the database submitting the annotation DB_Object_ID - unique identifier for the object being annotated DB_Object_Symbol - gene symbol or other symbol for the gene product Qualifier - one of: NOT, contributes_to, colocalizes_with Goid - the GO id number for the term annotated to the gene product DB:Reference - the reference which describes the annotation source Evidence Code - ISS, IEA, IMP, etc. With/From - accession number of item with sequence similarity, or physical interaction, etc. Aspect - one of: process, function, component stored as P, F, or C DB_Object_Name - common name of gene product (eg. “ribokinase”) Synonym - any alternate names for the gene product, can put gene symbol here too DB_Object_Type - what type of thing is being annotated (protein, gene, etc.) Taxon - the NCBI taxon id for the organism from which the gene product came Date - date the annotation was made Assigned_by - the database that made the annotation (same as the submitting db in most, but not all cases
    • Some of the Association files at GO
    • Updating Annotations and Enforcing Standards in the Repository
      • obsoletes/secondary ids
        • as terms become obsolete or secondary to other terms, those gene products annotated to those terms must be re-annotated to other terms
        • GO will remove such annotations from association files if they are not fixed
      • old IEAs
        • GO will remove IEA annotations more than one year old
      • new information
        • as new information or new terms become available older annotations should be revisited
        • some groups have limited ability to do this
      • GO enforces certain annotation policies by absolutely requiring certain evidence fields be populated.
      • GO enforces file format rules by deleting lines in the files that do not meet the standards or requirements.
    • GO Slims
      • Provide a “high-level” or “general” view of the GO
      • Specific annotations can be mapped up to Slim terms
        • GO provides the “map2slim” script
      • Small example slim:
        • amino acid metabolism
        • carbohydrate metabolism
        • lipid metabolism
        • amino acid transport
        • carbohydrate transport
        • lipid transport
        • cell division
        • pathogenesis
        • motility
      • Several Slim files are stored at GO for use by anyone
        • plant
        • yeast
        • prokaryotes (coming very soon)
        • http://www.geneontology.org/GO.slims.shtml
      • Specific slims can be made for specific organisms or projects
    • Using GO for data mining
      • Find all genes that share a particular function regardless of sequence.
      • Do comparisons across as many species as have been annotated to GO
        • currently there are representatives of every domain and kingdom of life
      • GO removes the problems associated with alternate names for the same concept AND the same name used for different concepts
        • queries are based on id numbers linked to precise definitions, not unreliable text-matching tools
      • Since all data is stored in one format, data-mining tools can be made to work off of one set of standards.
    • Annotation Training
      • Conducted by consortium mentors as needed for new groups (AgBase, PAMGO)
        • JCVI has mentored PAMGO (the Plant-Associated Microbe Gene Ontology group) in the development of terms to describe the interactions between microbes and their hosts
        • The result has been more than 500 new terms entered into the GO and annotations attached to several species important to agriculture
      • Annotation Camps held by GO
      • Workshops at conferences
      • Scheduled training classes provided by JCVI (which you are all familiar with)
        • 3-day workshops on-site at JCVI
        • prokaryotic: March/June/August/Oct.
        • eukaryotic: March/June/Sept.
        • http://www.JCVI.org/edutraining/training
    • JCVI’s Annotation Service
      • FREE service for anyone with a prokaryotic DNA sequence they wish to annotate
      • submit the sequence to us and we return a MySQL database to you with output of our pipeline (search results, automatic annotation)
      • you can then download the free open source tool Manatee to manually annotate the genome
      • We have >100 users with >200 genomes through the Annotation Engine so far
      • We recently received funding from the National Institute of General Medical Science to expand the service and add new features to the available tools
      • go to www.jcvi.org/AnnotationService for more information
    • Data Availability
      • Publication
        • JCVI staff/collaborators analysis of genome data
      • GenBank
        • Sequence and annotation submitted to GenBank at the time of publication
        • Updates sent as needed
      • Comprehensive Microbial Resource (CMR)
        • Data available for download
        • extensive analyses within and between genomes
      • GO repository
        • Data available for download
        • whole repository searchable with AmiGO
    • Acknowledgements Director of bioinformatics: Granger Sutton Prokaryotic Annotation: Scott Durkin Bob Dodson Ramana Madupu Sagar Kothari Susmita Shrivastava Yinong Sebastian Derek Harkins And the many other TIGR/JCVI employees, present and past, who have contributed to the development of these tools and to the annotation protocols I have described. Also thanks to the funding agencies that make our work possible including NIH, NSF, and DOE. CMR/Pathema team: Tanja Davidsen Erin Kelsey BRC Project Manager: Lauren Brinkac HMM/Genome Properties team: Dan Haft Jeremy Selengut Building the tools: Nikhat Zafar Alex Richter and many more….