defined by the absence of a translational “stop” codon
3 “stops” in bacteria: TAA, TAG, TGA
an ORF goes from “stop” to “stop”
ORFs can easily occur by chance in any given DNA sequence
Since “stop” codons are AT rich:
GC rich DNA has less random stops and therefore, on average, more/longer ORFs
AT rich DNA has more random stops and therefore, on average, fewer/shorter ORFs
Protein Coding Gene
coding sequence is contained within an ORF
requires translational “start” codon
3 “starts” in bacteria: ATG, GTG, TTG
coding sequence goes from “start” to “stop”
in JCVI jargon we tend to equate gene and coding sequence, even though we know that a gene consists of more than coding sequence.
you may hear JCVI people use gene and protein interchangeably
Telling the difference between random ORFs and genes is the goal in the gene finding process.
Each DNA sequence has 6 possible translations, one for each frame (Very) Short sample sequence TAGATGATTAGCTTGGATGAGCTCATATAGCCCGTAAGA ATCTACTAATCGAACCTACTCGAGTATATCGGGCATTGT Frame and corresponding codons +1 TAG ATG ATT AGC TTG GAT GAG CTC ATA TAG CCC GTA +2 AGA TGA TTA GCT TGG ATG AGC TCA TAT AGC CCG TAA +3 GAT GAT TAG CTT GGA TGA GCT CAT ATA GCC CGT AAG -1 TGT TAC GGG CTA TAT GAG CTC ATC CAA GCT AAG CAT -2 GTT ACG GGC TAT ATG AGC TCA TCC AAG CTA AGC ATC -3 TTA CGG GCT ATA TGA GCT CAT CCA AGC TAA GCA TCT color key translational start translational stop Just one sequence can result in many different potential coding sequence situations: frame +1 and +2 have relatively long ORFs with start codons and therefore are potential genes. Frames +3 and -3 have short ORFs with no starts and therefore no genes. Frame -1 and -2 are all ORF since there are no stops at all, but only one has a start.
More on 6-Frame translations In order to visualize the genes within the context of their neighbors along a DNA sequence, the sequence is often represented as a 6-frame translation. There are 6 possible frames for translation in every sequence of DNA, 3 in the forward (+) direction on the DNA sequence, and 3 in the reverse (-) direction on the DNA sequence. These are represented as horizontal bars with vertical marks for stops (long) and starts (short) in the order shown below. +1 +2 +3 -1 -2 -3 These are some of the many ORFs in this graphic. stop start
Glimmer is a tool which uses Interpolated Markov Models (IMMs) to predict which ORFs in a genome contain real genes.
Glimmer does this by comparing the nucleotide patterns it finds in a training set of “known real genes” to the nucleotide patterns of the ORFs in the whole genome. ORFs with patterns similar to the patterns in the training genes are considered real themselves.
Using Glimmer is a two-part process:
Train Glimmer with genes from the organism that was sequenced.
Run the trained Glimmer against the entire genome sequence.
Gene finding with Glimmer: Gathering the Training Set Glimmer built-in system: long ORFs that do not overlap other long ORFs IMM built Glimmer builds a model from the training set One can gather published sequences from the organism sequenced, however, this generally does not provide enough sequence for training - a general guideline is that ~250 kb of total sequence is needed (that is if you line up all of the training genes end to end they total 250 kb) BLAST all ORFs against your favorite protein database, retain only extremely strong matches
Gene Finding with Glimmer What happens during training. ATGCG T AAGGCTTTCACAGT ATGCG A GTAAGCTGCGTCGTAAGG A TGCGT A AGGCTTTCACAGTATGCGAGTAAGC TGCGT C GTAAGG AT GCGTA A GGCTTTCACAGTATGCGAGTAAGCTGCGTCGTAAGG ATG CGTAA G GCTTTCACAGTATGCGAGTAAGCTGCGTCGTAAGG ATGC GTAAG G CTTTCACAGTATGCGA GTAAG C TGCGTC GTAAG G Glimmer moves sequentially through each sequence in the training set, recording the nucleotide that occurs after each possible oligomer up to oligomers of length eight Glimmer then calculates the statistical probability of each pattern appearing in a real gene. These probabilities form the statistical model of what a real gene looks like in the given organism. This model is then run against the complete genome sequence. All ORFs above the chosen minimum length (90 bp at JCVI) are scored according to how closely they match the model of a real gene. Example for a 5-mer:
Candidate ORFs 6-frame translation map for a region of DNA: Each horizontal bar corresponds to one of the 6 translation frames (3 forward, 3 reverse) long vertical lines represent stops (TAA, TAG, TGA) short vertical lines represent starts (ATG, GTG, TTG) = ORFs meeting minimum length In the first panel the ORFs highlighted in blue meet the minimum length requirement of what can be considered a gene. (in our pipeline 90 bp) In the second panel, potential coding sequences within the candidate ORFs are represented by arrows, going from start to stop. A long ORF does not necessarily result in a long putative gene. Green ORFs scored well to the model, red ORFs scored less well. The green ORFs are chosen by Glimmer as the set of likely genes. ORFs in the area of lateral transfer, although real genes, often will not be chosen since their nucleotide patterns likely won’t match the model built from the patterns of the genome as a whole. Below genes are represented as arrows drawn above (forward orientation) or below (reverse orientation) a line representing the DNA molecule. Glimmer numbers the genes sequentially from the beginning of the DNA molecule on which they reside. Genes missed by Glimmer and overlapping genes are resolved by post-Glimmer processes, which will be discussed on later slides. ORF00001 ORF00002 ORF00003 ORF00004 = laterally transferred DNA ? +1 +2 +3 -1 -2 -3
Coordinates Genes are mapped to the underlying genome sequence via coordinates. Each gene is defined by two coordinates: end5 (the 5 prime end of the gene) and end3 (the 3 prime end of the gene). Nucleotide #1 for each molecule in the genome is the beginning of each final assembled molecule. Circular chromosomes are treated as linear for the purposes of annotation. One position is chosen to be nucleotide #1 and that becomes the “beginning” of the molecule. Some genomes have just one DNA molecule, some have several (multiple chromosomes or plasmids). 1 10,000 gene end5 end3 purple 12 527 red 802 675 blue 927 1543 green 9425 7894 pink 9575 9945 Note that forward genes have end5<end3, while reverse genes have end5>end3.
Gathering Evidence Determining How The New Proteins Function
there are occurrences of proteins where one amino acid substitution changes the function
all functional assignments based on sequence similarity should be considered putative until experimental confirmation
assessed using protein alignments
proteins are aligned one on top of the other so that the maximal number of amino acids “match”
identity in an alignment means that at a given position in the alignment the amino acids are identical
similarity in an alignment means that at a given position in the alignment the amino acids are similar in chemical structure to each other, and may therefore be able to carry out the same role in the protein.
two protein’s amino acid sequences aligned next to each other so that the maximum number of amino acids match
3 or more protein’s amino acid sequence aligned next to each other so that the maximum number of amino acids match in each column
more meaningful than pairwise alignments since it is much less likely that several proteins will share sequence similarity due to chance alone, than that 2 will share sequence similarity due to chance alone. Therefore, such shared similarity is more likely to be indicative of shared function.
clusters of proteins that all share sequence similarity
may be modeled by various statistical techniques
short regions of amino acid sequence shared by many proteins
Local pairwise alignment tools do not worry about matching proteins over their entire lengths, they look for any regions of similarity within the proteins that score well.
comes in many varieties (see NCBI site)
finds best out of all possible local alignments
slow but sensitive
Global pairwise alignment tools take two sequences and attempt to find an alignment of the two over their full lengths.
finds best out of all possible alignments
Multiple alignments tools try to align 3 or more proteins so that the maximal number of amino acids from each protein are matched in the alignment - this may or may not include the full length of some or all of the proteins
Sample Alignments Pairwise Multiple -two rows of amino acids compared to each other, the top row is the search protein and the bottom row is the match protein, numbers indicate amino acid position in the sequence -solid lines between amino acids indicate identity -dashed lines (colons) between amino acids indicate similarity -next slide shows a full alignment Different shadings indicate amount of matching
File composed of protein sequences from several source databases
The file is made non-redundant
identical protein sequences that came into the file from different source databases are collapsed into one entry
all of the protein’s accession numbers from the various source databases where it is found are stored linked to the protein
users can always view the protein at the source database
Sample NIAA entry >biotin synthase, Escherichia coli MAHRPRWTLSQVTELFEKPLLDLLFEAQQVHRQHFDPRQVQVSTLLSIKTGACPEDCKYC PQSSRYKTGLEAERLMEVEQVLESARKAKAAGSTRFCMGAAWKNPHERDMPYLEQMVQGV KAMGLEACMTLGTLSESQAQRLANAGLDYYNHNLDTSPEFYGNIITTRTYQERLDTLEKV RDAGIKVCSGGIVGLGETVKDRAGLLLQLANLPTPPESVPINMLVKVKGTPLADNDDVDA FDFIRTIAVARIMMPTSYVRLSAGREQMNEQTQAMCFMAGANSIFYGCKLLTTPNPEEDK DLQLFRKLGLNPQQTAVLAGDNEQQQRLEQALMTPDTDEYYNAAAL Source databases where this protein is found: -Swiss-Prot, accession # SP:P12996 -Protein Information Resource, accession # PIR:JC2517 -NCBI’s GenBank, accession # GB:AAC73862.1 All of these are collapsed into one entry in NIAA that is linked to all three accessions.
BLAST-extend-repraze (BER) Our pairwise protein search tool
Initial BLAST search of new query proteins from the genome sequence
search is against NIAA
stores good hits in mini-database for each query protein from the genome
Query protein sequence is extended by 300 nucleotides on each end and translated (see later slide)
A modified Smith-Waterman alignment is generated between the query and each match sequence in the mini-database
extends local alignments as far as homology continues over lengths of extended proteins
produces a file of alignments between the query protein and the match protein for each match protein in the mini-database
as the alignment generating algorithm builds the alignment, if the level of similarity falls below the necessary threshold, the program looks in different frames and through stop codons for homology to continue - this similarity can continue into the upstream and/or downstream extensions
BLAST-extend-repraze (BER) niaa BLAST Extend 300 nucleotides on both ends (see later slide) modified Smith- Waterman Alignment genome.pep vs. Significant hits put into mini-dbs for each protein (non-identical amino acid) vs. extended protein mini-database from BLAST search , mini-db for protein #1 mini-db for protein #2 , mini-db for protein #3 ... mini-db for protein #3000
BER Alignment An alignment like this will be generated for every match protein in the mini-database. In the next slides we will look closely at the types of information displayed here.
BER Alignment detail: Boxed Header -The background color of this box will be gold if the protein is in the characterized table and grey if it is not. -The top bar lists the percent identity/similarity and the organism from which the protein comes (if available). -The bottom section lists all of the accession numbers and names for all the instances of the match protein from the source databases (used in building NIAA for the searches.) -The accession numbers are links to pages for the match protein in the source databases. -A particular entry in the list will have colored text (the color corresponding to its characterized status) if that is the accession that is entered into the characterized table - this tells the annotators which link they should follow to find experimental characterization information. Only one accession for the match protein need be in the characterized table for the header to turn gold. -There are links at the end of each line to enter the accession into the characterized table or to edit an already existing entry in the characterized table.
BER Alignment detail: alignment header -It is most important to look at the range over which the alignment stretches and the percent identity -The top line show the amino acid coordinates over which the match extends for our protein -The second line shows the amino acid coordinates over which the match extends for the match protein, along with the name and accession of the match protein -The last line indicates the number of amino acids in the alignment found in each forward frame for the sequence as defined by the coordinates of the gene. The primary frame is the one starting with nucleotide one of the gene. If all is well with the protein, all of the matching amino acids should be in frame 1. -If there is a frameshift in the alignment (see later slide) the phrase “Frame Shifts = #” will flash and indicate how many frameshifts there are. In addition, inside the brackets in the last line will be a count of how many amino acids of match are in each frame.
BER Alignment detail: alignment of amino acids -In these alignments the codons of the DNA sequence read down in columns with the corresponding amino acid underneath. -The numbers refer to amino acid position. Position 1 is the first amino acid of the protein. The first nucleotide of the codon coding for amino acid 1 is nucleotide 1 of the coding sequence. Negative amino acid numbers indicate positions upstream of the predicted start of the protein. -Vertical lines between amino acids of our protein and the match protein (bottom line) indicate exact matches, dotted lines (colons) indicate similar amino acids. -Start sites are color coded: ATG is green, GTG is blue, TTG is red/orange -Stop codons are represented as asterisks in the amino acid sequence. An open reading frame goes from an upstream stop codon to the stop at the end of the protein, while the gene starts at the chosen start codon.
BER Skim A list of best matches from niaa to the search protein with statistics on length of match and BLAST p-value. Colored backgrounds indicate presence in characterized table and corresponding status.
Extensions in BER The extensions help in the detection of frameshifts (FS) and point mutations resulting in in-frame stop codons (PM). This is indicated when similarity extends outside the coordinates of the protein coding sequence. Blue line indicates predicted protein coding sequence, green line indicates up- and downstream extensions. Red line is the match protein. ORFxxxxx 300 bp 300 bp FS PM FS or PM ? two functionally unrelated genes from other species matching one query protein could indicate incorrectly fused ORFs end5 end3 search protein match protein similarity extending through a frameshift upstream or downstream into extensions similarity extending in the same frame through a stop codon ? normal full length match * ! !
HMMs are statistical models of the patterns of amino acids in a group of functionally related proteins found across species.
this group is called the “seed”
HMMs are built from multiple alignments of the seed members.
Proteins searched against an HMM receive a score indicating how well they match the model.
Proteins scoring well to the model can be expected to share the function that the HMM represents.
HMMs can be built at varying levels of functional relationship.
The most powerful level of relationship is one representing the exact same function.
It is important to know the kind of relationship an HMM models to be able to draw the correct conclusions from it
Annotation can be attached to HMMs
Our HMM-Type Labels Domain: These HMMs describe a region of homology that is not required to be the full length of a protein. The function of the region may or may not be known. Superfamily : This type of HMM describes a group of proteins which have full length protein sequence similarity and have the same domain architecture, but which do not necessarily have the same function. Subfamily: This type of HMM describes a group of proteins which also have full length homology, which represent more specific sub-groupings with a superfamily. Equivalog: The supreme HMM, designed so that all members of the family being modeled and all proteins scoring above trusted share the exact same function. Equivalog_domain: Just like equivalog, only applies to a polypeptide that can be found as a single function protein or part of a multifunctional protein. (The above are just a sampling of the most commonly seen isology types arranged from most general to most specific, there are more types than those listed here.) Pfam: Indicates that no isology type has yet been assigned to the Pfam HMM. The starred HMMs provide the most specific information, look for matches to these first. They are very strong evidence for function.
GO terms: GO:0004076 (biotin synthase activity), GO:0009102 (biotin biosynthesis)
name: radical SAM domain protein
EC: not applicable
gene symbol: not applicable
TIGR role: 703 (enzymes of unknown specificity)
GO terms: GO:0003824 (catalytic activity), GO:0008152 (metabolism)
Building HMMs Collect proteins to be in the “seed” Generate and Curate Multiple Alignment of Seed proteins Run HMM algorithm Choose “noise” and “trusted” cutoff scores based on what scores the “known” vs. “unknown” proteins receive. HMM is ready to go! Region of good alignment and closest similarity (same function/similar domain/ family membership) Computes statistical probabilities for amino acid patterns in the seed Search new model against all proteins this step may need a few iterations
Choosing cutoff scores 100 300 matches (seed members bold) score protein “definitely” 547 protein “absolutely” 501 protein “sure thing” 487 protein “confident” 398 protein “safe bet” 376 protein “very confident” 365 protein “has to be one” 355 protein “could be” 210 protein “maybe” 198 protein “not sure” 150 protein “no way” 74 protein “can’t be” 54 protein “not a chance” 47 =search the new HMM against NIAA =see the range of scores the match proteins receive =do analysis to determine where confirmed members score =do analysis to determine where confirmed non-members score =set the cut-offs accordingly =proteins that score above trusted can be considered part of the protein family modeled by the HMM =proteins that score below noise should not be considered part of the protein family modeled by the HMM =usefulness of an HMM is directly related to the care taken by the person building the HMM since some steps are subjective
HMM Searches HMM hits for protein #1 , , , etc. HMM hits for protein #3 HMM hits for protein #2 genome.pep vs. HMM database TIGRFAMS + Pfams TIGR01234 PF00012 TIGR00005 TIGR01004 NONE Each protein in the genome is searched against all HMMs in our db. Some will not have significant hits to any HMM, some will have significant hits to several HMMs. Multiple HMM hits can arise in many ways, for example: the same protein could hit an equivalog model, a superfamily model to which the equivalog function belongs, and a domain model representing the catalytic domain for the particular equivalog function. There is also overlap between TIGR and Pfam HMMs.
Evaluating HMM scores - if the protein’s total score is….. … above trusted: the protein is a member of family the HMM models … below noise: the protein is not a member of family the HMM models … in-between noise and trusted: the protein MAY be a member of the family the HMM models ...above trusted and some or all scores are negative: the protein is a member of the family the HMM models 0 100 -50 0 100 -50 0 100 -50 T N P 0 100 -50
Paralogous Families genome.pep genome.pep vs. Proteins from the same genome are grouped into families (minimum two members) according to sequence similarity. First, proteins are clustered according to HMM hits, second, other regions of the proteins, not found in HMM hits, are searched and clustered. Those families associated with HMMs get names containing the HMM accession (ex. fam_PF00528), those not associated with HMMs are given sequential numbered names as they are built (ex. fam_11). Reveals expansion/contraction of various families of proteins in one genome verses another. Helps in annotation consistency, frameshift detection, and start site editing.
The process of insuring that the predicted genes have the correct coordinates and that the set of predicted genes is complete and correct (to the best of our ability.)
Start site curation
overlap resolution (false positives)
missed genes (false negatives)
Gene Model Curation: Start site edits What to consider: - Start site frequency: ATG >> GTG >> TTG - Ribosome Binding Site (RBS): a string of AG rich sequence located 5-11 bp upstream of the start codon - Similarity to match protein, in BER, Paralogous Family, and multiple alignments - the most important factor. (Remember to note, that the DNA sequence reads down in columns for each codon.) -In the example below (showing just the beginning of one BER alignment), homology starts exactly at the first atg (the current chosen start, aa #1), there is a very favorable RBS beginning 9bp upstream of this atg (gagggaga). There is no reason to consider the ttg, and no justification for moving to the second atg (this would cut off some similarity and it does not have an RBS.) RBS upstream of chosen start 3 possible start sites This ORF’s upstream boundary
Gene Model Curation: Overlaps and InterEvidence Overlap analysis When two ORFs overlap (boxed areas), the one without similarity to anything (another protein, an HMM, etc.) is removed. If both don’t match anything, other considerations such as presence in a putative operon and potential start codon quality are considered. This process has both automated (for the easy ones) and manual (for the hard ones) components. Small regions of overlap are allowed (circle). InterEvidence regions Areas of the genome with no genes and areas within genes without any kind of evidence (no match to another protein, HMM, etc., such regions may include an entire gene in case of “hypothetical proteins”) are translated in all 6 frames and searched against niaa. Results are evaluated by the annotation team.
Functional Assignments Making the annotations: Assigning names and roles to the proteins
Functional Assignments: What we want to accomplish. Name and associated info Descriptive common name for the protein, with as much specificity as the evidence supports; gene symbol. EC number if the protein is an enzyme Role TIGR role Gene Ontology terms from all 3 ontologies (more on this soon) to describe what the protein is doing in the cell and why. Supporting evidence HMMs, Prosite, InterPro Characterized match from BER search Paralogous family membership.
Functional Assignments: What we want to avoid!
Transitive Annotation is the process of passing annotation attached to one gene to another based on sequence similarity:
A B, so B gets A’s name
B C, so C gets B’s name
C D, so D gets C’s name
A’s name has passed to D through several intermediates, however, if
this is a transitive annotation error .
We take a very conservative approach and err on the side of missing homology rather than stretching weak data.
In order to prevent transitive annotation errors, we require that in order for a BER match to be used as justification for a functional annotation the match protein must be experimentally characterized (we call these “characterized matches”.)
Increasingly, the BER search results are filled with sequences from genome projects, the names of those proteins can not be considered reliable. Therefore, it becomes more and more difficult to find matches to experimentally characterized proteins.
To help annotators in this effort we have started storing in our database accessions of proteins known to be experimentally characterized.
Are all pairwise matches with equal alignment quality of equal value to annotation?
It is important to know what proteins in our search database are characterized.
We store the accessions of proteins known or suspected to be characterized in the “characterized table” in our database
A confidence status is assigned to each entry in the characterized table.
Annotators see this information in the search results as color coded output:
green = full experimental characterization
sky blue = partial characterization
red = automated process (Swiss-Prot parse)
Our table does not contain all characterized proteins, not even close.
Any time additional characterized proteins are found it is important that they be entered into the table
AutoAnnotate Software tool which gives a preliminary name and role assignment to all the proteins in the genome. AutoAnnotate is not designed to give the most accurate annotation, it is designed to give at least some annotation (particularly TIGR roles) to every gene in the genome it can. It has been designed with the knowledge that manual annotation will follow the automatic annotation. This tool would need to be modified if we wanted it to produce stand-alone automatic annotation. Makes decisions based on ranked evidence types Equivalog HMM Other HMMs Characterized BER match Other BER match best evidence
Manual Annotation: Assigning Names to Proteins
Annotation must be grounded with supporting evidence The process of manual annotation involves assessing all available evidence and reaching a conclusion about what you think the protein is doing in the cell and why. Functional annotations should only be as specific as the supporting evidence allows, you must have high confidence in the level of functional annotation you are asserting. We have evolved a layered set of naming conventions that allows us to reflect the level of functional specificity of which we are confident in the names and other annotations assigned to a protein. All evidence that led to the annotation conclusions that were made must be stored so that others can assess the annotation for themselves. In addition, detailed documentation of methodologies and general rules or guidelines used in any annotation process should be provided.
need to be above trusted to be considered part of the family
how specific is the HMM?
Look at Genome Properties analysis
Check for operon structure or other information from neighboring genes.
presence of a gene in an operon can supplement weak similarity evidence
Are there transmembrane regions?
Is there a signal peptide?
Are there any motifs that might give a clue to function?
Is there a paralogous family?
Functional Assignments: High Confidence in Precise Function Required evidence: -At least one good alignment (minimum 35% identity, over the full lengths of both proteins) to a protein from another organism that has been experimentally characterized, preferably multiple such alignments. -Hits to appropriate HMMs above the trusted cutoff. -Conservation of active sites, binding sites, appropriate number of membrane spans, etc. Action: Give the protein a specific name and accompanying gene symbol, this is the only confidence level where we assign gene symbols. We default to E. coli gene symbols when possible, for Gram positive genes we use B. subtilis gene symbols. Example: name: “adenylosuccinate lyase” gene symbol: purB EC number: 220.127.116.11
Functional Assignments: High Confidence in General Function, Unsure of Substrate Specificity A good example of this is seen with transporters: If you can definitively identify a specific substrate: “ ribose ABC transporter, permease protein” -Hit scoring above trusted to an equivalog HMM for one substrate -high quality match to an experimentally characterized protein in the BER with that substrate specificity If you can identify a substrate group: “ sugar ABC transporter, permease protein” -Multiple characterized hits to a specific type of transporter in the BER results -Hits to appropriate HMMs for that type of transporter -The substrate identified for in these matches are not all be the same, but may fall into a group, for example they are all sugars. If you can only identify the type of transporter and nothing about substrate specificity: “ ABC transporter, permease protein” -Hits to HMMs defining the transporter family but without substrate specificity -Matches to characterized proteins in the BER where the substrates identified do not fall into a group ------------ Another example of a known function but not exact substrate: “ carbohydrate kinase, FGGY family”
Functional Assignments : Precise Function Unclear The “family” designation: -No matches to specific characterized protein -score above trusted cutoff to an HMM which defines a family, but not a specific function. Example: “CbbY family protein ” The general function name: -no match to a specific enzyme -HMM match to a superfamily of functions Example: “kinase” The “homolog” designation, used when we want to show sequence relationship but not functional relationship: -good match to a function not expected in the organism (like a photosynthesis gene in a non-photosynthetic bug) OR -two matches exist in the genome to a particular function, however, one has much stonger evidence than the other. One is likely to be the “real” one while the other is likely not. Example: “galactokinase homolog” The “putative” designation, used when evidence is missing one small element but is otherwise strong: -has been largely replace by “homolog” and “family” Example: “putative galactokinase”
Functional Assignments: Hypotheticals If a protein has no matches to any protein from another species in BER, HMM, Prosite, or TmHMM it is called: “ hypothetical protein” This term is used since we are not actually sure the gene/protein is actually real. It has been predicted by Glimmer but no evidence exists that it is really expressed. --------------------------------------------------------- If a “hypothetical protein” from one species matches a “hypothetical protein” from another, they both now become: “ conserved hypothetical protein” These are really no longer “hypothetical” because it is very unlikely that two species will both retain a sequence if it is not an actual gene. Other centers use terms for this such as: “ protein of unknown function” “ unknown protein” NOTE: the above names will likely change soon as discussion is underway to agree on new names for these that would be used by the whole community.
Functional Assignments: Frameshifts and Point Mutations When a frameshift or in-frame stop codon is seen within a region of alignment in our BER search results, two possible things may have occurred: a sequencing error, or a mutation in the gene. When these are found, the underlying sequence data is manually reviewed for accuracy. Occasionally we find that the sequence we are using has an error (missed base, extra base, miscalled base). When these are corrected the frameshift/in-frame stop is resolved. If we find that the sequence is correct as reported to us, the protein is annotated to reflect the presence of a disruption in the open reading frame: “ great protein, authentic frameshift” “ fun protein, authentic point mutation” We do NOT use the term “pseudogene” as this implies that some experiment was done to confirm the lack of function of the gene. We have done no such experiments and indeed it may be that the protein may be functional in some truncated form, or that some read-through mechanism allows expression. Also, it may be that mutations we see only occurred in the cells grown for the DNA sample used for the sequencing project.
Functional Assignments: Other ORF disruptions -many FS / PM, “ degenerate” -”truncation” -”deletion” -”insertion” (~20-50aa) -interruption “ interruption-N” “ interruption-C” -”fusion” -”fragment” ! ! ! * * * These are given descriptive terms in the common name and all are put into a “Disrupted reading frame” role category to make them easy to find. Again we do not use the term “pseudogene”. (some genes) N-term C-term
TIGR Roles Amino acid biosynthesis Purines, pyrimidines, nucleosides, and nucleotides Fatty acid and phospholipid metabolism Biosynthesis of cofactors, prosthetic groups, and carriers Central intermediary metabolism Energy metabolism Transport and binding proteins DNA metabolism Transcription Protein synthesis Protein Fate Regulatory Functions Signal Transduction Cell envelope Cellular processes Other categories Unknown Hypothetical Disrupted Reading Frame Unclassified ( not a real role ) Role Notes: Notes written by annotators expert in each role category to aid other annotators in knowing what belongs in that category and the JCVI naming conventions for it. AutoAnnotate makes a first pass at assigning role, based on roles associated with HMMs or with match proteins. Human annotator checks and adjusts as necessary. TIGR bacterial roles were first adapted from Monica Riley’s roles for E. coli - both systems have since undergone much change.
What is the Gene Ontology? and Why is it useful?
GO offers an annotation capture system that allows the unambiguous communication of annotation information in a format readable by both computers and humans and which facilitates data exchange.
Traditionally, annotations have been captured in “name”, “gene symbol”, “comment” and perhaps “EC number” or “comment” fields
this mode of capturing information is hard to standardize (just think of the many nomenclature battles that have gone on over the years).
generally the name field is all that is seen when people view annotations (think of a Blast search).
comment fields, although they can store a lot of detailed information, rarely are made to conform to format standards and it is difficult for a computer to search effectively since text-matching search tools will be forever challenged by free text and the inconsistencies of human language
Do we use precise, consistent terminology in biology?
DNA, RNA, protein and many other terms are universally understood by biologists
but not always
there are numerous instances of the same thing being known by more than one term
there are numerous instances of different things being known by the same term
Enzymes pyruvate kinase and phosphoenol transphosphorylase both are names for the reaction: ATP + pyruvate = ADP + phosphoenolpyruvate The Enzyme Commission has standardized enzyme reactions and provides id numbers and alternate names
a set of official terms, each with a precise definition
solves problems associated with
synonyms (two different words/phrases mean the same thing)
homographs (the same word/phrase means more than one thing)
allows one to use precise and consistent terminology in describing something
GO consists of 3 separate controlled vocabularies The vocabularies are used to describe 3 aspects of a gene products existence: what it does (molecular function), why it does what it does (biological process), and where it does what it does (cellular component)
I quote from Wikipedia: “In both computer science and information science, an ontology is a data model that represents a set of concepts within a domain and the relationships between those concepts. It is used to reason about the objects within that domain.”
the set of concepts are the three controlled vocabularies of the GO
the domains involved are the aspects of the activities of gene products: what, why, and where
every term is connected to other terms in the vocabulary using one of a set of defined relationship types
Each GO term has a relationship to at least one other term
“ biological process”, “molecular function”, and ‘cellular component” are roots
terms always have at least one parent (accept the roots)
a term may have children terms, children terms are more specific than their parents
a term may have sibling terms, siblings share the same parent
as one moves down the tree from parents to children, to grandchildren, the functions, processes, and structures become more specific (or granular) in nature
terms may have more than one parent
ribokinase “is a” kinase
periplasm is “part of” a cell
General GO tree structure Root term general term less general term more specific term even more specific term most specific term
Examples Molecular function catalytic activity kinase activity carbohydrate kinase activity ribokinase activity Biological process cellular process metabolism carbohydrate metabolism ribose metabolism transporter activity carbohydrate transporter activity ribose transporter activity lets look at just one branch: now two branches:
Definition : Enables the directed movement of amino acids, organic acids containing one or more amino substituents, into, out of, within or between cells.
parent term : amine transporter activity (GO:0005275)
relationship to parent : “is a”
parent term : carboxylic acid transporter activity (GO:0046943)
relationship to parent : “is a”
amine group carboxylic acid group
Tree view in AmiGO of a multi-parent term: Each node in an AmiGO tree can be expanded or collapsed as needed by clicking on the +/- in the box at the beginning of each line. The letter in the green circle (I/P) indicates the relationship between the term and its parent. The numbers in parentheses at the end of each line indicate the number of gene products in the GO repository annotation collection which have been annotated to that term or one of its children.
True Path Rule The pathway from a term all the way up to its top-level parent(s) must always be true for any gene product annotated to that term. cell organelle mitochondrian proton-transporting ATP synthase complex Incorrect for Bacteria cell intracellular proton-transporting ATP synthase complex plasma membrane proton-transporting ATP synthase complex mitochondrial proton-transporting ATP synthase complex membrane plasma membrane plasma membrane proton-transporting ATP synthase complex organelle mitochondrion mitochondrial inner membrane mitochondrial proton-transporting ATP synthase complex Correct for Bacteria (and Eukaryotes) NOTE: these trees are abbreviated versions of the actual ones
need to be above trusted to be considered part of the family
how specific is the HMM?
Look at Genome Properties analysis
Check for operon structure or other information from neighboring genes.
presence of a gene in an operon can supplement weak similarity evidence
Are there transmembrane regions?
Is there a signal peptide?
Are there any motifs that might give a clue to function?
Is there a paralogous family?
Decide what you think the protein does, why it does it, and where it does it.
Functional confidence captured with GO GO trees (very abbreviated) Function catalytic activity kinase activity carbohydrate kinase activity ribokinase activity glucokinase activity fructokinase activity Process metabolism carbohydrate metabolism monosaccharide metabolism hexose metabolism glucose metabolism fructose metabolism pentose metabolism ribose metabolism Available evidence for 3 genes #1 -good match to an HMM specific for “ribokinase” -a high-quality BER match to an experimentally characterized ribokinase #2 -good match to an HMM for “kinase” -a high-quality BER match to an experimentally characterized “glucokinase’ AND a ‘fructokinase’ #3 -good match to an HMM for “kinase’
Once you know what the protein does, why it does it, and where it does it, you need to find the terms to describe your knowledge about the protein.
mapping files (just 2 examples out of many)
ec2go - EC numbers and corresponding GO terms
Tigrfams2go - GO terms assigned to TIGR HMMs
Genome Property suggestions
available in Manatee and on the Comprehensive Microbial Resource
Search the ontologies with your favorite tree viewing tool
Manatee has an integrated GO viewer
AmiGO is the tool available on the GO web site
there are numerous other GO viewers, most listed on the GO web site
Search against proteins already annotated to GO
GOst (at GO web page)
Manatee search option (in “Select Function” menu)
(but don’t contribute to transitive annotation errors with this one - make sure you view the terms you find as suggestions that need to be confirmed with your evidence any your knowledge of the organism)
Try to get a term from every ontology at the level of specificity you are confidant of. If there is no information on what the function, process, and/or component are, or if there is insufficient information to make even the most general of annotations, then the function, process, and/or component is “unknown”. This is indicated by an annotation to the root term(s).
Assign as many terms as are appropriate to completely describe what is known about the protein. This is likely to result in assignment of multiple terms from one or more ontologies.
It is vitally important to store the evidence that was used to make annotations
method (direct assay, mutant phenotype, etc.)
references where experiments/analysis were done
methods of manual curation
accession numbers of similar sequences, HMMs, etc. that were used in analysis
references describing method of curation
GO has a format for storage of this information
evidence codes, references, “with”
GO Evidence A key element of a good annotation system is the storage of evidence:
Assign “Evidence Code” - evidence codes tell users what kind of evidence was used
“ inferred from sequence similarity” = ISS
Various experimental characterizations - IMP, IDA, etc.
“ inferred from electronic annotation” = IEA - immediately allows users to tell manual curation from automatic
Assign “Reference” - describes source of the annotation
PMID of paper describing characterization or method used for annotation
Standard GO reference set - submitted by various groups, text descriptions of their methods - JCVI has submitted three (HMM, pairwise match, operon). These can be found here: www.geneontology.org/doc/GO.references
Assign “with” (when appropriate)
Used with ISS to store the accession of the thing the sequence similarity is with
Used with IPI /IGI (inferred from physical interaction/genetic interaction)
“ contributes to” - use this modifier when you assign a function term representing the function of a complex to proteins that are part of the complex but can not themselves carry out the function of the complex
NOT - went you want to assert that a gene product should NOT get an annotation
“ co-localizes_with” - gene product transiently or peripherally associated
• IDA inferred from direct assay - Enzyme assays - In vitro reconstitution (e.g. transcription) - Immunofluorescence (for cellular component) - Cell fractionation (for cellular component) - Physical interaction/binding • IEP inferred from expression pattern - Transcript levels (e.g. Northerns, microarray data) - Protein levels (e.g. Western blots) • IGI inferred from genetic interaction - "Traditional" genetic interactions such as suppressors, synthetic lethals, etc. - Functional complementation - Rescue experiments - Inference about one gene drawn from the phenotype of a mutation in a different gene. • IMP inferred from mutant phenotype - Any gene mutation/knockout - Overexpression/ectopic expression of wild-type or mutant genes - Anti-sense experiments - RNAi experiments - Specific protein inhibitors • IPI inferred from physical interaction - 2-hybrid interactions - Co-purification/Co-immunoprecipitation - Ion/protein binding experiments • ISS inferred from sequence or structural similarity - Sequence similarity (to characterized proteins, domains, motifs) - Structural similarity • IGC inferred from genomic context -operon structure -syntenic regions -pathway/system reconstruction • ND no biological data available (used when annotating to the root terms) • IC inferred by curator • TAS/NAS traceable/nontraceable author statement • IEA inferred from electronic annotation http://www.geneontology.org/GO.evidence.html GO Evidence codes
GO Standard References GO_REF:0000011 A Hidden Markov Model (HMM) is a statistical representation of patterns found in a data set. When using HMMs with proteins, the HMM is a statistical model of the patterns of the amino acids found in a multiple alignment of a set of proteins called the "seed". Seed proteins are chosen based on sequence similarity to each other. Seed members can be chosen with different levels of relationship to each other. They can be members of a superfamily (ex. ABC transporter, ATP-binding proteins), they can all share the same exact specific function (ex. biotin synthase) or they could share another type of relationship of intermediate specificity (ex. subfamily, domain). New proteins can be scored against the model generated from the seed according to how closely the patterns of amino acids in the new proteins match those in the seed. There are two scores assigned to the HMM which allow annotators to judge how well any new protein scores to the model. Proteins scoring above the "trusted cutoff" score can be assumed to be part of the group defined by the seed. Proteins scoring below the "noise cutoff" score can be assumed to NOT be a part of the group. Proteins scoring between the trusted and noise cutoffs may be part of the group but may not. One of the important features of HMMs is that they are built from a multiple alignment of protein sequences, not a pairwise alignment. This is significant, since shared similarity between many proteins is much more likely to indicate shared functional relationship than sequence similarity between just two proteins. The usefulness of an HMM is directly related to the amount of care that is taken in chosing the seed members, building a good multiple alignment of the seed members, assessing the level of specificity of the model, and choosing the cutoff scores correctly. GO_REF:0000012 Pairwise alignments are generated by taking two sequences and aligning them so that the maximum number of amino acids in each protein match, or are similar to, each other. Tools such as BLAST work by comparing a protein-of-interest individually with every protein in a database of known protein sequences and retaining only those matches with a high probability of being significant. Basic BLAST generates local alignments between proteins for regions of high similarity. Other pairwise alignment tools attempt to generate global (full-length) protein alignments. GO_REF:0000025 Genes in prokaryotic organisms are often arranged in operons. Genes in an operon are all transcribed into one mRNA. Generally the genes in the operons code for proteins that all have related functions. For example, they may be the steps in a biochemical pathway, or they may be the subunits of a protein complex. Often the genes in operons shared between organisms are syntenic; that is, the same genes are in the same order in the operon in different species. When assessing sequence-comparison-based evidence during the process of manual annotation of a genome, it is often the case that some of the genes in the operon will have strong sequence-based evidence while others will have weak evidence. If seen alone, not in the presence of an operon, the weak evidence in question may not be sufficient to make a functional annotation. However, in the presence of an operon in which there is strong evidence for some of the genes, the very presence of the gene in the operon is a strong indication that the gene shares in the process carried out by the operon. If the putative function is one expected to exist for the process in question and particularly if that function has been observed in the same operon in another species, then the annotation should be made. This type of evidence is inferred from the context of the gene in an operon, and therefore the evidence code is IGC "inferred from genomic context."
Accession Format All objects (papers, proteins, etc.) used in evidence must be represented according to GO’s format: “ database:accession” where “database” is an abbreviation defined at GO for the database housing the object, and “accession” is the accession number or other identifier for the object Each database must have a registered abbreviation with GO. New abbreviations can be requested at any time. The current file can be viewed at: www.geneontology.org/doc/GO.xrf_abbs
abbreviation: JCVI database: J Craig Venter Institute generic_url: http://www.jcvi.org/ abbreviation: JCVI_CMR database: Comprehensive Microbial Resource object: Locus example_id: JCVI_CMR:VCA0557 generic_url: http://www.jcvi.org/ url_syntax: http://www.jcvi.org/tigr-scripts/CMR2/GenePage.spl?locus= url_example: http://www.jcvi.org/tigr-scripts/CMR2/GenePage.spl?locus=VCA0557 abbreviation: TIGR_TIGRFAMS database: J Craig Venter Institute, TIGRFAMs HMM collection object: Accession example_id: TIGR_TIGRFAMS:TIGR00254 generic_url: http://www.jcvi.org/ url_syntax: http://www.jcvi.org/tigr-scripts/CMR2/hmm_report.spl?acc= url_example: http://www.jcvi.org/tigr-scripts/CMR2/hmm_report.spl?acc=TIGR00254 Example Abbreviations from GO file
From the literature…… Enzymes involved in DNA ligation and end-healing in the radioresistant bacterium Deinococcus radiodurans Blasius, et al, 2007, BMC Molecular Biology “ Results: in this report, we describe the cloning and expression of a Deinococcus radiodurans DNA ligase in Escherichia coli . This enzyme efficiently catalysis DNA ligation in the presence of Mn(II) and NAD+ as cofactors…….” Read on in the text to see how they did the experiments and we find that they performed a direct assay on purified protein. This is IDA for the ligase activity. cellular component biological process molecular function aspect GO:0003909 PMID: 17705817 IC cytoplasm GO:0005737 N/A PMID: 17705817 IDA DNA ligation GO:0006266 N/A PMID: 17705817 IDA DNA ligase activity GO:0003909 with reference ev code term name GO id
Annotations resulting from this match (and others in the pathway for the biological process annotation) …. cellular component biological process molecular function aspect GO:0004807 GO_REF: 0000012 IC cytoplasm GO:0005737 TIGR_GenProp: GenProp0120 PMID: 15347579 IGC glycolysis GO:0006096 Swiss-Prot: P50921 GO_REF: 0000012 ISS triose-phosphate isomerase activity GO:0004807 with reference ev code term name GO id
Send annotation to GO to be placed in the repository of annotated genes to be a resource to the community (this is optional but benefits the whole community.)
All files are in the same format
15 column tab-delimited file
stores gene product identifiers, GO term identifiers and associated evidence
Facilitates use of data by everyone and development of tools that work on one consistent file format
All data in the repository is free to anyone who wishes to use it.
There are hundreds of thousands of gene product annotations available in the repository with more arriving all the time
Annotation (or Association) file format The annotation file consists of 15 tab-delimited columns: DB - the database submitting the annotation DB_Object_ID - unique identifier for the object being annotated DB_Object_Symbol - gene symbol or other symbol for the gene product Qualifier - one of: NOT, contributes_to, colocalizes_with Goid - the GO id number for the term annotated to the gene product DB:Reference - the reference which describes the annotation source Evidence Code - ISS, IEA, IMP, etc. With/From - accession number of item with sequence similarity, or physical interaction, etc. Aspect - one of: process, function, component stored as P, F, or C DB_Object_Name - common name of gene product (eg. “ribokinase”) Synonym - any alternate names for the gene product, can put gene symbol here too DB_Object_Type - what type of thing is being annotated (protein, gene, etc.) Taxon - the NCBI taxon id for the organism from which the gene product came Date - date the annotation was made Assigned_by - the database that made the annotation (same as the submitting db in most, but not all cases
Sequence and annotation submitted to GenBank at the time of publication
Updates sent as needed
Comprehensive Microbial Resource (CMR)
Data available for download
extensive analyses within and between genomes
Data available for download
whole repository searchable with AmiGO
Acknowledgements Director of bioinformatics: Granger Sutton Prokaryotic Annotation: Scott Durkin Bob Dodson Ramana Madupu Sagar Kothari Susmita Shrivastava Yinong Sebastian Derek Harkins And the many other TIGR/JCVI employees, present and past, who have contributed to the development of these tools and to the annotation protocols I have described. Also thanks to the funding agencies that make our work possible including NIH, NSF, and DOE. CMR/Pathema team: Tanja Davidsen Erin Kelsey BRC Project Manager: Lauren Brinkac HMM/Genome Properties team: Dan Haft Jeremy Selengut Building the tools: Nikhat Zafar Alex Richter and many more….