5. 5
The value of genome sequences lies in
their annotation
Annotation – Characterizing genomic
features using computational and
experimental methods
Genes: Four levels of annotation
Gene Prediction – Where are genes?
What do they look like?
Domains – What do the proteins do?
Role – What pathway(s) involved in?
6. 6
How many genes?
Consortium: 35,000 genes?
Celera: 30,000 genes?
Affymetrix: 60,000 human genes on
GeneChips?
Incyte and HGS: over 120,000 genes?
GenBank: 49,000 unique gene coding
sequences?
UniGene: > 89,000 clusters of unique
ESTs?
7. 7
Current consensus (in flux …)
15,000 known genes (similarity to
previously isolated genes and expressed
sequences from a large variety of different
organisms)
17,000 predicted (GenScan, GeneFinder,
GRAIL)
Based on and limited to previous
knowledge
10. 10
Complete DNA segments responsible to
make functional products
Products
Proteins
Functional RNA molecules
RNAi (interfering RNA)
rRNA (ribosomal RNA)
snRNA (small nuclear)
snoRNA (small nucleolar)
tRNA (transfer RNA)
What are genes? - 1
11. 11
What are genes? - 2
Definition vs. dynamic concept
Consider
Prokaryotic vs. eukaryotic gene models
Introns/exons
Posttranscriptional modifications
Alternative splicing
Differential expression
Genes-in-genes
Genes-ad-genes
Posttranslational modifications
Multi-subunit proteins
12. 12
Prokaryotic gene model: ORF-genes
“Small” genomes, high gene density
Haemophilus influenza genome 85% genic
Operons
One transcript, many genes
No introns.
One gene, one protein
Open reading frames
One ORF per gene
ORFs begin with start,
end with stop codon (def.)
TIGR: http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl
NCBI: http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html
13. 13
Eukaryotic gene model: spliced genes
Posttranscriptional modification
5’-CAP, polyA tail, splicing
Open reading frames
Mature mRNA contains ORF
All internal exons contain open “read-through”
Pre-start and post-stop sequences are UTRs
Multiple translates
One gene – many proteins via alternative splicing
14. 14
Expansions and Clarifications
ORFs
Start – triplets – stop
Prokaryotes: gene = ORF
Eukaryotes: spliced genes or ORF genes
Exons
Remain after introns have been removed
Flanking parts contain non-coding
sequence (5’- and 3’-UTRs)
15. 15
Where do genes live?
In genomes
Example: human genome
Ca. 3,200,000,000 base pairs
25 chromosomes : 1-22, X, Y, mt
28,000-45,000 genes (current estimate)
128 nucleotides (RNA gene) – 2,800 kb (DMD)
Ca. 25% of genome are genes (introns, exons)
Ca. 1% of genome codes for amino acids (CDS)
30 kb gene length (average)
1.4 kb ORF length (average)
3 transcripts per gene (average)
16. 16
Sample genomes
Species Size Genes Genes/Mb
H.sapiens 3,200Mb 35,000 11
D.melanogaster 137Mb 13.338 97
C.elegans 85.5Mb 18,266 214
A.thaliana 115Mb 25,800 224
S.cerevisiae 15Mb 6,144 410
E.coli 4.6Mb 4,300 934
List of 68 eukaryotes, 141 bacteria, and 17 archaea at
http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/links2a.html
17. 17
So much DNA – so “few” genes …
s
T
Genic
Intergenic
T C
18. 18
Genomic sequence features
Repeats (“Junk DNA”)
Transposable elements, simple repeats
RepeatMasker
Genes
Vary in density, length, structure
Identification depends on evidence and methods and
may require concerted application of bioinformatics
methods and lab research
Pseudo genes
Look-a-likes of genes, obstruct gene finding efforts.
Non-coding RNAs (ncRNA)
tRNA, rRNA, snRNA, snoRNA, miRNA
tRNASCAN-SE, COVE
20. 20
Gene prediction through comparative genomics
Highly similar (Conserved) regions
between two genomes are useful or else
they would have diverged
If genomes are too closely related all
regions are similar, not just genes
If genomes are too far apart, analogous
regions may be too dissimilar to be found
22. 22
Gene discovery using ESTs
Expressed Sequence Tags (ESTs)
represent sequences from expressed
genes.
If region matches EST with high
stringency then region is probably a
gene or pseudo gene.
EST overlapping exon boundary gives
an accurate prediction of exon boundary.
23. 23
Ab initio gene prediction
Prokaryotes
ORF-Detectors
Eukaryotes
Position, extent & direction: through promoter
and polyA-signal predictors
Structure: through splice site predictors
Exact location of coding sequences: through
determination of relationships between
potential start codons, splice sites, ORFs,
and stop codons
25. 25
How it works I – Motif identification
Exon-Intron Borders = Splice Sites
Exon Intron Exon
~~gaggcatcag|gtttgtagac~~~~~~~~~~~tgtgtttcag|tgcacccact~~
~~ccgccgctga|gtgagccgtg~~~~~~~~~~~tctattctag|gacgcgcggg~~
~~tgtgaattag|gtaagaggtt~~~~~~~~~~~atatctccag|atggagatca~~
~~ccatgaggag|gtgagtgcca~~~~~~~~~~~ttatttccag|gtatgagacg~~
Splice site Splice site
Exon Intron Exon
~~gaggcatcag|GTttgtagac~~~~~~~~~~~tgtgtttcAG|tgcacccact~~
~~ccgccgctga|GTgagccgtg~~~~~~~~~~~tctattctAG|gacgcgcggg~~
~~tgtgaattag|GTaagaggtt~~~~~~~~~~~atatctccAG|atggagatca~~
~~ccatgaggag|GTgagtgcca~~~~~~~~~~~ttatttccAG|gtatgagacg~~
Splice site Splice site
Motif Extraction Programs at http://www-btls.jst.go.jp/
26. 26
How it works II - Movies
Pribnow-Box Finder 0/1
Pribnow-Box Finder all
28. 28
Gene prediction programs
Rule-based programs
Use explicit set of rules to make decisions.
Example: GeneFinder
Neural Network-based programs
Use data set to build rules.
Examples: Grail, GrailEXP
Hidden Markov Model-based programs
Use probabilities of states and transitions
between these states to predict features.
Examples: Genscan, GenomeScan
29. 29
Evaluating prediction programs
Sensitivity vs. Specificity
Sensitivity
How many genes were found out of all
present?
Sn = TP/(TP+FN)
Specificity
How many predicted genes are indeed genes?
Sp = TP/(TP+FP)
30. 30
Gene prediction accuracies
Nucleotide level: 95%Sn, 90%Sp (Lows less than
50%)
Exon level: 75%Sn, 68%Sp (Lows less than 30%)
Gene Level: 40% Sn, 30%Sp (Lows less than 10%)
Programs that combine statistical evaluations with
similarity searches most powerful.
31. 31
Common difficulties
First and last exons difficult to annotate
because they contain UTRs.
Smaller genes are not statistically significant so
they are thrown out.
Algorithms are trained with sequences from
known genes which biases them against genes
about which nothing is known.
Masking repeats frequently removes potentially
indicative chunks from the untranslated regions
of genes that contain repetitive elements.
32. 32
The annotation pipeline
Mask repeats using RepeatMasker.
Run sequence through several programs.
Take predicted genes and do similarity
search against ESTs and genes from
other organisms.
Do similarity search for non-coding
sequences to find ncRNA.
33. 33
Annotation nomenclature
Known Gene – Predicted gene matches the
entire length of a known gene.
Putative Gene – Predicted gene contains region
conserved with known gene. Also referred to as
“like” or “similar to”.
Unknown Gene – Predicted gene matches a
gene or EST of which the function is not known.
Hypothetical Gene – Predicted gene that does
not contain significant similarity to any known
gene or EST.