2. 2
Prokaryotic gene model: ORF-genes
• “Small” genomes, high gene density
• Haemophilus influenza genome 85% genic
• Operons
• One transcript, many genes
• No introns.
• One gene, one protein
• Open reading frames
• One ORF per gene
• ORFs begin with start,
end with stop codon (def.)
3. 3
Eukaryotic gene model: spliced genes
Posttranscriptional modification
5’-CAP, polyA tail, splicing
Open reading frames
Mature mRNA contains ORF
All internal exons contain open “read-through”
Pre-start and post-stop sequences are UTRs
Multiple translates
One gene – many proteins via alternative splicing
4. 4
Where do genes live?
• In genomes
• Example: human genome
• Ca. 3,200,000,000 base pairs
• 23 pair chromosomes : 1-22, X, Y, and mt
• 20,000-25,000 genes (current estimate)
• 128 nucleotides (RNA gene) – 2,800 kb (DMD)
• Ca. 25% of genome are genes (introns, exons)
• Ca. 1% of genome codes for amino acids (CDS)
• 30 kb gene length (average)
• 1.4 kb ORF length (average)
5. 5
Sample genomes
Species Size Genes Genes/Mb
H.sapiens 3,200Mb 35,000 11
D.melanogaster 137Mb 13.338 97
C.elegans 85.5Mb 18,266 214
A.thaliana 115Mb 25,800 224
S.cerevisiae 15Mb 6,144 410
E.coli 4.6Mb 4,300 934
List of 68 eukaryotes, 141 bacteria, and 17 archaea at
http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/links2a.html
6. 6
Genomic sequence features
• Repeats (“Junk DNA”)
• Transposable elements, simple repeats
• RepeatMasker
• Genes
• Vary in density, length, structure
• Identification depends on evidence and methods and may
require concerted application of bioinformatics methods
and lab research
• Pseudo genes
• Look-a-likes of genes, obstruct gene finding efforts.
• Non-coding RNAs (ncRNA)
• tRNA, rRNA, snRNA, snoRNA, miRNA
• tRNASCAN-SE, COVE
8. 8
Gene prediction through comparative genomics
• Highly similar (Conserved) regions between two
genomes are useful or else they would have diverged
• If genomes are too closely related all regions are
similar, not just genes
• If genomes are too far apart, analogous regions may
be too dissimilar to be found
10. 10
Gene discovery using ESTs
• Expressed Sequence Tags (ESTs) represent
sequences from expressed genes.
• If region matches EST with high stringency then
region is probably a gene or pseudo gene.
• EST overlapping exon boundary gives an accurate
prediction of exon boundary.
11. 11
Ab initio gene prediction
• Prokaryotes
• ORF-Detectors
• Eukaryotes
• Position, extent & direction: through promoter and polyA-
signal predictors
• Structure: through splice site predictors
• Exact location of coding sequences: through
determination of relationships between potential start
codons, splice sites, ORFs, and stop codons
13. 13
How it works I – Motif identification
Exon-Intron Borders = Splice Sites
Exon Intron Exon
~~gaggcatcag|gtttgtagac~~~~~~~~~~~tgtgtttcag|tgcacccact~~
~~ccgccgctga|gtgagccgtg~~~~~~~~~~~tctattctag|gacgcgcggg~~
~~tgtgaattag|gtaagaggtt~~~~~~~~~~~atatctccag|atggagatca~~
~~ccatgaggag|gtgagtgcca~~~~~~~~~~~ttatttccag|gtatgagacg~~
Splice site Splice site
Exon Intron Exon
~~gaggcatcag|GTttgtagac~~~~~~~~~~~tgtgtttcAG|tgcacccact~~
~~ccgccgctga|GTgagccgtg~~~~~~~~~~~tctattctAG|gacgcgcggg~~
~~tgtgaattag|GTaagaggtt~~~~~~~~~~~atatctccAG|atggagatca~~
~~ccatgaggag|GTgagtgcca~~~~~~~~~~~ttatttccAG|gtatgagacg~~
Splice site Splice site
Motif Extraction Programs at http://www-btls.jst.go.jp/