1. Genome Annotation
Delivered by
Muhammad Tajammal Khan
M.Phil (Botany)
11-arid-3759
2. Definition: Genome Annotation is the process of
interpreting raw sequence data into useful biological
information Annotations describe the genome and
transform raw genome sequences into biological
information by integrating computational analyses,
other biological data and biological expertise.
3. Unannotated DNA
5' 3'
Annotated DNA
Legend:
Exon (protein coding)
Intron
Intergenic sequence
4. Annotation may be
Structural annotation
ORFs and their localisation (http://www.ncbi.nlm.nih.gov/gorf/gorf.html)
gene structure
coding regions
location of regulatory motifs
Functional annotation
biochemical function
biological function
involved regulation and interactions
expression
5. Things we are looking to
annotate?
CDS
mRNA
Promoter and Poly-A Signal
Pseudogenes
ncRNA
8. Two approaches to genome sequencing
Whole Genome Shotgun
An approach used to decode an organism's genome
by shredding it into smaller fragments of DNA which
can be sequenced individually. The sequences of these
fragments are then ordered, based on overlaps in the
genetic code, and finally reassembled into the complete
sequence. The 'whole genome shotgun' (WGS) method is
applied to the entire genome all at once, while the
'hierarchical shotgun' method is applied to large,
overlapping DNA fragments of known location in
the genome.
9. Two approaches to genome sequencing
Hierarchical shotgun method
Assemble contigs from various chromosomes, then sequence and assemble
them. A contig is a set of overlapping clones or sequences from which a
sequence can be obtained.
A contig is thus a chromosome map showing the locations of those regions of
a chromosome where contiguous DNA segments overlap. Contig maps are
important because they provide the ability to study a complete, and often
large segment of the genome by examining a series of overlapping clones
which then provide an unbroken succession of information about
that region.
12. 1. Prepare genomic DNA
2. Attach DNA to surface
DNA 3. Bridge amplification
4. Fragments become
adapters
double stranded
5. Denature the double-
stranded molecules
6. Complete amplification
Randomly fragment genomic DNA and ligate
adapters to both ends of the fragments
13. adapter
DNA 1. Prepare genomic DNA
fragment
2. Attach DNA to surface
dense lawn 3. Bridge amplification
of primers
adapter 4. Fragments become
double stranded
5. Denature the double-
stranded molecules
6. Complete amplification
Bind single-stranded fragments randomly to
the inside surface of the flow cell channels
14. 1. Prepare genomic DNA
2. Attach DNA to surface
3. Bridge amplification
4. Fragments become
double stranded
5. Denature the double-
stranded molecules
6. Complete amplification
Add unlabeled nucleotides and enzyme to
initiate solid-phase bridge amplification
15. 1. Prepare genomic DNA
2. Attach DNA to surface
Attached 3. Bridge amplification
Attached terminus free terminus
terminus 4. Fragments become
double stranded
5. Denature the double-
stranded molecules
6. Complete amplification
The enzyme incorporates nucleotides to
build double-stranded bridges on the
solid-phase substrate
16. 1. Prepare genomic DNA
2. Attach DNA to surface
Attached 3. Bridge amplification
Attached
4. Fragments become
double stranded
5. Denature the double-
stranded molecules
6. Complete amplification
Denaturation leaves single-
stranded templates anchored to
the substrate
17. 1. Prepare genomic DNA
2. Attach DNA to surface
3. Bridge amplification
4. Fragments become
double stranded
5. Denature the double-
stranded molecules
Clusters
6. Complete amplification
Several million dense clusters of
double-stranded DNA are generated in
each channel of the flow cell
18. 7. Determine first base
8. Image first base
9. Determine second base
10. Image second
chemistry cycle
11. Sequencing over
multiple chemistry cycles
12. Align data
Laser
The first sequencing cycle begins by
adding four labeled reversible terminators,
primers, and DNA polymerase
19. 7. Determine first base
8. Image first base
9. Determine second base
10. Image second
chemistry cycle
11. Sequencing over
multiple chemistry cycles
12. Align data
After laser excitation, the emitted
fluorescence from each cluster is captured
and the first base is identified
20. 7. Determine first base
8. Image first base
9. Determine second base
10. Image second
chemistry cycle
11. Sequencing over
multiple chemistry cycles
Laser 12. Align data
The next cycle repeats the incorporation
of four labeled reversible terminators,
primers, and DNA polymerase
21. 7. Determine first base
8. Image first base
9. Determine second base
10. Image second
chemistry cycle
11. Sequencing over
multiple chemistry cycles
12. Align data
After laser excitation the image is
captured as before, and the identity of
the second base is recorded.
22. 7. Determine first base
8. Image first base
9. Determine second base
10. Image second
chemistry cycle
11. Sequencing over
multiple chemistry cycles
12. Align data
The sequencing cycles are repeated to
determine the sequence of bases in a
fragment, one base at a time.
23. Reference
sequence
7. Determine first base
8. Image first base
9. Determine second base
10. Image second
chemistry cycle
11. Sequencing over
multiple chemistry cycles
12. Align data
The data are aligned and compared to
a reference, and sequencing
differences are identified.
24. The generic structure of an automatic genome annotation pipeline and
delivery system
28. What is gene
prediction?
Detecting meaningful signals in uncharacterised DNA sequences.
Knowledge of the interesting information in DNA.
GATCGGTCGAGCGTAAGCTAGCTAG
ATCGATGATCGATCGGCCATATATC
ACTAGAGCTAGAATCGATAATCGAT
CGATATAGCTATAGCTATAGCCTAT
Gene prediction is ‘recognising protein-
coding regions in genomic sequence’
29. Basic Gene Prediction Flow
Chart
Obtain new genomic DNA sequence
1. Translate in all six reading frames and compare to protein
sequence databases
2. Perform database similarity search of expressed sequence tag
Sites (EST) database of same organism, or cDNA sequences if
available
Use gene prediction program to locate genes
Analyze regulatory sequences in the gene
30. Approaches to gene prediction
Ab Initio Gene Finding
http://exon.gatech.edu/GenMark/eukhmm.cgi
http://sun1.softberry.com/berry.phtml=fgenesh&group=programs
&subgroup=gfind
Repeat Masking
http://www.repeatmasker.org
http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker
Transcript based prediction
http://plantta.tigr.org
http://harvest.ucr.edu/
Gene function CDNA
http://au.expasy.org/sprot/
http://www.pir.uniprot.org/
Gene Ontologies
http://www.geneontology.org
49. Some Concluding remarks
Trust but verify
Beware of gene prediction tools!
Always use more than one gene
prediction tool and more than one
genome when possible.
Active area of bioinformatics research,
so be mindful of the new literature in
this .