Genome annotation & comparative genomics
An appreciation for:
▶ An overview of some techniques and methods are used for
comparative genomics
▶ An understanding of genome annotation methods, particularly
the advantages and disadvantages of the different methods:
▶ Sequence analysis (ORF finding)
▶ Comparative sequence analysis
▶ Experimental methods (RNAseq & mass-spectroscopy)
2. Objectives for lecture 04
▶ An appreciation for:
▶ An overview of some techniques and methods are used for
comparative genomics
▶ An understanding of genome annotation methods, particularly
the advantages and disadvantages of the different methods:
▶ Sequence analysis (ORF finding)
▶ Comparative sequence analysis
▶ Experimental methods (RNAseq & mass-spectroscopy)
3. Different comparative genomics work flows
DNA
Sequencing
Genome
Assembly
Genome
Annotation
Compare
Genomes & Genes
An
Individual
RNA
Sequencing
Map to
Genome
Genome
Annotation
Quantify Gene
Expression
An
Individual
A
population
DNA
Sequencing
Map to
Genome
Call
Variation
Compare
Populations
An
Environment
DNA
Sequencing
Genome
Assembly
Map to
Taxonomy
Compare
Environments
A genome sequencing project
A RNA sequencing project
A population genomics project
A metagenome sequencing project
Answer
&
Ask
Questions
Idea
6. Discussion
▶ How should these researchers annotate their genomes (after
they have sequenced and assembled them)?
▶ What are the fast and cheap methods?
▶ What are the most accurate methods?
7. The data tsunami
▶ Thanks to new sequencing technologies
▶ Biologists no longer spend years acquiring data.
▶ The bottle-neck for research is now in the analysis phase of
research.
▶ Biologists with good mathematical and statistical skills and
mathematicians, statisticians and computer scientists with an
interest in biology are in high demand.
Gather data
Analyze-Classify
Hypotheses-
Predictions
Experiment GCGAGCAGACGCA
CCGAACAGACACA
GUGAGCAGGCGCC
CCGAGCAGUCAUA
ACACUGAGACGCA
GCGAGCGU-AACG
R
A
A
A
A
R
C
Y
Y R
R
G
Y
U
U
U
U
U
U U
5'
0.0
1.0
2.0
A
C
GU
CC
A
GA5
A
GA
U
CAGG
U
A10
CA
GU
CU
G
A
8. We can annotate genomes with sequence analysis...
▶ Genes can leave a statistical signal in the genome...
▶ Example 1: in A+T rich genomes, genes can be discovered by
looking for high G+C regions
▶ Example 2: identify promotors, ribosome binding sites,
open-reading frames (ORFs), terminators
▶ In eukaryotes CpG islands, splicing signals and poly-A tails may
be incorporated
Figure from: http://zerocool.is-a-geek.net/?p=630
9. ORF reminder
▶ ORF (open reading frame): a stretch of codons that begins
with a start codon (usually AUG) and ends at a stop codon
(usually UAA, UAG or UGA).
▶ So, 3/64 codons are expected to be stops by chance...
▶ The probability of observing a stop in a sequence of length n
is 1 − (61
64)n
AUGAAACGCAUUAGCACCACCAUUACCACCACCAUCACCAUUACCACAGGUAACGGUGCGGGCUGA
10. Sequence analysis: strengths and weaknesses
▶ ORF prediction: Prodigal (Bacterial/Archaeal), MAKER
(Eukaryotic prediction)
▶ Statistical model of ORF lengths, codon use & RBS sequences
▶ Strengths:
▶ very fast & cheap
▶ No prior knowledge about the genome is required (e.g. gene
sequences, etc.)
▶ Weaknesses:
▶ doesn’t account for splicing – less effective in eukaryotes.
▶ false positives (see AntiFam)
▶ misses short peptides (e.g. toxin-antitoxin systems)
▶ No ncRNAs, pseudogenes, recoding & frame-shift elements
12. The null hypothesis is important for science!
How do we know what is the truth?
critical value
H0 null
(negative control)
HA alternative
(positive control)
α
(FPR)
β
(FNR)
Power=1 − β
(Sensitivity)
1 − α
(Specificity)
(TN) (TP)
(FP)
(FN)
effect size
13. ORF length distributions
An experiment: annotate a native and a shuffled bacterial genome.
Note, that many predicted short ORFs are likely to be false!
ORF lengths:
native vs shuffled bacterial genome
ORF Length (nts)
Frequency
1
10
100
1000
10000
10
50
100
250
500
1000
2500
5000
K12: All ORFs
K12 shuffled: All ORFs
Probability
of
a
stop
codon
0.0
0.2
0.4
0.6
0.8
1.0
14. Another annotation strategy is to use homology...
▶ Based on the principle that evolution tends to preserve
functional genomic regions...
▶ Example 1: Use an existing set of genes from related species
and map these onto your genome (e.g. Roary)
▶ Useful for closely related species where one is well annotated
▶ Example 2: Align two or more related genomes, look for
conserved regions, patterns of variation can be indicative of
function (e.g. RNAz & RNAcode)
▶ coding sequences are enriched in synonymous mutations and
INDELs of size 3, 6, 9, ...
▶ ncRNA sequences may conserve basepairs, resulting in
covariation between alignment columns
▶ these methods require accurate and deep genome alignments
15. Example 1: mapping annotations between genomes...
Image source: Michael Schatz
16. Example 2: the DNA encoding a protein has a distinct
conservation pattern
# STOCKHOLM 1.0
#33 unique RNA sequences, 1 peptide sequence
#=GR PR1 G..A..D..V..T..H..P..P..A..G..D..
#=GR PR3 GlyAlaAspValThrHisProProAlaGlyAsp
platypus GGAGCAGACGTCACTCACCCCCCAGCCGGAGAT
opossum GGAGCAGATGTTACTCACCCTCCTGCTGGAGAT
sloth GGAGCAGACGTCACACACCCTCCCGCGGGGGAT
armadillo GGAGCAGACGTCACGCACCCTCCGGCAGGGGAT
tenrec GGGGCCGACGTCACGCACCCCCCTGCGGGCGAT
elephant GGAGCGGATGTCACACACCCGCCTGCGGGGGAT
shrew GGCGCAGATGTCACGCATCCTCCAGCAGGGGAC
hedgehog GGAGCAGATGTCACACACCCCCCAGCAGGAGAT
megabat GGAGCAGATGTCACACACCCTCCTGCAGGAGAT
microbat GGAGCAGATGTCACCCACCCCCCTGCAGGGGAC
dog GGAGCGGATGTCACACACCCCCCAGCCGGGGAC
cat GGAGCCGATGTCACGCACCCCCCAGCAGGGGAT
horse GGAGCGGATGTCACACACCCTCCGGCAGGGGAT
pika GGAGCAGATGTCACTCACCCTCCAGCTGGGGAT
rabbit GGTGCAGATGTCACACACCCCCCAGCTGGAGAT
squirrel GGAGCAGATGTCACTCACCCTCCAGCGGGAGAT
guinea_pig GGAGCAGATGTCACACACCCACCAGCGGGAGAT
mouse GGAGCAGATGTCACTCATCCGCCTGCTGGGGAC
rat GGAGCAGATGTCACTCATCCACCTGCTGGGGAT
kangaroo_rat GGAGCAGATGTTACACACCCTCCAGCAGGGGAT
tree_shrew GGCGCAGACGTCACGCACCCCCCGGCCGGGGAT
human GGAGCGGATGTCACACACCCCCCAGCAGGGGAT
tarsier GGTGCTGATGTCACACACCCCCCTGCAGGGGAT
marmoset GGAGCAGATGTCACACACCCACCAGCAGGGGAT
zebrafinch GGAGCAGATGTCACTCACCCTCCCGCCGGGGAT
green_anole GGGGCAGACGTCACTCACCCGCCAGCCGGGGAC
xenopus GGAGCAGATGTTACACACCCACCTGCTGGTGAT
pufferfish GGTGCGGATGTTACTCATCCTCCTGCTGGTGAT
fugu GGGGCTGATGTTACTCACCCTCCAGCTGGTGAT
stickleback GGTGCAGACGTCACACATCCTCCAGCGGGTGAT
medaka GGTGCCGATGTCACTCATCCTCCTGCCGGGGAC
zebrafish GGGGCAGATGTTACACACCCGCCGGCTGGTGAT
lamprey GGTGCCGATGTGACACACCCTCCAGCGGGAGAC
//
G
A
A
A
A
A
G
G
G
G
C
C
C
C
U
U
U
U
UC AG
UCA
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
AG
U
C
AG UCAG
P
S
U
nG
nG
oG
oG
oG
G
P
P
P
P
P
nM
nM
M
M
nM
nM
nM
Phenylalanine
Phe
Leucine
Leu
Leucine
Leu
Proline
Pro
Histidine
His
Glutamine
Gln
Isoleucine
Ile
Methionine
Met
Threonine
Thr
Asparagine
Asn
Lysine
Lys
Arginine
Arg
Arginine
Arg
Valine
Val
Alanine
Ala
Glutamic acid
Glu
Aspartic acid
Asp
Glycine
Gly
Serine
Ser
Serine
Ser
Tyrosine
Tyr
Cysteine
Cys
Tryptophan
Trp
Stops
Stop
E
G F L
S
S
Y
C
W
L
P
H
R
R
Q
I
M
T
N
K
V
A
D
89.09
75.07
174.20
174.20
146.19
165.19
133.11
117.15
147.13
146.15
155.16
115.13
105.09
105.09
131.18
132.12
MW
=
14
9.2
1
Da
131.18
119.12
204.23
131.18
181.19
121.16
HN
NH2
NH
H2N
OH
O
H2N
C
H3 OH
O
H2N
O
H2N
OH
O
O
HO
H2N
OH
O
HS
H2N
OH
O
H2N
O
NH2
OH
O
O
OH
H2N
OH
O
H2N
OH
O
NH
H2N
OH
O
N
C
H3 CH3
H2N
OH
O
C
H3
C
H3
H2N
OH
O
C
H3
C
H3
H2N
OH
O
H2N
H2N
OH
O
C
H3 S
H2N
OH
O
H2N
OH
O
NH
OH
O
H2N
HO OH
O
H2N
HO OH
O
H2N
HO
CH3
OH
O
NH
H2N
OH
O
HO
H2N
OH
O
H2N
C
H3
CH3
OH
O
Basic
Acidic
Polar
Nonpolar
(hydrophobic)
S -
M -
P -
U -
nM -
oG -
nG -
Sumo
Methyl
Phospho
Ubiquitin
N-Methyl
O-glycosyl
N-glycosyl
Modification
amino
acid
2nd
1st position 3rd
U
C
Image source: http://upload.wikimedia.org/wikipedia/en/d/d6/GeneticCode21-version-2.svg
17. Example 2: DNA encodes non-coding RNAs
Covariation can be important!
G
C
G
G
A
U
U
U
A
G
C
U
C
A
G
D
D
G
G G A
G A G C
G
C
C
A
G
A
C
U
G
A A
.
A
.
C
U
G
GAG
G
U
C
C U G U G
T . C
G
A
U
C
C
A
C
A
G
A
A
U
U
C
G
C
A
C
C
A
Variable
Loop
Anticodon
Loop
T ΨC
Loop
10 15 20 25 30 35
5 40 45 50 55 60 65 70 75
Anticodon
Loop
Acceptor
Stem
GCGGAUUUAGCUCAGDDGGGAGAGCGCCAGACUGAAYA.CUGGAGGUCCUGUGT.CGAUCCACAGAAUUCGCACCA
5’ 3’
Secondary Structure Tertiary Structure
B C
Primary Structure
A
Acceptor
Stem
T ΨC
Loop
Ψ
Ψ
Ψ
Ψ
Y
65
60
55
40
10
20
15
5
70
75
25
30
35
45
50
D Loop
3’
5’
5’
3’
D Loop
18. Homology-based annotation: strengths and weaknesses
▶ Example 1: map known genes onto genomes
▶ Strengths: fast, cheap, ...
▶ Weaknesses:
▶ Inaccurate for divergent species (e.g. the tuatara genome)
▶ Requires manual correction of border-line results
▶ Errors are propagated throughout the databases
▶ Example 2: aligning genomes, analyse patterns of variation
▶ Strengths:
▶ “cheap” if genomes already exist
▶ fast for small genomes
▶ evolutionary support for all discoveries
▶ Weaknesses:
▶ Requires lots of powerful computers for large genomes
▶ Inaccurate for divergent species (e.g. the tuatara genome)
▶ Requires manual correction of border-line results
19. Homology annotation: proteins are much easier to align
than nucleotides
0
20
40
60
80
100
Conservation of Xfam families in bacterial genomes
Conserved
families
(%)
Freq.
RNA−seq species
0
10
Pfam (N=6671)
Rfam (N=331)
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Phylogenetic distance
Lindgreen et al. (2014) Robust identification of noncoding RNA from transcriptomes requires
phylogenetically-informed sampling. PLOS Computational Biology.
20. Another annotation strategy is to use RNA sequencing...
▶ Protein and ncRNA genes require a transcription step...
▶ Example: sequence RNAs from multiple tissues,
developmental stages and environmental conditions
Wang, Gerstein & Snyder (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics.
21. RNA-seq based annotation
Sorek & Cossart (2010) Prokaryotic transcriptomics: a new view on regulation, physiology and pathogenicity.
Nature Reviews Genetics.
22. RNA-seq: strengths and weaknesses
▶ RNA-seq
▶ Strengths:
▶ Experimental support for transcribed regions
▶ Identifies untranslated regions (UTRs), ncRNAs, antisense
RNAs, ...
▶ Can identify alternatively spliced and edited RNAs
▶ Weaknesses:
▶ Expensive & lots of work
▶ RNA degradation and genomic contamination
▶ Misses genes transcribed in specific developmental stages,
tissues & environmental conditions E.g. lsy-6 microRNA
▶ Transcription does not prove translation
▶ Not all transcription is functional!
Hundreds of transcriptome papers e.g.:
Di Giorgio et al. (2020) Evidence for host-dependent RNA editing in the transcriptome of SARS-CoV-2. Science
Advances.
Lee et al. (2019) Diagnostic utility of transcriptome sequencing for rare Mendelian diseases. Genetics in Medicine.
Kalucka et al. (2020) Single-Cell Transcriptome Atlas of Murine Endothelial Cells. Cell.
Uhlen et al. (2017) A pathology atlas of the human cancer transcriptome. Science.
etc.
23. Another annotation strategy is to use direct protein
detection methods...
▶ Central dogma of molecular biology
▶ Example: Protein mass spectrometry
Figure from: http://en.wikipedia.org/wiki/Protein mass spectrometry
24. Protein mass spectrometry: strengths and weaknesses
▶ Protein mass spectrometry
▶ Strengths:
▶ Experimental support for translated regions
▶ Identifies alternative isoforms and post-translational
modifications (Ezkurdia et al. 2012)
▶ Weaknesses:
▶ Expensive & lots of work
▶ Misses genes transcribed in specific developmental stages,
tissues & environmental conditions
▶ Current technology generally detects only the most abundant
proteins
▶ Requires a reference protein DB
▶ How to deal with paralogues (duplicated genes)?
▶ Not all translation is functional!
Hundreds of proteomics papers e.g.:
Bojkova et al. (2020) Proteomics of SARS-CoV-2-infected host cells reveals therapy targets. Nature.
Nusinow et al. (2020) Quantitative Proteomics of the Cancer Cell Line Encyclopedia. Cell.
Messner et al. (2020) Ultra-High-Throughput Clinical Proteomics Reveals Classifiers of COVID-19 Infection. Cell
Systems.
Cheung et al. (2020) Defining the carrier proteome limit for single-cell proteomics. Nature Methods.
etc.
25. Weird gene interlude: Inside-out genes
▶ SNHG1 (a.k.a. UHG) is transcribed like a normal mRNA
▶ Spliced
▶ Exons have no open reading frame, and are quickly degraded
▶ The introns are highly conserved, stable trancripts, are
snoRNAs. SnoRNAs are important for maturing rRNA.
Scale
chr11:
DNase Clusters
Multiz Align
1 kb hg38
62,853,000 62,853,500 62,854,000 62,854,500 62,855,000 62,855,500
GENCODE v29 Comprehensive Transcript Set (only Basic displayed by default)
C/D and H/ACA Box snoRNAs, scaRNAs, and microRNAs from snoRNABase and miRBase
H3K27Ac Mark (Often Found Near Regulatory Elements) on 7 cell lines from ENCODE
DNase I Hypersensitivity Peak Clusters from ENCODE (95 cell types)
100 vertebrates Basewise Conservation by PhyloP
Vertebrate Multiz Alignment & Conservation (100 Species)
SNHG1
SNHG1
SNHG1
SNHG1
SNHG1
SNHG1
SNHG1
SNHG1
SNHG1
SNHG1
SNHG1
SNORD22 SNHG1
SNORD30 RF00099 SNORD28 SNORD27 SNORD26
SNORD25
U22 U31 U30 U29 U28 U27 U26 U25
Layered H3K27Ac
100 _
0 _
Cons 100 Verts
4.88 _
-4.5 _
0 -
Tycowski KT, Shu M & Steitz JA (1996) A mammalian gene with introns instead of exons generating stable RNA
products. Nature.
26. Multi-omics: How cool is this?!
Blevins et al. (2019) Extensive post-transcriptional buffering of gene expression in the response to severe oxidative
stress in baker’s yeast. Scientific Reports.
27. Combining evidence
▶ Robust science does not rely upon a single methodology, or
dataset...
▶ Certainty in annotations comes from combining multiple lines
of evidence
▶ COLLECT COLLAGES OF EVIDENCE...
Giglio et al. (2019) ECO, the Evidence & Conclusion Ontology: community standard for evidence information.
Nucleic Acids Research.
28. The main points
▶ An overview of comparative genomics
▶ An understanding of computational and experimental
strategies for annotating genomes
▶ Computational: ORF prediction, mapping from existing
genome annotations, pattern hunting in alignments
▶ Experimental: RNA-seq and Mass-spec (many other emerging
tools too)
▶ An understanding of the advantages and limitations of
different genome annotation methods
29. Self-evaluation exercises
▶ Genome annotation is a fundamental problem in
bioinformatics and genomics. Describe four methods for
annotating genes encoded in genome sequences. What are
their main disadvantages and advantages?
▶ You’ve been given the task of annotating the Giant Wētā
genome. A colleague has recently spent a year annotating the
related African King Cricket genome. Outline a fast and
cheap method for annotating your Wētā genome.
▶ Define the term “Open reading frame” (ORF). Describe the
relationship between and ORF and a gene. Describe how an
ORF can be located in an uncharacterised sequence.
30. Self-evaluation exercises
▶ You have run an ORF finder and a non-coding RNA
annotation tool on the above genomic sequence. The results
show an overlap between a predicted ORF and a ribosomal
RNA. Which annotation is likely to be correct? Justify your
answer.
▶ Consider the below alignment. Based upon the patterns of
sequence variation predict whether the sequences encoded in
this genomic region are protein coding or non-coding. Justify
your answer.
species1 GGTAAGCTGGCGCGTCAGTTTGAGCAGCAG...GGT
species2 GGTAAACTGGCGCGCCAGTTTGAGCAGCAG...GGT
species3 GGCAAACTCGCCCGCCAGTTGGAACACCATcagGGG
31. Further reading
▶ Reviews:
▶ Zerbino, Frankish & Flicek (2020) Progress, Challenges, and
Surprises in Annotating the Human Genome. Annual Review
of Genomics and Human Genetics.
▶ RNA-seq
▶ Wang, Gerstein & Snyder (2009) RNA-Seq: a revolutionary
tool for transcriptomics. Nature Reviews Genetics.
▶ Proteomics
▶ Nesvizhskii (2014) Proteogenomics: concepts, applications and
computational strategies. Nature Methods.
32. Questions relating to my lectures can be asked & viewed
here:
https://docs.google.com/document/d/1PQd dp7C 0cXA8SwUv-
qrkTOj8c8fUAt-U Z5dg2yc8/edit?usp=sharing
33. Homework: How to make a sequence alignment?
▶ Play: http://phylo.cs.mcgill.ca