GENE315: Sequence alignment & comparative
genomics
Paul Gardner
March 13, 2023
Objectives for lecture 05
▶ Able to answer:
▶ What’s the difference between homology and analogy?
▶ How homology is estimated?
▶ Where do sequence similarity scores come from?
▶ What are the BLOSUM protein scoring matrices?
▶ How are insertions and deletions scored?
What is homology?
▶ A homologous trait is any
characteristic of organisms that is
derived from a common ancestor
Eg. vertebrate forelimbs.
▶ Contrast this to analogous traits:
similarities between organisms that
were not in the last common
ancestor Eg. wings from
pterosaurs, bats & birds.
Text & image adapted from:
http://en.wikipedia.org/wiki/Homology (biology)
Homologs can be further broken down into...
▶ Orthologues: derived from a single ancestor, seperated by a
speciation event. Often presumed to be functionally
equivalent.
▶ Paralogues: genes that are related by a gene duplication event.
▶ Xenologues: genes that are the result of horizontal gene
transfer events (e.g. viral genes)
Fitch W (1970) Distinguishing homologous from analogous proteins. Systematic zoology.
Example of similarity due to orthology, paralogy and
analogy
https://en.wikipedia.org/wiki/Sequence homology
Convergent/analogous changes...
▶ Sequence similarity of
homologs is expected to
decline with evolutionary
distance (e.g. human &
chimp proteins more
similar than human &
mouse)
▶ Convergent & analogous
proteins may deviate
from this pattern (e.g.
pathogens may use
molecular mimicry to
evade an immune
system), proteins
involved in echolocation
may converge (see
figure)
Marcovitz et al. (2019) A functional enrichment
test for molecular convergent evolution finds a
clear protein-coding signal in echolocating bats and
whales. PNAS.
The homology search problem
▶ Given a biological
sequence, can we identify
homologues in other
species?
Hug et al. (2016) A new view of the tree of life. Nature Microbiology
Given gazillions of biomolecular sequences, how can we
estimate homology?
▶ By sequence comparison! Usually by alignment.
▶ NOTE: sequence similarity is used to infer homology.
Homology is binary, either something is or isn’t homologous.
Sequences can be 87% similar, they can’t be 87% homologous!
▶ Alignments are two or more sequences placed on top of each
other, identifying a functional, structural, or evolutionary
relationships between the sequences.
2
4
5
3
1
>1
GCAUCCAUGGCUGAAUGGUUAAAGCGCCCAACUCAUAAUUGGCGAACUCGCGGGUUCAAUUCCUGCUGGAUGCA
>2
GCAUUGGUGGUUCAGUGGUAGAAUUCUCGCCUGCCACGCGGGAGGCCCGGGUUCGAUUCCCGGCCAAUGCA
>3
UGGGCUAUGGUGUAAUUGGCAGCACGACUGAUUCUGGUUCAGUUAGUCUAGGUUCGAGUCCUGGUAGCCCAG
>4
GAAGAUCGUCGUCUCCGGUGAGGCGGCUGGACUUCAAAUCCAGUUGGGGCCGCCAGCGGUCCCGGGCAGGUUCGACUCCUGUGAUCUUCCG
>5
CUAAAUAUAUUUCAAUGGUUAGCAAAAUACGCUUGUGGUGCGUUAAAUCUAAGUUCGAUUCUUAGUAUUUACC
** *
1 GCAUCCAUGGCUGAAU-GGUU-AAAGCGCCCAACUCAUAAUUGGCGAA--
2 GCAUUGGUGGUUCAGU-GGU--AGAAUUCUCGCCUGCCACGCGG-GAG--
3 UGGGCUAUGGUGUAAUUGGC--AGCACGACUGAUUCUGGUUCAG-UUA--
4 GAAGAUCGUCGUCUCC-GGUG-AGGCGGCUGGACUUCAAAUCCA-GU-UG
5 CUAAAUAUAUUUCAAU-GGUUAGCAAAAUACGCUUGUGGUGCGU-UAA--
**** * **
1 ------------------CUCGCGGGUUCAAUUCCUGCUGGAUGC-A
2 ------------------G-CCCGGGUUCGAUUCCCGGCCAAUGC-A
3 ------------------G-UCUAGGUUCGAGUCCUGGUAGCCCA-G
4 GGGCCGCCAGCGGUCCCG--GGCAGGUUCGACUCCUGUGAUCUUCCG
5 ------------------A-UCUAAGUUCGAUUCUUAGUAUUUAC-C
S
M
A
D
M
Y
M
U
R
S
Y
U
C
A
M
Y
-
G
G
Y
u a A
V M M M
R M
H
C
R
M
Y
U
S
H V R
H
K
C
V
R
c
K
W
A
-
-
-
-
- c c - c
c
a
-
c
-
-
-
c
c
c
-
V
-
Y
S Y R R G
U U
C
R
A
Y
U
C
C
Y
R
S
Y
M
D
M
Y
V
M
c
V
Sequence alignments are an essential tool for genetics
▶ Sequence assembly
▶ Homology search (e.g. BLAST)
▶ Phylogenetics
▶ Structure-function analysis
▶ Variant calling
▶ Outbreak tracing
▶ And much much more...
So, geneticists of the future really should have a good
understanding of sequence alignment!
What is this sequence, and what is it related to?
Try BLASTP and PHMMER.
MPFVNKQFNYKDPVNGVDIAYIKIPNAGQMQPVKAFKIHNKIWVIPERDTFTNPEEGDLN
PPPEAKQVPVSYYDSTYLSTDNEKDNYLKGVTKLFERIYSTDLGRMLLTSIVRGIPFWGG
STIDTELKVIDTNCINVIQPDGSYRSEELNLVIIGPSADIIQFECKSFGHEVLNLTRNGY
GSTQYIRFSPDFTFGFEESLEVDTNPLLGAGKFATDPAVTLAHELIHAGHRLYGIAINPN
RVFKVNTNAYYEMSGLEVSFEELRTFGGHDAKFIDSLQENEFRLYYYNKFKDIASTLNKA
KSIVGTTASLQYMKNVFKEKYLLSEDTSGKFSVDKLKFDKLYKMLTEIYTEDNFVKFFKV
LNRKTYLNFDKAVFKINIVPKVNYTIYDGFNLRNTNLAANFNGQNTEINNMNFTKLKNFT
GLFEFYKLLCVRGIITSKTKSLDKGYNKALNDLCIKVNNWDLFFSPSEDNFTNDLNKGEE
ITSDTNIEAAEENISLDLIQQYYLTFNFDNEPENISIENLSSDIIGQLELMPNIERFPNG
KKYELDKYTMFHYLRAQEFEHGKSRIALTNSVNEALLNPSRVYTFFSSDYVKKVNKATEA
AMFLGWVEQLVYDFTDETSEVSTTDKIADITIIIPYIGPALNIGNMLYKDDFVGALIFSG
AVILLEFIPEIAIPVLGTFALVSYIANKVLTVQTIDNALSKRNEKWDEVYKYIVTNWLAK
VNTQIDLIRKKMKEALENQAEATKAIINYQYNQYTEEEKNNINFNIDDLSSKLNESINKA
MININKFLNQCSVSYLMNSMIPYGVKRLEDFDASLKDALLKYIYDNRGTLIGQVDRLKDK
VNNTLSTDIPFQLSKYVDNQRLLSTFTEYIKNIINTSILNLRYESNHLIDLSRYASKINI
GSKVNFDPIDKNQIQLFNLESSKIEVILKNAIVYNSMYENFSTSFWIRIPKYFNSISLNN
EYTIINCMENNSGWKVSLNYGEIIWTLQDTQEIKQRVVFKYSQMINISDYINRWIFVTIT
NNRLNNSKIYINGRLIDQKPISNLGNIHASNNIMFKLDGCRDTHRYIWIKYFNLFDKELN
EKEIKDLYDNQSNSGILKDFWGDYLQYDKPYYMLNLYDPNKYVDVNNVGIRGYMYLKGPR
GSVMTTNIYLNSSLYRGTKFIIKKYASGNKDNIVRNNDRVYINVVVKNKEYRLATNASQA
GVEKILSALEIPDVGNLSQVVVMKSKNDQGITNKCKMNLQDNNGNDIGFIGFHQFNNIAK
LVASNWYNRQIERSSRTLGCSWEFIPVDDGWGERPL
Paul should explain what...
▶ a sequence alignment is
▶ what the “|” and “-” symbols mean in the alignment
▶ discuss how we can make these
Query 1 CTTCGTCAGATTTCCTCTCAATATCGATCATACCGGACTGATATTCGTCCGG----GAAC
|| |||||||| ||||||||| |||||| | | | || |||| |||| ||||
Sbjct 828507 CTCCGTCAGATATCCTCTCAACATCGATACGGCTTGTCGGACATTCTTCCGCAGGCGAAC
Query 57 TCTAGCGATTGAAA-GGAAATCGTTATGAACTCAGGCACCGTAAAG
| | || |||||| ||| ||||||||||| |||||| ||| |||
Sbjct 828567 ACAA-CGGTTGAAAAGGAGATCGTTATGAATTCAGGCGTCGTCAAG
How should a sequence alignment be interpreted & scored?
What do the bit score and and expect value tell you?
>gb|CP001191.1| Rhizobium leguminosarum bv. trifolii WSM2304, complete genome
Length=4537948
Features in this part of subject sequence:
cold-shock DNA-binding domain protein
Score = 57.2 bits (62), Expect = 2e-05
Identities = 78/106 (74%), Gaps = 6/106 (6%)
Strand=Plus/Plus
Query 1 CTTCGTCAGATTTCCTCTCAATATCGATCATACCGGACTGATATTCGTCCGG----GAAC
|| |||||||| ||||||||| |||||| | | | || |||| |||| ||||
Sbjct 828507 CTCCGTCAGATATCCTCTCAACATCGATACGGCTTGTCGGACATTCTTCCGCAGGCGAAC
Query 57 TCTAGCGATTGAAA-GGAAATCGTTATGAACTCAGGCACCGTAAAG
| | || |||||| ||| ||||||||||| |||||| ||| |||
Sbjct 828567 ACAA-CGGTTGAAAAGGAGATCGTTATGAATTCAGGCGTCGTCAAG
Let’s play a game of coin toss
▶ How can we detect cheating?
Ratios between frequencies (or odds) are widely used...
▶ But the ratios ...
▶ 10
1 , 100
1 , 1,000
1 , · · · are found in the interval (1, inf), while
▶ 1
10 , 1
100 , 1
1,000 , · · · are found in the interval (0, 1).
0 200 400 600 800 1000
Ratio
|
|
|| | |
|
Ratio (log scale)
|
|
| | | |
|
0.001 0.01 0.1 1 10 100 1000
Where do the sequence alignment scores come from?
▶ Most alignment scores (from BLAST, HMMER, ...) can be
interpreted as log-odds ratios or bit-scores
▶ What is a log-odds ratio?
▶ The observed frequency of two independent events is fAB
▶ The expected frequency of two independent events is fA ∗ fB
(i.e. what is the likelihood of the event occuring by chance)
▶ The log-odds ratio is log2(FreqObserved
FreqExpected
) = log2( fAB
fA∗fB
)
▶ What happens when fAB = fA ∗ fB ?
▶ What happens when fAB > fA ∗ fB or fAB < fA ∗ fB ?
▶ Often called the bit-score, information theorists like to discuss
how many “bits” of information they have, hence “log2”.
Logarithms – reminder
Where do the nucleotide scores come from?
▶ Align lots of nucleotide sequences that are ≥ 99 percent
identity
▶ Assuming a uniform substitution model for sequences,
simulate changes over 1% identity intervals (steps in time)
▶ Compute a match & mismatch log-odds score for each
interval, E.g.
▶ at 99% ID ≈ Match:+2, Mismatch:−6.
▶ at 90% ID ≈ Match:+2, Mismatch:−3.
▶ at 75% − 62% ID ≈ Match:+1, Mismatch:−1.
▶ I.e. The nucleotide scores have been extrapolated from closely
related sequences, assuming a simple model of sequence
evolution
Where do the nucleotide scores come from?
TABLE 1. PAM Substitution Scores Based
on the Uniform Mutation Model
PAM
Dist.
Percent
Con-
served
Match
Score
(Bits)
Mis-
match
Score
(Bits)
Match/
Mis-
match
Score
Ratio
Ave.
Informa
tion Per
Posi-
tion
(Bits)
1 99.0 1.99 -6.24 0.32 1.90
2 98.0 1.97 -5.25 0.38 1.83
5 95.2 1.93 -3.95 0.49 1.64
10 90.6 1.86 -3.00 0.62 1.40
15 86.4 1.79 -2.46 0.73 1.21
20 82.4 1.72 -2.09 0.82 1.05
25 78.7 1.66 -1.82 0.91 0.92
30 75.3 1.59 -1.60 0.99 0.80
35 72.0 1.53 -1.42 1.07 0.70
40 69.0 1.46 -1.27 1.15 0.62
45 66.2 1.40 -1.15 1.22 0.54
50 63.5 1.34 -1.04 1.29 0.47
States, DJ et al. (1991) Improved sensitivity of nucleic acid database searches using application-specific scoring
matrices. Methods Enzymol.
Margaret Dayhoff’s amino acid classes
▶ Invented the one letter codes, developed the first substitution
matrices, one of the founders of bioinformatics
▶ (C)(MILV),(STAGP),(DEQN),(WYF),(HRK)
▶ MILV went to a STAGParty, DEQN went home WYF HRK.
C stayed home and studied.
▶ MILV and WYF are the most important classes to remember
(statistically).
Dayhoff, Schwartz & Orcutt (1978). A model of Evolutionary Change in Proteins. Atlas of protein sequence and
structure.
Where do the protein score matrices come from?
▶ The BLOSUM62 matrix
▶ Take a database of protein alignments (BLOCKS)
▶ Split alignments: collapse sequences with percent sequence
identity > 62% (To reduce contributions from closely related
sequences)
▶ Compute log-odds ratios (lods) from these
▶ More accurate than PAM matrices (no extrapolation)
G
A
A
A
A
A
G
G
G
G
C
C
C
C
U
U
U
U
UC AG
UC
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
AG
U
C
AG
UCA
G
P
S
U
nG
nG
oG
oG
oG
G
P
P
P
P
P
nM
nM
M
M
nM
nM
nM
Phenylalanine
Phe
Leucine
Leu
Leucine
Leu
Proline
Pro
Histidine
His
Glutamine
Gln
Isoleucine
Ile
Methionine
Met
Threonine
Thr
Asparagine
Asn
Lysine
Lys
Arginine
Arg
Arginine
Arg
Valine
Val
Alanine
Ala
Glutamic acid
Glu
Aspartic acid
Asp
Glycine
Gly
Serine
Ser
Serine
Ser
Tyrosine
Tyr
Cysteine
Cys
Tryptophan
Trp
Stops
Stop
E
G F L
S
S
Y
C
W
L
P
H
R
R
Q
I
M
T
N
K
V
A
D
89.09
75.07
174.20
174.20
146.19
165.19
133.11
117.15
147.13
146.15
155.16
115.13
105.09
105.09
131.18
132.12
MW
=
149.
21
Da
131.18
119.12
204.23
131.18
181.19
121.16
HN
NH2
NH
H2N
OH
O
H2N
C
H3 OH
O
H2N
O
H2N
OH
O
O
HO
H2N
OH
O
HS
H2N
OH
O
H2N
O
NH2
OH
O
O
OH
H2N
OH
O
H2N
OH
O
NH
H2N
OH
O
N
C
H3 CH3
H2N
OH
O
C
H3
C
H3
H2N
OH
O
C
H3
C
H3
H2N
OH
O
H2N
H2N
OH
O
C
H3 S
H2N
OH
O
H2N
OH
O
NH
OH
O
H2N
HO OH
O
H2N
HO OH
O
H2N
HO
CH3
OH
O
NH
H2N
OH
O
HO
H2N
OH
O
H2N
C
H3
CH3
OH
O
Basic
Acidic
Polar
Nonpolar
(hydrophobic)
S -
M -
P -
U -
nM -
oG -
nG -
Sumo
Methyl
Phospho
Ubiquitin
N-Methyl
O-glycosyl
N-glycosyl
Modification
amino
acid
2nd
1st position 3rd
U
C
Henikoff & Henikoff (1992) Amino acid substitution matrices from protein blocks. PNAS.
Image source: http://www.mathgon.com/Cours/TP/TP1/Alignements.html
How are insertions and deletions (INDELs) penalised?
▶ Usually INDELs occur as larger (≥ 1) residue events
▶ Therefore scoring schemes that tend to lump multiple
potential events into a few large events are prefered
▶ Hence, the “affine gap penalty” is preferred
▶ requires a gap-opening penalty (A) and a gap-extension
penalty (B)
▶ The penalty for introducing a single gap-block into an
alignment is then A + B ∗ L, where L is the length of the
gap-block
Query CGG----GAAC
||X ||||
Sbjct CGCAGGCGAAC
Liu et al. (2015) Development of genome-wide insertion and deletion markers for maize, based on next-generation
sequencing data. BMC Genomics
How are alignment bitscores calculated?
Query CGG----GAAC
||X ||||
Sbjct CGCAGGCGAAC
bitscore =
X
(match scores) +
X
(mismatch scores)
+
X
(gap penalties)
NB. usually mismatch score are negative (penalties).
Therefore, for a +5/ − 4 match/mismatch scoring scheme and
−10/ − 1 for gap-open and gap-extension penalties, then the score
for the above alignment with 6 matches, 1 mismatch and 1
INDEL of length 4 is:
bitscore = 6 ∗ 5 + 1 ∗ (−4) + (−10 + 4 ∗ (−1))
= 12 bits
How are alignments made?
▶ A method called dynamic programming (DP) is used to find
an alignment with the highest possible score
▶ local, global & glocal modes
▶ I wont go into any detail about DPs in this course...
The main points
▶ Homology & analogy
▶ How sequence similarity is scored
▶ Log-odds ratios (a.k.a. bitscores) are useful for scoring
evolutionary changes in sequences
▶ How the BLOSUM and nucleotide score matrices where
produced
▶ We use an affine-gap method for penalising INDELs
Self-evaluation exercises
▶ If I was to throw two dice 360 times and rolled double 6’s 80
times, what is the log-odds ratio for that (base 2)?
▶ What is the log-odds ratio of rolling 10/360 double 6’s?
▶ What is the log-odds ratio of rolling 5/360 double 6’s?
▶ With reference to protein sequences: Define “homology” and
“similarity”. What is the difference between the two?
▶ BLOSUM62 is an alignment scoring matrix. Describe why a
scoring matrix is needed and how BLOSUM62 was derived.
Self-evaluation exercises
▶ Given a +5/ − 4 match/mismatch scoring scheme and
−10/ − 1 for gap-open and gap-extension penalties. What is
the score for the below alignment?
▶ What evolutionary event does the ’-’ symbol in the alignment
correspond to?
Query AGCGAT
| ||X|
Sbjct A-CGGT
Further reading
▶ Eddy SR (2004) Where did the BLOSUM62 alignment score
matrix come from? Nature Biotechnology
▶ Blast Bitscores & E-values:
https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
▶ NB: you don’t need to memorise the BLOSUM62 matrix, the
genetic code, or biochemical properties of amino-acids
▶ Concentrate on understanding the principles
Post questions on the FAQ GoogleDoc:
https://docs.google.com/document/d/1PQd dp7C 0cXA8SwUv-
qrkTOj8c8fUAt-U Z5dg2yc8/edit?usp=sharing
The End
Matanaka, Waikouaiti, 11 March 2023.

ppgardner-lecture05-alignment-comparativegenomics.pdf

  • 1.
    GENE315: Sequence alignment& comparative genomics Paul Gardner March 13, 2023
  • 2.
    Objectives for lecture05 ▶ Able to answer: ▶ What’s the difference between homology and analogy? ▶ How homology is estimated? ▶ Where do sequence similarity scores come from? ▶ What are the BLOSUM protein scoring matrices? ▶ How are insertions and deletions scored?
  • 3.
    What is homology? ▶A homologous trait is any characteristic of organisms that is derived from a common ancestor Eg. vertebrate forelimbs. ▶ Contrast this to analogous traits: similarities between organisms that were not in the last common ancestor Eg. wings from pterosaurs, bats & birds. Text & image adapted from: http://en.wikipedia.org/wiki/Homology (biology)
  • 4.
    Homologs can befurther broken down into... ▶ Orthologues: derived from a single ancestor, seperated by a speciation event. Often presumed to be functionally equivalent. ▶ Paralogues: genes that are related by a gene duplication event. ▶ Xenologues: genes that are the result of horizontal gene transfer events (e.g. viral genes) Fitch W (1970) Distinguishing homologous from analogous proteins. Systematic zoology.
  • 5.
    Example of similaritydue to orthology, paralogy and analogy https://en.wikipedia.org/wiki/Sequence homology
  • 6.
    Convergent/analogous changes... ▶ Sequencesimilarity of homologs is expected to decline with evolutionary distance (e.g. human & chimp proteins more similar than human & mouse) ▶ Convergent & analogous proteins may deviate from this pattern (e.g. pathogens may use molecular mimicry to evade an immune system), proteins involved in echolocation may converge (see figure) Marcovitz et al. (2019) A functional enrichment test for molecular convergent evolution finds a clear protein-coding signal in echolocating bats and whales. PNAS.
  • 7.
    The homology searchproblem ▶ Given a biological sequence, can we identify homologues in other species? Hug et al. (2016) A new view of the tree of life. Nature Microbiology
  • 8.
    Given gazillions ofbiomolecular sequences, how can we estimate homology? ▶ By sequence comparison! Usually by alignment. ▶ NOTE: sequence similarity is used to infer homology. Homology is binary, either something is or isn’t homologous. Sequences can be 87% similar, they can’t be 87% homologous! ▶ Alignments are two or more sequences placed on top of each other, identifying a functional, structural, or evolutionary relationships between the sequences. 2 4 5 3 1 >1 GCAUCCAUGGCUGAAUGGUUAAAGCGCCCAACUCAUAAUUGGCGAACUCGCGGGUUCAAUUCCUGCUGGAUGCA >2 GCAUUGGUGGUUCAGUGGUAGAAUUCUCGCCUGCCACGCGGGAGGCCCGGGUUCGAUUCCCGGCCAAUGCA >3 UGGGCUAUGGUGUAAUUGGCAGCACGACUGAUUCUGGUUCAGUUAGUCUAGGUUCGAGUCCUGGUAGCCCAG >4 GAAGAUCGUCGUCUCCGGUGAGGCGGCUGGACUUCAAAUCCAGUUGGGGCCGCCAGCGGUCCCGGGCAGGUUCGACUCCUGUGAUCUUCCG >5 CUAAAUAUAUUUCAAUGGUUAGCAAAAUACGCUUGUGGUGCGUUAAAUCUAAGUUCGAUUCUUAGUAUUUACC ** * 1 GCAUCCAUGGCUGAAU-GGUU-AAAGCGCCCAACUCAUAAUUGGCGAA-- 2 GCAUUGGUGGUUCAGU-GGU--AGAAUUCUCGCCUGCCACGCGG-GAG-- 3 UGGGCUAUGGUGUAAUUGGC--AGCACGACUGAUUCUGGUUCAG-UUA-- 4 GAAGAUCGUCGUCUCC-GGUG-AGGCGGCUGGACUUCAAAUCCA-GU-UG 5 CUAAAUAUAUUUCAAU-GGUUAGCAAAAUACGCUUGUGGUGCGU-UAA-- **** * ** 1 ------------------CUCGCGGGUUCAAUUCCUGCUGGAUGC-A 2 ------------------G-CCCGGGUUCGAUUCCCGGCCAAUGC-A 3 ------------------G-UCUAGGUUCGAGUCCUGGUAGCCCA-G 4 GGGCCGCCAGCGGUCCCG--GGCAGGUUCGACUCCUGUGAUCUUCCG 5 ------------------A-UCUAAGUUCGAUUCUUAGUAUUUAC-C S M A D M Y M U R S Y U C A M Y - G G Y u a A V M M M R M H C R M Y U S H V R H K C V R c K W A - - - - - c c - c c a - c - - - c c c - V - Y S Y R R G U U C R A Y U C C Y R S Y M D M Y V M c V
  • 9.
    Sequence alignments arean essential tool for genetics ▶ Sequence assembly ▶ Homology search (e.g. BLAST) ▶ Phylogenetics ▶ Structure-function analysis ▶ Variant calling ▶ Outbreak tracing ▶ And much much more... So, geneticists of the future really should have a good understanding of sequence alignment!
  • 10.
    What is thissequence, and what is it related to? Try BLASTP and PHMMER. MPFVNKQFNYKDPVNGVDIAYIKIPNAGQMQPVKAFKIHNKIWVIPERDTFTNPEEGDLN PPPEAKQVPVSYYDSTYLSTDNEKDNYLKGVTKLFERIYSTDLGRMLLTSIVRGIPFWGG STIDTELKVIDTNCINVIQPDGSYRSEELNLVIIGPSADIIQFECKSFGHEVLNLTRNGY GSTQYIRFSPDFTFGFEESLEVDTNPLLGAGKFATDPAVTLAHELIHAGHRLYGIAINPN RVFKVNTNAYYEMSGLEVSFEELRTFGGHDAKFIDSLQENEFRLYYYNKFKDIASTLNKA KSIVGTTASLQYMKNVFKEKYLLSEDTSGKFSVDKLKFDKLYKMLTEIYTEDNFVKFFKV LNRKTYLNFDKAVFKINIVPKVNYTIYDGFNLRNTNLAANFNGQNTEINNMNFTKLKNFT GLFEFYKLLCVRGIITSKTKSLDKGYNKALNDLCIKVNNWDLFFSPSEDNFTNDLNKGEE ITSDTNIEAAEENISLDLIQQYYLTFNFDNEPENISIENLSSDIIGQLELMPNIERFPNG KKYELDKYTMFHYLRAQEFEHGKSRIALTNSVNEALLNPSRVYTFFSSDYVKKVNKATEA AMFLGWVEQLVYDFTDETSEVSTTDKIADITIIIPYIGPALNIGNMLYKDDFVGALIFSG AVILLEFIPEIAIPVLGTFALVSYIANKVLTVQTIDNALSKRNEKWDEVYKYIVTNWLAK VNTQIDLIRKKMKEALENQAEATKAIINYQYNQYTEEEKNNINFNIDDLSSKLNESINKA MININKFLNQCSVSYLMNSMIPYGVKRLEDFDASLKDALLKYIYDNRGTLIGQVDRLKDK VNNTLSTDIPFQLSKYVDNQRLLSTFTEYIKNIINTSILNLRYESNHLIDLSRYASKINI GSKVNFDPIDKNQIQLFNLESSKIEVILKNAIVYNSMYENFSTSFWIRIPKYFNSISLNN EYTIINCMENNSGWKVSLNYGEIIWTLQDTQEIKQRVVFKYSQMINISDYINRWIFVTIT NNRLNNSKIYINGRLIDQKPISNLGNIHASNNIMFKLDGCRDTHRYIWIKYFNLFDKELN EKEIKDLYDNQSNSGILKDFWGDYLQYDKPYYMLNLYDPNKYVDVNNVGIRGYMYLKGPR GSVMTTNIYLNSSLYRGTKFIIKKYASGNKDNIVRNNDRVYINVVVKNKEYRLATNASQA GVEKILSALEIPDVGNLSQVVVMKSKNDQGITNKCKMNLQDNNGNDIGFIGFHQFNNIAK LVASNWYNRQIERSSRTLGCSWEFIPVDDGWGERPL
  • 11.
    Paul should explainwhat... ▶ a sequence alignment is ▶ what the “|” and “-” symbols mean in the alignment ▶ discuss how we can make these Query 1 CTTCGTCAGATTTCCTCTCAATATCGATCATACCGGACTGATATTCGTCCGG----GAAC || |||||||| ||||||||| |||||| | | | || |||| |||| |||| Sbjct 828507 CTCCGTCAGATATCCTCTCAACATCGATACGGCTTGTCGGACATTCTTCCGCAGGCGAAC Query 57 TCTAGCGATTGAAA-GGAAATCGTTATGAACTCAGGCACCGTAAAG | | || |||||| ||| ||||||||||| |||||| ||| ||| Sbjct 828567 ACAA-CGGTTGAAAAGGAGATCGTTATGAATTCAGGCGTCGTCAAG
  • 12.
    How should asequence alignment be interpreted & scored? What do the bit score and and expect value tell you? >gb|CP001191.1| Rhizobium leguminosarum bv. trifolii WSM2304, complete genome Length=4537948 Features in this part of subject sequence: cold-shock DNA-binding domain protein Score = 57.2 bits (62), Expect = 2e-05 Identities = 78/106 (74%), Gaps = 6/106 (6%) Strand=Plus/Plus Query 1 CTTCGTCAGATTTCCTCTCAATATCGATCATACCGGACTGATATTCGTCCGG----GAAC || |||||||| ||||||||| |||||| | | | || |||| |||| |||| Sbjct 828507 CTCCGTCAGATATCCTCTCAACATCGATACGGCTTGTCGGACATTCTTCCGCAGGCGAAC Query 57 TCTAGCGATTGAAA-GGAAATCGTTATGAACTCAGGCACCGTAAAG | | || |||||| ||| ||||||||||| |||||| ||| ||| Sbjct 828567 ACAA-CGGTTGAAAAGGAGATCGTTATGAATTCAGGCGTCGTCAAG
  • 13.
    Let’s play agame of coin toss ▶ How can we detect cheating?
  • 14.
    Ratios between frequencies(or odds) are widely used... ▶ But the ratios ... ▶ 10 1 , 100 1 , 1,000 1 , · · · are found in the interval (1, inf), while ▶ 1 10 , 1 100 , 1 1,000 , · · · are found in the interval (0, 1). 0 200 400 600 800 1000 Ratio | | || | | | Ratio (log scale) | | | | | | | 0.001 0.01 0.1 1 10 100 1000
  • 15.
    Where do thesequence alignment scores come from? ▶ Most alignment scores (from BLAST, HMMER, ...) can be interpreted as log-odds ratios or bit-scores ▶ What is a log-odds ratio? ▶ The observed frequency of two independent events is fAB ▶ The expected frequency of two independent events is fA ∗ fB (i.e. what is the likelihood of the event occuring by chance) ▶ The log-odds ratio is log2(FreqObserved FreqExpected ) = log2( fAB fA∗fB ) ▶ What happens when fAB = fA ∗ fB ? ▶ What happens when fAB > fA ∗ fB or fAB < fA ∗ fB ? ▶ Often called the bit-score, information theorists like to discuss how many “bits” of information they have, hence “log2”.
  • 16.
  • 17.
    Where do thenucleotide scores come from? ▶ Align lots of nucleotide sequences that are ≥ 99 percent identity ▶ Assuming a uniform substitution model for sequences, simulate changes over 1% identity intervals (steps in time) ▶ Compute a match & mismatch log-odds score for each interval, E.g. ▶ at 99% ID ≈ Match:+2, Mismatch:−6. ▶ at 90% ID ≈ Match:+2, Mismatch:−3. ▶ at 75% − 62% ID ≈ Match:+1, Mismatch:−1. ▶ I.e. The nucleotide scores have been extrapolated from closely related sequences, assuming a simple model of sequence evolution
  • 18.
    Where do thenucleotide scores come from? TABLE 1. PAM Substitution Scores Based on the Uniform Mutation Model PAM Dist. Percent Con- served Match Score (Bits) Mis- match Score (Bits) Match/ Mis- match Score Ratio Ave. Informa tion Per Posi- tion (Bits) 1 99.0 1.99 -6.24 0.32 1.90 2 98.0 1.97 -5.25 0.38 1.83 5 95.2 1.93 -3.95 0.49 1.64 10 90.6 1.86 -3.00 0.62 1.40 15 86.4 1.79 -2.46 0.73 1.21 20 82.4 1.72 -2.09 0.82 1.05 25 78.7 1.66 -1.82 0.91 0.92 30 75.3 1.59 -1.60 0.99 0.80 35 72.0 1.53 -1.42 1.07 0.70 40 69.0 1.46 -1.27 1.15 0.62 45 66.2 1.40 -1.15 1.22 0.54 50 63.5 1.34 -1.04 1.29 0.47 States, DJ et al. (1991) Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods Enzymol.
  • 19.
    Margaret Dayhoff’s aminoacid classes ▶ Invented the one letter codes, developed the first substitution matrices, one of the founders of bioinformatics ▶ (C)(MILV),(STAGP),(DEQN),(WYF),(HRK) ▶ MILV went to a STAGParty, DEQN went home WYF HRK. C stayed home and studied. ▶ MILV and WYF are the most important classes to remember (statistically). Dayhoff, Schwartz & Orcutt (1978). A model of Evolutionary Change in Proteins. Atlas of protein sequence and structure.
  • 20.
    Where do theprotein score matrices come from? ▶ The BLOSUM62 matrix ▶ Take a database of protein alignments (BLOCKS) ▶ Split alignments: collapse sequences with percent sequence identity > 62% (To reduce contributions from closely related sequences) ▶ Compute log-odds ratios (lods) from these ▶ More accurate than PAM matrices (no extrapolation) G A A A A A G G G G C C C C U U U U UC AG UC A G U C A G U C A G U C A G U C A G U C A G U C A G U C A G U C A G U C A G U C A G U C A G U C AG U C AG UCA G P S U nG nG oG oG oG G P P P P P nM nM M M nM nM nM Phenylalanine Phe Leucine Leu Leucine Leu Proline Pro Histidine His Glutamine Gln Isoleucine Ile Methionine Met Threonine Thr Asparagine Asn Lysine Lys Arginine Arg Arginine Arg Valine Val Alanine Ala Glutamic acid Glu Aspartic acid Asp Glycine Gly Serine Ser Serine Ser Tyrosine Tyr Cysteine Cys Tryptophan Trp Stops Stop E G F L S S Y C W L P H R R Q I M T N K V A D 89.09 75.07 174.20 174.20 146.19 165.19 133.11 117.15 147.13 146.15 155.16 115.13 105.09 105.09 131.18 132.12 MW = 149. 21 Da 131.18 119.12 204.23 131.18 181.19 121.16 HN NH2 NH H2N OH O H2N C H3 OH O H2N O H2N OH O O HO H2N OH O HS H2N OH O H2N O NH2 OH O O OH H2N OH O H2N OH O NH H2N OH O N C H3 CH3 H2N OH O C H3 C H3 H2N OH O C H3 C H3 H2N OH O H2N H2N OH O C H3 S H2N OH O H2N OH O NH OH O H2N HO OH O H2N HO OH O H2N HO CH3 OH O NH H2N OH O HO H2N OH O H2N C H3 CH3 OH O Basic Acidic Polar Nonpolar (hydrophobic) S - M - P - U - nM - oG - nG - Sumo Methyl Phospho Ubiquitin N-Methyl O-glycosyl N-glycosyl Modification amino acid 2nd 1st position 3rd U C Henikoff & Henikoff (1992) Amino acid substitution matrices from protein blocks. PNAS. Image source: http://www.mathgon.com/Cours/TP/TP1/Alignements.html
  • 21.
    How are insertionsand deletions (INDELs) penalised? ▶ Usually INDELs occur as larger (≥ 1) residue events ▶ Therefore scoring schemes that tend to lump multiple potential events into a few large events are prefered ▶ Hence, the “affine gap penalty” is preferred ▶ requires a gap-opening penalty (A) and a gap-extension penalty (B) ▶ The penalty for introducing a single gap-block into an alignment is then A + B ∗ L, where L is the length of the gap-block Query CGG----GAAC ||X |||| Sbjct CGCAGGCGAAC Liu et al. (2015) Development of genome-wide insertion and deletion markers for maize, based on next-generation sequencing data. BMC Genomics
  • 22.
    How are alignmentbitscores calculated? Query CGG----GAAC ||X |||| Sbjct CGCAGGCGAAC bitscore = X (match scores) + X (mismatch scores) + X (gap penalties) NB. usually mismatch score are negative (penalties). Therefore, for a +5/ − 4 match/mismatch scoring scheme and −10/ − 1 for gap-open and gap-extension penalties, then the score for the above alignment with 6 matches, 1 mismatch and 1 INDEL of length 4 is: bitscore = 6 ∗ 5 + 1 ∗ (−4) + (−10 + 4 ∗ (−1)) = 12 bits
  • 23.
    How are alignmentsmade? ▶ A method called dynamic programming (DP) is used to find an alignment with the highest possible score ▶ local, global & glocal modes ▶ I wont go into any detail about DPs in this course...
  • 24.
    The main points ▶Homology & analogy ▶ How sequence similarity is scored ▶ Log-odds ratios (a.k.a. bitscores) are useful for scoring evolutionary changes in sequences ▶ How the BLOSUM and nucleotide score matrices where produced ▶ We use an affine-gap method for penalising INDELs
  • 25.
    Self-evaluation exercises ▶ IfI was to throw two dice 360 times and rolled double 6’s 80 times, what is the log-odds ratio for that (base 2)? ▶ What is the log-odds ratio of rolling 10/360 double 6’s? ▶ What is the log-odds ratio of rolling 5/360 double 6’s? ▶ With reference to protein sequences: Define “homology” and “similarity”. What is the difference between the two? ▶ BLOSUM62 is an alignment scoring matrix. Describe why a scoring matrix is needed and how BLOSUM62 was derived.
  • 26.
    Self-evaluation exercises ▶ Givena +5/ − 4 match/mismatch scoring scheme and −10/ − 1 for gap-open and gap-extension penalties. What is the score for the below alignment? ▶ What evolutionary event does the ’-’ symbol in the alignment correspond to? Query AGCGAT | ||X| Sbjct A-CGGT
  • 27.
    Further reading ▶ EddySR (2004) Where did the BLOSUM62 alignment score matrix come from? Nature Biotechnology ▶ Blast Bitscores & E-values: https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html ▶ NB: you don’t need to memorise the BLOSUM62 matrix, the genetic code, or biochemical properties of amino-acids ▶ Concentrate on understanding the principles Post questions on the FAQ GoogleDoc: https://docs.google.com/document/d/1PQd dp7C 0cXA8SwUv- qrkTOj8c8fUAt-U Z5dg2yc8/edit?usp=sharing
  • 28.