Sequence Alignment
Aligning sequences
Homology
Homology =
Similarity
between
objects due to
a common
ancestor
Hund = Dog,
Schwein = Pig
Sequence homology
VLSPAVKWAKVGAHAAGHG
||| || |||| | ||||
VLSEAVLWAKVEADVAGHG
Similarity between sequences as a
result of common ancestry.
Sequence alignment
Alignment: Comparing two (pairwise)
or more (multiple) sequences.
Searching for a series of identical or
similar characters in the sequences.
Why align?
VLSPAVKWAKV
||| || ||||
VLSEAVLWAKV
1.To detect if two sequences are homologous. If so,
homology may indicate similarity in structure (and
function).
2.Given a sequenced DNA, from an unknown region,
align it to the genome.
3.Required for evolutionary studies (e.g., tree
reconstruction).
4.To detect conservation (e.g., a tyrosine that is
evolutionary conserved is more likely to be a
phosphorylation site).
‫גנים‬ ‫של‬ ‫אבולוציה‬
‫אורתולוגים‬ ‫גנים‬
‫שמצויים‬ ‫גנים‬ -
‫במינים‬
(
species
)
‫שונים‬
,
.‫משותף‬ ‫קדמון‬ ‫אב‬ ‫ולהם‬
‫פונקציה‬ ‫להם‬ ‫ויש‬ ‫הגן‬ ‫ברצף‬ ‫דומים‬ ‫אורתולוגים‬ ,‫כלל‬ ‫בדרך‬
.‫המינים‬ ‫בשני‬ ‫דומה‬
‫גנים‬ ‫של‬ ‫אבולוציה‬
‫אורתולגים‬ ‫גנים‬
( ‫במינים‬ ‫שמצויים‬ ‫גנים‬ -
species
.‫משותף‬ ‫קדמון‬ ‫אב‬ ‫ולהם‬ ,‫שונים‬ )
.‫המינים‬ ‫בשני‬ ‫דומה‬ ‫פונקציה‬ ‫להם‬ ‫ויש‬ ‫הגן‬ ‫ברצף‬ ‫דומים‬ ‫אורתולוגים‬ ,‫כלל‬ ‫בדרך‬
( ‫אדם‬
Human
)
( ‫מקוק‬ ‫קוף‬
Rhesus macaque
)
‫משותף‬ ‫קדמון‬ ‫אב‬
(
Common ancestor
)
Hemoglobin
Hemoglobin
Hemoglobin
‫גנים‬ ‫של‬ ‫אבולוציה‬
?‫אורתולגים‬ ‫מוצאים‬ ‫איך‬
:‫שונים‬ ‫אורגנזימים‬ ‫בין‬ ‫דומים‬ ‫גנים‬ ‫רצפי‬ ‫של‬ ‫חיפוש‬
‫האדם‬ ‫גנום‬ ‫המקוק‬ ‫גנום‬
Gene 1
Gene 1
Gene 3
Gene 4
… …
Gene A
Gene B
Gene C
Gene D
‫גנים‬ ‫של‬ ‫אבולוציה‬
?‫אורתולגים‬ ‫מוצאים‬ ‫איך‬
:‫שונים‬ ‫אורגנזימים‬ ‫בין‬ ‫דומים‬ ‫גנים‬ ‫רצפי‬ ‫של‬ ‫חיפוש‬
‫האדם‬ ‫גנום‬ ‫המקוק‬ ‫גנום‬
Gene 1
Gene 1
Gene 3
Gene 4
… …
‫רצפים‬ ‫של‬ ‫חיפוש‬
:‫דומים‬
‫אדם‬ >-‫מקוק‬
‫מקוק‬ >-‫אדם‬
Gene A
Gene B
Gene C
Gene D
‫גנים‬ ‫של‬ ‫אבולוציה‬
?‫אורתולגים‬ ‫מוצאים‬ ‫איך‬
:‫שונים‬ ‫אורגנזימים‬ ‫בין‬ ‫דומים‬ ‫גנים‬ ‫רצפי‬ ‫של‬ ‫חיפוש‬
‫האדם‬ ‫גנום‬ ‫המקוק‬ ‫גנום‬
Gene 1
Gene 1
Gene 3
Gene 4
Gene A
Gene B
Gene C
Gene D
… …
‫רצפים‬ ‫של‬ ‫חיפוש‬
:‫דומים‬
‫אדם‬ >-‫מקוק‬
‫מקוק‬ >-‫אדם‬
BLAST similarity
search
Human->
Macaque
Macaque ->
Human
Insertions, deletions, and
substitutions
Sequence alignment
If two sequences share a common ancestor,
we can represent their evolutionary
relationship using a tree
Example: human and dog hemoglobin
VLSPAV-WAKV
||| || ||||
VLSEAVLWAKV
VLSPAV-WAKV VLSEAVLWAKV
Perfect match
VLSPAV-WAKV
||| || ||||
VLSEAVLWAKV
VLSPAV-WAKV VLSEAVLWAKV
A perfect match suggests that no change has
occurred from the common ancestor (although
this is not always the case).
A substitution
VLSPAV-WAKV
||| || ||||
VLSEAVLWAKV
VLSPAV-WAKV VLSEAVLWAKV
A substitution suggests that at least one
change has occurred since the common
ancestor
We cannot say in which lineage it occurred!
Indel
VLSPAV-WAKV
||| || ||||
VLSEAVLWAKV
VLSPAV-WAKV
VLSEAVLWAKV
Option 1: The ancestor had L and it was lost
here. In such a case, the event was a deletion.
VLSEAVLWAKV
Indel
VLSPAV-WAKV
||| || ||||
VLSEAVLWAKV
VLSPAV-WAKV
VLSEAVWAKV
Option 2: The ancestor was shorter and the L
was inserted here. In such a case, the event
was an insertion.
VLSEAVLWAKV
L
Indel
VLSPAV-WAKV
Normally, given two sequences we cannot tell
whether it was an insertion or a deletion, so
we term the event as an indel.
VLSEAVLWAKV
Deletion? Insertion?
Indels in protein coding genes
Indels in protein coding genes are often of 3bp,
6bp, 9bp, etc...
Why?
Gene Search
In fact, searching for indels of length 3K
(K=1,2,3,…) can help algorithms that search a
genome for coding regions
Global and Local pairwise
alignments
Global vs. Local
• Global alignment – finds the best
alignment across the entire two
sequences.
• Local alignment – finds regions of
similarity in parts of the sequences.
ADLGAVFALCDRYFQ
|||| |||| |
ADLGRTQN-CDRYYQ
ADLG CDRYFQ
|||| |||| |
ADLG CDRYYQ
Global
alignment:
forces
alignment in
regions which
differ
Local
alignment will
return only
regions of
good
alignment
Global alignment
FAK1 (a.k.a. PTK2, protein tyrosine kinase 2) of human and mouse
…
Proteins are comprised of domains
Domain B
Protein tyrosine
kinase domain
Domain A
Human PTK2 :
Global alignment may be problematic!
Domain X
Protein tyrosine kinase
domain
Domain B
Protein tyrosine kinase
domain
Domain A
Leukocyte TK
PTK2 The sequence similarity is
restricted to a single domain
Global alignment of PTK and LTK
(partial)
Local alignment of PTK and LTK
Conclusions
Use global alignment when the two sequences
share the same overall sequence arrangement.
Use local alignment to detect regions of
similarity.
How alignments are computed
Pairwise alignment
AAGCTGAATTCGAA
AGGCTCATTTCTGA
AAGCTGAATT-C-GAA
AGGCT-CATTTCTGA
-
One possible alignment:
AAGCTGAATT-C-GAA
AGGCT-CATTTCTGA
-
This alignment includes:
2 mismatches
4 indels (gap)
10 perfect matches
Choosing an alignment
for a pair of sequences
AAGCTGAATTCGAA
AGGCTCATTTCTGA
AAGCTGAATT-C-GAA
AGGCT-CATTTCTGA
-
A-AGCTGAATTC--GAA
AG-GCTCA-TTTCTGA-
Which alignment is better?
Many different alignments are
possible for 2 sequences:
Scoring system (naïve)
AAGCTGAATT-C-GAA
AGGCT-CATTTCTGA
-
Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2 Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1
A-AGCTGAATTC--GAA
AG-GCTCA-TTTCTGA-
Higher score  Better alignment
Perfect match: +1
Mismatch: -2
Indel (gap): -1
Alignment scoring - scoring of
sequence similarity:
Assumes independence between positions:
each position is considered separately
Scores each position:
• Positive if identical (match)
• Negative if different (mismatch or gap)
Total score = sum of position scores
Can be positive or negative
Scoring systems
Scoring system
• In the example above, the choice of +1 for
match,-2 for mismatch, and -1 for gap is
quite arbitrary
• Different scoring systems  different
alignments
• We want a good scoring system…
Scoring matrix
T C G A
2 A
2 -6 G
2 -6 -6 C
2 -6 -6 -6 T
• Representing the scoring
system as a table or
matrix n X n (n is the
number of letters the
alphabet contains. n=4
for nucleotides, n=20 for
amino acids)
• symmetric
DNA scoring matrices
• Uniform substitutions between all nucleotides:
T C G A From
To
-6 -6 -6 2 A
-6 -6 2 -6 G
-6 2 -6 -6 C
2 -6 -6 -6 T
Match
Mismatch
DNA scoring matrices
Can take into account biological phenomena
such as:
• Transition-transversion
DNA scoring matrices
• Non-uniform substitutions between all nucleotides:
T C G A From
To
-6 -6 -2 2 A
-6 -6 2 -2 G
-2 2 -6 -6 C
2 -2 -6 -6 T
Match
Mismatch
Scoring gaps (I)
In advanced algorithms, two gaps of one amino-
acid are given a different score than one gap of
two amino acids. This is solved by giving a penalty
to each gap that is opened.
Gap extension penalty < Gap opening penalty
Scoring gaps (II)
Assume that the gap opening cost is -3. The
length contributes -1 per base pair
AGGGTTC—GA
AGGGTTCTGA
Score = -4
AGGGTT-—GA
AGGGTTCTGA
Score = -5
AGGGT--—GA
AGGGTTCTGA
Score = -6
AGGG--C—GA
AGGGTTCTGA
Score = -9
Linear penalty
Intermediate summary
1. Scoring system =
substitution matrix + gap penalty.
2. Used for both global and local alignment
Computational Aspects
Many possible alignments
AAGCTGAATTCGAA
AGGCTCATTTCTGA
AAGCT-GAATT-C-GAA
A-GGCT-CATTTCTGA
-
AAGCTGAATT-C-GAA
AGGCT-CATTTCTGA
-
AAG-CTGAATT-C-GAA
AGGCT-CATTT-CTGA
-
Which alignment has the best
score?
• Two sequences of length 10 have
>> 1,000,000 possible alignments
• Two sequences of length 20 have
>> 100,000,000,000,000 possible
alignments
• Two sequences of length 30 have
>>
10,000,000,000,000,000,000,000
possible alignments
Optimal alignment algorithms
• Needleman-Wunsch (global) [1970]
• Smith-Waterman (local) [1981]
Only length(seq1) x length(seq2) operations!
Length of seq1,seq2 #
operations Instead of
10 100 1,000,000
20 400 100,000,000,000,000
30 900 10,000,000,000,000,000,000,
000
Matrix Representation
Match = 1
Mismatch = -1
Indel = -2
A G C
0 -2 -4 -6
A -2 1 -1 -3
A -4 -1 0 -2
A -6 -3 -2 -1
C -8 -5 -4 -1
score(AAAC,AGC) = -1
Seq2
Seq1
Matrix Representation
score(AAA,AG) = -2
A G C
0 -2 -4 -6
A -2 1 -1 -3
A -4 -1 0 -2
A -6 -3 -2 -1
C -8 -5 -4 -1
Match = 1
Mismatch = -1
Indel = -2
Seq2
Seq1
Matrix Representation
score(,AG) = -4
A G C
0 -2 -4 -6
A -2 1 -1 -3
A -4 -1 0 -2
A -6 -3 -2 -1
C -8 -5 -4 -1
Match = 1
Mismatch = -1
Indel = -2
Seq2
Seq1
The best alignment path
A G – C
A A A C
Match = 1
Mismatch = -1
Indel = -2
Seq2
Seq1
A G C
0 -2 -4 -6
A -2 1 -1 -3
A -4 -1 0 -2
A -6 -3 -2 -1
C -8 -5 -4 -1
Mi,j = MAXIMUM[
Mi-1, j-1 + Si,j (match/mismatch in the diagonal),
Mi,j-1 + w (gap in sequence #1),
Mi-1,j + w (gap in sequence #2)]
Computing the score of each cell
Mi,j = the score in cell (i,j)
Mi-1,j-1 Mi-1,j
Mi,j-1 Mi,j
Computing the Matrix
A G C
0 -2 -4 -6
A -2
A -4
A -6
C -8
Match = 1
Mismatch = -1
Indel = -2
This cell reflects putting A in the sequence in the
row against a gap in the sequence of the column:
Col -
Row A
Mi-1,j-1 Mi-1,j
Mi,j-1 Mi,j
Computing the Matrix
A G C
0 -2 -4 -6
A -2
A -4
A -6
C -8
Match = 1
Mismatch = -1
Indel = -2
This cell reflects putting AAAC in the sequence in
the column against 4 gaps in the sequence of the
row: Col AAAC
Row ----
Mi-1,j-1 Mi-1,j
Mi,j-1 Mi,j
Computing the Matrix
A G C
0 -2 -4 -6
A -2 ?
A -4
A -6
C -8
Match = 1
Mismatch = -1
Indel = -2
This arrow reflects putting A against A and
coming from putting nothing against nothing:
Col A
Row A
Col
Row
score =0+1=1
Mi-1,j-1 Mi-1,j
Mi,j-1 Mi,j
Computing the Matrix
A G C
0 -2 -4 -6
A -2 ?
A -4
A -6
C -8
Match = 1
Mismatch = -1
Indel = -2
This arrow reflects putting A from the column
sequence against an indel:
Mi-1,j-1 Mi-1,j
Mi,j-1 Mi,j
Col -A
Row A-
Col -
Row A
score =-2-2=-4
Computing the Matrix
A G C
0 -2 -4 -6
A -2 ?
A -4
A -6
C -8
Match = 1
Mismatch = -1
Indel = -2
This arrow reflects putting A from the row
sequence against an indel:
Mi-1,j-1 Mi-1,j
Mi,j-1 Mi,j
Col A-
Row -A
Col A
Row -
score =-2-2=-4
Computing the Matrix
A G C
0 -2 -4 -6
A -2 1
A -4
A -6
C -8
Match = 1
Mismatch = -1
Indel = -2
We have three possible paths to the A against A,
and we chose the path that has the best score
Computing the Matrix
A G C
0 -2 -4 -6
A -2 1
A -4
A -6
C -8
Match = 1
Mismatch = -1
Indel = -2
We have three possible paths to the A against A,
and we chose the path that has the best score
Computing the Matrix
A G C
0 -2 -4 -6
A -2 1
A -4
A -6
C -8
Match = 1
Mismatch = -1
Indel = -2
This arrow reflects putting A from the row
sequence against indel:
Col AA –
Row – –A
Col AA
Row – –
score=
-4-2=-6
Mi-1,j-1 Mi-1,j
Mi,j-1 Mi,j
Computing the Matrix
A G C
0 -2 -4 -6
A -2 1
A -4
A -6
C -8
Match = 1
Mismatch = -1
Indel = -2
This arrow reflects putting A from the column
sequence:
Col AA
Row A–
Col A
Row A
score=
1-2=-1
Mi-1,j-1 Mi-1,j
Mi,j-1 Mi,j
Computing the Matrix
A G C
0 -2 -4 -6
A -2 1
A -4
A -6
C -8
Match = 1
Mismatch = -1
Indel = -2
This arrow reflects putting A from the column
sequence and A from the row sequence:
Col AA
Row –A
Col A
Row –
score=
-2+1=-1
Mi-1,j-1 Mi-1,j
Mi,j-1 Mi,j
Computing the Matrix
A G C
0 -2 -4 -6
A -2 1
A -4 -1
A -6
C -8
Match = 1
Mismatch = -1
Indel = -2
We chose one of the best arrows. The best
alignment for this cell is
Col AA
Row A–
Computing the Matrix
A G C
0 -2 -4 -6
A -2 1
A -4 -1
A -6
C -8
Match = 1
Mismatch = -1
Indel = -2
We chose one of the best arrows. The best
alignment for this cell is
Col AA
Row A–
Computing the entire Matrix
A G C
0 -2 -4 -6
A -2 1 -1 -3
A -4 -1 0 -2
A -6 -3 -2 -1
C -8 -5 -4 -1
Match = 1
Mismatch = -1
Indel = -2
A G C
0 -2 -4 -6
A -2 1 -1 -3
A -4 -1 0 -2
A -6 -3 -2 -1
C -8 -5 -4 -1
‫המלאה‬ ‫המטריצה‬ ‫מתוך‬ ‫העימוד‬ ‫שחזור‬
AG-C
AAAC
We chose a path that goes from the end to
the beginning (marked here in Red). This is
called the Trace Back
A G C
0 -2 -4 -6
A -2 1 -1 -3
A -4 -1 0 -2
A -6 -3 -2 -1
C -8 -5 -4 -1
‫המלאה‬ ‫המטריצה‬ ‫מתוך‬ ‫העימוד‬ ‫שחזור‬
A-GC
AAAC
A G C
0 -2 -4 -6
A -2 1 -1 -3
A -4 -1 0 -2
A -6 -3 -2 -1
C -8 -5 -4 -1
‫המלאה‬ ‫המטריצה‬ ‫מתוך‬ ‫העימוד‬ ‫שחזור‬
-AGC
AAAC
Local alignment (note, letters were changed)
G A C
0 0 0 0
A 0
G 0
A 0
T 0
Match = 1
Mismatch = -1
Indel = -2
We initialize everything with a 0. Across the
matrix, there are no negative values (every time
we need to put a negative value, we put a 0).
Local alignment
G A C
0 0 0 0
A 0 0
G 0
A 0
T 0
Match = 1
Mismatch = -1
Indel = -2
We initialize everything with a 0. Across the
matrix, there are no negative values (every time
we need to put a negative value, we put a 0).
Local alignment
G A C
0 0 0 0
A 0 0 1 0
G 0 1 0 0
A 0 0
T 0 0
Match = 1
Mismatch = -1
Indel = -2
We initialize everything with a 0. Across the
matrix, there are no negative values (every time
we need to put a negative value, we put a 0).
Local alignment
G A C
0 0 0 0
A 0 0 1 0
G 0 1 0 0
A 0 0 2 0
T 0 0 0 1
Match = 1
Mismatch = -1
Indel = -2
We initialize everything with a 0. Across the
matrix, there are no negative values (every time
we need to put a negative value, we put a 0).
Local alignment
G A C
0 0 0 0
A 0 0 1 0
G 0 1 0 0
A 0 0 2 0
T 0 0 0 1
Match = 1
Mismatch = -1
Indel = -2
In trace-back, we start with the highest value and
go back to a zero (anywhere).
Col GA
Row GA
Global and Local pairwise alignments -
summary
• Global alignment – finds the best
alignment over all the positions
• Local alignment – finds regions of
similarity in parts of the sequences
Global & Local Alignments
Global Most useful when the sequences are similar in the
entire alignment.
Local More useful for dissimilar sequences that
are suspected to contain regions of similarity
or similar sequence motifs.
Pairwise alignment
• Local alignment – Waterman algorithm
• Global alignment – Needleman algorithm
• Protein and nucleotide alignments
• Input: two sequences
• Output: optimal sequence alignment
Question #1:
For highly similar sequences, which method
is more compatible for alignment?
Global
Local
Answer #1:
Global alignment
Global Alignment Webserver
(Needleman-Wunsch algorithm)
• https://www.ebi.ac.uk/jdispatcher/psa/emb
oss_needle
Local Alignment Webserver (Smith-
Waterman algorithm)
• https://www.ebi.ac.uk/jdispatcher/psa/emb
oss_water
Altering parameters
• Scoring matrix, e.g. BLOSUM:
– Higher numbers in matrix for similar
sequences
• Gap open & extend penalty
– Higher gap open penalty for similar
sequences
– If gap length matters increase gap extend
penalty
• Output format (pair)
Protein scoring matrix
The blosum 62 matrix.
The cost of substituting one amino acid by another.
BLOcks SUbstitution Matrix
• Based on comparisons of blocks of sequences from the
Blocks database.
• BLOSUM62 was built using data from sequences that
were up to 62% identical.
Results page
• Submission details
• Parameters
• Scores
– Identity
– Similarity
– Gaps
– Total score
Results page
• The alignment
• Notice
– Positions
– Matches |
– Similarities :
– Mismatches .
– Indels -
Question #3:
In case sequences are of unequal lengths – in which of the
sequences there will be more gaps in global alignment?
1
2
Answer #3:
Sequence no. 1
In local alignment:
Question #4:
For closely related sequences,
which BLOSUM matrix is better – BLOSUM62 or
BLOSUM30?
Answer #4:
BLOSUM62
• Sequence similarity
– Alignment - scoring system, local vs. global
• Pairwise alignment tools
– Needle for global
– Water for local
Summary

Pairwise Sequence Alignment is alignment.pptx

  • 1.
  • 2.
    Homology Homology = Similarity between objects dueto a common ancestor Hund = Dog, Schwein = Pig
  • 3.
    Sequence homology VLSPAVKWAKVGAHAAGHG ||| |||||| | |||| VLSEAVLWAKVEADVAGHG Similarity between sequences as a result of common ancestry.
  • 4.
    Sequence alignment Alignment: Comparingtwo (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.
  • 5.
    Why align? VLSPAVKWAKV ||| |||||| VLSEAVLWAKV 1.To detect if two sequences are homologous. If so, homology may indicate similarity in structure (and function). 2.Given a sequenced DNA, from an unknown region, align it to the genome. 3.Required for evolutionary studies (e.g., tree reconstruction). 4.To detect conservation (e.g., a tyrosine that is evolutionary conserved is more likely to be a phosphorylation site).
  • 6.
    ‫גנים‬ ‫של‬ ‫אבולוציה‬ ‫אורתולוגים‬‫גנים‬ ‫שמצויים‬ ‫גנים‬ - ‫במינים‬ ( species ) ‫שונים‬ , .‫משותף‬ ‫קדמון‬ ‫אב‬ ‫ולהם‬ ‫פונקציה‬ ‫להם‬ ‫ויש‬ ‫הגן‬ ‫ברצף‬ ‫דומים‬ ‫אורתולוגים‬ ,‫כלל‬ ‫בדרך‬ .‫המינים‬ ‫בשני‬ ‫דומה‬
  • 7.
    ‫גנים‬ ‫של‬ ‫אבולוציה‬ ‫אורתולגים‬‫גנים‬ ( ‫במינים‬ ‫שמצויים‬ ‫גנים‬ - species .‫משותף‬ ‫קדמון‬ ‫אב‬ ‫ולהם‬ ,‫שונים‬ ) .‫המינים‬ ‫בשני‬ ‫דומה‬ ‫פונקציה‬ ‫להם‬ ‫ויש‬ ‫הגן‬ ‫ברצף‬ ‫דומים‬ ‫אורתולוגים‬ ,‫כלל‬ ‫בדרך‬ ( ‫אדם‬ Human ) ( ‫מקוק‬ ‫קוף‬ Rhesus macaque ) ‫משותף‬ ‫קדמון‬ ‫אב‬ ( Common ancestor ) Hemoglobin Hemoglobin Hemoglobin
  • 8.
    ‫גנים‬ ‫של‬ ‫אבולוציה‬ ?‫אורתולגים‬‫מוצאים‬ ‫איך‬ :‫שונים‬ ‫אורגנזימים‬ ‫בין‬ ‫דומים‬ ‫גנים‬ ‫רצפי‬ ‫של‬ ‫חיפוש‬ ‫האדם‬ ‫גנום‬ ‫המקוק‬ ‫גנום‬ Gene 1 Gene 1 Gene 3 Gene 4 … … Gene A Gene B Gene C Gene D
  • 9.
    ‫גנים‬ ‫של‬ ‫אבולוציה‬ ?‫אורתולגים‬‫מוצאים‬ ‫איך‬ :‫שונים‬ ‫אורגנזימים‬ ‫בין‬ ‫דומים‬ ‫גנים‬ ‫רצפי‬ ‫של‬ ‫חיפוש‬ ‫האדם‬ ‫גנום‬ ‫המקוק‬ ‫גנום‬ Gene 1 Gene 1 Gene 3 Gene 4 … … ‫רצפים‬ ‫של‬ ‫חיפוש‬ :‫דומים‬ ‫אדם‬ >-‫מקוק‬ ‫מקוק‬ >-‫אדם‬ Gene A Gene B Gene C Gene D
  • 10.
    ‫גנים‬ ‫של‬ ‫אבולוציה‬ ?‫אורתולגים‬‫מוצאים‬ ‫איך‬ :‫שונים‬ ‫אורגנזימים‬ ‫בין‬ ‫דומים‬ ‫גנים‬ ‫רצפי‬ ‫של‬ ‫חיפוש‬ ‫האדם‬ ‫גנום‬ ‫המקוק‬ ‫גנום‬ Gene 1 Gene 1 Gene 3 Gene 4 Gene A Gene B Gene C Gene D … … ‫רצפים‬ ‫של‬ ‫חיפוש‬ :‫דומים‬ ‫אדם‬ >-‫מקוק‬ ‫מקוק‬ >-‫אדם‬ BLAST similarity search Human-> Macaque Macaque -> Human
  • 11.
  • 12.
    Sequence alignment If twosequences share a common ancestor, we can represent their evolutionary relationship using a tree Example: human and dog hemoglobin VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV-WAKV VLSEAVLWAKV
  • 13.
    Perfect match VLSPAV-WAKV ||| |||||| VLSEAVLWAKV VLSPAV-WAKV VLSEAVLWAKV A perfect match suggests that no change has occurred from the common ancestor (although this is not always the case).
  • 14.
    A substitution VLSPAV-WAKV ||| |||||| VLSEAVLWAKV VLSPAV-WAKV VLSEAVLWAKV A substitution suggests that at least one change has occurred since the common ancestor We cannot say in which lineage it occurred!
  • 15.
    Indel VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV-WAKV VLSEAVLWAKV Option1: The ancestor had L and it was lost here. In such a case, the event was a deletion. VLSEAVLWAKV
  • 16.
    Indel VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV-WAKV VLSEAVWAKV Option2: The ancestor was shorter and the L was inserted here. In such a case, the event was an insertion. VLSEAVLWAKV L
  • 17.
    Indel VLSPAV-WAKV Normally, given twosequences we cannot tell whether it was an insertion or a deletion, so we term the event as an indel. VLSEAVLWAKV Deletion? Insertion?
  • 18.
    Indels in proteincoding genes Indels in protein coding genes are often of 3bp, 6bp, 9bp, etc... Why? Gene Search In fact, searching for indels of length 3K (K=1,2,3,…) can help algorithms that search a genome for coding regions
  • 19.
    Global and Localpairwise alignments
  • 20.
    Global vs. Local •Global alignment – finds the best alignment across the entire two sequences. • Local alignment – finds regions of similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ Global alignment: forces alignment in regions which differ Local alignment will return only regions of good alignment
  • 21.
    Global alignment FAK1 (a.k.a.PTK2, protein tyrosine kinase 2) of human and mouse …
  • 22.
    Proteins are comprisedof domains Domain B Protein tyrosine kinase domain Domain A Human PTK2 : Global alignment may be problematic!
  • 23.
    Domain X Protein tyrosinekinase domain Domain B Protein tyrosine kinase domain Domain A Leukocyte TK PTK2 The sequence similarity is restricted to a single domain
  • 24.
    Global alignment ofPTK and LTK (partial)
  • 25.
    Local alignment ofPTK and LTK
  • 26.
    Conclusions Use global alignmentwhen the two sequences share the same overall sequence arrangement. Use local alignment to detect regions of similarity.
  • 27.
  • 28.
  • 29.
    AAGCTGAATT-C-GAA AGGCT-CATTTCTGA - This alignment includes: 2mismatches 4 indels (gap) 10 perfect matches
  • 30.
    Choosing an alignment fora pair of sequences AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA - A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Which alignment is better? Many different alignments are possible for 2 sequences:
  • 31.
    Scoring system (naïve) AAGCTGAATT-C-GAA AGGCT-CATTTCTGA - Score:= (+1)x10 + (-2)x2 + (-1)x4 = 2 Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1 A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Higher score  Better alignment Perfect match: +1 Mismatch: -2 Indel (gap): -1
  • 32.
    Alignment scoring -scoring of sequence similarity: Assumes independence between positions: each position is considered separately Scores each position: • Positive if identical (match) • Negative if different (mismatch or gap) Total score = sum of position scores Can be positive or negative
  • 33.
  • 34.
    Scoring system • Inthe example above, the choice of +1 for match,-2 for mismatch, and -1 for gap is quite arbitrary • Different scoring systems  different alignments • We want a good scoring system…
  • 35.
    Scoring matrix T CG A 2 A 2 -6 G 2 -6 -6 C 2 -6 -6 -6 T • Representing the scoring system as a table or matrix n X n (n is the number of letters the alphabet contains. n=4 for nucleotides, n=20 for amino acids) • symmetric
  • 36.
    DNA scoring matrices •Uniform substitutions between all nucleotides: T C G A From To -6 -6 -6 2 A -6 -6 2 -6 G -6 2 -6 -6 C 2 -6 -6 -6 T Match Mismatch
  • 37.
    DNA scoring matrices Cantake into account biological phenomena such as: • Transition-transversion
  • 38.
    DNA scoring matrices •Non-uniform substitutions between all nucleotides: T C G A From To -6 -6 -2 2 A -6 -6 2 -2 G -2 2 -6 -6 C 2 -2 -6 -6 T Match Mismatch
  • 39.
    Scoring gaps (I) Inadvanced algorithms, two gaps of one amino- acid are given a different score than one gap of two amino acids. This is solved by giving a penalty to each gap that is opened. Gap extension penalty < Gap opening penalty
  • 40.
    Scoring gaps (II) Assumethat the gap opening cost is -3. The length contributes -1 per base pair AGGGTTC—GA AGGGTTCTGA Score = -4 AGGGTT-—GA AGGGTTCTGA Score = -5 AGGGT--—GA AGGGTTCTGA Score = -6 AGGG--C—GA AGGGTTCTGA Score = -9 Linear penalty
  • 41.
    Intermediate summary 1. Scoringsystem = substitution matrix + gap penalty. 2. Used for both global and local alignment
  • 42.
  • 43.
    Many possible alignments AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCT-GAATT-C-GAA A-GGCT-CATTTCTGA - AAGCTGAATT-C-GAA AGGCT-CATTTCTGA - AAG-CTGAATT-C-GAA AGGCT-CATTT-CTGA - Whichalignment has the best score? • Two sequences of length 10 have >> 1,000,000 possible alignments • Two sequences of length 20 have >> 100,000,000,000,000 possible alignments • Two sequences of length 30 have >> 10,000,000,000,000,000,000,000 possible alignments
  • 44.
    Optimal alignment algorithms •Needleman-Wunsch (global) [1970] • Smith-Waterman (local) [1981] Only length(seq1) x length(seq2) operations! Length of seq1,seq2 # operations Instead of 10 100 1,000,000 20 400 100,000,000,000,000 30 900 10,000,000,000,000,000,000, 000
  • 45.
    Matrix Representation Match =1 Mismatch = -1 Indel = -2 A G C 0 -2 -4 -6 A -2 1 -1 -3 A -4 -1 0 -2 A -6 -3 -2 -1 C -8 -5 -4 -1 score(AAAC,AGC) = -1 Seq2 Seq1
  • 46.
    Matrix Representation score(AAA,AG) =-2 A G C 0 -2 -4 -6 A -2 1 -1 -3 A -4 -1 0 -2 A -6 -3 -2 -1 C -8 -5 -4 -1 Match = 1 Mismatch = -1 Indel = -2 Seq2 Seq1
  • 47.
    Matrix Representation score(,AG) =-4 A G C 0 -2 -4 -6 A -2 1 -1 -3 A -4 -1 0 -2 A -6 -3 -2 -1 C -8 -5 -4 -1 Match = 1 Mismatch = -1 Indel = -2 Seq2 Seq1
  • 48.
    The best alignmentpath A G – C A A A C Match = 1 Mismatch = -1 Indel = -2 Seq2 Seq1 A G C 0 -2 -4 -6 A -2 1 -1 -3 A -4 -1 0 -2 A -6 -3 -2 -1 C -8 -5 -4 -1
  • 49.
    Mi,j = MAXIMUM[ Mi-1,j-1 + Si,j (match/mismatch in the diagonal), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2)] Computing the score of each cell Mi,j = the score in cell (i,j) Mi-1,j-1 Mi-1,j Mi,j-1 Mi,j
  • 50.
    Computing the Matrix AG C 0 -2 -4 -6 A -2 A -4 A -6 C -8 Match = 1 Mismatch = -1 Indel = -2 This cell reflects putting A in the sequence in the row against a gap in the sequence of the column: Col - Row A Mi-1,j-1 Mi-1,j Mi,j-1 Mi,j
  • 51.
    Computing the Matrix AG C 0 -2 -4 -6 A -2 A -4 A -6 C -8 Match = 1 Mismatch = -1 Indel = -2 This cell reflects putting AAAC in the sequence in the column against 4 gaps in the sequence of the row: Col AAAC Row ---- Mi-1,j-1 Mi-1,j Mi,j-1 Mi,j
  • 52.
    Computing the Matrix AG C 0 -2 -4 -6 A -2 ? A -4 A -6 C -8 Match = 1 Mismatch = -1 Indel = -2 This arrow reflects putting A against A and coming from putting nothing against nothing: Col A Row A Col Row score =0+1=1 Mi-1,j-1 Mi-1,j Mi,j-1 Mi,j
  • 53.
    Computing the Matrix AG C 0 -2 -4 -6 A -2 ? A -4 A -6 C -8 Match = 1 Mismatch = -1 Indel = -2 This arrow reflects putting A from the column sequence against an indel: Mi-1,j-1 Mi-1,j Mi,j-1 Mi,j Col -A Row A- Col - Row A score =-2-2=-4
  • 54.
    Computing the Matrix AG C 0 -2 -4 -6 A -2 ? A -4 A -6 C -8 Match = 1 Mismatch = -1 Indel = -2 This arrow reflects putting A from the row sequence against an indel: Mi-1,j-1 Mi-1,j Mi,j-1 Mi,j Col A- Row -A Col A Row - score =-2-2=-4
  • 55.
    Computing the Matrix AG C 0 -2 -4 -6 A -2 1 A -4 A -6 C -8 Match = 1 Mismatch = -1 Indel = -2 We have three possible paths to the A against A, and we chose the path that has the best score
  • 56.
    Computing the Matrix AG C 0 -2 -4 -6 A -2 1 A -4 A -6 C -8 Match = 1 Mismatch = -1 Indel = -2 We have three possible paths to the A against A, and we chose the path that has the best score
  • 57.
    Computing the Matrix AG C 0 -2 -4 -6 A -2 1 A -4 A -6 C -8 Match = 1 Mismatch = -1 Indel = -2 This arrow reflects putting A from the row sequence against indel: Col AA – Row – –A Col AA Row – – score= -4-2=-6 Mi-1,j-1 Mi-1,j Mi,j-1 Mi,j
  • 58.
    Computing the Matrix AG C 0 -2 -4 -6 A -2 1 A -4 A -6 C -8 Match = 1 Mismatch = -1 Indel = -2 This arrow reflects putting A from the column sequence: Col AA Row A– Col A Row A score= 1-2=-1 Mi-1,j-1 Mi-1,j Mi,j-1 Mi,j
  • 59.
    Computing the Matrix AG C 0 -2 -4 -6 A -2 1 A -4 A -6 C -8 Match = 1 Mismatch = -1 Indel = -2 This arrow reflects putting A from the column sequence and A from the row sequence: Col AA Row –A Col A Row – score= -2+1=-1 Mi-1,j-1 Mi-1,j Mi,j-1 Mi,j
  • 60.
    Computing the Matrix AG C 0 -2 -4 -6 A -2 1 A -4 -1 A -6 C -8 Match = 1 Mismatch = -1 Indel = -2 We chose one of the best arrows. The best alignment for this cell is Col AA Row A–
  • 61.
    Computing the Matrix AG C 0 -2 -4 -6 A -2 1 A -4 -1 A -6 C -8 Match = 1 Mismatch = -1 Indel = -2 We chose one of the best arrows. The best alignment for this cell is Col AA Row A–
  • 62.
    Computing the entireMatrix A G C 0 -2 -4 -6 A -2 1 -1 -3 A -4 -1 0 -2 A -6 -3 -2 -1 C -8 -5 -4 -1 Match = 1 Mismatch = -1 Indel = -2
  • 63.
    A G C 0-2 -4 -6 A -2 1 -1 -3 A -4 -1 0 -2 A -6 -3 -2 -1 C -8 -5 -4 -1 ‫המלאה‬ ‫המטריצה‬ ‫מתוך‬ ‫העימוד‬ ‫שחזור‬ AG-C AAAC We chose a path that goes from the end to the beginning (marked here in Red). This is called the Trace Back
  • 64.
    A G C 0-2 -4 -6 A -2 1 -1 -3 A -4 -1 0 -2 A -6 -3 -2 -1 C -8 -5 -4 -1 ‫המלאה‬ ‫המטריצה‬ ‫מתוך‬ ‫העימוד‬ ‫שחזור‬ A-GC AAAC
  • 65.
    A G C 0-2 -4 -6 A -2 1 -1 -3 A -4 -1 0 -2 A -6 -3 -2 -1 C -8 -5 -4 -1 ‫המלאה‬ ‫המטריצה‬ ‫מתוך‬ ‫העימוד‬ ‫שחזור‬ -AGC AAAC
  • 66.
    Local alignment (note,letters were changed) G A C 0 0 0 0 A 0 G 0 A 0 T 0 Match = 1 Mismatch = -1 Indel = -2 We initialize everything with a 0. Across the matrix, there are no negative values (every time we need to put a negative value, we put a 0).
  • 67.
    Local alignment G AC 0 0 0 0 A 0 0 G 0 A 0 T 0 Match = 1 Mismatch = -1 Indel = -2 We initialize everything with a 0. Across the matrix, there are no negative values (every time we need to put a negative value, we put a 0).
  • 68.
    Local alignment G AC 0 0 0 0 A 0 0 1 0 G 0 1 0 0 A 0 0 T 0 0 Match = 1 Mismatch = -1 Indel = -2 We initialize everything with a 0. Across the matrix, there are no negative values (every time we need to put a negative value, we put a 0).
  • 69.
    Local alignment G AC 0 0 0 0 A 0 0 1 0 G 0 1 0 0 A 0 0 2 0 T 0 0 0 1 Match = 1 Mismatch = -1 Indel = -2 We initialize everything with a 0. Across the matrix, there are no negative values (every time we need to put a negative value, we put a 0).
  • 70.
    Local alignment G AC 0 0 0 0 A 0 0 1 0 G 0 1 0 0 A 0 0 2 0 T 0 0 0 1 Match = 1 Mismatch = -1 Indel = -2 In trace-back, we start with the highest value and go back to a zero (anywhere). Col GA Row GA
  • 71.
    Global and Localpairwise alignments - summary • Global alignment – finds the best alignment over all the positions • Local alignment – finds regions of similarity in parts of the sequences
  • 72.
    Global & LocalAlignments Global Most useful when the sequences are similar in the entire alignment. Local More useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs.
  • 73.
    Pairwise alignment • Localalignment – Waterman algorithm • Global alignment – Needleman algorithm • Protein and nucleotide alignments • Input: two sequences • Output: optimal sequence alignment
  • 74.
    Question #1: For highlysimilar sequences, which method is more compatible for alignment? Global Local
  • 75.
  • 76.
    Global Alignment Webserver (Needleman-Wunschalgorithm) • https://www.ebi.ac.uk/jdispatcher/psa/emb oss_needle
  • 77.
    Local Alignment Webserver(Smith- Waterman algorithm) • https://www.ebi.ac.uk/jdispatcher/psa/emb oss_water
  • 78.
    Altering parameters • Scoringmatrix, e.g. BLOSUM: – Higher numbers in matrix for similar sequences • Gap open & extend penalty – Higher gap open penalty for similar sequences – If gap length matters increase gap extend penalty • Output format (pair)
  • 79.
    Protein scoring matrix Theblosum 62 matrix. The cost of substituting one amino acid by another.
  • 80.
    BLOcks SUbstitution Matrix •Based on comparisons of blocks of sequences from the Blocks database. • BLOSUM62 was built using data from sequences that were up to 62% identical.
  • 81.
    Results page • Submissiondetails • Parameters • Scores – Identity – Similarity – Gaps – Total score
  • 82.
    Results page • Thealignment • Notice – Positions – Matches | – Similarities : – Mismatches . – Indels -
  • 83.
    Question #3: In casesequences are of unequal lengths – in which of the sequences there will be more gaps in global alignment? 1 2
  • 84.
    Answer #3: Sequence no.1 In local alignment:
  • 85.
    Question #4: For closelyrelated sequences, which BLOSUM matrix is better – BLOSUM62 or BLOSUM30?
  • 86.
  • 87.
    • Sequence similarity –Alignment - scoring system, local vs. global • Pairwise alignment tools – Needle for global – Water for local Summary

Editor's Notes

  • #6 גנים אורתולוגים –גנים בעלי רצף ופונקציה דומים המצויים במינים שונים.
  • #43 //
  • #50 לכתוב על הלוח את הנוסחה כי נשתמש בה בהמשך