2. Alignment
• In molecular biology, a common question is to
ask whether or not two sequences are related.
• The most common way to tell whether or not
they are related is to compare them to one
another to see if they are similar.
3. • Biological sequences that are similar (but not exact)
provide useful information to help discover functional,
structural, and evolutionary information.
• One common mistake is to describe two sequences as
having some sort of homology or a percent homology
based on their sequence similarity.
• This is a misuse of the biological term.
• Two sequences in different organisms are homologous if
they have been derived from a common ancestor
sequence.
• However, the greater the sequence similarity, the greater
chance there is that they share similar function and/or
structure.
4. Biological Definitions for Related
Sequences
• Homologs are similar sequences in two different
organisms that have been derived from a
common ancestor sequence.
– Homologs can be described as either orthologous or
paralogous.
• Orthologs are similar sequences in two different
organisms that have arisen due to a speciation
event.
Orthologs typically retain their functionality
throughout evolution.
5. • Paralogs are similar sequences within a single
organism that have arisen due to a gene
duplication event.
• Xenologs are similar sequences that do not
share the same evolutionary origin, but rather
have arisen out of horizontal transfer events
through symbiosis, viruses, etc
6. Hamming or edit distance
• One method in determining sequence
similarity is to determine the edit distance
between two sequences.
• If we take the example of pear and tear, how
similar are these two words?
• There is a mismatch in the first letter, and
matches in the last three
• An alignment of these two is as follows
7. • P E A R
| | |
• T E A R
One way to score this alignment is to calculate the Hamming distance,
which is the minimum number of letters by which the two words
differ.
In this example, the Hamming distance would be 1.
The Hamming distance is calculated by summing up the
number of mismatches when two words are aligned to one another.
8. • Which alignment is the better alignment? One
way to judge this is to assign a positive score for
each match, and a negative score for each
mismatch, and a negative score for each
insertion/deletion
– (collectively referred to as indels).
• One scoring scheme might assign the following
values:
– match: +2
– mismatch: -1
– indel –2
Scoring alignment
9. Using this scoring scheme, the first alignment has 5 matches, 1 mismatch, and 4
indels.
The score for this alignment is: 5 * 2 – 1(1) – 4(2) = 10 – 1 – 8 = 1.
The second alignment has 6 matches, 1 mismatch, and 2 indels. The score for the
second
alignment is 6 * 2 – 1(1) – 2 (2) = 12 – 1 – 4 = 7.
Therefore, using the above scoring scheme, the second alignment is a
better alignment, since it produces a higher alignment score.
10. Dot Plots
• One of the more basic, yet important techniques
for determining the alignment between two
sequences is by using a visual alignment known
as dot plots.
• Dot plots of sequence similarity are created
using a matrix where the rows in the matrix
correspond to the characters in the first sequence
and the columns in the matrix correspond to the
characters in the second sequence.
11. • The dot plot is created as follows:
• loop through each row. For the current row,
take the character in that row and compare it
to the character in each column.
• If they are equal, place a dot in the matrix.
• Continue until all nodes in the matrix have
been considered.
12.
13.
14. Information within Dot Plots
• Dot plots are useful as a first-level filter for
determining an alignment between two
sequences.
• Regions of similarity will show up as diagonals
within the dot plot matrix
• Regions containing insertions/deletions can be
readily determined
• One potential application is to determine the
number of coding regions (exons) contained
within a processed mRNA.
15. • Regions of genomic DNA can contain
repetitive regions.
• For instance, approximately 50 percent of the
human genome -repetitive elements, which
can be on the order of a few hundred bases
• low complexity are present as well
• In addition to repetitive elements, regions of a
genome can be duplicated
16. • The duplicated region can be found either as a
direct repeat or as an inverted repeat meaning
– Direct - it occurs in the same direction,
- inverted repeat - meaning that the sequence
of the duplicated region is found in the
reverse complement direction
• Dot plots can readily show regions of direct
and inverted repeats.
17.
18. • Dot plots show all possible matches of
residues
• the researcher can decide which alignments
are the most significant.