Bioinformatica t4-alignments

FBW
16-10-2012
Wim Van Criekinge

Inhoud Lessen: Bioinformatica
GEEN LES

Rat versus
mouse RBP
Rat versus
bacterial
lipocalin

– Henikoff and Henikoff have compared the
BLOSUM matrices to PAM by evaluating how
effectively the matrices can detect known members
of a protein family from a database when searching
with the ungapped local alignment program
BLAST. They conclude that overall the BLOSUM
62 matrix is the most effective.
• However, all the substitution matrices investigated
perform better than BLOSUM 62 for a proportion of
the families. This suggests that no single matrix is
the complete answer for all sequence comparisons.
• It is probably best to compliment the BLOSUM 62
matrix with comparisons using 250 PAMS, and
Overington structurally derived matrices.
– It seems likely that as more protein three
dimensional structures are determined, substitution
tables derived from structure comparison will give
the most reliable data.
Overview

Dotplots
• What is it ?
– Graphical representation using two orthogonal
axes and “dots” for regions of similarity.
– In a bioinformatics context two sequence are
used on the axes and dots are plotted when a
given treshold is met in a given window.
• Dot-plotting is the best way to see all of the
structures in common between two
sequences or to visualize all of the repeated
or inverted repeated structures in one
sequence

Visual Alignments (Dot Plots)
• Matrix
– Rows: Characters in one sequence
– Columns: Characters in second sequence
• Filling
– Loop through each row; if character in row, col match, fill
in the cell
– Continue until all cells have been examined

Dotplot-simulator.pl
print " $seq1n";
for(my $teller=0;$teller<=$seq2_length;$teller++){
print substr($seq2,$teller,1);
$w2=substr($seq2,$teller,$window);
for(my $teller2=0;$teller2<=$seq_length;$teller2++){
$w1=substr($seq1,$teller2,$window);
if($w1 eq $w2){print "*";}else{print " ";}
}
print"n";
}

Overview
Window size = 1, stringency 100%

Noise in Dot Plots
• Nucleic Acids (DNA, RNA)
– 1 out of 4 bases matches at random
• Stringency
– Window size is considered
– Percentage of bases matching in the window is
set as threshold

Reduction of Dot Plot Noise
Self alignment of ACCTGAGCTCACCTGAGTTA

Dotplot-simulator.pl
Example: ZK822 Genomic and cDNA
Gene prediction:
How many exons ?
Confirm donor and aceptor sites ?
Remember to check the reverse complement !

• Regions of similarity appear
as diagonal runs of dots
• Reverse diagonals
(perpendicular to diagonal)
indicate inversions
• Reverse diagonals crossing
diagonals (Xs) indicate
palindromes
• A gap is introduced by each
vertical or horizontal skip
Overview

• Window size changes with goal
of analysis
– size of average exon
– size of average protein structural
element
– size of gene promoter
– size of enzyme active site
Overview

Rules of thumb
 Don't get too many points, about 3-
5 times the length of the sequence
is about right (1-2%)
 Window size about 20 for distant
proteins 12 for nucleic acid
 Check sequence vs. itself
 Check sequence vs. sequence
 Anticipate results
(e.g. “in-house” sequence vs genomic,
question)
Overview

Available Dot Plot Programs
Dotlet (Java Applet)
http://www.isrec.isb-
sib.ch/java/dotlet/Dotlet.
html

Sequence Alignments
Introduction
Algorithms
What ?
Examples
Properties
Dynamic Programming for Pairwise Alignment
Concept
Example
Needleman-Wunsch(.pl)
Smith-Waterman(.pl)
Multiple Alignment
MSA
Hierarchical Pairwise Alignent
ClustalW, PileUp
Formatting
Interpretation
Alternative Methods
SIM
Blast2
Dali

Global and local alignment
Pairwise sequence alignment can be global or local
Global: the sequences are completely aligned
(Needleman and Wunsch, 1970)
Local: only the best sub-regions are aligned
(Smith and Waterman, 1981). BLAST
uses local alignment.

– In order to characterize protein families, identify
shared regions of homology in a multiple
sequence alignment; (this happens generally
when a sequence search revealed homologies to
several sequences)
– Determination of the consensus sequence of
several aligned sequences
– Help prediction of the secondary and tertiary
structures of new sequences;
– Preliminary step in molecular evolution analysis
using Phylogenetic methods for constructing
phylogenetic trees
– Garbage in, Garbage out
– Chicken/egg
Why we do multiple alignments?

Why we do multiple alignments?
• To find conserved regions
– Local multiple alignment reveals conserved
regions
– Conserved regions usually are key functional
regions
– These regions are prime targets for drug
developments
• To do phylogenetic analysis:
– Same protein from different species
– Optimal multiple alignment probably implies
history
– Discover irregularities, such as Cystic Fibrosis
gene

VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG--
ATLVCLISDFYPGA--VTVAWKADS--
AALGCLVKDYFPEP--VTVSWNSG---
VSLTCLVKGFYPSD--IAVEWWSNG--

Algorithms and Programs
• Algorithm: a method or a process followed
to solve a problem.
– A recipe.
• An algorithm takes the input to a problem
(function) and transforms it to the output.
– A mapping of input to output.
• A problem can have many algorithms.

Bubble Sort Algorithm
1. Initialize the size of the list to be sorted to be the actual size of the list.
2. Loop through the list until no element needs to be exchanged with another
to reach its correct position.
2.1 Loop (i) from 0 to size of the list to be sorted - 2.
2.1.1 Compare the ith and (i + 1)st elements in the unsorted list.
2.1.2 Swap the ith and (i + 1)st elements if not in order ( ascending or
descending as desired).
2.2 Decrease the size of the list to be sorted by 1.
One of the simplest sorting algorithms proceeds by walking down the list, comparing
adjacent elements, and swapping them if they are in the wrong order. The process is
continued until the list is sorted.
More formally:
Each pass "bubbles" the largest element in the unsorted part of the list to its correct location.
A 13 7 43 5 3 19 2 23 29 ?? ?? ?? ?? ??

Bubble Sort Implementation
void BubbleSort(int List[] , int Size) {
int tempInt; // temp variable for swapping list elems
for (int Stop = Size - 1; Stop > 0; Stop--) {
for (int Check = 0; Check < Stop; Check++) { // make a pass
if (List[Check] > List[Check + 1]) { // compare elems
tempInt = List[Check]; // swap if in the
List[Check] = List[Check + 1]; // wrong order
List[Check + 1] = tempInt;
}
}
}
}
Bubblesort compares and swaps adjacent elements; simple but not very efficient.
Efficiency note: the outer loop could be modified to exit if the list is already sorted.
Here is an ascending-order implementation of the bubblesort algorithm for integer arrays:

ijs
• 6 eierdooiers + 105 gram S1 kristalsuiker
• 1’ kloppen to “ruban”
• Ondertussen 500 ml volle melk laten opwarmen
met 105 gram S1 suiker
• Toevoegen vanille en/of chocolade (kaneel)
• Langzaam de bijna kokende melk onder ruban
kloppen (van het vuur)
• Terug op het vuur: “Porter a la nappe”
• Afkoelen
• “Afdraaien” (in ijsmachine)
• 15” voor stolling 500 ml room toevoegen

"Great algorithms are the poetry of computation"

"Great algorithms are the poetry of computation"
1946: The Metropolis Algorithm for Monte Carlo. Through the use of random
processes, this algorithm offers an efficient way to stumble toward answers to
problems that are too complicated to solve exactly.
1947: Simplex Method for Linear Programming. An elegant solution to a common
problem in planning and decision-making.
1950: Krylov Subspace Iteration Method. A technique for rapidly solving the linear
equations that abound in scientific computation.
1951: The Decompositional Approach to Matrix Computations. A suite of techniques
for numerical linear algebra.
1957: The Fortran Optimizing Compiler. Turns high-level code into efficient
computer-readable code.
1959: QR Algorithm for Computing Eigenvalues. Another crucial matrix operation
made swift and practical.
1962: Quicksort Algorithms for Sorting. For the efficient handling of large databases.
1965: Fast Fourier Transform. Perhaps the most ubiquitous algorithm in use today, it
breaks down waveforms (like sound) into periodic components.
1977: Integer Relation Detection. A fast method for spotting simple equations satisfied
by collections of seemingly unrelated numbers.
1987: Fast Multipole Method. A breakthrough in dealing with the complexity of n-body
calculations, applied in problems ranging from celestial mechanics to protein folding.
From Random Samples, Science page 799, February 4, 2000.

Algorithm Properties
• An algorithm possesses the following
properties:
– It must be correct.
– It must be composed of a series of concrete steps.
– There can be no ambiguity as to which step will be
performed next.
– It must be composed of a finite number of steps.
– It must terminate.
• A computer program is an instance, or
concrete representation, for an algorithm
in some programming language.

Measuring Algorithm Efficiency
• Types of complexity
– Space complexity
– Time complexity
• Analysis of algorithms
– The measuring of the complexity of an algorithm
• Cannot compute actual time for an algorithm
– We usually measure worst-case time

Three algorithms for computing
1 + 2 + … n for an integer n > 0

The number of operations required by the algorithms

The number of operations required by the algorithms as a
function of n

Big Oh Notation
• To say "Algorithm A has a worst-case time
requirement proportional to n"
– We say A is O(n)
– Read "Big Oh of n"
• For the other two algorithms
– Algorithm B is O(n2)
– Algorithm C is O(1)
• O is derived from order (magnitude)

Picturing Efficiency
O(n) algorithm

An O(n2) algorithm.

Another O(n2) algorithm.

The best alignment:
The one with the maximum total
score

• Exhaustive …
– All combinations:
• Algorithm
– Dynamic programming (much faster)
• Heuristics
– Needleman – Wunsh for global
alignments
(Journal of Molecular Biology, 1970)
– Later adapated by Smith-Waterman
for local alignment
Overview

• Score of an alignment: reward
matches and penalize mismatches
and spaces.
– eg, each column gets a (different)
value for:
• a match: +1, (both have the same
characters);
• a mismatch : -1, (both have different
characters); and
• a space in a column: -2.
– The total score of an alignment is the
sum of the values assigned to its
columns.

A metric …
GACGGATTAG, GATCGGAATAG
GA-CGGATTAG
GATCGGAATAG
+1 (a match), -1 (a mismatch),-2 (gap)
9*1 + 1*(-1)+1*(-2) = 6

Dynamic programming
Reduce the problem:
the solution to a large problem is to
simplify … if we first know the
solution to a smaller problem that
is a subset of the larger problem
Overview
P
P2
P1 P3
P

Dynamic Programming
• Finding optimal solution to search
problem
• Recursively computes solution
• Fundamental principle is to produce
optimal solutions to smaller pieces of
the problem first and then glue them
together
• Efficient divide-and-conquer strategy
because it uses a bottom-up approach
and utilizes a look-up table instead of
recomputing optimal solutions to sub-
problems
P
P2
P1 P3
P

Dynamic Programming
What is the best way to get from A to C ?
Rules: Three stops
Solutions: Try all and select best, requires
(combin(13,3)) = 286 calculations
A C

Dynamic Programming
What is the best way to get from A to C ?
If we known that B is on the optimal path ?
A C
B

Dynamic Programming
What is the best way to get from A to B ?
A C
B
1
2
3
4
5
6

Dynamic Programming
What is the best way to get from B to C ?
A C
B
2
3
4
5
6
1

Dynamic Programming
How many paths from A to C via B ?
6 * 6 = 36
A C
B
1
2
3
4
5
6
1

Dynamic Programming
Solve the subproblem A to B: 6 calculations
A C
B
1
2
3
4
5
6

Dynamic Programming
Solve the subproblem B to C: 6 calculations
A C
B
2
3
4
5
6
1

Dynamic Programming
If B is on optimal path from A->C, this
optimal path = optimal path from A to B +
optimal path from B to C
12 calculations needed (not 36 or 286)
A C
B
5
3

the best alignment between
• a zinc-finger core sequence:
–CKHVFCRVCI
• and a sequence fragment
from a viral polyprotein:
–CKKCFCKCV

C K H V F C R V C I
+--------------------
C | 1 1 1
K | 1
K | 1
C | 1 1 1
F | 1
C | 1 1 1
K | 1
C | 1 1 1
V | 1 1
Dynamic Programming

C K H V F C R V C I
+--------------------
C | 1 1 1 0
K | 1 0
K | 1 0
C | 1 1 1 0
F | 1 0
C | 1 1 1 0
K | 1 0
C | 1 1 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming

C K H V F C R V C I
+--------------------
C | 1 1 1 0
K | 1 0
K | 1 0
C | 1 1 1 0
F | 1 0
C | 1 1 1 0
K | 1 0
C | 2 1 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming

C K H V F C R V C I
+--------------------
C | 1 1 1 0
K | 1 0 0
K | 1 0 0
C | 1 1 1 0
F | 1 0 0
C | 1 1 1 0
K | 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming

C K H V F C R V C I
+--------------------
C | 1 1 1 1 0
K | 1 1 0 0
K | 1 1 0 0
C | 1 1 1 1 0
F | 1 1 0 0
C | 1 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming

C K H V F C R V C I
+--------------------
C | 1 1 1 1 1 0
K | 1 1 1 0 0
K | 1 1 1 0 0
C | 1 1 1 1 1 0
F | 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming

C K H V F C R V C I
+--------------------
C | 1 2 1 1 1 0
K | 1 1 1 1 0 0
K | 1 1 1 1 0 0
C | 1 2 1 1 1 0
F | 2 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming

C K H V F C R V C I
+--------------------
C | 1 2 2 1 1 1 0
K | 1 2 1 1 1 0 0
K | 1 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 2 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming

C K H V F C R V C I
+--------------------
C | 1 3 2 2 1 1 1 0
K | 1 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 2 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming

C K H V F C R V C I
+--------------------
C | 1 3 3 2 2 1 1 1 0
K | 4 4 3 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 2 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming

C K H V F C R V C I
+--------------------
C | 5 3 3 3 2 2 1 1 1 0
K | 4 4 3 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 2 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming

C K H V F C R V C I
+--------------------
C | 5 3 3 3 2 2 1 1 1 0
K | 4 4 3 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 3 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming

C K H V F C R V C I
+--------------------
C | 5 3 3 3 2 2 1 1 1 0
K | 4 4 3 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 3 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
C K H V F C R V C I
C K K C F C - K C V
C K H V F C R V C I
C K K C F C K - C V
C - K H V F C R V C I
C K K C - F C - K C V
C K H - V F C R V C I
C K K C - F C - K C V
Dynamic Programming

Extensions to basic dynamic programming method
use gap penalties
– constant gap penalty for gap > 1
– gap penalty proportional to gap size
• one penalty for starting a gap (gap opening
penalty)
• different (lower) penalty for adding to a gap
(gap extension penalty)
• for nucleic acids, can be used to mimic
thermodynamics of helix formation
– two kinds of gap opening penalties
• one for gap closed by AT, different for GC
Dynamic Programming

• Zie cursus voor voorbeeld met gap-penalties
– zoek de fouten ;-)
• Beschikbaar als perl programma waarmee we
kunnen experimenteren

Needleman-Wunsch.pl
# initialization
my @matrix;
$matrix[0][0]{score} = 0;
$matrix[0][0]{pointer} = "none";
for(my $j = 1; $j <= length($seq1); $j++) {
$matrix[0][$j]{score} = $GAP * $j;
$matrix[0][$j]{pointer} = "left";
}
for (my $i = 1; $i <= length($seq2); $i++) {
$matrix[$i][0]{score} = $GAP * $i;
$matrix[$i][0]{pointer} = "up";
}

Needleman-Wunsch-edu.pl
The Score Matrix
----------------
Seq1(j)1 2 3 4 5 6 7
Seq2 * C K H V F C R
(i) * 0 -1 -2 -3 -4 -5 -6 -7
1 C -1 1 0 -1 -2 -3 -4 -5
2 K -2 0 2 1 0 -1 -2 -3
3 K -3 -1 1 1 0 -1 -2 -3
4 C -4 -2 0 0 0 -1 0 -1
5 F -5 -3 -1 -1 -1 1 0 -1
6 C -6 -4 -2 -2 -2 0 2 1
7 K -7 -5 -3 -3 -3 -1 1 1
8 C -8 -6 -4 -4 -4 -2 0 0
9 V -9 -7 -5 -5 -3 -3 -1 -1

Needleman-Wunsch.pl
# fill
for(my $i = 1; $i <= length($seq2); $i++) {
for(my $j = 1; $j <= length($seq1); $j++) {
my ($diagonal_score, $left_score, $up_score);
# calculate match score
my $letter1 = substr($seq1, $j-1, 1);
my $letter2 = substr($seq2, $i-1, 1);
if ($letter1 eq $letter2) {
$diagonal_score = $matrix[$i-1][$j-1]{score} + $MATCH;
}
else {
$diagonal_score = $matrix[$i-1][$j-1]{score} + $MISMATCH;
}
# calculate gap scores
$up_score = $matrix[$i-1][$j]{score} + $GAP;
$left_score = $matrix[$i][$j-1]{score} + $GAP;
# choose best score
if ($diagonal_score >= $up_score) {
if ($diagonal_score >= $left_score) {
$matrix[$i][$j]{score} = $diagonal_score;
$matrix[$i][$j]{pointer} = "diagonal";
}
else {
$matrix[$i][$j]{score} = $left_score;
$matrix[$i][$j]{pointer} = "left";
}
} else {
if ($up_score >= $left_score) {
$matrix[$i][$j]{score} = $up_score;
$matrix[$i][$j]{pointer} = "up";
}
else {
$matrix[$i][$j]{score} = $left_score;
$matrix[$i][$j]{pointer} = "left";
}

Needleman-Wunsch.pl
#!e:perlbin -w
use strict;
# usage statement
die "usage: $0 <sequence 1> <sequence 2>n" unless @ARGV
== 2;
# get sequences from command line
my ($seq1, $seq2) = @ARGV;
# scoring scheme
my $MATCH = 1; # +1 for letters that match
my $MISMATCH = -1; # -1 for letters that mismatch
my $GAP = -1; # -1 for any gap

The Score Matrix
----------------
Seq1(j)1 2 3 4 5 6 7
Seq2 * C K H V F C R
(i) * 0 -1 -2 -3 -4 -5 -6 -7
1 C -1 1 0 -1 -2 -3 -4 -5
2 K -2 0 2 1 0 -1 -2 -3
3 K -3 -1 1 1 0 -1 -2 -3
4 C -4 -2 0 0 0 -1 0 -1
5 F -5 -3 -1 -1 -1 1 0 -1
6 C -6 -4 -2 -2 -2 0 2 1
7 K -7 -5 -3 -3 -3 -1 1 1
8 C -8 -6 -4 -4 -4 -2 0 0
9 V -9 -7 -5 -5 -3 -3 -1 -1
a
b
c
A: matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH
if (substr(seq1,j-1,1) eq substr(seq2,i-1,1)
B: up_score = matrix(i-1,j) + GAP
C: left_score = matrix(i,j-1) + GAP

Needleman-Wunsch.pl
my $align1 = "";
my $align2 = "";
my $j = length($seq1);
my $i = length($seq2);
while (1) {
last if $matrix[$i][$j]{pointer} eq "none";
if ($matrix[$i][$j]{pointer} eq "diagonal") {
$align1 .= substr($seq1, $j-1, 1);
$align2 .= substr($seq2, $i-1, 1);
$i--; $j--;
}
elsif ($matrix[$i][$j]{pointer} eq "left") {
$align1 .= substr($seq1, $j-1, 1);
$align2 .= "-";
$j--;
}
elsif ($matrix[$i][$j]{pointer} eq "up") {
$align1 .= "-";
$align2 .= substr($seq2, $i-1, 1);
$i--;
}
}
$align1 = reverse $align1;
$align2 = reverse $align2;
print "$align1n";
print "$align2n";

Seq1:CKHVFCRVCI
Seq2:CKKCFC-KCV
++--++--+- score = 0

• Practicum: use similarity function in
initialization step -> scoring tables
• Time Complexity
• Use random proteins to generate
histogram of scores from aligned
random sequences

Time complexity with needleman-wunsch.pl
Sequence Length (aa) Execution Time (s)
10 0
25 0
50 0
100 1
500 5
1000 19
2500 559
5000 Memory could not be
written

• -edu version
• Monte-carlo version

Average around -64 !
-80
-78
-76
-74
-72 **
-70 *******
-68 ***************
-66 *************************
-64 ************************************************************
-60 ***********************
-58 ***************
-56 ********
-54 ****
-52 *
-50
-48
-46
-44
-42
-40
-38

If the sequences are similar, the path
of the best alignment should be very
close to the main diagonal.
Therefore, we may not need to fill the
entire matrix, rather, we fill a narrow
band of entries around the main
diagonal.
An algorithm that fills in a band of
width 2k+1 around the main
diagonal.

Smith-Waterman.pl
• Three changes
– The edges of the matrix are initialized to 0 instead
of increasing gap penalties
– The maximum score is never less than 0, and no
pointer is recorded unless the score is greater
than 0
– The trace-back starts from the highest score in
the matrix (rather than at the end of the matrix)
and ends at a score of 0 (rather than the start of
the matrix)
• Demonstration

The best alignment:
The one with the maximum total score
Multiple Aligment: n>2

On its top-left side, the cube is
"covered" by the polyhedron. The
edges 1, 2, 3, 6 and 7 are coming
from the inside, and edges 4 and 5
can be ignored (and are therefore
not labeled in the figure).

• Each node in the k-dimensional hyperlattice is
visited once, and therefore the running time
must be proportional to the number of nodes in
the lattice.
– This number is the product of the lengths of the
sequences.
– eg. the 3-dimensional lattice as visualized.
Computational Complexity of MA by standard Dynamic Programming

• The memory space requirement is even worse.
To trace back the alignment, we need to store the
whole lattice, a data structure the size of a
multidimensional skyscraper.
– In fact, space is the No.1 problem here, bogging down
multiple alignment methods that try to achieve
optimality.
– Furthermore, incorporating a realistic gap model, we
will further increase our demands on space and running
time

• The most practical and widely used
method in multiple sequence alignment
is the hierarchical extensions of
pairwise alignment methods.
• The principal is that multiple alignments
is achieved by successive application
of pairwise methods.
– First do all pairwise alignments (not just one
sequence with all others)
– Then combine pairwise alignments to generate
overall alignment
Multiple Alignment Method

• The steps are summarized as follows:
– Compare all sequences pairwise.
– Perform cluster analysis on the pairwise data to
generate a hierarchy for alignment. This may be in
the form of a binary tree or a simple ordering
– Build the multiple alignment by first aligning the
most similar pair of sequences, then the next most
similar pair and so on. Once an alignment of two
sequences has been made, then this is fixed.
Thus for a set of sequences A, B, C, D having
aligned A with C and B with D the alignment of A,
B, C, D is obtained by comparing the alignments
of A and C with that of B and D using averaged
scores at each aligned position.
Multiple Alignment Method

• Automatic multiple alignemnt
– extend dynamic programming (MSA - Lipman)
• limit: computing power: length and number of sequences
(e.q. 2000^8)
– progressive alignment (Feng & Doolittle)
• use “guide tree” (PileUp, ClustalW etc)
• Dedicated alignment editing program
– Boxshade
– SeaView
– SeqPup (Java)
• Combination (Biology – Computation)
Multiple Sequence Alignment programs

• ClustalW is a general purpose multiple
alignment program for DNA or proteins.
• ClustalW is produced by Julie D. Thompson,
Toby Gibson of European Molecular Biology
Laboratory, Germany and Desmond Higgins
of European Bioinformatics Institute,
Cambridge, UK. Algorithmic
• Improves the sensitivity of progressive
multiple sequence alignment through
sequence weighting, positions-specific gap
penalties and weight matrix choice. Nucleic
Acids Research, 22:4673-4680.
ClustalW

****** MULTIPLE ALIGNMENT MENU ******
1. Do complete multiple alignment now (Slow/Accurate)
2. Produce guide tree file only
3. Do alignment using old guide tree file
4. Toggle Slow/Fast pairwise alignments = SLOW
5. Pairwise alignment parameters
6. Multiple alignment parameters
7. Reset gaps between alignments? = OFF
8. Toggle screen display = ON
9. Output format options
S. Execute a system command
H. HELP
or press [RETURN] to go back to main menu
Your choice:
Running ClustalW

• Before you run PILEUP, it is necessary to study
the sequences that will be aligned.
• PILEUP is very sensitive to gaps, so if a set of
sequences are of different lengths, gaps will be
added to the ends of all shorter sequences to
make them equal to the longest one in the set.
• If you try to align five 300 nucleotide EST's with a
single 20,000 nucleotide cosmid, you are adding
5 X 19,700 gaps to the alignment - and PILEUP
will crash!
PileUp

• The final product of a PILEUP run is a set of aligned
sequences, which are stored in a Multiple
Sequence File (called .msf by GCG).
This msf file is a text file that can be formatted with
a text editor, but GCG has some dedicated tools for
improving the looks of msf files for easier
interpretation and for publication.
• Consensus sequences can be calculated and the
relationship of each character of each sequence to
the consensus can be highlighted using the
program PRETTY
Formatting Multiple Alignments

• Shading of regions of high homology can be created using
the programs BOXSHADE and PRETTYBOX , but that
goes beyond the scope of this tutorial. (Boxshade:
http://www.ch.embnet.org/software/BOX_form.html)
• In addition to these programs that run on the Alpha, the
output of PILEUP (or CLUSTAL) can be moved by FTP
from your RCR account to a local Mac or PC.
• Since this output is a plain text file, it can be edited with
any word processing program, or imported into any
drawing program to add boldface text, underlining,
shading, boxes, arrows, etc
Formatting Multiple Alignments

http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html

VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG--
ATLVCLISDFYPGA--VTVAWKADS--
AALGCLVKDYFPEP--VTVSWNSG---
VSLTCLVKGFYPSD--IAVEWWSNG--
An example of Multiple Alignment … immunoglobulin

• Their alignment highlights conserved
residues (one of the cysteines forming the
disulphide bridges, and the tryptophan are
notable)
• conserved regions (in particular, "Q.PG" at
the end of the first 4 sequences), and more
sophisticated patterns, like the dominance of
hydrophobic residues at fragment positions 1
and 3.
• The alternating hydrophobicity pattern is
typical for the surface beta-strand at the
beginning of each fragment. Indeed, multiple
alignments are helpful for protein structure
prediction.
An example of Multiple Alignment … immunoglobulin

• Providing the alignment is accurate
then the following may be inferred
about the secondary structure from
a multiple sequence alignment.
 The position of insertions and
deletions (INDELS) suggests
regions where surface loops exist.
 Conserved glycine or proline
suggests a beta-turn.
A Practical Approach: Interpretation

• Residues with hydrophobic properties
conserved at i, i+2, i+4 separated by
unconserved or hydrophilic residues
suggest surface beta- strands.
 A short run of hydrophobic amino acids
(4 residues) suggests a buried beta-
strand.
 Pairs of conserved hydrophobic amino
acids separated by pairs of
unconserved, or hydrophilic residues
suggests an alfa-helix with one face
packing in the protein core. Likewise,
an i, i+3, i+4, i+7 pattern of conserved
hydrophobic residues.
A Practical Approach: Interpretation

• Take out noise (GAPS)
• Extra information (structure - function)
• Recursive selection
– first most similar to have an idea about
conserved regions
– manual scan for these in more distant
members then include these
A Practical Approach: Which sequences to use ?

L-align (2 sequences)
SIM (www.expasy.ch)
LALNVIEW is available for UNIX, Mac
and PC on the ExPASy anonymous
FTP server.
very nice TWEAKING tool (70% criteria)

How can I use NCBI
to compare two
sequences?
Answer:
Use the
“BLAST 2 Sequences”
program

• Go to http://www.ncbi.nlm.nih.gov/BLAST
• Choose BLAST 2 sequences
• In the program,
[1] choose blastp (protein search) or blastn (for DNA)
[2] paste in your accession numbers
(or use FASTA format)
[3] select optional parameters, such as
--BLOSU62 matrix is default for proteins
try PAM250 for distantly related proteins
--gap creation and extension penalties
[4] click “align”
Practical guide to pairwise alignment:
the “BLAST 2 sequences” website

Question #2:
How can I use NCBI
to compare a
sequence to an
entire database?
BLAST!

• An introduction to Basic Concepts in
Computer Science for Life Scientists
• Dotplot patterns: A Literal Look at
Pattern Languages

• CpG Islands
– Download from ENSEMBL 1000 (random) promoters (3000 bp) (hint:
use Biomart)
– How many times would you expect to observe CG if all nucleotides
were equipropable
– Count the number op times CG is observed for these 1000 genes and
make a histogram from these scores.
– Are there any other dinucleatides over- or underrepresented
– CG repeats are often methylated. In order to study methylation
patterns bisulfide treatment of DNA is used. Bisulfide changes every C
which is not followed by G into T. Generate computationally the
bisulfide treated version of DNA (hint: while (s/C([^G])/T$1/g) {};)
– How would you find primers that discriminate between methylated and
unmethylated DNA ? Given that the genome is 3.109 bp how long do
you need to make a primer to avoid mispriming ?
Practicum 3

Weblems
W4.1: Align the amino acid sequence of acetylcholine
receptor from human, rat, mouse, dog with
ClustalW
T-Coffee
Dali
MSA
W4.2: Use BoxShade to create a word file indicating
the different conserved resides in colours
W4.3: Perform a LocalAlignent using SIM and Lalign
on the same sequence and Blast2
W4.4: Do the different methods give different results,
what are the default settings they use ?
W4.5: How would you identify critical residues for
catalytic activity ?

Bioinformatica t4-alignments

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to Bioinformatica t4-alignments

Similar to Bioinformatica t4-alignments (20)

More from Prof. Wim Van Criekinge

More from Prof. Wim Van Criekinge (20)

Recently uploaded

Recently uploaded (20)

Bioinformatica t4-alignments