SlideShare a Scribd company logo
1 of 123
FBW
20-10-2015
Wim Van Criekinge
Rat versus
mouse RBP
Rat versus
bacterial
lipocalin
– Henikoff and Henikoff have compared the
BLOSUM matrices to PAM by evaluating how
effectively the matrices can detect known members
of a protein family from a database when searching
with the ungapped local alignment program
BLAST. They conclude that overall the BLOSUM
62 matrix is the most effective.
• However, all the substitution matrices investigated
perform better than BLOSUM 62 for a proportion of
the families. This suggests that no single matrix is
the complete answer for all sequence comparisons.
• It is probably best to compliment the BLOSUM 62
matrix with comparisons using 250 PAMS, and
Overington structurally derived matrices.
– It seems likely that as more protein three
dimensional structures are determined, substitution
tables derived from structure comparison will give
the most reliable data.
Overview
Available Dot Plot Programs
Dotlet (Java Applet)
http://www.isrec.isb-
sib.ch/java/dotlet/Dotlet.
html
Sequence Alignments
Introduction
Algorithms
What ?
Examples
Properties
Dynamic Programming for Pairwise Alignment
Concept
Example
Needleman-Wunsch(.pl)
Smith-Waterman(.pl)
Multiple Alignment
MSA
Hierarchical Pairwise Alignent
ClustalW, PileUp
Formatting
Interpretation
Alternative Methods
SIM
Blast2
Dali
Global and local alignment
Pairwise sequence alignment can be global or local
Global: the sequences are completely aligned
(Needleman and Wunsch, 1970)
Local: only the best sub-regions are aligned
(Smith and Waterman, 1981). BLAST
uses local alignment.
– In order to characterize protein families, identify
shared regions of homology in a multiple
sequence alignment; (this happens generally
when a sequence search revealed homologies to
several sequences)
– Determination of the consensus sequence of
several aligned sequences
– Help prediction of the secondary and tertiary
structures of new sequences;
– Preliminary step in molecular evolution analysis
using Phylogenetic methods for constructing
phylogenetic trees
– Garbage in, Garbage out
– Chicken/egg
Why we do multiple alignments?
Why we do multiple alignments?
• To find conserved regions
– Local multiple alignment reveals conserved
regions
– Conserved regions usually are key functional
regions
– These regions are prime targets for drug
developments
• To do phylogenetic analysis:
– Same protein from different species
– Optimal multiple alignment probably implies
history
– Discover irregularities, such as Cystic Fibrosis
gene
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG--
ATLVCLISDFYPGA--VTVAWKADS--
AALGCLVKDYFPEP--VTVSWNSG---
VSLTCLVKGFYPSD--IAVEWWSNG--
Sequence Alignments
Introduction
Algorithms
What ?
Examples
Properties
Dynamic Programming for Pairwise Alignment
Concept
Example
Needleman-Wunsch(.pl)
Smith-Waterman(.pl)
Multiple Alignment
MSA
Hierarchical Pairwise Alignent
ClustalW, PileUp
Formatting
Interpretation
Alternative Methods
SIM
Blast2
Dali
Algorithms and Programs
• Algorithm: a method or a process followed
to solve a problem.
– A recipe.
• An algorithm takes the input to a problem
(function) and transforms it to the output.
– A mapping of input to output.
• A problem can have many algorithms.
Arayabhata-Euclid’s algorithm: How to find gcd(a,b),
the greatest common divisor of a and b
Based on a single observation: if a = b q + r, then
any divisor of a and b is also a divisor of r, and any divisor
of b and r is also a divisor of a, so gcd(a,b) = gcd(b,r)
Euclid algorithm: use the division algorithm repeatedly
To reduce the problem to one you can solve.
Example: gcd(55,35)
55 = 35*1 + 20 so gcd(55,35) = gcd(35,20)
35 = 20*1 + 15 so gcd(35,20) = gcd(20,15)
20 = 15*1 + 5 done gcd(55,35) = 5
Pseudocode
GGD.py
def gcd(a, b):
while a != 0:
a, b = b%a, a # parallel assignment
return b
print (gcd(55, 35))
Bubble Sort Algorithm
1. Initialize the size of the list to be sorted to be the actual size of the list.
2. Loop through the list until no element needs to be exchanged with another
to reach its correct position.
2.1 Loop (i) from 0 to size of the list to be sorted - 2.
2.1.1 Compare the ith and (i + 1)st elements in the unsorted list.
2.1.2 Swap the ith and (i + 1)st elements if not in order ( ascending or
descending as desired).
2.2 Decrease the size of the list to be sorted by 1.
One of the simplest sorting algorithms proceeds by walking down the list, comparing
adjacent elements, and swapping them if they are in the wrong order. The process is
continued until the list is sorted.
More formally:
Each pass "bubbles" the largest element in the unsorted part of the list to its correct location.
A 13 7 43 5 3 19 2 23 29 ?? ?? ?? ?? ??
Bubble Sort Implementation
void BubbleSort(int List[] , int Size) {
int tempInt; // temp variable for swapping list elems
for (int Stop = Size - 1; Stop > 0; Stop--) {
for (int Check = 0; Check < Stop; Check++) { // make a pass
if (List[Check] > List[Check + 1]) { // compare elems
tempInt = List[Check]; // swap if in the
List[Check] = List[Check + 1]; // wrong order
List[Check + 1] = tempInt;
}
}
}
}
Bubblesort compares and swaps adjacent elements; simple but not very efficient.
Efficiency note: the outer loop could be modified to exit if the list is already sorted.
Here is an ascending-order implementation of the bubblesort algorithm for integer arrays:
"Great algorithms are the poetry of computation"
"Great algorithms are the poetry of computation"
1946: The Metropolis Algorithm for Monte Carlo. Through the use of random
processes, this algorithm offers an efficient way to stumble toward answers to
problems that are too complicated to solve exactly.
1947: Simplex Method for Linear Programming. An elegant solution to a common
problem in planning and decision-making.
1950: Krylov Subspace Iteration Method. A technique for rapidly solving the linear
equations that abound in scientific computation.
1951: The Decompositional Approach to Matrix Computations. A suite of techniques
for numerical linear algebra.
1957: The Fortran Optimizing Compiler. Turns high-level code into efficient
computer-readable code.
1959: QR Algorithm for Computing Eigenvalues. Another crucial matrix operation
made swift and practical.
1962: Quicksort Algorithms for Sorting. For the efficient handling of large databases.
1965: Fast Fourier Transform. Perhaps the most ubiquitous algorithm in use today, it
breaks down waveforms (like sound) into periodic components.
1977: Integer Relation Detection. A fast method for spotting simple equations satisfied
by collections of seemingly unrelated numbers.
1987: Fast Multipole Method. A breakthrough in dealing with the complexity of n-body
calculations, applied in problems ranging from celestial mechanics to protein folding.
From Random Samples, Science page 799, February 4, 2000.
Algorithm Properties
• An algorithm possesses the following
properties:
– It must be correct.
– It must be composed of a series of concrete steps.
– There can be no ambiguity as to which step will be
performed next.
– It must be composed of a finite number of steps.
– It must terminate.
• A computer program is an instance, or
concrete representation, for an algorithm
in some programming language.
Measuring Algorithm Efficiency
• Types of complexity
– Space complexity
– Time complexity
• Analysis of algorithms
– The measuring of the complexity of an algorithm
• Cannot compute actual time for an algorithm
– We usually measure worst-case time
Measuring Algorithm Efficiency
Three algorithms for computing
1 + 2 + … n for an integer n > 0
Measuring Algorithm Efficiency
The number of operations required by the algorithms
Measuring Algorithm Efficiency
The number of operations required by the algorithms as a
function of n
Big Oh Notation
• To say "Algorithm A has a worst-case time
requirement proportional to n"
– We say A is O(n)
– Read "Big Oh of n"
• For the other two algorithms
– Algorithm B is O(n2)
– Algorithm C is O(1)
• O is derived from order (magnitude)
Picturing Efficiency
O(n) algorithm
Picturing Efficiency
An O(n2) algorithm.
Picturing Efficiency
Another O(n2) algorithm.
Sequence Alignments
Introduction
Algorithms
What ?
Examples
Properties
Dynamic Programming for Pairwise Alignment
Concept
Example
Needleman-Wunsch(.pl)
Smith-Waterman(.pl)
Multiple Alignment
MSA
Hierarchical Pairwise Alignent
ClustalW, PileUp
Formatting
Interpretation
Alternative Methods
SIM
Blast2
Dali
The best alignment:
The one with the maximum total
score
• Exhaustive …
– All combinations:
• Algorithm
– Dynamic programming (much faster)
• Heuristics
– Needleman – Wunsh for global
alignments
(Journal of Molecular Biology, 1970)
– Later adapated by Smith-Waterman
for local alignment
Overview
• Score of an alignment: reward
matches and penalize mismatches
and spaces.
– eg, each column gets a (different)
value for:
• a match: +1, (both have the same
characters);
• a mismatch : -1, (both have different
characters); and
• a space in a column: -2.
– The total score of an alignment is the
sum of the values assigned to its
columns.
A metric …
GACGGATTAG, GATCGGAATAG
GA-CGGATTAG
GATCGGAATAG
+1 (a match), -1 (a mismatch),-2 (gap)
9*1 + 1*(-1)+1*(-2) = 6
Dynamic programming
Reduce the problem:
the solution to a large problem is to
simplify … if we first know the
solution to a smaller problem that
is a subset of the larger problem
Overview
P
P2P1 P3
P
Dynamic Programming
• Finding optimal solution to search
problem
• Recursively computes solution
• Fundamental principle is to produce
optimal solutions to smaller pieces of
the problem first and then glue them
together
• Efficient divide-and-conquer strategy
because it uses a bottom-up approach
and utilizes a look-up table instead of
recomputing optimal solutions to sub-
problems
P
P2P1 P3
P
the best alignment between
• a zinc-finger core sequence:
–CKHVFCRVCI
• and a sequence fragment
from a viral polyprotein:
–CKKCFCKCV
C K H V F C R V C I
+--------------------
C | 1 1 1
K | 1
K | 1
C | 1 1 1
F | 1
C | 1 1 1
K | 1
C | 1 1 1
V | 1 1
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 1 1 1
K | 1
K | 1
C | 1 1 1
F | 1
C | 1 1 1
K | 1
C | 1 1 1
V | 1 1
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 1 1 1 0
K | 1 0
K | 1 0
C | 1 1 1 0
F | 1 0
C | 1 1 1 0
K | 1 0
C | 1 1 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 1 1 1 0
K | 1 0
K | 1 0
C | 1 1 1 0
F | 1 0
C | 1 1 1 0
K | 1 0
C | 2 1 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 1 1 1 0
K | 1 0 0
K | 1 0 0
C | 1 1 1 0
F | 1 0 0
C | 1 1 1 0
K | 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 1 1 1 1 0
K | 1 1 0 0
K | 1 1 0 0
C | 1 1 1 1 0
F | 1 1 0 0
C | 1 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 1 1 1 1 1 0
K | 1 1 1 0 0
K | 1 1 1 0 0
C | 1 1 1 1 1 0
F | 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 1 2 1 1 1 0
K | 1 1 1 1 0 0
K | 1 1 1 1 0 0
C | 1 2 1 1 1 0
F | 2 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 1 2 2 1 1 1 0
K | 1 2 1 1 1 0 0
K | 1 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 2 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 1 3 2 2 1 1 1 0
K | 1 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 2 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 1 3 3 2 2 1 1 1 0
K | 4 4 3 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 2 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 5 3 3 3 2 2 1 1 1 0
K | 4 4 3 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 2 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 5 3 3 3 2 2 1 1 1 0
K | 4 4 3 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 3 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 5 3 3 3 2 2 1 1 1 0
K | 4 4 3 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 3 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 5 3 3 3 2 2 1 1 1 0
K | 4 4 3 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 3 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 5 3 3 3 2 2 1 1 1 0
K | 4 4 3 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 3 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 5 3 3 3 2 2 1 1 1 0
K | 4 4 3 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 3 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 5 3 3 3 2 2 1 1 1 0
K | 4 4 3 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 3 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 5 3 3 3 2 2 1 1 1 0
K | 4 4 3 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 3 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 5 3 3 3 2 2 1 1 1 0
K | 4 4 3 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 3 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 5 3 3 3 2 2 1 1 1 0
K | 4 4 3 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 3 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 5 3 3 3 2 2 1 1 1 0
K | 4 4 3 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 3 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
C K H V F C R V C I
C K K C F C - K C V
C K H V F C R V C I
C K K C F C K - C V
C - K H V F C R V C I
C K K C - F C - K C V
C K H - V F C R V C I
C K K C - F C - K C V
Dynamic Programming
C K H V F C R V C I
+--------------------
C | 5 3 3 3 2 2 1 1 1 0
K | 4 4 3 3 2 1 1 1 0 0
K | 3 4 3 3 2 1 1 1 0 0
C | 4 3 3 3 2 2 1 1 1 0
F | 3 2 2 2 3 1 1 1 0 0
C | 4 2 2 2 2 2 1 1 1 0
K | 2 3 2 2 2 1 1 1 0 0
C | 2 1 1 1 1 2 1 0 1 0
V | 0 0 0 1 0 0 0 1 0 0
C K H V F C R V C I
C K K C F C - K C V
C K H V F C R V C I
C K K C F C K - C V
C - K H V F C R V C I
C K K C - F C - K C V
C K H - V F C R V C I
C K K C - F C - K C V
Dynamic Programming
Needleman-Wunsch-Simple.py
Needleman-Wunsch-Simple.py
The Score Matrix
----------------
Seq1(j)1 2 3 4 5 6 7
Seq2 * C K H V F C R
(i) * 0 -1 -2 -3 -4 -5 -6 -7
1 C -1 1 0 -1 -2 -3 -4 -5
2 K -2 0 2 1 0 -1 -2 -3
3 K -3 -1 1 1 0 -1 -2 -3
4 C -4 -2 0 0 0 -1 0 -1
5 F -5 -3 -1 -1 -1 1 0 -1
6 C -6 -4 -2 -2 -2 0 2 1
7 K -7 -5 -3 -3 -3 -1 1 1
8 C -8 -6 -4 -4 -4 -2 0 0
9 V -9 -7 -5 -5 -3 -3 -1 -1
The Score Matrix
----------------
Seq1(j)1 2 3 4 5 6 7
Seq2 * C K H V F C R
(i) * 0 -1 -2 -3 -4 -5 -6 -7
1 C -1 1 0 -1 -2 -3 -4 -5
2 K -2 0 2 1 0 -1 -2 -3
3 K -3 -1 1 1 0 -1 -2 -3
4 C -4 -2 0 0 0 -1 0 -1
5 F -5 -3 -1 -1 -1 1 0 -1
6 C -6 -4 -2 -2 -2 0 2 1
7 K -7 -5 -3 -3 -3 -1 1 1
8 C -8 -6 -4 -4 -4 -2 0 0
9 V -9 -7 -5 -5 -3 -3 -1 -1
Needleman-Wunsch-Simple.py
The Score Matrix
----------------
Seq1(j)1 2 3 4 5 6 7
Seq2 * C K H V F C R
(i) * 0 -1 -2 -3 -4 -5 -6 -7
1 C -1 1 0 -1 -2 -3 -4 -5
2 K -2 0 2 1 0 -1 -2 -3
3 K -3 -1 1 1 0 -1 -2 -3
4 C -4 -2 0 0 0 -1 0 -1
5 F -5 -3 -1 -1 -1 1 0 -1
6 C -6 -4 -2 -2 -2 0 2 1
7 K -7 -5 -3 -3 -3 -1 1 1
8 C -8 -6 -4 -4 -4 -2 0 0
9 V -9 -7 -5 -5 -3 -3 -1 -1
a
bc
A: matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH
if (substr(seq1,j-1,1) eq substr(seq2,i-1,1)
B: up_score = matrix(i-1,j) + GAP
C: left_score = matrix(i,j-1) + GAP
Needleman-Wunsch-Simple.py
The Score Matrix
----------------
Seq1(j)1 2 3 4 5 6 7
Seq2 * C K H V F C R
(i) * 0 -1 -2 -3 -4 -5 -6 -7
1 C -1 1 0 -1 -2 -3 -4 -5
2 K -2 0 2 1 0 -1 -2 -3
3 K -3 -1 1 1 0 -1 -2 -3
4 C -4 -2 0 0 0 -1 0 -1
5 F -5 -3 -1 -1 -1 1 0 -1
6 C -6 -4 -2 -2 -2 0 2 1
7 K -7 -5 -3 -3 -3 -1 1 1
8 C -8 -6 -4 -4 -4 -2 0 0
9 V -9 -7 -5 -5 -3 -3 -1 -1
Needleman-Wunsch-Simple.py
Needleman-Wunsch-Simple.py
Seq1:CKHVFCRVCI
Seq2:CKKCFC-KCV
++--++--+- score = 0
Needleman-Wunsch-Simple.py
Extensions to basic dynamic programming method
use gap penalties
– constant gap penalty for gap > 1
– gap penalty proportional to gap size
• one penalty for starting a gap (gap
opening penalty)
• different (lower) penalty for adding to a
gap (gap extension penalty)
use blosum62
• instead of MATCH and MISMATCH
Dynamic Programming: Needleman-Wunsch-Complete.py
Needleman-Wunsch-Complete.py
Needleman-Wunsch-Complete.py
Needleman-Wunsch-Complete.py
Needleman-Wunsch-Complete.py
Needleman-Wunsch-Complete.py
Needleman-Wunsch-Complete.py
Uses of Needleman-Wunsch-Complete.py
• Time Complexity
• Use random proteins to generate
histogram of scores from aligned
random sequences
Time complexity with Needleman-Wunsch-Complete.py
Sequence Length
(aa)
Execution Time (s)
10 0:00:00.001500
25 0:00:00.005340
50 0:00:00.020112
100 0:00:00.081580
500 0:00:01.960721
1000 0:00:07.720884
10000 0:11:36.344549
100000 Memory could not be
written
Simple version (Match/Mismatch) – no gap extension
Complete version !
True positives False positives
False negatives
Sequences reported
as related
Sequences reported
as unrelated
True negatives
homologous
sequences
non-homologous
sequences
Sensitivity:
ability to find
true positives
Specificity:
ability to minimize
false positives
If the sequences are similar, the path
of the best alignment should be very
close to the main diagonal.
Therefore, we may not need to fill the
entire matrix, rather, we fill a narrow
band of entries around the main
diagonal.
An algorithm that fills in a band of
width 2k+1 around the main
diagonal.
Local alignment
• The concept of ‘local alignment’ was
introduced by Smith & Waterman in 1981
• A local alignment of 2 sequences is an
alignment between parts of the 2
sequences
Two proteins may one share one stretch of high sequence
similarity, but be very dissimilar outside that region
A global (N-W) alignment of such sequences would have:
(i) lots of matches in the region of high sequence similarity
(ii) lots of mismatches & gaps (insertions/deletions)
outside the region of similarity
It makes sense to find the best local alignment instead
Smith-Waterman.py
• Three changes
– The edges of the matrix are initialized to 0 instead
of increasing gap penalties
– The maximum score is never less than 0, and no
pointer is recorded unless the score is greater
than 0
– The trace-back starts from the highest score in
the matrix (rather than at the end of the matrix)
and ends at a score of 0 (rather than the start of
the matrix)
Smith-Waterman.py
Smith-Waterman.py
Sequence Alignments
Introduction
Algorithms
What ?
Examples
Properties
Dynamic Programming for Pairwise Alignment
Concept
Example
Needleman-Wunsch(.pl)
Smith-Waterman(.pl)
Multiple Alignment
MSA
Hierarchical Pairwise Alignent
ClustalW, PileUp
Formatting
Interpretation
Alternative Methods
SIM
Blast2
Dali
The best alignment:
The one with the maximum total score
Multiple Aligment: n>2
2 to 3: hyperlattice
On its top-left side, the cube is
"covered" by the polyhedron. The
edges 1, 2, 3, 6 and 7 are coming
from the inside, and edges 4 and 5
can be ignored (and are therefore
not labeled in the figure).
• Each node in the k-dimensional hyperlattice is
visited once, and therefore the running time
must be proportional to the number of nodes in
the lattice.
– This number is the product of the lengths of the
sequences.
– eg. the 3-dimensional lattice as visualized.
Computational Complexity of MA by standard Dynamic Programming
• The memory space requirement is even worse.
To trace back the alignment, we need to store the
whole lattice, a data structure the size of a
multidimensional skyscraper.
– In fact, space is the No.1 problem here, bogging down
multiple alignment methods that try to achieve
optimality.
– Furthermore, incorporating a realistic gap model, we
will further increase our demands on space and running
time
Size/Time limits…
• The most practical and widely used
method in multiple sequence alignment
is the hierarchical extensions of
pairwise alignment methods.
• The principal is that multiple alignments
is achieved by successive application
of pairwise methods.
– First do all pairwise alignments (not just one
sequence with all others)
– Then combine pairwise alignments to generate
overall alignment
Multiple Alignment Method
• The steps are summarized as follows:
– Compare all sequences pairwise.
– Perform cluster analysis on the pairwise data to
generate a hierarchy for alignment. This may be in
the form of a binary tree or a simple ordering
– Build the multiple alignment by first aligning the
most similar pair of sequences, then the next most
similar pair and so on. Once an alignment of two
sequences has been made, then this is fixed.
Thus for a set of sequences A, B, C, D having
aligned A with C and B with D the alignment of A,
B, C, D is obtained by comparing the alignments
of A and C with that of B and D using averaged
scores at each aligned position.
Multiple Alignment Method
Multiple Alignment Method
Multiple Alignment Method
• Automatic multiple alignemnt
– extend dynamic programming (MSA - Lipman)
• limit: computing power: length and number of sequences
(e.q. 2000^8)
– progressive alignment (Feng & Doolittle)
• use “guide tree” (PileUp, ClustalW etc)
• Dedicated alignment editing program
– Boxshade
– SeaView
– SeqPup (Java)
• Combination (Biology – Computation)
Multiple Sequence Alignment programs
• ClustalW is a general purpose multiple
alignment program for DNA or proteins.
• ClustalW is produced by Julie D. Thompson,
Toby Gibson of European Molecular Biology
Laboratory, Germany and Desmond Higgins
of European Bioinformatics Institute,
Cambridge, UK. Algorithmic
• Improves the sensitivity of progressive
multiple sequence alignment through
sequence weighting, positions-specific gap
penalties and weight matrix choice. Nucleic
Acids Research, 22:4673-4680.
ClustalW
****** MULTIPLE ALIGNMENT MENU ******
1. Do complete multiple alignment now (Slow/Accurate)
2. Produce guide tree file only
3. Do alignment using old guide tree file
4. Toggle Slow/Fast pairwise alignments = SLOW
5. Pairwise alignment parameters
6. Multiple alignment parameters
7. Reset gaps between alignments? = OFF
8. Toggle screen display = ON
9. Output format options
S. Execute a system command
H. HELP
or press [RETURN] to go back to main menu
Your choice:
Running ClustalW
• The final product of a PILEUP run is a set of aligned
sequences, which are stored in a Multiple
Sequence File (called .msf by GCG).
This msf file is a text file that can be formatted with
a text editor, but GCG has some dedicated tools for
improving the looks of msf files for easier
interpretation and for publication.
• Consensus sequences can be calculated and the
relationship of each character of each sequence to
the consensus can be highlighted using the
program PRETTY
Formatting Multiple Alignments
• Shading of regions of high homology can be created using
the programs BOXSHADE and PRETTYBOX , but that
goes beyond the scope of this tutorial. (Boxshade:
http://www.ch.embnet.org/software/BOX_form.html)
• In addition to these programs that run on the Alpha, the
output of PILEUP (or CLUSTAL) can be moved by FTP
from your RCR account to a local Mac or PC.
• Since this output is a plain text file, it can be edited with
any word processing program, or imported into any
drawing program to add boldface text, underlining,
shading, boxes, arrows, etc
Formatting Multiple Alignments
http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG--
ATLVCLISDFYPGA--VTVAWKADS--
AALGCLVKDYFPEP--VTVSWNSG---
VSLTCLVKGFYPSD--IAVEWWSNG--
An example of Multiple Alignment … immunoglobulin
• Their alignment highlights conserved
residues (one of the cysteines forming the
disulphide bridges, and the tryptophan are
notable)
• conserved regions (in particular, "Q.PG" at
the end of the first 4 sequences), and more
sophisticated patterns, like the dominance of
hydrophobic residues at fragment positions 1
and 3.
• The alternating hydrophobicity pattern is
typical for the surface beta-strand at the
beginning of each fragment. Indeed, multiple
alignments are helpful for protein structure
prediction.
An example of Multiple Alignment … immunoglobulin
• Providing the alignment is accurate
then the following may be inferred
about the secondary structure from
a multiple sequence alignment.
 The position of insertions and
deletions (INDELS) suggests
regions where surface loops exist.
 Conserved glycine or proline
suggests a beta-turn.
A Practical Approach: Interpretation
• Residues with hydrophobic properties
conserved at i, i+2, i+4 separated by
unconserved or hydrophilic residues
suggest surface beta- strands.
 A short run of hydrophobic amino acids
(4 residues) suggests a buried beta-
strand.
 Pairs of conserved hydrophobic amino
acids separated by pairs of
unconserved, or hydrophilic residues
suggests an alfa-helix with one face
packing in the protein core. Likewise,
an i, i+3, i+4, i+7 pattern of conserved
hydrophobic residues.
A Practical Approach: Interpretation
• Take out noise (GAPS)
• Extra information (structure - function)
• Recursive selection
– first most similar to have an idea about
conserved regions
– manual scan for these in more distant
members then include these
A Practical Approach: Which sequences to use ?
Sequence Alignments
Introduction
Algorithms
What ?
Examples
Properties
Dynamic Programming for Pairwise Alignment
Concept
Example
Needleman-Wunsch(.pl)
Smith-Waterman(.pl)
Multiple Alignment
MSA
Hierarchical Pairwise Alignent
ClustalW, PileUp
Formatting
Interpretation
Alternative Methods
SIM
Blast2
Dali
L-align (2 sequences)
SIM (www.expasy.ch)
LALNVIEW is available for UNIX, Mac
and PC on the ExPASy anonymous
FTP server.
very nice TWEAKING tool (70% criteria)
Length
P-value
SIM
SIM
SIM
How can I use NCBI
to compare two
sequences?
Answer:
Use the
“BLAST 2 Sequences”
program
• Go to http://www.ncbi.nlm.nih.gov/BLAST
• Choose BLAST 2 sequences
• In the program,
[1] choose blastp (protein search) or blastn (for DNA)
[2] paste in your accession numbers
(or use FASTA format)
[3] select optional parameters, such as
--BLOSU62 matrix is default for proteins
try PAM250 for distantly related proteins
--gap creation and extension penalties
[4] click “align”
Practical guide to pairwise alignment:
the “BLAST 2 sequences” website
Question #2:
How can I use NCBI
to compare a
sequence to an
entire database?
BLAST!
Weblems
W4.1: Align the amino acid sequence of acetylcholine
receptor from human, rat, mouse, dog with
ClustalW
T-Coffee
Dali
MSA
W4.2: Use BoxShade to create a word file indicating
the different conserved resides in colours
W4.3: Perform a LocalAlignent using SIM and Lalign
on the same sequence and Blast2
W4.4: Do the different methods give different results,
what are the default settings they use ?
W4.5: How would you identify critical residues for
catalytic activity ?

More Related Content

What's hot

2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekingeProf. Wim Van Criekinge
 
Lineage-driven Fault Injection, SIGMOD'15
Lineage-driven Fault Injection, SIGMOD'15Lineage-driven Fault Injection, SIGMOD'15
Lineage-driven Fault Injection, SIGMOD'15palvaro
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologiesProf. Wim Van Criekinge
 
Kyle Jensen's MIT Ph.D. Thesis Proposal
Kyle Jensen's MIT Ph.D. Thesis ProposalKyle Jensen's MIT Ph.D. Thesis Proposal
Kyle Jensen's MIT Ph.D. Thesis ProposalKyle Jensen
 
Programming Languages for Biological Modeling
Programming Languages for Biological ModelingProgramming Languages for Biological Modeling
Programming Languages for Biological Modelingjxyz
 
Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCGenome Reference Consortium
 
hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)Shaojun Xie
 

What's hot (18)

2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge
 
Lineage-driven Fault Injection, SIGMOD'15
Lineage-driven Fault Injection, SIGMOD'15Lineage-driven Fault Injection, SIGMOD'15
Lineage-driven Fault Injection, SIGMOD'15
 
Introduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEASTIntroduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEAST
 
Ashg2014 grc workshop_schneider
Ashg2014 grc workshop_schneiderAshg2014 grc workshop_schneider
Ashg2014 grc workshop_schneider
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
Ashg2015 schneider final
Ashg2015 schneider finalAshg2015 schneider final
Ashg2015 schneider final
 
Kyle Jensen's MIT Ph.D. Thesis Proposal
Kyle Jensen's MIT Ph.D. Thesis ProposalKyle Jensen's MIT Ph.D. Thesis Proposal
Kyle Jensen's MIT Ph.D. Thesis Proposal
 
1 md2016 homology
1 md2016 homology1 md2016 homology
1 md2016 homology
 
Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
 
Programming Languages for Biological Modeling
Programming Languages for Biological ModelingProgramming Languages for Biological Modeling
Programming Languages for Biological Modeling
 
Explaining the assembly model
Explaining the assembly modelExplaining the assembly model
Explaining the assembly model
 
Pathogen phylogenetics using BEAST
Pathogen phylogenetics using BEASTPathogen phylogenetics using BEAST
Pathogen phylogenetics using BEAST
 
Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
 
GRCWorkshop_geval_1KG_slides
GRCWorkshop_geval_1KG_slidesGRCWorkshop_geval_1KG_slides
GRCWorkshop_geval_1KG_slides
 
P7 2018 biopython3
P7 2018 biopython3P7 2018 biopython3
P7 2018 biopython3
 
Mane v2 final
Mane v2 finalMane v2 final
Mane v2 final
 
hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
 

Viewers also liked

كيفية كتابة المسح الأدبي
كيفية كتابة المسح الأدبيكيفية كتابة المسح الأدبي
كيفية كتابة المسح الأدبيresearchcenterm
 
ALN Risk Management Survey at ILTACON 2016
ALN Risk Management Survey at ILTACON 2016ALN Risk Management Survey at ILTACON 2016
ALN Risk Management Survey at ILTACON 2016Erez Bustan
 
Tugas menejement pak yayann.
Tugas menejement pak yayann.Tugas menejement pak yayann.
Tugas menejement pak yayann.resa safrida
 
Crc3adtica de-la-razc3b3n-boliviana
Crc3adtica de-la-razc3b3n-bolivianaCrc3adtica de-la-razc3b3n-boliviana
Crc3adtica de-la-razc3b3n-bolivianapercy rios
 
Sistem ekonomi indonesia
Sistem ekonomi indonesiaSistem ekonomi indonesia
Sistem ekonomi indonesiasuhemah emah
 
Introduction to educational video
Introduction to educational videoIntroduction to educational video
Introduction to educational videoDr. Almodaires
 
Educational Video3 - camera shots
Educational Video3 - camera shotsEducational Video3 - camera shots
Educational Video3 - camera shotsDr. Almodaires
 
T5 II Revolució Industrial i Imperialisme
T5 II Revolució Industrial i ImperialismeT5 II Revolució Industrial i Imperialisme
T5 II Revolució Industrial i ImperialismeMaria Polo
 

Viewers also liked (12)

كيفية كتابة المسح الأدبي
كيفية كتابة المسح الأدبيكيفية كتابة المسح الأدبي
كيفية كتابة المسح الأدبي
 
ALN Risk Management Survey at ILTACON 2016
ALN Risk Management Survey at ILTACON 2016ALN Risk Management Survey at ILTACON 2016
ALN Risk Management Survey at ILTACON 2016
 
Tugas menejement pak yayann.
Tugas menejement pak yayann.Tugas menejement pak yayann.
Tugas menejement pak yayann.
 
Crc3adtica de-la-razc3b3n-boliviana
Crc3adtica de-la-razc3b3n-bolivianaCrc3adtica de-la-razc3b3n-boliviana
Crc3adtica de-la-razc3b3n-boliviana
 
Sistem ekonomi indonesia
Sistem ekonomi indonesiaSistem ekonomi indonesia
Sistem ekonomi indonesia
 
02 epc
02 epc02 epc
02 epc
 
CONTINUIDAD
CONTINUIDADCONTINUIDAD
CONTINUIDAD
 
Introduction to educational video
Introduction to educational videoIntroduction to educational video
Introduction to educational video
 
Educational Video3 - camera shots
Educational Video3 - camera shotsEducational Video3 - camera shots
Educational Video3 - camera shots
 
T5 II Revolució Industrial i Imperialisme
T5 II Revolució Industrial i ImperialismeT5 II Revolució Industrial i Imperialisme
T5 II Revolució Industrial i Imperialisme
 
Video6 editing
Video6 editingVideo6 editing
Video6 editing
 
Alternativa 01 2017
Alternativa 01   2017Alternativa 01   2017
Alternativa 01 2017
 

Similar to 2015 bioinformatics alignments_wim_vancriekinge

Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Prof. Wim Van Criekinge
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastRai University
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastRai University
 
sequence alignment
sequence alignmentsequence alignment
sequence alignmentammar kareem
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadfalizain9604
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentRai University
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentRai University
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfH K Yoon
 
Laboratory 1 sequence_alignments
Laboratory 1 sequence_alignmentsLaboratory 1 sequence_alignments
Laboratory 1 sequence_alignmentsseham15
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...journal ijrtem
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...IJRTEMJOURNAL
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignmentKubuldinho
 
Approaches to online quantile estimation
Approaches to online quantile estimationApproaches to online quantile estimation
Approaches to online quantile estimationData Con LA
 
Artificial Intelligence Applications in Petroleum Engineering - Part I
Artificial Intelligence Applications in Petroleum Engineering - Part IArtificial Intelligence Applications in Petroleum Engineering - Part I
Artificial Intelligence Applications in Petroleum Engineering - Part IRamez Abdalla, M.Sc
 

Similar to 2015 bioinformatics alignments_wim_vancriekinge (20)

Bioinformatics t4-alignments v2014
Bioinformatics t4-alignments v2014Bioinformatics t4-alignments v2014
Bioinformatics t4-alignments v2014
 
Bioinformatica t4-alignments
Bioinformatica t4-alignmentsBioinformatica t4-alignments
Bioinformatica t4-alignments
 
Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013
 
Bioinformatica 27-10-2011-t4-alignments
Bioinformatica 27-10-2011-t4-alignmentsBioinformatica 27-10-2011-t4-alignments
Bioinformatica 27-10-2011-t4-alignments
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
 
PPT
PPTPPT
PPT
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
Seq alignment
Seq alignment Seq alignment
Seq alignment
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
Laboratory 1 sequence_alignments
Laboratory 1 sequence_alignmentsLaboratory 1 sequence_alignments
Laboratory 1 sequence_alignments
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignment
 
Approaches to online quantile estimation
Approaches to online quantile estimationApproaches to online quantile estimation
Approaches to online quantile estimation
 
Artificial Intelligence Applications in Petroleum Engineering - Part I
Artificial Intelligence Applications in Petroleum Engineering - Part IArtificial Intelligence Applications in Petroleum Engineering - Part I
Artificial Intelligence Applications in Petroleum Engineering - Part I
 

More from Prof. Wim Van Criekinge

2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_uploadProf. Wim Van Criekinge
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_uploadProf. Wim Van Criekinge
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_uploadProf. Wim Van Criekinge
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_uploadProf. Wim Van Criekinge
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Prof. Wim Van Criekinge
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_uploadProf. Wim Van Criekinge
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_uploadProf. Wim Van Criekinge
 

More from Prof. Wim Van Criekinge (20)

2020 02 11_biological_databases_part1
2020 02 11_biological_databases_part12020 02 11_biological_databases_part1
2020 02 11_biological_databases_part1
 
2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload
 
P6 2018 biopython2b
P6 2018 biopython2bP6 2018 biopython2b
P6 2018 biopython2b
 
P4 2018 io_functions
P4 2018 io_functionsP4 2018 io_functions
P4 2018 io_functions
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
 
T1 2018 bioinformatics
T1 2018 bioinformaticsT1 2018 bioinformatics
T1 2018 bioinformatics
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
 
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload
 
P7 2017 biopython3
P7 2017 biopython3P7 2017 biopython3
P7 2017 biopython3
 
P6 2017 biopython2
P6 2017 biopython2P6 2017 biopython2
P6 2017 biopython2
 
Van criekinge 2017_11_13_rodebiotech
Van criekinge 2017_11_13_rodebiotechVan criekinge 2017_11_13_rodebiotech
Van criekinge 2017_11_13_rodebiotech
 

Recently uploaded

On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIShubhangi Sonawane
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 

Recently uploaded (20)

On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 

2015 bioinformatics alignments_wim_vancriekinge

  • 1.
  • 3.
  • 4. Rat versus mouse RBP Rat versus bacterial lipocalin
  • 5. – Henikoff and Henikoff have compared the BLOSUM matrices to PAM by evaluating how effectively the matrices can detect known members of a protein family from a database when searching with the ungapped local alignment program BLAST. They conclude that overall the BLOSUM 62 matrix is the most effective. • However, all the substitution matrices investigated perform better than BLOSUM 62 for a proportion of the families. This suggests that no single matrix is the complete answer for all sequence comparisons. • It is probably best to compliment the BLOSUM 62 matrix with comparisons using 250 PAMS, and Overington structurally derived matrices. – It seems likely that as more protein three dimensional structures are determined, substitution tables derived from structure comparison will give the most reliable data. Overview
  • 6. Available Dot Plot Programs Dotlet (Java Applet) http://www.isrec.isb- sib.ch/java/dotlet/Dotlet. html
  • 7. Sequence Alignments Introduction Algorithms What ? Examples Properties Dynamic Programming for Pairwise Alignment Concept Example Needleman-Wunsch(.pl) Smith-Waterman(.pl) Multiple Alignment MSA Hierarchical Pairwise Alignent ClustalW, PileUp Formatting Interpretation Alternative Methods SIM Blast2 Dali
  • 8. Global and local alignment Pairwise sequence alignment can be global or local Global: the sequences are completely aligned (Needleman and Wunsch, 1970) Local: only the best sub-regions are aligned (Smith and Waterman, 1981). BLAST uses local alignment.
  • 9. – In order to characterize protein families, identify shared regions of homology in a multiple sequence alignment; (this happens generally when a sequence search revealed homologies to several sequences) – Determination of the consensus sequence of several aligned sequences – Help prediction of the secondary and tertiary structures of new sequences; – Preliminary step in molecular evolution analysis using Phylogenetic methods for constructing phylogenetic trees – Garbage in, Garbage out – Chicken/egg Why we do multiple alignments?
  • 10. Why we do multiple alignments? • To find conserved regions – Local multiple alignment reveals conserved regions – Conserved regions usually are key functional regions – These regions are prime targets for drug developments • To do phylogenetic analysis: – Same protein from different species – Optimal multiple alignment probably implies history – Discover irregularities, such as Cystic Fibrosis gene
  • 12. Sequence Alignments Introduction Algorithms What ? Examples Properties Dynamic Programming for Pairwise Alignment Concept Example Needleman-Wunsch(.pl) Smith-Waterman(.pl) Multiple Alignment MSA Hierarchical Pairwise Alignent ClustalW, PileUp Formatting Interpretation Alternative Methods SIM Blast2 Dali
  • 13. Algorithms and Programs • Algorithm: a method or a process followed to solve a problem. – A recipe. • An algorithm takes the input to a problem (function) and transforms it to the output. – A mapping of input to output. • A problem can have many algorithms.
  • 14.
  • 15. Arayabhata-Euclid’s algorithm: How to find gcd(a,b), the greatest common divisor of a and b Based on a single observation: if a = b q + r, then any divisor of a and b is also a divisor of r, and any divisor of b and r is also a divisor of a, so gcd(a,b) = gcd(b,r) Euclid algorithm: use the division algorithm repeatedly To reduce the problem to one you can solve. Example: gcd(55,35) 55 = 35*1 + 20 so gcd(55,35) = gcd(35,20) 35 = 20*1 + 15 so gcd(35,20) = gcd(20,15) 20 = 15*1 + 5 done gcd(55,35) = 5
  • 17. GGD.py def gcd(a, b): while a != 0: a, b = b%a, a # parallel assignment return b print (gcd(55, 35))
  • 18. Bubble Sort Algorithm 1. Initialize the size of the list to be sorted to be the actual size of the list. 2. Loop through the list until no element needs to be exchanged with another to reach its correct position. 2.1 Loop (i) from 0 to size of the list to be sorted - 2. 2.1.1 Compare the ith and (i + 1)st elements in the unsorted list. 2.1.2 Swap the ith and (i + 1)st elements if not in order ( ascending or descending as desired). 2.2 Decrease the size of the list to be sorted by 1. One of the simplest sorting algorithms proceeds by walking down the list, comparing adjacent elements, and swapping them if they are in the wrong order. The process is continued until the list is sorted. More formally: Each pass "bubbles" the largest element in the unsorted part of the list to its correct location. A 13 7 43 5 3 19 2 23 29 ?? ?? ?? ?? ??
  • 19. Bubble Sort Implementation void BubbleSort(int List[] , int Size) { int tempInt; // temp variable for swapping list elems for (int Stop = Size - 1; Stop > 0; Stop--) { for (int Check = 0; Check < Stop; Check++) { // make a pass if (List[Check] > List[Check + 1]) { // compare elems tempInt = List[Check]; // swap if in the List[Check] = List[Check + 1]; // wrong order List[Check + 1] = tempInt; } } } } Bubblesort compares and swaps adjacent elements; simple but not very efficient. Efficiency note: the outer loop could be modified to exit if the list is already sorted. Here is an ascending-order implementation of the bubblesort algorithm for integer arrays:
  • 20. "Great algorithms are the poetry of computation"
  • 21. "Great algorithms are the poetry of computation" 1946: The Metropolis Algorithm for Monte Carlo. Through the use of random processes, this algorithm offers an efficient way to stumble toward answers to problems that are too complicated to solve exactly. 1947: Simplex Method for Linear Programming. An elegant solution to a common problem in planning and decision-making. 1950: Krylov Subspace Iteration Method. A technique for rapidly solving the linear equations that abound in scientific computation. 1951: The Decompositional Approach to Matrix Computations. A suite of techniques for numerical linear algebra. 1957: The Fortran Optimizing Compiler. Turns high-level code into efficient computer-readable code. 1959: QR Algorithm for Computing Eigenvalues. Another crucial matrix operation made swift and practical. 1962: Quicksort Algorithms for Sorting. For the efficient handling of large databases. 1965: Fast Fourier Transform. Perhaps the most ubiquitous algorithm in use today, it breaks down waveforms (like sound) into periodic components. 1977: Integer Relation Detection. A fast method for spotting simple equations satisfied by collections of seemingly unrelated numbers. 1987: Fast Multipole Method. A breakthrough in dealing with the complexity of n-body calculations, applied in problems ranging from celestial mechanics to protein folding. From Random Samples, Science page 799, February 4, 2000.
  • 22. Algorithm Properties • An algorithm possesses the following properties: – It must be correct. – It must be composed of a series of concrete steps. – There can be no ambiguity as to which step will be performed next. – It must be composed of a finite number of steps. – It must terminate. • A computer program is an instance, or concrete representation, for an algorithm in some programming language.
  • 23. Measuring Algorithm Efficiency • Types of complexity – Space complexity – Time complexity • Analysis of algorithms – The measuring of the complexity of an algorithm • Cannot compute actual time for an algorithm – We usually measure worst-case time
  • 24. Measuring Algorithm Efficiency Three algorithms for computing 1 + 2 + … n for an integer n > 0
  • 25. Measuring Algorithm Efficiency The number of operations required by the algorithms
  • 26. Measuring Algorithm Efficiency The number of operations required by the algorithms as a function of n
  • 27. Big Oh Notation • To say "Algorithm A has a worst-case time requirement proportional to n" – We say A is O(n) – Read "Big Oh of n" • For the other two algorithms – Algorithm B is O(n2) – Algorithm C is O(1) • O is derived from order (magnitude)
  • 31. Sequence Alignments Introduction Algorithms What ? Examples Properties Dynamic Programming for Pairwise Alignment Concept Example Needleman-Wunsch(.pl) Smith-Waterman(.pl) Multiple Alignment MSA Hierarchical Pairwise Alignent ClustalW, PileUp Formatting Interpretation Alternative Methods SIM Blast2 Dali
  • 32. The best alignment: The one with the maximum total score
  • 33. • Exhaustive … – All combinations: • Algorithm – Dynamic programming (much faster) • Heuristics – Needleman – Wunsh for global alignments (Journal of Molecular Biology, 1970) – Later adapated by Smith-Waterman for local alignment Overview
  • 34.
  • 35. • Score of an alignment: reward matches and penalize mismatches and spaces. – eg, each column gets a (different) value for: • a match: +1, (both have the same characters); • a mismatch : -1, (both have different characters); and • a space in a column: -2. – The total score of an alignment is the sum of the values assigned to its columns.
  • 36. A metric … GACGGATTAG, GATCGGAATAG GA-CGGATTAG GATCGGAATAG +1 (a match), -1 (a mismatch),-2 (gap) 9*1 + 1*(-1)+1*(-2) = 6
  • 37. Dynamic programming Reduce the problem: the solution to a large problem is to simplify … if we first know the solution to a smaller problem that is a subset of the larger problem Overview P P2P1 P3 P
  • 38. Dynamic Programming • Finding optimal solution to search problem • Recursively computes solution • Fundamental principle is to produce optimal solutions to smaller pieces of the problem first and then glue them together • Efficient divide-and-conquer strategy because it uses a bottom-up approach and utilizes a look-up table instead of recomputing optimal solutions to sub- problems P P2P1 P3 P
  • 39. the best alignment between • a zinc-finger core sequence: –CKHVFCRVCI • and a sequence fragment from a viral polyprotein: –CKKCFCKCV
  • 40. C K H V F C R V C I +-------------------- C | 1 1 1 K | 1 K | 1 C | 1 1 1 F | 1 C | 1 1 1 K | 1 C | 1 1 1 V | 1 1 Dynamic Programming
  • 41. C K H V F C R V C I +-------------------- C | 1 1 1 K | 1 K | 1 C | 1 1 1 F | 1 C | 1 1 1 K | 1 C | 1 1 1 V | 1 1 Dynamic Programming
  • 42. C K H V F C R V C I +-------------------- C | 1 1 1 0 K | 1 0 K | 1 0 C | 1 1 1 0 F | 1 0 C | 1 1 1 0 K | 1 0 C | 1 1 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 43. C K H V F C R V C I +-------------------- C | 1 1 1 0 K | 1 0 K | 1 0 C | 1 1 1 0 F | 1 0 C | 1 1 1 0 K | 1 0 C | 2 1 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 44. C K H V F C R V C I +-------------------- C | 1 1 1 0 K | 1 0 0 K | 1 0 0 C | 1 1 1 0 F | 1 0 0 C | 1 1 1 0 K | 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 45. C K H V F C R V C I +-------------------- C | 1 1 1 1 0 K | 1 1 0 0 K | 1 1 0 0 C | 1 1 1 1 0 F | 1 1 0 0 C | 1 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 46. C K H V F C R V C I +-------------------- C | 1 1 1 1 1 0 K | 1 1 1 0 0 K | 1 1 1 0 0 C | 1 1 1 1 1 0 F | 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 47. C K H V F C R V C I +-------------------- C | 1 2 1 1 1 0 K | 1 1 1 1 0 0 K | 1 1 1 1 0 0 C | 1 2 1 1 1 0 F | 2 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 48. C K H V F C R V C I +-------------------- C | 1 2 2 1 1 1 0 K | 1 2 1 1 1 0 0 K | 1 2 1 1 1 0 0 C | 4 3 3 3 2 2 1 1 1 0 F | 2 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 49. C K H V F C R V C I +-------------------- C | 1 3 2 2 1 1 1 0 K | 1 3 2 1 1 1 0 0 K | 3 4 3 3 2 1 1 1 0 0 C | 4 3 3 3 2 2 1 1 1 0 F | 2 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 50. C K H V F C R V C I +-------------------- C | 1 3 3 2 2 1 1 1 0 K | 4 4 3 3 2 1 1 1 0 0 K | 3 4 3 3 2 1 1 1 0 0 C | 4 3 3 3 2 2 1 1 1 0 F | 2 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 51. C K H V F C R V C I +-------------------- C | 5 3 3 3 2 2 1 1 1 0 K | 4 4 3 3 2 1 1 1 0 0 K | 3 4 3 3 2 1 1 1 0 0 C | 4 3 3 3 2 2 1 1 1 0 F | 2 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 52. C K H V F C R V C I +-------------------- C | 5 3 3 3 2 2 1 1 1 0 K | 4 4 3 3 2 1 1 1 0 0 K | 3 4 3 3 2 1 1 1 0 0 C | 4 3 3 3 2 2 1 1 1 0 F | 3 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 53. C K H V F C R V C I +-------------------- C | 5 3 3 3 2 2 1 1 1 0 K | 4 4 3 3 2 1 1 1 0 0 K | 3 4 3 3 2 1 1 1 0 0 C | 4 3 3 3 2 2 1 1 1 0 F | 3 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 54. C K H V F C R V C I +-------------------- C | 5 3 3 3 2 2 1 1 1 0 K | 4 4 3 3 2 1 1 1 0 0 K | 3 4 3 3 2 1 1 1 0 0 C | 4 3 3 3 2 2 1 1 1 0 F | 3 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 55. C K H V F C R V C I +-------------------- C | 5 3 3 3 2 2 1 1 1 0 K | 4 4 3 3 2 1 1 1 0 0 K | 3 4 3 3 2 1 1 1 0 0 C | 4 3 3 3 2 2 1 1 1 0 F | 3 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 56. C K H V F C R V C I +-------------------- C | 5 3 3 3 2 2 1 1 1 0 K | 4 4 3 3 2 1 1 1 0 0 K | 3 4 3 3 2 1 1 1 0 0 C | 4 3 3 3 2 2 1 1 1 0 F | 3 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 57. C K H V F C R V C I +-------------------- C | 5 3 3 3 2 2 1 1 1 0 K | 4 4 3 3 2 1 1 1 0 0 K | 3 4 3 3 2 1 1 1 0 0 C | 4 3 3 3 2 2 1 1 1 0 F | 3 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 58. C K H V F C R V C I +-------------------- C | 5 3 3 3 2 2 1 1 1 0 K | 4 4 3 3 2 1 1 1 0 0 K | 3 4 3 3 2 1 1 1 0 0 C | 4 3 3 3 2 2 1 1 1 0 F | 3 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 59. C K H V F C R V C I +-------------------- C | 5 3 3 3 2 2 1 1 1 0 K | 4 4 3 3 2 1 1 1 0 0 K | 3 4 3 3 2 1 1 1 0 0 C | 4 3 3 3 2 2 1 1 1 0 F | 3 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 60. C K H V F C R V C I +-------------------- C | 5 3 3 3 2 2 1 1 1 0 K | 4 4 3 3 2 1 1 1 0 0 K | 3 4 3 3 2 1 1 1 0 0 C | 4 3 3 3 2 2 1 1 1 0 F | 3 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 Dynamic Programming
  • 61. C K H V F C R V C I +-------------------- C | 5 3 3 3 2 2 1 1 1 0 K | 4 4 3 3 2 1 1 1 0 0 K | 3 4 3 3 2 1 1 1 0 0 C | 4 3 3 3 2 2 1 1 1 0 F | 3 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 C K H V F C R V C I C K K C F C - K C V C K H V F C R V C I C K K C F C K - C V C - K H V F C R V C I C K K C - F C - K C V C K H - V F C R V C I C K K C - F C - K C V Dynamic Programming
  • 62. C K H V F C R V C I +-------------------- C | 5 3 3 3 2 2 1 1 1 0 K | 4 4 3 3 2 1 1 1 0 0 K | 3 4 3 3 2 1 1 1 0 0 C | 4 3 3 3 2 2 1 1 1 0 F | 3 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0 C K H V F C R V C I C K K C F C - K C V C K H V F C R V C I C K K C F C K - C V C - K H V F C R V C I C K K C - F C - K C V C K H - V F C R V C I C K K C - F C - K C V Dynamic Programming
  • 64. Needleman-Wunsch-Simple.py The Score Matrix ---------------- Seq1(j)1 2 3 4 5 6 7 Seq2 * C K H V F C R (i) * 0 -1 -2 -3 -4 -5 -6 -7 1 C -1 1 0 -1 -2 -3 -4 -5 2 K -2 0 2 1 0 -1 -2 -3 3 K -3 -1 1 1 0 -1 -2 -3 4 C -4 -2 0 0 0 -1 0 -1 5 F -5 -3 -1 -1 -1 1 0 -1 6 C -6 -4 -2 -2 -2 0 2 1 7 K -7 -5 -3 -3 -3 -1 1 1 8 C -8 -6 -4 -4 -4 -2 0 0 9 V -9 -7 -5 -5 -3 -3 -1 -1
  • 65. The Score Matrix ---------------- Seq1(j)1 2 3 4 5 6 7 Seq2 * C K H V F C R (i) * 0 -1 -2 -3 -4 -5 -6 -7 1 C -1 1 0 -1 -2 -3 -4 -5 2 K -2 0 2 1 0 -1 -2 -3 3 K -3 -1 1 1 0 -1 -2 -3 4 C -4 -2 0 0 0 -1 0 -1 5 F -5 -3 -1 -1 -1 1 0 -1 6 C -6 -4 -2 -2 -2 0 2 1 7 K -7 -5 -3 -3 -3 -1 1 1 8 C -8 -6 -4 -4 -4 -2 0 0 9 V -9 -7 -5 -5 -3 -3 -1 -1 Needleman-Wunsch-Simple.py
  • 66. The Score Matrix ---------------- Seq1(j)1 2 3 4 5 6 7 Seq2 * C K H V F C R (i) * 0 -1 -2 -3 -4 -5 -6 -7 1 C -1 1 0 -1 -2 -3 -4 -5 2 K -2 0 2 1 0 -1 -2 -3 3 K -3 -1 1 1 0 -1 -2 -3 4 C -4 -2 0 0 0 -1 0 -1 5 F -5 -3 -1 -1 -1 1 0 -1 6 C -6 -4 -2 -2 -2 0 2 1 7 K -7 -5 -3 -3 -3 -1 1 1 8 C -8 -6 -4 -4 -4 -2 0 0 9 V -9 -7 -5 -5 -3 -3 -1 -1 a bc A: matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH if (substr(seq1,j-1,1) eq substr(seq2,i-1,1) B: up_score = matrix(i-1,j) + GAP C: left_score = matrix(i,j-1) + GAP Needleman-Wunsch-Simple.py
  • 67. The Score Matrix ---------------- Seq1(j)1 2 3 4 5 6 7 Seq2 * C K H V F C R (i) * 0 -1 -2 -3 -4 -5 -6 -7 1 C -1 1 0 -1 -2 -3 -4 -5 2 K -2 0 2 1 0 -1 -2 -3 3 K -3 -1 1 1 0 -1 -2 -3 4 C -4 -2 0 0 0 -1 0 -1 5 F -5 -3 -1 -1 -1 1 0 -1 6 C -6 -4 -2 -2 -2 0 2 1 7 K -7 -5 -3 -3 -3 -1 1 1 8 C -8 -6 -4 -4 -4 -2 0 0 9 V -9 -7 -5 -5 -3 -3 -1 -1 Needleman-Wunsch-Simple.py
  • 70.
  • 71. Extensions to basic dynamic programming method use gap penalties – constant gap penalty for gap > 1 – gap penalty proportional to gap size • one penalty for starting a gap (gap opening penalty) • different (lower) penalty for adding to a gap (gap extension penalty) use blosum62 • instead of MATCH and MISMATCH Dynamic Programming: Needleman-Wunsch-Complete.py
  • 78. Uses of Needleman-Wunsch-Complete.py • Time Complexity • Use random proteins to generate histogram of scores from aligned random sequences
  • 79. Time complexity with Needleman-Wunsch-Complete.py Sequence Length (aa) Execution Time (s) 10 0:00:00.001500 25 0:00:00.005340 50 0:00:00.020112 100 0:00:00.081580 500 0:00:01.960721 1000 0:00:07.720884 10000 0:11:36.344549 100000 Memory could not be written
  • 80. Simple version (Match/Mismatch) – no gap extension
  • 82. True positives False positives False negatives Sequences reported as related Sequences reported as unrelated True negatives homologous sequences non-homologous sequences Sensitivity: ability to find true positives Specificity: ability to minimize false positives
  • 83. If the sequences are similar, the path of the best alignment should be very close to the main diagonal. Therefore, we may not need to fill the entire matrix, rather, we fill a narrow band of entries around the main diagonal. An algorithm that fills in a band of width 2k+1 around the main diagonal.
  • 84. Local alignment • The concept of ‘local alignment’ was introduced by Smith & Waterman in 1981 • A local alignment of 2 sequences is an alignment between parts of the 2 sequences Two proteins may one share one stretch of high sequence similarity, but be very dissimilar outside that region A global (N-W) alignment of such sequences would have: (i) lots of matches in the region of high sequence similarity (ii) lots of mismatches & gaps (insertions/deletions) outside the region of similarity It makes sense to find the best local alignment instead
  • 85. Smith-Waterman.py • Three changes – The edges of the matrix are initialized to 0 instead of increasing gap penalties – The maximum score is never less than 0, and no pointer is recorded unless the score is greater than 0 – The trace-back starts from the highest score in the matrix (rather than at the end of the matrix) and ends at a score of 0 (rather than the start of the matrix)
  • 88.
  • 89. Sequence Alignments Introduction Algorithms What ? Examples Properties Dynamic Programming for Pairwise Alignment Concept Example Needleman-Wunsch(.pl) Smith-Waterman(.pl) Multiple Alignment MSA Hierarchical Pairwise Alignent ClustalW, PileUp Formatting Interpretation Alternative Methods SIM Blast2 Dali
  • 90. The best alignment: The one with the maximum total score Multiple Aligment: n>2
  • 91. 2 to 3: hyperlattice
  • 92. On its top-left side, the cube is "covered" by the polyhedron. The edges 1, 2, 3, 6 and 7 are coming from the inside, and edges 4 and 5 can be ignored (and are therefore not labeled in the figure).
  • 93. • Each node in the k-dimensional hyperlattice is visited once, and therefore the running time must be proportional to the number of nodes in the lattice. – This number is the product of the lengths of the sequences. – eg. the 3-dimensional lattice as visualized. Computational Complexity of MA by standard Dynamic Programming
  • 94. • The memory space requirement is even worse. To trace back the alignment, we need to store the whole lattice, a data structure the size of a multidimensional skyscraper. – In fact, space is the No.1 problem here, bogging down multiple alignment methods that try to achieve optimality. – Furthermore, incorporating a realistic gap model, we will further increase our demands on space and running time
  • 96. • The most practical and widely used method in multiple sequence alignment is the hierarchical extensions of pairwise alignment methods. • The principal is that multiple alignments is achieved by successive application of pairwise methods. – First do all pairwise alignments (not just one sequence with all others) – Then combine pairwise alignments to generate overall alignment Multiple Alignment Method
  • 97. • The steps are summarized as follows: – Compare all sequences pairwise. – Perform cluster analysis on the pairwise data to generate a hierarchy for alignment. This may be in the form of a binary tree or a simple ordering – Build the multiple alignment by first aligning the most similar pair of sequences, then the next most similar pair and so on. Once an alignment of two sequences has been made, then this is fixed. Thus for a set of sequences A, B, C, D having aligned A with C and B with D the alignment of A, B, C, D is obtained by comparing the alignments of A and C with that of B and D using averaged scores at each aligned position. Multiple Alignment Method
  • 100. • Automatic multiple alignemnt – extend dynamic programming (MSA - Lipman) • limit: computing power: length and number of sequences (e.q. 2000^8) – progressive alignment (Feng & Doolittle) • use “guide tree” (PileUp, ClustalW etc) • Dedicated alignment editing program – Boxshade – SeaView – SeqPup (Java) • Combination (Biology – Computation) Multiple Sequence Alignment programs
  • 101. • ClustalW is a general purpose multiple alignment program for DNA or proteins. • ClustalW is produced by Julie D. Thompson, Toby Gibson of European Molecular Biology Laboratory, Germany and Desmond Higgins of European Bioinformatics Institute, Cambridge, UK. Algorithmic • Improves the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680. ClustalW
  • 102. ****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file 4. Toggle Slow/Fast pairwise alignments = SLOW 5. Pairwise alignment parameters 6. Multiple alignment parameters 7. Reset gaps between alignments? = OFF 8. Toggle screen display = ON 9. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu Your choice: Running ClustalW
  • 103. • The final product of a PILEUP run is a set of aligned sequences, which are stored in a Multiple Sequence File (called .msf by GCG). This msf file is a text file that can be formatted with a text editor, but GCG has some dedicated tools for improving the looks of msf files for easier interpretation and for publication. • Consensus sequences can be calculated and the relationship of each character of each sequence to the consensus can be highlighted using the program PRETTY Formatting Multiple Alignments
  • 104. • Shading of regions of high homology can be created using the programs BOXSHADE and PRETTYBOX , but that goes beyond the scope of this tutorial. (Boxshade: http://www.ch.embnet.org/software/BOX_form.html) • In addition to these programs that run on the Alpha, the output of PILEUP (or CLUSTAL) can be moved by FTP from your RCR account to a local Mac or PC. • Since this output is a plain text file, it can be edited with any word processing program, or imported into any drawing program to add boldface text, underlining, shading, boxes, arrows, etc Formatting Multiple Alignments
  • 107. • Their alignment highlights conserved residues (one of the cysteines forming the disulphide bridges, and the tryptophan are notable) • conserved regions (in particular, "Q.PG" at the end of the first 4 sequences), and more sophisticated patterns, like the dominance of hydrophobic residues at fragment positions 1 and 3. • The alternating hydrophobicity pattern is typical for the surface beta-strand at the beginning of each fragment. Indeed, multiple alignments are helpful for protein structure prediction. An example of Multiple Alignment … immunoglobulin
  • 108. • Providing the alignment is accurate then the following may be inferred about the secondary structure from a multiple sequence alignment.  The position of insertions and deletions (INDELS) suggests regions where surface loops exist.  Conserved glycine or proline suggests a beta-turn. A Practical Approach: Interpretation
  • 109. • Residues with hydrophobic properties conserved at i, i+2, i+4 separated by unconserved or hydrophilic residues suggest surface beta- strands.  A short run of hydrophobic amino acids (4 residues) suggests a buried beta- strand.  Pairs of conserved hydrophobic amino acids separated by pairs of unconserved, or hydrophilic residues suggests an alfa-helix with one face packing in the protein core. Likewise, an i, i+3, i+4, i+7 pattern of conserved hydrophobic residues. A Practical Approach: Interpretation
  • 110. • Take out noise (GAPS) • Extra information (structure - function) • Recursive selection – first most similar to have an idea about conserved regions – manual scan for these in more distant members then include these A Practical Approach: Which sequences to use ?
  • 111. Sequence Alignments Introduction Algorithms What ? Examples Properties Dynamic Programming for Pairwise Alignment Concept Example Needleman-Wunsch(.pl) Smith-Waterman(.pl) Multiple Alignment MSA Hierarchical Pairwise Alignent ClustalW, PileUp Formatting Interpretation Alternative Methods SIM Blast2 Dali
  • 112. L-align (2 sequences) SIM (www.expasy.ch) LALNVIEW is available for UNIX, Mac and PC on the ExPASy anonymous FTP server. very nice TWEAKING tool (70% criteria)
  • 114. SIM
  • 115. SIM
  • 116. How can I use NCBI to compare two sequences? Answer: Use the “BLAST 2 Sequences” program
  • 117. • Go to http://www.ncbi.nlm.nih.gov/BLAST • Choose BLAST 2 sequences • In the program, [1] choose blastp (protein search) or blastn (for DNA) [2] paste in your accession numbers (or use FASTA format) [3] select optional parameters, such as --BLOSU62 matrix is default for proteins try PAM250 for distantly related proteins --gap creation and extension penalties [4] click “align” Practical guide to pairwise alignment: the “BLAST 2 sequences” website
  • 118.
  • 119.
  • 120. Question #2: How can I use NCBI to compare a sequence to an entire database? BLAST!
  • 121.
  • 122.
  • 123. Weblems W4.1: Align the amino acid sequence of acetylcholine receptor from human, rat, mouse, dog with ClustalW T-Coffee Dali MSA W4.2: Use BoxShade to create a word file indicating the different conserved resides in colours W4.3: Perform a LocalAlignent using SIM and Lalign on the same sequence and Blast2 W4.4: Do the different methods give different results, what are the default settings they use ? W4.5: How would you identify critical residues for catalytic activity ?