SlideShare a Scribd company logo
FBW 
28-10-2014 
Wim Van Criekinge
Wel les op 4 november en GEEN les op 18 november
BPC 2014 ?
DataBase Searching 
Dynamic Programming 
Reloaded 
Database Searching 
Fasta 
Blast 
Statistics 
Practical Guide 
Extentions 
PSI-Blast 
PHI-Blast 
Local Blast 
BLAT
Needleman-Wunsch-edu.pl 
The Score Matrix 
---------------- 
Seq1(j)1 2 3 4 5 6 7 
Seq2 * C K H V F C R 
(i) * 0 -1 -2 -3 -4 -5 -6 -7 
1 C -1 1 a 
0 -1 -2 -3 -4 -5 
2 K -2 0 c 2 b 
1 0 -1 -2 -3 
3 K -3 -1 1 1 0 -1 -2 -3 
4 C -4 -2 0 0 0 -1 0 -1 
5 F -5 -3 -1 -1 -1 1 0 -1 
6 C -6 -4 -2 -2 -2 0 2 1 
7 K -7 -5 -3 -3 -3 -1 1 1 
8 C -8 -6 -4 -4 -4 -2 0 0 
9 V -9 -7 -5 -5 -3 -3 -1 -1 
A: matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH 
if (substr(seq1,j-1,1) eq substr(seq2,i-1,1) 
B: up_score = matrix(i-1,j) + GAP 
C: left_score = matrix(i,j-1) + GAP
• The most practical and widely used 
method in multiple sequence alignment 
is the hierarchical extensions of 
pairwise alignment methods. 
• The principal is that multiple alignments 
is achieved by successive application 
of pairwise methods. 
– First do all pairwise alignments (not just one 
sequence with all others) 
– Then combine pairwise alignments to generate 
overall alignment 
Multiple Alignment Method
• Consider the task of searching 
SWISSPROT against a query 
sequence: 
– say our query sequence is 362 
amino acids long 
– SWISSPROT release 38 contains 
29,085,265 amino acids 
– finding local alignments via 
dynamic programming would 
entail O(1010) matrix operations 
• Given size of databases, more 
efficient methods needed 
Database Searching
Heuristic approaches to DP for database searching 
FASTA (Pearson 1995) 
Uses heuristics to avoid 
calculating the full dynamic 
programming matrix 
Speed up searches by an 
order of magnitude 
compared to full Smith- 
Waterman 
The statistical side of FASTA is 
still stronger than BLAST 
BLAST (Altschul 1990, 1997) 
Uses rapid word lookup 
methods to completely skip 
most of the database 
entries 
Extremely fast 
One order of magnitude 
faster than FASTA 
Two orders of magnitude 
faster than Smith- 
Waterman 
Almost as sensitive as FASTA
« Hit and extend heuristic» 
• Problem: Too many calculations 
“wasted” by comparing regions 
that have nothing in common 
• Initial insight: Regions that are 
similar between two sequences 
are likely to share short 
stretches that are identical 
• Basic method: Look for similar 
regions only near short 
stretches that match exactly 
FASTA
FASTA-Stages 
1. Find k-tups in the two sequences (k=1,2 for 
proteins, 4-6 for DNA sequences) 
2. Score and select top 10 scoring “local diagonals” 
3. Rescan top 10 regions, score with PAM250 
(proteins) or DNA scoring matrix. Trim off the 
ends of the regions to achieve highest scores. 
4. Try to join regions with gapped alignments. Join 
if similarity score is one standard deviation above 
average expected score 
5. After finding the best initial region, FASTA 
performs a global alignment of a 32 residue wide 
region centered on the best initial region, and 
uses the score as the optimized score.
• Sensitivity: the ability of a 
program to identify weak but 
biologically significant sequence 
similarity. 
• Selectivity: the ability of a 
program to discriminate between 
true matches and matches 
occurring by chance alone. 
– A decrease in selectivity results in 
more false positives being reported. 
FastA
FastA (http://www.ebi.ac.uk/fasta33/) 
Blosum50 
default. 
Lower PAM 
higher blosum 
to detect close 
sequences 
Higher PAM and 
lower blosum 
to detect distant 
sequences 
Gap opening penalty 
-12, -16 by default 
for fasta with 
proteins and DNA, 
respectively 
Gap extension 
penalty -2, -4 by 
default for fasta 
with proteins and 
DNA, respectively 
The larger the 
word-length the 
less sensitive, but 
faster the search 
will be 
Max number of 
scores and 
alignments is 100
FastA Output 
Database 
code 
hyperlinked 
to the SRS 
database at 
EBI 
Accession 
number 
Description Length 
Initn, init1, opt, z-score 
calculated 
during run 
E score - 
expectation 
value, how 
many hits are 
expected to be 
found by 
chance with 
such a score 
while 
comparing 
this query to 
this database. 
E() does not 
represent the 
% similarity
FastA is a family of programs 
FastA, TFastA, FastX, FastY 
Query: DNA Protein 
Database:DNA Protein
FASTA can miss significant 
similarity since 
– For proteins, similar sequences do 
not have to share identical residues 
• Asp-Lys-Val is quite similar to 
• Glu-Arg-Ile yet it is missed even with 
ktuple size of 1 since no amino acid 
matches 
• Gly-Asp-Gly-Lys-Gly is quite similar 
to Gly-Glu-Gly-Arg-Gly but there is 
no match with ktuple size of 2 
FASTA problems
FASTA can miss significant 
similarity since 
– For nucleic acids, due to codon 
“wobble”, DNA sequences may 
look like XXyXXyXXy where X’s 
are conserved and y’s are not 
• GGuUCuACgAAg and 
GGcUCcACaAAA both code for 
the same peptide sequence (Gly-Ser- 
Thr-Lys) but they don’t match with 
ktuple size of 3 or higher 
FASTA problems
DataBase Searching 
Dynamic Programming 
Reloaded 
Database Searching 
Fasta 
Blast 
Statistics 
Practical Guide 
Extentions 
PSI-Blast 
PHI-Blast Local Blast 
Blast
BLAST - Basic Local Alignment 
Search Tool
What does BLAST do? 
• Search a large target set of sequences... 
• …for hits to a query sequence... 
• …and return the alignments and scores from those 
hits... 
• Do it fast. 
Show me those sequences that deserve a second look. 
Blast programs were designed for fast database 
searching, with minimal sacrifice of sensitivity to 
distant related sequences.
The big red button 
Do My Job 
It is dangerous to hide too much of the 
underlying complexity from the scientists.
• Approach: find segment pairs 
by first finding word pairs that 
score above a threshold, i.e., 
find word pairs of fixed length w 
with a score of at least T 
• Key concept “Neigborhood”: 
Seems similar to FASTA, but 
we are searching for words 
which score above T rather than 
that match exactly 
• Calculate neigborhood (T) for 
substrings of query (size W) 
Overview
Overview 
Compile a list of words which give a score 
above T when paired with the query sequence. 
– Example using PAM-120 for query sequence ACDE 
(w=4, T=17): 
A C D E 
A C D E = +3 +9 +5 +5 = 22 
• try all possibilities: 
A A A A = +3 -3 0 0 = 0 no good 
A A A C = +3 -3 0 -7 = -7 no good 
• ...too slow, try directed change
Overview 
A C D E 
A C D E = +3 +9 +5 +5 = 22 
• change 1st pos. to all acceptable substitutions 
g C D E = +1 +9 +5 +5 = 20 ok 
n C D E = +0 +9 +5 +5 = 19 ok 
I C D E = -1 +9 +5 +5 = 18 ok 
k C D E = -2 +9 +5 +5 = 17 ok 
• change 2nd pos.: can't - all alternatives negative 
and the other three positions only add up to 13 
• change 3rd pos. in combination with first position 
gCnE = 1 9 2 5 = 17 ok 
• continue - use recursion 
• For "best" values of w and T there are typically 
about 50 words in the list for every residue in the 
query sequence
Neighborhood.pl 
# Calculate neighborhood 
my %NH; 
for (my $i = 0; $i < @A; $i++) { 
my $s1 = $S{$W[0]}{$A[$i]}; 
for (my $j = 0; $j < @A; $j++) { 
my $s2 = $S{$W[1]}{$A[$j]}; 
for (my $k = 0; $k < @A; $k++) { 
my $s3 = $S{$W[2]}{$A[$k]}; 
my $score = $s1 + $s2 + $s3; 
my $word = "$A[$i]$A[$j]$A[$k]"; 
next if $word =~ /[BZX*]/; 
$NH{$word} = $score if $score >= $T; 
} 
} 
} 
# Output neighborhood 
foreach my $word (sort {$NH{$b} <=> $NH{$a} or $a cmp $b} keys %NH) { 
print "$word $NH{$word}n"; 
}
BLOSUM62 RGD 11 
RGD 17 
KGD 14 
QGD 13 
RGE 13 
EGD 12 
HGD 12 
NGD 12 
RGN 12 
AGD 11 
MGD 11 
RAD 11 
RGQ 11 
RGS 11 
RND 11 
RSD 11 
SGD 11 
TGD 11 
PAM200 RGD 13 
RGD 18 
RGE 17 
RGN 16 
KGD 15 
RGQ 15 
KGE 14 
HGD 13 
KGN 13 
RAD 13 
RGA 13 
RGG 13 
RGH 13 
RGK 13 
RGS 13 
RGT 13 
RSD 13 
WGD 13
S 
Length of extension 
Score 
Trim to max 
indexed 
* 
*Two non-overlapping HSP’s on a diagonal within distance A
S 
Length of extension 
Score 
Trim to max 
indexed 
* 
*Two non-overlapping HSP’s on a diagonal within distance A
The BLAST algorithm 
• Break the search sequence into words 
– W = 3 for proteins, W = 12 for DNA 
MCGPFILGTYC 
CGP 
MCG 
MCG, CGP, GPF, PFI, FIL, 
ILG, LGT, GTY, TYC 
• Include in the search all words that score 
above a certain value (T) for any search word 
MCG CGP 
MCT MGP … 
MCN CTP 
… … 
This list can be 
computed in linear 
time
The Blast Algorithm (2) 
• Search for the words in the database 
– Word locations can be precomputed and indexed 
– Searching for a short string in a long string 
• HSP (High Scoring Pair) = A match between 
a query word and the database 
• Find a “hit”: Two non-overlapping HSP’s on a 
diagonal within distance A 
• Extend the hit until the score falls below a 
threshold value, S
BLAST parameters 
• Lowering the neighborhood word threshold (T) 
allows more distantly related sequences to be found, 
at the expense of increased noise in the results set. 
• Choosing a value for w 
– small w: many matches to expand 
– big w: many words to be generated 
– w=4 is a good compromise 
• Lowering the segment extension cutoff (S) returns 
longer extensions for each hit. 
• Changing the minimum E-value changes the 
threshold for reporting a hit.
Critical parameters: T,W and scoring matrix 
• The proper value of T depends ons both the 
values in the scoring matrix and balance 
between speed and sensitivity 
• Higher values of T progressively remove 
more word hits and reduce the search space. 
• Word size (W) of 1 will produce more hits 
than a word size of 10. In general, if T is 
scaled uniformly with W, smaller word 
sizes incraese sensitivity and decrease 
speed. 
• The interplay between W,T and the scoring 
matrix is criticial and choosing them wisely 
is the most effective way of controlling the 
speed and sensiviy of blast
DataBase Searching 
Dynamic Programming 
Reloaded 
Database Searching 
Fasta 
Blast 
Statistics 
Practical Guide 
Extentions 
PSI-Blast 
PHI-Blast 
Local Blast 
BLAT
Database Searching 
• How can we find a particular short sequence 
in a database of sequences (or one HUGE 
sequence)? 
• Problem is identical to local sequence 
alignment, but on a much larger scale. 
• We must also have some idea of the 
significance of a database hit. 
– Databases always return some kind of hit, how 
much attention should be paid to the result? 
• How can we determine how “unusual” a 
particular alignment score is?
Sentence 1: 
“These algorithms are trying to find the best way to match up 
two sequences” 
Sentence 2: 
“This does not mean that they will find anything profound” 
ALIGNMENT: 
THESEALGRITHMARETR--YINGTFINDTHEBESTWAYTMATCHPTWSEQENCES 
:: :.. . .. ...: : ::::.. :: . : ... 
THISDESNTMEANTHATTHEYWILLFINDAN-------YTHIN-GPRFND------ 
12 exact matches 
14 conservative substitutions 
Is this a good alignment? 
Significance
• A key to the utility of BLAST is 
the ability to calculate expected 
probabilities of occurrence of 
Maximum Segment Pairs 
(MSPs) given w and T 
• This allows BLAST to rank 
matching sequences in order of 
“significance” and to cut off 
listings at a user-specified 
probability 
Overview
Mathematical Basis of BLAST 
• Model matches as a sequence of coin tosses 
• Let p be the probability of a “head” 
– For a “fair” coin, p = 0.5 
• (Erdös-Rényi) If there are n throws, then the 
expected length R of the longest run of heads is 
R = log1/p (n). 
• Example: Suppose n = 20 for a “fair” coin 
R=log2(20)=4.32 
• Trick is how to model DNA (or amino acid) 
sequence alignments as coin tosses.
Mathematical Basis of BLAST 
• To model random sequence alignments, replace a 
match with a “head” and mismatch with a “tail”. 
AATCAT 
ATTCAG 
HTHHHT 
• For DNA, the probability of a “head” is 1/4 
– What is it for amino acid sequences?
Mathematical Basis of BLAST 
• So, for one particular alignment, the Erdös-Rényi 
property can be applied 
• What about for all possible alignments? 
– Consider that sequences are being shifted back and forth, 
dot matrix plot 
• The expected length of the longest match is 
R=log1/p(mn) 
where m and n are the lengths of the two sequences.
Analytical derivation 
Erdös-Rényi 
… 
… 
… 
Karlin-Alschul
Karlin-Alschul Statistics 
E=kmn-λS 
This equation states that the number of alignments 
expected by chance (E) during the sequence 
database search is a function of the size of the 
search space (m*n), the normalized score (λS) 
and a minor constant (k mostly 0.1) 
E-Value grows linearly with the product of target and 
query sizes. Doubling target set size and doubling 
query length have the same effect on e-value
Analytical derivation 
Erdös-Rényi 
… 
… 
… 
Karlin-Alschul 
R=log1/p(mn) 
E=kmn-λS
Scoring alignments 
• Score: S (~R) 
– S=SM(qi,ti) - Sgaps 
• Any alignment has a score 
• Any two sequences have a(t least one) 
optimal alignment
• For a particular scoring matrix and its 
associated gap initiation and extention costs 
one must calculate λ and k 
• Unfortunately (for gapped alignments), you 
can’t do this analytically and the values must 
be estimated empirically 
– The procedure involves aligning random 
sequences (Monte Carlo approach) with a specific 
scoring scheme and observing the alignment 
properties (scores, target frequencies and 
lengths)
Significance 
“Monte Carlo” Approach: 
• Compares result to randomized result, 
similarly to results generated by a roulette 
wheel at Monte Carlo 
• Typical procedure for alignments 
– Randomize sequence A 
– Align to sequence B 
– Repeat many times (hundreds) 
– Keep track op optimal score 
• Histogram of scores …
Assessing significance requires a distribution 
• I have an pumpkin of diameter 1m. Is that unusual? 
Diameter (m) 
Frequency
Normal Distribution does NOT Fit Alignment Scores !! 
• In seeking optimal Alignments between two 
sequences, one desires those that have the highest 
score - i.e. one is seeking a distribution of maxima 
• In seeking optimal Matches between an Input 
Sequence and Sequence Entries in a Database, one 
again desires the matches that have the highest 
score, and these are obtained via examination of the 
distribution of such scores for the entries in the 
database - this is again a distribution of maxima. 
“A Normal Distribution is a distribution of Sums of 
independent variables rather than a sum of their 
Maxima.“ 
Significance
Comparing distributions 
Gaussian: Extreme Value: 
 
 
 
 
1  
  1 
   
 
 
 
 
 
 
 
 
x 
e 
x 
f x e e 
  
2 
2 
2 
2 
 
  
  
 
x 
f x e
Alignment scores follow extreme value distributions 
Alignment of unrelated/random sequences result in scores 
following an extreme value distribution 
E 
P = 1 –e-E 
P(xS) = 1-exp(-kmne-S) 
m, n: sequence lengths. 
k, : free parameters. 
x 
E=-ln(1-P) 
This can be shown analytically for ungapped alignments and has 
been found empirically to also hold for gapped alignments under 
commonly used conditions.
Alignment scores follow extreme value distributions 
Alignment algorithms will always produce alignments, 
regardless of whether it is meaningful or not 
=> important to have way of selecting significant alignments 
from large set of database hits. 
Solution: fit distribution of scores from database search to 
extreme value distribution; determine p-value of hit from this 
fitted distribution. 
Example: scores fitted to 
extreme value distribution. 
99.9% of this distribution is 
located below score=112 
=> hit with score = 112 has a 
p-value of 0.1%
BLAST uses precomputed extreme 
value distributions to calculate E-values 
from alignment scores 
For this reason BLAST only allows 
certain combinations of substitution 
matrices and gap penalties 
This also means that the fit is based on 
a different data set than the one you 
are working on 
Significance 
A word of caution: BLAST tends to overestimate the significance of its 
matches 
E-values from BLAST are fine for identifying sure hits 
One should be careful using BLAST’s E-values to judge if a marginal hit 
can be trusted (e.g., you may want to use E-values of 10-4 to 10-5).
Determining P-values 
• If we can estimate  and , then we can 
determine, for a given match score x, the 
probability that a random match with score x 
or greater would have occurred in the 
database. 
• For sequence matches, a scoring system and 
database can be parameterized by two 
parameters, k and , related to  and . 
– It would be nice if we could compare hit 
significance without regard to the scoring system 
used!
Bit Scores 
• The expected number of hits with score  S 
is: 
E = Kmn e s 
– Where m and n are the sequence lengths 
• Normalize the raw score using: 
S  
ln K 
ln 2 
S 
  
 
• Obtains a “bit score” S’, with a standard set of 
units. 
• The new E-value is: 
E  mn 2 
 S 
-74 
-73 
-72 * 
-71 ***** 
-70 ******* 
-69 ********** 
-68 *************** 
-67 ************************* 
-66 ************************* 
-65 ************************************ 
-64 ***************************************** 
-63 ************************************************************ 
-61 ************************ 
-60 ***************************** 
-59 ******************* 
-58 ************** 
-57 ********* 
-56 ******** 
-55 ***** 
-54 **** 
-53 * 
-52 * 
-51 * 
-50 
-49 
Needleman-wunsch-Monte-Carlo.pl 
(Average around -64 !)
• The distribution of scores graph of 
frequency of observed scores 
• expected curve (asterisks) according 
to the extreme value distribution 
–the theoretic curve should be 
similar to the observed results 
• deviations indicate that the fitting 
parameters are wrong 
–too weak gap penalties 
–compositional biases 
FastA Output
< 20 222 0 :* 
22 30 0 :* 
24 18 1 :* 
26 18 15 :* 
28 46 159 :* 
30 207 963 :* 
32 1016 3724 := * 
34 4596 10099 :==== * 
36 9835 20741 :========= * 
38 23408 34278 :==================== * 
40 41534 47814 :=================================== * 
42 53471 58447 :============================================ * 
44 73080 64473 :====================================================*======= 
46 70283 65667 :=====================================================*==== 
48 64918 62869 :===================================================*== 
50 65930 57368 :===============================================*======= 
52 47425 50436 :======================================= * 
54 36788 43081 :=============================== * 
56 33156 35986 :============================ * 
58 26422 29544 :====================== * 
60 21578 23932 :================== * 
62 19321 19187 :===============* 
64 15988 15259 :============*= 
66 14293 12060 :=========*== 
68 11679 9486 :=======*== 
70 10135 7434 :======*== 
FastA Output
72 8957 5809 :====*=== 
74 7728 4529 :===*=== 
76 6176 3525 :==*=== 
78 5363 2740 :==*== 
80 4434 2128 :=*== 
82 3823 1628 :=*== 
84 3231 1289 :=*= 
86 2474 998 :*== 
88 2197 772 :*= 
90 1716 597 :*= 
92 1430 462 :*= :===============*======================== 
94 1250 358 :*= :============*=========================== 
96 954 277 :* :=========*======================= 
98 756 214 :* :=======*=================== 
100 678 166 :* :=====*================== 
102 580 128 :* :====*=============== 
104 476 99 :* :===*============= 
106 367 77 :* :==*========== 
108 309 59 :* :==*======== 
110 287 46 :* :=*======== 
112 206 36 :* :=*====== 
114 161 28 :* :*===== 
116 144 21 :* :*==== 
118 127 16 :* :*==== 
>120 886 13 :* :*============================== 
Related 
FastA Output
• A summary of the statistics and of the 
program parameters follows the histogram. 
– An important number in this summary is the 
Kolmogorov-Smirnov statistic, which indicates 
how well the actual data fit the theoretical 
statistical distribution. The lower this value, the 
better the fit, and the more reliable the statistical 
estimates. 
– In general, a Kolmogorov-Smirnov statistic under 
0.1 indicates a good fit with the theoretical model. 
If the statistic is higher than 0.2, the statistics may 
not be valid, and it is recommended to repeat the 
search, using more stringent (more negative) 
values for the gap penalty parameters. 
FastA Output
Statistics summary 
• Optimal local alignment scores for pairs of random 
amino acid sequences of the same length follow and 
extreme-value distribution. For any score S, the 
probability of observing a score >= S is given by the 
Karlin-Altschul statistic (P(score>=S)=1-exp(-kmne(- 
lambda.S)) 
• k en Lambda are parameters related to the position 
of the maximum and the with of the distribution, 
• Note the long tail at the right. This means that a 
score serveral standard deviations above the mean 
has higher probability of arising by chance (that is, it 
is less significant) than if the scores followed a 
normal distribution.
P-values 
• Many programs report P = the probability that the 
alignment is no better than random. The relationship 
between Z and P depends on the distribution of the 
scores from the control population, which do NOT 
follow the normal distributions 
– P<=10E-100 (exact match) 
– P in range 10E-100 10E-50 (sequences nearly identical eg. 
Alleles or SNPs 
– P in range 10E-50 10E-10 (closely related sequenes, 
homology certain) 
– P in range 10-5 10E-1 (usually distant relatives) 
– P > 10-1 (match probably insignificant)
E 
• For database searches, most programs report E-values. The 
E-value of an alignemt is the expected number of sequences 
that give the same Z-score or better if the database is probed 
with a random sequence. E is found by multiplying the value 
of P by the size of the database probed. Note that E but not P 
depends on the size of the database. Values of P are 
between 0 and 1. Values of E are between 0 and the number 
of sequences in the database searched: 
– E<=0.02 sequences probably homologous 
– E between 0.02 and 1 homology cannot be ruled out 
– E>1 you would have to expect this good a match by just chance
DataBase Searching 
Dynamic Programming 
Reloaded 
Database Searching 
Fasta 
Blast 
Statistics 
Practical Guide 
Extentions 
PSI-Blast 
PHI-Blast Local Blast 
Blast
BLAST is actually a family of programs: 
• BLASTN - Nucleotide query searching a 
nucleotide database. 
• BLASTP - Protein query searching a 
protein database. 
• BLASTX - Translated nucleotide query 
sequence (6 frames) searching a protein 
database. 
• TBLASTN - Protein query searching a 
translated nucleotide (6 frames) database. 
• TBLASTX - Translated nucleotide query (6 
frames) searching a translated nucleotide 
(6 frames) database. 
Blast
Blast
Blast
Blast
Blast
Blast
Blast
Blast
• Be aware of what options you 
have selected when using 
BLAST, or FASTA 
implementations. 
• Treat BLAST searches as 
scientific experiments 
• So you should try your searches 
with the filters on and off to see 
whether it makes any difference 
to the output 
Tips
Tips: Low-complexity and Gapped Blast Algorithm 
• The common, Web-based ones often have 
default settings that will affect the outcome 
of your searches. By default all NCBI BLAST 
implementations filter out biased sequence 
composition from your query sequence (e.g. 
signal peptide and transmembrane 
sequences - beware!). 
• The SEG program has been implemented 
as part of the blast routine in order to mask 
low-complexity regions 
• Low-complexity regions are denoted by 
strings of Xs in the query sequence
• The sequence databases contain a 
wealth of information. They also 
contain a lot of errors. Contaminants 
… 
• Annotation errors, frameshifts that 
may result in erroneous conceptual 
translations. 
• Hypothetical proteins ? 
• In the words of Fox Mulder, "Trust 
no one." 
Tips
• Once you get a match to things 
in the databases, check whether 
the match is to the entire protein, 
or to a domain. Don't 
immediately assume that a 
match means that your protein 
carries out the same function 
(see above). Compare your 
protein and the match protein(s) 
along their entire lengths before 
making this assumption. 
Tips
• Domain matches can also cause problems 
by hiding other informative matches. For 
instance if your protein contains a common 
domain you'll get significant matches to 
every homologous sequence in the 
database. BLAST only reports back a 
limited number of matches, ordered by P 
value. 
• If this list consists only of matches to the 
same domain, cut this bit out of your query 
sequence and do the BLAST search again 
with the edited sequence (e.g. NHR). 
Tips
• Do controls wherever possible. In 
particular when you use a particular 
search software for the first time. 
• Suitable positive controls would be protein 
sequences known to have distant 
homologues in the databases to check 
how good the software is at detecting such 
matches. 
• Negative controls can be employed to 
make sure the compositional bias of the 
sequence isn't giving you false positives. 
Shuffle your query sequence and see what 
difference this makes to the matches that 
are returned. A real match should be lost 
upon shuffling of your sequence. 
Tips
• Perform Controls 
#!/usr/bin/perl -w 
use strict; 
my ($def, @seq) = <>; 
print $def; 
chomp @seq; 
@seq = split(//, join("", @seq)); 
my $count = 0; 
while (@seq) { 
my $index = rand(@seq); 
my $base = splice(@seq, $index, 1); 
print $base; 
print "n" if ++$count % 60 == 0; 
} 
print "n" unless $count %60 == 0; 
Tips
• Read the footer first 
• View results graphically 
• Parse Blasts with Bioperl 
Tips
• BLAST's major advantage is its speed. 
– 2-3 minutes for BLAST versus several hours 
for a sensitive FastA search of the whole of 
GenBank. 
• When both programs use their default 
setting, BLAST is usually more sensitive 
than FastA for detecting protein sequence 
similarity. 
– Since it doesn't require a perfect sequence 
match in the first stage of the search. 
FastA vs. Blast
Weakness of BLAST: 
– The long word size it uses in the initial stage of DNA 
sequence similarity searches was chosen for speed, and not 
sensitivity. 
– For a thorough DNA similarity search, FastA is the 
program of choice, especially when run with a lowered 
KTup value. 
– FastA is also better suited to the specialised task of 
detecting genomic DNA regions using a cDNA query 
sequence, because it allows the use of a gap extension 
penalty of 0. BLAST, which only creates ungapped 
alignments, will usually detect only the longest exon, or fail 
altogether. 
• In general, a BLAST search using the default 
parameters should be the first step in a database 
similarity search strategy. In many cases, this is all 
that may be required to yield all the information 
needed, in a very short time. 
FastA vs. Blast
DataBase Searching 
Dynamic Programming 
Reloaded 
Database Searching 
Fasta 
Blast 
Statistics 
Practical Guide 
Extentions 
PSI-Blast 
PHI-Blast Local Blast 
BLAT
1. Old (ungapped) BLAST 
2. New BLAST (allows gaps) 
3. Profile -> PSI Blast - Position Specific 
Iterated 
 Strategy:Multiple alignment of the hits 
Calculates a position-specific score matrix 
Searches with this matrix 
 In many cases is much more sensitive to weak but 
biologically relevant sequence similarities 
 PSSM !!! 
PSI-Blast
• Patterns of conservation from the alignment of 
related sequences can aid the recognition of 
distant similarities. 
– These patterns have been variously called motifs, 
profiles, position-specific score matrices, and 
Hidden Markov Models. 
For each position in the derived pattern, every 
amino acid is assigned a score. 
(1) Highly conserved residue at a position: that 
residue is assigned a high positive score, and 
others are assigned high negative scores. 
(2) Weakly conserved positions: all residues receive 
scores near zero. 
(3) Position-specific scores can also be assigned to 
potential insertions and deletions. 
PSI-Blast
Pattern 
• a set of alternative 
sequences, using 
“regular expressions” 
• Prosite 
(http://www.expasy.org/ 
prosite/)
PSSM (Position Specific Scoring Matrice)
PSSM (Position Specific Scoring Matrice)
PSSM (Position Specific Scoring Matrice)
• The power of profile methods can be 
further enhanced through iteration of 
the search procedure. 
– After a profile is run against a database, 
new similar sequences can be detected. A 
new multiple alignment, which includes 
these sequences, can be constructed, a 
new profile abstracted, and a new 
database search performed. 
– The procedure can be iterated as often as 
desired or until convergence, when no new 
statistically significant sequences are 
detected. 
PSI-Blast
(1) PSI-BLAST takes as an input a single protein sequence 
and compares it to a protein database, using the gapped 
BLAST program. 
(2) The program constructs a multiple alignment, and then a 
profile, from any significant local alignments found. 
The original query sequence serves as a template for the multiple 
alignment and profile, whose lengths are identical to that of the 
query. Different numbers of sequences can be aligned in different 
template positions. 
(3) The profile is compared to the protein database, again 
seeking local alignments using the BLAST algorithm. 
(4) PSI-BLAST estimates the statistical significance of the local 
alignments found. 
Because profile substitution scores are constructed to a fixed 
scale, and gap scores remain independent of position, the 
statistical theory and parameters for gapped BLAST alignments 
remain applicable to profile alignments. 
(5) Finally, PSI-BLAST iterates, by returning to step (2), a 
specified number of times or until convergence. 
PSI-Blast
PSI-BLAST 
PSSM 
PSSM 
From: http://bioweb.pasteur.fr/seqanal/blast/intro-uk.html
PSI-BLAST
PSI-BLAST
PSI-BLAST
PSI-BLAST
PSI-BLAST pitfalls 
• Avoid too close sequences: overfit! 
• Can include false homologous! Therefore check 
the matches carefully: include or exclude 
sequences based on biological knowledge. 
• The E-value reflects the significance of the 
match to the previous training set not to the 
original sequence! 
• Choose carefully your query sequence. 
• Try reverse experiment to certify.
Reduce overfitting risk by Cobbler 
• A single sequence is selected 
from a set of blocks and enriched 
by replacing the conserved 
regions delineated by the blocks 
by consensus residues derived 
from the blocks. 
• Embedding consensus residues 
improves performance 
• S. Henikoff and J.G. Henikoff; 
Protein Science (1997) 6:698- 
705.
DataBase Searching 
Dynamic Programming 
Reloaded 
Database Searching 
Fasta 
Blast 
Statistics 
Practical Guide 
Extentions 
PSI-Blast 
PHI-Blast 
Local Blast 
BLAT
PHI-Blast Local Blast 
(Pattern-Hit Initiated BLAST)
From: http://bioweb.pasteur.fr/seqanal/blast/intro-uk.html 
PHI-Blast Local Blast
PHI-Blast Local Blast
PHI-Blast Local Blast
PHI-Blast Local Blast
DataBase Searching 
Dynamic Programming 
Reloaded 
Database Searching 
Fasta 
Blast 
Statistics 
Practical Guide 
Extentions 
PSI-Blast 
PHI-Blast 
Local Blast 
BLAT
Installing Blast Locally 
• 2 flavors: NCBI/WuBlast 
• Excutables: 
– ftp://ftp.ncbi.nih.gov/blast/executables/ 
• Database: 
– ftp://ftp.ncbi.nih.gov/blast/db/ 
•Formatdb 
–formatdb -i ecoli.nt -p F 
–formatdb -i ecoli.protein -p T 
• For options: blastall - 
– blastall -p blastp -i query -d database -o output
DataBase Searching 
Dynamic Programming 
Reloaded 
Database Searching 
Fasta 
Blast 
Statistics 
Practical Guide 
Extentions 
PSI-Blast 
PHI-Blast 
Local Blast 
BLAT
Main database: BLAT 
• BLAT: BLAST-Like Alignment Tool 
• Aligns the input sequence to the 
Human Genome 
• Connected to several databases, like: 
– mRNAs - GenScan 
– ESTs - TwinScan 
– RepeatMasker - UniGene 
– RefSeq - CpG Islands
BLAT Human Genome Browser
BLAT method 
• Align sequence with BLAT, get alignment 
info 
• Per BLAT hit, pick up additional info from 
connected databases: 
– mRNAs 
– ESTs 
– RepeatMasker 
– CpG Islands 
– RefSeq Genes
Weblems 
W5.1: Submit the amino acid sequence of papaya 
papein to a BLAST (gapped and ungapped) and to a 
PSI-BLAST search. What are the main difference in 
results? 
W5.2: Is there a relationship between Klebsiella 
aerogenes urease, Pseudomonas diminuta 
phosphotriesterase and mouse adenosine deaminase 
? Also use DALI, ClustalW and T-coffee. 
W5.3: Yeast two-hybrid typically yields DNA 
sequences. How would you find the corresponding 
protein ? 
W5.4: When and why would you use tblastn ? 
W5.5: How would you search a database if you want to 
restrict the search space to those entries having a 
secretion signal consisting of 4 consecutive (N-terminal) 
basic residues ?

More Related Content

Viewers also liked

Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
Prof. Wim Van Criekinge
 
2015 03 13_puurs_v_public
2015 03 13_puurs_v_public2015 03 13_puurs_v_public
2015 03 13_puurs_v_public
Prof. Wim Van Criekinge
 
Bioinformatica 29-09-2011-p1-introduction
Bioinformatica 29-09-2011-p1-introductionBioinformatica 29-09-2011-p1-introduction
Bioinformatica 29-09-2011-p1-introduction
Prof. Wim Van Criekinge
 
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmmBioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Prof. Wim Van Criekinge
 
Bioinformatics t7-proteinstructure v2014
Bioinformatics t7-proteinstructure v2014Bioinformatics t7-proteinstructure v2014
Bioinformatics t7-proteinstructure v2014
Prof. Wim Van Criekinge
 
2014 03 28_next_generation_epigenetic_profling_v_les_epigenetica_vweb
2014 03 28_next_generation_epigenetic_profling_v_les_epigenetica_vweb2014 03 28_next_generation_epigenetic_profling_v_les_epigenetica_vweb
2014 03 28_next_generation_epigenetic_profling_v_les_epigenetica_vweb
Prof. Wim Van Criekinge
 
2015 bioinformatics bio_python_partii
2015 bioinformatics bio_python_partii2015 bioinformatics bio_python_partii
2015 bioinformatics bio_python_partii
Prof. Wim Van Criekinge
 

Viewers also liked (7)

Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
 
2015 03 13_puurs_v_public
2015 03 13_puurs_v_public2015 03 13_puurs_v_public
2015 03 13_puurs_v_public
 
Bioinformatica 29-09-2011-p1-introduction
Bioinformatica 29-09-2011-p1-introductionBioinformatica 29-09-2011-p1-introduction
Bioinformatica 29-09-2011-p1-introduction
 
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmmBioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
 
Bioinformatics t7-proteinstructure v2014
Bioinformatics t7-proteinstructure v2014Bioinformatics t7-proteinstructure v2014
Bioinformatics t7-proteinstructure v2014
 
2014 03 28_next_generation_epigenetic_profling_v_les_epigenetica_vweb
2014 03 28_next_generation_epigenetic_profling_v_les_epigenetica_vweb2014 03 28_next_generation_epigenetic_profling_v_les_epigenetica_vweb
2014 03 28_next_generation_epigenetic_profling_v_les_epigenetica_vweb
 
2015 bioinformatics bio_python_partii
2015 bioinformatics bio_python_partii2015 bioinformatics bio_python_partii
2015 bioinformatics bio_python_partii
 

Similar to Bioinformatics t5-databasesearching v2014

2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge
Prof. Wim Van Criekinge
 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Prof. Wim Van Criekinge
 
2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge
Prof. Wim Van Criekinge
 
Bioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingBioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searching
Prof. Wim Van Criekinge
 
_BLAST.ppt
_BLAST.ppt_BLAST.ppt
Sequence comparison techniques
Sequence comparison techniquesSequence comparison techniques
Sequence comparison techniques
ruchibioinfo
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
alizain9604
 
Blast Algorithm
Blast AlgorithmBlast Algorithm
Blast fasta 4
Blast fasta 4Blast fasta 4
Blast fasta 4
Er Puspendra Tripathi
 
Mayank
MayankMayank
Mayank
Mayank Miky
 
Blast 2013 1
Blast 2013 1Blast 2013 1
Blast 2013 1
Jumbo Nantawong
 
Fasta : steps, features, algorithm, result etc.
Fasta : steps, features, algorithm, result etc.Fasta : steps, features, algorithm, result etc.
Fasta : steps, features, algorithm, result etc.
Cherry
 
Database Searching
Database SearchingDatabase Searching
Database Searching
Meghaj Mallick
 
Similarity
SimilaritySimilarity
Similarity
hiratufail
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
ammar kareem
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
Greg Makowski
 
BLAST and sequence alignment
BLAST and sequence alignmentBLAST and sequence alignment
BLAST_CSS2.ppt
BLAST_CSS2.pptBLAST_CSS2.ppt
BLAST_CSS2.ppt
Silpa87
 
Bioinformatica t5-database searching
Bioinformatica t5-database searchingBioinformatica t5-database searching
Bioinformatica t5-database searching
Prof. Wim Van Criekinge
 
Biological sequences analysis
Biological sequences analysisBiological sequences analysis
Biological sequences analysis
Davide Andrea Guastella
 

Similar to Bioinformatics t5-databasesearching v2014 (20)

2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge
 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekinge
 
2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge
 
Bioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingBioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searching
 
_BLAST.ppt
_BLAST.ppt_BLAST.ppt
_BLAST.ppt
 
Sequence comparison techniques
Sequence comparison techniquesSequence comparison techniques
Sequence comparison techniques
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
 
Blast Algorithm
Blast AlgorithmBlast Algorithm
Blast Algorithm
 
Blast fasta 4
Blast fasta 4Blast fasta 4
Blast fasta 4
 
Mayank
MayankMayank
Mayank
 
Blast 2013 1
Blast 2013 1Blast 2013 1
Blast 2013 1
 
Fasta : steps, features, algorithm, result etc.
Fasta : steps, features, algorithm, result etc.Fasta : steps, features, algorithm, result etc.
Fasta : steps, features, algorithm, result etc.
 
Database Searching
Database SearchingDatabase Searching
Database Searching
 
Similarity
SimilaritySimilarity
Similarity
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
BLAST and sequence alignment
BLAST and sequence alignmentBLAST and sequence alignment
BLAST and sequence alignment
 
BLAST_CSS2.ppt
BLAST_CSS2.pptBLAST_CSS2.ppt
BLAST_CSS2.ppt
 
Bioinformatica t5-database searching
Bioinformatica t5-database searchingBioinformatica t5-database searching
Bioinformatica t5-database searching
 
Biological sequences analysis
Biological sequences analysisBiological sequences analysis
Biological sequences analysis
 

More from Prof. Wim Van Criekinge

2020 02 11_biological_databases_part1
2020 02 11_biological_databases_part12020 02 11_biological_databases_part1
2020 02 11_biological_databases_part1
Prof. Wim Van Criekinge
 
2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload
Prof. Wim Van Criekinge
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
Prof. Wim Van Criekinge
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload
Prof. Wim Van Criekinge
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload
Prof. Wim Van Criekinge
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload
Prof. Wim Van Criekinge
 
P7 2018 biopython3
P7 2018 biopython3P7 2018 biopython3
P7 2018 biopython3
Prof. Wim Van Criekinge
 
P6 2018 biopython2b
P6 2018 biopython2bP6 2018 biopython2b
P6 2018 biopython2b
Prof. Wim Van Criekinge
 
P4 2018 io_functions
P4 2018 io_functionsP4 2018 io_functions
P4 2018 io_functions
Prof. Wim Van Criekinge
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
Prof. Wim Van Criekinge
 
T1 2018 bioinformatics
T1 2018 bioinformaticsT1 2018 bioinformatics
T1 2018 bioinformatics
Prof. Wim Van Criekinge
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
Prof. Wim Van Criekinge
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
Prof. Wim Van Criekinge
 
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
Prof. Wim Van Criekinge
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload
Prof. Wim Van Criekinge
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
Prof. Wim Van Criekinge
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload
Prof. Wim Van Criekinge
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload
Prof. Wim Van Criekinge
 
P7 2017 biopython3
P7 2017 biopython3P7 2017 biopython3
P7 2017 biopython3
Prof. Wim Van Criekinge
 
P6 2017 biopython2
P6 2017 biopython2P6 2017 biopython2
P6 2017 biopython2
Prof. Wim Van Criekinge
 

More from Prof. Wim Van Criekinge (20)

2020 02 11_biological_databases_part1
2020 02 11_biological_databases_part12020 02 11_biological_databases_part1
2020 02 11_biological_databases_part1
 
2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload
 
P7 2018 biopython3
P7 2018 biopython3P7 2018 biopython3
P7 2018 biopython3
 
P6 2018 biopython2b
P6 2018 biopython2bP6 2018 biopython2b
P6 2018 biopython2b
 
P4 2018 io_functions
P4 2018 io_functionsP4 2018 io_functions
P4 2018 io_functions
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
 
T1 2018 bioinformatics
T1 2018 bioinformaticsT1 2018 bioinformatics
T1 2018 bioinformatics
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
 
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload
 
P7 2017 biopython3
P7 2017 biopython3P7 2017 biopython3
P7 2017 biopython3
 
P6 2017 biopython2
P6 2017 biopython2P6 2017 biopython2
P6 2017 biopython2
 

Recently uploaded

Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Fajar Baskoro
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
ak6969907
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
IreneSebastianRueco1
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
chanes7
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
adhitya5119
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
tarandeep35
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
Celine George
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
eBook.com.bd (প্রয়োজনীয় বাংলা বই)
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
Celine George
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
Celine George
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 

Recently uploaded (20)

Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 

Bioinformatics t5-databasesearching v2014

  • 1.
  • 2. FBW 28-10-2014 Wim Van Criekinge
  • 3. Wel les op 4 november en GEEN les op 18 november
  • 5. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 6. Needleman-Wunsch-edu.pl The Score Matrix ---------------- Seq1(j)1 2 3 4 5 6 7 Seq2 * C K H V F C R (i) * 0 -1 -2 -3 -4 -5 -6 -7 1 C -1 1 a 0 -1 -2 -3 -4 -5 2 K -2 0 c 2 b 1 0 -1 -2 -3 3 K -3 -1 1 1 0 -1 -2 -3 4 C -4 -2 0 0 0 -1 0 -1 5 F -5 -3 -1 -1 -1 1 0 -1 6 C -6 -4 -2 -2 -2 0 2 1 7 K -7 -5 -3 -3 -3 -1 1 1 8 C -8 -6 -4 -4 -4 -2 0 0 9 V -9 -7 -5 -5 -3 -3 -1 -1 A: matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH if (substr(seq1,j-1,1) eq substr(seq2,i-1,1) B: up_score = matrix(i-1,j) + GAP C: left_score = matrix(i,j-1) + GAP
  • 7. • The most practical and widely used method in multiple sequence alignment is the hierarchical extensions of pairwise alignment methods. • The principal is that multiple alignments is achieved by successive application of pairwise methods. – First do all pairwise alignments (not just one sequence with all others) – Then combine pairwise alignments to generate overall alignment Multiple Alignment Method
  • 8. • Consider the task of searching SWISSPROT against a query sequence: – say our query sequence is 362 amino acids long – SWISSPROT release 38 contains 29,085,265 amino acids – finding local alignments via dynamic programming would entail O(1010) matrix operations • Given size of databases, more efficient methods needed Database Searching
  • 9. Heuristic approaches to DP for database searching FASTA (Pearson 1995) Uses heuristics to avoid calculating the full dynamic programming matrix Speed up searches by an order of magnitude compared to full Smith- Waterman The statistical side of FASTA is still stronger than BLAST BLAST (Altschul 1990, 1997) Uses rapid word lookup methods to completely skip most of the database entries Extremely fast One order of magnitude faster than FASTA Two orders of magnitude faster than Smith- Waterman Almost as sensitive as FASTA
  • 10. « Hit and extend heuristic» • Problem: Too many calculations “wasted” by comparing regions that have nothing in common • Initial insight: Regions that are similar between two sequences are likely to share short stretches that are identical • Basic method: Look for similar regions only near short stretches that match exactly FASTA
  • 11. FASTA-Stages 1. Find k-tups in the two sequences (k=1,2 for proteins, 4-6 for DNA sequences) 2. Score and select top 10 scoring “local diagonals” 3. Rescan top 10 regions, score with PAM250 (proteins) or DNA scoring matrix. Trim off the ends of the regions to achieve highest scores. 4. Try to join regions with gapped alignments. Join if similarity score is one standard deviation above average expected score 5. After finding the best initial region, FASTA performs a global alignment of a 32 residue wide region centered on the best initial region, and uses the score as the optimized score.
  • 12.
  • 13.
  • 14. • Sensitivity: the ability of a program to identify weak but biologically significant sequence similarity. • Selectivity: the ability of a program to discriminate between true matches and matches occurring by chance alone. – A decrease in selectivity results in more false positives being reported. FastA
  • 15. FastA (http://www.ebi.ac.uk/fasta33/) Blosum50 default. Lower PAM higher blosum to detect close sequences Higher PAM and lower blosum to detect distant sequences Gap opening penalty -12, -16 by default for fasta with proteins and DNA, respectively Gap extension penalty -2, -4 by default for fasta with proteins and DNA, respectively The larger the word-length the less sensitive, but faster the search will be Max number of scores and alignments is 100
  • 16. FastA Output Database code hyperlinked to the SRS database at EBI Accession number Description Length Initn, init1, opt, z-score calculated during run E score - expectation value, how many hits are expected to be found by chance with such a score while comparing this query to this database. E() does not represent the % similarity
  • 17. FastA is a family of programs FastA, TFastA, FastX, FastY Query: DNA Protein Database:DNA Protein
  • 18. FASTA can miss significant similarity since – For proteins, similar sequences do not have to share identical residues • Asp-Lys-Val is quite similar to • Glu-Arg-Ile yet it is missed even with ktuple size of 1 since no amino acid matches • Gly-Asp-Gly-Lys-Gly is quite similar to Gly-Glu-Gly-Arg-Gly but there is no match with ktuple size of 2 FASTA problems
  • 19. FASTA can miss significant similarity since – For nucleic acids, due to codon “wobble”, DNA sequences may look like XXyXXyXXy where X’s are conserved and y’s are not • GGuUCuACgAAg and GGcUCcACaAAA both code for the same peptide sequence (Gly-Ser- Thr-Lys) but they don’t match with ktuple size of 3 or higher FASTA problems
  • 20. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast Blast
  • 21. BLAST - Basic Local Alignment Search Tool
  • 22. What does BLAST do? • Search a large target set of sequences... • …for hits to a query sequence... • …and return the alignments and scores from those hits... • Do it fast. Show me those sequences that deserve a second look. Blast programs were designed for fast database searching, with minimal sacrifice of sensitivity to distant related sequences.
  • 23. The big red button Do My Job It is dangerous to hide too much of the underlying complexity from the scientists.
  • 24. • Approach: find segment pairs by first finding word pairs that score above a threshold, i.e., find word pairs of fixed length w with a score of at least T • Key concept “Neigborhood”: Seems similar to FASTA, but we are searching for words which score above T rather than that match exactly • Calculate neigborhood (T) for substrings of query (size W) Overview
  • 25. Overview Compile a list of words which give a score above T when paired with the query sequence. – Example using PAM-120 for query sequence ACDE (w=4, T=17): A C D E A C D E = +3 +9 +5 +5 = 22 • try all possibilities: A A A A = +3 -3 0 0 = 0 no good A A A C = +3 -3 0 -7 = -7 no good • ...too slow, try directed change
  • 26. Overview A C D E A C D E = +3 +9 +5 +5 = 22 • change 1st pos. to all acceptable substitutions g C D E = +1 +9 +5 +5 = 20 ok n C D E = +0 +9 +5 +5 = 19 ok I C D E = -1 +9 +5 +5 = 18 ok k C D E = -2 +9 +5 +5 = 17 ok • change 2nd pos.: can't - all alternatives negative and the other three positions only add up to 13 • change 3rd pos. in combination with first position gCnE = 1 9 2 5 = 17 ok • continue - use recursion • For "best" values of w and T there are typically about 50 words in the list for every residue in the query sequence
  • 27. Neighborhood.pl # Calculate neighborhood my %NH; for (my $i = 0; $i < @A; $i++) { my $s1 = $S{$W[0]}{$A[$i]}; for (my $j = 0; $j < @A; $j++) { my $s2 = $S{$W[1]}{$A[$j]}; for (my $k = 0; $k < @A; $k++) { my $s3 = $S{$W[2]}{$A[$k]}; my $score = $s1 + $s2 + $s3; my $word = "$A[$i]$A[$j]$A[$k]"; next if $word =~ /[BZX*]/; $NH{$word} = $score if $score >= $T; } } } # Output neighborhood foreach my $word (sort {$NH{$b} <=> $NH{$a} or $a cmp $b} keys %NH) { print "$word $NH{$word}n"; }
  • 28. BLOSUM62 RGD 11 RGD 17 KGD 14 QGD 13 RGE 13 EGD 12 HGD 12 NGD 12 RGN 12 AGD 11 MGD 11 RAD 11 RGQ 11 RGS 11 RND 11 RSD 11 SGD 11 TGD 11 PAM200 RGD 13 RGD 18 RGE 17 RGN 16 KGD 15 RGQ 15 KGE 14 HGD 13 KGN 13 RAD 13 RGA 13 RGG 13 RGH 13 RGK 13 RGS 13 RGT 13 RSD 13 WGD 13
  • 29.
  • 30. S Length of extension Score Trim to max indexed * *Two non-overlapping HSP’s on a diagonal within distance A
  • 31. S Length of extension Score Trim to max indexed * *Two non-overlapping HSP’s on a diagonal within distance A
  • 32. The BLAST algorithm • Break the search sequence into words – W = 3 for proteins, W = 12 for DNA MCGPFILGTYC CGP MCG MCG, CGP, GPF, PFI, FIL, ILG, LGT, GTY, TYC • Include in the search all words that score above a certain value (T) for any search word MCG CGP MCT MGP … MCN CTP … … This list can be computed in linear time
  • 33. The Blast Algorithm (2) • Search for the words in the database – Word locations can be precomputed and indexed – Searching for a short string in a long string • HSP (High Scoring Pair) = A match between a query word and the database • Find a “hit”: Two non-overlapping HSP’s on a diagonal within distance A • Extend the hit until the score falls below a threshold value, S
  • 34.
  • 35. BLAST parameters • Lowering the neighborhood word threshold (T) allows more distantly related sequences to be found, at the expense of increased noise in the results set. • Choosing a value for w – small w: many matches to expand – big w: many words to be generated – w=4 is a good compromise • Lowering the segment extension cutoff (S) returns longer extensions for each hit. • Changing the minimum E-value changes the threshold for reporting a hit.
  • 36. Critical parameters: T,W and scoring matrix • The proper value of T depends ons both the values in the scoring matrix and balance between speed and sensitivity • Higher values of T progressively remove more word hits and reduce the search space. • Word size (W) of 1 will produce more hits than a word size of 10. In general, if T is scaled uniformly with W, smaller word sizes incraese sensitivity and decrease speed. • The interplay between W,T and the scoring matrix is criticial and choosing them wisely is the most effective way of controlling the speed and sensiviy of blast
  • 37. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 38. Database Searching • How can we find a particular short sequence in a database of sequences (or one HUGE sequence)? • Problem is identical to local sequence alignment, but on a much larger scale. • We must also have some idea of the significance of a database hit. – Databases always return some kind of hit, how much attention should be paid to the result? • How can we determine how “unusual” a particular alignment score is?
  • 39. Sentence 1: “These algorithms are trying to find the best way to match up two sequences” Sentence 2: “This does not mean that they will find anything profound” ALIGNMENT: THESEALGRITHMARETR--YINGTFINDTHEBESTWAYTMATCHPTWSEQENCES :: :.. . .. ...: : ::::.. :: . : ... THISDESNTMEANTHATTHEYWILLFINDAN-------YTHIN-GPRFND------ 12 exact matches 14 conservative substitutions Is this a good alignment? Significance
  • 40. • A key to the utility of BLAST is the ability to calculate expected probabilities of occurrence of Maximum Segment Pairs (MSPs) given w and T • This allows BLAST to rank matching sequences in order of “significance” and to cut off listings at a user-specified probability Overview
  • 41. Mathematical Basis of BLAST • Model matches as a sequence of coin tosses • Let p be the probability of a “head” – For a “fair” coin, p = 0.5 • (Erdös-Rényi) If there are n throws, then the expected length R of the longest run of heads is R = log1/p (n). • Example: Suppose n = 20 for a “fair” coin R=log2(20)=4.32 • Trick is how to model DNA (or amino acid) sequence alignments as coin tosses.
  • 42. Mathematical Basis of BLAST • To model random sequence alignments, replace a match with a “head” and mismatch with a “tail”. AATCAT ATTCAG HTHHHT • For DNA, the probability of a “head” is 1/4 – What is it for amino acid sequences?
  • 43. Mathematical Basis of BLAST • So, for one particular alignment, the Erdös-Rényi property can be applied • What about for all possible alignments? – Consider that sequences are being shifted back and forth, dot matrix plot • The expected length of the longest match is R=log1/p(mn) where m and n are the lengths of the two sequences.
  • 44. Analytical derivation Erdös-Rényi … … … Karlin-Alschul
  • 45. Karlin-Alschul Statistics E=kmn-λS This equation states that the number of alignments expected by chance (E) during the sequence database search is a function of the size of the search space (m*n), the normalized score (λS) and a minor constant (k mostly 0.1) E-Value grows linearly with the product of target and query sizes. Doubling target set size and doubling query length have the same effect on e-value
  • 46. Analytical derivation Erdös-Rényi … … … Karlin-Alschul R=log1/p(mn) E=kmn-λS
  • 47. Scoring alignments • Score: S (~R) – S=SM(qi,ti) - Sgaps • Any alignment has a score • Any two sequences have a(t least one) optimal alignment
  • 48. • For a particular scoring matrix and its associated gap initiation and extention costs one must calculate λ and k • Unfortunately (for gapped alignments), you can’t do this analytically and the values must be estimated empirically – The procedure involves aligning random sequences (Monte Carlo approach) with a specific scoring scheme and observing the alignment properties (scores, target frequencies and lengths)
  • 49. Significance “Monte Carlo” Approach: • Compares result to randomized result, similarly to results generated by a roulette wheel at Monte Carlo • Typical procedure for alignments – Randomize sequence A – Align to sequence B – Repeat many times (hundreds) – Keep track op optimal score • Histogram of scores …
  • 50. Assessing significance requires a distribution • I have an pumpkin of diameter 1m. Is that unusual? Diameter (m) Frequency
  • 51.
  • 52.
  • 53. Normal Distribution does NOT Fit Alignment Scores !! • In seeking optimal Alignments between two sequences, one desires those that have the highest score - i.e. one is seeking a distribution of maxima • In seeking optimal Matches between an Input Sequence and Sequence Entries in a Database, one again desires the matches that have the highest score, and these are obtained via examination of the distribution of such scores for the entries in the database - this is again a distribution of maxima. “A Normal Distribution is a distribution of Sums of independent variables rather than a sum of their Maxima.“ Significance
  • 54. Comparing distributions Gaussian: Extreme Value:     1    1            x e x f x e e   2 2 2 2       x f x e
  • 55. Alignment scores follow extreme value distributions Alignment of unrelated/random sequences result in scores following an extreme value distribution E P = 1 –e-E P(xS) = 1-exp(-kmne-S) m, n: sequence lengths. k, : free parameters. x E=-ln(1-P) This can be shown analytically for ungapped alignments and has been found empirically to also hold for gapped alignments under commonly used conditions.
  • 56. Alignment scores follow extreme value distributions Alignment algorithms will always produce alignments, regardless of whether it is meaningful or not => important to have way of selecting significant alignments from large set of database hits. Solution: fit distribution of scores from database search to extreme value distribution; determine p-value of hit from this fitted distribution. Example: scores fitted to extreme value distribution. 99.9% of this distribution is located below score=112 => hit with score = 112 has a p-value of 0.1%
  • 57. BLAST uses precomputed extreme value distributions to calculate E-values from alignment scores For this reason BLAST only allows certain combinations of substitution matrices and gap penalties This also means that the fit is based on a different data set than the one you are working on Significance A word of caution: BLAST tends to overestimate the significance of its matches E-values from BLAST are fine for identifying sure hits One should be careful using BLAST’s E-values to judge if a marginal hit can be trusted (e.g., you may want to use E-values of 10-4 to 10-5).
  • 58. Determining P-values • If we can estimate  and , then we can determine, for a given match score x, the probability that a random match with score x or greater would have occurred in the database. • For sequence matches, a scoring system and database can be parameterized by two parameters, k and , related to  and . – It would be nice if we could compare hit significance without regard to the scoring system used!
  • 59. Bit Scores • The expected number of hits with score  S is: E = Kmn e s – Where m and n are the sequence lengths • Normalize the raw score using: S  ln K ln 2 S    • Obtains a “bit score” S’, with a standard set of units. • The new E-value is: E  mn 2  S 
  • 60. -74 -73 -72 * -71 ***** -70 ******* -69 ********** -68 *************** -67 ************************* -66 ************************* -65 ************************************ -64 ***************************************** -63 ************************************************************ -61 ************************ -60 ***************************** -59 ******************* -58 ************** -57 ********* -56 ******** -55 ***** -54 **** -53 * -52 * -51 * -50 -49 Needleman-wunsch-Monte-Carlo.pl (Average around -64 !)
  • 61. • The distribution of scores graph of frequency of observed scores • expected curve (asterisks) according to the extreme value distribution –the theoretic curve should be similar to the observed results • deviations indicate that the fitting parameters are wrong –too weak gap penalties –compositional biases FastA Output
  • 62. < 20 222 0 :* 22 30 0 :* 24 18 1 :* 26 18 15 :* 28 46 159 :* 30 207 963 :* 32 1016 3724 := * 34 4596 10099 :==== * 36 9835 20741 :========= * 38 23408 34278 :==================== * 40 41534 47814 :=================================== * 42 53471 58447 :============================================ * 44 73080 64473 :====================================================*======= 46 70283 65667 :=====================================================*==== 48 64918 62869 :===================================================*== 50 65930 57368 :===============================================*======= 52 47425 50436 :======================================= * 54 36788 43081 :=============================== * 56 33156 35986 :============================ * 58 26422 29544 :====================== * 60 21578 23932 :================== * 62 19321 19187 :===============* 64 15988 15259 :============*= 66 14293 12060 :=========*== 68 11679 9486 :=======*== 70 10135 7434 :======*== FastA Output
  • 63. 72 8957 5809 :====*=== 74 7728 4529 :===*=== 76 6176 3525 :==*=== 78 5363 2740 :==*== 80 4434 2128 :=*== 82 3823 1628 :=*== 84 3231 1289 :=*= 86 2474 998 :*== 88 2197 772 :*= 90 1716 597 :*= 92 1430 462 :*= :===============*======================== 94 1250 358 :*= :============*=========================== 96 954 277 :* :=========*======================= 98 756 214 :* :=======*=================== 100 678 166 :* :=====*================== 102 580 128 :* :====*=============== 104 476 99 :* :===*============= 106 367 77 :* :==*========== 108 309 59 :* :==*======== 110 287 46 :* :=*======== 112 206 36 :* :=*====== 114 161 28 :* :*===== 116 144 21 :* :*==== 118 127 16 :* :*==== >120 886 13 :* :*============================== Related FastA Output
  • 64. • A summary of the statistics and of the program parameters follows the histogram. – An important number in this summary is the Kolmogorov-Smirnov statistic, which indicates how well the actual data fit the theoretical statistical distribution. The lower this value, the better the fit, and the more reliable the statistical estimates. – In general, a Kolmogorov-Smirnov statistic under 0.1 indicates a good fit with the theoretical model. If the statistic is higher than 0.2, the statistics may not be valid, and it is recommended to repeat the search, using more stringent (more negative) values for the gap penalty parameters. FastA Output
  • 65. Statistics summary • Optimal local alignment scores for pairs of random amino acid sequences of the same length follow and extreme-value distribution. For any score S, the probability of observing a score >= S is given by the Karlin-Altschul statistic (P(score>=S)=1-exp(-kmne(- lambda.S)) • k en Lambda are parameters related to the position of the maximum and the with of the distribution, • Note the long tail at the right. This means that a score serveral standard deviations above the mean has higher probability of arising by chance (that is, it is less significant) than if the scores followed a normal distribution.
  • 66. P-values • Many programs report P = the probability that the alignment is no better than random. The relationship between Z and P depends on the distribution of the scores from the control population, which do NOT follow the normal distributions – P<=10E-100 (exact match) – P in range 10E-100 10E-50 (sequences nearly identical eg. Alleles or SNPs – P in range 10E-50 10E-10 (closely related sequenes, homology certain) – P in range 10-5 10E-1 (usually distant relatives) – P > 10-1 (match probably insignificant)
  • 67. E • For database searches, most programs report E-values. The E-value of an alignemt is the expected number of sequences that give the same Z-score or better if the database is probed with a random sequence. E is found by multiplying the value of P by the size of the database probed. Note that E but not P depends on the size of the database. Values of P are between 0 and 1. Values of E are between 0 and the number of sequences in the database searched: – E<=0.02 sequences probably homologous – E between 0.02 and 1 homology cannot be ruled out – E>1 you would have to expect this good a match by just chance
  • 68. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast Blast
  • 69. BLAST is actually a family of programs: • BLASTN - Nucleotide query searching a nucleotide database. • BLASTP - Protein query searching a protein database. • BLASTX - Translated nucleotide query sequence (6 frames) searching a protein database. • TBLASTN - Protein query searching a translated nucleotide (6 frames) database. • TBLASTX - Translated nucleotide query (6 frames) searching a translated nucleotide (6 frames) database. Blast
  • 70. Blast
  • 71. Blast
  • 72. Blast
  • 73. Blast
  • 74. Blast
  • 75. Blast
  • 76. Blast
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85. • Be aware of what options you have selected when using BLAST, or FASTA implementations. • Treat BLAST searches as scientific experiments • So you should try your searches with the filters on and off to see whether it makes any difference to the output Tips
  • 86. Tips: Low-complexity and Gapped Blast Algorithm • The common, Web-based ones often have default settings that will affect the outcome of your searches. By default all NCBI BLAST implementations filter out biased sequence composition from your query sequence (e.g. signal peptide and transmembrane sequences - beware!). • The SEG program has been implemented as part of the blast routine in order to mask low-complexity regions • Low-complexity regions are denoted by strings of Xs in the query sequence
  • 87. • The sequence databases contain a wealth of information. They also contain a lot of errors. Contaminants … • Annotation errors, frameshifts that may result in erroneous conceptual translations. • Hypothetical proteins ? • In the words of Fox Mulder, "Trust no one." Tips
  • 88. • Once you get a match to things in the databases, check whether the match is to the entire protein, or to a domain. Don't immediately assume that a match means that your protein carries out the same function (see above). Compare your protein and the match protein(s) along their entire lengths before making this assumption. Tips
  • 89. • Domain matches can also cause problems by hiding other informative matches. For instance if your protein contains a common domain you'll get significant matches to every homologous sequence in the database. BLAST only reports back a limited number of matches, ordered by P value. • If this list consists only of matches to the same domain, cut this bit out of your query sequence and do the BLAST search again with the edited sequence (e.g. NHR). Tips
  • 90. • Do controls wherever possible. In particular when you use a particular search software for the first time. • Suitable positive controls would be protein sequences known to have distant homologues in the databases to check how good the software is at detecting such matches. • Negative controls can be employed to make sure the compositional bias of the sequence isn't giving you false positives. Shuffle your query sequence and see what difference this makes to the matches that are returned. A real match should be lost upon shuffling of your sequence. Tips
  • 91. • Perform Controls #!/usr/bin/perl -w use strict; my ($def, @seq) = <>; print $def; chomp @seq; @seq = split(//, join("", @seq)); my $count = 0; while (@seq) { my $index = rand(@seq); my $base = splice(@seq, $index, 1); print $base; print "n" if ++$count % 60 == 0; } print "n" unless $count %60 == 0; Tips
  • 92. • Read the footer first • View results graphically • Parse Blasts with Bioperl Tips
  • 93. • BLAST's major advantage is its speed. – 2-3 minutes for BLAST versus several hours for a sensitive FastA search of the whole of GenBank. • When both programs use their default setting, BLAST is usually more sensitive than FastA for detecting protein sequence similarity. – Since it doesn't require a perfect sequence match in the first stage of the search. FastA vs. Blast
  • 94. Weakness of BLAST: – The long word size it uses in the initial stage of DNA sequence similarity searches was chosen for speed, and not sensitivity. – For a thorough DNA similarity search, FastA is the program of choice, especially when run with a lowered KTup value. – FastA is also better suited to the specialised task of detecting genomic DNA regions using a cDNA query sequence, because it allows the use of a gap extension penalty of 0. BLAST, which only creates ungapped alignments, will usually detect only the longest exon, or fail altogether. • In general, a BLAST search using the default parameters should be the first step in a database similarity search strategy. In many cases, this is all that may be required to yield all the information needed, in a very short time. FastA vs. Blast
  • 95. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 96. 1. Old (ungapped) BLAST 2. New BLAST (allows gaps) 3. Profile -> PSI Blast - Position Specific Iterated  Strategy:Multiple alignment of the hits Calculates a position-specific score matrix Searches with this matrix  In many cases is much more sensitive to weak but biologically relevant sequence similarities  PSSM !!! PSI-Blast
  • 97. • Patterns of conservation from the alignment of related sequences can aid the recognition of distant similarities. – These patterns have been variously called motifs, profiles, position-specific score matrices, and Hidden Markov Models. For each position in the derived pattern, every amino acid is assigned a score. (1) Highly conserved residue at a position: that residue is assigned a high positive score, and others are assigned high negative scores. (2) Weakly conserved positions: all residues receive scores near zero. (3) Position-specific scores can also be assigned to potential insertions and deletions. PSI-Blast
  • 98. Pattern • a set of alternative sequences, using “regular expressions” • Prosite (http://www.expasy.org/ prosite/)
  • 99. PSSM (Position Specific Scoring Matrice)
  • 100. PSSM (Position Specific Scoring Matrice)
  • 101. PSSM (Position Specific Scoring Matrice)
  • 102. • The power of profile methods can be further enhanced through iteration of the search procedure. – After a profile is run against a database, new similar sequences can be detected. A new multiple alignment, which includes these sequences, can be constructed, a new profile abstracted, and a new database search performed. – The procedure can be iterated as often as desired or until convergence, when no new statistically significant sequences are detected. PSI-Blast
  • 103. (1) PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program. (2) The program constructs a multiple alignment, and then a profile, from any significant local alignments found. The original query sequence serves as a template for the multiple alignment and profile, whose lengths are identical to that of the query. Different numbers of sequences can be aligned in different template positions. (3) The profile is compared to the protein database, again seeking local alignments using the BLAST algorithm. (4) PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale, and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments remain applicable to profile alignments. (5) Finally, PSI-BLAST iterates, by returning to step (2), a specified number of times or until convergence. PSI-Blast
  • 104. PSI-BLAST PSSM PSSM From: http://bioweb.pasteur.fr/seqanal/blast/intro-uk.html
  • 109. PSI-BLAST pitfalls • Avoid too close sequences: overfit! • Can include false homologous! Therefore check the matches carefully: include or exclude sequences based on biological knowledge. • The E-value reflects the significance of the match to the previous training set not to the original sequence! • Choose carefully your query sequence. • Try reverse experiment to certify.
  • 110. Reduce overfitting risk by Cobbler • A single sequence is selected from a set of blocks and enriched by replacing the conserved regions delineated by the blocks by consensus residues derived from the blocks. • Embedding consensus residues improves performance • S. Henikoff and J.G. Henikoff; Protein Science (1997) 6:698- 705.
  • 111. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 112. PHI-Blast Local Blast (Pattern-Hit Initiated BLAST)
  • 117. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 118. Installing Blast Locally • 2 flavors: NCBI/WuBlast • Excutables: – ftp://ftp.ncbi.nih.gov/blast/executables/ • Database: – ftp://ftp.ncbi.nih.gov/blast/db/ •Formatdb –formatdb -i ecoli.nt -p F –formatdb -i ecoli.protein -p T • For options: blastall - – blastall -p blastp -i query -d database -o output
  • 119. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 120. Main database: BLAT • BLAT: BLAST-Like Alignment Tool • Aligns the input sequence to the Human Genome • Connected to several databases, like: – mRNAs - GenScan – ESTs - TwinScan – RepeatMasker - UniGene – RefSeq - CpG Islands
  • 121. BLAT Human Genome Browser
  • 122. BLAT method • Align sequence with BLAT, get alignment info • Per BLAT hit, pick up additional info from connected databases: – mRNAs – ESTs – RepeatMasker – CpG Islands – RefSeq Genes
  • 123.
  • 124. Weblems W5.1: Submit the amino acid sequence of papaya papein to a BLAST (gapped and ungapped) and to a PSI-BLAST search. What are the main difference in results? W5.2: Is there a relationship between Klebsiella aerogenes urease, Pseudomonas diminuta phosphotriesterase and mouse adenosine deaminase ? Also use DALI, ClustalW and T-coffee. W5.3: Yeast two-hybrid typically yields DNA sequences. How would you find the corresponding protein ? W5.4: When and why would you use tblastn ? W5.5: How would you search a database if you want to restrict the search space to those entries having a secretion signal consisting of 4 consecutive (N-terminal) basic residues ?