blast and fasta

DR. HARI SINGH GOURDR. HARI SINGH GOUR
CENTRAL UNIVERSITY SAGARCENTRAL UNIVERSITY SAGAR
2018-192018-19
FASTA AND BLASTAFASTA AND BLASTA
MADE BY:-MADE BY:- NAGENDRA SAHUNAGENDRA SAHU(MSC 1 SEM)(MSC 1 SEM)
GUIDED BY:-GUIDED BY:- DR. SIDHARTH KUMAR MISHRADR. SIDHARTH KUMAR MISHRA

Database Searching for SimilarDatabase Searching for Similar
SequencesSequences
 Search a sequence database for sequences thatSearch a sequence database for sequences that
are similar to a query sequenceare similar to a query sequence
 provide a list of database sequences with whichprovide a list of database sequences with which
the query sequence can be aligned wellthe query sequence can be aligned well
 Key issue:Key issue:
efficiencyefficiency

Database Searching for SimilarDatabase Searching for Similar
Sequences MethodsSequences Methods
 Smith-Waterman requires order NSmith-Waterman requires order N22
L computationsL computations
 Popular database searching methods (heuristic methods)Popular database searching methods (heuristic methods)
 FASTAFASTA [Pearson & Lipman, 1988][Pearson & Lipman, 1988]
 BLASTBLAST [Altschul et al., 1990][Altschul et al., 1990]
 Tradeoffs of using the heuristic fast methodTradeoffs of using the heuristic fast method
 Accuracy (Sensitivity and Selectivity)Accuracy (Sensitivity and Selectivity)

FASTAFASTA
 Problem with Smith-Waterman algorithm: Too manyProblem with Smith-Waterman algorithm: Too many
calculations “wasted” by comparing regions that havecalculations “wasted” by comparing regions that have
nothing in commonnothing in common
 Initial insight: Regions that areInitial insight: Regions that are similarsimilar between twobetween two
sequences are likely to share short stretches that aresequences are likely to share short stretches that are
identicalidentical
 Basic method: Look for similar regions only near shortBasic method: Look for similar regions only near short
stretches that matchstretches that match exactlyexactly ------ “Hit and extend”“Hit and extend”
sequence searchingsequence searching

11
Diagonal Method ExampleDiagonal Method Example
55
1010
33
88
1111
99
11
44
22
LLVVIIQQAAAAYYFFRRAAHHss ==
11111010998877665544332211
AAIIQQAAAAMMDDVVtt ==
8877665544332211
+9+9
offsetoffset
-2-2
+2+2
+3+3
-3-3
+1+1
+2+2
+2+2 +2+2 -6-6
-2-2
-1-1
……
YY
VV
RR
QQ
LL
II
HH
FF
AA
Look-uptableLook-uptable
11 11 11 11 11 11
,6,6,7,7
1122 223344
+10+10+9+9+8+8+7+7+6+6+5+5+4+4+3+3+2+2+1+100-1-1-2-2-3-3-4-4-5-5-6-6-7-7
Offset vectorOffset vector

Limitations of FASTALimitations of FASTA
 FASTA can miss significant similarity since:FASTA can miss significant similarity since:
 For nucleic acids, due to codon “wobble”, DNA sequences mayFor nucleic acids, due to codon “wobble”, DNA sequences may
look like XXy where X’s are conserved and y’s are notlook like XXy where X’s are conserved and y’s are not
 GGuUCuACgAAgGGuUCuACgAAg andand GGcUCcACaAAAGGcUCcACaAAA both code for theboth code for the
same peptide sequence (Gly-Ser-Thr-Lys) but they don’t match with k-same peptide sequence (Gly-Ser-Thr-Lys) but they don’t match with k-
tuple size of 3 or highertuple size of 3 or higher
 For proteins, similar sequences do not have to share identicalFor proteins, similar sequences do not have to share identical
residuesresidues
 Gly-Asp-Gly-Lys-GlyGly-Asp-Gly-Lys-Gly is quite similar tois quite similar to Gly-Glu-Gly-Arg-GlyGly-Glu-Gly-Arg-Gly
but there is no match with k-tuple of size 2but there is no match with k-tuple of size 2
 Asp-Lys-ValAsp-Lys-Val is quite similar tois quite similar to Glu-Arg-IleGlu-Arg-Ile yet it is missed evenyet it is missed even
with k-tuple size of 1with k-tuple size of 1
Score ?
Score ?
Ala-Ala-Ala-Ala-AlaAla-Ala-Ala-Ala-Ala vsvs Ala-Ala-Ala-Ala-AlaAla-Ala-Ala-Ala-Ala Score ?

BLASTBLAST
 What does BLAST stand for?What does BLAST stand for?
 Basic Local Alignment Search ToolBasic Local Alignment Search Tool

BLASTBLAST
 BLAST is similar to FASTA but it searches for wordsBLAST is similar to FASTA but it searches for words
whichwhich score above Tscore above T rather than thatrather than that match exactlymatch exactly..
It is also faster because its implementation has beenIt is also faster because its implementation has been
optimized to work with parallel UNIX architectureoptimized to work with parallel UNIX architecture
from an early stage.from an early stage.
 ReferenceReference
 S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J.S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J.
Lipman.Lipman. Basic Local Alignment Search Tool.Basic Local Alignment Search Tool. J. Mol. Biol.J. Mol. Biol.
215:215:403-410 (1990)403-410 (1990)

BLAST basicsBLAST basics
 BLAST is mainly a 3-step algorithm:BLAST is mainly a 3-step algorithm:
 Compile list of high-scoring strings (Compile list of high-scoring strings (wordswords))
 Search for hits – each hit gives aSearch for hits – each hit gives a seedseed
 Extend seeds to obtainExtend seeds to obtain segment pairssegment pairs

BLASTBLAST
 For protein sequences, the list of high-scoring wordsFor protein sequences, the list of high-scoring words
consists of all words withconsists of all words with ww characters that scores atcharacters that scores at
leastleast TT with some word in the query sequence (with some word in the query sequence (ww = 3 or= 3 or
4 for protein search, 11 or 12 for nucleotide sequences).4 for protein search, 11 or 12 for nucleotide sequences).
 Search for “hits” using a hash table or a finite stateSearch for “hits” using a hash table or a finite state
machine.machine.
 Key concept: Searching for words whichKey concept: Searching for words which score abovescore above
TT rather than thatrather than that match exactlymatch exactly

BLAST method for proteinsBLAST method for proteins
1. Compile a list of words which give a score above1. Compile a list of words which give a score above TT
when paired with the query sequence.when paired with the query sequence.
 Example using PAM-120 for query sequence ACDE (Example using PAM-120 for query sequence ACDE (ww=4,=4,
TT=17):=17):
A C D EA C D E
ACDE = +3 +9 +5 +5 = 22ACDE = +3 +9 +5 +5 = 22
 try all possibilities:try all possibilities:
AAAA = +3 -3 0 0 = 0 no goodAAAA = +3 -3 0 0 = 0 no good
AAAC = +3 -3 0 -7 = -7 no goodAAAC = +3 -3 0 -7 = -7 no good
 ...too slow, try directed change...too slow, try directed change

Generating word listGenerating word list
A C D EA C D E
ACDE = +3 +9 +5 +5 = 22ACDE = +3 +9 +5 +5 = 22
 change 1st pos. to all acceptable substitutionschange 1st pos. to all acceptable substitutions
gCDE = 1 9 5 5 = 20 ok (=pCDE,sCDE,tCDE)gCDE = 1 9 5 5 = 20 ok (=pCDE,sCDE,tCDE)
nCDE = 0 9 5 5 = 19 ok (=dCDE,eCDE,nCDE = 0 9 5 5 = 19 ok (=dCDE,eCDE,
nCDE,vCDE)nCDE,vCDE)
iCDE = -1 9 5 5 = 18 ok (=qCDE)iCDE = -1 9 5 5 = 18 ok (=qCDE)
kCDE = -2 9 5 5 = 17 ok (=mCDE)kCDE = -2 9 5 5 = 17 ok (=mCDE)
 change 2nd pos.: can't - all alternatives negative and the other threechange 2nd pos.: can't - all alternatives negative and the other three
positions only add up to 13positions only add up to 13
 change 3rd pos. in combination with first positionchange 3rd pos. in combination with first position
gCnE = 1 9 2 5 = 17 okgCnE = 1 9 2 5 = 17 ok
 continue - use recursioncontinue - use recursion

2. Scan the database for hits with the compiled list2. Scan the database for hits with the compiled list
of words.of words.
 UseUse finite state machinefinite state machine (actually used)(actually used)
 Calculate a state transition table that tells what state to goCalculate a state transition table that tells what state to go
to based on the next character in the sequenceto based on the next character in the sequence
3. Extend hits in both directions to form segment3. Extend hits in both directions to form segment
pairs (without allowing gaps)pairs (without allowing gaps)

 Example of a finite state machine for stringExample of a finite state machine for string
matching: (input alphabet: a,b,c)matching: (input alphabet: a,b,c)
Word:Word: ababacaababaca
aa
bb
bb
aa
aa
aa
22 33 5544 66 771100 aa aabb bb aa aacc
Database sequence:Database sequence: bcabccaaababacababacabbbcabccaaababacababacabb

exerciseexercise
 Construct a finite state machine that recognizeConstruct a finite state machine that recognize
the word:the word:
ATGATG
 Assuming the sequence is a nucleotide sequenceAssuming the sequence is a nucleotide sequence

BLAST Method for DNABLAST Method for DNA
1. Make list of all words of length1. Make list of all words of length ww in the query sequencein the query sequence
(often(often ww=11 or 12)=11 or 12)
2. Compress database by packing 4 nucleotides into a2. Compress database by packing 4 nucleotides into a
single byte (use auxiliary table to tell you wheresingle byte (use auxiliary table to tell you where
sequences start and stop within the compressedsequences start and stop within the compressed
database) -- doesn't allow for unspecified basesdatabase) -- doesn't allow for unspecified bases
(wildcards)(wildcards)

BLAST Method for DNABLAST Method for DNA
3. Compress the3. Compress the wordswords from the query sequence the samefrom the query sequence the same
way.way.
4. Search the compressed database for matches with the4. Search the compressed database for matches with the
compressedcompressed wordswords
Since all frames of the query sequence are considered separately,Since all frames of the query sequence are considered separately,
any match of lengthany match of length ww>=11 must contain a match of length 8>=11 must contain a match of length 8
that lies on a byte boundary of one of thethat lies on a byte boundary of one of the wordswords from the queryfrom the query
sequence. Thus can scan a (packed) byte at a time, improvingsequence. Thus can scan a (packed) byte at a time, improving
speed 4-fold over comparing one nucleotide at a time.speed 4-fold over comparing one nucleotide at a time.

Low-Complexity RegionsLow-Complexity Regions
 Low-complexity regions are segments thatLow-complexity regions are segments that
contains certain bases or amino acid more oftencontains certain bases or amino acid more often
than one would expect in “than one would expect in “normalnormal” nucleotide or” nucleotide or
protein sequences.protein sequences.
 Problem: if query sequence has a stretch ofProblem: if query sequence has a stretch of
unusual base composition (e.g., A-T rich) or aunusual base composition (e.g., A-T rich) or a
repeated sequence element (e.g.,repeated sequence element (e.g., AluAlu sequence)sequence)
there will be many hits with "uninteresting"there will be many hits with "uninteresting"
regions.regions.

Low-Complexity RegionsLow-Complexity Regions
 Solution :Solution :
 Make a list of the words occurring very frequentlyMake a list of the words occurring very frequently
(more frequently than expected by chance).(more frequently than expected by chance).
 Remove these words from the query list ofRemove these words from the query list of wordswords
before searching database. (The words are replacedbefore searching database. (The words are replaced
by strings of Xs.)by strings of Xs.)

BLAST Statistical significanceBLAST Statistical significance
 A key to the utility of BLAST is the ability toA key to the utility of BLAST is the ability to
calculate expected probabilities of occurrence ofcalculate expected probabilities of occurrence of
maximum segment pairs (MSPs) givenmaximum segment pairs (MSPs) given ww andand TT
 This allows BLAST to rank matching sequencesThis allows BLAST to rank matching sequences
in order of “significance” and to cut off listingsin order of “significance” and to cut off listings
at a user-specified probabilityat a user-specified probability

Choosing Values forChoosing Values for ww andand TT
 Trade-off: sensitivity vs. running-timeTrade-off: sensitivity vs. running-time
 Choosing a value forChoosing a value for ww
 SmallSmall ww: many matches to expand: many matches to expand
 BigBig ww: many words to be generated: many words to be generated
 ww=3/4 is a good compromise=3/4 is a good compromise
 Choosing a value forChoosing a value for TT
 SmallSmall TT: greater sensitivity, more matches to: greater sensitivity, more matches to
expandexpand

BLAST NotesBLAST Notes
 May fail to find optimal MSPsMay fail to find optimal MSPs
 May miss seeds ifMay miss seeds if TT is too stringentis too stringent
 Empirically, 10 to 50 times faster than Smith-WatermanEmpirically, 10 to 50 times faster than Smith-Waterman

Basic BLAST FamilyBasic BLAST Family
 BLASTNBLASTN
 DNA to DNA databaseDNA to DNA database
 BLASTPBLASTP
 protein to protein databaseprotein to protein database
 TBLASTNTBLASTN
 DNA (translated) to protein databaseDNA (translated) to protein database
 BLASTXBLASTX
 protein to DNA database (translated)protein to DNA database (translated)
 TBLASTXTBLASTX
 DNA (translated) to DNA database (translated)DNA (translated) to DNA database (translated)

BLAST RefinementsBLAST Refinements
 gapped alignmentsgapped alignments
 ““two-hit” method for extending word pairstwo-hit” method for extending word pairs
 Iterate with position-specific matrix (PSI-Iterate with position-specific matrix (PSI-
BLAST)BLAST)
 Pattern-hit initiated BLAST (PHI-BLAST)Pattern-hit initiated BLAST (PHI-BLAST)

blast and fasta

More Related Content

What's hot

Similar to blast and fasta

More from Nagendrasahu6

Recently uploaded

blast and fasta

Editor's Notes