DR. HARI SINGH GOURDR. HARI SINGH GOUR
CENTRAL UNIVERSITY SAGARCENTRAL UNIVERSITY SAGAR
2018-192018-19
FASTA AND BLASTAFASTA AND BLASTA
MADE BY:-MADE BY:- NAGENDRA SAHUNAGENDRA SAHU(MSC 1 SEM)(MSC 1 SEM)
GUIDED BY:-GUIDED BY:- DR. SIDHARTH KUMAR MISHRADR. SIDHARTH KUMAR MISHRA
Database Searching for SimilarDatabase Searching for Similar
SequencesSequences
 Search a sequence database for sequences thatSearch a sequence database for sequences that
are similar to a query sequenceare similar to a query sequence
 provide a list of database sequences with whichprovide a list of database sequences with which
the query sequence can be aligned wellthe query sequence can be aligned well
 Key issue:Key issue:
efficiencyefficiency
Database Searching for SimilarDatabase Searching for Similar
Sequences MethodsSequences Methods
 Smith-Waterman requires order NSmith-Waterman requires order N22
L computationsL computations
 Popular database searching methods (heuristic methods)Popular database searching methods (heuristic methods)
 FASTAFASTA [Pearson & Lipman, 1988][Pearson & Lipman, 1988]
 BLASTBLAST [Altschul et al., 1990][Altschul et al., 1990]
 Tradeoffs of using the heuristic fast methodTradeoffs of using the heuristic fast method
 Accuracy (Sensitivity and Selectivity)Accuracy (Sensitivity and Selectivity)
FASTAFASTA
 Problem with Smith-Waterman algorithm: Too manyProblem with Smith-Waterman algorithm: Too many
calculations “wasted” by comparing regions that havecalculations “wasted” by comparing regions that have
nothing in commonnothing in common
 Initial insight: Regions that areInitial insight: Regions that are similarsimilar between twobetween two
sequences are likely to share short stretches that aresequences are likely to share short stretches that are
identicalidentical
 Basic method: Look for similar regions only near shortBasic method: Look for similar regions only near short
stretches that matchstretches that match exactlyexactly ------ “Hit and extend”“Hit and extend”
sequence searchingsequence searching
11
Diagonal Method ExampleDiagonal Method Example
55
1010
33
88
1111
99
11
44
22
LLVVIIQQAAAAYYFFRRAAHHss ==
11111010998877665544332211
AAIIQQAAAAMMDDVVtt ==
8877665544332211
+9+9
offsetoffset
-2-2
+2+2
+3+3
-3-3
+1+1
+2+2
+2+2 +2+2 -6-6
-2-2
-1-1
……
YY
VV
RR
QQ
LL
II
HH
FF
AA
Look-uptableLook-uptable
11 11 11 11 11 11
,6,6,7,7
1122 223344
+10+10+9+9+8+8+7+7+6+6+5+5+4+4+3+3+2+2+1+100-1-1-2-2-3-3-4-4-5-5-6-6-7-7
Offset vectorOffset vector
Limitations of FASTALimitations of FASTA
 FASTA can miss significant similarity since:FASTA can miss significant similarity since:
 For nucleic acids, due to codon “wobble”, DNA sequences mayFor nucleic acids, due to codon “wobble”, DNA sequences may
look like XXy where X’s are conserved and y’s are notlook like XXy where X’s are conserved and y’s are not
 GGuUCuACgAAgGGuUCuACgAAg andand GGcUCcACaAAAGGcUCcACaAAA both code for theboth code for the
same peptide sequence (Gly-Ser-Thr-Lys) but they don’t match with k-same peptide sequence (Gly-Ser-Thr-Lys) but they don’t match with k-
tuple size of 3 or highertuple size of 3 or higher
 For proteins, similar sequences do not have to share identicalFor proteins, similar sequences do not have to share identical
residuesresidues
 Gly-Asp-Gly-Lys-GlyGly-Asp-Gly-Lys-Gly is quite similar tois quite similar to Gly-Glu-Gly-Arg-GlyGly-Glu-Gly-Arg-Gly
but there is no match with k-tuple of size 2but there is no match with k-tuple of size 2
 Asp-Lys-ValAsp-Lys-Val is quite similar tois quite similar to Glu-Arg-IleGlu-Arg-Ile yet it is missed evenyet it is missed even
with k-tuple size of 1with k-tuple size of 1
Score ?
Score ?
Ala-Ala-Ala-Ala-AlaAla-Ala-Ala-Ala-Ala vsvs Ala-Ala-Ala-Ala-AlaAla-Ala-Ala-Ala-Ala Score ?
BLASTBLAST
 What does BLAST stand for?What does BLAST stand for?
 Basic Local Alignment Search ToolBasic Local Alignment Search Tool
BLASTBLAST
 BLAST is similar to FASTA but it searches for wordsBLAST is similar to FASTA but it searches for words
whichwhich score above Tscore above T rather than thatrather than that match exactlymatch exactly..
It is also faster because its implementation has beenIt is also faster because its implementation has been
optimized to work with parallel UNIX architectureoptimized to work with parallel UNIX architecture
from an early stage.from an early stage.
 ReferenceReference
 S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J.S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J.
Lipman.Lipman. Basic Local Alignment Search Tool.Basic Local Alignment Search Tool. J. Mol. Biol.J. Mol. Biol.
215:215:403-410 (1990)403-410 (1990)
BLAST basicsBLAST basics
 BLAST is mainly a 3-step algorithm:BLAST is mainly a 3-step algorithm:
 Compile list of high-scoring strings (Compile list of high-scoring strings (wordswords))
 Search for hits – each hit gives aSearch for hits – each hit gives a seedseed
 Extend seeds to obtainExtend seeds to obtain segment pairssegment pairs
BLASTBLAST
 For protein sequences, the list of high-scoring wordsFor protein sequences, the list of high-scoring words
consists of all words withconsists of all words with ww characters that scores atcharacters that scores at
leastleast TT with some word in the query sequence (with some word in the query sequence (ww = 3 or= 3 or
4 for protein search, 11 or 12 for nucleotide sequences).4 for protein search, 11 or 12 for nucleotide sequences).
 Search for “hits” using a hash table or a finite stateSearch for “hits” using a hash table or a finite state
machine.machine.
 Key concept: Searching for words whichKey concept: Searching for words which score abovescore above
TT rather than thatrather than that match exactlymatch exactly
BLAST method for proteinsBLAST method for proteins
1. Compile a list of words which give a score above1. Compile a list of words which give a score above TT
when paired with the query sequence.when paired with the query sequence.
 Example using PAM-120 for query sequence ACDE (Example using PAM-120 for query sequence ACDE (ww=4,=4,
TT=17):=17):
A C D EA C D E
ACDE = +3 +9 +5 +5 = 22ACDE = +3 +9 +5 +5 = 22
 try all possibilities:try all possibilities:
AAAA = +3 -3 0 0 = 0 no goodAAAA = +3 -3 0 0 = 0 no good
AAAC = +3 -3 0 -7 = -7 no goodAAAC = +3 -3 0 -7 = -7 no good
 ...too slow, try directed change...too slow, try directed change
Generating word listGenerating word list
A C D EA C D E
ACDE = +3 +9 +5 +5 = 22ACDE = +3 +9 +5 +5 = 22
 change 1st pos. to all acceptable substitutionschange 1st pos. to all acceptable substitutions
gCDE = 1 9 5 5 = 20 ok (=pCDE,sCDE,tCDE)gCDE = 1 9 5 5 = 20 ok (=pCDE,sCDE,tCDE)
nCDE = 0 9 5 5 = 19 ok (=dCDE,eCDE,nCDE = 0 9 5 5 = 19 ok (=dCDE,eCDE,
nCDE,vCDE)nCDE,vCDE)
iCDE = -1 9 5 5 = 18 ok (=qCDE)iCDE = -1 9 5 5 = 18 ok (=qCDE)
kCDE = -2 9 5 5 = 17 ok (=mCDE)kCDE = -2 9 5 5 = 17 ok (=mCDE)
 change 2nd pos.: can't - all alternatives negative and the other threechange 2nd pos.: can't - all alternatives negative and the other three
positions only add up to 13positions only add up to 13
 change 3rd pos. in combination with first positionchange 3rd pos. in combination with first position
gCnE = 1 9 2 5 = 17 okgCnE = 1 9 2 5 = 17 ok
 continue - use recursioncontinue - use recursion
BLAST method for proteinsBLAST method for proteins
2. Scan the database for hits with the compiled list2. Scan the database for hits with the compiled list
of words.of words.
 UseUse finite state machinefinite state machine (actually used)(actually used)
 Calculate a state transition table that tells what state to goCalculate a state transition table that tells what state to go
to based on the next character in the sequenceto based on the next character in the sequence
3. Extend hits in both directions to form segment3. Extend hits in both directions to form segment
pairs (without allowing gaps)pairs (without allowing gaps)
BLAST method for proteinsBLAST method for proteins
 Example of a finite state machine for stringExample of a finite state machine for string
matching: (input alphabet: a,b,c)matching: (input alphabet: a,b,c)
Word:Word: ababacaababaca
aa
bb
bb
aa
aa
aa
22 33 5544 66 771100 aa aabb bb aa aacc
Database sequence:Database sequence: bcabccaaababacababacabbbcabccaaababacababacabb
exerciseexercise
 Construct a finite state machine that recognizeConstruct a finite state machine that recognize
the word:the word:
ATGATG
 Assuming the sequence is a nucleotide sequenceAssuming the sequence is a nucleotide sequence
BLAST Method for DNABLAST Method for DNA
1. Make list of all words of length1. Make list of all words of length ww in the query sequencein the query sequence
(often(often ww=11 or 12)=11 or 12)
2. Compress database by packing 4 nucleotides into a2. Compress database by packing 4 nucleotides into a
single byte (use auxiliary table to tell you wheresingle byte (use auxiliary table to tell you where
sequences start and stop within the compressedsequences start and stop within the compressed
database) -- doesn't allow for unspecified basesdatabase) -- doesn't allow for unspecified bases
(wildcards)(wildcards)
BLAST Method for DNABLAST Method for DNA
3. Compress the3. Compress the wordswords from the query sequence the samefrom the query sequence the same
way.way.
4. Search the compressed database for matches with the4. Search the compressed database for matches with the
compressedcompressed wordswords
Since all frames of the query sequence are considered separately,Since all frames of the query sequence are considered separately,
any match of lengthany match of length ww>=11 must contain a match of length 8>=11 must contain a match of length 8
that lies on a byte boundary of one of thethat lies on a byte boundary of one of the wordswords from the queryfrom the query
sequence. Thus can scan a (packed) byte at a time, improvingsequence. Thus can scan a (packed) byte at a time, improving
speed 4-fold over comparing one nucleotide at a time.speed 4-fold over comparing one nucleotide at a time.
Low-Complexity RegionsLow-Complexity Regions
 Low-complexity regions are segments thatLow-complexity regions are segments that
contains certain bases or amino acid more oftencontains certain bases or amino acid more often
than one would expect in “than one would expect in “normalnormal” nucleotide or” nucleotide or
protein sequences.protein sequences.
 Problem: if query sequence has a stretch ofProblem: if query sequence has a stretch of
unusual base composition (e.g., A-T rich) or aunusual base composition (e.g., A-T rich) or a
repeated sequence element (e.g.,repeated sequence element (e.g., AluAlu sequence)sequence)
there will be many hits with "uninteresting"there will be many hits with "uninteresting"
regions.regions.
Low-Complexity RegionsLow-Complexity Regions
 Solution :Solution :
 Make a list of the words occurring very frequentlyMake a list of the words occurring very frequently
(more frequently than expected by chance).(more frequently than expected by chance).
 Remove these words from the query list ofRemove these words from the query list of wordswords
before searching database. (The words are replacedbefore searching database. (The words are replaced
by strings of Xs.)by strings of Xs.)
BLAST Statistical significanceBLAST Statistical significance
 A key to the utility of BLAST is the ability toA key to the utility of BLAST is the ability to
calculate expected probabilities of occurrence ofcalculate expected probabilities of occurrence of
maximum segment pairs (MSPs) givenmaximum segment pairs (MSPs) given ww andand TT
 This allows BLAST to rank matching sequencesThis allows BLAST to rank matching sequences
in order of “significance” and to cut off listingsin order of “significance” and to cut off listings
at a user-specified probabilityat a user-specified probability
Choosing Values forChoosing Values for ww andand TT
 Trade-off: sensitivity vs. running-timeTrade-off: sensitivity vs. running-time
 Choosing a value forChoosing a value for ww
 SmallSmall ww: many matches to expand: many matches to expand
 BigBig ww: many words to be generated: many words to be generated
 ww=3/4 is a good compromise=3/4 is a good compromise
 Choosing a value forChoosing a value for TT
 SmallSmall TT: greater sensitivity, more matches to: greater sensitivity, more matches to
expandexpand
BLAST NotesBLAST Notes
 May fail to find optimal MSPsMay fail to find optimal MSPs
 May miss seeds ifMay miss seeds if TT is too stringentis too stringent
 Empirically, 10 to 50 times faster than Smith-WatermanEmpirically, 10 to 50 times faster than Smith-Waterman
Basic BLAST FamilyBasic BLAST Family
 BLASTNBLASTN
 DNA to DNA databaseDNA to DNA database
 BLASTPBLASTP
 protein to protein databaseprotein to protein database
 TBLASTNTBLASTN
 DNA (translated) to protein databaseDNA (translated) to protein database
 BLASTXBLASTX
 protein to DNA database (translated)protein to DNA database (translated)
 TBLASTXTBLASTX
 DNA (translated) to DNA database (translated)DNA (translated) to DNA database (translated)
BLAST RefinementsBLAST Refinements
 gapped alignmentsgapped alignments
 ““two-hit” method for extending word pairstwo-hit” method for extending word pairs
 Iterate with position-specific matrix (PSI-Iterate with position-specific matrix (PSI-
BLAST)BLAST)
 Pattern-hit initiated BLAST (PHI-BLAST)Pattern-hit initiated BLAST (PHI-BLAST)

blast and fasta

  • 1.
    DR. HARI SINGHGOURDR. HARI SINGH GOUR CENTRAL UNIVERSITY SAGARCENTRAL UNIVERSITY SAGAR 2018-192018-19 FASTA AND BLASTAFASTA AND BLASTA MADE BY:-MADE BY:- NAGENDRA SAHUNAGENDRA SAHU(MSC 1 SEM)(MSC 1 SEM) GUIDED BY:-GUIDED BY:- DR. SIDHARTH KUMAR MISHRADR. SIDHARTH KUMAR MISHRA
  • 2.
    Database Searching forSimilarDatabase Searching for Similar SequencesSequences  Search a sequence database for sequences thatSearch a sequence database for sequences that are similar to a query sequenceare similar to a query sequence  provide a list of database sequences with whichprovide a list of database sequences with which the query sequence can be aligned wellthe query sequence can be aligned well  Key issue:Key issue: efficiencyefficiency
  • 3.
    Database Searching forSimilarDatabase Searching for Similar Sequences MethodsSequences Methods  Smith-Waterman requires order NSmith-Waterman requires order N22 L computationsL computations  Popular database searching methods (heuristic methods)Popular database searching methods (heuristic methods)  FASTAFASTA [Pearson & Lipman, 1988][Pearson & Lipman, 1988]  BLASTBLAST [Altschul et al., 1990][Altschul et al., 1990]  Tradeoffs of using the heuristic fast methodTradeoffs of using the heuristic fast method  Accuracy (Sensitivity and Selectivity)Accuracy (Sensitivity and Selectivity)
  • 4.
    FASTAFASTA  Problem withSmith-Waterman algorithm: Too manyProblem with Smith-Waterman algorithm: Too many calculations “wasted” by comparing regions that havecalculations “wasted” by comparing regions that have nothing in commonnothing in common  Initial insight: Regions that areInitial insight: Regions that are similarsimilar between twobetween two sequences are likely to share short stretches that aresequences are likely to share short stretches that are identicalidentical  Basic method: Look for similar regions only near shortBasic method: Look for similar regions only near short stretches that matchstretches that match exactlyexactly ------ “Hit and extend”“Hit and extend” sequence searchingsequence searching
  • 5.
    11 Diagonal Method ExampleDiagonalMethod Example 55 1010 33 88 1111 99 11 44 22 LLVVIIQQAAAAYYFFRRAAHHss == 11111010998877665544332211 AAIIQQAAAAMMDDVVtt == 8877665544332211 +9+9 offsetoffset -2-2 +2+2 +3+3 -3-3 +1+1 +2+2 +2+2 +2+2 -6-6 -2-2 -1-1 …… YY VV RR QQ LL II HH FF AA Look-uptableLook-uptable 11 11 11 11 11 11 ,6,6,7,7 1122 223344 +10+10+9+9+8+8+7+7+6+6+5+5+4+4+3+3+2+2+1+100-1-1-2-2-3-3-4-4-5-5-6-6-7-7 Offset vectorOffset vector
  • 6.
    Limitations of FASTALimitationsof FASTA  FASTA can miss significant similarity since:FASTA can miss significant similarity since:  For nucleic acids, due to codon “wobble”, DNA sequences mayFor nucleic acids, due to codon “wobble”, DNA sequences may look like XXy where X’s are conserved and y’s are notlook like XXy where X’s are conserved and y’s are not  GGuUCuACgAAgGGuUCuACgAAg andand GGcUCcACaAAAGGcUCcACaAAA both code for theboth code for the same peptide sequence (Gly-Ser-Thr-Lys) but they don’t match with k-same peptide sequence (Gly-Ser-Thr-Lys) but they don’t match with k- tuple size of 3 or highertuple size of 3 or higher  For proteins, similar sequences do not have to share identicalFor proteins, similar sequences do not have to share identical residuesresidues  Gly-Asp-Gly-Lys-GlyGly-Asp-Gly-Lys-Gly is quite similar tois quite similar to Gly-Glu-Gly-Arg-GlyGly-Glu-Gly-Arg-Gly but there is no match with k-tuple of size 2but there is no match with k-tuple of size 2  Asp-Lys-ValAsp-Lys-Val is quite similar tois quite similar to Glu-Arg-IleGlu-Arg-Ile yet it is missed evenyet it is missed even with k-tuple size of 1with k-tuple size of 1 Score ? Score ? Ala-Ala-Ala-Ala-AlaAla-Ala-Ala-Ala-Ala vsvs Ala-Ala-Ala-Ala-AlaAla-Ala-Ala-Ala-Ala Score ?
  • 7.
    BLASTBLAST  What doesBLAST stand for?What does BLAST stand for?  Basic Local Alignment Search ToolBasic Local Alignment Search Tool
  • 8.
    BLASTBLAST  BLAST issimilar to FASTA but it searches for wordsBLAST is similar to FASTA but it searches for words whichwhich score above Tscore above T rather than thatrather than that match exactlymatch exactly.. It is also faster because its implementation has beenIt is also faster because its implementation has been optimized to work with parallel UNIX architectureoptimized to work with parallel UNIX architecture from an early stage.from an early stage.  ReferenceReference  S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J.S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman.Lipman. Basic Local Alignment Search Tool.Basic Local Alignment Search Tool. J. Mol. Biol.J. Mol. Biol. 215:215:403-410 (1990)403-410 (1990)
  • 9.
    BLAST basicsBLAST basics BLAST is mainly a 3-step algorithm:BLAST is mainly a 3-step algorithm:  Compile list of high-scoring strings (Compile list of high-scoring strings (wordswords))  Search for hits – each hit gives aSearch for hits – each hit gives a seedseed  Extend seeds to obtainExtend seeds to obtain segment pairssegment pairs
  • 10.
    BLASTBLAST  For proteinsequences, the list of high-scoring wordsFor protein sequences, the list of high-scoring words consists of all words withconsists of all words with ww characters that scores atcharacters that scores at leastleast TT with some word in the query sequence (with some word in the query sequence (ww = 3 or= 3 or 4 for protein search, 11 or 12 for nucleotide sequences).4 for protein search, 11 or 12 for nucleotide sequences).  Search for “hits” using a hash table or a finite stateSearch for “hits” using a hash table or a finite state machine.machine.  Key concept: Searching for words whichKey concept: Searching for words which score abovescore above TT rather than thatrather than that match exactlymatch exactly
  • 11.
    BLAST method forproteinsBLAST method for proteins 1. Compile a list of words which give a score above1. Compile a list of words which give a score above TT when paired with the query sequence.when paired with the query sequence.  Example using PAM-120 for query sequence ACDE (Example using PAM-120 for query sequence ACDE (ww=4,=4, TT=17):=17): A C D EA C D E ACDE = +3 +9 +5 +5 = 22ACDE = +3 +9 +5 +5 = 22  try all possibilities:try all possibilities: AAAA = +3 -3 0 0 = 0 no goodAAAA = +3 -3 0 0 = 0 no good AAAC = +3 -3 0 -7 = -7 no goodAAAC = +3 -3 0 -7 = -7 no good  ...too slow, try directed change...too slow, try directed change
  • 12.
    Generating word listGeneratingword list A C D EA C D E ACDE = +3 +9 +5 +5 = 22ACDE = +3 +9 +5 +5 = 22  change 1st pos. to all acceptable substitutionschange 1st pos. to all acceptable substitutions gCDE = 1 9 5 5 = 20 ok (=pCDE,sCDE,tCDE)gCDE = 1 9 5 5 = 20 ok (=pCDE,sCDE,tCDE) nCDE = 0 9 5 5 = 19 ok (=dCDE,eCDE,nCDE = 0 9 5 5 = 19 ok (=dCDE,eCDE, nCDE,vCDE)nCDE,vCDE) iCDE = -1 9 5 5 = 18 ok (=qCDE)iCDE = -1 9 5 5 = 18 ok (=qCDE) kCDE = -2 9 5 5 = 17 ok (=mCDE)kCDE = -2 9 5 5 = 17 ok (=mCDE)  change 2nd pos.: can't - all alternatives negative and the other threechange 2nd pos.: can't - all alternatives negative and the other three positions only add up to 13positions only add up to 13  change 3rd pos. in combination with first positionchange 3rd pos. in combination with first position gCnE = 1 9 2 5 = 17 okgCnE = 1 9 2 5 = 17 ok  continue - use recursioncontinue - use recursion
  • 13.
    BLAST method forproteinsBLAST method for proteins 2. Scan the database for hits with the compiled list2. Scan the database for hits with the compiled list of words.of words.  UseUse finite state machinefinite state machine (actually used)(actually used)  Calculate a state transition table that tells what state to goCalculate a state transition table that tells what state to go to based on the next character in the sequenceto based on the next character in the sequence 3. Extend hits in both directions to form segment3. Extend hits in both directions to form segment pairs (without allowing gaps)pairs (without allowing gaps)
  • 14.
    BLAST method forproteinsBLAST method for proteins  Example of a finite state machine for stringExample of a finite state machine for string matching: (input alphabet: a,b,c)matching: (input alphabet: a,b,c) Word:Word: ababacaababaca aa bb bb aa aa aa 22 33 5544 66 771100 aa aabb bb aa aacc Database sequence:Database sequence: bcabccaaababacababacabbbcabccaaababacababacabb
  • 15.
    exerciseexercise  Construct afinite state machine that recognizeConstruct a finite state machine that recognize the word:the word: ATGATG  Assuming the sequence is a nucleotide sequenceAssuming the sequence is a nucleotide sequence
  • 16.
    BLAST Method forDNABLAST Method for DNA 1. Make list of all words of length1. Make list of all words of length ww in the query sequencein the query sequence (often(often ww=11 or 12)=11 or 12) 2. Compress database by packing 4 nucleotides into a2. Compress database by packing 4 nucleotides into a single byte (use auxiliary table to tell you wheresingle byte (use auxiliary table to tell you where sequences start and stop within the compressedsequences start and stop within the compressed database) -- doesn't allow for unspecified basesdatabase) -- doesn't allow for unspecified bases (wildcards)(wildcards)
  • 17.
    BLAST Method forDNABLAST Method for DNA 3. Compress the3. Compress the wordswords from the query sequence the samefrom the query sequence the same way.way. 4. Search the compressed database for matches with the4. Search the compressed database for matches with the compressedcompressed wordswords Since all frames of the query sequence are considered separately,Since all frames of the query sequence are considered separately, any match of lengthany match of length ww>=11 must contain a match of length 8>=11 must contain a match of length 8 that lies on a byte boundary of one of thethat lies on a byte boundary of one of the wordswords from the queryfrom the query sequence. Thus can scan a (packed) byte at a time, improvingsequence. Thus can scan a (packed) byte at a time, improving speed 4-fold over comparing one nucleotide at a time.speed 4-fold over comparing one nucleotide at a time.
  • 18.
    Low-Complexity RegionsLow-Complexity Regions Low-complexity regions are segments thatLow-complexity regions are segments that contains certain bases or amino acid more oftencontains certain bases or amino acid more often than one would expect in “than one would expect in “normalnormal” nucleotide or” nucleotide or protein sequences.protein sequences.  Problem: if query sequence has a stretch ofProblem: if query sequence has a stretch of unusual base composition (e.g., A-T rich) or aunusual base composition (e.g., A-T rich) or a repeated sequence element (e.g.,repeated sequence element (e.g., AluAlu sequence)sequence) there will be many hits with "uninteresting"there will be many hits with "uninteresting" regions.regions.
  • 19.
    Low-Complexity RegionsLow-Complexity Regions Solution :Solution :  Make a list of the words occurring very frequentlyMake a list of the words occurring very frequently (more frequently than expected by chance).(more frequently than expected by chance).  Remove these words from the query list ofRemove these words from the query list of wordswords before searching database. (The words are replacedbefore searching database. (The words are replaced by strings of Xs.)by strings of Xs.)
  • 20.
    BLAST Statistical significanceBLASTStatistical significance  A key to the utility of BLAST is the ability toA key to the utility of BLAST is the ability to calculate expected probabilities of occurrence ofcalculate expected probabilities of occurrence of maximum segment pairs (MSPs) givenmaximum segment pairs (MSPs) given ww andand TT  This allows BLAST to rank matching sequencesThis allows BLAST to rank matching sequences in order of “significance” and to cut off listingsin order of “significance” and to cut off listings at a user-specified probabilityat a user-specified probability
  • 23.
    Choosing Values forChoosingValues for ww andand TT  Trade-off: sensitivity vs. running-timeTrade-off: sensitivity vs. running-time  Choosing a value forChoosing a value for ww  SmallSmall ww: many matches to expand: many matches to expand  BigBig ww: many words to be generated: many words to be generated  ww=3/4 is a good compromise=3/4 is a good compromise  Choosing a value forChoosing a value for TT  SmallSmall TT: greater sensitivity, more matches to: greater sensitivity, more matches to expandexpand
  • 24.
    BLAST NotesBLAST Notes May fail to find optimal MSPsMay fail to find optimal MSPs  May miss seeds ifMay miss seeds if TT is too stringentis too stringent  Empirically, 10 to 50 times faster than Smith-WatermanEmpirically, 10 to 50 times faster than Smith-Waterman
  • 25.
    Basic BLAST FamilyBasicBLAST Family  BLASTNBLASTN  DNA to DNA databaseDNA to DNA database  BLASTPBLASTP  protein to protein databaseprotein to protein database  TBLASTNTBLASTN  DNA (translated) to protein databaseDNA (translated) to protein database  BLASTXBLASTX  protein to DNA database (translated)protein to DNA database (translated)  TBLASTXTBLASTX  DNA (translated) to DNA database (translated)DNA (translated) to DNA database (translated)
  • 26.
    BLAST RefinementsBLAST Refinements gapped alignmentsgapped alignments  ““two-hit” method for extending word pairstwo-hit” method for extending word pairs  Iterate with position-specific matrix (PSI-Iterate with position-specific matrix (PSI- BLAST)BLAST)  Pattern-hit initiated BLAST (PHI-BLAST)Pattern-hit initiated BLAST (PHI-BLAST)

Editor's Notes

  • #4 Sensitivity: ability of the method to find most of the members of the protein family represented by the query sequence Selectivity: ability of the method not to find the known members of other family as false positive.