BLAST
06/01/2012
Introduction:






Acronym for Basic Local Alignment Search Tool
The BLAST program was developed by Stephen
Altschul et al of NCBI in 1990
Also a heuristic method like FASTA
It is one of the most popular programs for sequence
analysis
enables a researcher to compare a query
sequence with a library or database of
sequences and
 identify library sequences that resemble the
query sequence above a certain threshold
 The objective is to find high-scoring ungapped
segments among related sequences

Using BLAST
 http://www.ncbi.nlm.nih.gov/BLAST
1. Select BLAST program to use (blastn, blastp,

2.
3.
4.

5.

blastx, tblastn, tblastx)
Select database to search
different BLAST programs have different
databases
Enter Query Sequence
Submit Search
Steps in BLAST










The seq is optionally filtered to remove lowcomplexity regions (AGAGAG…)
The next step is to create a list of words from the
query sequence.
Each word is typically 3 residues for protein
sequences and 11 residues for DNA sequences.
The list includes every possible word extracted from
the query sequence.
This step is also called seeding.
PROTEIN WORDS

Query: GTQITVEDLFYNIATRRKALKN
Word Size = 3

Word Size can be 2 or 3 (default = 3)

GTQ
TQI
Make a lookup
Neighborhood Words
table of words QIT
LTV, MTV, ISV, LSV, etc.
ITV
TVE
VED
EDL
DLF
...
NUCLEOTIDE WORDS

Query: GTACTGGACATGGACCCTACAGGAA
Word Size = 11

minimum word size = 7
blastn default = 11
megablast default = 28

GTACTGGACAT
TACTGGACATG
ACTGGACATGG
CTGGACATGGA
Make a
TGGACATGGAC
lookup
GGACATGGACC
table of words
GACATGGACCC
ACATGGACCCT
...........


The third step is to search a sequence
database for the occurrence of these words.



This step is to identify database sequences
containing the matching words
Using substitution scores matrixes the query
seq. words are evaluated for matches with
any DB seq. and these scores (log) are added
 A cut-off score (T) is selected to reduce
number of matches to the most significant
ones
 The above procedure is repeated for each
word in the query seq.
 The remaining high-scoring words are
organised into efficient search tree and rapidly
compared to the DB seq.



If a good match is found then an alignment is
extended from the match area in both
directions as far as the score continue to grow.



The extension continues until the score of the
alignment drops below a threshold due to
mismatches



(the drop threshold is twenty-two for proteins
and twenty for DNA).
The resulting contiguous aligned segment pair
without gaps is called high-scoring segment pair
(HSP )
 In the original version of BLAST, the highest
scored HSPs are presented as the final report

A recent improvement in the implementation
of BLAST is the ability to provide gapped
alignment.
 In gapped BLAST, the highest scored segment
is chosen to be extended in both directions
using dynamic programming where gaps may
be introduced.
 The extension continues if the alignment
score is above a certain threshold otherwise it
is terminated

BLAST Output
1.

2.

3.

4.

an introduction that tells where the search occurred
and what database and query were compared
a list of the sequences in the database containing
segment pairs whose scores were least likely to occur
by chance
alignments of the high-scoring segment pairs showing
identical
and
similar
residues
a complete list of the parameter settings used for the
search.
BLAST Variants
 Program
 BLASTP

Query sequence

Database sequence

protein
 BLASTN
nucleic acid
 BLASTX
translated nucleic acid
 TBLASTN protein
 TBLASTX translated nucleic acid

protein
nucleic acid
protein
translated nucleic acid
translated nucleic acid
Databases available on BLAST Web server
Database Description
A. Peptide sequence databases
1.
nr-translations of GenBank DNA sequences with redundancies removed,
PDB,
SwissProt, PIR, and PRF
2.
month -new or revised entries or updates to nr in the previous 30 days
3.
Swissprot- latest release of the SwissProt protein sequence databasea
4.
Drosophila genome -provided by Celera and Berkeley Drosophila genome
project
5.
yeast -yeast (Saccharomyces cerevisiae) genomic sequences
6.
E. Coli- E. coli genomic sequences
7.
pdb -sequences of proteins of known three-dimensional structure from the
Brookhaven Protein Data Bank
8.
yeast -yeast (S. cerevisiae) protein sequences
9.
E. coli- E. coli genomic coding sequence translations
10. kabat [kabatpro] -Kabat’s database of sequences of immunological interest
11. Alu- translations of select Alu repeats from REPBASE, a database of sequence
repeats
 B. Nucleotide sequence databases
1. nr- GenBank, EMBL, DDBJ, and PDB sequences with redundancies

removed (EST, STS, GSS, and HTGS sequences excluded)
2. month -new or revised entries or updates to nr in the previous 30
days
3. dbestb- EST sequences from GenBank, EMBL, and DDBJ with
redundancies removed
4. dbstsb- STS sequences from GenBank, EMBL, and DDBJ with
redundancies removed
5. htgsb- high-throughput genomic sequences
6. kabat [kabatnuc] -Kabat’s database of sequences of immunological
interest
7. vector- vector subset of GenBank
8. mito -database of mitochondrial sequences
9. alu -select Alu repeats from REPBASE, a database of sequence repeats;
suitable for masking Alu repeats from query sequences
10. epd- eukaryotic promoter database
11. gssb -genome survey sequences, includes single-pass genomic
data,exon-trapped sequences, and Alu PCR sequences
Difference between BLAST and FASTA
BLAST

FASTA

uses a substitution matrix to find matching
words

Uses the hashing procedure

Word size:
Protein=3 ;DNA=11

K-tuple:
Protein=2;DNA=4-6

Faster than FASTA

Slower than BLAST

have higher specificity than FASTA due to
Low complexity masking

Lower specificity
E-value (expectation value)








Important statistical indicator in Sequence alignment
it indicates the probability that the resulting
alignments from a database search are caused by
random chance
The E-value provides information about the
likelihood that a given sequence match is purely by
chance.
The lower the E-value, the less likely the database
match is a result of random chance and therefore
the more significant the match is
Formula
E-value is determined by the equation
 E = m × n × P
Where
 m is the total number of residues in a database
 n is the number of residues in the query sequence
and
 P is the probability that an HSP alignment is a result
of random chance.

Bit Score




A bit score is another prominent statistical indicator
used in addition to the E value in a BLAST output.
The bit score measures sequence similarity
independent of query sequence length and
database size and is normalized based on the raw
pairwise alignment score.
Formula
The bit score (S) is determined by the following formula:
S = (λ × s − lnK)/ ln2
Where
 λ is the Gumble distribution constant,
 s is the raw alignment score, and
 K is a constant associated with the scoring matrix used.
 Thus, the bit score (S) is linearly related to the raw
alignment score (s).
 Hence, the higher the bit score, the more highly
significant the match is.

Blast bioinformatics

Blast bioinformatics

  • 1.
  • 2.
    Introduction:     Acronym for BasicLocal Alignment Search Tool The BLAST program was developed by Stephen Altschul et al of NCBI in 1990 Also a heuristic method like FASTA It is one of the most popular programs for sequence analysis
  • 3.
    enables a researcherto compare a query sequence with a library or database of sequences and  identify library sequences that resemble the query sequence above a certain threshold  The objective is to find high-scoring ungapped segments among related sequences 
  • 4.
    Using BLAST  http://www.ncbi.nlm.nih.gov/BLAST 1.Select BLAST program to use (blastn, blastp, 2. 3. 4. 5. blastx, tblastn, tblastx) Select database to search different BLAST programs have different databases Enter Query Sequence Submit Search
  • 5.
    Steps in BLAST      Theseq is optionally filtered to remove lowcomplexity regions (AGAGAG…) The next step is to create a list of words from the query sequence. Each word is typically 3 residues for protein sequences and 11 residues for DNA sequences. The list includes every possible word extracted from the query sequence. This step is also called seeding.
  • 6.
    PROTEIN WORDS Query: GTQITVEDLFYNIATRRKALKN WordSize = 3 Word Size can be 2 or 3 (default = 3) GTQ TQI Make a lookup Neighborhood Words table of words QIT LTV, MTV, ISV, LSV, etc. ITV TVE VED EDL DLF ...
  • 7.
    NUCLEOTIDE WORDS Query: GTACTGGACATGGACCCTACAGGAA WordSize = 11 minimum word size = 7 blastn default = 11 megablast default = 28 GTACTGGACAT TACTGGACATG ACTGGACATGG CTGGACATGGA Make a TGGACATGGAC lookup GGACATGGACC table of words GACATGGACCC ACATGGACCCT ...........
  • 8.
     The third stepis to search a sequence database for the occurrence of these words.  This step is to identify database sequences containing the matching words
  • 9.
    Using substitution scoresmatrixes the query seq. words are evaluated for matches with any DB seq. and these scores (log) are added  A cut-off score (T) is selected to reduce number of matches to the most significant ones  The above procedure is repeated for each word in the query seq.  The remaining high-scoring words are organised into efficient search tree and rapidly compared to the DB seq. 
  • 10.
     If a goodmatch is found then an alignment is extended from the match area in both directions as far as the score continue to grow.  The extension continues until the score of the alignment drops below a threshold due to mismatches  (the drop threshold is twenty-two for proteins and twenty for DNA).
  • 11.
    The resulting contiguousaligned segment pair without gaps is called high-scoring segment pair (HSP )  In the original version of BLAST, the highest scored HSPs are presented as the final report 
  • 13.
    A recent improvementin the implementation of BLAST is the ability to provide gapped alignment.  In gapped BLAST, the highest scored segment is chosen to be extended in both directions using dynamic programming where gaps may be introduced.  The extension continues if the alignment score is above a certain threshold otherwise it is terminated 
  • 14.
    BLAST Output 1. 2. 3. 4. an introductionthat tells where the search occurred and what database and query were compared a list of the sequences in the database containing segment pairs whose scores were least likely to occur by chance alignments of the high-scoring segment pairs showing identical and similar residues a complete list of the parameter settings used for the search.
  • 17.
    BLAST Variants  Program BLASTP Query sequence Database sequence protein  BLASTN nucleic acid  BLASTX translated nucleic acid  TBLASTN protein  TBLASTX translated nucleic acid protein nucleic acid protein translated nucleic acid translated nucleic acid
  • 18.
    Databases available onBLAST Web server Database Description A. Peptide sequence databases 1. nr-translations of GenBank DNA sequences with redundancies removed, PDB, SwissProt, PIR, and PRF 2. month -new or revised entries or updates to nr in the previous 30 days 3. Swissprot- latest release of the SwissProt protein sequence databasea 4. Drosophila genome -provided by Celera and Berkeley Drosophila genome project 5. yeast -yeast (Saccharomyces cerevisiae) genomic sequences 6. E. Coli- E. coli genomic sequences 7. pdb -sequences of proteins of known three-dimensional structure from the Brookhaven Protein Data Bank 8. yeast -yeast (S. cerevisiae) protein sequences 9. E. coli- E. coli genomic coding sequence translations 10. kabat [kabatpro] -Kabat’s database of sequences of immunological interest 11. Alu- translations of select Alu repeats from REPBASE, a database of sequence repeats
  • 19.
     B. Nucleotidesequence databases 1. nr- GenBank, EMBL, DDBJ, and PDB sequences with redundancies removed (EST, STS, GSS, and HTGS sequences excluded) 2. month -new or revised entries or updates to nr in the previous 30 days 3. dbestb- EST sequences from GenBank, EMBL, and DDBJ with redundancies removed 4. dbstsb- STS sequences from GenBank, EMBL, and DDBJ with redundancies removed 5. htgsb- high-throughput genomic sequences 6. kabat [kabatnuc] -Kabat’s database of sequences of immunological interest 7. vector- vector subset of GenBank 8. mito -database of mitochondrial sequences 9. alu -select Alu repeats from REPBASE, a database of sequence repeats; suitable for masking Alu repeats from query sequences 10. epd- eukaryotic promoter database 11. gssb -genome survey sequences, includes single-pass genomic data,exon-trapped sequences, and Alu PCR sequences
  • 20.
    Difference between BLASTand FASTA BLAST FASTA uses a substitution matrix to find matching words Uses the hashing procedure Word size: Protein=3 ;DNA=11 K-tuple: Protein=2;DNA=4-6 Faster than FASTA Slower than BLAST have higher specificity than FASTA due to Low complexity masking Lower specificity
  • 21.
    E-value (expectation value)     Importantstatistical indicator in Sequence alignment it indicates the probability that the resulting alignments from a database search are caused by random chance The E-value provides information about the likelihood that a given sequence match is purely by chance. The lower the E-value, the less likely the database match is a result of random chance and therefore the more significant the match is
  • 22.
    Formula E-value is determinedby the equation  E = m × n × P Where  m is the total number of residues in a database  n is the number of residues in the query sequence and  P is the probability that an HSP alignment is a result of random chance. 
  • 23.
    Bit Score   A bitscore is another prominent statistical indicator used in addition to the E value in a BLAST output. The bit score measures sequence similarity independent of query sequence length and database size and is normalized based on the raw pairwise alignment score.
  • 24.
    Formula The bit score(S) is determined by the following formula: S = (λ × s − lnK)/ ln2 Where  λ is the Gumble distribution constant,  s is the raw alignment score, and  K is a constant associated with the scoring matrix used.  Thus, the bit score (S) is linearly related to the raw alignment score (s).  Hence, the higher the bit score, the more highly significant the match is. 