PRESENTED BY :
ARUNDHATI MEHTA
BLAST
( A Bioinformatics tool )
© Arundhati Mehta 2016
BIOINFORMATICS
It is the science of managing and
analyzing biological data (informations
associated with biomolecules like DNA,
RNA, Protein etc.) using advanced
computing techniques.
Software tools for bioinformatics range
from simple command-line tools, to more
complex graphical programs and stand
alone web-services available from various
bioinformatics companies or public
institutions.
Introduction to BLAST
It is a sequence similarity search program for
comparing biological sequences such as amino acid
sequence of different proteins or the nucleotides of
DNA sequences with sequence database or library
sequences.
It is an Insilico Hybridisation experiment used
to identify significant similarities between query
sequences with the library sequences.
BLAST stands for :
B - Basic
L - Local
A - Alignment
S - Search
T - Tool
BLAST was designed by Eugene Myers , Samuel Karlin , Stephen Altschul,
Warren Gish, David J. Lipman and Webb Miller ( 1990,1994,1997 ) at the
National Institute of Health and was published in Journal of Molecular Biology
in 1990.
It was originally developed & controlled by NCBI .
Link: http://www.ncbi.nlm.nih.gov/BLAST/
NCBI Home Page
NCBI- BLAST Home Page
EMBL-EBI Home Page
EMBL-EBI BLAST Page
BLAST - Input & Output
Input
FASTA
format
GenBank
format
Output
HTML
format
XML
format
Plain Text
Format
Default database is the
non-redundant (nr)
database maintained by
NCBI.
All BLAST programs use
a substitution scoring matrix
(BLOSUM or PAM),
determines pair-wise raw
alignment scores.
BLAST PROCESS
BLAST works through use of Heuristic Algorithm , an
algorithm that is able to produce an acceptable solution to a
problem in many practical scenarios and is more faster than
classical methods. Heuristics are typically used when there is
no known method to find an optimal solution ,under
the given constraints.
Using this BLAST finds homologous sequences , not by
comparing either sequences in its entirety, but rather by
locating short matches between the two sequences.
While attempting to find the homology sequences , sets of
common letters are known as WORDS.
SEEDING
Find similar words
between query and
each database
sequence
EXTENSION
Extend such words
to obtain high-
scoring sequence
pairs (HSPs)
EVALUATION
Calculate statistics
analytically
BLAST Types
BLAST
Amino acid
sequence
Blastp
tBlastn
DNA
sequence
Blastn
Blastx
tBlastx
• Blastp : compares protein query
against proteins sequence database.
• tBlastn : compares protein query
against the all six reading frames of a
translated nucleotide sequence database.
• Blastn : compares nucleotide query
against nucleotide sequence database.
• Blastx : compares six-frame conceptual
translation products of a nucleotide
query sequence (both strands) against a
protein sequence database.
• tBlastx : compares nucleotide query
against translated nucleotide sequence
database.
BLAST Search
BLAST Search
Graphic summary
• Query sequence is at the top, with
colour key for alignment scores.
• Each bar represents the portion of
another sequence that’s similar to
your query sequence :-
Red bars- most similar sequence
Pink bars- match less good
Green bars- not impressive match
Blue bars- worst score
Black bars- Bad hits
1 - This portion of each description links to the sequence record for a particular hit.
2 - Score or bit score is a value calculated from the number of gaps and substitutions
associated with each aligned sequence. The higher the score, the more significant the
alignment. Each score links to the corresponding pairwise alignment between query sequence
and hit sequence (also referred to as subject sequence).
3 - E Value (Expect Value) describes the likelihood that a sequence with a similar score will
occur in the database by chance. The smaller the E Value, the more significant the alignment
4 - These links provide the user with direct access from BLAST results to related entries in
other databases. ‘L’ links to Locus Link records and ‘S’ links to structure records in NCBI's
Molecular Modelling DataBase.
The Percentage of identity: This gives you a concrete substitute for the E-value. An
identity of more than 25 percent is good news. ( The identity is the number of identical
residues divided by the number of matched residues — gaps are simply ignored.)
The Positives field gives you a measure of the fraction of residues that are either identical or
similar — represented with a + on the actual alignment.
The Gaps field shows residues that were not aligned.
Length : is alignment length of sequence aligned by BLAST.
Top sequence : Query sequence
Bottom sequence : Hits ( referred as Subject sequence )
line between sequences : + sign (similar amino acids)
space (mismatch)
letter (identical residues)
XXXX Region : low- complexity segments
Numbers : to the right side indicate the coordinates of the match on query & on Hit
sequence.
BLAST Statistics
R = aI + bX - cO - dG
Percentage of Identities
% I = No. of identical residues
-------------------------------- x 100
No. of matched residues
Raw scores
Applications of BLAST
BLAST can be used for several purposes.
These include:
 Identifying Species:
With the use of BLAST, you can possibly
correctly identify a species and/or find
homologous species. This can be useful, for
example, when one is working with a DNA
sequence from an unknown species.
Establishing Phylogeny:
Using the results received through BLAST,
one can create a phylogenetic tree using
the BLAST web-page.
Applications of BLAST
 DNA Mapping:
When working with a known species, and looking
to sequence a gene at an unknown location, BLAST
can compare the chromosomal position of the
sequence of interest, to relevant sequences in the
database(s).
 Locating Domains:
When working with a protein sequence you can
input it into BLAST, to locate known domains
within the sequence of interest.
 Comparison:
When working with genes, BLAST can locate
common genes in two related species, and can be
used to map annotations from one organism to
another.
QUESTIONS
???
Blast

Blast

  • 1.
    PRESENTED BY : ARUNDHATIMEHTA BLAST ( A Bioinformatics tool ) © Arundhati Mehta 2016
  • 2.
    BIOINFORMATICS It is thescience of managing and analyzing biological data (informations associated with biomolecules like DNA, RNA, Protein etc.) using advanced computing techniques. Software tools for bioinformatics range from simple command-line tools, to more complex graphical programs and stand alone web-services available from various bioinformatics companies or public institutions.
  • 3.
    Introduction to BLAST Itis a sequence similarity search program for comparing biological sequences such as amino acid sequence of different proteins or the nucleotides of DNA sequences with sequence database or library sequences. It is an Insilico Hybridisation experiment used to identify significant similarities between query sequences with the library sequences. BLAST stands for : B - Basic L - Local A - Alignment S - Search T - Tool
  • 4.
    BLAST was designedby Eugene Myers , Samuel Karlin , Stephen Altschul, Warren Gish, David J. Lipman and Webb Miller ( 1990,1994,1997 ) at the National Institute of Health and was published in Journal of Molecular Biology in 1990. It was originally developed & controlled by NCBI . Link: http://www.ncbi.nlm.nih.gov/BLAST/
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
    BLAST - Input& Output Input FASTA format GenBank format Output HTML format XML format Plain Text Format Default database is the non-redundant (nr) database maintained by NCBI. All BLAST programs use a substitution scoring matrix (BLOSUM or PAM), determines pair-wise raw alignment scores.
  • 10.
    BLAST PROCESS BLAST worksthrough use of Heuristic Algorithm , an algorithm that is able to produce an acceptable solution to a problem in many practical scenarios and is more faster than classical methods. Heuristics are typically used when there is no known method to find an optimal solution ,under the given constraints. Using this BLAST finds homologous sequences , not by comparing either sequences in its entirety, but rather by locating short matches between the two sequences. While attempting to find the homology sequences , sets of common letters are known as WORDS. SEEDING Find similar words between query and each database sequence EXTENSION Extend such words to obtain high- scoring sequence pairs (HSPs) EVALUATION Calculate statistics analytically
  • 11.
    BLAST Types BLAST Amino acid sequence Blastp tBlastn DNA sequence Blastn Blastx tBlastx •Blastp : compares protein query against proteins sequence database. • tBlastn : compares protein query against the all six reading frames of a translated nucleotide sequence database. • Blastn : compares nucleotide query against nucleotide sequence database. • Blastx : compares six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. • tBlastx : compares nucleotide query against translated nucleotide sequence database.
  • 13.
  • 14.
  • 20.
    Graphic summary • Querysequence is at the top, with colour key for alignment scores. • Each bar represents the portion of another sequence that’s similar to your query sequence :- Red bars- most similar sequence Pink bars- match less good Green bars- not impressive match Blue bars- worst score Black bars- Bad hits
  • 22.
    1 - Thisportion of each description links to the sequence record for a particular hit. 2 - Score or bit score is a value calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment. Each score links to the corresponding pairwise alignment between query sequence and hit sequence (also referred to as subject sequence). 3 - E Value (Expect Value) describes the likelihood that a sequence with a similar score will occur in the database by chance. The smaller the E Value, the more significant the alignment 4 - These links provide the user with direct access from BLAST results to related entries in other databases. ‘L’ links to Locus Link records and ‘S’ links to structure records in NCBI's Molecular Modelling DataBase.
  • 23.
    The Percentage ofidentity: This gives you a concrete substitute for the E-value. An identity of more than 25 percent is good news. ( The identity is the number of identical residues divided by the number of matched residues — gaps are simply ignored.) The Positives field gives you a measure of the fraction of residues that are either identical or similar — represented with a + on the actual alignment. The Gaps field shows residues that were not aligned. Length : is alignment length of sequence aligned by BLAST. Top sequence : Query sequence Bottom sequence : Hits ( referred as Subject sequence ) line between sequences : + sign (similar amino acids) space (mismatch) letter (identical residues) XXXX Region : low- complexity segments Numbers : to the right side indicate the coordinates of the match on query & on Hit sequence.
  • 25.
    BLAST Statistics R =aI + bX - cO - dG Percentage of Identities % I = No. of identical residues -------------------------------- x 100 No. of matched residues Raw scores
  • 27.
    Applications of BLAST BLASTcan be used for several purposes. These include:  Identifying Species: With the use of BLAST, you can possibly correctly identify a species and/or find homologous species. This can be useful, for example, when one is working with a DNA sequence from an unknown species. Establishing Phylogeny: Using the results received through BLAST, one can create a phylogenetic tree using the BLAST web-page.
  • 28.
    Applications of BLAST DNA Mapping: When working with a known species, and looking to sequence a gene at an unknown location, BLAST can compare the chromosomal position of the sequence of interest, to relevant sequences in the database(s).  Locating Domains: When working with a protein sequence you can input it into BLAST, to locate known domains within the sequence of interest.  Comparison: When working with genes, BLAST can locate common genes in two related species, and can be used to map annotations from one organism to another.
  • 29.