FASTA is a sequence alignment tool that was developed before BLAST. It uses a hashing strategy to find matches between k-tuples, or short stretches of identical residues, in query and target sequences. FASTA breaks sequences down into k-tuples and searches target databases to find similarities. While faster than dynamic programming, FASTA and BLAST may not find optimal alignments or true homologs.
2. FASTA stands for fast-all” or “FastA”.
It was the first database similarity search tool developed, preceding the development of
BLAST.
FASTA is another sequence alignment tool which is used to search similarities between
sequences of DNA and proteins.
FASTA uses a “hashing” strategy to find matches for a short stretch of identical residues
with a length of k. The string of residues is known as ktuples or ktups, which are
equivalent to words in BLAST, but are normally shorter than the words.
Typically, a ktup is composed of two residues for protein sequences and six residues for
DNA sequences.
The query sequence is thus broken down into sequence patterns or words known as k-
tuples and the target sequences are searched for these k-tuples in order to find the
similarities between the two.
FASTA is a fine tool for similarity searches.
These methods are not guaranteed to find the optimal alignment or true homologs, but are
50–100 times faster than dynamic programming.
3. FastA - Compares a DNA query sequence to a DNA
database, or a protein query to a protein database,
detecting the sequence type automatically.
Versions 2 and 3 are in common use, version 3
having a highly improved score normalization
method. It significantly reduces the overlap between
the score distributions.
FASTX - Compares a DNA query to a protein
database. It may introduce gaps only between
codons.
FASTY - Compares a DNA query to a protein
database, optimizing gap location, even within
codons.
TFASTA - Compares a protein query to a DNA
database.
4.
5.
6. • It is used for the identification of the species.
• Used for the establishment of the phylogeny
• For DNA mapping
• FASTA is also used for understanding the
biochemical functions of the protein.
• Study the evolution of the species, from where
that specific species evolved, or identify the
ancestors.
• Calculation of the molecular weight
• Identification of mutations in the sequences by
comparing those sequences with the reference
sequences.
7. Basic steps Step1: Set a word size, usually 6 for DNA and 2 for protein. Hashing: FASTA
locates regions of the query sequence and matching regions in the database sequences
that have high densities of exact word matches (without gaps). The length of the
matched word is called the k-tuple parameter.
Step 2: Scoring: The ten highest scoring regions are rescored using the BLOSUM50
scoring matrix. The score for such a pair of regions is saved as the init1 score.
Step 3: Introduction of Gaps: FASTA determines if any of the initial regions from
different diagonals may be joined together to form an approximate alignment with gaps.
Only non-overlapping regions may be joined. The score for the joined regions is the
sum of the scores of the initial regions minus a joining penalty for each gap. The score
of the highest scoring region, at the end of this step, is saved as the init n. FASTA
(4) Step 4: Alignment: After computing the initial scores, FASTA determines the best
segment of similarity between the query sequence and the search set sequence, using a
variation of the SmithWaterman algorithm. The score for this alignment is the opt score.
Step 5: Random Sequence Simulation: In order to evaluate the significance of such
alignment FASTA empirically estimates the score distribution from the alignment of
many random pairs of sequences. More precisely, the characters of the query sequences
are reshuffled (to maintain bias due to length and character composition) and searched
against a random subset of the database. This empirical distribution is extrapolated,
assuming it is an extreme value distribution, and each alignment to the real query is
assigned a Z-score and an E-score. Modifications: In step4, use a band around init1
8. FASTA calculates significance “on the fly”.
This can be problematic if the dataset is
small. To identify an unknown protein
sequence use either of these: FastA3,
Ssearch3 or tFastX3. FASTA3 has improved
methods of aligning sequences and of
calculating the statistical significance of
alignment.
9. There is no standard filename extension for a
text file containing FASTA formatted
sequences. The table below shows each
extension and its respective meaning.
10. Developed by Steven Altschul and Samuel
Karlin in 1990.
• Compares nucleotide/aminoacid
sequences
• Is a heuristic method.
• Is a fast but approximate method of
alignment.
• Locates local alignments/short matches
called words
11.
12. blastp: compares a protein sequence against a
protein sequence database.
blastn: compares a nucleotide sequence against a
nucleotide sequence database.
blastx: compares a six frame translation of a
nucleotide sequence against a protein database
tblastn: compares a protein sequence against a
six frame translation of a nucleotide database
tblastx: compares a six frame translation of a
nucleotide sequence against a six frame
translation of a nucleotide database
13. Blast searches begin with a query sequence
that will be matched against sequence
databases specified by the user.
•Begins by breaking down the query sequence
into a series of short overlapping “words”
•Default word size for BLAST N is 28 nucleotides
•Default word size for BLAST P is 3 amino acids
•Results obtained depend on the scoring matrix
used.
•BLOSUM 62 matrix is the default scoring matrix
for BLASTP
14. Basic steps Step1: Set a word size, usually 11 for DNA and
3 for protein. Given query sequence, compile the list of
possible words, which form with words in high scoring
word pairs (Filter out low complexity regions)
Step 2: Scan database for exact matching with the list of
words complied in step 1. e.g. qlnfsagw -> (ql, ln, nf, fs,
sa, ag, gw) Extend the list (using some threshold T) Step 3:
Scan through the string and whenever a word in the list is
found try to extend it in both directions (no gaps) to get to
a score beyond a threshold S. While extending use a
parameter L that defines how long an extension will be
tried to raise the score over S.
Modification of step 3: -Original BLAST: Extension is
continued as long as the score continued to increase. -
Another version -BLAST2 (gapped BLAST): - Lower value of
T is used. - After extension try to combine (allowing gaps)
- Find maximal scoring segment. This program uses the
BLASTP or BLASTN algorithms for aligning two sequences.
15. BLAST calculates probabilities and this can fail if
some assumptions are invalid for that search. There
are versions of BLAST for searching nucleic acid and
protein databases, which can be used to translate
DNA sequences prior to comparing them to protein
sequence databases in 1997. Recent improvement in
BLAST is GAPPED-BLAST (three times faster than the
original BLAST) and PSI-BLAST (position-specific-
iterated BLAST). The GAPPED-BLAST algorithm allows
gaps to be introduced into the alignments. That
means that similar regions are not broken into
several segments (as in the older versions). This
method reflects biological relationships much better
than ordinary BLAST.