In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
2. Introduction
FASTA uses an algorithm for similarity search for nucleotide or protein
sequence from a biological database.
Nucleotide Sequence (Query)
Protein Sequence (Query)
Nucleotide Sequence (Database)
Protein Sequence (Database)
3. FASTA Algorithm
It start from a Dot-plot or Dot-matrix.
A B C D E F
A
B
M
D
L
F
Second Sequence (Database)
First Sequence
(Query)
Shows regions of similarity
between 2 Sequences
represented as diagonals.
4. FASTA Algorithm
• FASTA goes a step forward from dot-plot
• It calculates the sum of dots along each diagonal.
• It is a “word” based method.
• It looks for matching “word” or the sequence of patterns called “k-tuple”
Tuple: Finite ordered list of elements
Sequence patterns: 1 or 2 amino acids, or 5 or 6 nucleotides
• Build local alignment using this “word” or “k-tuple”.
• Match identical “word”
• Create diagonals by joining adjacent matches.
• Rescore the highest scoring system using PAM or BLOSUM matrix.
• Best of these scores is called init1.
• Join segments using gaps, the best score from this is called initn.
• Use Dynamic programing (Smith-Waterman algorithm) to create the optimal alignment.
7. FASTA Output
• The Histogram
• The Sequence listing
• The Local alignments
8. FASTA Output
The Histogram
• First part of FASTA output is Histogram.
• Predicted extreme value is represented by asterisk * symbol
• Actual numbers obtained is represented by equal = sign
• First column: z-opt score
• Second column: number of sequences with these z-opt scores
• Third column: Expected number of alignments
Histogram used to determine, whether statistical theory is valid or not.
• If equal sign follow predicted value Valid
• If equal sign do not follow predicted value Invalid
10. FASTA Output: The Sequence listing
• Listing of the best scoring sequences in the database.
• Best sequence: reported first
• Worst sequence: reported last
First Column Second
Column
Opt
column
Last
Column
Database Database
accession
number
Database
identifier
Total length
of database
sequence
Final score E-Value
12. FASTA Output: The Local alignments
Display:
The local alignment
Init1 & Initn scores
E-value
Opt-score
Z-score
Percent identity
13. Significance of E-Value
• E-Value or Expected value is about number of
alignments hit by chance.
• Smaller the E-value: Less likely a given alignment
occurred by chance.
14. Variants of FASTA
• FastA - Compares a DNA query sequence to a DNA database, or a
protein query to a protein database, detecting the sequence type
automatically.
• FASTX - Compares a DNA query to a protein database. It may
introduce gaps only between codons.
• FASTY - Compares a DNA query to a protein database, optimizing
gap location, even within codons.
• TFASTA - Compares a protein query to a DNA database.