1
“Discussing FastA homology search Algorithm.”
BY: MUUNDA MUDENDA, MSc Molecular Biology and Biotechnology
Email: muundamudenda@gmail.com
FastA stands for “Fast All”. It is a sequence alignment algorithm that was developed in 1985
by Lapman and Pearson. FastA algorithm uses heuristics to find similarities between nucleotide
or protein sequences in local alignment searches. FastA programs operate on four main
parameters which are; Expect value (E-Value), true homology, threshold and putative
conserved domains.
According to Sagar Aryal (2019), the FastA algorithm uses hashing strategy to make
alignments for short stretches of identical residues. During the FastA search, queries are broken
into small sequence patterns or words known as Ktuples or Ktups which are used to search the
target sequences. A Ktup is composed of two residues for proteins and six residues for DNA
sequences. Ktups are the equivalents of words in the BLAST algorithm.
The programs in FastA include; FASTA, FASTX and FASTY, GGSEARCH, and
GLSEARCH.
 FASTA: The most common of the programs. It compare proteins sequences to protein
databases. It also has applications in nucleotides, genomes and whole genome shotgun.
 FASTX, FASTY: Used to compare nucleotide sequences to protein databases.
 SSEARCH: This program performs Smith-Waterman alignment in protein to protein or
nucleotide to nucleotide sequences.
 GGSEARCH: The program uses global alignment to compare query sequences.
 GLSEARCH: This program compares DNA or protein sequences to sequences in
databases. The searches involves global alignments in query and local alignments in
databases.
The following are the steps involved in FastA algorithm. These steps are extracted from Itshack
Pe’er.
1. Specifying an integer parameter and look for Ktup length matching substrings of the
two strings. The standard recommended Ktup values are six for DNA sequence
matching and two for protein sequence matching.
2
2. Fining the 10 best diagonal runs of hot spots in the matrix. A diagonal run is a sequence
of nearby hot spots on the same diagonal. A run need not contain all the hot spots on its
diagonal, and a diagonal may contain more than one of the 10 best runs found.
3. Evaluating the runs using an amino acid (or nucleotide) substitution matrix, and pick
the best scoring run. The single best sub-alignment found in this stage is called init1. A
filtration is performed and the diagonal runs achieving relatively low scores are
discarded.
4. Constructing a directed weighted graph whose vertices are the sub-alignments found in
the previous stage, and the weight in each vertex is the score found in the previous stage
of the sub-alignment it represents. Essentially, FASTA then finds a maximum weight
path in this graph. The best alignment found in this stage is marked initn. The low-
scoring alignments are discarded.
5. FASTA computes an alternative local alignment score, in addition to initn. The best
local alignment computed in this stage is called opt.
6. In the last step, the database sequences are ranked according to initn scores or opt scores,
and the full dynamic programming algorithm is used to align the query sequence against
each of the highest ranking result sequences.
REFERENCES
EMBL-EBI. (2021). Sequence Similarity Searching. Link: ebi.ac.uk/tools/sss/
Itshack Pe’er. (1999). FASTA. Link: cs.tau.ac.il
Pearson W. R. (2016). Finding Protein and Nucleotide Similarities with FASTA. Current
protocols in bioinformatics, 53, 3.9.1–3.9.25. https://doi.org/10.1002/0471250953.bi0309s53
Pearson, W. R., Lipman, D. J., P.N.A.S. (1988). FASTA Sequence Comparison. 85:2444-
2448 Link: gen.tcd.ie/molevol/fasta.html
Sagar Aryal. (2019). FASTA and BLAST. Microbe Notes. Link: microbenotes.com

FastA HOMOLOGY SEARCH ALGORITHM

  • 1.
    1 “Discussing FastA homologysearch Algorithm.” BY: MUUNDA MUDENDA, MSc Molecular Biology and Biotechnology Email: muundamudenda@gmail.com FastA stands for “Fast All”. It is a sequence alignment algorithm that was developed in 1985 by Lapman and Pearson. FastA algorithm uses heuristics to find similarities between nucleotide or protein sequences in local alignment searches. FastA programs operate on four main parameters which are; Expect value (E-Value), true homology, threshold and putative conserved domains. According to Sagar Aryal (2019), the FastA algorithm uses hashing strategy to make alignments for short stretches of identical residues. During the FastA search, queries are broken into small sequence patterns or words known as Ktuples or Ktups which are used to search the target sequences. A Ktup is composed of two residues for proteins and six residues for DNA sequences. Ktups are the equivalents of words in the BLAST algorithm. The programs in FastA include; FASTA, FASTX and FASTY, GGSEARCH, and GLSEARCH.  FASTA: The most common of the programs. It compare proteins sequences to protein databases. It also has applications in nucleotides, genomes and whole genome shotgun.  FASTX, FASTY: Used to compare nucleotide sequences to protein databases.  SSEARCH: This program performs Smith-Waterman alignment in protein to protein or nucleotide to nucleotide sequences.  GGSEARCH: The program uses global alignment to compare query sequences.  GLSEARCH: This program compares DNA or protein sequences to sequences in databases. The searches involves global alignments in query and local alignments in databases. The following are the steps involved in FastA algorithm. These steps are extracted from Itshack Pe’er. 1. Specifying an integer parameter and look for Ktup length matching substrings of the two strings. The standard recommended Ktup values are six for DNA sequence matching and two for protein sequence matching.
  • 2.
    2 2. Fining the10 best diagonal runs of hot spots in the matrix. A diagonal run is a sequence of nearby hot spots on the same diagonal. A run need not contain all the hot spots on its diagonal, and a diagonal may contain more than one of the 10 best runs found. 3. Evaluating the runs using an amino acid (or nucleotide) substitution matrix, and pick the best scoring run. The single best sub-alignment found in this stage is called init1. A filtration is performed and the diagonal runs achieving relatively low scores are discarded. 4. Constructing a directed weighted graph whose vertices are the sub-alignments found in the previous stage, and the weight in each vertex is the score found in the previous stage of the sub-alignment it represents. Essentially, FASTA then finds a maximum weight path in this graph. The best alignment found in this stage is marked initn. The low- scoring alignments are discarded. 5. FASTA computes an alternative local alignment score, in addition to initn. The best local alignment computed in this stage is called opt. 6. In the last step, the database sequences are ranked according to initn scores or opt scores, and the full dynamic programming algorithm is used to align the query sequence against each of the highest ranking result sequences. REFERENCES EMBL-EBI. (2021). Sequence Similarity Searching. Link: ebi.ac.uk/tools/sss/ Itshack Pe’er. (1999). FASTA. Link: cs.tau.ac.il Pearson W. R. (2016). Finding Protein and Nucleotide Similarities with FASTA. Current protocols in bioinformatics, 53, 3.9.1–3.9.25. https://doi.org/10.1002/0471250953.bi0309s53 Pearson, W. R., Lipman, D. J., P.N.A.S. (1988). FASTA Sequence Comparison. 85:2444- 2448 Link: gen.tcd.ie/molevol/fasta.html Sagar Aryal. (2019). FASTA and BLAST. Microbe Notes. Link: microbenotes.com