This document provides instructions for students to conduct basic BLAST searches using unknown DNA sequences as queries to identify homologous sequences and determine the identity of the unknown sequences. It describes how to analyze the BLAST results, including the descriptions, alignments and significance values. Students are guided to find homologs of a known sequence on GenBank and explore the BLAST results, alignments and formatting options.
Overview on Edible Vaccine: Pros & Cons with Mechanism
Basic BLAST (BLASTn)
1. Asian University For Women
BINF 2000: Introduction to
Bioinformatics (Lab#03)
Basic BLAST (BLASTn)
Fall 2022
This content is prepared by A.K.M Moniruzzaman Mollah (Ph.D., Head, Science & Math Program, Associate Professor
of Life Sciences), Syed Mohammad Lokman (Adjunct Faculty, Bioinformatics & Environmental Sciences), Nabila
Ishaque (Adjunct Faculty, Life Sciences), and Fatema-Tuz-Zohora (Adjunct Faculty, Bioinformatics & Environmental
Sciences), Asian University For Women, Chittagong, Bangladesh.
References are cited in between the contents.
2. Lab#3: Basic BLAST (blastn)
One of the most important bioinformatic strategies used for the functional annotation of
genes and genomes is to predict the function of uncharacterized genes or proteins
based on their similarity to sequences with better functional annotations. BLAST is
perhaps the single most important tool for finding database sequences that are
similar to a query sequence of interest.
The Basic Local Alignment and Search Tool (BLAST; Altschul et al., 1997) is a very
powerful approach to identifying database sequences that share local similarity to a
query sequence (see below for definitions). There is a very important chain of
assumptions used in biological research that is generally followed when using BLAST:
● Homologous genes share sequence similarity
○ Orthologous genes have the highest similarity among multiple species
■ Orthologous genes most likely have similar functions
● Consequently, sequences that are most similar between
multiple species share similar functions.
Note, it is very important to understand that these are only assumptions, and there are
many reasons and instances where these assumptions prove to be false. Nevertheless,
they are a reasonable starting place.
3.1. Definitions:
● Similar sequences – sequences that share a significant number of residues
(nucleotides or amino acids). Sequences can be similar due to homology or
simply by chance. The higher the similarity between sequences, the more likely
they are to be homologous.
● Homologous sequences – sequences that are related through common
ancestry. Homology is qualitative – two sequences either are, or are not related
through common ancestry. Homologous sequences can vary greatly in their level
of similarity – from 100% to 0%.
3. ● Orthologous sequences – sequences that are related through a past speciation
event. Orthologous sequences are assumed to share common functions.
● Paralogous sequences – sequences that are related through a past gene
duplication event. Genes often diverge in function after duplicating; therefore,
paralogous sequences are not assumed to share a common function.
● Query sequence – your sequence; the sequence you are interested in finding
more about.
● High Scoring Segment Pair (HSP) – ‘hits’ to the database. A subsequence
match between your query sequence and a database sequence returned by
BLAST.
● Local alignment – a sequence alignment that extends only across part of the
sequence.
● Global alignment – a sequence alignment that extends across the entire
sequence (from end to end).
● HSP: high scoring sequence pair.
● FASTA format: FASTA format is used to represent either nucleotide or peptide
sequences. The first line is a comment line, beginning with “>” and describing the
sequence. All the following lines are the sequence, in plain text.
○ Example DNA sequence in FASTA format:
■ >gi|23423|ref|NM_23542.0| Homo sapiens protein
ATGAATCGATACGATAGCTAGCTATCGATGCAGATCAGAGAGGG
GCTTTAGCTAGCTAAGCTAG
4. ○ Example protein sequence in FASTA format:
■ >MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTE
AELQDMINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEIREAFR
VFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQ
VNYEEFVQMMTAK*
3.2. How does BLAST work?
BLAST identifies homologous sequences using a heuristic method which initially finds
short matches between two sequences; thus, the method does not take the entire
sequence space into account. After the initial match, BLAST attempts to start local
alignments from these initial matches. Four major steps of BLAST are:
1. Seeding:
a. Break sequence(s) into words, search for similar words in
database
b. Calculate E-value to select HSPs
c. Pass selected number of HSPs to extension step
2. Extension: Drawing the lines and generate alignment
3. Evaluation: Figuring out which lines are best (Perform
additional evaluations (%identity, %positives, %gaps)
4. Output: Writing the results out to disk
3.3. Types of BLAST
Method Option Query Type DB Type Comparison
Natural
BLAST
BLASTn Nucleotide Nucleotide Nucleotide-Nucleotide
BLASTp Protein Protein Protein-Protein
Translated
BLAST
BLASTx Nucleotide Protein Protein-Protein
tBLASTn Protein Nucleotide Protein-Protein
5. 3.4. Understanding BLAST results:
The amount of information on the BLAST website is a bit overwhelming — even for the
scientists who use it on a frequent basis! You are not expected to know every detail of
the BLAST program. BLAST results have the following fields:
E value: The E value (expected value) is a number that describes how many times you
would expect a match by chance in a database of that size. The lower the E value is,
the more significant the match.
• E-value < 10e-100 Identical sequences. You will get long alignments across the
entire query and hit sequence.
• 10e-50 < E-value < 10e-100 Almost identical sequences. A long stretch of the
query protein is matched to the database.
• 10e-10 < E-value < 10e-50 Closely related sequences.
• 1 < E-value < 10e-6 Could be a true homologue but it is a gray area.
• E-value > 1 Proteins are most likely not related
• E-value > 10 Hits are most likely junk unless the query sequence is very short.
Percent Identity: The percent identity is a number that describes how similar the query
sequence is to the target sequence (how many characters in each sequence are
identical). The higher the percent identity is, the more significant the match.
Query Cover: The query cover is a number that describes how much of the query
sequence is covered by the target sequence. If the target sequence in the database
spans the whole query sequence, then the query cover is 100%. This tells us how long
the sequences are, relative to each other.
3.4. Examples of BLAST usage
BLAST can be used for a lot of different purposes. A few of them are mentioned below.
• Looking for species. If you are sequencing DNA from unknown species, BLAST may
help identify the correct species or homologous species.
• Looking for domains. If you BLAST a protein sequence (or a translated nucleotide
sequence) BLAST will look for known domains in the query sequence.
6. • Looking at phylogeny. You can use the BLAST web pages to generate a
phylogenetic tree of the BLAST result.
• Mapping DNA to a known chromosome. If you are sequencing a gene from a known
species but have no idea of the chromosome location, BLAST can help you. BLAST will
show you the position of the query sequence in relation to the hit sequences.
• Annotations. BLAST can also be used to map annotations from one organism to
another or look for common genes in two related species.
Exercise 3. 1: Identify unknown sequences
The Sequences:
>Unknown sequence 1
AAATGAGTTAATAGAATCTTTACAAATAAGAATATACACTTCTGCTTAGGATGATAATTG
GAGGCAAGTGAATCCTGAGCGTGATTTGATAATGACCTAATAATGATGGGTTTTATTT
CCAGACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGTAAAAT
TAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTATGCCTGGCAC
CATTAAAGAAAATATCATCTTTGGTGTTTCCTATGATGAATATAGATACAGAAGCGTCA
TCAAAGCATGCCAACTAGAAGAGGTAAGAAACTATGTGAAAACTTTTTGATTATGCAT
ATGAACCCTTCACACTACCCAAATTATATATTTGGCTCCATATTCAATCGGTTAGTCTA
CATATATTTATGTTTCCTCTATGGGTAAGCTACTGTGAATGGATCAATTAATAAAACAC
ATGACCTATGCTTTAAGAAGCTTGCAAACACATGAA
>Unknown sequence 2
AAATGAGTTAATAGAATCTTTACAAATAAGAATATACACTTCTGCTTAGGATGATAATTG
GAGGCAAGTGAATCCTGAGCGTGATTTGATAATGACCTAATAATGATGGGTTTTATTT
CCAGACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGTAAAAT
TAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTATGCCTGGCAC
CATTAAAGAAAATATCATTGGTGTTTCCTATGATGAATATAGATACAGAAGCGTCATCA
AAGCATGCCAACTAGAAGAGGTAAGAAACTATGTGAAAACTTTTTGATTATGCATATG
AACCCTTCACACTACCCAAATTATATATTTGGCTCCATATTCAATCGGTTAGTCTACAT
ATATTTATGTTTCCTCTATGGGTAAGCTACTGTGAATGGATCAATTAATAAAACACATG
ACCTATGCTTTAAGAAGCTTGCAAACACATGAA
7. Procedure:
1. Navigate to the main BLAST page (https://blast.ncbi.nlm.nih.gov/Blast.cgi)
2. Select the appropriate type of BLAST for your sequence
3. Paste the “unknown sequence 1” into the box
4. Click on the “BLAST” button and wait for the results. BLAST is usually fairly quick
for short sequences, but should still take a few seconds.
5. Once the results are displayed, notice there are three main headings: Graphic
Summary, Descriptions, and Alignments (these may be expanded so you’ll
have to scroll down).
6. Use these results to answer the questions below.
7. Repeat steps 1-6 for the other “unknown sequence 2”.
Questions:
Q1. In the Descriptions section, look at the top result, which should be the result with
the highest score. Write down information about the best match:
● Description (no need to write the whole thing)
● E value
● Identity
● Query cover
Q2. Now scroll down to the Alignments heading. Look at the top result, which should be
the same one. Look at the alignment between your query and the reference. Do you see
any mismatches?
Q3. How can you judge whether this is a good match?
Q4. What is this gene? Google the name of the gene and write down something
significant you learned about it.
Q5. How are the “unknown sequence 1” and “unknown sequence 1” related?
8. Exercise 3.2: Finding Homologes
1. Visit the GenBank page of your known sequence NM_001336190.1
(https://www.ncbi.nlm.nih.gov/nuccore/NM_001336190.1)
2. Click on “Run BLAST” from the right panel (in the “Analyze This Sequence” part of
the webpage.)
3. On the BLAST page, note that under the Enter Query Sequence section, the NCBI
system has automatically entered the accession number (but you can also enter a GI
number, or FASTA sequence)
4. We want to query the full NCBI database; the NCBI linking system has automatically
changed the default Database (which is Human) to Other and Nucleotide collection
(nr/nt) because our sequence is non-human. The nr database is the non-redundant
collection of sequences in GenBank.
5. In the “program selection”, Change the Program Selected / Optimized for to
“Somewhat similar sequences (blastn).”
6. Open the Algorithm Parameters near the bottom.
7. Check the box next to “Show results in a new window” near the BLAST button.
Now (finally) click the BLAST button.
Questions:
Q6. When would you want to use megaBLAST? What about discontiguous
megaBLAST? (try each to see how your results differ)
Q7. What is the Expect threshold?
Q8. What would happen if you decreased it? Increased it?
Q9. What would be the effect of increasing the Word size?
Q10. Why is there a Low complexity regions filter? Should we keep it on?
9. 8. Analyze the Results Page:
The results page is broken up into sections. At the very top is the job summary, which
simply shows details about your query and the database searched. You can find more
details about your search by clicking Search Summary.
Questions:
Q11. How many sequences are in the nr database?
Q12. What sequences are not included in the nr database? (Trick question: this
information is actually available by clicking on the question mark beside the
Database option on the input page!)
9. Explore the Graphic Summary tab. Scroll your mouse over the coloured bars.
Questions:
Q13. What do the coloured bars mean?
Q14. How does the color code work?
Q15. What information is displayed when you hover on an entry?
Q16. What do you notice about the significance values as you move down the graphical
summary?
Q17. What is the genus and species of the top (best) hit?
Q18. What happens if you click on one of the entries?
10. The Descriptions tab lists the following:
● Description [hyperlinked to corresponding Alignment(s) in Alignments section]
● Max Score – the alignment bit score
● Total Score – another alignment bit score which may differ from the Max Score if
your query matched a single database entry in multiple regions.
● Query Coverage – what percent of the query had similarity to the database hit.
● E-value – probably the best measure of hit quality. Smaller numbers mean better
● hits, with 0.0 being the best value possible.
● Identity – the highest identity found between query and HSP.
● Accession – linked to the indicated sequence at NCBI
10. Questions:
Q19. How many sequence matches are listed for this query sequence? How are they
ordered? (you can sort these segments in other ways, like by identity, score, and
query start position.)
Q20. What happens if you click the Accession hotlink?
11. Finally we can explore the actual HSP Alignments in the Alignments tab.
Compare the information presented for the first HSP alignment to the first entry in the
graphical summary and HSP summary. As you scroll down the alignments, you will see
the alignment quality drop – that is, the e-value increases.
Questions:
Q22. What do the vertical bars ( | )represent between the Query and the Sbjct
(database sequence)?
Q23. What does Strand=Plus/Plus, Strand=Plus/Minus mean? Hint: are genes
always in the same direction on a piece of chromosomal DNA?
12. Go back to the top of the page and click Formatting options. Change the
Alignment View to Query-anchored with dots for identities. Click Reformat and score
down to the HSP alignment section.
Questions:
Q24. Describe the difference between this format and the previous format. Can you
imagine cases where the different formats might be most useful?
Q25. Play with these format options to get a feel for what they mean.