Week 3: Gene
Annotation
By
Prof. Jackson
NCBI ORF
Finder
You can also use the accession ID for search.
Since you’re using unknown sequence,
you’re trying to annotate, this will not be a
good option
Copy and paste your sequence in this
window.
Go to:
https://www.ncbi.nlm.nih.g
ov/orffinder/
https://www.ncbi.nlm.nih.gov/orffinder/
https://www.ncbi.nlm.nih.gov/orffinder/
ORF Finder
parameters
If working with the sequence (genome) that still has mitochondrial
you may want to clean your genome and remove the mitochondrial
genome. (not applicable for this assignment)
Depending on the organism (if you known it) you
may increase or decrease the length. This
implies that if the algorithm finds a stop codon
before the selected length, it will be ignored
until another stop codon that meets the length
restriction. For small organism like viruses,
bacteria, it may be a good idea to reduce the
length from 75.
Some organisms like E.coli have other known
start codons than ATG. You may want to identify
a cds starting with other codons as well.
ORF finder results
• In most cases the longest of the predicted ORFs is usually the correct ORF but there
are times when this may not be true.
• + is for positive strand or forward strand. There are usually 3 frames forward (+)
and three frame in reverse (negative (-)) making a total of 6 frames
• The algorithm reads in a sliding window. Frame 1 will start with nucleotide in the
first position, frame 2 will start on second position and frame 3 third position. You
easily implement your simple ORF predictor with python, Biopython or any other
programing language.
Predicted ORFAmino acid sequence
matching the ORF
You can click on the Mark flag to
select multiple ORFs
Highlight the Orf you would like
to include on your selection
Have your mouse on the
graphic for more information
ORF finder results cont.
• BLAST is a validation step that can also be used to further identify your sequence/
gene.
• Ideally you should choose the same database you used when BLASTing to identify
your sequence.
• You expect the correct ORF to have similar results as your sequence BLAST results.
• The correct Orf should also have better statistics/scores, i.e., better e-value than
the rest.
Select a BLAST database to search
BLAST your marked ORFs
BLAST
The number of sequences should match the number of
ORFs you marked/selected .
You can also select the BLAST search database here and
change any other parameters you want
ORF BLAST
Results
• The default results will be on the first Orf
you selected.
• ORF1 seems to have the most hits and
acceptable scores, likely to be the correct
ORF.
• Selected ORF3 has no results
Use drop down to select
results for different ORFs
OFR Finder
• Now that you know ORF1 is
the most likely ORF use
SmartBLAST instead.
• You can also try searching
the multiple ORFs in
different databases for
further evaluations
U.
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
Week 3 Gene AnnotationBy Prof. JacksonNCBI OR.docx
1. Week 3: Gene
Annotation
By
Prof. Jackson
NCBI ORF
Finder
You can also use the accession ID for search.
Since you’re using unknown sequence,
you’re trying to annotate, this will not be a
good option
Copy and paste your sequence in this
window.
Go to:
https://www.ncbi.nlm.nih.g
ov/orffinder/
https://www.ncbi.nlm.nih.gov/orffinder/
https://www.ncbi.nlm.nih.gov/orffinder/
ORF Finder
parameters
If working with the sequence (genome) that still has
mitochondrial
2. you may want to clean your genome and remove the
mitochondrial
genome. (not applicable for this assignment)
Depending on the organism (if you known it) you
may increase or decrease the length. This
implies that if the algorithm finds a stop codon
before the selected length, it will be ignored
until another stop codon that meets the length
restriction. For small organism like viruses,
bacteria, it may be a good idea to reduce the
length from 75.
Some organisms like E.coli have other known
start codons than ATG. You may want to identify
a cds starting with other codons as well.
ORF finder results
• In most cases the longest of the predicted ORFs is usually the
correct ORF but there
are times when this may not be true.
• + is for positive strand or forward strand. There are usually 3
frames forward (+)
and three frame in reverse (negative (-)) making a total of 6
frames
• The algorithm reads in a sliding window. Frame 1 will start
with nucleotide in the
first position, frame 2 will start on second position and frame 3
third position. You
easily implement your simple ORF predictor with python,
Biopython or any other
3. programing language.
Predicted ORFAmino acid sequence
matching the ORF
You can click on the Mark flag to
select multiple ORFs
Highlight the Orf you would like
to include on your selection
Have your mouse on the
graphic for more information
ORF finder results cont.
• BLAST is a validation step that can also be used to further
identify your sequence/
gene.
• Ideally you should choose the same database you used when
BLASTing to identify
your sequence.
• You expect the correct ORF to have similar results as your
sequence BLAST results.
• The correct Orf should also have better statistics/scores, i.e.,
better e-value than
the rest.
Select a BLAST database to search
BLAST your marked ORFs
4. BLAST
The number of sequences should match the number of
ORFs you marked/selected .
You can also select the BLAST search database here and
change any other parameters you want
ORF BLAST
Results
• The default results will be on the first Orf
you selected.
• ORF1 seems to have the most hits and
acceptable scores, likely to be the correct
ORF.
• Selected ORF3 has no results
Use drop down to select
results for different ORFs
OFR Finder
• Now that you know ORF1 is
the most likely ORF use
SmartBLAST instead.
• You can also try searching
the multiple ORFs in
5. different databases for
further evaluations
Use SmartBLAST
SmartBLAST results
• Pay attention to what is suggested as the top hits. They
do not necessarily have all the best scores, as we will
examine in the next slides
• Smart BLAST also generates phylogenetic tree.
• In addition, you can perform a multiple sequence
alignment
Analyzing selected Best
hits results
• They all have same e-value so that can’t
be the deciding attribute
• The Homo sapiens has good e-value, no
gaps although the identities of 94%
suggest may not be the same organism
as the unknown sequence.
• Mus musculus has 4% gaps which is too
high compared to others and the
identities is too low to be acceptable as
6. the same organism.
• All the rest have gaps and compatible
identities making the homo sapiens the
best hit
Comparing BLAST
results
• The first hit seems to have perfect scores and would
seem like an obvious choice
• Most of the hits that follow are only predicted,
therefore nothing to base on to dismiss the first hit.
• The 14th and 15th hits under homo sapiens quite
competing scores and they indicate to be complete.
Those should be included for further evaluation.
Click GenBank to go to
nucleotide database
Click on each selection for detailed
information
Click on gene to link to the gene database for
additional information
Further
analysis of the
BLAST Results
7. • The top hit indicates that it is
Provisional, and the Homo
sapiens hit are reviewed.
• For annotation purposes, always
choose the reviewed hit.
• This explains why SmartBLAST
selected Homo sapiens for the top
best hit even though Macaca
Mulatta had better scores.
• By now you should be familiar
with concepts of Reviewed,
Provisional, Model, predicted,
etc., for better decisions in
selecting hits.
• As covered in week 2, NCBI gene
database has graphics that can be
used to capture various genomic
features.
• Hovering your mouse in the graphics
as shown in the figures can get the
intron and exon positions.
• You can also download the
information by clicking at the end of
the download arrows.
8. On slide 12, you can
choose the Ensembl
id under gene
summary to bring
you directly to this
page. You can also
go to Ensembl
website and paste
the ID or search
with NCBI accession
ID
Gene position here. Forward strand is your ORF. Can also be
represented as positive
or + strand
Select
regulation for
regulatory
regions such as
enhancers and
promoters
Select transcript
to get sequence
regions like
exons, and
introns positions
Select exons to get the coding
positions table
Results
9. Select cDNA to retrieve other sequence region including
UTR regions Select variant table for the results below
Results from selecting regulation on the gene tab
Follow the color flag and hover your mouse to
see the feature
Results of the regulations features
Same information as Ensembl. UCSC also has all the ENCODE
data