Submitting DNA sequences to the databases, SEQUIN.pptx

SUBMITTING DNA
SEQUENCES TO THE
DATABASES, SEQUIN
P:1 U:2
Vedanti S. Gharat
Roll No. :- 09
M.Sc. Biotech part 1

• DNA sequence records from the public databases
(DDBJ/EMBL/GenBank) are essential components of
computational analysis in molecular biology.
• The sequence records are also reagents for improved curated
resources like LocusLink or many of the protein databases.
• Accurate and informative biological annotation of sequence
records is critical in determining the function of a disease gene
by sequence similarity search.
• The names or functions of the encoded protein products, the
name of the genetic locus, and the link to the original
publication of that sequence make a sequence record of
immediate value to the scientist who retrieves it as the result of a
BLAST or Entrez search.
• The submission process is governed by an international,
collaborative agreement. Sequences submitted to any one of the
three databases participating in this collaboration will appear in
the other two databases within a few days of their release to the
public.

WHY, WHERE, AND WHAT TO
SUBMIT?
• One should submit to whichever of the three public databases is most
convenient.
• This may be the database that is closest geographically, it may be the
repository one has always used in the past, or it may simply be the
place one’s submission is likely to receive the best attention.
• Under normal circumstances, an accession number will be returned
within one workday, and a finished record should be available
within 5–10 working days, depending on the information provided
by the submitter.
• Submitting data to the database is not the end of one’s scientific
obligation. Updating the record as more information becomes
available will ensure that the information within the record will
survive time and scientific rigor.
• Submissions of sequences are done electronically: via the World
Wide Web, by electronic mail, or on a computer disk sent via regular
postal mail.

DNA/RNA
• The submission process is quite simple, but care must be taken
to provide information that is accurate and as biologically sound
as possible, to ensure maximal usability by the scientific
community.
1]Nature of the Sequence.
Is it of genomic or mRNA origin?
2] Is the Sequence Synthetic, But Not Artificial?
There is a special division in the nucleotide databases for synthetic
molecules, sequences put together experimentally that do not
occur naturally in the environment. The DNA sequence databases
do not accept computer-generated sequences.
3] How Accurate is the Sequence?
The assumption that the submitted sequence is as accurate as
possible usually means at least two-pass coverage on the whole
submitted sequence. Equally important is the verification of the
final submitted sequence.

• Organism
All DNA sequence records must show the organism from which the
sequence was derived. Many inferences are made from the phylogenetic
position of the records present in the databases.
• Citation
Having a citation in the submission being prepared is of great
importance, even if it consists of just a temporary list of authors and a
working title. Updating these citations at publication time is also
important to the value of the record.
• Coding Sequence(s)
A submission of nucleotide also means the inclusion of the protein
sequences it encodes. This is important for two reasons:
• Protein databases (e.g., SWISS-PROT and PIR) are almost entirely
populated by protein sequences present in DNA sequence database
records.
• The inclusion of the protein sequence serves as an important, if not
essential, validation step in the submission process.
The coding sequence features, or CDS, are the links between the DNA
or RNA and the protein sequences, and their correct positioning is
central in the validation, as is the correct genetic code.

POPULATION, PHYLOGENETIC, AND MUTATION
STUDIES
• The nucleotide databases are now accepting population,
phylogenetic, and mutational studies as submitted sequence sets,
and, although this information is not adequately represented in
the flatfile records, it is appearing in the various databases.
• This allows the submission of a group of related sequences
together, with entry of shared information required only once.
• Sequin also allows the user to include the alignment generated
with a favorite alignment tool and to submit this information
with the DNA sequence.
PROTEIN-ONLY SUBMISSIONS
• In most cases, protein sequences come with a DNA sequence.
There are some exceptions—people do sequence proteins
directly—and such sequences must be submitted without a
corresponding DNA sequence. SWISS-PROT presently is the
best venue for these submissions.

HOW TO SUBMITON THE WORLDWIDEWEB
• The World Wide Web is now the most common interface used to submit sequences to the
three databases. The Web-based submission systems include Sakura at DDBJ, WebIn at
EBI, and BankIt at the NCBI.
• Some 75–80% of individual submissions to NCBI are done via the Web.
• On entering a BankIt submission, the user is asked about the length of the nucleotide
sequence to be submitted. The next BankIt form is straightforward: it asks about the contact
person, the citations, the organism, the location, some map information, and the nucleotide
sequence itself.
• At the end of the form, there is a BankIt button, which calls up the next form. At this point,
some validation is made, and, if any necessary fields were not filled in, the form is
presented again. If all is well, the next form asks how many features are to be added and
prompts the user to indicate their types.
• If no features were added, BankIt will issue a warning and ask for confirmation that not
even one CDS is to be added to the submission. The user can say no (zero new CDSs) or
take the opportunity to add one or more CDS.
• To begin to save a record, press the BankIt button again. The view that now appears must be
approved before the submission is completed; that is, more changes may be made, or other
features may be added. To finish, press BankIt one more time.
• The final screen will then appear; after the user toggles the Update/Finished set of buttons
and hits BankIt one last time, the submission will go to NCBI for processing. A copy of the
just-finished submission should arrive promptly via E-mail.

HOW TO SUBMIT WITH SEQUIN
• Sequin is designed for preparing new sequence records and updating
existing records for submission to DDBJ, EMBL, and GenBank.
• It is a tool that works on most computer platforms and is suitable for a
wide range of sequence lengths and complexities, including traditional
(gene-sized) nucleotide sequences, segmented entries, long (genome-
sized) sequences with many annotated features, and sets of related
sequences (i.e., population, phylogenetic, or mutation studies of a
particular gene, region, or viral genome).
• Sequin is more practical for more complex cases. Certain types of
submission (e.g., segmented sets) cannot be made via the Web unless
explicit instructions to the database staff are inserted.
• For sets of related or similar sequences (e.g., population or phylogenetic
studies), Sequin accepts information from the submitter on how the
multiple sequences are aligned to each other.
• Finally, Sequin can be used to edit and resubmit a record that already
exists in GenBank, either by extending (or replacing) the sequence or by
annotating additional features or alignments.

SUBMISSION MADE EASY
• Sequin has a number of attributes that greatly simplify the process of
building and annotating a record.
• The most profound aspect is automatic calculation of the intervals on
a CDS feature given only the nucleotide sequence, the sequence of
the protein product, and the genetic code. This ‘‘Suggest Intervals’’
process takes consensus splice sites into account in its calculations.
• Another important attribute is the ability to enter relevant annotation
in a simple format in the definition line of the sequence data file.
• Sequin recognizes and extracts this information when reading the
sequences and then puts it in the proper places in the record. This is
especially important for population and phylogenetic studies, where
the source modifiers are necessary to distinguish one component
from another.

STARTING A NEW SUBMISSION
• Sequin begins with a window that allows the user to start a new
submission or load a file containing a saved record. If Sequin has been
configured to be network aware, this window also allows the
downloading of existing database records that are to be updated.
• A new submission is made by filling out several forms. The forms use
folder tabs to subdivide a window into several pages, allowing all the
requested data to be entered without the need for a huge computer
screen. These entry forms have buttons for Previous Page and Next
Page. When the user arrives at the last page on a form, the Next Page
button changes to Next Form.
• The Submitting Authors form requests a tentative title, information
on the contact person, the authors of the sequence, and their
institutional affiliations.
• The Sequence Format form asks for the type of submission (single
sequence, segmented sequence, or population, phylogenetic, or
mutation study)
• The Organism and Sequences form asks for the biological data. On
the Organism page, as the user starts to type the scientific name, the list
of frequently used organisms scrolls automatically.

Entering a Single Nucleotide Sequence
and its Protein Products
• For a single sequence or a segmented sequence, the rest of the
Organism and Sequences form contains Nucleotide and Protein folder
tabs.
• The Nucleotide page has controls for setting the molecule type (e.g.,
genomic DNA or mRNA) and topology (usually linear, occasionally
circular) and for indicating whether the sequence is incomplete at the
5 or 3 ends.
• For each protein sequence, Suggest Intervals is run against the
nucleotide sequence, and a CDS feature is made with the resulting
intervals. A Gene feature is generated, with a single interval spanning
the CDS intervals. A protein product sequence is made, with a Protein
feature to give it a name.
• In most cases, it is much easier to enter the protein sequence and let
Sequin construct the record automatically than to manually add a CDS
feature later.

Entering an Aligned Set of Sequences
• A growing class of submissions involves sets of related sequences. A large number of HIV
sequences come in as population studies. A common phylogenetic study involves ribulose-
1,5-bisphosphate carboxylase (RUBISCO).
• The same submission information form is used to enter author and contact information.
• In the Sequence Format form, the user chooses the desired type of submission. Population
studies are generally from different individuals in the same (crossbreeding) species.
Phylogenetic studies are from different species.
• Multiple sequence studies can be submitted in FASTA format, in which case Sequin should
later be called on to calculate an alignment.
• The Organism and Sequences form is slightly different for sets of sequences. The
Organism page for phylogenetic studies allows the setting of a default genetic code only
for organisms not in Sequin’s local list of popular species. Instead of a Protein page, there
is now an Annotation page.
• As a final step, Sequin displays an editor that allows all organism and source modifiers on
each sequence to be edited .On confirmation of the modifiers, Sequin finishes assembling
the record into the proper structure.

Viewing the Sequence Record
• Sequin provides a number of different views of a sequence record.
The traditional flatfile can be presented in FASTA, GenBank or
EMBL format.
• There is a more detailed view that shows the features on the actual
sequence. For records containing alignments one can request either
a graphical overview showing insertions, deletions, and mismatches
or a detailed view showing the alignment of sequence letters.
• Clicking on a feature, a sequence, or the graphical representation of
an alignment between sequences will highlight that object.
Validation
• To ensure the quality of data being submitted, Sequin has a built-in
validator that searches for missing organism information, incorrect
coding region lengths, internal stop codons in coding regions,
mismatched amino acids, and non consensus splice sites.
• The validator also checks for inconsistent use of ‘‘partial’’
indications, especially among coding regions, the protein product,
and the protein feature on the product.

SENDING THE SUBMISSION
• A finished submission can be saved to disk and E-mailed to one of the
databases. It is also a good practice to save frequently throughout the
Sequin session, to make sure nothing is inadvertently lost.
CONCLUDING REMARKS
• The act of depositing records into a database and seeing these records
made public has always been an exercise of pride on the part of
submitters, a segment of the scientific activity from their laboratory
that they present to the scientific community. In this process,
submitters always hope to provide information in the most complete
and useful fashion, allowing maximum use of their data by the
scientific community.
• The databases strongly encourage the submission of sequence data
and of all appropriate updates. Many tools are available to facilitate
this task, and together the databases support Sequin as the tool to
use for new submissions, in addition to their respective Web
submissions tools. Submitting data to the databases has now become
a manageable (and sometimes enjoyable) task, with scientists no
longer having good excuses for neglecting.

Submitting DNA sequences to the databases, SEQUIN.pptx

More Related Content

What's hot

Similar to Submitting DNA sequences to the databases, SEQUIN.pptx

More from Ved Gharat

Recently uploaded

Submitting DNA sequences to the databases, SEQUIN.pptx