Presented By: Alphy Joseph
Date: 03 March 2016
Important file formats
Early Data Formats
•These early databases stored sequence
data in a file. The file held the sequence
in ASCII (plain)text and had a
• This method became limiting when
researchers wanted to include
annotations and information about the
source of the sequence.
• Difficulty in searching for sequences
was also an issue.
Flat File Storage Data
•When GenBank, EMBL and DDBJ
formed a collaboration (1986),
sequence databases had moved to a
defined flat file format with a shared
feature table format and annotation
•The PIR also adopted a similar format
for protein sequences
•The flat file formats from the
sequence databases are still used to
access and display sequence and
annotation. They are also convenient
for storage of local copies.
• Bioinformaticists have developed a
standard format for nucleotide and
protein sequences that allows them to
be read by a wide range of programs.
This format is called FASTA format.
•FASTA format each nucleotide or
amino acid is represented using a
•The first line of a FASTA is the
comment line, identified with either the
greater than symbol ‘>’. This line
identifies the sequence and includes the
accession number from NCBI,
Genbank or another repository.
•The remaining lines contain the
sequence,in lines of 80 or 120
characters per line.
•A sequence in PIR format consists of:
–One line starting with
•a ">" (greater-than) sign, followed
•a two-letter code describing the
sequence type (P1, F1, DL, DC, RL,
RC, or XX), followed by
•a semicolon, followed by
•the sequence identification
–One line containing a textual
description of the sequence.
–One or more lines containing the
sequence itself. The end of the
sequence is marked by a "*"
–Optionally, this can be followed by
one or more lines describing the
sequence. Software that is
supposed to read only the sequence
should ignore these.
•A file in PIR format may comprise
more than one sequence.
•The PIR format is also often referred
to as the NBRF format.
• The first line in the file must start with
the words "CLUSTALW". Other
information in the first line is ignored.
• One or more empty lines.
• One or more blocks of sequence data. Each
block consists of:
– One line for each sequence in the alignment.
Each line consists of:
•the sequence name
•up to 60 sequence symbols.
•optional - white space followed by a cumulative
count of residues for the sequences
– A line showing the degree of
conservation for the columns of the
alignment in this block.
– One or more empty lines
•Some rules about representing
•Case doesn't matter.
•Sequence symbols should be from a
•Gaps are represented using hyphens
•The characters used to represent the
degree of conservation are
* -all residues or nucleotides in that
column are identical
: - conserved substitutions have been
. -semi-conserved substitutions have
- no match.
•msf formatted multiple sequence files
are most often created when using
programs of the GCG suite.
• msf files include the sequence name
and the sequence itself, which is
usually aligned with other sequences
in the file.
• You can specify a single sequence or
many sequences within an msf file.
•Some of the hallmarks of a msf
formatted sequence are the same as a
single sequence gcg format file:
•Begins with the line (all uppercase) !!
for nucleic acid sequences or !!
for amino acid sequences.
• Do not edit or delete the file type if
•A description line which contains
informative text describing what is in
the file. You can add this information
to the top of the MSF file using a text
•A dividing line which contains the
number of bases or residues in the
sequence, when the file was created,
and importantly, two dots (..) which
act as a divider between the
descriptive information and the
•msf files contain some other
information as well:
•Name/Weight: The name of each
sequence included in the alignment, as
well as its length and checksum (both
non-editable) and weight (editable).
•Separating Line. Must include two
slashes (//) to divide the name/weight
information from the sequence
•Multiple Sequence Alignment. Each
sequence named in the above
Name/Weight lines is included. The
alignment allows you to view the
relationship among sequences