BIOINFORMARICS
SEQUENCE FILE
FORMATS
Presented By: Alphy Joseph
Date: 03 March 2016
Important file formats
•Genbank
•FASTA
•PIR
•ALN/ClustalW2
•GCG/MSF
Early Data Formats
•These early databases stored sequence
data in a file. The file held the sequence
in ASCII (plain)text and had a
descriptive filename.
• This method became limiting when
researchers wanted to include
annotations and information about the
source of the sequence.
• Difficulty in searching for sequences
was also an issue.
Flat File Storage Data
Formats
•When GenBank, EMBL and DDBJ
formed a collaboration (1986),
sequence databases had moved to a
defined flat file format with a shared
feature table format and annotation
standards.
•The PIR also adopted a similar format
for protein sequences
•The flat file formats from the
sequence databases are still used to
access and display sequence and
annotation. They are also convenient
for storage of local copies.
FASTA Format
• Bioinformaticists have developed a
standard format for nucleotide and
protein sequences that allows them to
be read by a wide range of programs.
This format is called FASTA format.
•FASTA format each nucleotide or
amino acid is represented using a
single letter.
•The first line of a FASTA is the
comment line, identified with either the
greater than symbol ‘>’. This line
identifies the sequence and includes the
accession number from NCBI,
Genbank or another repository.
•The remaining lines contain the
sequence,in lines of 80 or 120
characters per line.
PIR FORMAT
•A sequence in PIR format consists of:
–One line starting with
•a ">" (greater-than) sign, followed
by
•a two-letter code describing the
sequence type (P1, F1, DL, DC, RL,
RC, or XX), followed by
•a semicolon, followed by
•the sequence identification
–One line containing a textual
description of the sequence.
–One or more lines containing the
sequence itself. The end of the
sequence is marked by a "*"
(asterisk) character.
–Optionally, this can be followed by
one or more lines describing the
sequence. Software that is
supposed to read only the sequence
should ignore these.
•A file in PIR format may comprise
more than one sequence.
•The PIR format is also often referred
to as the NBRF format.
ALN/ClustalW
• The first line in the file must start with
the words "CLUSTALW". Other
information in the first line is ignored.
• One or more empty lines.
• One or more blocks of sequence data. Each
block consists of:
– One line for each sequence in the alignment.
Each line consists of:
•the sequence name
•white space
•up to 60 sequence symbols.
•optional - white space followed by a cumulative
count of residues for the sequences
– A line showing the degree of
conservation for the columns of the
alignment in this block.
– One or more empty lines
•Some rules about representing
sequences:
•Case doesn't matter.
•Sequence symbols should be from a
valid alphabet.
•Gaps are represented using hyphens
("-").
•The characters used to represent the
degree of conservation are
* -all residues or nucleotides in that
column are identical
: - conserved substitutions have been
observed
. -semi-conserved substitutions have
been observed
- no match.
GCG/MSF
•msf formatted multiple sequence files
are most often created when using
programs of the GCG suite.
• msf files include the sequence name
and the sequence itself, which is
usually aligned with other sequences
in the file.
• You can specify a single sequence or
many sequences within an msf file.
•Some of the hallmarks of a msf
formatted sequence are the same as a
single sequence gcg format file:
•Begins with the line (all uppercase) !!
NA_MULTIPLE_ALIGNMENT 1.0
for nucleic acid sequences or !!
AA_MULTIPLE_ALIGNMENT 1.0
for amino acid sequences.
• Do not edit or delete the file type if
its present.
•A description line which contains
informative text describing what is in
the file. You can add this information
to the top of the MSF file using a text
editor.
•A dividing line which contains the
number of bases or residues in the
sequence, when the file was created,
and importantly, two dots (..) which
act as a divider between the
descriptive information and the
•msf files contain some other
information as well:
•Name/Weight: The name of each
sequence included in the alignment, as
well as its length and checksum (both
non-editable) and weight (editable).
•Separating Line. Must include two
slashes (//) to divide the name/weight
information from the sequence
alignment.
•Multiple Sequence Alignment. Each
sequence named in the above
Name/Weight lines is included. The
alignment allows you to view the
relationship among sequences
THANK YOU

Sequence file formats

  • 1.
  • 2.
  • 3.
    Early Data Formats •Theseearly databases stored sequence data in a file. The file held the sequence in ASCII (plain)text and had a descriptive filename. • This method became limiting when researchers wanted to include annotations and information about the source of the sequence. • Difficulty in searching for sequences was also an issue.
  • 4.
    Flat File StorageData Formats •When GenBank, EMBL and DDBJ formed a collaboration (1986), sequence databases had moved to a defined flat file format with a shared feature table format and annotation standards. •The PIR also adopted a similar format for protein sequences
  • 5.
    •The flat fileformats from the sequence databases are still used to access and display sequence and annotation. They are also convenient for storage of local copies.
  • 10.
    FASTA Format • Bioinformaticistshave developed a standard format for nucleotide and protein sequences that allows them to be read by a wide range of programs. This format is called FASTA format. •FASTA format each nucleotide or amino acid is represented using a single letter.
  • 11.
    •The first lineof a FASTA is the comment line, identified with either the greater than symbol ‘>’. This line identifies the sequence and includes the accession number from NCBI, Genbank or another repository. •The remaining lines contain the sequence,in lines of 80 or 120 characters per line.
  • 13.
    PIR FORMAT •A sequencein PIR format consists of: –One line starting with •a ">" (greater-than) sign, followed by •a two-letter code describing the sequence type (P1, F1, DL, DC, RL, RC, or XX), followed by •a semicolon, followed by •the sequence identification
  • 14.
    –One line containing atextual description of the sequence. –One or more lines containing the sequence itself. The end of the sequence is marked by a "*" (asterisk) character. –Optionally, this can be followed by one or more lines describing the sequence. Software that is supposed to read only the sequence should ignore these.
  • 15.
    •A file inPIR format may comprise more than one sequence. •The PIR format is also often referred to as the NBRF format.
  • 17.
    ALN/ClustalW • The firstline in the file must start with the words "CLUSTALW". Other information in the first line is ignored. • One or more empty lines. • One or more blocks of sequence data. Each block consists of: – One line for each sequence in the alignment. Each line consists of: •the sequence name •white space •up to 60 sequence symbols. •optional - white space followed by a cumulative count of residues for the sequences
  • 18.
    – A lineshowing the degree of conservation for the columns of the alignment in this block. – One or more empty lines •Some rules about representing sequences: •Case doesn't matter. •Sequence symbols should be from a valid alphabet. •Gaps are represented using hyphens ("-").
  • 19.
    •The characters usedto represent the degree of conservation are * -all residues or nucleotides in that column are identical : - conserved substitutions have been observed . -semi-conserved substitutions have been observed - no match.
  • 21.
    GCG/MSF •msf formatted multiplesequence files are most often created when using programs of the GCG suite. • msf files include the sequence name and the sequence itself, which is usually aligned with other sequences in the file. • You can specify a single sequence or many sequences within an msf file.
  • 22.
    •Some of thehallmarks of a msf formatted sequence are the same as a single sequence gcg format file: •Begins with the line (all uppercase) !! NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid sequences or !! AA_MULTIPLE_ALIGNMENT 1.0 for amino acid sequences. • Do not edit or delete the file type if its present.
  • 23.
    •A description linewhich contains informative text describing what is in the file. You can add this information to the top of the MSF file using a text editor. •A dividing line which contains the number of bases or residues in the sequence, when the file was created, and importantly, two dots (..) which act as a divider between the descriptive information and the
  • 24.
    •msf files containsome other information as well: •Name/Weight: The name of each sequence included in the alignment, as well as its length and checksum (both non-editable) and weight (editable). •Separating Line. Must include two slashes (//) to divide the name/weight information from the sequence alignment.
  • 25.
    •Multiple Sequence Alignment.Each sequence named in the above Name/Weight lines is included. The alignment allows you to view the relationship among sequences
  • 27.