Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sequence file formats

2,085 views

Published on

Important file formats:
Genbank,FASTA,PIR,ALN/ClustalW2 &GCG/MSF

Published in: Science
  • Be the first to comment

Sequence file formats

  1. 1. BIOINFORMARICS SEQUENCE FILE FORMATS Presented By: Alphy Joseph Date: 03 March 2016
  2. 2. Important file formats •Genbank •FASTA •PIR •ALN/ClustalW2 •GCG/MSF
  3. 3. Early Data Formats •These early databases stored sequence data in a file. The file held the sequence in ASCII (plain)text and had a descriptive filename. • This method became limiting when researchers wanted to include annotations and information about the source of the sequence. • Difficulty in searching for sequences was also an issue.
  4. 4. Flat File Storage Data Formats •When GenBank, EMBL and DDBJ formed a collaboration (1986), sequence databases had moved to a defined flat file format with a shared feature table format and annotation standards. •The PIR also adopted a similar format for protein sequences
  5. 5. •The flat file formats from the sequence databases are still used to access and display sequence and annotation. They are also convenient for storage of local copies.
  6. 6. FASTA Format • Bioinformaticists have developed a standard format for nucleotide and protein sequences that allows them to be read by a wide range of programs. This format is called FASTA format. •FASTA format each nucleotide or amino acid is represented using a single letter.
  7. 7. •The first line of a FASTA is the comment line, identified with either the greater than symbol ‘>’. This line identifies the sequence and includes the accession number from NCBI, Genbank or another repository. •The remaining lines contain the sequence,in lines of 80 or 120 characters per line.
  8. 8. PIR FORMAT •A sequence in PIR format consists of: –One line starting with •a ">" (greater-than) sign, followed by •a two-letter code describing the sequence type (P1, F1, DL, DC, RL, RC, or XX), followed by •a semicolon, followed by •the sequence identification
  9. 9. –One line containing a textual description of the sequence. –One or more lines containing the sequence itself. The end of the sequence is marked by a "*" (asterisk) character. –Optionally, this can be followed by one or more lines describing the sequence. Software that is supposed to read only the sequence should ignore these.
  10. 10. •A file in PIR format may comprise more than one sequence. •The PIR format is also often referred to as the NBRF format.
  11. 11. ALN/ClustalW • The first line in the file must start with the words "CLUSTALW". Other information in the first line is ignored. • One or more empty lines. • One or more blocks of sequence data. Each block consists of: – One line for each sequence in the alignment. Each line consists of: •the sequence name •white space •up to 60 sequence symbols. •optional - white space followed by a cumulative count of residues for the sequences
  12. 12. – A line showing the degree of conservation for the columns of the alignment in this block. – One or more empty lines •Some rules about representing sequences: •Case doesn't matter. •Sequence symbols should be from a valid alphabet. •Gaps are represented using hyphens ("-").
  13. 13. •The characters used to represent the degree of conservation are * -all residues or nucleotides in that column are identical : - conserved substitutions have been observed . -semi-conserved substitutions have been observed - no match.
  14. 14. GCG/MSF •msf formatted multiple sequence files are most often created when using programs of the GCG suite. • msf files include the sequence name and the sequence itself, which is usually aligned with other sequences in the file. • You can specify a single sequence or many sequences within an msf file.
  15. 15. •Some of the hallmarks of a msf formatted sequence are the same as a single sequence gcg format file: •Begins with the line (all uppercase) !! NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid sequences or !! AA_MULTIPLE_ALIGNMENT 1.0 for amino acid sequences. • Do not edit or delete the file type if its present.
  16. 16. •A description line which contains informative text describing what is in the file. You can add this information to the top of the MSF file using a text editor. •A dividing line which contains the number of bases or residues in the sequence, when the file was created, and importantly, two dots (..) which act as a divider between the descriptive information and the
  17. 17. •msf files contain some other information as well: •Name/Weight: The name of each sequence included in the alignment, as well as its length and checksum (both non-editable) and weight (editable). •Separating Line. Must include two slashes (//) to divide the name/weight information from the sequence alignment.
  18. 18. •Multiple Sequence Alignment. Each sequence named in the above Name/Weight lines is included. The alignment allows you to view the relationship among sequences
  19. 19. THANK YOU

×