SlideShare a Scribd company logo
1 of 27
BIOINFORMARICS
SEQUENCE FILE
FORMATS
Presented By: Alphy Joseph
Date: 03 March 2016
Important file formats
•Genbank
•FASTA
•PIR
•ALN/ClustalW2
•GCG/MSF
Early Data Formats
•These early databases stored sequence
data in a file. The file held the sequence
in ASCII (plain)text and had a
descriptive filename.
• This method became limiting when
researchers wanted to include
annotations and information about the
source of the sequence.
• Difficulty in searching for sequences
was also an issue.
Flat File Storage Data
Formats
•When GenBank, EMBL and DDBJ
formed a collaboration (1986),
sequence databases had moved to a
defined flat file format with a shared
feature table format and annotation
standards.
•The PIR also adopted a similar format
for protein sequences
•The flat file formats from the
sequence databases are still used to
access and display sequence and
annotation. They are also convenient
for storage of local copies.
FASTA Format
• Bioinformaticists have developed a
standard format for nucleotide and
protein sequences that allows them to
be read by a wide range of programs.
This format is called FASTA format.
•FASTA format each nucleotide or
amino acid is represented using a
single letter.
•The first line of a FASTA is the
comment line, identified with either the
greater than symbol ‘>’. This line
identifies the sequence and includes the
accession number from NCBI,
Genbank or another repository.
•The remaining lines contain the
sequence,in lines of 80 or 120
characters per line.
PIR FORMAT
•A sequence in PIR format consists of:
–One line starting with
•a ">" (greater-than) sign, followed
by
•a two-letter code describing the
sequence type (P1, F1, DL, DC, RL,
RC, or XX), followed by
•a semicolon, followed by
•the sequence identification
–One line containing a textual
description of the sequence.
–One or more lines containing the
sequence itself. The end of the
sequence is marked by a "*"
(asterisk) character.
–Optionally, this can be followed by
one or more lines describing the
sequence. Software that is
supposed to read only the sequence
should ignore these.
•A file in PIR format may comprise
more than one sequence.
•The PIR format is also often referred
to as the NBRF format.
ALN/ClustalW
• The first line in the file must start with
the words "CLUSTALW". Other
information in the first line is ignored.
• One or more empty lines.
• One or more blocks of sequence data. Each
block consists of:
– One line for each sequence in the alignment.
Each line consists of:
•the sequence name
•white space
•up to 60 sequence symbols.
•optional - white space followed by a cumulative
count of residues for the sequences
– A line showing the degree of
conservation for the columns of the
alignment in this block.
– One or more empty lines
•Some rules about representing
sequences:
•Case doesn't matter.
•Sequence symbols should be from a
valid alphabet.
•Gaps are represented using hyphens
("-").
•The characters used to represent the
degree of conservation are
* -all residues or nucleotides in that
column are identical
: - conserved substitutions have been
observed
. -semi-conserved substitutions have
been observed
- no match.
GCG/MSF
•msf formatted multiple sequence files
are most often created when using
programs of the GCG suite.
• msf files include the sequence name
and the sequence itself, which is
usually aligned with other sequences
in the file.
• You can specify a single sequence or
many sequences within an msf file.
•Some of the hallmarks of a msf
formatted sequence are the same as a
single sequence gcg format file:
•Begins with the line (all uppercase) !!
NA_MULTIPLE_ALIGNMENT 1.0
for nucleic acid sequences or !!
AA_MULTIPLE_ALIGNMENT 1.0
for amino acid sequences.
• Do not edit or delete the file type if
its present.
•A description line which contains
informative text describing what is in
the file. You can add this information
to the top of the MSF file using a text
editor.
•A dividing line which contains the
number of bases or residues in the
sequence, when the file was created,
and importantly, two dots (..) which
act as a divider between the
descriptive information and the
•msf files contain some other
information as well:
•Name/Weight: The name of each
sequence included in the alignment, as
well as its length and checksum (both
non-editable) and weight (editable).
•Separating Line. Must include two
slashes (//) to divide the name/weight
information from the sequence
alignment.
•Multiple Sequence Alignment. Each
sequence named in the above
Name/Weight lines is included. The
alignment allows you to view the
relationship among sequences
THANK YOU

More Related Content

What's hot (20)

Blast and fasta
Blast and fastaBlast and fasta
Blast and fasta
 
Needleman-Wunsch Algorithm
Needleman-Wunsch AlgorithmNeedleman-Wunsch Algorithm
Needleman-Wunsch Algorithm
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)
 
Swiss prot database
Swiss prot databaseSwiss prot database
Swiss prot database
 
Entrez databases
Entrez databasesEntrez databases
Entrez databases
 
Gen bank databases
Gen bank databasesGen bank databases
Gen bank databases
 
UniProt
UniProtUniProt
UniProt
 
Nucleic acid and protein databanks
Nucleic acid and protein databanksNucleic acid and protein databanks
Nucleic acid and protein databanks
 
Dynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignmentDynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignment
 
Genome Database Systems
Genome Database Systems Genome Database Systems
Genome Database Systems
 
OMIM Database
OMIM DatabaseOMIM Database
OMIM Database
 
methods for protein structure prediction
methods for protein structure predictionmethods for protein structure prediction
methods for protein structure prediction
 
Clustal
ClustalClustal
Clustal
 
Fasta
FastaFasta
Fasta
 
Protein 3 d structure prediction
Protein 3 d structure predictionProtein 3 d structure prediction
Protein 3 d structure prediction
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 

Viewers also liked

Intro to Open Babel
Intro to Open BabelIntro to Open Babel
Intro to Open Babelbaoilleach
 
Computational biology bls 303
Computational biology bls 303Computational biology bls 303
Computational biology bls 303Bruno Mmassy
 
molecular file formats in bioinformatics
molecular file formats in bioinformaticsmolecular file formats in bioinformatics
molecular file formats in bioinformaticsnadeem akhter
 
BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES nadeem akhter
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataAbhik Seal
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformaticsnadeem akhter
 

Viewers also liked (12)

Intro to Open Babel
Intro to Open BabelIntro to Open Babel
Intro to Open Babel
 
Computational biology bls 303
Computational biology bls 303Computational biology bls 303
Computational biology bls 303
 
molecular file formats in bioinformatics
molecular file formats in bioinformaticsmolecular file formats in bioinformatics
molecular file formats in bioinformatics
 
Design your own test automation tool
Design your own test automation toolDesign your own test automation tool
Design your own test automation tool
 
BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical data
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Biological Databases
Biological DatabasesBiological Databases
Biological Databases
 
Biological databases
Biological databasesBiological databases
Biological databases
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
 
Biological databases
Biological databasesBiological databases
Biological databases
 

Similar to Sequence file formats

Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Databricks
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
 
Bdam presentation on parquet
Bdam presentation on parquetBdam presentation on parquet
Bdam presentation on parquetManpreet Khurana
 
16119 - Get to Know Your Data Sets (1).pdf
16119 - Get to Know Your Data Sets (1).pdf16119 - Get to Know Your Data Sets (1).pdf
16119 - Get to Know Your Data Sets (1).pdf3operatordcslipiPeng
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedTony Rogerson
 
ELF(executable and linkable format)
ELF(executable and linkable format)ELF(executable and linkable format)
ELF(executable and linkable format)Seungha Son
 
(Very u seful) different file format
(Very u seful) different file format(Very u seful) different file format
(Very u seful) different file formatJitendra Chinchore
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview externalmattlieber
 
SRA-System (7).ppsx
SRA-System (7).ppsxSRA-System (7).ppsx
SRA-System (7).ppsxlaibayyy38
 
picard_poster_12_16_15
picard_poster_12_16_15picard_poster_12_16_15
picard_poster_12_16_15David E. Kling
 
Bibliographic format ISO 2709
Bibliographic format ISO 2709 Bibliographic format ISO 2709
Bibliographic format ISO 2709 Shahil mohammed
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisSANJANA PANDEY
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
 
File handling in C by Faixan
File handling in C by FaixanFile handling in C by Faixan
File handling in C by FaixanٖFaiXy :)
 

Similar to Sequence file formats (20)

Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
Avro intro
Avro introAvro intro
Avro intro
 
1650607.ppt
1650607.ppt1650607.ppt
1650607.ppt
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 
Bdam presentation on parquet
Bdam presentation on parquetBdam presentation on parquet
Bdam presentation on parquet
 
16119 - Get to Know Your Data Sets (1).pdf
16119 - Get to Know Your Data Sets (1).pdf16119 - Get to Know Your Data Sets (1).pdf
16119 - Get to Know Your Data Sets (1).pdf
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - Advanced
 
ELF(executable and linkable format)
ELF(executable and linkable format)ELF(executable and linkable format)
ELF(executable and linkable format)
 
(Very u seful) different file format
(Very u seful) different file format(Very u seful) different file format
(Very u seful) different file format
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
 
SRA-System (7).ppsx
SRA-System (7).ppsxSRA-System (7).ppsx
SRA-System (7).ppsx
 
FS Mod2@AzDOCUMENTS.in.pdf
FS Mod2@AzDOCUMENTS.in.pdfFS Mod2@AzDOCUMENTS.in.pdf
FS Mod2@AzDOCUMENTS.in.pdf
 
picard_poster_12_16_15
picard_poster_12_16_15picard_poster_12_16_15
picard_poster_12_16_15
 
Ch6
Ch6Ch6
Ch6
 
Bibliographic format ISO 2709
Bibliographic format ISO 2709 Bibliographic format ISO 2709
Bibliographic format ISO 2709
 
Data.ppt
Data.pptData.ppt
Data.ppt
 
Oracle
OracleOracle
Oracle
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data Analysis
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
File handling in C by Faixan
File handling in C by FaixanFile handling in C by Faixan
File handling in C by Faixan
 

Recently uploaded

Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspectsmuralinath2
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.Silpa
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxRenuJangid3
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsSérgio Sacani
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxDiariAli
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.Silpa
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry Areesha Ahmad
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptxArvind Kumar
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Silpa
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...Monika Rani
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body Areesha Ahmad
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Silpa
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Silpa
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfSumit Kumar yadav
 

Recently uploaded (20)

Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 

Sequence file formats

  • 1. BIOINFORMARICS SEQUENCE FILE FORMATS Presented By: Alphy Joseph Date: 03 March 2016
  • 3. Early Data Formats •These early databases stored sequence data in a file. The file held the sequence in ASCII (plain)text and had a descriptive filename. • This method became limiting when researchers wanted to include annotations and information about the source of the sequence. • Difficulty in searching for sequences was also an issue.
  • 4. Flat File Storage Data Formats •When GenBank, EMBL and DDBJ formed a collaboration (1986), sequence databases had moved to a defined flat file format with a shared feature table format and annotation standards. •The PIR also adopted a similar format for protein sequences
  • 5. •The flat file formats from the sequence databases are still used to access and display sequence and annotation. They are also convenient for storage of local copies.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10. FASTA Format • Bioinformaticists have developed a standard format for nucleotide and protein sequences that allows them to be read by a wide range of programs. This format is called FASTA format. •FASTA format each nucleotide or amino acid is represented using a single letter.
  • 11. •The first line of a FASTA is the comment line, identified with either the greater than symbol ‘>’. This line identifies the sequence and includes the accession number from NCBI, Genbank or another repository. •The remaining lines contain the sequence,in lines of 80 or 120 characters per line.
  • 12.
  • 13. PIR FORMAT •A sequence in PIR format consists of: –One line starting with •a ">" (greater-than) sign, followed by •a two-letter code describing the sequence type (P1, F1, DL, DC, RL, RC, or XX), followed by •a semicolon, followed by •the sequence identification
  • 14. –One line containing a textual description of the sequence. –One or more lines containing the sequence itself. The end of the sequence is marked by a "*" (asterisk) character. –Optionally, this can be followed by one or more lines describing the sequence. Software that is supposed to read only the sequence should ignore these.
  • 15. •A file in PIR format may comprise more than one sequence. •The PIR format is also often referred to as the NBRF format.
  • 16.
  • 17. ALN/ClustalW • The first line in the file must start with the words "CLUSTALW". Other information in the first line is ignored. • One or more empty lines. • One or more blocks of sequence data. Each block consists of: – One line for each sequence in the alignment. Each line consists of: •the sequence name •white space •up to 60 sequence symbols. •optional - white space followed by a cumulative count of residues for the sequences
  • 18. – A line showing the degree of conservation for the columns of the alignment in this block. – One or more empty lines •Some rules about representing sequences: •Case doesn't matter. •Sequence symbols should be from a valid alphabet. •Gaps are represented using hyphens ("-").
  • 19. •The characters used to represent the degree of conservation are * -all residues or nucleotides in that column are identical : - conserved substitutions have been observed . -semi-conserved substitutions have been observed - no match.
  • 20.
  • 21. GCG/MSF •msf formatted multiple sequence files are most often created when using programs of the GCG suite. • msf files include the sequence name and the sequence itself, which is usually aligned with other sequences in the file. • You can specify a single sequence or many sequences within an msf file.
  • 22. •Some of the hallmarks of a msf formatted sequence are the same as a single sequence gcg format file: •Begins with the line (all uppercase) !! NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid sequences or !! AA_MULTIPLE_ALIGNMENT 1.0 for amino acid sequences. • Do not edit or delete the file type if its present.
  • 23. •A description line which contains informative text describing what is in the file. You can add this information to the top of the MSF file using a text editor. •A dividing line which contains the number of bases or residues in the sequence, when the file was created, and importantly, two dots (..) which act as a divider between the descriptive information and the
  • 24. •msf files contain some other information as well: •Name/Weight: The name of each sequence included in the alignment, as well as its length and checksum (both non-editable) and weight (editable). •Separating Line. Must include two slashes (//) to divide the name/weight information from the sequence alignment.
  • 25. •Multiple Sequence Alignment. Each sequence named in the above Name/Weight lines is included. The alignment allows you to view the relationship among sequences
  • 26.