1
 Data is stored in a biological database in
the form of sequences or molecular form
 Unique file format
 Representation of data in biological
database
 Categories of file formats
 Sequence database
 Molecular database
2
 Gene bank flat-file Format
 FASTA Format
 Multi-FASTA Format
 GCG Format
 GCG-MSF Format
 EMBL Format
 Clustal Format
 SWIS PROT format
3
 Used by NCBI
 It is divided into three parts
 Header just a direct and very precise or
brief introductory part
 Features
all genes in seq., location of genes in genome,
protein product and coding genes etc.
 Sequence : ORIGIN atcgatcgatgcgctat //
4
 HEADRES
 Locus
 Definition
 Accession
 Version
 Dbsource: dates for creation and modifications
 Keywords
 Source
 Organism
 References
 Authors
 Title
 Journal
 Medline ID: all published sources
 Comment
 FEATURES
 SEQUENCE
5
6
7
8
 One line header
 Stats with > followed by name of gene
 Sequence of gene or protein
 Blank spaces
 Paragraph marks
 Numerals
 Are all ignored
 Steric sign * at the end
9
 >p53
ctcgaggggc ctagacattg ccctccagag agagcaccca
acaccctcca ggcttgaccg
61 gccagggtgt ccccttccta ccttggagag agcagcccca
gggcatcctg cagggggtgc
121 tgggacacca gctggccttc aaggtctctg cctccctcca
gccaccccac tacacgctgc
181 tgggatcctg gatctcagct ccctggccga caacactggc
aaactcctac tcatccacga
241 aggccctcct gggcatggtg gtccttccca gcctggcagt
ctgttcctca cacaccttgt
301 tagtgcccag cccctgaggt tgcagctggg ggtgtctctg
aagggctgtg agcccccagg
361 aagccctggg gaagtgcctg ccttgcctcc ccccggccct10
11
 Just like an aggregation of FASTA file as listed
above
 Multiple sequences follow one after the other
 Single file
 Accepted by several databases
 Clustal W
 Multalin
12
 > jhuma
gccagggtgt ccccttccta ccttggagag agcagcccca
gggcatcctg cagggggtgc
 >bhuma
gccagggtgt ccccttccta ccttggagag agcagcccca
gggcatcctg cagggggtgc
 >puma
gccagggtgt ccccttccta ccttggagag agcagcccca
gggcatcctg cagggggtgc
 >zuma
gccagggtgt ccccttccta ccttggagag agcagcccca
gggcatcctg cagggggtgc 13
14
 GCG: genetics computer group
 First line says it all ….
 !!N.A_SEQUENCE 1.0
 !!AA_SEQUENCE 1.0
 Just a simple format in which we just get
to now the sequence for the genes or
proteins
15
16
 Multiple sequences
 Sequence name
 Sequences
 Alignment
 Word pileup indicates that It is a multiple
sequence containing file
 Mandatory MSF word indicated in the file that
tells that it is an MSF GCG file and is not just
GCG
 Comments terminated with //
 2 consecutive blank lines
 Multiple sequences 17
18
 Sequence format of European molecular
biology laboratory database
 Starts with ID identification number
 Ends with // as terminator
 Different lines with own format
 Used to record various forms of data
 i.e DNA, RNA, GENE, PROTEIN etc etc
19
20
 Most widely used sequence alignment tool
 CLUSTAL W
 CLUSTAL X
 Aligned protein or gene sequences
21
22
 Protein sequence database
 ID : identification number
 AC: accession number
 DE: description
 GN: gene name
 OS: organism specie
 OG: organelle
 OC: organism classification
 OX: organism taxonomy cross reference
 RN: reference number
 RP: reference position 23
 RC: reference comment
 RX: reference cross reference
 RA: reference author
 RT: reference title
 RL: reference location
 CC: blank
 DR: database cross reference
 KW: key word
 FT: feature table
 SQ: sequence 24
25
 Several software's have been designed by … ?
 The aim of these software's is to make a
detailed conversion of one sequence format
into another
 Some of the software used widely for sequence
inter-conversion are :
 ReadSeq
 GCG
 SeqVerter
 Seqret 26
 Developed by Dr. D.G Gilbert
 Automated conversion
 18 supported file formats are there which
can be interconverted into one another
27
28
29
 FASTA
 Multi FASTA
 Flat file
 GCG format
 EMBL
 Clustal
 SWISS PROT
Make each file by this Friday and send as
attachments in an email 30
31

sequence of file formats in bioinformatics

  • 1.
  • 2.
     Data isstored in a biological database in the form of sequences or molecular form  Unique file format  Representation of data in biological database  Categories of file formats  Sequence database  Molecular database 2
  • 3.
     Gene bankflat-file Format  FASTA Format  Multi-FASTA Format  GCG Format  GCG-MSF Format  EMBL Format  Clustal Format  SWIS PROT format 3
  • 4.
     Used byNCBI  It is divided into three parts  Header just a direct and very precise or brief introductory part  Features all genes in seq., location of genes in genome, protein product and coding genes etc.  Sequence : ORIGIN atcgatcgatgcgctat // 4
  • 5.
     HEADRES  Locus Definition  Accession  Version  Dbsource: dates for creation and modifications  Keywords  Source  Organism  References  Authors  Title  Journal  Medline ID: all published sources  Comment  FEATURES  SEQUENCE 5
  • 6.
  • 7.
  • 8.
  • 9.
     One lineheader  Stats with > followed by name of gene  Sequence of gene or protein  Blank spaces  Paragraph marks  Numerals  Are all ignored  Steric sign * at the end 9
  • 10.
     >p53 ctcgaggggc ctagacattgccctccagag agagcaccca acaccctcca ggcttgaccg 61 gccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc 121 tgggacacca gctggccttc aaggtctctg cctccctcca gccaccccac tacacgctgc 181 tgggatcctg gatctcagct ccctggccga caacactggc aaactcctac tcatccacga 241 aggccctcct gggcatggtg gtccttccca gcctggcagt ctgttcctca cacaccttgt 301 tagtgcccag cccctgaggt tgcagctggg ggtgtctctg aagggctgtg agcccccagg 361 aagccctggg gaagtgcctg ccttgcctcc ccccggccct10
  • 11.
  • 12.
     Just likean aggregation of FASTA file as listed above  Multiple sequences follow one after the other  Single file  Accepted by several databases  Clustal W  Multalin 12
  • 13.
     > jhuma gccagggtgtccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc  >bhuma gccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc  >puma gccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc  >zuma gccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc 13
  • 14.
  • 15.
     GCG: geneticscomputer group  First line says it all ….  !!N.A_SEQUENCE 1.0  !!AA_SEQUENCE 1.0  Just a simple format in which we just get to now the sequence for the genes or proteins 15
  • 16.
  • 17.
     Multiple sequences Sequence name  Sequences  Alignment  Word pileup indicates that It is a multiple sequence containing file  Mandatory MSF word indicated in the file that tells that it is an MSF GCG file and is not just GCG  Comments terminated with //  2 consecutive blank lines  Multiple sequences 17
  • 18.
  • 19.
     Sequence formatof European molecular biology laboratory database  Starts with ID identification number  Ends with // as terminator  Different lines with own format  Used to record various forms of data  i.e DNA, RNA, GENE, PROTEIN etc etc 19
  • 20.
  • 21.
     Most widelyused sequence alignment tool  CLUSTAL W  CLUSTAL X  Aligned protein or gene sequences 21
  • 22.
  • 23.
     Protein sequencedatabase  ID : identification number  AC: accession number  DE: description  GN: gene name  OS: organism specie  OG: organelle  OC: organism classification  OX: organism taxonomy cross reference  RN: reference number  RP: reference position 23
  • 24.
     RC: referencecomment  RX: reference cross reference  RA: reference author  RT: reference title  RL: reference location  CC: blank  DR: database cross reference  KW: key word  FT: feature table  SQ: sequence 24
  • 25.
  • 26.
     Several software'shave been designed by … ?  The aim of these software's is to make a detailed conversion of one sequence format into another  Some of the software used widely for sequence inter-conversion are :  ReadSeq  GCG  SeqVerter  Seqret 26
  • 27.
     Developed byDr. D.G Gilbert  Automated conversion  18 supported file formats are there which can be interconverted into one another 27
  • 28.
  • 29.
  • 30.
     FASTA  MultiFASTA  Flat file  GCG format  EMBL  Clustal  SWISS PROT Make each file by this Friday and send as attachments in an email 30
  • 31.