Chemical File Formats for storing chemical data

2,226 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,226
On SlideShare
0
From Embeds
0
Number of Embeds
507
Actions
Shares
0
Downloads
30
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Chemical File Formats for storing chemical data

  1. 1. Molecular File Formats
  2. 2. Types of File formats Elsevier MDL supports a number of file formats for representation and communication of chemical information. Name Description molfiles Each molfile describes a single molecular structure which can contain disjoint fragments as salts . SDfiles They are Structure-data files which contain data for any number of molecules .SDfiles are the primary format for large-scale data transfer between MDL databases. RGfiles An RGfile describes a single molecular query with Rgroups. Each RGfile is a combination of Ctabs defining the root molecule and each member of each Rgroup in the query. rxnfiles Reaction files.Eachrxnfile contains the structural information for the reactants and products of a single reaction. RDfiles Reaction Data File: RDfile is a more general format that can include reactions as well as molecules.
  3. 3. File Formats http://c4.cabrillo.edu/404/ctfile.pdf
  4. 4. Connection Table [Ctab] A connection table (Ctab) contains information describing the structural relationships and properties of a collection of atoms. The connection table is fundamental to all of the MDL file formats. 9 9 0 0 0 0 0 0 0 0999 V2000 Countline -1.0200 1.5300 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.5100 2.4100 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.5000 2.3900 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.0000 3.2700 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.0300 3.2700 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 0.5000 4.1500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 Atom Block -0.5000 4.1500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.0100 3.2800 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.0300 3.2800 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 2 8 1 0 2 3 2 3 3 4 1 0 4 5 2 0 4 6 1 0 6 7 2 3 Bonds Block 7 8 1 0 8 9 2 0
  5. 5. Ctab Features Parts of Ctab Description Counts Line Important specifications here relate to the number of atoms, bonds, and atom lists, the chiral flag setting, and the Ctab version. Atom Block Specifies the atomic symbol and any mass difference, charge, stereochemistry, and associated hydrogens for each atom. Bond Block Specifies the two atoms connected by the bond, the bond type, and any bond stereochemistry and topology (chain or ring properties) for each bond. Properties Block Provides for future expandability of Ctab features, while maintaining compatibility with earlier Ctab configurations.
  6. 6. 1. Counts Line aaabbblllfffcccsssmmmvvvvvv where • aaa = number of atoms (current max 255)* [Generic] • bbb = number of bonds (current max 255)* [Generic] • lll = number of atom lists (max 30)* [Query] • fff = (obsolete) • ccc = chiral flag: 0=not chiral, 1=chiral [Generic] • sss = number of stext entries [MDL ISIS/Desktop] • Mmm = number of lines of additional properties, including the M END line. no longer supported, the default is set to 999.[Generic] shows six atoms, five bonds, the CHIRAL flag on, and three lines in the properties block: 6 5 0 0 1 0 3 V2000 Shows 9 atoms, 9 bonds, the CHIRAL flag of 9 9 0 0 0 0 0 0 0 0999 V2000
  7. 7. 2. Atom Block The Atom Block is made up of atom lines, one line per atom with the following format. xxxxx.xxxxyyyyy.yyyyzzzzz.zzzzaaaddcccssshhhbbbvvvHHHrrriiimmmnnneee Field Meaning Values XYZ Atom coordinates aaa atom symbol entry in periodic table or L for atom list, A, Q, * for unspecified atom, and LP for lone pair, or R# for Rgroup label dd Mass difference -3, -2, -1, 0, 1, 2, 3, 4 (0 if value beyond these limits) ccc Charge 0 = uncharged or value other than these, 1 = +3, 2 = +2, 3 = +1, 4 = doublet radical, 5 = -1, 6 = -2, 7 = -3 sss atom stereo parity 0 = not stereo, 1 = odd, 2 = even, 3 = either or unmarked stereo center. hhh hydrogen count + 1 1 = H0, 2 = H1, 3 = H2, 4 = H3, 5 = H4 bbb stereo care box 0 = ignore stereo configuration of this double bond atom, 1 = stereo configuration of double bond atom must match vvv Valence 0 = no marking (default) (1 to 14) = (1 to 14) 15 = zero valence. HHH H0 designator 0 = not specified, 1 = no H atoms allowed
  8. 8. 3.Bonds block The Bond Block is made up of bond lines, one line per bond, with the following format: 111222tttsssxxxrrrccc Field Meaning Values 111 First atom number 1 - number of atoms 222 Second atom number 1 - number of atoms ttt Bond type 1 = Single, 2 = Double, 3 = Triple, 4 = Aromatic, 5 = Single or Double, 6 = Single or Aromatic, 7 = Double or Aromatic, 8 = Any sss bond stereo Single bonds: 0 = not stereo, 1 = Up, 4 = Either, 6 = Down, Double bonds: 0 = Use x-, y-, z-coords from atom block to determine cis or trans, 3 = Cis or trans (either) double bond. rrr Bond topology 0 = Either, 1 = Ring, 2 = Chain
  9. 9. Mol File A molfile consists of a header block and a connection table. The following shows a molfile for alanine corresponding to the following structure:x` Identifies the molfile: molecule name, user's name, program, date, and other miscellaneous information and comments atom 4: charge +1 atom 6: charge -1 1 entry for an isotope atom 3: mass=13
  10. 10. Representation of Stereochemistry What is Stereochemistry ? http://www.chemhelper.com/enantiomers.html
  11. 11. Representationof Stereochemistry: Atom Block
  12. 12. Representationof Stereochemistry: Bond Block 1= Shows stereo bond up
  13. 13. RGfiles In RGfilesLines beginning with $ define the overall structure of the Rgroup query; the molfile header block is embedded in the Rgroup header block.In addition to the primary connection table (Ctab block) for the root structure, a Ctab block defines each member (*m) within each Rgroup (*r).
  14. 14. Example of RGfile
  15. 15. SDfile An SDfile (structure-data file) contains the structural information and associated data items for one or more compounds. *l is repeated for each line of data *d is repeated for each data item *c is repeated for each compound
  16. 16. Example of SDfile
  17. 17. RXNfile Rxnfiles contain structural data for the reactants and products of a reaction. where: *r is repeated for each reactant *p is repeated for each product
  18. 18. RXNfile example
  19. 19. RDfiles • An RD-File(reaction data file) consist of a set of edible “records”. Each record defines a molecule or reaction, and its associated data. • The [RDfile Header] must occur at the beginning of the physical file and indentifies the file as an RDfile. A version stamp of 1 is given for future expansion of the format. • $DATM: Date/time (M/D/Y, c) stamp. This line is treated as a comment and ignored when the program is read. *d is repeated for each data item *r is repeated for each reaction or molecule
  20. 20. RDfile example
  21. 21. Mol2 files from TRIPOS Original from Tripos. Contains atom coordinates, bonds, substructure information.This format supports partial charges and isotopes. • Lines 1,2,3,5 and 6 are comments. They contain the molecule name and information about the time the molecule was created and last modified. • Lines 8, 15, 28, and 41 in the example are Record Type Indicator(RTIs). It is used to indicate the type of data which follows in a .mol2 file. • Lines 9-12, 16-27, 29-40, and 42 are all data records
  22. 22. Parts of mol2 file @<TRIPOS>MOLECULE The first data line is the name of the molecule. The second data line contains the number of atoms, bonds, substructures, features, and sets associated with the molecule. The third data line is the molecule type. The fourth data line tells the type of charges associated with the molecule. The fifth data line contains the internal SYBYL status bits associated with the molecule. The last data line contains any comment which may be associated with the molecule. @<TRIPOS>ATOM atom_id atom_name x y z atom_type [subst_id [subst_name [charge [status_bit]]]] Example : 1 CA -0.149 0.299 0.000 C.3 1 ALA1 0.000 BACKBONE|DICT|DIRECT In the example above the atom has ID number 1. It is named CA and is located at (-0.149, 0.299, 0.000). Its atom type is C.3. It belongs to the substructure with ID 1 which is named ALA1. The charge associated with the atom is 0.000 and the SYBYL status bits associated with the atom are BACKBONE, DICT, and DIRECT. @<TRIPOS>BOND bond_id origin_atom_id target_atom_id bond_type [status_bits] Example : 1 1 2 ar Example bond shows, it has ID number 1 and connects atoms 1 and 2 .It is an aromatic bond. @<TRIPOS>SUBSTRUCTURE subst_id subst_name root_atom [subst_type [dict_type [chain [sub_type [inter_bonds [status [comment]]]]]]] Example: 1 BENZENE1 PERM 0 **** ****** 0 ROOT The substructure has 1 as ID BENZENE1 as name .It is a type of PERM and associated with dictionary type 0 . The SYBYL status bits indicate it is the ROOT substructure.
  23. 23. References • http://www.tripos.com/data/support/mol2.pdf • http://accelrys.com/products/informatics/cheminformatics/ctfile-formats/no-fee.php • Description of Several Chemical Structure File Formats Used by Computer Programs Developed at Molecular Design Limited. Arthur Dalby etal. J. Chem. Inf Comput. Sci. 1992, 32, 244-255. • http://www.chem.ucla.edu/harding/tutorials/stereochem/rsez.pdf • http://www.chem.ucla.edu/harding/notes/notes_14C_stereo03.pdf

×