Data model


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data model

  1. 1. CVSP New Data Model Standardization and Parents
  2. 2. Why there is a need for a new data model?• Details of the proposed new data model• Concept of Substance and Compound• Standardization Workflow• The benefit of having parents (we all know this )
  3. 3. Deposited Record – “Substance”Unique Substance Identifier (SID) assigned for eachdeposited record (a record identified by combination of DataSource and depositor’s internal database registry identifier)Benefits of having separate independent layers of depositedrecord data (“Substances”) and standardized record data(“Compounds”) - archive model - are:• Depositors’ records – Substance - gets preserved as they are with no alteration• Depositors records get versioned when changes occur. Only last most-to-date version is used in links and calculations• Same chemical may be deposited by several depositors – each of them will have different substance ID, but all of them will be linked to same standardized compound• Any records can be accepted – even those not producing InChI (e.g. plant extracts, blood samples, polymers, etc.)• Substance identifier (SID) guaranteed not to change
  4. 4. Deposited Compounds Parents CompoundsSubstances Fragment Parent (CSID2) SID 1 CSID 1 CSID 2 SDF1 Stereo Parent (CSID5) DataSource1 Synonym1 Synonym2 XRef1 Isotope Parent (CSID4) Standardized Standardized MOL MOL SID 2 Tautomer Parent (CSID6) DataSource1 DataSource3 DataSource2 DataSource4 SDF2 Synonym1 Charge Parent (CSID3) Synonym4 DataSource2 Synonym2 Synonym5 Synonym1 Synonym3 Synonym6 Synonym3 Super Parent (CSID7) XRef1 XRef3 XRef2 XRef2 XRef4
  5. 5. What happens when standardization rules adjust?Would that affect Substance-Compound relationships?Would SID-CSID change?Yep, it is possible!!• After occasional total ChemSpider re-standardization we can’t guarantee that same standardized compound (CSID) will be linked to Substance – the mapping may change. This, however, will not in any way affect depositors’ SIDs.• It should be encouraged that depositors use their substance identifiers (SIDs) when referring to ChemSpider• Need to develop a compound permalinks (URL) that depositors can always use to get to their up-to-date CSIDs via SIDs. In this case, our re-standardization wouldn’t affect external references.
  6. 6. What happens when depositor revokes Substance (SID)?• Revoking is still versioning the substance record. A new version of record will be created with “not alive” flag.• Revoked substances are no longer indexed• If there are no more Substances point at Compound then the Compound gets deleted. Otherwise, the data from revoked substances is pulled off the compound• If revoke substance gets re-deposited a new version is created with “live” flag
  7. 7. FDA Structure Registration SystemVersion 5c, 2007• This guide is used to standardize the entry of substances into the Food and Drug Administration (FDA) Substance Registration System (SRS)• The primary purpose of this guide is to prevent duplicate entries of a single substance• Conventions for drawing structures and for organizing the characteristics of substances are included• The lack of standardization system at FDA gave birth to SRS SOP that served as guidelines for curators to draw chemicals the same way to avoid duplication in database
  8. 8. Standardization – is it possible to please allinterested parties?Depending on the area of specialization:• Some folks may insist on neutralizing charges while others may feel differently• Some may think that canonical tautomer should always be in specific form• We believe that combining “mild” standardization supplemented with parents may be the right choice to please as many interested parties as possible
  9. 9. Standardization – Step I - OrganometallicsAlways disconnect N, O, and F from metals: Example: (Ph3)Sn+… HO-Disconnect nonmetals (except N,O,F) with transition metals (except Hg)Ionize free metal with carboxylic acid Metals: Group I and IIWhenever covalent bonds with metals are disconnected - charges are adjusted.
  10. 10. Standardization – Step II (CVSP only)Tautomer Canonicalization In CVSP tautomer canonicalization is a part of standardization In OpenPhacts model tautomer canonicalization is not part of standardization. Instead, a tautomer-unsensitive (canonicalized tautomer) parent is being generated. Why OpenPHACTS approach is different?  Having different tautomers of the same family to be mapped to different standardized compounds would give better tautomer-specific annotation mapping (e.g. tautomer-specific NMR spectra, calculated properties, etc)  Standardized compounds representing same tautomeric family will have same tautomeric parent – canonicalized tautomer
  11. 11. Standardization – Step IIISome of basic InChI normalization experience/rules were used (~30) [*;H+:1]>>[*;H:1] [O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3] >>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3] [N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2] Etc FDA SRS rules added (~30) [n:1]=[O:2]>>[n+:1][O-:2] [*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3] [N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5] Thiopurine [H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10]([H,*:12])[n:9]2>>[ H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10]([H,*:12])[N:9]2)[C:3] 1=[S:2] etc
  12. 12. CVSP standardization vs OpenPHACTS CVSP OpenPHACTS Standardization Standardization1. Disconnecting Metals 1. Disconnecting Metals2. Canonicalizing tautomer 2. Omitted3. Applying SMIRKES rules 3. Applying SMIRKS (InChI + FDA) rules (InChI + FDA) PARENTS 1. Tautomer-unsensitive 2. Charge-unsensitive 3. Isotope-unsensitive 4. Stereo-unsensitive 5. Super-unsensitive
  13. 13. For each Compound (CSID) parent generation isattempted“Tautomerism in large databases”, Sitzmann andothers, J.Comput Aided Mol Des (2010) Parent DescriptionFragment-Unsensitive Largest fragment is identified and set as fragment parent. Parent set to the biggest organic fragment.Charge-Unsensitive An attempt is made to neutralize ionized acids and bases. Envisioned to be an ongoing improvement while new cases appear.Isotope-Unsensitive Isotopes replaced by common weightStereo-Unsensitive Stereo is strippedTautomer-Unsensitive Tautomer canonicalization is attempting to generate a “reasonable” tautomerSuper-Unsensitive This parent is all of the above
  14. 14. standardizationstandardization standardization
  15. 15. Tricky cases of generating charge-unsensitive parents DrugBank ID: DB00152 DrugBank ID: DB00209 Currently not dealt with
  16. 16. What do we use as chemical identity of the standardized records (primary compound key)?• Standard InChI/InChIKey (currently used ChemSpider)• Absolute smiles (isomeric canonical)Drawbacks• SMILES – can be too long; no accepted standard; needs to be hashed• Standard InChI • does not distinguish between undefined and unknown stereo • by default standard InChI does some basic tautomer canonicalization (not needed in new model) • By default assumes absolute stereoProposed SolutionNon-standard InChI with options: SUU SLUUD FixedH SUCF• much more sensitive to stereo description• Fixing mobile hydrogens• Pays attention to chiral flag in mol file (relative/absolute stereo)
  17. 17. Preliminary Data FlowSDF Split to Parallel Processingfile chunks StandardizeMoving forward to HADOOP-basedprocessing Generate Parents Upload to DB (optional)
  18. 18. ThanksWe would appreciate any comments.For comments or questions