Data model
Upcoming SlideShare
Loading in...5
×
 

Data model

on

  • 456 views

 

Statistics

Views

Total Views
456
Views on SlideShare
456
Embed Views
0

Actions

Likes
1
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data model Data model Presentation Transcript

  • CVSP New Data Model Standardization and Parents
  • Why there is a need for a new data model?• Details of the proposed new data model• Concept of Substance and Compound• Standardization Workflow• The benefit of having parents (we all know this )
  • Deposited Record – “Substance”Unique Substance Identifier (SID) assigned for eachdeposited record (a record identified by combination of DataSource and depositor’s internal database registry identifier)Benefits of having separate independent layers of depositedrecord data (“Substances”) and standardized record data(“Compounds”) - archive model - are:• Depositors’ records – Substance - gets preserved as they are with no alteration• Depositors records get versioned when changes occur. Only last most-to-date version is used in links and calculations• Same chemical may be deposited by several depositors – each of them will have different substance ID, but all of them will be linked to same standardized compound• Any records can be accepted – even those not producing InChI (e.g. plant extracts, blood samples, polymers, etc.)• Substance identifier (SID) guaranteed not to change
  • Deposited Compounds Parents CompoundsSubstances Fragment Parent (CSID2) SID 1 CSID 1 CSID 2 SDF1 Stereo Parent (CSID5) DataSource1 Synonym1 Synonym2 XRef1 Isotope Parent (CSID4) Standardized Standardized MOL MOL SID 2 Tautomer Parent (CSID6) DataSource1 DataSource3 DataSource2 DataSource4 SDF2 Synonym1 Charge Parent (CSID3) Synonym4 DataSource2 Synonym2 Synonym5 Synonym1 Synonym3 Synonym6 Synonym3 Super Parent (CSID7) XRef1 XRef3 XRef2 XRef2 XRef4
  • What happens when standardization rules adjust?Would that affect Substance-Compound relationships?Would SID-CSID change?Yep, it is possible!!• After occasional total ChemSpider re-standardization we can’t guarantee that same standardized compound (CSID) will be linked to Substance – the mapping may change. This, however, will not in any way affect depositors’ SIDs.• It should be encouraged that depositors use their substance identifiers (SIDs) when referring to ChemSpider• Need to develop a compound permalinks (URL) that depositors can always use to get to their up-to-date CSIDs via SIDs. In this case, our re-standardization wouldn’t affect external references.
  • What happens when depositor revokes Substance (SID)?• Revoking is still versioning the substance record. A new version of record will be created with “not alive” flag.• Revoked substances are no longer indexed• If there are no more Substances point at Compound then the Compound gets deleted. Otherwise, the data from revoked substances is pulled off the compound• If revoke substance gets re-deposited a new version is created with “live” flag
  • FDA Structure Registration SystemVersion 5c, 2007• This guide is used to standardize the entry of substances into the Food and Drug Administration (FDA) Substance Registration System (SRS)• The primary purpose of this guide is to prevent duplicate entries of a single substance• Conventions for drawing structures and for organizing the characteristics of substances are included• The lack of standardization system at FDA gave birth to SRS SOP that served as guidelines for curators to draw chemicals the same way to avoid duplication in database
  • Standardization – is it possible to please allinterested parties?Depending on the area of specialization:• Some folks may insist on neutralizing charges while others may feel differently• Some may think that canonical tautomer should always be in specific form• We believe that combining “mild” standardization supplemented with parents may be the right choice to please as many interested parties as possible
  • Standardization – Step I - OrganometallicsAlways disconnect N, O, and F from metals: Example: (Ph3)Sn+… HO-Disconnect nonmetals (except N,O,F) with transition metals (except Hg)Ionize free metal with carboxylic acid Metals: Group I and IIWhenever covalent bonds with metals are disconnected - charges are adjusted.
  • Standardization – Step II (CVSP only)Tautomer Canonicalization In CVSP tautomer canonicalization is a part of standardization In OpenPhacts model tautomer canonicalization is not part of standardization. Instead, a tautomer-unsensitive (canonicalized tautomer) parent is being generated. Why OpenPHACTS approach is different?  Having different tautomers of the same family to be mapped to different standardized compounds would give better tautomer-specific annotation mapping (e.g. tautomer-specific NMR spectra, calculated properties, etc)  Standardized compounds representing same tautomeric family will have same tautomeric parent – canonicalized tautomer
  • Standardization – Step IIISome of basic InChI normalization experience/rules were used (~30) [*;H+:1]>>[*;H:1] [O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3] >>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3] [N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2] Etc FDA SRS rules added (~30) [n:1]=[O:2]>>[n+:1][O-:2] [*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3] [N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5] Thiopurine [H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10]([H,*:12])[n:9]2>>[ H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10]([H,*:12])[N:9]2)[C:3] 1=[S:2] etc
  • CVSP standardization vs OpenPHACTS CVSP OpenPHACTS Standardization Standardization1. Disconnecting Metals 1. Disconnecting Metals2. Canonicalizing tautomer 2. Omitted3. Applying SMIRKES rules 3. Applying SMIRKS (InChI + FDA) rules (InChI + FDA) PARENTS 1. Tautomer-unsensitive 2. Charge-unsensitive 3. Isotope-unsensitive 4. Stereo-unsensitive 5. Super-unsensitive
  • For each Compound (CSID) parent generation isattempted“Tautomerism in large databases”, Sitzmann andothers, J.Comput Aided Mol Des (2010) Parent DescriptionFragment-Unsensitive Largest fragment is identified and set as fragment parent. Parent set to the biggest organic fragment.Charge-Unsensitive An attempt is made to neutralize ionized acids and bases. Envisioned to be an ongoing improvement while new cases appear.Isotope-Unsensitive Isotopes replaced by common weightStereo-Unsensitive Stereo is strippedTautomer-Unsensitive Tautomer canonicalization is attempting to generate a “reasonable” tautomerSuper-Unsensitive This parent is all of the above
  • standardizationstandardization standardization
  • Tricky cases of generating charge-unsensitive parents DrugBank ID: DB00152 DrugBank ID: DB00209 Currently not dealt with
  • What do we use as chemical identity of the standardized records (primary compound key)?• Standard InChI/InChIKey (currently used ChemSpider)• Absolute smiles (isomeric canonical)Drawbacks• SMILES – can be too long; no accepted standard; needs to be hashed• Standard InChI • does not distinguish between undefined and unknown stereo • by default standard InChI does some basic tautomer canonicalization (not needed in new model) • By default assumes absolute stereoProposed SolutionNon-standard InChI with options: SUU SLUUD FixedH SUCF• much more sensitive to stereo description• Fixing mobile hydrogens• Pays attention to chiral flag in mol file (relative/absolute stereo)
  • Preliminary Data FlowSDF Split to Parallel Processingfile chunks StandardizeMoving forward to HADOOP-basedprocessing Generate Parents Upload to DB (optional)
  • ThanksWe would appreciate any comments.For comments or questions emailkarapetyank@rsc.org