Standardization and Generation of Parents
for
Open PHACTS Chemical Registry System
Karen Karapetyan, Valery Tkachenko
Colin Batchelor, Antony Williams
Validation checks
 Correct file format (SDF, MOL, CDX, etc)
 “Valid” chemical structure
 Valid atoms (not query atoms)
 Valid bonds
 Valid valences
 Valid charges
 SP3 stereo
 Synonyms
 Names (name to structure)
 SMILES, InChIs (SMILES/InChI to structure)
 XRefs
Severity assigned to every validation issue
Filtering by severity and by issues
Standardization – Organometallics/Salts
 Always disconnect N, O, and F from metals:
 Disconnect nonmetals (except N,O,F) with transition metals (except Hg)
 Ionize free metal with carboxylic acid (Metals of Group I and II)
Standardization SMIRKS
(based on InChI normalization and on FDA SRS)
Examples of InChI normalization
 [*;H+:1]>>[*;H:1]
 [O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3]
>>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3]
 [N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2]
Examples of FDA SRS rules
 [n:1]=[O:2]>>[n+:1][O-:2]
 [*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3]
 [N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5]
 Thiopurine
[H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10]([H,*:12])[n:9]2>>[
H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10]([H,*:12])[N:9]2)[C:3]
1=[S:2]
Standardization
 Dearomatize
 Double bond with adjacent wiggly single bond
 Fold hydrogen atoms with no up or down bonds
Standardization
 Remove symmetric stereocenters
 Turn off chiral flag if no up or down bonds
 Do Layout
Chiral flag is set
Standardization – partially ionized acids
(move proton from strong acids to a weaker)
For each Compound parent generation is attempted
“Tautomerism in large databases”, Sitzmann and others,
J.Comput Aided Mol Des (2010)
Parent Description RDF
Charge-Unsensitive An attempt is made to neutralize ionized acids
and bases. Envisioned to be an ongoing
improvement while new cases appear.
void:linkPredicate skos:closeMatch
dul:expresses cheminf:CHEMINF_000460;
Isotope-Unsensitive Isotopes replaced by common weight void:linkPredicate skos:closeMatch;
dul:expresses cheminf:CHEMINF_000459
Stereo-Unsensitive SP3 and double bond stereo removed void:linkPredicate skos:closeMatch
cheminf:CHEMINF_000456
Tautomer-
Unsensitive
Tautomer canonicalization is attempting to
generate a canonical tautomer
void:linkPredicate skos:closeMatch;
dul:expresses cheminf:CHEMINF_000486;
Super Parent Super parent is generated by applying
modifications of all of the above
void:linkPredicate skos:broadMatch;
dul:expresses cheminf:CHEMINF_000458;
Fragment
SID 1
SDF1
DataSource1
Synonym1
Synonym2
XRef1
SID 2
SDF2
DataSource2
Synonym1
Synonym3
XRef2
OPS_ID 1
Deposited
Substances
Parents
Standardized
MOLECULE
DataSource1
DataSource2
Synonym1
Synonym2
Synonym3
XRef1
XRef2
Charge Parent (OPS_ID 6)
Isotope Parent (OPS_ID 4)
Stereo Parent (OPS_ID 3)
Tautomer Parent (OPS_ID 5)
Super Parent (OPS_ID 7)
Compounds
OPS_ID 2
Standardized
MOL
DataSource3
DataSource4
Synonym4
Synonym5
Synonym6
XRef3
XRef4
What do we use as chemical identity of the standardized records
(primary compound key)?
• Standard InChI/InChIKey (currently used ChemSpider)
• Absolute smiles (isomeric canonical)
Drawbacks
• SMILES – can be too long; no accepted standard; needs to be hashed
• Standard InChI
• does not distinguish between undefined and unknown stereo
• by default standard InChI does some basic tautomer canonicalization
(not needed in new model)
• By default assumes absolute stereo
Proposed Solution
Non-standard InChI with options: SUU SLUUD FixedH SUCF
• much more sensitive to stereo description
• Fixes mobile hydrogens (so tautomers could be distinguished)
• Handles “AND-ed” relative stereo
Thanks
We would appreciate any comments.
For comments or questions email
karapetyank@rsc.org

Standardization and Generation of Parents for Open PHACTS Chemical Registry System

  • 1.
    Standardization and Generationof Parents for Open PHACTS Chemical Registry System Karen Karapetyan, Valery Tkachenko Colin Batchelor, Antony Williams
  • 2.
    Validation checks  Correctfile format (SDF, MOL, CDX, etc)  “Valid” chemical structure  Valid atoms (not query atoms)  Valid bonds  Valid valences  Valid charges  SP3 stereo  Synonyms  Names (name to structure)  SMILES, InChIs (SMILES/InChI to structure)  XRefs
  • 3.
    Severity assigned toevery validation issue
  • 4.
    Filtering by severityand by issues
  • 5.
    Standardization – Organometallics/Salts Always disconnect N, O, and F from metals:  Disconnect nonmetals (except N,O,F) with transition metals (except Hg)  Ionize free metal with carboxylic acid (Metals of Group I and II)
  • 6.
    Standardization SMIRKS (based onInChI normalization and on FDA SRS) Examples of InChI normalization  [*;H+:1]>>[*;H:1]  [O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3] >>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3]  [N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2] Examples of FDA SRS rules  [n:1]=[O:2]>>[n+:1][O-:2]  [*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3]  [N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5]  Thiopurine [H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10]([H,*:12])[n:9]2>>[ H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10]([H,*:12])[N:9]2)[C:3] 1=[S:2]
  • 7.
    Standardization  Dearomatize  Doublebond with adjacent wiggly single bond  Fold hydrogen atoms with no up or down bonds
  • 8.
    Standardization  Remove symmetricstereocenters  Turn off chiral flag if no up or down bonds  Do Layout Chiral flag is set
  • 9.
    Standardization – partiallyionized acids (move proton from strong acids to a weaker)
  • 10.
    For each Compoundparent generation is attempted “Tautomerism in large databases”, Sitzmann and others, J.Comput Aided Mol Des (2010) Parent Description RDF Charge-Unsensitive An attempt is made to neutralize ionized acids and bases. Envisioned to be an ongoing improvement while new cases appear. void:linkPredicate skos:closeMatch dul:expresses cheminf:CHEMINF_000460; Isotope-Unsensitive Isotopes replaced by common weight void:linkPredicate skos:closeMatch; dul:expresses cheminf:CHEMINF_000459 Stereo-Unsensitive SP3 and double bond stereo removed void:linkPredicate skos:closeMatch cheminf:CHEMINF_000456 Tautomer- Unsensitive Tautomer canonicalization is attempting to generate a canonical tautomer void:linkPredicate skos:closeMatch; dul:expresses cheminf:CHEMINF_000486; Super Parent Super parent is generated by applying modifications of all of the above void:linkPredicate skos:broadMatch; dul:expresses cheminf:CHEMINF_000458;
  • 11.
    Fragment SID 1 SDF1 DataSource1 Synonym1 Synonym2 XRef1 SID 2 SDF2 DataSource2 Synonym1 Synonym3 XRef2 OPS_ID1 Deposited Substances Parents Standardized MOLECULE DataSource1 DataSource2 Synonym1 Synonym2 Synonym3 XRef1 XRef2 Charge Parent (OPS_ID 6) Isotope Parent (OPS_ID 4) Stereo Parent (OPS_ID 3) Tautomer Parent (OPS_ID 5) Super Parent (OPS_ID 7) Compounds OPS_ID 2 Standardized MOL DataSource3 DataSource4 Synonym4 Synonym5 Synonym6 XRef3 XRef4
  • 13.
    What do weuse as chemical identity of the standardized records (primary compound key)? • Standard InChI/InChIKey (currently used ChemSpider) • Absolute smiles (isomeric canonical) Drawbacks • SMILES – can be too long; no accepted standard; needs to be hashed • Standard InChI • does not distinguish between undefined and unknown stereo • by default standard InChI does some basic tautomer canonicalization (not needed in new model) • By default assumes absolute stereo Proposed Solution Non-standard InChI with options: SUU SLUUD FixedH SUCF • much more sensitive to stereo description • Fixes mobile hydrogens (so tautomers could be distinguished) • Handles “AND-ed” relative stereo
  • 14.
    Thanks We would appreciateany comments. For comments or questions email karapetyank@rsc.org

Editor's Notes

  • #3 I would like to start by defining what a “quality” record means, because that is what validation part of the CVSP is about. The chemical record has several aspect to its quality. One that is easiest to check is file format correctness. Each file format has its own formatting rules that record in that format needs to follow. This type of file validation is done by all the database maintainers that have deposition systems.Another, more relevant, type of validation is the chemical validation. A record can be perfectly formatted from file format point of view, but make no sense in chemical. So structure validation is something that is usually overlooked or not prioritized highly. Some of the chemical validations are atom validation – checking that atom is legal chemical atom, its charges and valences. That stereo is defined. Synonym validation is very useful for spotting records that are inconsistent and worth pointing depositor to look at them. Often during data export/import synonyms and/or structure are being manipulated and relationships between them can become faulty. So attempting to verify that synonym and structure actually match something worth doing.SMILES/INCHIs – again relationship between chemical record and depositor’s provided INCHI or SMILES can be faulty. As I’ll show later, this inconsistency could reveal a systematic issue with data set as sometimes INCHi or SMILEs do not match the structure.
  • #4 The result of processing is a list of records with validation messages in the middle. If record was standardized then “Standardized” column is present with the structure.
  • #5 Here is the bigger DrugBank dataset we have processed. Some warnings are shown in the dropdown list. Warnings about metals, stereo, enol presence, etc.