I would like to start by defining what a “quality” record means, because that is what validation part of the CVSP is about. The chemical record has several aspect to its quality. One that is easiest to check is file format correctness. Each file format has its own formatting rules that record in that format needs to follow. This type of file validation is done by all the database maintainers that have deposition systems.Another, more relevant, type of validation is the chemical validation. A record can be perfectly formatted from file format point of view, but make no sense in chemical. So structure validation is something that is usually overlooked or not prioritized highly. Some of the chemical validations are atom validation – checking that atom is legal chemical atom, its charges and valences. That stereo is defined. Synonym validation is very useful for spotting records that are inconsistent and worth pointing depositor to look at them. Often during data export/import synonyms and/or structure are being manipulated and relationships between them can become faulty. So attempting to verify that synonym and structure actually match something worth doing.SMILES/INCHIs – again relationship between chemical record and depositor’s provided INCHI or SMILES can be faulty. As I’ll show later, this inconsistency could reveal a systematic issue with data set as sometimes INCHi or SMILEs do not match the structure.
The result of processing is a list of records with validation messages in the middle. If record was standardized then “Standardized” column is present with the structure.
Here is the bigger DrugBank dataset we have processed. Some warnings are shown in the dropdown list. Warnings about metals, stereo, enol presence, etc.
Transcript of "Standardization and Generation of Parents for Open PHACTS Chemical Registry System"
Standardization and Generation of Parents
Open PHACTS Chemical Registry System
Karen Karapetyan, Valery Tkachenko
Colin Batchelor, Antony Williams
Standardization – Organometallics/Salts
Always disconnect N, O, and F from metals:
Disconnect nonmetals (except N,O,F) with transition metals (except Hg)
Ionize free metal with carboxylic acid (Metals of Group I and II)
(based on InChI normalization and on FDA SRS)
Examples of InChI normalization
Examples of FDA SRS rules
Double bond with adjacent wiggly single bond
Fold hydrogen atoms with no up or down bonds
Remove symmetric stereocenters
Turn off chiral flag if no up or down bonds
Chiral flag is set
Standardization – partially ionized acids
(move proton from strong acids to a weaker)
For each Compound parent generation is attempted
“Tautomerism in large databases”, Sitzmann and others,
J.Comput Aided Mol Des (2010)
Parent Description RDF
Charge-Unsensitive An attempt is made to neutralize ionized acids
and bases. Envisioned to be an ongoing
improvement while new cases appear.
Isotope-Unsensitive Isotopes replaced by common weight void:linkPredicate skos:closeMatch;
Stereo-Unsensitive SP3 and double bond stereo removed void:linkPredicate skos:closeMatch
Tautomer canonicalization is attempting to
generate a canonical tautomer
Super Parent Super parent is generated by applying
modifications of all of the above
What do we use as chemical identity of the standardized records
(primary compound key)?
• Standard InChI/InChIKey (currently used ChemSpider)
• Absolute smiles (isomeric canonical)
• SMILES – can be too long; no accepted standard; needs to be hashed
• Standard InChI
• does not distinguish between undefined and unknown stereo
• by default standard InChI does some basic tautomer canonicalization
(not needed in new model)
• By default assumes absolute stereo
Non-standard InChI with options: SUU SLUUD FixedH SUCF
• much more sensitive to stereo description
• Fixes mobile hydrogens (so tautomers could be distinguished)
• Handles “AND-ed” relative stereo
We would appreciate any comments.
For comments or questions email
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.