Validation and Standardization of
Molecular Structures in General
and Sugars in Particular: a Case
Study
Colin Batchelor,
...
Overview
Open PHACTS and chemical validation and
standardization
RDF for chemoinformatics calculations
General case study:...
Overview
Open PHACTS and chemical validation and
standardization
RDF for chemoinformatics calculations
General case study:...
Who is involved? 28 Consortium Members >45 Associated Partners
3-year European project funded by:
• European Pharmaceutica...
How do we fit in?
We integrate and standardize the chemical
compound collection underpinning Open
PHACTS and provide regul...
Open PHACTS provides an integrated platform of publicly
available pharmacological and physicochemical data
”“
Data accessi...
How does Open PHACTS work?
Currently integrated databases
Database Millions of triples
ACD Labs / ChemSpider 161.3
ChEBI 0.9
ChEMBL 146.1
ConceptWiki...
CVSP and the OPS CRS
Standardization workflows
(CVSP, FDA, OPS, custom) using
modules such as:
• SMIRKS transformations
• ...
Overview
Open PHACTS and chemical validation and
standardization
RDF for chemoinformatics calculations
General case study:...
RDF and Open PHACTS
The underlying language of Open PHACTS is RDF.
There are few constraints as such, only guidelines
for ...
What does RDF look like?
In the Turtle format below, each line is a triple, in
which a binary predicate links a subject an...
Royal Society of Chemistry
data in Open PHACTS
1. Molecule synonyms and identifiers
2. Linksets between
ChEBI, ChEMBL, Dru...
Royal Society of Chemistry
data in Open PHACTS
1. Molecule synonyms and identifiers
2. Linksets between
ChEBI, ChEMBL, Dru...
Calculated physicochemical
properties (ACD 12.0)
log P log D (at pH 5.5, at pH 7.4)
bioconcentration factor KOC (at pH 5.5...
RDF for calculated properties:
vocabularies
Two dozen calculated properties for each of
>106 molecules.
CHEMINF ontology f...
RDF for calculated properties:
schema
benzene’s
connection table
OPS
benzene
calculation result
QUDT
dimensionless
quantit...
Overview
Open PHACTS and chemical validation and
standardization
RDF for chemoinformatics calculations
General case study:...
ChEMBL and DrugBank
analysed
Taking ChEMBL 16 (http://www.ebi.ac.uk/chembl/) which
contains 1 295 510 distinct molecules, ...
ChEMBL DrugBank
Potentially serious things
14218 1.09% 202 3.10% Not an overall neutral system
485 0.04% 21 0.32% Forbidde...
ChEMBL DrugBank
Aesthetics
57275 4.42% 70 1.08
%
Uneven-length bonds
25736 1.99% 78 1.20
%
Congested layout
23622 1.82% 24...
ChEMBL DrugBank
Artwork molecules
0 0 Cyclobutane
8 0 Ethane molecules in the structure
6 0 Sulfur atoms with no explicit ...
ChEMBL DrugBank
FDA tautomer and metal rules
17508 1.35% 80 1.29% In enol form (or chalcogenoenol form)
9526 0.74% 4 0.07%...
ChEMBL DrugBank
Stereochemistry
185742 14.3% 39 0.60% G2-4: Has a single unknown stereocentre and no
defined stereocentres...
Overview
Open PHACTS and chemical validation and
standardization
RDF for chemoinformatics calculations
General case study:...
Sugar depiction challenges
Stereochemistry not stored in V2000
format (though present in .cdx).
Consequences
ChEMBL
(19275)
DrugBank
(153)
Sugar questions
5359 27.8% 138 90.2% At least one L-pyranose ring (often antibiotics
contain...
Sugar ring redepiction
algorithm
1. Identify perspective conformation
(boat, chair, Haworth)
2. Determine perspective ster...
Take the x-axis as parallel to
the line through the top two
chair atoms or through the
bottom two chair atoms.
Δy positive...
In the boat case, the
substituent further up the
page is the wedge, while
the one further down the
page is the
hash, regar...
Depiction 1. Identify mean bond
length and chair centroid.
2. Snap ring atoms to a
regular-hexagonal grid.
3. Remove super...
Tidying: desiderata
Different problem from structure layout in
general.
The structure we end up with is, in many
important...
Next steps
Stable user-facing URI for CVSP
(currently http://cvsp.beta.rsc-us.org/,
but subject to change)
Apply CVSP to a...
Acknowledgements
In particular,
Jon Steele (RSC)
David Sharpe (RSC)
John Blunt (Canterbury, NZ)
Any questions?
batchelorc@rsc.org
@documentvector
20130724 cisrg sugars_batchelor
20130724 cisrg sugars_batchelor
20130724 cisrg sugars_batchelor
Upcoming SlideShare
Loading in...5
×

20130724 cisrg sugars_batchelor

343

Published on

Validation and Standardization of Molecular Structures in General and Sugars in Particular

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
343
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

20130724 cisrg sugars_batchelor

  1. 1. Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko, Antony Williams 6th Joint Sheffield Conference on Chemoinformatics 2013-07-24
  2. 2. Overview Open PHACTS and chemical validation and standardization RDF for chemoinformatics calculations General case study: ChEMBL and DrugBank Sugar case study: Perspective perception
  3. 3. Overview Open PHACTS and chemical validation and standardization RDF for chemoinformatics calculations General case study: ChEMBL and DrugBank Sugar case study: Perspective perception
  4. 4. Who is involved? 28 Consortium Members >45 Associated Partners 3-year European project funded by: • European Pharmaceutical Industry • Innovative Medicines Initiative Open PHACTS API Applications using the Open PHACTS API dev.openphacts.org Explorer www.openphacts.org Twitter: @open_phacts
  5. 5. How do we fit in? We integrate and standardize the chemical compound collection underpinning Open PHACTS and provide regular updates and on- going data curation. The validation and standardization rules have been derived from the FDA structure guidelines and have been changed for consistency and input from members of EFPIA.
  6. 6. Open PHACTS provides an integrated platform of publicly available pharmacological and physicochemical data ”“ Data accessible via: • Free application programming interface (API) dev.openphacts.org • Third-party applications built to use the API Open PHACTS app ecosystem
  7. 7. How does Open PHACTS work?
  8. 8. Currently integrated databases Database Millions of triples ACD Labs / ChemSpider 161.3 ChEBI 0.9 ChEMBL 146.1 ConceptWiki 3.7 DrugBank 0.5 Enzyme 0.1 Gene Ontology 0.9 SwissProt 156.6 WikiPathways 0.1 TOTAL 470.2
  9. 9. CVSP and the OPS CRS Standardization workflows (CVSP, FDA, OPS, custom) using modules such as: • SMIRKS transformations • layout (GGA) • canonical tautomers (ChemAxon) • sugar interpretation (RSC)  
  10. 10. Overview Open PHACTS and chemical validation and standardization RDF for chemoinformatics calculations General case study: ChEMBL and DrugBank Sugar case study: Perspective perception
  11. 11. RDF and Open PHACTS The underlying language of Open PHACTS is RDF. There are few constraints as such, only guidelines for which classes of identifier to use and accounts of best practice. This RDF goes into the data cache and we access the results through user interfaces built on RESTful JSON web services.
  12. 12. What does RDF look like? In the Turtle format below, each line is a triple, in which a binary predicate links a subject and an object. :CSID1execution obo:OBO_0000299 :CSID1prop11 . :CSID1prop11 obo:IAO_0000136 ops:OPS1 . :CSID1prop11 rdf:type cheminf:CHEMINF_000349 . :CSID1prop11 qudt:numericValue "1.049E-17"^^xsd:double . :CSID1prop11 qudt:unit obo:UO_0000324 . There is also RDF/XML, which is less human- readable.
  13. 13. Royal Society of Chemistry data in Open PHACTS 1. Molecule synonyms and identifiers 2. Linksets between ChEBI, ChEMBL, DrugBank and OPS identifiers 3. Molecule–molecule relations (―parent– child‖) of interest for drug discovery 4. Calculated physicochemical properties for compounds (both molecular and macroscopic)
  14. 14. Royal Society of Chemistry data in Open PHACTS 1. Molecule synonyms and identifiers 2. Linksets between ChEBI, ChEMBL, DrugBank and OPS identifiers 3. Molecule–molecule relations (―parent– child‖) of interest for drug discovery 4. Calculated physicochemical properties for compounds (both molecular and macroscopic)
  15. 15. Calculated physicochemical properties (ACD 12.0) log P log D (at pH 5.5, at pH 7.4) bioconcentration factor KOC (at pH 5.5, at pH 7.4) index of refraction polar surface area molar refractivity molar volume polarizability surface tension density at STP boiling point at 1 atm flash point at 1 atm enthalpy of vaporization at STP vapour pressure at STP
  16. 16. RDF for calculated properties: vocabularies Two dozen calculated properties for each of >106 molecules. CHEMINF ontology for kinds of calculation and chemical data QUDT for results OPS IDs for molecules OBI and IAO to connect calculations to results
  17. 17. RDF for calculated properties: schema benzene’s connection table OPS benzene calculation result QUDT dimensionless quantity ―2.17‖^^xsd:float IAO is about OBI has specified output OBI has specified input QUDT has value QUDT has standard uncertainty QUDT has unit CHEMINF calculated log P rdf:type CHEMINF connection table rdf:type ―0.234‖^^xsd:float calculation process CHEMINF execution of ACD/Labs PhysChem software library version 12.01 rdf:type
  18. 18. Overview Open PHACTS and chemical validation and standardization RDF for chemoinformatics calculations General case study: ChEMBL and DrugBank Sugar case study: Perspective perception
  19. 19. ChEMBL and DrugBank analysed Taking ChEMBL 16 (http://www.ebi.ac.uk/chembl/) which contains 1 295 510 distinct molecules, CVSP found something to say about 456 250 of them (35%). DrugBank 3.0 (http://www.drugbank.ca/) contains 6510 distinct molecules of which CVSP has found something to say about 662 of them (10%) (We haven’t done all of CS yet; we will.)
  20. 20. ChEMBL DrugBank Potentially serious things 14218 1.09% 202 3.10% Not an overall neutral system 485 0.04% 21 0.32% Forbidden-valence atoms 44 — 0 — Has adjacent atoms with like charges 4 — 0 — Has more than one radical centre
  21. 21. ChEMBL DrugBank Aesthetics 57275 4.42% 70 1.08 % Uneven-length bonds 25736 1.99% 78 1.20 % Congested layout 23622 1.82% 24 0.37 % Containing not-quite-linear cyano groups 167 0.01% 1 — Zero-dimensional structures 70 0.01% 0 — Containing not-quite-linear isocyano groups
  22. 22. ChEMBL DrugBank Artwork molecules 0 0 Cyclobutane 8 0 Ethane molecules in the structure 6 0 Sulfur atoms with no explicit bonds 4 0 Boron atoms with no explicit bonds 1 0 Ethyne molecule (in the ChEMBL case it actually is acetylene) 3 0 Stray methane molecules
  23. 23. ChEMBL DrugBank FDA tautomer and metal rules 17508 1.35% 80 1.29% In enol form (or chalcogenoenol form) 9526 0.74% 4 0.07% N=C–OH tautomer of a carbonyl compound 2 — 1 — Nitroso-form oximes 1104 0.09% 6 0.09% Metal–nitrogen bond 845 0.06% 10 0.15% Non-metal–transition-metal bond 432 0.03% 10 0.15% Metal–oxygen bond 3 — 2 — Aluminium–non-metal bond 2 — 0 — Metal–fluorine bond
  24. 24. ChEMBL DrugBank Stereochemistry 185742 14.3% 39 0.60% G2-4: Has a single unknown stereocentre and no defined stereocentres: probably a racemate 68572 5.3% 13 0.20% G2-42 Has more than one unknown stereocentre and no defined stereocentres: probably problematic. Could indicate relative stereochemistry? 36572 2.8% 27 0.44% G2-44 At least one defined stereocentre, and one is stereocentre undefined or unknown: probably an epimer or mixture of anomers 26076 2.0% 11 0.17% G2-46 Has more than one unknown stereocentre and more than one defined stereocentre – probably problematic again 23113 1.8% 13 0.20% Unknown double bond arrangement 883 0.1% 1 — At least one ring containing stereobonds
  25. 25. Overview Open PHACTS and chemical validation and standardization RDF for chemoinformatics calculations General case study: ChEMBL Sugar case study: Perspective perception
  26. 26. Sugar depiction challenges Stereochemistry not stored in V2000 format (though present in .cdx).
  27. 27. Consequences
  28. 28. ChEMBL (19275) DrugBank (153) Sugar questions 5359 27.8% 138 90.2% At least one L-pyranose ring (often antibiotics contain these) 4748 24.6% 0 — At least one perspective chair 416 2.16% 0 — At least one Haworth ring 52 0.03% 0 — At least one perspective boat or twist boat
  29. 29. Sugar ring redepiction algorithm 1. Identify perspective conformation (boat, chair, Haworth) 2. Determine perspective stereo 3. Assign wedge or hash to bonds accordingly 4. Reconstruct sugar ring so as to minimize disruption to the rest of molecule 5. Tidy
  30. 30. Take the x-axis as parallel to the line through the top two chair atoms or through the bottom two chair atoms. Δy positive: wedge Δy negative: hash Then remap chair to homotropous hexagon.
  31. 31. In the boat case, the substituent further up the page is the wedge, while the one further down the page is the hash, regardless of whether bridgehead or not.
  32. 32. Depiction 1. Identify mean bond length and chair centroid. 2. Snap ring atoms to a regular-hexagonal grid. 3. Remove superfluous hydrogen atoms. 4. Only mark stereo on a single substituent if they are paired (cf. Grice).
  33. 33. Tidying: desiderata Different problem from structure layout in general. The structure we end up with is, in many important respects, fine. Preserve drawing conventions—aglycones being on the top right hand side.
  34. 34. Next steps Stable user-facing URI for CVSP (currently http://cvsp.beta.rsc-us.org/, but subject to change) Apply CVSP to all of ChemSpider. Investigate fused rings.
  35. 35. Acknowledgements In particular, Jon Steele (RSC) David Sharpe (RSC) John Blunt (Canterbury, NZ)
  36. 36. Any questions? batchelorc@rsc.org @documentvector
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×