The InChI, the International Chemical Identifier, has been the basis of both indexing and deduplication of the ChemSpider database since the inception of the platform. When the InChI was adopted we envisaged a future whereby the identifier would proliferate across journals, databases and the internet in general providing us a basis for “structure searching the internet”. This presentation will provide an overview of how the InChI has facilitated the integration of ChemSpider to chemistry on the internet, some of the surprising findings that have resulted from this work and extrapolate the influence of InChIs into the future for a chemically enabled web.
11. Of course it is out there…
Drugbox: 3001/5080 with InChIs
Chembox:5436/7690 with InChIs
12. Tell me more…
Where can I find the molfile for Yohimbine?
Papers/Patents about Yohimbine?
What are the side effects of Yohimbine?
Where can I order Yohimbine?
What are the physicochemical properties?
Metabolic pathways?
Different synonyms of Yohimbine?
Synthesis of Yohimbine?
Side effects of Yohimbine?
Etc….
15. How do we build it?
We deal in Molfiles or SDF files – with coordinates
Deposit anything that has an InChI – we support
what InChI can handle, good and bad
Standardization based on “InChI standardization”
InChIs aggregate (certain) tautomers
We link out to external sites using their IDs
16. Downsides of InChI
InChI was a moving target (multi versions) but
overall worked as planned.
Good for small molecules – but no polymers,
issues with inorganics, organometallics, imperfect
stereochemistry. ChemSpider is “small molecules”
InChI used as the “deduplicator” – FIRST version
of a compound into the database becomes THE
structure to deduplicate against…
21. Downsides of Overall Approach
Meshing data together based on InChIs worked
for simple molecules
2D layout errors inherited or limited by algorithm
Complex molecules that are meant to be the
same thing were NOT deduplicated. Compounds
differing by one stereocenter, named the same,
meant to be the same, are not the same
30. Recognizing Compound Dilution
So much chemistry on the web….
And so much dilution – “structural uniqueness”
versus “accidental ambiguity”
InChI as an easy skeleton search
35. Many Problems Can be Solved…
Clean up databases – structure validation,
structure standardization
Warn about
Valency, charge balance, depiction issues,
bond types, absent stereo, and another 100
rules (or so…)
Standardize
Agree community rules to “Standardize”
38. What needs to happen?
If we could validate
Catch errors in databases (and clean)
Proactively catch errors in publications/patents
Reduce junk in the ether – improve QUALITY!
If we standardized
Interlinking should improve
42. Substructure # of # of No Incomplete Complete but
Hits Correct stereochemistry Stereochemistry incorrect
Hits stereochemistry
Gonane 34 5 8 21 0
Gon-4-ene 55 12 3 33 7
Gon-1,4-diene 60 17 10 23 10
43. Structure-Name Validation
H3C
NH2
O
I I
O O CH3
H3C OH
O CH3
O
CH3
O H
HN
CH3 I OH
OH
O
O HO
O O
O
Choladine
O
CH3
Taxol
Cl
H3C N
N
CH3 CH3
CH3 H
Cholane
H H
Chlotrimazole
44. Standardize
Use the SRS as a guidance document for
standardization
Adjust as necessary to our needs
49. ChemSpider Standardization
Entire ChemSpider database will be standardized
using modified FDA rule set
Original Molfiles will be standardized and all
properties (predicted properties, SMILES, InChIs,
Names) will all be regenerated
Standardization procedures automatically applied
to all future depositions
50. Identifier Dictionaries
Reciprocal curation processes…share curation
with each other.
If a database has a compound already then use
InChiKeys to match “suggested” validation
against the compound.
A series of “added” and “removed” synonyms
against InChIKeys for matching.
51. Proof of Concept Data Curation Sharing
Who wants to work with us?
52. Structure Validation using feed
Look for approved synonyms
Compare feed InChIKey with database InChIKey
If different, flag for inspection
53. It is so difficult to navigate…
IP?
What’s the
structure?
Are they in
our file?
What’s
similar?
What’s the
Pharmacology target?
data?
Known
Pathways?
Competitors?
Working On
Connections Now?
to disease?
Expressed in
right cell type?
54. Open PHACTS Project
Develop a set of robust standards…
Implement the standards in a semantic integration hub
Deliver services to support drug discovery programs in
pharma and public domain
22 partners, 8 pharmaceutical companies, 3 biotechs
36 months project
Guiding principle is open access, open usage, open source
- Key to standards adoption -
55.
56. Chemistry in Open PHACTS
Selected data slices of ChemSpider carrying
pharmacological links into the “linked data cache”
ChemSpiderIDs and InChIs/InChIKeys will be in
Open PHACTS and available for linking
A structure ID standard to enable further linking
across the semantic web of science
57. ChemSpider and InChI
Internet Data
Small organic molecules Commercial Software
Undefined materials Pre-competitive Data
Organometallics Open Science
Nanomaterials Open Data
Polymers Publishers
Minerals Educators
Particle bound Open Databases
Links to Biologicals Chemical Vendors
58. The great promise should be obvious
InChIs are here to stay
They will evolve, they will encompass, we will
adopt and adapt
Public and private databases will federate &
build a linked environment of validated data!
Data validation and standardization is
needed
Open Data will continue to proliferate
InChIs are in the “Semantic Web” already
59. If InChI never existed or went away..
ChemSpider would never have been built
Database linking would suffer dramatically
The web would not be “structure searchable”
Cheminformatics tools would likely not be linking
to public domain databases in the same way
And we would not have the pleasure of today…
60. Acknowledgments
The inspiration of the InChI Masters – Steve H.,
Steve S., Alan, Dmitrii, Igor
IUPAC, NIST, all adopters, supporters,
challengers and users
The InChI Trust and its supporters for funding
continued development
Al Gore –enabling us to search InChIs on the web