Great promise of navigating the internet using in chis

Great promise of navigating the
internet using InChIs

Antony J Williams
ACS San Diego March 2012

Openness and Quality Issues
Williams and Ekins, DDT, 16: 747-750 (2011)

Science Translational Medicine 2011

Warning…
 This talk is not about Quality…it’s about quantity

Warning…
 This talk is not about Quality…it’s about quantity

Drugbank was here

It’s about what’s out there…

And getting out of overwhelm…

Of course it is out there…

Drugbox: 3001/5080 with InChIs

Chembox:5436/7690 with InChIs

Tell me more…
 Where can I find the molfile for Yohimbine?
 Papers/Patents about Yohimbine?
 What are the side effects of Yohimbine?
 Where can I order Yohimbine?
 What are the physicochemical properties?
 Metabolic pathways?
 Different synonyms of Yohimbine?
 Synthesis of Yohimbine?
 Side effects of Yohimbine?
 Etc….

Yohimbine on ChemSpider..Quality?

How do we build it?
 We deal in Molfiles or SDF files – with coordinates

 Deposit anything that has an InChI – we support
what InChI can handle, good and bad

 Standardization based on “InChI standardization”

 InChIs aggregate (certain) tautomers

 We link out to external sites using their IDs

Downsides of InChI
 InChI was a moving target (multi versions) but
overall worked as planned.

 Good for small molecules – but no polymers,
issues with inorganics, organometallics, imperfect
stereochemistry. ChemSpider is “small molecules”

 InChI used as the “deduplicator” – FIRST version
of a compound into the database becomes THE
structure to deduplicate against…

Standardization Issues
Depiction based on molfile

Downsides of Overall Approach
 Meshing data together based on InChIs worked
for simple molecules

 2D layout errors inherited or limited by algorithm

 Complex molecules that are meant to be the
same thing were NOT deduplicated. Compounds
differing by one stereocenter, named the same,
meant to be the same, are not the same

InChI String Search via Google
Give me InChIKeys…

 ChemSpider

 BRENDA

 Wikipedia

 ChEMBL

 ChEBI

 DrugBank

 Aggregator

 Enzymes

 Encyclopedia

 Pharmacology

 Curated Chemicals

 Drug-Drug Target

Recognizing Compound Dilution
 So much chemistry on the web….

 And so much dilution – “structural uniqueness”
versus “accidental ambiguity”

 InChI as an easy skeleton search

Vancomycin – Search the Internet

Vancomycin

Search Molecular Search Full Molecule
SKELETON

All aggegators suffer dilution!

Many Problems Can be Solved…
 Clean up databases – structure validation,
structure standardization

 Warn about
 Valency, charge balance, depiction issues,
bond types, absent stereo, and another 100
rules (or so…)

 Standardize
 Agree community rules to “Standardize”

What needs to happen?
 If we could validate
 Catch errors in databases (and clean)
 Proactively catch errors in publications/patents
 Reduce junk in the ether – improve QUALITY!

 If we standardized
 Interlinking should improve

Substructure # of # of No Incomplete Complete but

Hits Correct stereochemistry Stereochemistry incorrect

Hits stereochemistry

Gonane 34 5 8 21 0

Gon-4-ene 55 12 3 33 7

Gon-1,4-diene 60 17 10 23 10

Structure-Name Validation
H3C
NH2
O
I I
O O CH3
H3C OH
O CH3
O
CH3
O H
HN
CH3 I OH
OH
O
O HO
O O
O
Choladine
O
CH3

Taxol

Cl
H3C N
N
CH3 CH3

CH3 H
Cholane
H H
Chlotrimazole

Standardize

 Use the SRS as a guidance document for
standardization
 Adjust as necessary to our needs

Millions of structures? Lots of Issues

ChemSpider Standardization
 Entire ChemSpider database will be standardized
using modified FDA rule set

 Original Molfiles will be standardized and all
properties (predicted properties, SMILES, InChIs,
Names) will all be regenerated

 Standardization procedures automatically applied
to all future depositions

Identifier Dictionaries
 Reciprocal curation processes…share curation
with each other.

 If a database has a compound already then use
InChiKeys to match “suggested” validation
against the compound.

 A series of “added” and “removed” synonyms
against InChIKeys for matching.

Proof of Concept Data Curation Sharing
Who wants to work with us?

Structure Validation using feed
 Look for approved synonyms

 Compare feed InChIKey with database InChIKey

 If different, flag for inspection

It is so difficult to navigate…
IP?
What’s the
structure?
Are they in
our file?
What’s
similar?
What’s the
Pharmacology target?
data?

Known
Pathways?
Competitors?
Working On
Connections Now?
to disease?
Expressed in
right cell type?

Open PHACTS Project
 Develop a set of robust standards…
 Implement the standards in a semantic integration hub
 Deliver services to support drug discovery programs in
pharma and public domain
 22 partners, 8 pharmaceutical companies, 3 biotechs
 36 months project

Guiding principle is open access, open usage, open source
- Key to standards adoption -

Chemistry in Open PHACTS
 Selected data slices of ChemSpider carrying
pharmacological links into the “linked data cache”

 ChemSpiderIDs and InChIs/InChIKeys will be in
Open PHACTS and available for linking

 A structure ID standard to enable further linking
across the semantic web of science

ChemSpider and InChI
Internet Data

Small organic molecules Commercial Software
Undefined materials Pre-competitive Data
Organometallics Open Science
Nanomaterials Open Data
Polymers Publishers
Minerals Educators
Particle bound Open Databases
Links to Biologicals Chemical Vendors

The great promise should be obvious
 InChIs are here to stay
 They will evolve, they will encompass, we will
adopt and adapt
 Public and private databases will federate &
build a linked environment of validated data!
 Data validation and standardization is
needed
 Open Data will continue to proliferate
 InChIs are in the “Semantic Web” already

If InChI never existed or went away..
 ChemSpider would never have been built

 Database linking would suffer dramatically

 The web would not be “structure searchable”

 Cheminformatics tools would likely not be linking
to public domain databases in the same way

 And we would not have the pleasure of today…

Acknowledgments
 The inspiration of the InChI Masters – Steve H.,
Steve S., Alan, Dmitrii, Igor

 IUPAC, NIST, all adopters, supporters,
challengers and users

 The InChI Trust and its supporters for funding
continued development

 Al Gore –enabling us to search InChIs on the web

Thank you

Email: williamsa@rsc.org
Twitter: ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

Great promise of navigating the internet using in chis

Recommended

Recommended

More Related Content

Similar to Great promise of navigating the internet using in chis

Similar to Great promise of navigating the internet using in chis (20)

More from Royal Society of Chemistry

More from Royal Society of Chemistry (16)

Recently uploaded

Recently uploaded (20)

Great promise of navigating the internet using in chis