Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Great promise of navigating the internet using in chis


Published on

The InChI, the International Chemical Identifier, has been the basis of both indexing and deduplication of the ChemSpider database since the inception of the platform. When the InChI was adopted we envisaged a future whereby the identifier would proliferate across journals, databases and the internet in general providing us a basis for “structure searching the internet”. This presentation will provide an overview of how the InChI has facilitated the integration of ChemSpider to chemistry on the internet, some of the surprising findings that have resulted from this work and extrapolate the influence of InChIs into the future for a chemically enabled web.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Great promise of navigating the internet using in chis

  1. 1. Great promise of navigating the internet using InChIs Antony J Williams ACS San Diego March 2012
  2. 2. Openness and Quality IssuesWilliams and Ekins, DDT, 16: 747-750 (2011) Science Translational Medicine 2011
  3. 3. Warning… This talk is not about Quality…it’s about quantity
  4. 4. Warning… This talk is not about Quality…it’s about quantity Drugbank was here
  5. 5. Data quality is a known issue
  6. 6. We ALL have issues!!!
  7. 7. It’s about what’s out there…
  8. 8. How to Link it…
  9. 9. And getting out of overwhelm…
  10. 10. So what is Yohimbine?
  11. 11. Of course it is out there… Drugbox: 3001/5080 with InChIs Chembox:5436/7690 with InChIs
  12. 12. Tell me more… Where can I find the molfile for Yohimbine? Papers/Patents about Yohimbine? What are the side effects of Yohimbine? Where can I order Yohimbine? What are the physicochemical properties? Metabolic pathways? Different synonyms of Yohimbine? Synthesis of Yohimbine? Side effects of Yohimbine? Etc….
  13. 13. Quantity!
  14. 14. Yohimbine on ChemSpider..Quality?
  15. 15. How do we build it? We deal in Molfiles or SDF files – with coordinates Deposit anything that has an InChI – we support what InChI can handle, good and bad Standardization based on “InChI standardization” InChIs aggregate (certain) tautomers We link out to external sites using their IDs
  16. 16. Downsides of InChI InChI was a moving target (multi versions) but overall worked as planned. Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules” InChI used as the “deduplicator” – FIRST version of a compound into the database becomes THE structure to deduplicate against…
  17. 17. Side Effects of InChI Usage
  18. 18. SMILES by comparison…
  19. 19. Side Effects of InChI Usage
  20. 20. Standardization IssuesDepiction based on molfile
  21. 21. Downsides of Overall Approach Meshing data together based on InChIs worked for simple molecules 2D layout errors inherited or limited by algorithm Complex molecules that are meant to be the same thing were NOT deduplicated. Compounds differing by one stereocenter, named the same, meant to be the same, are not the same
  22. 22. Yohimbine on ChemSpider..Quality?
  23. 23. So where can we travel???
  24. 24. So where can we travel???
  25. 25. InChI String Search via GoogleGive me InChIKeys…
  26. 26. And where can we travel???
  27. 27.  ChemSpider BRENDA Wikipedia ChEMBL ChEBI DrugBank
  28. 28.  Aggregator Enzymes Encyclopedia Pharmacology Curated Chemicals Drug-Drug Target
  29. 29. Recognizing Compound Dilution So much chemistry on the web…. And so much dilution – “structural uniqueness” versus “accidental ambiguity” InChI as an easy skeleton search
  30. 30. Vancomycin – Search the Internet
  31. 31. VancomycinSearch Molecular Search Full Molecule SKELETON
  32. 32. Full Skeleton Search
  33. 33. All aggegators suffer dilution!
  34. 34. Many Problems Can be Solved… Clean up databases – structure validation, structure standardization Warn about  Valency, charge balance, depiction issues, bond types, absent stereo, and another 100 rules (or so…) Standardize  Agree community rules to “Standardize”
  35. 35. Structure Validation
  36. 36. Structure Validation - Fixed
  37. 37. What needs to happen? If we could validate  Catch errors in databases (and clean)  Proactively catch errors in publications/patents  Reduce junk in the ether – improve QUALITY! If we standardized  Interlinking should improve
  38. 38. NPC Browser Set
  39. 39. Download, Deposit, Reprocess
  40. 40. Substructure # of # of No Incomplete Complete but Hits Correct stereochemistry Stereochemistry incorrect Hits stereochemistryGonane 34 5 8 21 0Gon-4-ene 55 12 3 33 7Gon-1,4-diene 60 17 10 23 10
  41. 41. Structure-Name Validation H3C NH2 O I I O O CH3 H3C OH O CH3 O CH3 O H HN CH3 I OH OH O O HO O O O Choladine O CH3 Taxol Cl H3C N N CH3 CH3 CH3 H Cholane H H Chlotrimazole
  42. 42. Standardize Use the SRS as a guidance document for standardization Adjust as necessary to our needs
  43. 43. Nitro groups
  44. 44. Salt and Ionic Bonds
  45. 45. Ammonium salts
  46. 46. Millions of structures? Lots of Issues
  47. 47. ChemSpider Standardization Entire ChemSpider database will be standardized using modified FDA rule set Original Molfiles will be standardized and all properties (predicted properties, SMILES, InChIs, Names) will all be regenerated Standardization procedures automatically applied to all future depositions
  48. 48. Identifier Dictionaries Reciprocal curation processes…share curation with each other. If a database has a compound already then use InChiKeys to match “suggested” validation against the compound. A series of “added” and “removed” synonyms against InChIKeys for matching.
  49. 49. Proof of Concept Data Curation SharingWho wants to work with us?
  50. 50. Structure Validation using feed Look for approved synonyms Compare feed InChIKey with database InChIKey If different, flag for inspection
  51. 51. It is so difficult to navigate… IP? What’s the structure? Are they in our file? What’s similar? What’s the Pharmacology target? data? Known Pathways? Competitors? Working On Connections Now? to disease? Expressed in right cell type?
  52. 52. Open PHACTS Project Develop a set of robust standards… Implement the standards in a semantic integration hub Deliver services to support drug discovery programs in pharma and public domain 22 partners, 8 pharmaceutical companies, 3 biotechs 36 months project Guiding principle is open access, open usage, open source - Key to standards adoption -
  53. 53. Chemistry in Open PHACTS Selected data slices of ChemSpider carrying pharmacological links into the “linked data cache” ChemSpiderIDs and InChIs/InChIKeys will be in Open PHACTS and available for linking A structure ID standard to enable further linking across the semantic web of science
  54. 54. ChemSpider and InChI Internet Data Small organic molecules Commercial Software Undefined materials Pre-competitive Data Organometallics Open Science Nanomaterials Open Data Polymers Publishers Minerals Educators Particle bound Open Databases Links to Biologicals Chemical Vendors
  55. 55. The great promise should be obvious InChIs are here to stay They will evolve, they will encompass, we will adopt and adapt Public and private databases will federate & build a linked environment of validated data! Data validation and standardization is needed Open Data will continue to proliferate InChIs are in the “Semantic Web” already
  56. 56. If InChI never existed or went away.. ChemSpider would never have been built Database linking would suffer dramatically The web would not be “structure searchable” Cheminformatics tools would likely not be linking to public domain databases in the same way And we would not have the pleasure of today…
  57. 57. Acknowledgments The inspiration of the InChI Masters – Steve H., Steve S., Alan, Dmitrii, Igor IUPAC, NIST, all adopters, supporters, challengers and users The InChI Trust and its supporters for funding continued development Al Gore –enabling us to search InChIs on the web
  58. 58. Steve Heller
  59. 59. Steve Heller
  60. 60. Thank youEmail: williamsa@rsc.orgTwitter: ChemConnectorPersonal Blog: www.chemconnector.comSLIDES: