The Importance of the InChI Identifier
as a Foundation Technology for
eScience Platforms at RSC
Antony Williams
Bio-IT,
Bo...
Without the InChI…
• ChemSpider is unlikely to have been built
• It would not have grown into one of the
domains primary o...
• ~30 million chemicals and growing
• Data sourced from >500 different sources
• Crowd sourced curation and annotation
• O...
ChemSpider
ChemSpider
Experimental/Predicted Properties
Literature references
Patents references
So what is Yohimbine?
Of course it is out there…
Drugbox: 3001/5080 with InChIs
Chembox:5436/7690 with InChIs
Tell me more…
• Where can I find the molfile for Yohimbine?
• Papers/Patents about Yohimbine?
• What are the side effects ...
Quantity!
Yohimbine on ChemSpider
Downsides of Overall Approach
• Meshing data together based on InChIs
worked for simple molecules
• 2D layout errors inher...
Yohimbine on
ChemSpider..Quality?
So where can we travel???
So where can we travel???
InChI String Search via Google
Give me InChIKeys…
And where can we travel???
ChemSpider
BRENDA
Wikipedia
ChEMBL
ChEBI
DrugBank
Aggregator
Enzymes
Encyclopedia
Pharmacology
Curated Chemicals
Drug-Drug Target
How do we build it?
• We deal in Molfiles or SDF files – with coordinates
• Deposit anything that has an InChI – we suppor...
Downsides of InChI
• InChI was a moving target (multi versions) but
overall worked as planned.
• Good for small molecules ...
Side Effects of InChI Usage
SMILES by comparison…
Side Effects of InChI Usage
Standardization Issues
Depiction based on molfile
Standardize
Use the SRS as a guidance document for
standardization
Adjust as necessary to our needs
Nitro groups
Salt and Ionic Bonds
Ammonium salts
CVSP
NPC Browser Set
Checking include InChI
• Many SDF files contain InChIs and SMILES
– comparing the structure contained within
the file with...
So, I’m writing an article…
With these…I will lose data 
But linking with InChI …
Structure Searching the Web
Data in Publications
• This is not new, you know the story…
• So much data of value is contained within a
publication and ...
“Data enable” publications?
• We would LOVE to bring data out of our archive
• What could we do?
• Find chemical names and...
RSC Archive – since 1841
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thio...
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thio...
But names = structures
• Systematic names can be generated FROM
chemical structures algorithmically
But names = structures
• …and structures from systematic names
But what of trivial names?
• What about trivial names, trade names, CAS
numbers, multilingual names etc.?
Searching that lipid in patents
Aspirin on ChemSpider
Work in Progress
Work in Progress
Work in Progress
Work in Progress
But Context Gives Reactions
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in ...
ChemSpider Reactions
ChemSpider as a Foundation
• >30 million chemicals (and growing)
• ChemSpider is free to access for everyone –
and the API...
Support grant-based services
• Multiple European consortium-based grants
• PharmaSea (FP7 funded)
• Open PHACTS (IMI funde...
PharmaSea
• 3-year Innovative Medicines Initiative project
• Integrating chemistry and biology data using
semantic web technologies
...
Open PHACTS
All Databases We Generate…
• All databases and systems we build now
include generated InChIs
• InChIs are facilitating dis...
But we are still VERY LIMITED
• RSC deals with way more than organics,
inorganics, organometallics – we are building a
dat...
The great promise should be
obvious
• InChIs are here to stay
• They will evolve, they will encompass, we will
adopt and a...
If InChI never existed …
• ChemSpider would never have been built
• Database linking would suffer dramatically
• The web w...
Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com...
The importance of the InChI identifier as a foundation technology for eScience platforms
The importance of the InChI identifier as a foundation technology for eScience platforms
Upcoming SlideShare
Loading in...5
×

The importance of the InChI identifier as a foundation technology for eScience platforms

3,684

Published on

The Royal Society of Chemistry hosts one of the largest online chemistry databases containing almost 30 million unique chemical structures. The database, ChemSpider, provides the underpinning for a series of eScience projects allowing for the integration of chemical compounds with our archive of scientific publications, the delivery of a reaction database containing millions of reactions as well as a chemical validation and standardization platform developed to help improve the quality of structural representations on the internet. The InChI has been a fundamental part of each of our projects and has been pivotal in our support of international projects such as the Open PHACTS semantic web project integrating chemistry and biology data and the PharmaSea project focused on identifying novel chemical components from the ocean with the intention of identifying new antibiotics. This presentation will provide an overview of the importance of InChI in the development of many of our eScience platforms and how we have used it specifically in the ChemSpider project to provide integration across hundreds of websites and chemistry databases across the web. We will discuss how we are now expanding our efforts to develop a Global Chemistry Network encompassing efforts in Open Source Drug Discovery and the support of data management for neglected diseases.

Published in: Science, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,684
On Slideshare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

The importance of the InChI identifier as a foundation technology for eScience platforms

  1. 1. The Importance of the InChI Identifier as a Foundation Technology for eScience Platforms at RSC Antony Williams Bio-IT, Boston, April 27th 2014
  2. 2. Without the InChI… • ChemSpider is unlikely to have been built • It would not have grown into one of the domains primary online chemistry resources • The Royal Society of Chemistry would not have it as an online database, would not have a large cheminformatics team and would not be involved in a number of large scale funded projects around chemistry data
  3. 3. • ~30 million chemicals and growing • Data sourced from >500 different sources • Crowd sourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • Structure centric hub for web-searching • …and a really big dictionary!!!
  4. 4. ChemSpider
  5. 5. ChemSpider
  6. 6. Experimental/Predicted Properties
  7. 7. Literature references
  8. 8. Patents references
  9. 9. So what is Yohimbine?
  10. 10. Of course it is out there… Drugbox: 3001/5080 with InChIs Chembox:5436/7690 with InChIs
  11. 11. Tell me more… • Where can I find the molfile for Yohimbine? • Papers/Patents about Yohimbine? • What are the side effects of Yohimbine? • Where can I order Yohimbine? • What are the physicochemical properties? • Metabolic pathways? • Different synonyms of Yohimbine? • Synthesis of Yohimbine? • Side effects of Yohimbine? • Etc….
  12. 12. Quantity!
  13. 13. Yohimbine on ChemSpider
  14. 14. Downsides of Overall Approach • Meshing data together based on InChIs worked for simple molecules • 2D layout errors inherited or limited by algorithm • Complex molecules that are meant to be the same thing were NOT deduplicated. Compounds differing by one stereocenter, named the same, meant to be the same, are not the same
  15. 15. Yohimbine on ChemSpider..Quality?
  16. 16. So where can we travel???
  17. 17. So where can we travel???
  18. 18. InChI String Search via Google Give me InChIKeys…
  19. 19. And where can we travel???
  20. 20. ChemSpider BRENDA Wikipedia ChEMBL ChEBI DrugBank
  21. 21. Aggregator Enzymes Encyclopedia Pharmacology Curated Chemicals Drug-Drug Target
  22. 22. How do we build it? • We deal in Molfiles or SDF files – with coordinates • Deposit anything that has an InChI – we support what InChI can handle, good and bad • Standardization based on “InChI standardization” • InChIs aggregate (certain) tautomers • We link out to external sites using their IDs
  23. 23. Downsides of InChI • InChI was a moving target (multi versions) but overall worked as planned. • Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules” • InChI used as the “deduplicator” – FIRST version of a compound into the database becomes THE structure to deduplicate against…
  24. 24. Side Effects of InChI Usage
  25. 25. SMILES by comparison…
  26. 26. Side Effects of InChI Usage
  27. 27. Standardization Issues Depiction based on molfile
  28. 28. Standardize Use the SRS as a guidance document for standardization Adjust as necessary to our needs
  29. 29. Nitro groups
  30. 30. Salt and Ionic Bonds
  31. 31. Ammonium salts
  32. 32. CVSP
  33. 33. NPC Browser Set
  34. 34. Checking include InChI • Many SDF files contain InChIs and SMILES – comparing the structure contained within the file with the associated InChI is useful – turned up a number of errors in checking online databases
  35. 35. So, I’m writing an article…
  36. 36. With these…I will lose data 
  37. 37. But linking with InChI …
  38. 38. Structure Searching the Web
  39. 39. Data in Publications • This is not new, you know the story… • So much data of value is contained within a publication and delivered in a PDF form • PDF files, and unclear licensing/copyright, limit access to data so I can rework, reuse, repurpose, text mine etc. • “I specialize in XXXX. I want a database of YYYY extracted from publications and made available, for free, with the capabilities I need, and the publishers should just do it”
  40. 40. “Data enable” publications? • We would LOVE to bring data out of our archive • What could we do? • Find chemical names and generate structures • Find chemical images and generate structures • Find reactions – and make a database! • Find data (MP, BP, LogP) and host. Build models! • Find figures and database them • Find spectra (and link to structures) • Validate the data algorithmically
  41. 41. RSC Archive – since 1841
  42. 42. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2- trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  43. 43. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2- trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  44. 44. But names = structures • Systematic names can be generated FROM chemical structures algorithmically
  45. 45. But names = structures • …and structures from systematic names
  46. 46. But what of trivial names? • What about trivial names, trade names, CAS numbers, multilingual names etc.?
  47. 47. Searching that lipid in patents
  48. 48. Aspirin on ChemSpider
  49. 49. Work in Progress
  50. 50. Work in Progress
  51. 51. Work in Progress
  52. 52. Work in Progress
  53. 53. But Context Gives Reactions The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2- trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  54. 54. ChemSpider Reactions
  55. 55. ChemSpider as a Foundation • >30 million chemicals (and growing) • ChemSpider is free to access for everyone – and the API means people program against it • What projects can we benefit?
  56. 56. Support grant-based services • Multiple European consortium-based grants • PharmaSea (FP7 funded) • Open PHACTS (IMI funded) • UK National Chemical Database Service (http://cds.rsc.org) – developing data repository for lab data, integrate Electronic Lab Notebooks • Open Drug Discovery projects
  57. 57. PharmaSea
  58. 58. • 3-year Innovative Medicines Initiative project • Integrating chemistry and biology data using semantic web technologies • Open code, open data, open standards • Academics, Pharmas, Publishers… • To put medicines in the pipeline…
  59. 59. Open PHACTS
  60. 60. All Databases We Generate… • All databases and systems we build now include generated InChIs • InChIs are facilitating discoverability via searching on Google (see Chris’ talk) but also for querying and linking
  61. 61. But we are still VERY LIMITED • RSC deals with way more than organics, inorganics, organometallics – we are building a data repository to include materials, polymers, ambiguous materials etc. • There are many plans for InChI moving forward – Markush, polymers, organometallics etc
  62. 62. The great promise should be obvious • InChIs are here to stay • They will evolve, they will encompass, we will adopt and adapt • Public and private databases will federate & build a linked environment of validated data! • Data validation and standardization is needed • Open Data will continue to proliferate • InChIs are in the “Semantic Web” already
  63. 63. If InChI never existed … • ChemSpider would never have been built • Database linking would suffer dramatically • The web would not be “structure searchable” • Cheminformatics tools would likely not be linking to public domain databases in the same way
  64. 64. Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×