0
How the InChI identifier is used to
underpin our online chemistry
databases at RSC
Antony Williams, Valery Tkachenko
and K...
What can I say that I haven’t said?
What can I say that I haven’t said?
What can I say that I haven’t said?
YouTube InChIKey Collision Movie
What can I say that I haven’t said?
InChI is for machines but do
have a human aspect…
Many Names, One Structure
Structure Identifiers
OPSIN (chemical name to structure) http
://opsin.ch.cam.ac.uk/
• InChI support systems…
InChI mapping helps a lot!
• We wanted to map together chemical data on
the web
• We knew that chemical name mapping was
d...
• ~32 million chemicals and growing
• Data sourced from >500 different sources
• Crowd sourced curation and annotation
• O...
ChemSpider
So where can we travel???
InChI String Search via Google
So give me InChIKeys…
And where can we travel???
And where can we travel???
And where can we travel???
And where can we travel???
NEW
15th
Edition
*The name THE MERCK INDEX is owned by Merck Sharp & Dohme Corp., a subsidiary of Merck & Co.,
Inc., White...
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thio...
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thio...
SO MANY reactions!
Extracting our Archive
• What could we get from our archive?
• Find chemical names and generate structures
• Find chemical...
After we mine the Archive
Models published from data
Text-mining Data to compare
Progress to date
• We have text-mined all 21st
century articles…
>100k articles from 2000-2013
• Marked up with XML and pu...
MedChemComm markup
MedChemComm markup
MedChemComm markup
InChIs under our “repository”
• Scientific publications are a summary of work
• Is all work reported?
• How much science i...
New Repository Architecture
doi: 10.1007/s10822-014-9784-5
What are we building?
• We are building the “RSC Data Repository”
• Containers for compounds, reactions, analytical
data, ...
New Repository Architecture
Compounds Reactions Spectra Materials Documents
Compounds
API
Reactions
API
Spectra
API
Materi...
Deposition of Data
Compounds
Reactions
Analytical data
Crystallography data
InChIs under the repository
• All compound-based data handling will of
course connect with InChIs
• Compounds
• Reactions
...
For Deposition of Data
• Developing systems that provides
feedback to users regarding data quality
• Validate/standardize ...
RSC Cheminformatics Projects
• RSC as a provider of support for grant-based
projects
• Utilizing ChemSpider initially as a...
The PharmaSea Website
• ChemSpider IDs and InChIs/InChIKeys
made open and available for linking
• Exposed via the Open PHACTS RDF export
• A str...
InChIs and DDP
Electronic Notebook Data
• Development work integrating chemistry
into the Southampton Labtrove notebook
• Stoichiometry t...
Side Effects of InChI Usage
SMILES by comparison…
Side Effects of InChI Usage
Standardization Issues
Depiction based on molfile
Standardize
• Use the SRS as guidance for standardization
• Adjust as necessary to our needs
Nitro groups
Salt and Ionic Bonds
What needs to happen?
• If we could validate
• Catch errors in databases (and clean)
• Proactively catch errors in publica...
Validate and Standardize
CVSP Filtering
CVSP Filtering of DrugBank
DrugBank (ca. 6000 records)
• 38 records with InChI not matching the
structure, e.g. DB08521, DB08187
• 24 records where n...
ChEMBL (1.3 million records)
• 11,020 records with 4 bonds and zero charge,
e.g. CHEMBL501101 or CHEMBL501973
• 271 record...
ChemSpider Standardization
• Entire ChemSpider database will be
standardized using modified FDA rule set
• Original Molfil...
Recent Data (last week)
Internet Data
Data Repositories and InChI
Commercial Software
Pre-competitive Data
Open Science
Open Data
Publishers
Educa...
If InChI was not developed…
• Database linking would suffer dramatically
• The web would not be “structure searchable”
• C...
Acknowledgments
• The InChI team
• The entire RSC cheminformatics team…
• Daniel Lowe for the text mining work
• Igor Tetk...
Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com...
How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry
How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry
How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry
How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry
How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry
How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry
How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry
How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry
Upcoming SlideShare
Loading in...5
×

How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

742

Published on

The Royal Society of Chemistry hosts a growing collection of online chemistry content. For much of our work the InChI identifier is an important component underpinning our projects. This enables the integration of chemical compounds with our archive of scientific publications, the delivery of a reaction database containing millions of reactions as well as a chemical validation and standardization platform developed to help improve the quality of structural representations on the internet. The InChI has been a fundamental part of each of our projects and has been pivotal in our support of international projects such as the Open PHACTS semantic web project integrating chemistry and biology data and the PharmaSea project focused on identifying novel chemical components from the ocean with the intention of identifying new antibiotics. This presentation will provide an overview of the importance of InChI in the development of many of our eScience platforms and how we have used it to provide integration across hundreds of websites and chemistry databases across the web. We will discuss how we are now expanding our efforts to develop a platform encompassing efforts in Open Source Drug Discovery and the support of data management for neglected diseases.

Published in: Science
1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total Views
742
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
10
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide
  • The content of the 15th Edition was produced by the Editorial team and Merck, and the book was published by us earlier this year. It is available for purchase by both individuals and libraries.
    In addition to the book, we have produced and Online version of The Merck index, which is available solely through the Royal Society of Chemistry. This is available as a subscription, with one year free trial to individual purchasers of the book.
  • Compound list (rather than articles)
  • Transcript of "How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry"

    1. 1. How the InChI identifier is used to underpin our online chemistry databases at RSC Antony Williams, Valery Tkachenko and Ken Karapetyan ACS San Francisco August 2014
    2. 2. What can I say that I haven’t said?
    3. 3. What can I say that I haven’t said?
    4. 4. What can I say that I haven’t said? YouTube InChIKey Collision Movie
    5. 5. What can I say that I haven’t said?
    6. 6. InChI is for machines but do have a human aspect…
    7. 7. Many Names, One Structure
    8. 8. Structure Identifiers
    9. 9. OPSIN (chemical name to structure) http ://opsin.ch.cam.ac.uk/ • InChI support systems…
    10. 10. InChI mapping helps a lot! • We wanted to map together chemical data on the web • We knew that chemical name mapping was difficult but dictionaries were useful • It is InChI that became the foundation technology for our database… • We accepted all the limitations of InChI • We lived with the “Useful but not ideal” • And so….
    11. 11. • ~32 million chemicals and growing • Data sourced from >500 different sources • Crowd sourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • Structure centric hub for web-searching • …and a really big dictionary!!!
    12. 12. ChemSpider
    13. 13. So where can we travel???
    14. 14. InChI String Search via Google So give me InChIKeys…
    15. 15. And where can we travel???
    16. 16. And where can we travel???
    17. 17. And where can we travel???
    18. 18. And where can we travel???
    19. 19. NEW 15th Edition *The name THE MERCK INDEX is owned by Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc., Whitehouse Station, N.J., U.S.A., and is licensed to The Royal Society of Chemistry for use in the U.S.A. and Canada. Where else is RSC using InChIs
    20. 20. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
    21. 21. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
    22. 22. SO MANY reactions!
    23. 23. Extracting our Archive • What could we get from our archive? • Find chemical names and generate structures • Find chemical images and generate structures • Find reactions • Find data (MP, BP, LogP) and deposit • Find figures and database them • Find spectra (and link to structures) • And of course InChIfy the entire collection
    24. 24. After we mine the Archive
    25. 25. Models published from data
    26. 26. Text-mining Data to compare
    27. 27. Progress to date • We have text-mined all 21st century articles… >100k articles from 2000-2013 • Marked up with XML and published onto the HTML forms of the articles • Required multiple iterations based on dictionaries, markup, text mining iterations • New visualization tools in development – not just chemical names. Add chemical and biomedical terms markup also!
    28. 28. MedChemComm markup
    29. 29. MedChemComm markup
    30. 30. MedChemComm markup
    31. 31. InChIs under our “repository” • Scientific publications are a summary of work • Is all work reported? • How much science is lost to pruning? • What of value sits in notebooks and is lost? • Publications offering access to “real data”? • How much data is lost? • How many compounds never reported? • How many syntheses fail or succeed? • How many characterization measurements?
    32. 32. New Repository Architecture doi: 10.1007/s10822-014-9784-5
    33. 33. What are we building? • We are building the “RSC Data Repository” • Containers for compounds, reactions, analytical data, tabular data • Algorithms for data validation and standardization • Flexible indexing and search technologies • A platform for modeling data and hosting existing models and predictive algorithms
    34. 34. New Repository Architecture Compounds Reactions Spectra Materials Documents Compounds API Reactions API Spectra API Materials API Documents API Compounds Widgets Reactions Widgets Spectra Widgets Materials Widgets Documents Widgets Data tier Data access tier User interface components tier Analytical Laboratory application User interface tier (examples) Electronic Laboratory Notebook Paid 3rd party integrations (various platforms – SharePoint, Google, etc) Chemical Inventory application
    35. 35. Deposition of Data
    36. 36. Compounds
    37. 37. Reactions
    38. 38. Analytical data
    39. 39. Crystallography data
    40. 40. InChIs under the repository • All compound-based data handling will of course connect with InChIs • Compounds • Reactions • Compound-spectra matching • Etc. etc. etc…
    41. 41. For Deposition of Data • Developing systems that provides feedback to users regarding data quality • Validate/standardize chemical compounds • Check for balanced reactions • Checks spectral data • EXAMPLE Future work • Properties – compare experimental to pred. • Automated structure verification - NMR
    42. 42. RSC Cheminformatics Projects • RSC as a provider of support for grant-based projects • Utilizing ChemSpider initially as a platform • Developing Chemical Registry Service • Utilizing core architecture and widgets to serve the projects
    43. 43. The PharmaSea Website
    44. 44. • ChemSpider IDs and InChIs/InChIKeys made open and available for linking • Exposed via the Open PHACTS RDF export • A structure ID standard to enable further linking across the semantic web of science
    45. 45. InChIs and DDP
    46. 46. Electronic Notebook Data • Development work integrating chemistry into the Southampton Labtrove notebook • Stoichiometry table development • Analytical data integration • “ChemTrove” includes chemistry widgets and InChI as an important data field
    47. 47. Side Effects of InChI Usage
    48. 48. SMILES by comparison…
    49. 49. Side Effects of InChI Usage
    50. 50. Standardization Issues Depiction based on molfile
    51. 51. Standardize • Use the SRS as guidance for standardization • Adjust as necessary to our needs
    52. 52. Nitro groups
    53. 53. Salt and Ionic Bonds
    54. 54. What needs to happen? • If we could validate • Catch errors in databases (and clean) • Proactively catch errors in publications/patents • Reduce junk in the ether – improve QUALITY! • If we standardized • Interlinking should improve
    55. 55. Validate and Standardize
    56. 56. CVSP Filtering
    57. 57. CVSP Filtering of DrugBank
    58. 58. DrugBank (ca. 6000 records) • 38 records with InChI not matching the structure, e.g. DB08521, DB08187 • 24 records where names (IUPAC_NAME) did not match the structure, e.g. DB08346 • 38 records with SMILES not matching the structure, e.g. DB08293 • 53 records with unusual valence, e.g. DB01983 with boron(V)
    59. 59. ChEMBL (1.3 million records) • 11,020 records with 4 bonds and zero charge, e.g. CHEMBL501101 or CHEMBL501973 • 271 records with hypervalent oxygen (e.g. , CHEMBL2219679), carbon (e.g. 1005895), boron, chlorine, iodine or phosphine • 6,177 records where direction of bond makes no sense, e.g. CHEMBL12760 and CHEMBL34704
    60. 60. ChemSpider Standardization • Entire ChemSpider database will be standardized using modified FDA rule set • Original Molfiles will be standardized and all properties (predicted properties, SMILES, InChIs, Names) will all be regenerated • CLEAN’ed database to compounds repository • Standardization procedures automatically applied to all future depositions
    61. 61. Recent Data (last week)
    62. 62. Internet Data Data Repositories and InChI Commercial Software Pre-competitive Data Open Science Open Data Publishers Educators Open Databases Chemical Vendors Small organic molecules Undefined materials Organometallics Nanomaterials Polymers Minerals Particle bound Links to Biologicals
    63. 63. If InChI was not developed… • Database linking would suffer dramatically • The web would not be “structure searchable” • Cheminformatics tools would likely not be linking to public domain databases in the same way • We wouldn’t be here discussing…. • And ChemSpider would not have been built
    64. 64. Acknowledgments • The InChI team • The entire RSC cheminformatics team… • Daniel Lowe for the text mining work • Igor Tetko for OCHEM modeling
    65. 65. Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×