Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Serving the Medicinal Chemistry 
Community with RSC 
Cheminformatics Platforms 
Antony Williams 
Brazilian Medicinal Chemi...
www.slideshare.net/AntonyWilliams
Chemistry for the Community 
• The Royal Society of Chemistry as a provider 
of chemistry for the community: 
• As a chari...
Overwhelmed with data…
So much online data…
Organizations releasing data
Funders encourage openness
We model data – then lose it 
• What if we could share models and the 
underlying data via a central repository? 
• This i...
Pharma Companies Repeat Work 
Pre-competitive Informatics: 
Pharma are all accessing, processing, storing & re-processing ...
Publications lock up data
When I finish this article…
The data will be locked up..
But what if we could navigate? 
What’s the 
structure? 
Are they in 
our file? 
What’s 
similar? 
What’s the 
Pharmacology...
• ~30 million chemicals and growing 
• Data sourced from >500 different sources 
• Crowd sourced curation and annotation 
...
ChemSpider
ChemSpider
Experimental/Predicted Properties
Literature references
Google Books
Patents references
RSC Databases
Vendors and data sources
With structures and names…
Name Searching
Standards for Integration
Structure Searching
What did we learn??? 
• Data Quality is an enormous challenge 
• Crowd sourced annotation can help!
Crowdsourced Enhancement 
• The community can clean and enhance the 
database by providing Feedback and direct 
curation 
...
But Software Can Help 
• SRS as guidance for standardization rules
http://cvsp.chemspider.com
ChemSpider is a building block
…for grant-based services 
• Use ChemSpider data slices and API/Web 
services to support grant-based projects 
• Multiple ...
Over half of all drugs introduced between 
1940 and 2006 were of natural origin or 
inspired by natural compounds
http://www.pharma-sea.eu/
PharmaSea
..as a dereplication platform
http://www.openphacts.org/
• 3-year Innovative Medicines Initiative project 
• Integrating chemistry and biology data using 
semantic web technologie...
The Open PHACTS community ecosystem
Open PHACTS http://ops.rsc.org
Chemistry Searching…
But what about Biology???
http://explorer.openphacts.org
Open PHACTS Explorer
Pharmacology by Compound
Compounds and enzymes
Compounds and enzymes
Pharmacology by Target
Facilitated by ChemSpider RDF
Open Sourcing Data and Code 
• Open PHACTS data licensed as Open Data 
– approx. 2 Million chemicals 
• Open Source code t...
ChemSpider as a “dictionary” 
• Systematic name(s) 
• Trivial Name(s) 
• SMILES 
• InChI Strings 
• InChIKeys 
• Database ...
Valium on ChemSpider 
With strong dictionaries 
connections can be made…
Semantic Mark-up of Articles
Linking Names to Structures
…and providing more links… 
MedChemComm
Mark-up of MedChemComm
Mark-up of MedChemComm
Mark-up of MedChemComm
Links out from MedChemComm
Links out from MedChemComm
What about old publications? 
• We would LOVE to bring data out of our archive 
• Find chemical names and generate structu...
RSC Archive – since 1841
SO MANY reactions!
Text Mining 
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- 
thiadiazol-5-yl)urea prepared in Example 6 , th...
Text Mining 
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- 
thiadiazol-5-yl)urea prepared in Example 6 , th...
But names = structures 
• Systematic names can be generated FROM 
chemical structures algorithmically
..and Context Gives Reactions 
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- 
thiadiazol-5-yl)urea prepared...
ChemSpider Reactions
Text spectra? 
• 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 
30.11 (CH, benzylic methane), 30.77 (CH, 
benzylic methane), ...
1H NMR (CDCl3, 400 MHz): 
δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, 
C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6...
Turn “Figures” Into Data 
FIGURE 
EXTRACTED
Models published from data
Text-mining Data to compare 
Presently extracting 
property data from 
Google Patents as test
National Compound Collection 
• Unlock chemistry data in PhD theses 
• Discover novel molecules for biosciences 
• Working...
We should make sure 
thesis data are available 
in consumable formats – 
compounds, reactions, 
analytical data etc.
What are we building? 
• We are building the “RSC Data Repository” 
• Containers for compounds, reactions, analytical 
dat...
Compounds
Reactions
Analytical data
Crystallography data
With data in hand maybe it’s time 
What’s the 
structure? 
Are they in 
our file? 
What’s 
similar? 
What’s the 
Pharmacol...
Conclusions 
• We are building platforms that can support 
multiple communities, including MedChem 
• We are working hard ...
Thank you 
Email: williamsa@rsc.org 
ORCID: 0000-0002-2668-4821 
Twitter: @ChemConnector 
Personal Blog: www.chemconnector...
Serving the medicinal chemistry community with Royal Society of Chemistry cheminformatics platforms
Serving the medicinal chemistry community with Royal Society of Chemistry cheminformatics platforms
Serving the medicinal chemistry community with Royal Society of Chemistry cheminformatics platforms
Serving the medicinal chemistry community with Royal Society of Chemistry cheminformatics platforms
Upcoming SlideShare
Loading in …5
×

Serving the medicinal chemistry community with Royal Society of Chemistry cheminformatics platforms

5,469 views

Published on

The Royal Society of Chemistry (RSC) is a major participant in providing access to chemistry related data via the web. As an internationally renowned society for the chemical sciences, a scientific publisher and the host of the ChemSpider database for the community, RSC continues to make dramatic strides in providing online access to data. ChemSpider provides access to over 30 million chemicals sourced from over 500 data suppliers and linked out to related information on the web. The platform is a crowdsourcing environment whereby members of the community can participate in validating and expanding the content of the database. With a set of application programming interfaces ChemSpider is used by various organizations and projects to serve up data for various purposes. These include structure identification for mass spectrometry instrument vendors, RSC databases such as the Marinlit natural products database and a European grant-based project from the Innovative Medicines Initiative fund. This presentation will provide an overview of various cheminformatics activities and projects that RSC is involved with to serve the medicinal chemistry community. This will include the Open PHACTS semantic web project, the PharmaSea project to identify new pharmaceutical leads from the ocean and the UK National Compound Collection to identify new lead compounds contained within PhD theses.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Serving the medicinal chemistry community with Royal Society of Chemistry cheminformatics platforms

  1. 1. Serving the Medicinal Chemistry Community with RSC Cheminformatics Platforms Antony Williams Brazilian Medicinal Chemistry Conference, November 11th 2014
  2. 2. www.slideshare.net/AntonyWilliams
  3. 3. Chemistry for the Community • The Royal Society of Chemistry as a provider of chemistry for the community: • As a charity • As a scientific publisher • As a host of commercial databases • As a partner in grant-based projects • As the host of ChemSpider • New: the RSC Data Repository for Chemistry
  4. 4. Overwhelmed with data…
  5. 5. So much online data…
  6. 6. Organizations releasing data
  7. 7. Funders encourage openness
  8. 8. We model data – then lose it • What if we could share models and the underlying data via a central repository? • This is MOSTLY not a technology issue!!!
  9. 9. Pharma Companies Repeat Work Pre-competitive Informatics: Pharma are all accessing, processing, storing & re-processing external research data Literature Genbank PubChem Patents Databases Downloads Data Integration Data Analysis Firewalled Databases Repeat at each company x
  10. 10. Publications lock up data
  11. 11. When I finish this article…
  12. 12. The data will be locked up..
  13. 13. But what if we could navigate? What’s the structure? Are they in our file? What’s similar? What’s the Pharmacology target? data? Known Pathways? Working On Connections Now? to disease? Expressed in right cell type? Competitors? IP?
  14. 14. • ~30 million chemicals and growing • Data sourced from >500 different sources • Crowd sourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • Structure centric hub for web-searching • It’s a really big dictionary!!!
  15. 15. ChemSpider
  16. 16. ChemSpider
  17. 17. Experimental/Predicted Properties
  18. 18. Literature references
  19. 19. Google Books
  20. 20. Patents references
  21. 21. RSC Databases
  22. 22. Vendors and data sources
  23. 23. With structures and names…
  24. 24. Name Searching
  25. 25. Standards for Integration
  26. 26. Structure Searching
  27. 27. What did we learn??? • Data Quality is an enormous challenge • Crowd sourced annotation can help!
  28. 28. Crowdsourced Enhancement • The community can clean and enhance the database by providing Feedback and direct curation • Tens of thousands of edits made
  29. 29. But Software Can Help • SRS as guidance for standardization rules
  30. 30. http://cvsp.chemspider.com
  31. 31. ChemSpider is a building block
  32. 32. …for grant-based services • Use ChemSpider data slices and API/Web services to support grant-based projects • Multiple European consortium-based grants • PharmaSea (FP7 funded) • Open PHACTS (IMI funded) • Support Open Drug Discovery projects
  33. 33. Over half of all drugs introduced between 1940 and 2006 were of natural origin or inspired by natural compounds
  34. 34. http://www.pharma-sea.eu/
  35. 35. PharmaSea
  36. 36. ..as a dereplication platform
  37. 37. http://www.openphacts.org/
  38. 38. • 3-year Innovative Medicines Initiative project • Integrating chemistry and biology data using semantic web technologies • Open code, open data, open standards • Academics, Pharma Companies, Publishers…
  39. 39. The Open PHACTS community ecosystem
  40. 40. Open PHACTS http://ops.rsc.org
  41. 41. Chemistry Searching…
  42. 42. But what about Biology???
  43. 43. http://explorer.openphacts.org
  44. 44. Open PHACTS Explorer
  45. 45. Pharmacology by Compound
  46. 46. Compounds and enzymes
  47. 47. Compounds and enzymes
  48. 48. Pharmacology by Target
  49. 49. Facilitated by ChemSpider RDF
  50. 50. Open Sourcing Data and Code • Open PHACTS data licensed as Open Data – approx. 2 Million chemicals • Open Source code to release to community (from Open PHACTS github site) • Chemical Registration Service • Chemical Validation and Standardization Platform
  51. 51. ChemSpider as a “dictionary” • Systematic name(s) • Trivial Name(s) • SMILES • InChI Strings • InChIKeys • Database IDs • Registry Number
  52. 52. Valium on ChemSpider With strong dictionaries connections can be made…
  53. 53. Semantic Mark-up of Articles
  54. 54. Linking Names to Structures
  55. 55. …and providing more links… MedChemComm
  56. 56. Mark-up of MedChemComm
  57. 57. Mark-up of MedChemComm
  58. 58. Mark-up of MedChemComm
  59. 59. Links out from MedChemComm
  60. 60. Links out from MedChemComm
  61. 61. What about old publications? • We would LOVE to bring data out of our archive • Find chemical names and generate structures • Find chemical images and generate structures • Find reactions – and make a database! • Find data (MP, BP, LogP) and host. Build models! • Find figures and database them • Find spectra (and link to structures) • Validate the data algorithmically
  62. 62. RSC Archive – since 1841
  63. 63. SO MANY reactions!
  64. 64. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl- N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  65. 65. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl- N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  66. 66. But names = structures • Systematic names can be generated FROM chemical structures algorithmically
  67. 67. ..and Context Gives Reactions The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl- N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  68. 68. ChemSpider Reactions
  69. 69. Text spectra? • 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
  70. 70. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  71. 71. Turn “Figures” Into Data FIGURE EXTRACTED
  72. 72. Models published from data
  73. 73. Text-mining Data to compare Presently extracting property data from Google Patents as test
  74. 74. National Compound Collection • Unlock chemistry data in PhD theses • Discover novel molecules for biosciences • Working together with industry and the academic community • Build the data into RSC online platforms • Perform virtual screening/modeling and access physical samples to screen
  75. 75. We should make sure thesis data are available in consumable formats – compounds, reactions, analytical data etc.
  76. 76. What are we building? • We are building the “RSC Data Repository” • Containers for compounds, reactions, analytical data, tabular data • Algorithms for data validation and standardization • Flexible indexing and search technologies • A platform for modeling data and hosting existing models and predictive algorithms • Chemistry RESEARCH DATA MANAGEMENT
  77. 77. Compounds
  78. 78. Reactions
  79. 79. Analytical data
  80. 80. Crystallography data
  81. 81. With data in hand maybe it’s time What’s the structure? Are they in our file? What’s similar? What’s the Pharmacology target? data? Known Pathways? Working On Connections Now? to disease? Expressed in right cell type? Competitors? IP?
  82. 82. Conclusions • We are building platforms that can support multiple communities, including MedChem • We are working hard to extend our data and improve quality of online data • Opening APIs to the platforms and data provides a powerful building block • We are Open Sourcing components of our platforms to the community
  83. 83. Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

×