0
Current Initiatives in Developing
Research Data Repositories at
the Royal Society of Chemistry
Antony Williams
July 28th
2...
Chemistry for the Community
• The Royal Society of Chemistry as a
provider of chemistry for the community:
• As a charity
...
• ~30 million chemicals and growing
• Data sourced from >500 different sources
• Crowd sourced curation and annotation
• O...
ChemSpider
ChemSpider
Experimental/Predicted Properties
Literature references
Patents references
Books
Vendors and data sources
Aspirin on ChemSpider
Many Names, One Structure
What is the Structure of Vitamin K?
MeSH
• A lipid cofactor that is required for normal
blood clotting.
• Several forms of vitamin K have been
identified:
• V...
What is the Structure of Vitamin K?
The ultimate “dictionary”
• Search all forms of structure IDs
• Systematic name(s)
• Trivial Name(s)
• SMILES
• InChI Stri...
Linking Names to Structures
Semantic Mark-up of Articles
Crowdsourced “Annotations”
• Users can add
• Descriptions, Syntheses and Commentaries
• Links to PubMed articles
• Links t...
Crowdsourced Enhancement
• The community can clean and enhance the
database by providing Feedback and direct
curation
• Te...
ChemSpider
• ChemSpider allowed the community to
participate in linking the internet of
chemistry & crowdsourcing of data
...
ChemSpider Spectra
www.SpectralGame.com
http://www.jcheminf.com/content/1/1/9
Publications-summary of work
• Scientific publications are a summary of work
• Is all work reported?
• How much science is...
Deposition of Research Data
• If we manage compounds, syntheses and
analytical data…
• If we have security and provenance ...
What are we building?
• We are building the “RSC Data Repository”
• Containers for compounds, reactions, analytical
data, ...
Deposition of Data
Compounds
Reactions
Analytical data
Crystallography data
Deposition of Data
• Developing systems that provides
feedback to users regarding data quality
• Validate/standardize chem...
Can we get historical data?
• Text and data can be mined
• Spectra can be extracted and converted
• SO MUCH Open Source Co...
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thio...
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thio...
Text spectra?
13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3),
30.11 (CH, benzylic methane), 30.77 (CH,
benzylic methane), 66.12...
1H NMR (CDCl3, 400 MHz):
δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz,
C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H...
Turn “Figures” Into Data
Make it interactive
SO MANY reactions!
Extracting our Archive
• What could we get from our archive?
• Find chemical names and generate structures
• Find chemical...
Models published from data
Text-mining Data to compare
Support grant-based services
• Multiple European consortium-based grants
• PharmaSea (FP7 funded)
• Open PHACTS (IMI funde...
• 3-year Innovative Medicines Initiative project
• Integrating chemistry and biology data using
semantic web technologies
...
The Open PHACTS community ecosystem
Open Source Drug Discovery India
The Drug Discovery Portal
Internet Data
The Future
Commercial Software
Pre-competitive Data
Open Science
Open Data
Publishers
Educators
Open Databas...
Current initiatives in developing research data repositories at the Royal Society of Chemistry
Upcoming SlideShare
Loading in...5
×

Current initiatives in developing research data repositories at the Royal Society of Chemistry

531

Published on

Access to scientific information has changed in a manner that was likely never even imagined by the early pioneers of the internet. The quantities of data, the array of tools available to search and analyze, the devices and the shift in community participation continues to expand while the pace of change does not appear to be slowing. RSC hosts a number of chemistry data resources for the community including ChemSpider, one of the community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day and it serves as the foundation for many important international projects to integrate chemistry and biology data, facilitate drug discovery efforts and help to identify new chemicals from under the ocean. This presentation will provide an overview of the expanding reach of this cheminformatics platform and the nature of the solutions that it helps to enable including structure validation and text mining and semantic markup. ChemSpider is limited in scope as a chemical compound database and we are presently architecting the RSC Data Repository, a platform that will enable us to extend our reach to include chemical reactions, analytical data, and diverse data depositions from chemists across various domains. We will also discuss the possibilities it offers in terms of supporting data modeling and sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community.

Published in: Science
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
531
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Current initiatives in developing research data repositories at the Royal Society of Chemistry"

  1. 1. Current Initiatives in Developing Research Data Repositories at the Royal Society of Chemistry Antony Williams July 28th 2014, FDA
  2. 2. Chemistry for the Community • The Royal Society of Chemistry as a provider of chemistry for the community: • As a charity • As a scientific publisher • As a host of commercial databases • As a partner in grant-based projects • As the host of ChemSpider • And now in development : the RSC Data Repository for Chemistry
  3. 3. • ~30 million chemicals and growing • Data sourced from >500 different sources • Crowd sourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • Structure centric hub for web-searching • …and a really big dictionary!!!
  4. 4. ChemSpider
  5. 5. ChemSpider
  6. 6. Experimental/Predicted Properties
  7. 7. Literature references
  8. 8. Patents references
  9. 9. Books
  10. 10. Vendors and data sources
  11. 11. Aspirin on ChemSpider
  12. 12. Many Names, One Structure
  13. 13. What is the Structure of Vitamin K?
  14. 14. MeSH • A lipid cofactor that is required for normal blood clotting. • Several forms of vitamin K have been identified: • VITAMIN K 1 (phytomenadione) derived from plants, • VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, • VITAMIN K 3 (menadione).
  15. 15. What is the Structure of Vitamin K?
  16. 16. The ultimate “dictionary” • Search all forms of structure IDs • Systematic name(s) • Trivial Name(s) • SMILES • InChI Strings • InChIKeys • Database IDs • Registry Number
  17. 17. Linking Names to Structures
  18. 18. Semantic Mark-up of Articles
  19. 19. Crowdsourced “Annotations” • Users can add • Descriptions, Syntheses and Commentaries • Links to PubMed articles • Links to articles via DOIs • Add spectral data • Add Crystallographic Information Files • Add photos • Add MP3 files • Add Videos
  20. 20. Crowdsourced Enhancement • The community can clean and enhance the database by providing Feedback and direct curation • Tens of thousands of edits made
  21. 21. ChemSpider • ChemSpider allowed the community to participate in linking the internet of chemistry & crowdsourcing of data • Successful experiment in terms of building a central hub for integrated web search • More people are “users” than “contributors” • Yet basic feedback and game-play helps
  22. 22. ChemSpider Spectra
  23. 23. www.SpectralGame.com http://www.jcheminf.com/content/1/1/9
  24. 24. Publications-summary of work • Scientific publications are a summary of work • Is all work reported? • How much science is lost to pruning? • What of value sits in notebooks and is lost? • Publications offering access to “real data”? • How much data is lost? • How many compounds never reported? • How many syntheses fail or succeed? • How many characterization measurements?
  25. 25. Deposition of Research Data • If we manage compounds, syntheses and analytical data… • If we have security and provenance of data… • If we deliver user interfaces to satisfy the various use cases… • Then we have delivered electronic lab notebooks for chemistry laboratories. ELNs are research data repositories
  26. 26. What are we building? • We are building the “RSC Data Repository” • Containers for compounds, reactions, analytical data, tabular data • Algorithms for data validation and standardization • Flexible indexing and search technologies • A platform for modeling data and hosting existing models and predictive algorithms
  27. 27. Deposition of Data
  28. 28. Compounds
  29. 29. Reactions
  30. 30. Analytical data
  31. 31. Crystallography data
  32. 32. Deposition of Data • Developing systems that provides feedback to users regarding data quality • Validate/standardize chemical compounds • Check for balanced reactions • Checks spectral data • EXAMPLE Future work • Properties – compare experimental to pred. • Automated structure verification - NMR
  33. 33. Can we get historical data? • Text and data can be mined • Spectra can be extracted and converted • SO MUCH Open Source Code available
  34. 34. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  35. 35. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  36. 36. Text spectra? 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
  37. 37. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  38. 38. Turn “Figures” Into Data
  39. 39. Make it interactive
  40. 40. SO MANY reactions!
  41. 41. Extracting our Archive • What could we get from our archive? • Find chemical names and generate structures • Find chemical images and generate structures • Find reactions • Find data (MP, BP, LogP) and deposit • Find figures and database them • Find spectra (and link to structures)
  42. 42. Models published from data
  43. 43. Text-mining Data to compare
  44. 44. Support grant-based services • Multiple European consortium-based grants • PharmaSea (FP7 funded) • Open PHACTS (IMI funded) • UK National Chemical Database Service ( http://cds.rsc.org) – developing data repository for lab data, integrate Electronic Lab Notebooks • Open Drug Discovery projects
  45. 45. • 3-year Innovative Medicines Initiative project • Integrating chemistry and biology data using semantic web technologies • Open code, open data, open standards • Academics, Pharmas, Publishers… • To put medicines in the pipeline…
  46. 46. The Open PHACTS community ecosystem
  47. 47. Open Source Drug Discovery India
  48. 48. The Drug Discovery Portal
  49. 49. Internet Data The Future Commercial Software Pre-competitive Data Open Science Open Data Publishers Educators Open Databases Chemical Vendors Small organic molecules Undefined materials Organometallics Nanomaterials Polymers Minerals Particle bound Links to Biologicals
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×