0
Experiences in Hosting Big
Chemistry Data Collections
for the Community
Antony Williams
July 30th
2014, NIST
Overview of Our Activities
• The Royal Society of Chemistry as a
provider of chemistry for the community:
• As a charity
•...
• ~30 million chemicals and growing
• Data sourced from >500 different sources
• Crowd sourced curation and annotation
• O...
ChemSpider
ChemSpider
ChemSpider
Experimental/Predicted Properties
Literature references
Patents references
RSC Books
Google Books
Vendors and data sources
Crowdsourced “Annotations”
• Users can add
• Descriptions, Syntheses and Commentaries
• Links to PubMed articles
• Links t...
APIs
APIs
WebBook and ChemSpider
WebBook and ChemSpider
WebBook and ChemSpider
WebBook and ChemSpider
WebBook and ChemSpider
Javascript viewer NMR, MS, IR
Aspirin on ChemSpider
Many Names, One Structure
What is the Structure of Vitamin K?
MeSH
• A lipid cofactor that is required for normal
blood clotting.
• Several forms of vitamin K have been
identified:
• V...
What is the Structure of Vitamin K?
The ultimate “dictionary”
• Search all forms of structure IDs
• Systematic name(s)
• Trivial Name(s)
• SMILES
• InChI Stri...
Linking Names to Structures
Semantic Mark-up of Articles
Data Quality Issues
Williams and Ekins, DDT, 16: 747-750 (2011)
Science Translational Medicine 2011
Data quality is a known issue
Standardize
• Use the SRS as a guidance document for
standardization
• Adjust as necessary to our needs
Nitro groups
Salt and Ionic Bonds
Ammonium salts
CVSP Filtering and Flagging
Openness and Quality Issues
Williams and Ekins, DDT, 16: 747-750 (2011)
Science Translational Medicine 2011
Substructure # of
Hits
# of
Correct
Hits
No
stereochemistry
Incomplete
Stereochemistry
Complete but
incorrect
stereochemis...
Crowdsourced Enhancement
• The community can clean and enhance the
database by providing Feedback and direct
curation
• Te...
Data Quality is Work
• Cholesterol
• Taxol
Maybe we can help?
• Is there an interest in data checking the
WebBook or other NIST data sources?
Publications-summary of work
• Scientific publications are a summary of work
• Is all work reported?
• How much science is...
What are we building?
• We are building the “RSC Data Repository”
• Containers for compounds, reactions, analytical
data, ...
Deposition of Data
Compounds
Reactions
Analytical data
Crystallography data
Can we get historical data?
• Text and data can be mined
• Spectra can be extracted and converted
• SO MUCH Open Source Co...
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thio...
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thio...
Text spectra?
13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3),
30.11 (CH, benzylic methane), 30.77 (CH,
benzylic methane), 66.12...
1H NMR (CDCl3, 400 MHz):
δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz,
C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H...
Turn “Figures” Into Data
Make it interactive
SO MANY reactions!
Extracting our Archive
• What could we get from our archive?
• Find chemical names and generate structures
• Find chemical...
Models published from data
Text-mining Data to compare
How is DERA going?
• We have text-mined all 21st
century articles…
>100k articles from 2000-2013
• Marked up with XML and ...
Work in Progress
Work in Progress
Work in Progress
Work in Progress
Dictionary
(ontologies)RSC ontologies
(methods,
reactions)
Dictionary
(chemistry)
Text-mining
Curated dictionaries for kno...
Acknowledgments
• Regarding InChI – Steve Stein, Steve
Heller, Dmitrii Tchekhovskoi, Igor Pletnev
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: w...
Upcoming SlideShare
Loading in...5
×

Experiences in Hosting Big Chemistry Data Collections for the Community

683

Published on

Access to scientific information has changed dramatically as a result of the web and its underpinning technologies. The quantities of data, the array of tools available to search and analyze, the devices and the shift in community participation continues to expand while the pace of change does not appear to be slowing. RSC hosts a number of chemistry data resources for the community including ChemSpider, one of the community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day. The platform offers the ability for crowdsourcing enabling the community to deposit and curate data. This presentation will provide an overview of the expanding reach of this cheminformatics platform and the nature of the solutions that it helps to enable including structure validation and text mining and semantic markup. ChemSpider is limited in scope as a chemical compound database and we are presently architecting the RSC Data Repository, a platform that will enable us to extend our reach to include chemical reactions, analytical data, and diverse data depositions from chemists across various domains. We will also discuss the possibilities it offers in terms of supporting data modeling and sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community.

Published in: Science
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
683
On Slideshare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Experiences in Hosting Big Chemistry Data Collections for the Community"

  1. 1. Experiences in Hosting Big Chemistry Data Collections for the Community Antony Williams July 30th 2014, NIST
  2. 2. Overview of Our Activities • The Royal Society of Chemistry as a provider of chemistry for the community: • As a charity • As a scientific publisher • As a host of commercial databases • As a partner in grant-based projects • As the host of ChemSpider • And now in development : the RSC Data Repository for Chemistry
  3. 3. • ~30 million chemicals and growing • Data sourced from >500 different sources • Crowd sourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • Structure centric hub for web-searching • …and a really big dictionary!!!
  4. 4. ChemSpider
  5. 5. ChemSpider
  6. 6. ChemSpider
  7. 7. Experimental/Predicted Properties
  8. 8. Literature references
  9. 9. Patents references
  10. 10. RSC Books
  11. 11. Google Books
  12. 12. Vendors and data sources
  13. 13. Crowdsourced “Annotations” • Users can add • Descriptions, Syntheses and Commentaries • Links to PubMed articles • Links to articles via DOIs • Add spectral data • Add Crystallographic Information Files • Add photos • Add MP3 files • Add Videos
  14. 14. APIs
  15. 15. APIs
  16. 16. WebBook and ChemSpider
  17. 17. WebBook and ChemSpider
  18. 18. WebBook and ChemSpider
  19. 19. WebBook and ChemSpider
  20. 20. WebBook and ChemSpider
  21. 21. Javascript viewer NMR, MS, IR
  22. 22. Aspirin on ChemSpider
  23. 23. Many Names, One Structure
  24. 24. What is the Structure of Vitamin K?
  25. 25. MeSH • A lipid cofactor that is required for normal blood clotting. • Several forms of vitamin K have been identified: • VITAMIN K 1 (phytomenadione) derived from plants, • VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, • VITAMIN K 3 (menadione).
  26. 26. What is the Structure of Vitamin K?
  27. 27. The ultimate “dictionary” • Search all forms of structure IDs • Systematic name(s) • Trivial Name(s) • SMILES • InChI Strings • InChIKeys • Database IDs • Registry Number
  28. 28. Linking Names to Structures
  29. 29. Semantic Mark-up of Articles
  30. 30. Data Quality Issues Williams and Ekins, DDT, 16: 747-750 (2011) Science Translational Medicine 2011
  31. 31. Data quality is a known issue
  32. 32. Standardize • Use the SRS as a guidance document for standardization • Adjust as necessary to our needs
  33. 33. Nitro groups
  34. 34. Salt and Ionic Bonds
  35. 35. Ammonium salts
  36. 36. CVSP Filtering and Flagging
  37. 37. Openness and Quality Issues Williams and Ekins, DDT, 16: 747-750 (2011) Science Translational Medicine 2011
  38. 38. Substructure # of Hits # of Correct Hits No stereochemistry Incomplete Stereochemistry Complete but incorrect stereochemistry Gonane 34 5 8 21 0 Gon-4-ene 55 12 3 33 7 Gon-1,4-diene 60 17 10 23 10
  39. 39. Crowdsourced Enhancement • The community can clean and enhance the database by providing Feedback and direct curation • Tens of thousands of edits made
  40. 40. Data Quality is Work • Cholesterol • Taxol
  41. 41. Maybe we can help? • Is there an interest in data checking the WebBook or other NIST data sources?
  42. 42. Publications-summary of work • Scientific publications are a summary of work • Is all work reported? • How much science is lost to pruning? • What of value sits in notebooks and is lost? • Publications offering access to “real data”? • How much data is lost? • How many compounds never reported? • How many syntheses fail or succeed? • How many characterization measurements?
  43. 43. What are we building? • We are building the “RSC Data Repository” • Containers for compounds, reactions, analytical data, tabular data • Algorithms for data validation and standardization • Flexible indexing and search technologies • A platform for modeling data and hosting existing models and predictive algorithms
  44. 44. Deposition of Data
  45. 45. Compounds
  46. 46. Reactions
  47. 47. Analytical data
  48. 48. Crystallography data
  49. 49. Can we get historical data? • Text and data can be mined • Spectra can be extracted and converted • SO MUCH Open Source Code available
  50. 50. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  51. 51. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  52. 52. Text spectra? 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
  53. 53. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  54. 54. Turn “Figures” Into Data
  55. 55. Make it interactive
  56. 56. SO MANY reactions!
  57. 57. Extracting our Archive • What could we get from our archive? • Find chemical names and generate structures • Find chemical images and generate structures • Find reactions • Find data (MP, BP, LogP) and deposit • Find figures and database them • Find spectra (and link to structures)
  58. 58. Models published from data
  59. 59. Text-mining Data to compare
  60. 60. How is DERA going? • We have text-mined all 21st century articles… >100k articles from 2000-2013 • Marked up with XML and published onto the HTML forms of the articles • Required multiple iterations based on dictionaries, markup, text mining iterations • New visualization tools in development – not just chemical names. Add chemical and biomedical terms markup also!
  61. 61. Work in Progress
  62. 62. Work in Progress
  63. 63. Work in Progress
  64. 64. Work in Progress
  65. 65. Dictionary (ontologies)RSC ontologies (methods, reactions) Dictionary (chemistry) Text-mining Curated dictionaries for known names ACD N2S OPSIN Unknown names: automated name to structure conversion XML ready for publication Marked-up XML Production processes CDX integration (coming soon) Chemical structures SD file Is It Easy?
  66. 66. Acknowledgments • Regarding InChI – Steve Stein, Steve Heller, Dmitrii Tchekhovskoi, Igor Pletnev
  67. 67. Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams Thank you
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×