Experiences in Hosting Big Chemistry Data Collections for the Community
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Experiences in Hosting Big Chemistry Data Collections for the Community

  • 655 views
Uploaded on

Access to scientific information has changed dramatically as a result of the web and its underpinning technologies. The quantities of data, the array of tools available to search and analyze, the......

Access to scientific information has changed dramatically as a result of the web and its underpinning technologies. The quantities of data, the array of tools available to search and analyze, the devices and the shift in community participation continues to expand while the pace of change does not appear to be slowing. RSC hosts a number of chemistry data resources for the community including ChemSpider, one of the community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day. The platform offers the ability for crowdsourcing enabling the community to deposit and curate data. This presentation will provide an overview of the expanding reach of this cheminformatics platform and the nature of the solutions that it helps to enable including structure validation and text mining and semantic markup. ChemSpider is limited in scope as a chemical compound database and we are presently architecting the RSC Data Repository, a platform that will enable us to extend our reach to include chemical reactions, analytical data, and diverse data depositions from chemists across various domains. We will also discuss the possibilities it offers in terms of supporting data modeling and sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community.

More in: Science
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
655
On Slideshare
484
From Embeds
171
Number of Embeds
10

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 171

http://www.chemconnector.com 145
http://www.slideee.com 10
http://feedly.com 6
http://www.newsblur.com 3
http://digg.com 2
http://www.inoreader.com 1
http://127.0.0.1 1
https://www.linkedin.com 1
https://www.rebelmouse.com 1
http://news.google.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Experiences in Hosting Big Chemistry Data Collections for the Community Antony Williams July 30th 2014, NIST
  • 2. Overview of Our Activities • The Royal Society of Chemistry as a provider of chemistry for the community: • As a charity • As a scientific publisher • As a host of commercial databases • As a partner in grant-based projects • As the host of ChemSpider • And now in development : the RSC Data Repository for Chemistry
  • 3. • ~30 million chemicals and growing • Data sourced from >500 different sources • Crowd sourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • Structure centric hub for web-searching • …and a really big dictionary!!!
  • 4. ChemSpider
  • 5. ChemSpider
  • 6. ChemSpider
  • 7. Experimental/Predicted Properties
  • 8. Literature references
  • 9. Patents references
  • 10. RSC Books
  • 11. Google Books
  • 12. Vendors and data sources
  • 13. Crowdsourced “Annotations” • Users can add • Descriptions, Syntheses and Commentaries • Links to PubMed articles • Links to articles via DOIs • Add spectral data • Add Crystallographic Information Files • Add photos • Add MP3 files • Add Videos
  • 14. APIs
  • 15. APIs
  • 16. WebBook and ChemSpider
  • 17. WebBook and ChemSpider
  • 18. WebBook and ChemSpider
  • 19. WebBook and ChemSpider
  • 20. WebBook and ChemSpider
  • 21. Javascript viewer NMR, MS, IR
  • 22. Aspirin on ChemSpider
  • 23. Many Names, One Structure
  • 24. What is the Structure of Vitamin K?
  • 25. MeSH • A lipid cofactor that is required for normal blood clotting. • Several forms of vitamin K have been identified: • VITAMIN K 1 (phytomenadione) derived from plants, • VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, • VITAMIN K 3 (menadione).
  • 26. What is the Structure of Vitamin K?
  • 27. The ultimate “dictionary” • Search all forms of structure IDs • Systematic name(s) • Trivial Name(s) • SMILES • InChI Strings • InChIKeys • Database IDs • Registry Number
  • 28. Linking Names to Structures
  • 29. Semantic Mark-up of Articles
  • 30. Data Quality Issues Williams and Ekins, DDT, 16: 747-750 (2011) Science Translational Medicine 2011
  • 31. Data quality is a known issue
  • 32. Standardize • Use the SRS as a guidance document for standardization • Adjust as necessary to our needs
  • 33. Nitro groups
  • 34. Salt and Ionic Bonds
  • 35. Ammonium salts
  • 36. CVSP Filtering and Flagging
  • 37. Openness and Quality Issues Williams and Ekins, DDT, 16: 747-750 (2011) Science Translational Medicine 2011
  • 38. Substructure # of Hits # of Correct Hits No stereochemistry Incomplete Stereochemistry Complete but incorrect stereochemistry Gonane 34 5 8 21 0 Gon-4-ene 55 12 3 33 7 Gon-1,4-diene 60 17 10 23 10
  • 39. Crowdsourced Enhancement • The community can clean and enhance the database by providing Feedback and direct curation • Tens of thousands of edits made
  • 40. Data Quality is Work • Cholesterol • Taxol
  • 41. Maybe we can help? • Is there an interest in data checking the WebBook or other NIST data sources?
  • 42. Publications-summary of work • Scientific publications are a summary of work • Is all work reported? • How much science is lost to pruning? • What of value sits in notebooks and is lost? • Publications offering access to “real data”? • How much data is lost? • How many compounds never reported? • How many syntheses fail or succeed? • How many characterization measurements?
  • 43. What are we building? • We are building the “RSC Data Repository” • Containers for compounds, reactions, analytical data, tabular data • Algorithms for data validation and standardization • Flexible indexing and search technologies • A platform for modeling data and hosting existing models and predictive algorithms
  • 44. Deposition of Data
  • 45. Compounds
  • 46. Reactions
  • 47. Analytical data
  • 48. Crystallography data
  • 49. Can we get historical data? • Text and data can be mined • Spectra can be extracted and converted • SO MUCH Open Source Code available
  • 50. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  • 51. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  • 52. Text spectra? 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
  • 53. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  • 54. Turn “Figures” Into Data
  • 55. Make it interactive
  • 56. SO MANY reactions!
  • 57. Extracting our Archive • What could we get from our archive? • Find chemical names and generate structures • Find chemical images and generate structures • Find reactions • Find data (MP, BP, LogP) and deposit • Find figures and database them • Find spectra (and link to structures)
  • 58. Models published from data
  • 59. Text-mining Data to compare
  • 60. How is DERA going? • We have text-mined all 21st century articles… >100k articles from 2000-2013 • Marked up with XML and published onto the HTML forms of the articles • Required multiple iterations based on dictionaries, markup, text mining iterations • New visualization tools in development – not just chemical names. Add chemical and biomedical terms markup also!
  • 61. Work in Progress
  • 62. Work in Progress
  • 63. Work in Progress
  • 64. Work in Progress
  • 65. Dictionary (ontologies)RSC ontologies (methods, reactions) Dictionary (chemistry) Text-mining Curated dictionaries for known names ACD N2S OPSIN Unknown names: automated name to structure conversion XML ready for publication Marked-up XML Production processes CDX integration (coming soon) Chemical structures SD file Is It Easy?
  • 66. Acknowledgments • Regarding InChI – Steve Stein, Steve Heller, Dmitrii Tchekhovskoi, Igor Pletnev
  • 67. Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams Thank you