ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

  • 1,904 views
Uploaded on

This is the presentation I gave at OpenSciNY 2010. It was a great gathering of Librarians and people interested in Open Science. Sharing the stage with Beth Brown Jean-Claude Bradley and Heather …

This is the presentation I gave at OpenSciNY 2010. It was a great gathering of Librarians and people interested in Open Science. Sharing the stage with Beth Brown Jean-Claude Bradley and Heather Joseph was, as usual, a good opportunity to discuss how openness and online data sharing is changing the way we access and share data. We live in interesting and exciting times.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,904
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
9
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community OpenSciNY, New York, May 2010,
  • 2. Once Upon a Time Over a “Coffee”
  • 3. Which is better for Plants? Vodka, Sprite or Viagra?
  • 4. It Works – Viagra Wins the Day
  • 5. Now Which is Better?
    • Viagra or Cialis?
    • Images sourced from Wikipedia
  • 6. Cialis
    • I want…
        • The structure
        • Any patent information
        • Related publications
        • Where can I buy it?
        • Metabolic pathway info
        • What else is easy to find…
  • 7. Cialis on Google?
  • 8. What is Cialis?
  • 9. What is Cialis? Can we trust Wikipedia?
  • 10. What is Cialis? 6 hits on PubChem
  • 11. What is Cialis?
  • 12. Search by Trade Name
  • 13. Are there other names???
  • 14. Are there other names???
    • PubMed hits:
      • 736 Tadalafil
      • 744 Cialis
  • 15. Are there other names???
  • 16. Are There Other Names?
  • 17. IC351 on PubChem? 5 HITS for IC351 ZERO HITS for IC 351
  • 18. Chemistry on the Web
    • Text searching the web is far from optimal
    • The quality of data on the web is a problem
    • It may be hard to find but it is “out there”
    • What was once locked up behind an expensive license can generally be found
    • Structure searching the web is already possible!
  • 19. Text Searching the Web
    • Text searching the web for chemical compounds is an enormous challenge
    • RSC has multiple databases, >500,000 articles and a lot of other resources. How do we do?
  • 20. The RSC Publishing Platform (Beta)
  • 21. 2+2 = 4 Articles?
  • 22. CAS Number Search
  • 23. Text Searching the Web
    • Disambiguation dictionaries of name-structure relationships would be very enabling.
      • IC351 = IC 351 = Tadalafil = Cialis = …
    • Creating validated dictionaries is an enormous challenge to cover chemistry
  • 24. CAS Registry – LOTS of Chemicals!
  • 25.  
  • 26.  
  • 27. The Final Search Strategy A “Disambiguation Query!”
  • 28. All Those Names, One Structure A problem to solve…
  • 29. ChemSpider - A Pragmatic Vision
      • “ Build a Structure Centric Community to
      • Serve Chemists”
      • Aggregate and integrate chemical structure data on the web – names, structures, links
      • Create a “structure-based hub” to information, data and algorithmic predictions
      • Let chemists contribute their own data
      • Allow the community to curate/correct data
  • 30. media.obsessable.com
    • As few interfaces as possible
    What do humans want?
  • 31. Aggregating Data – Who to Trust???
    • Encyclopedic articles (Wikipedia)
    • Chemical vendor databases
    • Metabolic pathway databases
    • Property databases
    • Patents with chemical structures
    • Drug Discovery data
    • Scientific publications
    • Compound aggregators
    • Blogs/Wikis and Open Notebook Science
  • 32. Just “Public Compound” Databases
    • PubChem
    • Drugbank
    • ChEBI/ChEMBL
    • KEGG
    • LipidMAPs
    • ChemIDPlus
    • eMolecules
    • ZINC
    • Lots of chemical vendors
  • 33. Question Everything online: www.dhmo.org
  • 34. Di-Hydrogen Monoxide
    • 2H
  • 35. Di-Hydrogen Monoxide
    • 2H + 1O
  • 36. Di-Hydrogen Monoxide
    • H2O
  • 37. Di-Hydrogen Monoxide
    • H2O
    • Water
  • 38. It’s all on Wikipedia…
  • 39. What About Gases? Methane…
  • 40. What’s Methane?
  • 41. What’s Methane?
  • 42. What ELSE is Methane???
  • 43. Structural Data for Life Sciences DailyMed
  • 44. Lack of Stereochemisty
  • 45. Incorrect Structures
  • 46. Pragmatic Vision Delivered…
    • Aggregate, integrate and link data from across the internet
    • Almost 25 million structures from > 300 data sources
    • Linked to vendors, literature, online databases (open and commercial), open notebook science, patents and….
    • Robotic and Crowdsourced Curation
  • 47. Search “OEA”
  • 48. Search OEA
  • 49. Search OEA
  • 50. Search OEA
  • 51. Linked Patents for OEA
  • 52. Answering Questions…
    • Questions a student might ask…
      • What is the structure of levulinic acid?
      • Chemically, what is phenolphthalein?
      • What are the stereocenters of cholesterol?
      • Where can I find publications about xylene?
      • What are the different trade names for Ketoconazole?
      • What is the NMR spectrum of Aspirin?
      • How can I synthesize 2,4-dichlorophenol?
      • What are the safety handling issues for Thymol Blue?
  • 53. Back to Cialis…
  • 54. Cialis on ChemSpider : 1 hit
    • Chemicals are curated/validated on ChemSpider by ourselves and the community
    • Based on assertions from various sources. Iterative, time-consuming and exacting!
    • We believe we know the structure now
    • What is linked and available?
  • 55. Google Patents
  • 56. ChemSpider – Patents Linked SURECHEM PATENTS GOOGLE
  • 57. Google Books
  • 58. Microsoft Academic Search
  • 59. Google Scholar – Articles were found by CAS Number !
  • 60. Identifiers for Tadalafil
  • 61. How Many Articles in RSC Journals ?
    • Based on 171596-29​-5 there are 13 articles in RSC journals
    • What about if we VALIDATE identifiers?
  • 62. Validated Dictionaries Hit APIs This is data curation...
  • 63. Does this generate more results?
  • 64. RSC Journals
  • 65. RSC Journals REMEMBER 2+2 = 4
  • 66. PubMed
  • 67. Google Scholar – Expanded Hit Set
  • 68. Microsoft Academic Search
  • 69. Microsoft Academic Search
    • Be careful! More mussels than drugs…
  • 70. Searching Chemistry on the Internet
    • Do we get complete a result set will we get if we search for “chemicals” only by name?
    • Is there a better way to link chemistry databases? Linking by “names” is dangerous
    • Chemists want structure and SUBstructure searching
  • 71. Structure Searching the Web
    • We have resources about Tadalafil actively linked to ChemSpider
    • What about searching the web for Tadalafil by structure…not based on the various identifiers
    • How?
  • 72. Link the Internet with InChIKeys! Taken from: Rafael Sidis’ Blog
  • 73. The InChI Identifier
  • 74. Multiple Layers
  • 75. InChIStrings Hash to InChIKeys
  • 76. Cialis – Searching the Web by InChI Search Molecular SKELETON Search Full Molecule
  • 77. InChI Search the Web by Skeleton 78 Hits by Skeleton
  • 78. InChI Search the Web Exact Match 32 Hits by InChIKey
  • 79. InChI Search the Web Exact Match 6 Hits by Standard InChIKey
  • 80. InChifying the Web
    • There are more than 2X “skeletons” for Cialis than exact matches – different stereo? Mistakes?
    • Our judgment…MISTAKES
  • 81. Vancomycin – Search the Internet
  • 82. Full Molecule Search: 4 Hits
  • 83. Full Skeleton Search: 104 Hits
  • 84. InChIKeys Make the internet searchable by adding InChIKeys Publishers add InChIKeys to papers now… But what is the structure???
  • 85. We need an InChI “Resolver”
  • 86. InChI Resolver to DOIs Structure Search the Web
  • 87. Semantic Markup: Project Prospect
  • 88. Depends on Validated Dictionaries Link to a Structure or the Right Structure?
  • 89. Name-Structure Pairs
  • 90. Semantic Linking of Structures
    • What would you want to link off a structure?
      • Chemical suppliers
      • Other publications
      • Analytical Data
      • Related Reactions
      • Wikipedia
      • Patents
      • “ Everything”
      • Through ChemSpider!
  • 91. Unpublished Chemistry
    • Only a fraction of chemistry is published
    • Only a tiny fraction of chemistry is patented
    • What of the “Lost Chemistry”- never published and cannot be abstracted
      • Reactions performed
      • Structures made and studied
      • Spectra acquired and then disposed of
      • Available chemicals never found
  • 92. Org Prep Daily (Blog)
  • 93. ChemSpider SyntheticPages
  • 94. Submission process
    • Register as a user
    • Use the Submit button and fill in the fields…
  • 95. Submission Process
    • Submissions reviewed by editorial board
    • Published as is or comments sent to author
    • Online Peer Review process
    • Data supported include web movies, images, live spectra etc.
  • 96. Micro- and Nano-publications
    • Blogs, wiki entries and even Amazon book reviews are micro/nano-publications
    • ChemSpider SyntheticPages will be DOI’ed – students can add these “micro-publications” to their resume
    • Structures and spectra are nano-publications – these can be tracked and referenced also. (depositions, curations etc). Students participate in building one of the premier sources of chemistry data.
  • 97. ChemSpider : Spectra Linked
  • 98. Spectra Linked
  • 99. Spectra Linked
  • 100. Not Just NMR Data
  • 101. www.SpectralGame.com http://www.jcheminf.com/content/1/1/9
  • 102. Spectral Game
  • 103. Increasing Complexity
  • 104. Spectral Game
  • 105. ChemSpider Content
    • ChemSpider is a container…supports multimedia
      • Spectra
      • Crystal structures
      • Images
      • MP3s
      • Videos
  • 106. Roses’ Crystal Image Collection
  • 107. MP3s and Videos : Titanium
  • 108. Periodic Table Images
  • 109. How Can You Help ChemSpider?
    • Deposit your data and share with the community
      • Structures – one or many
      • Spectra
      • Links
      • Syntheses into SyntheticPages
    • Curate data – most basic level…just add comments
    • Spread the word – ChemSpider is an untapped resource
  • 110. Community Contribution
    • We can make a bigger contribution to the community if the community shares via ChemSpider
    • Don’t underestimate what others will find of value
    • ChemSpider wins “Community
    • contribution” best practice award”
  • 111. Chemistry on the Internet FUTURE
    • The semantic web for chemistry is in place
    • Crowdsourced contributions are commonplace
    • Chemists will search by structure/substructure
    • Chemistry articles indexed and searchable
    • Reduced number of searches to find data
    • Data are integrated – compounds, vendors, syntheses, data, publications and patents
    • A world of Open Access and Open Data
  • 112. Thank you [email_address] Twitter: ChemSpiderman www.chemspider.com/blog SLIDES: www.slideshare.net/AntonyWilliams