ChemSpider: Collecting and Curating the
World’s Chemistry with the Community
A Pragmatic Vision
“Build a Structure Centric Community”
 December 2006 – A hobby project initiated to
connect chemistry ...
Where is chemistry online?
 Encyclopedic articles (Wikipedia)
 Chemical vendor databases
 Metabolic pathway databases
...
Chemistry on the Internet TODAY
 Chemistry searches are generally limited to text-
based searches across the internet
 D...
media.obsessable.com
As few interfaces as possible
What do humans want?
Chemistry on the Internet FUTURE
 The semantic web for chemistry is in place
 Crowdsourced contributions are commonplace...
Getting it done
 March 2007 – A beta system opened online
 One purchased computer, two home-built
 Seeded with 10.5 mil...
ChemSpider Searches
Search Cholesterol
Search Cholesterol
Search Cholesterol
Search Cholesterol
Search Cholesterol
Linked across the internet
Kyoto Encyclopedia of Genes and Genomes
Links to Patents based on structure
Articles Linked
ChemSpider Complex Searches
Link off a structure in ChemSpider
 Chemical suppliers
 Other publications
 Analytical Data
 Related Reactions
 Wikip...
Answering Questions for Chemists
 Questions a chemist might ask…
 What is the melting point of n-butanol?
 What is the ...
What is a compound?
ChemSpider is a structure-centric hub
 ChemSpider aggregates and links out across the
internet
 Data aggregate based on ...
Linked Data on the Web
Taken from: Rafael Sidis’ Blog
Where Would You look?
What Do You Trust?
Question Everything online: www.dhmo.org
Di-Hydrogen Monoxide
2H
Di-Hydrogen Monoxide
2H + 1O
Di-Hydrogen Monoxide
H2O
Di-Hydrogen Monoxide
H2O
Water
It’s all on Wikipedia…
Chemistry on The Internet Is Messy
It’s Methane…
What’s Methane?
What’s Methane?
What ELSE is Methane???
PubChem
Chemistry is REALLY Messy
Vancomycin
 Who will curate?
 How would you clean such
a large dataset?
 Assertions!!!
Vancomycin on ChemSpider
1 compound – 3 days
The EXPERTS must get it right?!
Wikipedia, C&E News, PubChem
C&E News (from ACS)
The InChI Identifier
Multiple Layers
InChIStrings Hash to InChIKeys
InChIs for Taxol
InChIKeys for Taxol
 DrugBank: RCINICONZNJXQF-CLDWUXIMDD
 ChEBI: RCINICONZNJXQF-GXKQXQCDDN
 Wikipedia: RCINICONZNJXQF-M...
Does one stereocenter matter?
Does one stereocenter matter?
 Distaval, Talimol, Nibrol,
Sedimide, Quietoplex,
Contergan, Neurosedyn,
and Softenon
Does one stereocenter matter?
 Distaval, Talimol, Nibrol,
Sedimide, Quietoplex,
Contergan, Neurosedyn,
and Softenon
Assertion and Chemical Entities
 Who says what Taxol is?
 What is the “timeline” for a molecule?
 How do we clean up th...
Vancomycin – Search the Internet
Full Molecule Search: 4 Hits
Full Skeleton Search: 104 Hits
The InChI “Resolver”
Citizen Scientists
Crowd-sourcing Chemistry Curation
 Crowd-sourced curation: identify/tag errors, edit
names, synonyms, identify records to...
Building a Structure Centric
Multi-level Curation and Approval
Citizens as Data Sources
Semantic Markup: Project Prospect
Entity-Extraction, Mark-up, Annotate
Success Depends on Dictionaries
ChemMantis and CJOC
Name-Structure Pairs
Species – linked to Wikipedia
Semantic Linking of Structures
 What would you want
to link off a structure?
 Chemical suppliers
 Other publications
 ...
ChemSpider Everywhere : Embed
ChemSpider Everywhere: Spectral Game
ChemSpider Everywhere
Crowdsourced Curation of Spectra
ChemSpider Everywhere:
What do computers want?
Web services
flickr.com/photos/microcosmos
ChemSpider Everywhere
 Linked from Wikipedia and many Public Databases
 Linked from Open Notebook Science sites
 Linked...
ChemSpider Everywhere: ChemMobi
There will always be gaps...
 What ChemSpider does not deal with, yet...
 Materials
 Minerals
 Polymers
 Biological m...
Open Source, Access and Data
 ChemSpider is NOT Open Source but we do use
Open Source components (OpenBabel,
JSpecView, J...
Open Data?
Who declares data as Open?
 Data licensing is very interesting and can spark
“interesting” conversations. Opinions differ...
Conclusions: ChemSpider Today
 ChemSpider is an established community resource
 >23 million compounds from >300 data sou...
ChemSpider Tomorrow
 Continue the curation effort and keep cleaning
 Finish depositions – millions left to deposit
 Int...
Acknowledgments
 Royal Society of Chemistry
 Valery Tkachenko and Sergey Shevelev
 Commercial Software: Microsoft, Adva...
Thank you
antony.williams@chemspider.com
Twitter: ChemSpiderman
www.chemspider.com/blog
SLIDES: www.slideshare.net/AntonyW...
RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn
Upcoming SlideShare
Loading in...5
×

RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

2,447

Published on

These are the slides I will be giving here at the Science Commons Symposium Pacific Northwest at the Microsoft Campus here in Redmond in about 5 minutes time

Published in: Technology, Education
1 Comment
2 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,447
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
16
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide

RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

  1. 1. ChemSpider: Collecting and Curating the World’s Chemistry with the Community
  2. 2. A Pragmatic Vision “Build a Structure Centric Community”  December 2006 – A hobby project initiated to connect chemistry on the web  Integrate chemical structure data on the web  Create a “structure-based hub” to information and data  Provide access to structure-based “algorithms”  Let chemists contribute their own data  Allow the community to curate/correct data
  3. 3. Where is chemistry online?  Encyclopedic articles (Wikipedia)  Chemical vendor databases  Metabolic pathway databases  Property databases  Patents with chemical structures  Drug Discovery data  Scientific publications  Compound aggregators  Blogs/Wikis and Open Notebook Science
  4. 4. Chemistry on the Internet TODAY  Chemistry searches are generally limited to text- based searches across the internet  Data are dirty: sorting the wheat from the chaff. Who can you trust?  Too many searches required to resource data
  5. 5. media.obsessable.com As few interfaces as possible What do humans want?
  6. 6. Chemistry on the Internet FUTURE  The semantic web for chemistry is in place  Crowdsourced contributions are commonplace  Chemists will search by structure/substructure  Chemistry articles indexed and searchable  Reduced number of searches to find data  Data are integrated – compounds, vendors, syntheses, data, publications and patents  A world of Open Access and Open Data  Classical business models will have to morph
  7. 7. Getting it done  March 2007 – A beta system opened online  One purchased computer, two home-built  Seeded with 10.5 million structures  Structure/substructure searching  June 2007  A curating layer to flag data  A deposition interface to add to the data  And so it continued….
  8. 8. ChemSpider Searches
  9. 9. Search Cholesterol
  10. 10. Search Cholesterol
  11. 11. Search Cholesterol
  12. 12. Search Cholesterol
  13. 13. Search Cholesterol
  14. 14. Linked across the internet
  15. 15. Kyoto Encyclopedia of Genes and Genomes
  16. 16. Links to Patents based on structure
  17. 17. Articles Linked
  18. 18. ChemSpider Complex Searches
  19. 19. Link off a structure in ChemSpider  Chemical suppliers  Other publications  Analytical Data  Related Reactions  Wikipedia  Patents  “Everything”
  20. 20. Answering Questions for Chemists  Questions a chemist might ask…  What is the melting point of n-butanol?  What is the chemical structure of Xanax?  Chemically, what is phenolphthalein?  What are the stereocenters of cholesterol?  Where can I find publications about xylene?  What are the different trade names for Ketoconazole?  What is the NMR spectrum of Aspirin?  What are the safety handling issues for Thymol Blue?
  21. 21. What is a compound?
  22. 22. ChemSpider is a structure-centric hub  ChemSpider aggregates and links out across the internet  Data aggregate based on “structures and links”  What defines a chemical compound?
  23. 23. Linked Data on the Web Taken from: Rafael Sidis’ Blog
  24. 24. Where Would You look? What Do You Trust?
  25. 25. Question Everything online: www.dhmo.org
  26. 26. Di-Hydrogen Monoxide 2H
  27. 27. Di-Hydrogen Monoxide 2H + 1O
  28. 28. Di-Hydrogen Monoxide H2O
  29. 29. Di-Hydrogen Monoxide H2O Water
  30. 30. It’s all on Wikipedia…
  31. 31. Chemistry on The Internet Is Messy
  32. 32. It’s Methane…
  33. 33. What’s Methane?
  34. 34. What’s Methane?
  35. 35. What ELSE is Methane???
  36. 36. PubChem
  37. 37. Chemistry is REALLY Messy
  38. 38. Vancomycin  Who will curate?  How would you clean such a large dataset?  Assertions!!!
  39. 39. Vancomycin on ChemSpider 1 compound – 3 days
  40. 40. The EXPERTS must get it right?!
  41. 41. Wikipedia, C&E News, PubChem C&E News (from ACS)
  42. 42. The InChI Identifier
  43. 43. Multiple Layers
  44. 44. InChIStrings Hash to InChIKeys
  45. 45. InChIs for Taxol
  46. 46. InChIKeys for Taxol  DrugBank: RCINICONZNJXQF-CLDWUXIMDD  ChEBI: RCINICONZNJXQF-GXKQXQCDDN  Wikipedia: RCINICONZNJXQF-MZXODVADBJ  ChEBI and Wikipedia are the SAME structure  Drugbank is a DIFFERENT structure – ONE stereocenter
  47. 47. Does one stereocenter matter?
  48. 48. Does one stereocenter matter?  Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon
  49. 49. Does one stereocenter matter?  Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon
  50. 50. Assertion and Chemical Entities  Who says what Taxol is?  What is the “timeline” for a molecule?  How do we clean up the Public data?  The Quality source is Chemical Abstracts Service…
  51. 51. Vancomycin – Search the Internet
  52. 52. Full Molecule Search: 4 Hits
  53. 53. Full Skeleton Search: 104 Hits
  54. 54. The InChI “Resolver”
  55. 55. Citizen Scientists
  56. 56. Crowd-sourcing Chemistry Curation  Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
  57. 57. Building a Structure Centric Multi-level Curation and Approval
  58. 58. Citizens as Data Sources
  59. 59. Semantic Markup: Project Prospect
  60. 60. Entity-Extraction, Mark-up, Annotate
  61. 61. Success Depends on Dictionaries
  62. 62. ChemMantis and CJOC
  63. 63. Name-Structure Pairs
  64. 64. Species – linked to Wikipedia
  65. 65. Semantic Linking of Structures  What would you want to link off a structure?  Chemical suppliers  Other publications  Analytical Data  Related Reactions  Wikipedia  Patents  “Everything”
  66. 66. ChemSpider Everywhere : Embed
  67. 67. ChemSpider Everywhere: Spectral Game
  68. 68. ChemSpider Everywhere Crowdsourced Curation of Spectra
  69. 69. ChemSpider Everywhere: What do computers want? Web services flickr.com/photos/microcosmos
  70. 70. ChemSpider Everywhere  Linked from Wikipedia and many Public Databases  Linked from Open Notebook Science sites  Linked from Blogs using Structure/Spectra EMBED  Integrated into structure drawing packages  Integrated to software offerings from Thermo, Waters, Agilent, Bruker
  71. 71. ChemSpider Everywhere: ChemMobi
  72. 72. There will always be gaps...  What ChemSpider does not deal with, yet...  Materials  Minerals  Polymers  Biological macromolecules
  73. 73. Open Source, Access and Data  ChemSpider is NOT Open Source but we do use Open Source components (OpenBabel, JSpecView, Jmol). Thanks Microsoft!  ChemSpider is not an “Open Access Database” – it’s a “free access” resource  We do not assume copyright. Rights to the data and the creative works remain with the depositor  Is ChemSpider “Open Data”?
  74. 74. Open Data?
  75. 75. Who declares data as Open?  Data licensing is very interesting and can spark “interesting” conversations. Opinions differ:  Are images data? Are assertions data?  What on a ChemSpider record is data?  Is PubChem or PubMed Open Data?  We allow people to declare their data as Open and add an Open Data button at upload  A lot of data on ChemSpider are free but not Open  Pragmatism: Our focus is a community resource
  76. 76. Conclusions: ChemSpider Today  ChemSpider is an established community resource  >23 million compounds from >300 data sources  About 7000 unique users per day and up to ½ million transactions per day  A crowdsourced deposition and curation platform  Grows daily – more depositions, more links, more data  Web services provider  Linked to commercial and open source software  Supporting analytical companies: Agilent, Thermo, Waters, Bruker  Serving ONS, providing games to students, ChemSpidey robot  A publishing platform for the community
  77. 77. ChemSpider Tomorrow  Continue the curation effort and keep cleaning  Finish depositions – millions left to deposit  Integrate RSC content – a massive archive!  Integrate RSC publishing workflows and databases  Enable the semantic web for chemistry
  78. 78. Acknowledgments  Royal Society of Chemistry  Valery Tkachenko and Sergey Shevelev  Commercial Software: Microsoft, Advanced Chemistry Development, OpenEye and Symyx  Open Source Software: Jmol, OpenBabel, JSpecView  JC Bradley, Andrew Lang – The Spectral Game and Open Notebook Science integration  The “Crowd” of curators  306 Data Source providers  SyntheticPages.org
  79. 79. Thank you antony.williams@chemspider.com Twitter: ChemSpiderman www.chemspider.com/blog SLIDES: www.slideshare.net/AntonyWilliamsSLIDES: www.slideshare.net/AntonyWilliams
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×