Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The expansive reach of ChemSpider as a resource for the chemistry community


Published on

Our access to scientific information has changed in ways that were hardly imagined even by the early pioneers of the internet. The immense quantities of data and the array of tools available to search and analyze online content continues to expand while the pace of change does not appear to be slowing. ChemSpider is one of the chemistry community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day and it serves as the foundation for many important international projects to integrate chemistry and biology data, facilitate drug discovery efforts and help to identify new chemicals from under the ocean. This presentation will provide an overview of the expanding reach of the ChemSpider platform and the nature of the solutions that it helps to enable. We will also discuss the possibilities it offers in the domain of crowdsourcing and open data sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community and facilitated collaboration and ultimately accelerate scientific progress.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

The expansive reach of ChemSpider as a resource for the chemistry community

  1. 1. The Expansive Reach ofChemSpider as a Resource forthe Chemistry CommunityAntony WilliamsUniversity of Oregon, April 24th2013
  2. 2. The World of Online Chemistry• Property databases• Compound aggregators• Screening assay results• Scientific publications• Encyclopedic articles (Wikipedia)• Metabolic pathway databases• ADME/Tox data – eTOX for example• Blogs/Wikis and Open Notebook Science
  3. 3. We Have …Too Much Data!!!
  4. 4. e-Science and Primary Data• How much data generated in a lab, that COULD go public, islost forever?
  5. 5.
  6. 6. e-Science and Primary Data• How much data generated in a lab, that COULD go public, islost forever?• Public Domain reference databases of value?– Syntheses– Properties– Spectra– CIFs– Images
  7. 7. Collaborative Knowledge Management
  8. 8. e-Science and Primary Data• How much data generated in a lab, that COULD go public, islost forever?• Public Domain reference databases of value?– Syntheses– Properties– Spectra– CIFs– Images• Much of chemistry is chemical structure-based – where andhow could we host these data?
  9. 9. RSC’s ChemSpider
  10. 10. Crowdsourced “Annotations”• Users can add– Descriptions/Syntheses/Commentaries– Links to PubMed articles– Links to articles via DOIs– Add spectral data– Add Crystallographic Information Files– Add photos– Add MP3 files– Add Videos
  11. 11. Spectra
  12. 12. Chemistry Data online is messy• We have inherited errors• All public compound databases, including ours, haveerrors• “Incorrect” structures – assertions, timelines etc• “Incorrect” names associated with structures• Properties• Links• Publications• ENORMOUS CHALLENGE
  13. 13. The Structure of Vitamin K?
  14. 14. MeSH• A lipid cofactor that is required for normal blood clotting.Several forms of vitamin K have been identified:VITAMIN K 1 (phytomenadione) derived from plants,VITAMIN K 2 (menaquinone) from bacteria, and syntheticnaphthoquinone provitamins, VITAMIN K 3 (menadione).Vitamin K 3 provitamins, after being alkylated in vivo,exhibit the antifibrinolytic activity of vitamin K. Greenleafy vegetables, liver, cheese, butter, and egg yolk aregood sources of vitamin K
  15. 15. The Structure of Vitamin K1?
  16. 16. What is the Structure of VitaminK1?
  17. 17. CAS’s Common Chemistry
  18. 18. Wikipedia
  19. 19. “2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”• Variants of systematic names on PubChem– 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl– 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl– 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl– 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl– 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl– 2-methyl-3-[(E)-3,7,11,15-tetramethyl– 2-methyl-3-(3,7,11,15-tetramethyl– 2-methyl-3-[(E)-3,7,11,15-tetramethyl
  20. 20. Question Everything online:
  21. 21. It’s all on Wikipedia…
  22. 22. Chemistry on The Internet Is Messy
  23. 23. It’s Methane…
  24. 24. What’s Methane?
  25. 25. What’s Methane?
  26. 26. What ELSE is Methane???
  27. 27. With Great Fanfare…
  28. 28. NPC Browser
  29. 29. NPC Browser
  30. 30. Public Domain Databases• Our databases are a mess…• Non-curated databases are proliferating errors• We source and deposit data between databases• Original sources of errors hard to determine• Curation is time-consuming and challenging
  31. 31. Stop Whining – Fix it
  32. 32. Crowdsourced Curation• Crowd-sourced curation: identify/tag errors,edit names, synonyms, identify records todeprecate
  33. 33. Search “Vitamin H”
  34. 34. “Curate” Identifiers
  35. 35. “Curate” Identifiers
  36. 36. “Curate” Identifiers
  37. 37. Standards : Structure Standardization
  38. 38. Standards : Structure Standardization
  39. 39. Standards : Structure Standardization
  40. 40. The InChI Identifier
  41. 41. Multiple Layers
  42. 42. InChIStrings Hash to InChIKeys
  43. 43. Vancomycin – Search the Internet
  44. 44. VancomycinSearch MolecularSKELETONSearch Full Molecule
  45. 45. Full Skeleton Search: 104 Hits
  46. 46. Full Molecule Search: 4 Hits
  47. 47. Validated Name-Structure Dictionaries• Chemical name dictionaries are used for:• Text-mining (publications, patents)– Used to index PubMed and link to Google Patents• Linking to other databases – think Biology!– When structures are not available drug names link• Searching the web– Names link to structures link to InChIs
  48. 48. I want to know about “Vincristine”If all algorithms work then everything on the page is correct bydefault except the name-structure relationship!
  49. 49. Vincristine: Identifiers andProperties
  50. 50. Vincristine: Vendors and SourcesLinked by Structure
  51. 51. Vincristine: PatentsLinked by Name
  52. 52. Vincristine: ArticlesLinked by Name
  53. 53. ChemSpider Resources for Chemistry
  54. 54. Micropublishing Syntheses
  55. 55. ChemSpider SyntheticPages
  56. 56. Olympicene
  57. 57. So you Want a Profile???
  58. 58. Interactive Data
  59. 59. PharmaSea• Dereplication via ChemSpider• Segregation of natural products datasets• Analytical data algorithms & integration– Mass spec searching – predicted fragmentation– NMR feature searching – NMR prediction– Computer-assisted structure elucidation
  60. 60. It is so difficult to navigate…What’s thestructure?What’s thestructure?Are they inour file?Are they inour file?What’ssimilar?What’ssimilar?What’s thetarget?What’s thetarget?Pharmacologydata?Pharmacologydata?KnownPathways?KnownPathways?Working OnNow?Working OnNow?Connections todisease?Connections todisease?Expressed inright cell type?Expressed inright cell type?Competitors?Competitors?IP?IP?
  61. 61. • 3-year Innovative Medicines Initiative project• Integrating chemistry and biology data using semanticweb technologies• Open source code, open data and open standards• Academics, Pharma companies, Publishers….
  62. 62. ChemSpider Contributions• The host of the chemistry services– Supplier of “standardized” chemical data files– Chemistry searching (structure, substructure etc)– Provider of data in RDF format– Curator and data quality checking• Now building the Open PHACTS chemicalregistration system
  63. 63. ChemSpider Contributions• Supplier of chemistry UI components• “Quality Police” for data checking• Chemical Validation and Standardization Platform• Nanopublications from RSC publications
  64. 64. Integrate to instruments and software• Integration to analytical instrumentation vendorsalready in place– Agilent, Bruker, Thermo, Waters• Also, Cheminformatics vendors link to ChemSpider– Accelrys, ACD/Labs, ChemAxon, iChemLabs, and…
  65. 65. Natural Products Updates• Names hard, Structures“Obvious”• New content based onmonthly updates of thedatabase• Click through to the NaturalProducts Updates entry
  66. 66. National Chemical Database Service
  67. 67. Chemical DatabaseService• National Chemical Database Servicefor UK Academics• Integrating Commercial Databasesand Services• Chemicals, analytical data,prediction algorithms• Development of data repository
  68. 68. Publications - a summary of work• Scientific publications are a summary of work– Is all work reported?– How much science is lost to pruning?– What of value sits in notebooks and is lost?• How much data is lost?– How many compounds never reported?– How many syntheses fail or succeed?– How many characterization measurements?
  69. 69. Community Repository for Data• Funding agencies encourage sharing of data• Increasing availability of “Open Data”• Institutional repositories no specific domainsupport• Develop a community repository for chemistrydata – private, public, embargoed• Provides data to develop models/algorithms
  70. 70. Community Repository for Data• Automated depositions of data• DOI’ed data objects for citation purposes• A database of reference data, but validated bythe community• National services feeding the repository –crystallography, mass spectrometry• Integrate to blogging tools for chemistry• Integrate to Electronic Lab Notebooks as feeds
  71. 71. Model Building with Community Data• Community data as a basis of model building– Consume data from available databases, communitydata, new publications and build predictivealgorithms for the community– How many algorithms are reported and lost? Howmuch repeat work is done in the domain ofalgorithmic development?
  72. 72. Pulling Data from our Archive• Our contribution to the world of chemistry data• DERA – digitally enabling the RSC archive– Text mining• Find chemicals, reactions, analytical data, properties– Algorithmic checking• Validate algorithmically what we can - robots– “Web 2.0 interfaces” for curating and validating
  73. 73. What if we could capture it all?Digitally Enhancing the RSC Archive
  74. 74. Data Validation and Curation RequiredEncouraging Participation withRewards and RECOGNITION
  75. 75. Manual Curation• Integrated commenting, curating and validationplatform across ALL eScience and publishingplatforms• All integrated to a central RSC profile andfeeding the AltMetrics tools
  76. 76. Structure Review
  77. 77. Maybe Hybrid Man-Machine
  78. 78. Where we are now…
  79. 79. Rewards and RecognitionCongratulations! Your 1st CSSP articlehas been published. Philosopher LaoTzu said “A journey of a thousandmiles begins with a single step”. In thesame way we hope that this will bethe first of many submissions that youmake to CSSP.The First Step badge isawarded when a usersubmits (& has published)their 1stCSSP article.
  80. 80. Future Recognition in AltMetrics?ChemSpider
  81. 81. Internet DataThe FutureCommercial SoftwarePre-competitive DataOpen ScienceOpen DataPublishersEducatorsOpen DatabasesChemical VendorsSmall organic moleculesUndefined materialsOrganometallicsNanomaterialsPolymersMineralsParticle boundLinks to Biologicals
  82. 82. The Future of Chemistry on the Web?• Public compound databases federate & build alinked environment of validated data!• Data validation needs are not ignored• Publishers layer on information to makepublications discoverable• Public-Private databases can be linked• Open Data proliferate• The “Semantic Web” in action
  83. 83. Acknowledgments• Valery Tkachenko and the eScience team• Our data providers, depositors, collaboratorsand curators• Software providers – OpenEye, ChemDoodle,ACD/Labs, GGA Software, Open Source (Jmol,JSpecView, OpenBabel)
  84. 84. Thank youEmail: williamsa@rsc.orgTwitter: @ChemConnectorPersonal Blog: www.chemconnector.comSLIDES: