eScience at the Royal Society ofChemistry: Current InitiativesAntony WilliamsCornell University, May 14th2013
We Have …Too Much Data!!!
The World of Online Chemistry• Property databases• Compound aggregators• Screening assay results• Scientific publications•...
e-Science and Primary Data• How much data generated in a lab, that COULD go public, islost forever?
e-Science and Primary Data• How much data generated in a lab, that COULD go public, islost forever?• Public Domain referen...
e-Science and Primary Data• How much data generated in a lab, that COULD go public, islost forever?• Public Domain referen...
RSC’s ChemSpider
ChemSpider• >28.5 million unique chemicals from >400data sources• Focus on improving data quality, enhancingfunctionality,...
Crowdsourced “Annotations”• Users can add– Descriptions/Syntheses/Commentaries– Links to PubMed articles– Links to article...
Spectra
Chemistry Data online are messy• We have inherited errors• All public compound databases have errors• “Incorrect” structur...
Crowdsourced Curation• Crowd-sourced curation: identify/tag errors,edit names, synonyms, identify records todeprecate
Search “Vitamin H”
“Curate” Identifiers
“Curate” Identifiers
“Curate” Identifiers
Validated Name-Structure Dictionaries• Chemical name dictionaries are used for:• Text-mining (publications, patents)– Used...
I want to know about “Vincristine”
Vincristine: Identifiers andProperties
Vincristine: Vendors and SourcesLinked by Structure
Vincristine: PatentsLinked by Name
Vincristine: ArticlesLinked by Name
Semantic Mark-up of Articles
Linking Names to Structures
The InChI Identifier
InChIStrings Hash to InChIKeys
Vancomycin – Search the Internet
VancomycinSearch MolecularSKELETONSearch Full Molecule
Full Skeleton Search: 104 Hits
Full Molecule Search: 4 Hits
ChemSpider Resources for Chemistry
Some usage statistics• ca. 200 visitors at any one time, ~30,000 visits per day• Mar 4-Apr 3, 2013– Visits = 731,656– Uniq...
Access ChemSpider• APIs– Programmatic access used by Mobile Apps, FundedConsortia projects, many Academic groups• Widgets–...
Flexible ChemSpider APIhttp://www.chemspider.com/google/
Flexible ChemSpider API
Publications - a summary of work• Scientific publications are a summary of work– Is all work reported?– How much science i...
Micropublishing Syntheses
ChemSpider SyntheticPages
Olympicene
So you Want a Profile???
Interactive Data
Integrate to instruments and software• Integration to analytical instrumentation vendorsalready in place– Agilent, Bruker,...
PharmaSea• Dereplication via ChemSpider• Segregation of natural products datasets• Analytical data algorithms & integratio...
It is so difficult to navigate…What’s thestructure?What’s thestructure?Are they inour file?Are they inour file?What’ssimil...
• 3-year Innovative Medicines Initiative project• Integrating chemistry and biology data using semanticweb technologies• O...
ChemSpider Contributions• The host of the chemistry services– Supplier of “standardized” chemical data files– Chemistry se...
Natural Products Updates• Names hard, Structures“Obvious”• New content based onmonthly updates of thedatabase• Click throu...
National Chemical Database Service
Chemical DatabaseService• National Chemical Database Servicefor UK Academics• Integrating Commercial Databasesand Services...
Community Repository for Data• Funding agencies encourage sharing of data• Increasing availability of “Open Data”• Institu...
Community Repository for Data• Automated depositions of data• DOI’ed data objects for citation purposes• A database of ref...
Model Building with Community Data• Community data as a basis of model building– Consume data from available databases, co...
Support for Chemical Reactions• Integrating mined reaction data from patents• Will also incorporate and integrate RSCDatab...
Inside our Publication Archive• How much data is in the archive, in thepublications and in the supplementary info?– How ma...
What if we could capture it all?Digitally Enhancing the RSC Archive
Start with data in publications
Data Validation and Curation RequiredEncouraging Participation withRewards and RECOGNITION
Manual Curation• Integrated commenting, curating and validationplatform across ALL eScience and publishingplatforms• All i...
Structure Review
Future Recognition in AltMetrics?ChemSpider
Internet DataThe FutureCommercial SoftwarePre-competitive DataOpen ScienceOpen DataPublishersEducatorsOpen DatabasesChemic...
The Future of Chemistry on the Web?• Public compound databases federate & build alinked environment of validated data!• Da...
Thank youEmail: williamsa@rsc.orgTwitter: @ChemConnectorPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/Ant...
eScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiatives
Upcoming SlideShare
Loading in...5
×

eScience at the Royal Society of Chemistry and our current initiatives

576

Published on

Access to scientific information has changed in a manner that was likely never even imagined by the early pioneers of the internet. The quantities of data, the array of tools available to search and analyze, the devices and the shift in community participation continues to expand while the pace of change does not appear to be slowing. ChemSpider is one of the chemistry community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day and it serves as the foundation for many important international projects to integrate chemistry and biology data, facilitate drug discovery efforts and help to identify new chemicals from under the ocean. This presentation will provide an overview of the expanding reach of this eScience cheminformatics platform and the nature of the solutions that it helps to enable including structure validation, text mining and semantic markup, the National Chemical Database Service for the United Kingdom and the development of a chemistry data repository. We will also discuss the possibilities it offers in the domain of crowdsourcing and open data sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community and facilitated collaboration and ultimately accelerate scientific progress.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
576
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

eScience at the Royal Society of Chemistry and our current initiatives

  1. 1. eScience at the Royal Society ofChemistry: Current InitiativesAntony WilliamsCornell University, May 14th2013
  2. 2. We Have …Too Much Data!!!
  3. 3. The World of Online Chemistry• Property databases• Compound aggregators• Screening assay results• Scientific publications• Encyclopedic articles (Wikipedia)• Metabolic pathway databases• ADME/Tox data – eTOX for example• Blogs/Wikis and Open Notebook Science
  4. 4. e-Science and Primary Data• How much data generated in a lab, that COULD go public, islost forever?
  5. 5. e-Science and Primary Data• How much data generated in a lab, that COULD go public, islost forever?• Public Domain reference databases of value?– Syntheses– Properties– Spectra– CIFs– Images
  6. 6. e-Science and Primary Data• How much data generated in a lab, that COULD go public, islost forever?• Public Domain reference databases of value?– Syntheses– Properties– Spectra– CIFs– Images• Much of chemistry is chemical structure-based – where andhow could we host these data?
  7. 7. RSC’s ChemSpider
  8. 8. ChemSpider• >28.5 million unique chemicals from >400data sources• Focus on improving data quality, enhancingfunctionality, integrating and enabling
  9. 9. Crowdsourced “Annotations”• Users can add– Descriptions/Syntheses/Commentaries– Links to PubMed articles– Links to articles via DOIs– Add spectral data– Add Crystallographic Information Files– Add photos– Add MP3 files– Add Videos
  10. 10. Spectra
  11. 11. Chemistry Data online are messy• We have inherited errors• All public compound databases have errors• “Incorrect” structures – assertions, timelines etc• “Incorrect” names associated with structures• Properties• Links• Publications• ENORMOUS CHALLENGE
  12. 12. Crowdsourced Curation• Crowd-sourced curation: identify/tag errors,edit names, synonyms, identify records todeprecate
  13. 13. Search “Vitamin H”
  14. 14. “Curate” Identifiers
  15. 15. “Curate” Identifiers
  16. 16. “Curate” Identifiers
  17. 17. Validated Name-Structure Dictionaries• Chemical name dictionaries are used for:• Text-mining (publications, patents)– Used to index PubMed and link to Google Patents• Linking to other databases – think Biology!– When structures are not available drug names link• Searching the web– Names link to structures link to InChIs
  18. 18. I want to know about “Vincristine”
  19. 19. Vincristine: Identifiers andProperties
  20. 20. Vincristine: Vendors and SourcesLinked by Structure
  21. 21. Vincristine: PatentsLinked by Name
  22. 22. Vincristine: ArticlesLinked by Name
  23. 23. Semantic Mark-up of Articles
  24. 24. Linking Names to Structures
  25. 25. The InChI Identifier
  26. 26. InChIStrings Hash to InChIKeys
  27. 27. Vancomycin – Search the Internet
  28. 28. VancomycinSearch MolecularSKELETONSearch Full Molecule
  29. 29. Full Skeleton Search: 104 Hits
  30. 30. Full Molecule Search: 4 Hits
  31. 31. ChemSpider Resources for Chemistry
  32. 32. Some usage statistics• ca. 200 visitors at any one time, ~30,000 visits per day• Mar 4-Apr 3, 2013– Visits = 731,656– Unique Visitors = 527,008• Independent servers to support other projects
  33. 33. Access ChemSpider• APIs– Programmatic access used by Mobile Apps, FundedConsortia projects, many Academic groups• Widgets– UI components for embedding in other websites• Data– Data access, downloads, reuse, licensing
  34. 34. Flexible ChemSpider APIhttp://www.chemspider.com/google/
  35. 35. Flexible ChemSpider API
  36. 36. Publications - a summary of work• Scientific publications are a summary of work– Is all work reported?– How much science is lost to pruning?– What of value sits in notebooks and is lost?• How much data is lost?– How many compounds never reported?– How many syntheses fail or succeed?– How many characterization measurements?
  37. 37. Micropublishing Syntheses
  38. 38. ChemSpider SyntheticPages
  39. 39. Olympicene
  40. 40. So you Want a Profile???
  41. 41. Interactive Data
  42. 42. Integrate to instruments and software• Integration to analytical instrumentation vendorsalready in place– Agilent, Bruker, Thermo, Waters• Also, Cheminformatics vendors link to ChemSpider– Accelrys, ACD/Labs, ChemAxon, iChemLabs, and…
  43. 43. PharmaSea• Dereplication via ChemSpider• Segregation of natural products datasets• Analytical data algorithms & integration– Mass spec searching – predicted fragmentation– NMR feature searching – NMR prediction– Computer-assisted structure elucidation
  44. 44. It is so difficult to navigate…What’s thestructure?What’s thestructure?Are they inour file?Are they inour file?What’ssimilar?What’ssimilar?What’s thetarget?What’s thetarget?Pharmacologydata?Pharmacologydata?KnownPathways?KnownPathways?Working OnNow?Working OnNow?Connections todisease?Connections todisease?Expressed inright cell type?Expressed inright cell type?Competitors?Competitors?IP?IP?
  45. 45. • 3-year Innovative Medicines Initiative project• Integrating chemistry and biology data using semanticweb technologies• Open source code, open data and open standards• Academics, Pharma companies, Publishers….
  46. 46. ChemSpider Contributions• The host of the chemistry services– Supplier of “standardized” chemical data files– Chemistry searching (structure, substructure etc)– Curator and data quality checking• Now building the Open PHACTS chemicalregistration system
  47. 47. Natural Products Updates• Names hard, Structures“Obvious”• New content based onmonthly updates of thedatabase• Click through to the NaturalProducts Updates entry
  48. 48. National Chemical Database Service
  49. 49. Chemical DatabaseService• National Chemical Database Servicefor UK Academics• Integrating Commercial Databasesand Services• Chemicals, analytical data,prediction algorithms• Development of data repository
  50. 50. Community Repository for Data• Funding agencies encourage sharing of data• Increasing availability of “Open Data”• Institutional repositories no specific domainsupport• Develop a community repository for chemistrydata – private, public, embargoed• Provides data to develop models/algorithms
  51. 51. Community Repository for Data• Automated depositions of data• DOI’ed data objects for citation purposes• A database of reference data, but validated bythe community• National services feeding the repository –crystallography, mass spectrometry• Integrate to blogging tools for chemistry• Integrate to Electronic Lab Notebooks as feeds
  52. 52. Model Building with Community Data• Community data as a basis of model building– Consume data from available databases, communitydata, new publications and build predictivealgorithms for the community– How many algorithms are reported and lost? Howmuch repeat work is done in the domain ofalgorithmic development?
  53. 53. Support for Chemical Reactions• Integrating mined reaction data from patents• Will also incorporate and integrate RSCDatabases: Methods of Organic Synthesis,Catalysts and Catalyzed Reactions and…
  54. 54. Inside our Publication Archive• How much data is in the archive, in thepublications and in the supplementary info?– How many compounds for ChemSpider?– How many syntheses for ChemSpider reactions?– How many characterization measurements?• Property Data• Spectral Data• Graphs and charts to be used for modeling?
  55. 55. What if we could capture it all?Digitally Enhancing the RSC Archive
  56. 56. Start with data in publications
  57. 57. Data Validation and Curation RequiredEncouraging Participation withRewards and RECOGNITION
  58. 58. Manual Curation• Integrated commenting, curating and validationplatform across ALL eScience and publishingplatforms• All integrated to a central RSC profile andfeeding the AltMetrics tools
  59. 59. Structure Review
  60. 60. Future Recognition in AltMetrics?ChemSpider
  61. 61. Internet DataThe FutureCommercial SoftwarePre-competitive DataOpen ScienceOpen DataPublishersEducatorsOpen DatabasesChemical VendorsSmall organic moleculesUndefined materialsOrganometallicsNanomaterialsPolymersMineralsParticle boundLinks to Biologicals
  62. 62. The Future of Chemistry on the Web?• Public compound databases federate & build alinked environment of validated data!• Data validation needs are not ignored• Publishers layer on information to makepublications discoverable• Open Data proliferate• The “Semantic Web” will continue to develop…
  63. 63. Thank youEmail: williamsa@rsc.orgTwitter: @ChemConnectorPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×