0
Dealing with the complex challenge
of managing diverse chemistry
data online
Antony Williams, Valery Tkachenko, Alexey
Psh...
CAS Counter http://www.cas.org/content/counter
About Me…as a Chemist
• I’ve performed a few dozen chemical
syntheses
• I’ve run thousands of analytical spectra
• I’ve ge...
• If we imagine that permission exists…
(i.e. forget IP, chemical and pharma
companies etc…think students…)
– How many syn...
Consider a shift to Openness
Times have changed…
Open Access funder mandates…
Publishers are responding
The world of Open Data is here
Open Data are everywhere
• Is Openness and Social Sharing changing
the world?
• The cultural experiments in Open Data and
...
An Experiment - ChemSpider
• ChemSpider allowed the community to
participate in linking the internet of chemistry
& crowds...
An Experiment - CSSP
An EPSRC Call
“…the identification of the need for a UK
national service for the provision of a
searchable, electronic che...
National Chemical Database Service
• Manage “all” of the chemistry data associated
with chemical substances – PUBLISHED and
UNPUBLISHED
• Based on user selec...
Data Repository
• Registration of chemical compounds
• Deposition of chemical syntheses
• Addition of analytical data
• In...
Development of Data Repository
• Data repository should not just be a data
dump – should not be a “big disk”
• Searchable,...
New Repository Architecture
doi: 10.1007/s10822-014-9784-5
New Repository Architecture
Compounds Reactions Spectra Materials Documents
Compounds
API
Reactions
API
Spectra
API
Materi...
Input data pipeline
Deposition Gateway
Staging
databases
Compounds
Reactions
Spectra
Materials
Articles / CSSP
Compounds
M...
Compounds
Reactions
Analytical data
Crystallography data
For Deposition of Data
• Quality of data at source
• ensuring chemicals are correct - VALIDATION
• reactions map and balan...
Input data pipeline
Deposition Gateway
Staging
databases
Compounds
Reactions
Spectra
Materials
Articles / CSSP
Compounds
M...
Depositions Gateway User
Interface
Deposition of Data
Validate and Standardize
CVSP Filtering
CVSP Filtering of DrugBank
ChEMBL (1.3 million records)
• 11,020 records with 4 bonds and zero charge,
e.g. CHEMBL501101 or CHEMBL501973
• 271 record...
Depositions User Interface
The challenges of analytical data
• Vendors produce complex proprietary data
formats and standard formats are required
(JC...
ChemSpider ID 24528095 H1 NMR
ChemSpider ID 24528095 C13 NMR
ChemSpider ID 24528095 HHCOSY
ChemSpider ID 24528095 HSQC
ChemSpider ID 24528095 HMBC
Managing Assignments?
Depositions User Interface
Depositions from ELNs
• Development work integrating chemistry
into the Southampton Labtrove notebook
• Stoichiometry tabl...
Document deposition/processing
Experimental data checker
User Interface Approach
Compounds Reactions Spectra Materials Documents
Compounds
API
Reactions
API
Spectra
API
Materials
...
User Interface Approach
Compounds Reactions Spectra Materials Documents
Compounds
API
Reactions
API
Spectra
API
Materials
...
Display Widgets
Work in Progress
Work in Progress
User Interface Approach
Compounds Reactions Spectra Materials Documents
Compounds
API
Reactions
API
Spectra
API
Materials
...
Analytical Chemist
Characterize
Measure
Search
Store
<<include>>
<<include>>
<<include>>
Synthetic Chemist
Search
(synthet...
A Compounds Repository Interface
A Reactions/Document Interface
The PharmaSea Website
The Open PHACTS community ecosystem
Open Source Drug Discovery India
What can drive participation?
• What can drive scientists to participate and
contribute?
• Ensuring provenance of their da...
Scientists are Increasingly Quantified…
AltMetrics as Scientist Impact
AltMetrics
Detailed Usage Statistics
Rewards and Recognition
Congratulations! Your 1st CSSP
article has been published.
Philosopher Lao Tzu said “A
journey of ...
http://orcid.org/0000-0002-2668-4821
AltMetrics Feeds
• For our data repository ensure contribution of
data will feed out to the AltMetrics platforms
• Every d...
What do we have in place?
• We are testing an early form of the data
repository on our data – ChemSpider and our
archive o...
The Challenges Ahead
• Chemistry is NOT just nicely defined structures!
• Materials, minerals, attached to beads,
polymers...
But it’s not easy of course
• Not everything we would like around data
handling is there for sure
• Many systems, tools, p...
And yes…we know…
Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com...
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
Upcoming SlideShare
Loading in...5
×

Dealing with the complex challenge of managing diverse chemistry data online

898

Published on

The Royal Society of Chemistry has provided access to data associated with millions of chemical compounds via our ChemSpider database for over 5 years. During this period the richness and complexity of the data has continued to expand dramatically and the original vision for providing an integrated hub for structure-centric data has been delivered across the world to hundreds of thousands of users. With an intention of expanding the reach to cover more diverse aspects of chemistry-related data including compounds, reactions and analytical data, to name just a few data-types, we are in the process of implementing a new architecture to build a Chemistry Data Repository. The data repository will manage the challenges of associated metadata, the various levels of required security (private, shared and public) and exposing the data as appropriate using semantic web technologies. Ultimately this platform will become the host for all chemicals, reactions and analytical data contained within RSC publications and specifically supplementary information. This presentation will report on how our efforts to manage chemistry related data has impacted chemists and projects across the world and will review specifically our contributions to projects involving natural products for collaborators in Brazil and China, for the Open Source Drug Discovery project in India, and our collaborations with scientists in Russia.

Published in: Science
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
898
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Dealing with the complex challenge of managing diverse chemistry data online"

  1. 1. Dealing with the complex challenge of managing diverse chemistry data online Antony Williams, Valery Tkachenko, Alexey Pshenichnov and Ken Karapetyan ACS San Francisco August 2014
  2. 2. CAS Counter http://www.cas.org/content/counter
  3. 3. About Me…as a Chemist • I’ve performed a few dozen chemical syntheses • I’ve run thousands of analytical spectra • I’ve generated thousands of NMR assignments • I’ve probably published <5% of all work • Most of it has been lost • But things can be different today…. • But it still needs to be associated with me…
  4. 4. • If we imagine that permission exists… (i.e. forget IP, chemical and pharma companies etc…think students…) – How many syntheses are performed – How many spectra are run – How many properties are measured – How many compounds are made – How many, how much, how big??..... – Let’s go manage it all!! Think about chemistry a mo’
  5. 5. Consider a shift to Openness
  6. 6. Times have changed… Open Access funder mandates…
  7. 7. Publishers are responding
  8. 8. The world of Open Data is here
  9. 9. Open Data are everywhere • Is Openness and Social Sharing changing the world? • The cultural experiments in Open Data and exchange are almost daily • Mobile platforms enhance participation • And then what of Chemistry Data???
  10. 10. An Experiment - ChemSpider • ChemSpider allowed the community to participate in linking the internet of chemistry & crowdsourcing of data • Successful experiment in terms of building a central hub for integrated web search • More people are “users” than “contributors” • Yet basic feedback and game-play helps
  11. 11. An Experiment - CSSP
  12. 12. An EPSRC Call “…the identification of the need for a UK national service for the provision of a searchable, electronic chemical database for the UK academic research community.”
  13. 13. National Chemical Database Service
  14. 14. • Manage “all” of the chemistry data associated with chemical substances – PUBLISHED and UNPUBLISHED • Based on user selected licensing the data to be downloadable, reusable, interactive • Build a platform that enables the scientist • Data storage, validation, standardization and curation • Collaborative data sharing • Provide data platform that can enable and enhance publishing of scientific papers We set a vision…
  15. 15. Data Repository • Registration of chemical compounds • Deposition of chemical syntheses • Addition of analytical data • Integration to electronic notebooks • Rewards and recognition for data sharing • Document processing • Hosting of data as private, embargoed or public
  16. 16. Development of Data Repository • Data repository should not just be a data dump – should not be a “big disk” • Searchable, integrated, segregated repository of data types • Data access including private, shared embargoed and public • Delivery of derived models from data
  17. 17. New Repository Architecture doi: 10.1007/s10822-014-9784-5
  18. 18. New Repository Architecture Compounds Reactions Spectra Materials Documents Compounds API Reactions API Spectra API Materials API Documents API Compounds Widgets Reactions Widgets Spectra Widgets Materials Widgets Documents Widgets Data tier Data access tier User interface components tier Analytical Laboratory application User interface tier (examples) Electronic Laboratory Notebook Paid 3rd party integrations (various platforms – SharePoint, Google, etc) Chemical Inventory application
  19. 19. Input data pipeline Deposition Gateway Staging databases Compounds Reactions Spectra Materials Articles / CSSP Compounds Module Spectra Module Reactions Module Materials Module Textmining Module ͙ Module Web UI for unified depositions DropBox, Google Drive, SkyDrive, etc LabTroveand other templated data Documents API, FTP, etc Raw data Validated data Staging databases Alldatabases are sliced by data sources/data collections and havesimple security model where each data slice/sourceis private, public or embargoed
  20. 20. Compounds
  21. 21. Reactions
  22. 22. Analytical data
  23. 23. Crystallography data
  24. 24. For Deposition of Data • Quality of data at source • ensuring chemicals are correct - VALIDATION • reactions map and balance as appropriate – VALIDATION and STANDARDIZATION • file format handling for analytical data types – binary file formats are proprietary - STANDARDIZATION • valid interpretation of data – VALIDATION and ANNOTATION
  25. 25. Input data pipeline Deposition Gateway Staging databases Compounds Reactions Spectra Materials Articles / CSSP Compounds Module Spectra Module Reactions Module Materials Module Textmining Module ͙ Module Web UI for unified depositions DropBox, Google Drive, SkyDrive, etc LabTroveand other templated data Documents API, FTP, etc Raw data Validated data Staging databases Alldatabases are sliced by data sources/data collections and havesimple security model where each data slice/sourceis private, public or embargoed
  26. 26. Depositions Gateway User Interface
  27. 27. Deposition of Data
  28. 28. Validate and Standardize
  29. 29. CVSP Filtering
  30. 30. CVSP Filtering of DrugBank
  31. 31. ChEMBL (1.3 million records) • 11,020 records with 4 bonds and zero charge, e.g. CHEMBL501101 or CHEMBL501973 • 271 records with hypervalent oxygen (e.g. , CHEMBL2219679), carbon (e.g. 1005895), boron, chlorine, iodine or phosphine • 6,177 records where direction of bond makes no sense, e.g. CHEMBL12760 and CHEMBL34704
  32. 32. Depositions User Interface
  33. 33. The challenges of analytical data • Vendors produce complex proprietary data formats and standard formats are required (JCAMP, NetCDF, AniML) • ChemSpider already hosts thousands of JCAMP spectra • Support of “assigned spectra” in place • Data validation approaches understood • There are a myriad of analytical data types…
  34. 34. ChemSpider ID 24528095 H1 NMR
  35. 35. ChemSpider ID 24528095 C13 NMR
  36. 36. ChemSpider ID 24528095 HHCOSY
  37. 37. ChemSpider ID 24528095 HSQC
  38. 38. ChemSpider ID 24528095 HMBC
  39. 39. Managing Assignments?
  40. 40. Depositions User Interface
  41. 41. Depositions from ELNs • Development work integrating chemistry into the Southampton Labtrove notebook • Stoichiometry table development • Analytical data integration • “ChemTrove” rolled out to a small test group in January
  42. 42. Document deposition/processing
  43. 43. Experimental data checker
  44. 44. User Interface Approach Compounds Reactions Spectra Materials Documents Compounds API Reactions API Spectra API Materials API Documents API Compounds Widgets Reactions Widgets Spectra Widgets Materials Widgets Documents Widgets Data tier Data access tier User interface components tier Analytical Laboratory application User interface tier (examples) Electronic Laboratory Notebook Paid 3rd party integrations (various platforms – SharePoint, Google, etc) Chemical Inventory application
  45. 45. User Interface Approach Compounds Reactions Spectra Materials Documents Compounds API Reactions API Spectra API Materials API Documents API Compounds Widgets Reactions Widgets Spectra Widgets Materials Widgets Documents Widgets Data tier Data access tier User interface components tier Analytical Laboratory application User interface tier (examples) Electronic Laboratory Notebook Paid 3rd party integrations (various platforms – SharePoint, Google, etc) Chemical Inventory application
  46. 46. Display Widgets
  47. 47. Work in Progress
  48. 48. Work in Progress
  49. 49. User Interface Approach Compounds Reactions Spectra Materials Documents Compounds API Reactions API Spectra API Materials API Documents API Compounds Widgets Reactions Widgets Spectra Widgets Materials Widgets Documents Widgets Data tier Data access tier User interface components tier Analytical Laboratory application User interface tier (examples) Electronic Laboratory Notebook Paid 3rd party integrations (various platforms – SharePoint, Google, etc) Chemical Inventory application
  50. 50. Analytical Chemist Characterize Measure Search Store <<include>> <<include>> <<include>> Synthetic Chemist Search (synthetic procedure) Document (publish synthetic procedure) Retrosynthetic analysis
  51. 51. A Compounds Repository Interface
  52. 52. A Reactions/Document Interface
  53. 53. The PharmaSea Website
  54. 54. The Open PHACTS community ecosystem
  55. 55. Open Source Drug Discovery India
  56. 56. What can drive participation? • What can drive scientists to participate and contribute? • Ensuring provenance of their data for reuse • Mandates from funding agencies • Improved systems to ease contribution • Additional contributions to science • Improved publishing processes • Recognition for contributions
  57. 57. Scientists are Increasingly Quantified…
  58. 58. AltMetrics as Scientist Impact
  59. 59. AltMetrics
  60. 60. Detailed Usage Statistics
  61. 61. Rewards and Recognition Congratulations! Your 1st CSSP article has been published. Philosopher Lao Tzu said “A journey of a thousand miles begins with a single step”. In the same way we hope that this will be the first of many submissions that you make to CSSP. The First Step badge is awarded when a user submits (& has published) their 1st CSSP article.
  62. 62. http://orcid.org/0000-0002-2668-4821
  63. 63. AltMetrics Feeds • For our data repository ensure contribution of data will feed out to the AltMetrics platforms • Every data point, every data download, use and reuse will be associated with the scientist • Data will be DOI’ed (presently under review) • Services provided will allow for AltMetrics use
  64. 64. What do we have in place? • We are testing an early form of the data repository on our data – ChemSpider and our archive of publications • Working with collaborators to define needs • Testing and enhancing deposition systems • Chemical validation & standardization platform • Analytical data handling formats • And lots in development…
  65. 65. The Challenges Ahead • Chemistry is NOT just nicely defined structures! • Materials, minerals, attached to beads, polymers, ambiguous materials • Domain-specific measurements • File format standards are limited in application • Encouraging scientists to free up their data • AltMetrics, open data mandates, systems • The data explosion continues
  66. 66. But it’s not easy of course • Not everything we would like around data handling is there for sure • Many systems, tools, platforms are already available but we don’t know about them or even if we did contributing us “more work” • “What’s in it for me?”, “It’s my data”, “It’s too much work”, “What credit do I get?”
  67. 67. And yes…we know…
  68. 68. Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×