Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Importance of data standards for
large scale data integration in
chemistry
Antony Williams, Valery Tkachenko, Alexey
Pshen...
Free and Easy
• To make it easy to “take notes” these slides
will be available at:
www.slideshare.net/AntonyWilliams/
Charles Holland Duell
Charles Holland Duell
• 1898-1901: US Commissioner
of Patents
• "Everything that can be
invented has been invented."
Antony John Williams (et al)
Antony John Williams (et al)
• “We don’t need more
standards!”
• “Of COURSE we can build
a spectral database!”
• “The stan...
A Pragmatic View to Progress
• Let’s consider progressing an NMR Spectral
database for the community!
• MUST HAVES– spectr...
Standards without adoption..
Standards
2D NMR
Progress in standards
Progress in standards
Standards without adoption
are limited in value
• If the instrument vendors don’t support or
adopt the standards success i...
www.ChemSpider.com
9400 Spectra and growing
http://www.chemspider.com/spectra.aspx
JCAMP NMR Spectra
Data on ChemSpider
JCAMP file downloads
• When NMR spectra are stored as JCAMP
then downloads into offline packages are
feasible – MestreLabs...
Challenges with Spectra
• JCAMP is good for a lot of spectral data – IR,
Raman, 1D NMR
• MS data is rarely made available ...
Proper Verification
03/25/15
Advanced Chemistry Development, Inc.
(ACD/Labs)
20
Jmol - JSpecView
ChemDoodle Components
Spectral Display in the hand
New Repository Architecture
doi: 10.1007/s10822-014-9784-5
Compounds
Reactions
Analytical data
Deposition of Data
1,000,000 Spectra Online?
ESI – Text Spectra
Developing Proof-of-Concept
• Extract from 1976-2014 USPTO applications
*unknown – starts off with NMR: peak list (no nucl...
We want to find text spectra?
• We can find and index text spectra:13C NMR
(CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH,
b...
MestreLabs Mnova NMR
1H NMR (CDCl3, 400 MHz):
δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t,
1H, Jb = 10.8 Hz, C(6)H...
13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane),
30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 ...
ESI Data also contains figures
Publications & “Real Spectra”
• We are turning text into spectra
• We are turning figures into spectra
Early Test Experiments

Input

74 supplementary data documents. 3444 pages

Output

Plot2Txt extracted content from 10...
“Where is the real data please?”
FIGURE
DATA
Manual Curation Layer
• ALL SPECTRA WILL BE STORED AS JCAMP
• ChemSpider has had a manual curation layer
for >8 years
• Us...
Extraction is the WRONG WAY
• We should NOT mine data out – digital form!
• Structures should be submitted “correctly”
• S...
We can solve for Authors here
Will it be used though??? YES!
Supplementary Info Data now..
Data mining – it’s MINE!!!
What should we be doing?
• Settle on a short-term format – JCAMP-JMOL?
But there ARE solutions!
But there ARE solutions!
What should we be doing?
• Settle on a short-term format – JCAMP-JMOL?
• Convince the instrument vendors to export in
this...
NMRShiftDB anyone?
Standards in Large Scale
Data Integration
• ALL of these are imperfect standards
• Molfiles
• SDF
• InChI
• JCAMP
• But wh...
Compound Data
• The standards of chemical structure handling
are primarily molfile, SDfile, SMILES, InChI
• We primarily d...
Searching the Entire Web?
Searching Internet by Structure
Compound Data
• The standards of chemical structure handling
are primarily molfile, SDfile, SMILES, InChI
• We primarily d...
USE and TEACH Standards
• Too few people are aware of the existing
standards and their capabilities
• Part of the CINF mis...
USE and TEACH Standards!
USE and TEACH Standards!
CVSP: Validate and Standardize
CVSP Rules Sets
CVSP Filtering of DrugBank
Compounds
Reactions
Use Ontologies
Contribute to PUBLIC
Ontologies
• Yes there are “company” ontologies – but for
the good of the community contribute to
pub...
ChAMP – Stuart Chalk
Use standards in APIs,
endpoints and widgets
Semanticize content : RDF
Actions
• Support and encourage new standards
• In the meantime, reawaken and modernize the
JCAMP standard
• Show up and l...
Charles Holland Duell in 1902
“…all previous advances in the
various lines of invention will
appear totally insignificant ...
“Git-r-Done”
Acknowledgments
• Daniel Lowe – NextMove, Reactions and Spectra
• Bill Brouwer – Plot2Txt Development
• Carlos Cobas and S...
Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com...
Importance of data standards for large scale data integration in chemistry
Upcoming SlideShare
Loading in …5
×

Importance of data standards for large scale data integration in chemistry

1,945 views

Published on

The Royal Society of Chemistry hosts large scale data collections and provides access to the data to the chemistry community. The largest RSC data set of wide scale interest to the community offers access to tens of millions of compounds. The host platform, ChemSpider, is limited as it is a structure centric hub only. A new architecture, the RSC data repository, has been developed that extends support to reactions, spectral data, crystallography data and related property data. It is also the architecture underlying a series of exemplar projects for managing data for a number of diverse laboratories. The adoption of data standards for the integration and distribution of data has been essential. Specific standards include molecular structure formats such as molfiles and InChIs, and spectral data formats such as JCAMP. This presentation will report on our development of the data repository, the importance of utilizing standards for data integration, the flexible nature of the architecture to deliver solutions for various laboratories and our efforts to develop new large data collections. This includes text-mining efforts to extract large spectrum-structure collections from large corpuses.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Importance of data standards for large scale data integration in chemistry

  1. 1. Importance of data standards for large scale data integration in chemistry Antony Williams, Valery Tkachenko, Alexey Pshenichnov, Ken Karapetyan, Stuart Chalk, Daniel Lowe and Carlos Coba ACS Denver, March 2015
  2. 2. Free and Easy • To make it easy to “take notes” these slides will be available at: www.slideshare.net/AntonyWilliams/
  3. 3. Charles Holland Duell
  4. 4. Charles Holland Duell • 1898-1901: US Commissioner of Patents • "Everything that can be invented has been invented."
  5. 5. Antony John Williams (et al)
  6. 6. Antony John Williams (et al) • “We don’t need more standards!” • “Of COURSE we can build a spectral database!” • “The standards we have are good enough”
  7. 7. A Pragmatic View to Progress • Let’s consider progressing an NMR Spectral database for the community! • MUST HAVES– spectra (1D/2D), associated structures, assignments • WANTS – predict NMR spectra, spectral searching, privacy/embargos • What would we need in terms of standards? • Molfiles and JCAMP
  8. 8. Standards without adoption..
  9. 9. Standards
  10. 10. 2D NMR
  11. 11. Progress in standards
  12. 12. Progress in standards
  13. 13. Standards without adoption are limited in value • If the instrument vendors don’t support or adopt the standards success is limited • YESTERDAY discussion about publishing NMR – JCAMP • But what is already available will work – Jeol, Bruker, Thermo, Anasazi, Agilent/Varian - imperfect but useful
  14. 14. www.ChemSpider.com
  15. 15. 9400 Spectra and growing http://www.chemspider.com/spectra.aspx
  16. 16. JCAMP NMR Spectra
  17. 17. Data on ChemSpider
  18. 18. JCAMP file downloads • When NMR spectra are stored as JCAMP then downloads into offline packages are feasible – MestreLabs, ACD/Labs etc • Open Data – download versus view • Store spectra locally and reuse • Java is increasingly a pain! • Need to move to HTML5 viewing on ChemSpider, especially for Mobile Viewing
  19. 19. Challenges with Spectra • JCAMP is good for a lot of spectral data – IR, Raman, 1D NMR • MS data is rarely made available in JCAMP • We would love a ratified JCAMP 6.0 for 2D data exchange – allows third parties to build support for download • ASSIGNED JCAMP spectra supported
  20. 20. Proper Verification 03/25/15 Advanced Chemistry Development, Inc. (ACD/Labs) 20
  21. 21. Jmol - JSpecView
  22. 22. ChemDoodle Components
  23. 23. Spectral Display in the hand
  24. 24. New Repository Architecture doi: 10.1007/s10822-014-9784-5
  25. 25. Compounds
  26. 26. Reactions
  27. 27. Analytical data
  28. 28. Deposition of Data
  29. 29. 1,000,000 Spectra Online?
  30. 30. ESI – Text Spectra
  31. 31. Developing Proof-of-Concept • Extract from 1976-2014 USPTO applications *unknown – starts off with NMR: peak list (no nucleus) H 975543 C 56536 unknown 44306 F 9429 P 3241 B 91 Si 62 Sn 22 Se 11 N 8
  32. 32. We want to find text spectra? • We can find and index text spectra:13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC) • What would be better are spectral figures – and include assignments where possible!
  33. 33. MestreLabs Mnova NMR
  34. 34. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  35. 35. 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
  36. 36. ESI Data also contains figures
  37. 37. Publications & “Real Spectra” • We are turning text into spectra • We are turning figures into spectra
  38. 38. Early Test Experiments  Input  74 supplementary data documents. 3444 pages  Output  Plot2Txt extracted content from 1069 pages  1151 spectra total - >80% of peaks extracted to within 1-2 decimal places (ppm)
  39. 39. “Where is the real data please?” FIGURE DATA
  40. 40. Manual Curation Layer • ALL SPECTRA WILL BE STORED AS JCAMP • ChemSpider has had a manual curation layer for >8 years • Users can annotate data on ChemSpider • We do receive useful feedback from the community on the data and are optimistic!
  41. 41. Extraction is the WRONG WAY • We should NOT mine data out – digital form! • Structures should be submitted “correctly” • Spectra should be digital spectral formats, not images • ESI should be RICH and interactive • Data should be open, available, with meta data and provenance
  42. 42. We can solve for Authors here Will it be used though??? YES!
  43. 43. Supplementary Info Data now..
  44. 44. Data mining – it’s MINE!!!
  45. 45. What should we be doing? • Settle on a short-term format – JCAMP-JMOL?
  46. 46. But there ARE solutions!
  47. 47. But there ARE solutions!
  48. 48. What should we be doing? • Settle on a short-term format – JCAMP-JMOL? • Convince the instrument vendors to export in this format • Push button depositions into “containers” – ChemSpider, NMRShiftDB, Institutional Repositories • Encourage format support in software (read and write) – Mestre, ACD/Labs, Bruker TopSpin, etc.
  49. 49. NMRShiftDB anyone?
  50. 50. Standards in Large Scale Data Integration • ALL of these are imperfect standards • Molfiles • SDF • InChI • JCAMP • But what can be done with them?
  51. 51. Compound Data • The standards of chemical structure handling are primarily molfile, SDfile, SMILES, InChI • We primarily depend on molfiles and SDF files for data deposition and interchange • We use InChI a lot – especially for integrated searching across the web
  52. 52. Searching the Entire Web?
  53. 53. Searching Internet by Structure
  54. 54. Compound Data • The standards of chemical structure handling are primarily molfile, SDfile, SMILES, InChI • We primarily depend on molfiles and SDF files for data deposition and interchange • We use InChI a lot – especially for integrated searching across the web • There ARE data interchange problems associated with structures….
  55. 55. USE and TEACH Standards • Too few people are aware of the existing standards and their capabilities • Part of the CINF mission activities should be to teach standards and this is being done • Still too few people have heard of InChI and JCAMP for example • Still little known about the importance of correct structure representations – kudos to people like Leah et al who TEACH THIS!
  56. 56. USE and TEACH Standards!
  57. 57. USE and TEACH Standards!
  58. 58. CVSP: Validate and Standardize
  59. 59. CVSP Rules Sets
  60. 60. CVSP Filtering of DrugBank
  61. 61. Compounds
  62. 62. Reactions
  63. 63. Use Ontologies
  64. 64. Contribute to PUBLIC Ontologies • Yes there are “company” ontologies – but for the good of the community contribute to public ontologies and standards • For data interchange and meshing this is soooooo beneficial!
  65. 65. ChAMP – Stuart Chalk
  66. 66. Use standards in APIs, endpoints and widgets
  67. 67. Semanticize content : RDF
  68. 68. Actions • Support and encourage new standards • In the meantime, reawaken and modernize the JCAMP standard • Show up and listen to Bob Hanson today • Encourage scientists to provide data
  69. 69. Charles Holland Duell in 1902 “…all previous advances in the various lines of invention will appear totally insignificant when compared with those which the present century will witness. I almost wish that I might live my life over again to see the wonders which are at the threshold”
  70. 70. “Git-r-Done”
  71. 71. Acknowledgments • Daniel Lowe – NextMove, Reactions and Spectra • Bill Brouwer – Plot2Txt Development • Carlos Cobas and Stan Sykora– MestreLabs • The ChemSpider team – led by Richard Kidd • The RSC Data Repository team
  72. 72. Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

×