Approaches for extraction and “digitalchromatography” of chemical data:A perspective from the RSC
Overview• Introduction   – What data can we consider?   – What are the challenges   – What data and sources does the     R...
Traditional ChromatographyImages taken from:http://www.sciencemadness.org/talk/viewthread.php?tid=3960&page=3http://en.wik...
Why Digital Chromatography?• Useable information is mixed in with description  and analysis – Makes it difficult to find• ...
Style/Layout Vs Meaning• Structures drawn to illustrate more than  just the identity• Data not generated with reuse in min...
Data generated for humans• Separated/Orphaned information inc.  Markush structures, information passed by  reference
What chemical data can we consider?• Chemistry is an especially challenging - wide range of  types of data   –   Numeric d...
What chemical data and sources does the              RSC have?
A beginning: helping chemists review          their own workAmphidinoketide ITo a solution of…….…. Amphidinoketide I was i...
Case study 1: Project Prospect
What is Prospect?Visible output                       Enhanced         Prospect InChI–name pairs  Better      Output layer...
People and machines            People                      MachinesCan understand narratives.   Can’t understand narrative...
Case study 2: The chair representation                issue   InChI=1S/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2   ...
Case study 2: Chair forms ofhexacycles what could go wrong?
How do we “fix” chair-representationsHow we normalize them:1. Identify 6-membered rings (Indigo)2. Identify what sort of r...
The future: “The digester”• Ability to:   – Reconnect R-groups   – Expand abbreviations   – Expand brackets   – Link struc...
Other examples that we didn’t        mention in case studies• CIF data importer• Structure Validation and Standardisation ...
Summary• Many data sharing practices are based on:   – Traditional print articles   – Consumption of data by humans only• ...
Acknowledgements• Colin Batchelor - Development and Technical  work• Jeff White & Aileen Day• Richard Kidd, Graham McCann ...
Thank youEmail: chemspider@rsc.orgTwitter: @ChemSpiderhttp://www.chemspider.com
Approaches for extraction and digital chromatography of chemical data
Upcoming SlideShare
Loading in …5
×

Approaches for extraction and digital chromatography of chemical data

645
-1

Published on

The traditional perception of the publishing process has been that it culminates in a print article. The Royal Society of Chemistry (RSC) has for many years been acutely aware that there is a wealth of information contained in scientific communications that we publish and that its true value can only be unlocked by enabling the discovery of the data within them. This is challenging due to the variety of ways that scientists provide data, textually, graphically, and increasingly in supplementary information. This talk will outline how the RSC has applied innovative approaches, developed both internally and externally, to identifying important chemical data within the literature and provides tools to anyone using chemical data to analyse and improve its quality. Examples will include: Project Prospect, the Experimental Data Checker, our CIF data importer, ChemSpider and our structure validation and standardization service.

This presentation was given by David Sharpe at the ACS Fall Meeting 2012 in Philadelphia

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
645
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
4
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Approaches for extraction and digital chromatography of chemical data

  1. 1. Approaches for extraction and “digitalchromatography” of chemical data:A perspective from the RSC
  2. 2. Overview• Introduction – What data can we consider? – What are the challenges – What data and sources does the RSC have? – Experimental Data Checker• Case Studies: – Project Prospect – Chair forms of Sugars/cyclohexanes
  3. 3. Traditional ChromatographyImages taken from:http://www.sciencemadness.org/talk/viewthread.php?tid=3960&page=3http://en.wikipedia.org/wiki/Column_chromatography
  4. 4. Why Digital Chromatography?• Useable information is mixed in with description and analysis – Makes it difficult to find• Despite our best efforts – still lots of ambiguous or plain wrong/unusable chemical information• Why? – Human error – Processing errors – Incorrect usage of data generation/extraction – Style over meaning – Data not generated with reuse in mind – Data generated for humans
  5. 5. Style/Layout Vs Meaning• Structures drawn to illustrate more than just the identity• Data not generated with reuse in mind• Author practices• Mixed 2D and perspective representations• Unintentional definition of stereochemistry
  6. 6. Data generated for humans• Separated/Orphaned information inc. Markush structures, information passed by reference
  7. 7. What chemical data can we consider?• Chemistry is an especially challenging - wide range of types of data – Numeric data – Names – Structures – Terminology• Over a hugely different set of topics: Org, Inorg, Physical – Meanings/interpretations are not perfectly aligned• Application of standards can be challenging• Drawing conventions – are documented but not used
  8. 8. What chemical data and sources does the RSC have?
  9. 9. A beginning: helping chemists review their own workAmphidinoketide ITo a solution of…….…. Amphidinoketide I was isolated as a ……..[α]D25 −17.6 (c 0.085, CH2Cl2); Rf = 0.61 (1:1 hexane:ethyl acetate); νmax(CHCl3)/cm−1 1707.2 (CO), 1686.9 (CO), 1632.4 (CO), 1618.9 (CC), 1458.1; 1H NMR (CD2Cl2, 500 MHz) δH 6.08 (1H, t, J = 1.3 Hz, 3-CHC), 5.82 (1H, ddt, J = 16.9, 10.2, 6.7 Hz, 19-CHCH2), 4.99 (1H, m (17.1 Hz), 20-CHA), 4.92 (1H, m (10.2 Hz), 20-CHB), 3.05 (1H, dd, J = 17.9, 9.3 Hz, 8-CHA), 3.00–2.90 (3H, m, 9-CHCH3, 11-CHA, 12-CHCH3), 2.72–2.64 (2H, m, 5-CHA, 6-CHA), 2.62–2.55 (2H, m, 5-CHB, 6-CHB), 2.51–2.45 (3H, m, 8-CHB, 11-CHB, 14-CHA), 2.33 (1H, dd, J = 16.9, 7.4 Hz, 14-CHB), 2.09 (3H, s, 21-CH3), 2.05–1.99 (2H, m, 18-CH2), 1.99–1.96 (1H, m, 15-CHCH3), 1.88 (3H, s, 1-CH3), 1.39–1.25 (3H, 17-CH2, 16-CHA), 1.14–1.10 (1H, m, 16-CHB), 1.07 (3H, d, J = 7.0 Hz, 22-CH3), 1.05 (3H, d, J = 7.2 Hz, 23-CH3), 0.87 (3H, d, J = 6.7 Hz, 24-CH3); 13C NMR (CD2Cl2, 125 MHz) δC 213.15 (13-CO), 212.08 (10-CO), 208.40 (7-CO), 198.76 (4- CO), 155.40 (2-CCH), 138.41 (19-CHCH2), 123.54 (3-CHC), 114.19 (20-CH2C), 48.81 (14-CH2), 45.93 (11-CH2), 44.50 (8-CH2), 41.43 (9-CHCH3), 41.01 (12-CHCH3), 37.74 (5-CH2), 36.55 (16-CH2), 36.27 (6- CH2), 34.18 (18-CH2), 28.74 (15-CHCH3), 27.57 (1-CH3), 26.60 (17-CH2), 20.63 (21-CH3), 19.77 (24- CH3), 16.65 (22 or 23-CH3), 16.62 (22 or 23-CH3); HRMS (ESI) Calculated for C24H38O4 413.2668, found 413.26600 (MNa+). (9R, 12R, 15S)-1 had [α]D25 +11 (c 0.245, CH2Cl2).• http://www.rsc.org/is/journals/checker/run.htm
  10. 10. Case study 1: Project Prospect
  11. 11. What is Prospect?Visible output Enhanced Prospect InChI–name pairs Better Output layer RSS HTML database (in ChemSpider) ontologies Information layer Enhanced RSC XML Tool layer OSCAR InChI–Name pairs Author Input layer Ontologies RSC XML (from ChemSpider) CDX files 12
  12. 12. People and machines People MachinesCan understand narratives. Can’t understand narratives.Can interpret pictures. Can’t interpret pictures.Can reason about three- Not able to infer 3D structure dimensional objects. from 2D without cues.Can do a high-quality job. Can do a lower-quality, but still useful job.
  13. 13. Case study 2: The chair representation issue InChI=1S/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2 WQZGKKKJIJFFOK-UHFFFAOYSA-N• 5 stereocentres = 2^5 isomers =32 structures
  14. 14. Case study 2: Chair forms ofhexacycles what could go wrong?
  15. 15. How do we “fix” chair-representationsHow we normalize them:1. Identify 6-membered rings (Indigo)2. Identify what sort of ring it is3. Map atoms onto a standard structure (eg. beta-D-glucopyranose)4. Tidy
  16. 16. The future: “The digester”• Ability to: – Reconnect R-groups – Expand abbreviations – Expand brackets – Link structures with reference IDs
  17. 17. Other examples that we didn’t mention in case studies• CIF data importer• Structure Validation and Standardisation – (Thurs Aug 23, 9:15 am, Marriott Downtown, Franklin Hall 6)• Work on creation of ontologies, RXNO, CMO – Also collaborating on: ChEBI ontology, GO, SO• Collaboration with Utopia to enable Prospect mark-up of PDFs
  18. 18. Summary• Many data sharing practices are based on: – Traditional print articles – Consumption of data by humans only• This poses issues for publishers and users alike• The RSC is developing innovative solutions to address some of these problems – Chemical structures are challenging – Limitations to what a machine methods can achieve – Need to educate authors to think differently
  19. 19. Acknowledgements• Colin Batchelor - Development and Technical work• Jeff White & Aileen Day• Richard Kidd, Graham McCann and Will Russell• RSC ICT staff
  20. 20. Thank youEmail: chemspider@rsc.orgTwitter: @ChemSpiderhttp://www.chemspider.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×