Capturing Chemistry in XML/CML J. A. Townsend * ,  S. E. Adams *  , J. M. Goodman * ,  P. Murray-Rust * , C. A. Waudby *  ...
The Agony Of  Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World
The Agony Of  Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World Sad The Scientist The Lab Journal...
The Vision-1 Capturing Chemistry in XML/CML ACS March 2004 < scalar  dictRef =“ ccml:mp ” units =“units:c” minValue =“65” ...
The Vision-2 <ul><li>Chemists can carry on doing what they want </li></ul>Capturing Chemistry in XML/CML ACS March 2004 <u...
Our Approach <ul><li>Let chemists use familiar programs … </li></ul><ul><li>… and document templates </li></ul><ul><li>Foc...
Machine Parsing  of Chemistry Capturing Chemistry in XML/CML ACS March 2004 Structured (CompChem) Semi-Structured (Article...
How? Abstract Discussion Experimental Capturing Chemistry in XML/CML ACS March 2004 Article semi- structured Add  Structur...
Regular Expressions Capturing Chemistry in XML/CML ACS March 2004 <ul><ul><li>[Mm].?pp{Punct}?s+>?s?d*.?d?s?-s?d*?.?d?s°?s...
CML - XML For  Chemistry <ul><li>Based on W3C XML Schemas  </li></ul><ul><li>300+ components </li></ul><ul><li>Customisabl...
The CML Family Controlled XMLNamespaces: CMLCore – compounds and properties CMLReact – reactions CMLSpect – spectra * CMLC...
Case Studies Parsing output from 750,000 MOPAC jobs High-throughput parsing of journals Capturing Chemistry in XML/CML ACS...
CompChem Logs Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Point Group Dip...
Loss From CompChem Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Tot...
Loss From CompChem Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Tot...
Parsing Data CompChem Output Capturing Chemistry in XML/CML ACS March 2004 Coordinates Energy Levels Vibrations Coordinate...
Display Process 1 Capturing Chemistry in XML/CML ACS March 2004 CompChem Log Xindice CML XSLT
Display Process 2 Capturing Chemistry in XML/CML ACS March 2004 CML File CMLCore CMLCore CMLComp CMLSpect compChem Output ...
Parsing Data Capturing Chemistry in XML/CML ACS March 2004 Dictionary Entry: The pointgroup of a molecule ... The Schoenfl...
Dictionaries Capturing Chemistry in XML/CML ACS March 2004 < scalar  dictRef =“ ccml:mp ” units =“units:c” minValue =“65” ...
OSCAR Open Source Chemistry Analysis Routines Capturing Chemistry in XML/CML ACS March 2004 Sponsored by the Royal Society...
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental...
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental...
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental...
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental...
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental...
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental...
Information  Checked / Extracted Capturing Chemistry in XML/CML ACS March 2004 <ul><li>Chemical name </li></ul><ul><li>Yie...
OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004 H NMR Nature HRMS
OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004
OSCAR Data Found Capturing Chemistry in XML/CML ACS March 2004 Results from one paper
OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 Serious Error Warning Type 1 Warning Type 2
OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 ~30 errors / warnings  searched for This article has: 4...
OSCAR Data Presentation Capturing Chemistry in XML/CML ACS March 2004
OSCAR Speed Capturing Chemistry in XML/CML ACS March 2004 A typical paper contains ca. 20 compounds JOC (Feb 2004) contain...
OSCAR Accuracy Capturing Chemistry in XML/CML ACS March 2004 92 % of Data Correctly Identified 3 % incorrect  author entry...
XML-CML Databases Capturing Chemistry in XML/CML ACS March 2004 CML Journals Theses CompChem XMLDb can support > 250,000 m...
Capturing Molecules Capturing Chemistry in XML/CML ACS March 2004 <ul><li>Autogenerate IUPAC INChI universal identifier </...
NLP & Parsing Names Capturing Chemistry in XML/CML ACS March 2004 KEY:  Locant  Characteristic Group  Mono valent parent h...
Thank You Unilever RSC Jonathan Goodman Sam Adams Fraser Norton Chris Waudby Yong Zhang Capturing Chemistry in XML/CML ACS...
Upcoming SlideShare
Loading in...5
×

Capturing Chemistry In XML

864

Published on

Presentation at ACS conference. How can we and do we convert various legacy formats into XML and CML.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
864
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Capturing Chemistry In XML"

  1. 1. Capturing Chemistry in XML/CML J. A. Townsend * , S. E. Adams * , J. M. Goodman * , P. Murray-Rust * , C. A. Waudby * Capturing Chemistry in XML/CML ACS March 2004 * Unilever Centre for Molecular Informatics, University of Cambridge
  2. 2. The Agony Of Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World
  3. 3. The Agony Of Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World Sad The Scientist The Lab Journals Web Pages
  4. 4. The Vision-1 Capturing Chemistry in XML/CML ACS March 2004 < scalar dictRef =“ ccml:mp ” units =“units:c” minValue =“65” maxValue =“66” /> mp 65-66  C Human-readable Machine-readable
  5. 5. The Vision-2 <ul><li>Chemists can carry on doing what they want </li></ul>Capturing Chemistry in XML/CML ACS March 2004 <ul><li>Reuse chemistry </li></ul><ul><li>Archive data </li></ul><ul><li>Ensure validity of data </li></ul><ul><li>Create new sources of data / molecules </li></ul>But also
  6. 6. Our Approach <ul><li>Let chemists use familiar programs … </li></ul><ul><li>… and document templates </li></ul><ul><li>Focus on Journal Articles, Theses, CompChem </li></ul><ul><li>Create data for knowledge-based discovery </li></ul><ul><li>Let computers do the work </li></ul><ul><li>Evolution… </li></ul>Capturing Chemistry in XML/CML ACS March 2004
  7. 7. Machine Parsing of Chemistry Capturing Chemistry in XML/CML ACS March 2004 Structured (CompChem) Semi-Structured (Articles) Unstructured (Discussion) Structured documents and data in XML MACHINE PARSING ?
  8. 8. How? Abstract Discussion Experimental Capturing Chemistry in XML/CML ACS March 2004 Article semi- structured Add Structure Parse with Regular Expressions Legacy to CML converters
  9. 9. Regular Expressions Capturing Chemistry in XML/CML ACS March 2004 <ul><ul><li>[Mm].?pp{Punct}?s+>?s?d*.?d?s?-s?d*?.?d?s°?s?C </li></ul></ul>Maybe ‘.’ Any punctuation 0 or more digits Capital ‘ C’ Melting point: two possible syntaxes Capital or lowercase ‘m’ Lowercase ‘ p’ Maybe whitespace Maybe degrees sign m.p. > 23.5 °C mp 23.5 – 25 °C
  10. 10. CML - XML For Chemistry <ul><li>Based on W3C XML Schemas </li></ul><ul><li>300+ components </li></ul><ul><li>Customisable </li></ul><ul><li>Extensible through dictionaries </li></ul><ul><li>Openly available software </li></ul>Capturing Chemistry in XML/CML ACS March 2004 J. Chem. Inf. Comp. Sci., 2003 , 43 , 757
  11. 11. The CML Family Controlled XMLNamespaces: CMLCore – compounds and properties CMLReact – reactions CMLSpect – spectra * CMLComp – compChem CMLCryst – crystallography and condensed matter Interoperates with HTML, MathML, SVG, * AniML + , * ThermoML $ , etc. Capturing Chemistry in XML/CML ACS March 2004 + spectra: ANSI/JCAMP $ thermochemistry: NIST J. Chem. Inf. Comp. Sci., 2003 , 43 , 757
  12. 12. Case Studies Parsing output from 750,000 MOPAC jobs High-throughput parsing of journals Capturing Chemistry in XML/CML ACS March 2004
  13. 13. CompChem Logs Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Point Group Dipole Total Energy
  14. 14. Loss From CompChem Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Total Energy Ionisation Potential
  15. 15. Loss From CompChem Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Total Energy Ionisation Potential
  16. 16. Parsing Data CompChem Output Capturing Chemistry in XML/CML ACS March 2004 Coordinates Energy Levels Vibrations Coordinates Energy Level Vibration CML File CMLCore CMLCore CMLComp CMLSpect Input/jobControl General Parsers
  17. 17. Display Process 1 Capturing Chemistry in XML/CML ACS March 2004 CompChem Log Xindice CML XSLT
  18. 18. Display Process 2 Capturing Chemistry in XML/CML ACS March 2004 CML File CMLCore CMLCore CMLComp CMLSpect compChem Output 3D structure, electronic properties Coordinates Energy Levels Vibrations Input/jobControl XSLT Display Normal modes 2D structure, thermodynamic properties
  19. 19. Parsing Data Capturing Chemistry in XML/CML ACS March 2004 Dictionary Entry: The pointgroup of a molecule ... The Schoenflies convention is normally used, but Hermann Mauguin is also allowed. D [debye] ParentSI: c.m Multiplier: 3.335641E-30 CGS units for electric dipole
  20. 20. Dictionaries Capturing Chemistry in XML/CML ACS March 2004 < scalar dictRef =“ ccml:mp ” units =“units:c” minValue =“65” maxValue =“66” /> Linked to CML schema Accesses CCML namespace Units dictionary id =&quot;celsius&quot; name =&quot;Celsius&quot; parentSI =&quot;k&quot; multiplierToSI =&quot;1&quot; constantToSI =&quot;273.15&quot; abbreviation =&quot;C&quot; unitType =&quot;temp&quot; id =&quot;meltrange&quot; term =&quot;Melting range&quot; definition =&quot;Minimum and maximum values of melting range in degrees Celsius&quot;
  21. 21. OSCAR Open Source Chemistry Analysis Routines Capturing Chemistry in XML/CML ACS March 2004 Sponsored by the Royal Society of Chemistry (Cambridge) Mounted on http://www.rsc.org/
  22. 22. Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  23. 23. Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  24. 24. Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  25. 25. Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  26. 26. Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  27. 27. Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Synthesis Set up Analysis Compound Name Article Experimental
  28. 28. Information Checked / Extracted Capturing Chemistry in XML/CML ACS March 2004 <ul><li>Chemical name </li></ul><ul><li>Yield </li></ul><ul><li>Boiling / Melting point </li></ul><ul><li>Carbon NMR </li></ul><ul><li>Hydrogen NMR </li></ul><ul><li>Infra Red spectrometry </li></ul><ul><li>Mass spectrometry </li></ul><ul><li>Elemental Analysis </li></ul><ul><li>Optical Rotation </li></ul><ul><li>Refractive Index </li></ul><ul><li>R f value </li></ul><ul><li>Ultra Violet spectrometry </li></ul><ul><li>Nature (colour, state, modifiers, description, etc.) </li></ul>
  29. 29. OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004 H NMR Nature HRMS
  30. 30. OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004
  31. 31. OSCAR Data Found Capturing Chemistry in XML/CML ACS March 2004 Results from one paper
  32. 32. OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 Serious Error Warning Type 1 Warning Type 2
  33. 33. OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 ~30 errors / warnings searched for This article has: 4 errors 2 warnings (type 1) 30 warnings (type 2) Elemental analysis, incorrect – calculations are for a different molecular formula
  34. 34. OSCAR Data Presentation Capturing Chemistry in XML/CML ACS March 2004
  35. 35. OSCAR Speed Capturing Chemistry in XML/CML ACS March 2004 A typical paper contains ca. 20 compounds JOC (Feb 2004) contains ~600 compounds OSCAR could extract and tabulate in under 5 minutes OBC (Feb 2004) contains ~300 compounds OSCAR could extract and tabulate in under 3 minutes High throughput, high precision
  36. 36. OSCAR Accuracy Capturing Chemistry in XML/CML ACS March 2004 92 % of Data Correctly Identified 3 % incorrect author entry 5 % missed 437 items, ~10,000 data fields in test set, working with current Regular Expressions False-positives: 3 %
  37. 37. XML-CML Databases Capturing Chemistry in XML/CML ACS March 2004 CML Journals Theses CompChem XMLDb can support > 250,000 molecules Millisecond retrieval on INChI, properties Xindice
  38. 38. Capturing Molecules Capturing Chemistry in XML/CML ACS March 2004 <ul><li>Autogenerate IUPAC INChI universal identifier </li></ul><ul><li>Embed MDLMol or Chemdraw files in MSWord </li></ul><ul><li>Autoconvert to CML connection table </li></ul><ul><li>Next phase: </li></ul><ul><li>Parse chemical names into CML using modern NLP + </li></ul><ul><li>Learning-machine rather than rule-based </li></ul><ul><li>+ Natural Language Processing </li></ul>Encourage chemists to
  39. 39. NLP & Parsing Names Capturing Chemistry in XML/CML ACS March 2004 KEY: Locant Characteristic Group Mono valent parent hydride Multiplier Heterocyclic parent hydride
  40. 40. Thank You Unilever RSC Jonathan Goodman Sam Adams Fraser Norton Chris Waudby Yong Zhang Capturing Chemistry in XML/CML ACS March 2004

×