Capturing Chemistry in XML/CML J. A. Townsend * ,  S. E. Adams *  , J. M. Goodman * ,  P. Murray-Rust * , C. A. Waudby *   Capturing Chemistry in XML/CML ACS March 2004 *  Unilever Centre for Molecular Informatics, University of Cambridge
The Agony Of  Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World
The Agony Of  Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World Sad The Scientist The Lab Journals Web Pages
The Vision-1 Capturing Chemistry in XML/CML ACS March 2004 < scalar  dictRef =“ ccml:mp ” units =“units:c” minValue =“65” maxValue =“66”  /> mp 65-66   C Human-readable Machine-readable
The Vision-2 Chemists can carry on doing what they want Capturing Chemistry in XML/CML ACS March 2004 Reuse chemistry Archive data Ensure validity of data Create new sources of data / molecules But also
Our Approach Let chemists use familiar programs … … and document templates Focus on Journal Articles, Theses, CompChem Create data for knowledge-based discovery Let computers do the work Evolution… Capturing Chemistry in XML/CML ACS March 2004
Machine Parsing  of Chemistry Capturing Chemistry in XML/CML ACS March 2004 Structured (CompChem) Semi-Structured (Articles) Unstructured (Discussion) Structured  documents and data in  XML MACHINE PARSING   ?
How? Abstract Discussion Experimental Capturing Chemistry in XML/CML ACS March 2004 Article semi- structured Add  Structure Parse with Regular Expressions Legacy to CML  converters
Regular Expressions Capturing Chemistry in XML/CML ACS March 2004 [Mm]\.?p\p{Punct}?\s+>?\s?\d*\.?\d?\s?-\s?\d*?\.?\d?\s°?\s?C Maybe ‘.’ Any  punctuation 0 or more digits Capital ‘ C’ Melting point: two possible syntaxes Capital or  lowercase ‘m’ Lowercase ‘ p’ Maybe whitespace Maybe degrees sign m.p. > 23.5 °C mp 23.5 – 25 °C
CML - XML For  Chemistry Based on W3C XML Schemas  300+ components Customisable  Extensible through dictionaries Openly available software Capturing Chemistry in XML/CML ACS March 2004 J. Chem. Inf. Comp. Sci.,  2003 ,  43 , 757
The CML Family Controlled XMLNamespaces: CMLCore – compounds and properties CMLReact – reactions CMLSpect – spectra * CMLComp – compChem CMLCryst – crystallography and condensed matter Interoperates with HTML, MathML, SVG,  * AniML + ,  * ThermoML $ , etc. Capturing Chemistry in XML/CML ACS March 2004 + spectra: ANSI/JCAMP $ thermochemistry: NIST J. Chem. Inf. Comp. Sci.,  2003 ,  43 , 757
Case Studies Parsing output from 750,000 MOPAC jobs High-throughput parsing of journals Capturing Chemistry in XML/CML ACS March 2004
CompChem Logs Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Point Group Dipole Total Energy
Loss From CompChem Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Total Energy Ionisation Potential
Loss From CompChem Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Total Energy Ionisation Potential
Parsing Data CompChem Output Capturing Chemistry in XML/CML ACS March 2004 Coordinates Energy Levels Vibrations Coordinates Energy Level Vibration CML File CMLCore CMLCore CMLComp CMLSpect Input/jobControl General Parsers
Display Process 1 Capturing Chemistry in XML/CML ACS March 2004 CompChem Log Xindice CML XSLT
Display Process 2 Capturing Chemistry in XML/CML ACS March 2004 CML File CMLCore CMLCore CMLComp CMLSpect compChem Output 3D structure, electronic properties Coordinates Energy Levels Vibrations Input/jobControl XSLT Display Normal modes 2D structure,  thermodynamic properties
Parsing Data Capturing Chemistry in XML/CML ACS March 2004 Dictionary Entry: The pointgroup of a molecule ... The Schoenflies convention is  normally used, but Hermann  Mauguin is also allowed. D [debye] ParentSI: c.m Multiplier: 3.335641E-30 CGS units for electric dipole
Dictionaries Capturing Chemistry in XML/CML ACS March 2004 < scalar  dictRef =“ ccml:mp ” units =“units:c” minValue =“65” maxValue =“66”  /> Linked to CML schema Accesses CCML  namespace Units dictionary id =&quot;celsius&quot;  name =&quot;Celsius&quot;  parentSI =&quot;k&quot; multiplierToSI =&quot;1&quot;  constantToSI =&quot;273.15&quot;  abbreviation =&quot;C&quot;  unitType =&quot;temp&quot; id =&quot;meltrange&quot;  term =&quot;Melting range&quot; definition =&quot;Minimum and maximum values of melting range in degrees Celsius&quot;
OSCAR Open Source Chemistry Analysis Routines Capturing Chemistry in XML/CML ACS March 2004 Sponsored by the Royal Society of Chemistry (Cambridge) Mounted on http://www.rsc.org/
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Synthesis Set up Analysis Compound Name Article Experimental
Information  Checked / Extracted Capturing Chemistry in XML/CML ACS March 2004 Chemical name Yield Boiling / Melting point Carbon NMR Hydrogen NMR Infra Red spectrometry Mass spectrometry Elemental Analysis Optical Rotation Refractive Index R f  value Ultra Violet spectrometry Nature (colour, state,  modifiers, description, etc.)
OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004 H NMR Nature HRMS
OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004
OSCAR Data Found Capturing Chemistry in XML/CML ACS March 2004 Results from one paper
OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 Serious Error Warning Type 1 Warning Type 2
OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 ~30 errors / warnings  searched for This article has: 4 errors 2 warnings (type 1) 30 warnings (type 2) Elemental analysis, incorrect – calculations are for a different molecular formula
OSCAR Data Presentation Capturing Chemistry in XML/CML ACS March 2004
OSCAR Speed Capturing Chemistry in XML/CML ACS March 2004 A typical paper contains ca. 20 compounds JOC (Feb 2004) contains ~600 compounds OSCAR could extract and tabulate in under 5 minutes OBC (Feb 2004) contains ~300 compounds OSCAR could extract and tabulate in under 3 minutes High throughput, high precision
OSCAR Accuracy Capturing Chemistry in XML/CML ACS March 2004 92 % of Data Correctly Identified 3 % incorrect  author entry 5 % missed 437 items, ~10,000 data fields in test set, working with current Regular Expressions False-positives: 3 %
XML-CML Databases Capturing Chemistry in XML/CML ACS March 2004 CML Journals Theses CompChem XMLDb can support > 250,000 molecules Millisecond retrieval on INChI, properties Xindice
Capturing Molecules Capturing Chemistry in XML/CML ACS March 2004 Autogenerate IUPAC INChI universal identifier Embed MDLMol or Chemdraw files in MSWord Autoconvert to CML connection table Next phase: Parse chemical names into CML using modern NLP + Learning-machine rather than rule-based + Natural Language Processing Encourage chemists to
NLP & Parsing Names Capturing Chemistry in XML/CML ACS March 2004 KEY:  Locant  Characteristic Group  Mono valent parent hydride Multiplier  Heterocyclic parent hydride
Thank You Unilever RSC Jonathan Goodman Sam Adams Fraser Norton Chris Waudby Yong Zhang Capturing Chemistry in XML/CML ACS March 2004

Capturing Chemistry In XML

  • 1.
    Capturing Chemistry inXML/CML J. A. Townsend * , S. E. Adams * , J. M. Goodman * , P. Murray-Rust * , C. A. Waudby * Capturing Chemistry in XML/CML ACS March 2004 * Unilever Centre for Molecular Informatics, University of Cambridge
  • 2.
    The Agony Of Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World
  • 3.
    The Agony Of Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World Sad The Scientist The Lab Journals Web Pages
  • 4.
    The Vision-1 CapturingChemistry in XML/CML ACS March 2004 < scalar dictRef =“ ccml:mp ” units =“units:c” minValue =“65” maxValue =“66” /> mp 65-66  C Human-readable Machine-readable
  • 5.
    The Vision-2 Chemistscan carry on doing what they want Capturing Chemistry in XML/CML ACS March 2004 Reuse chemistry Archive data Ensure validity of data Create new sources of data / molecules But also
  • 6.
    Our Approach Letchemists use familiar programs … … and document templates Focus on Journal Articles, Theses, CompChem Create data for knowledge-based discovery Let computers do the work Evolution… Capturing Chemistry in XML/CML ACS March 2004
  • 7.
    Machine Parsing of Chemistry Capturing Chemistry in XML/CML ACS March 2004 Structured (CompChem) Semi-Structured (Articles) Unstructured (Discussion) Structured documents and data in XML MACHINE PARSING ?
  • 8.
    How? Abstract DiscussionExperimental Capturing Chemistry in XML/CML ACS March 2004 Article semi- structured Add Structure Parse with Regular Expressions Legacy to CML converters
  • 9.
    Regular Expressions CapturingChemistry in XML/CML ACS March 2004 [Mm]\.?p\p{Punct}?\s+>?\s?\d*\.?\d?\s?-\s?\d*?\.?\d?\s°?\s?C Maybe ‘.’ Any punctuation 0 or more digits Capital ‘ C’ Melting point: two possible syntaxes Capital or lowercase ‘m’ Lowercase ‘ p’ Maybe whitespace Maybe degrees sign m.p. > 23.5 °C mp 23.5 – 25 °C
  • 10.
    CML - XMLFor Chemistry Based on W3C XML Schemas 300+ components Customisable Extensible through dictionaries Openly available software Capturing Chemistry in XML/CML ACS March 2004 J. Chem. Inf. Comp. Sci., 2003 , 43 , 757
  • 11.
    The CML FamilyControlled XMLNamespaces: CMLCore – compounds and properties CMLReact – reactions CMLSpect – spectra * CMLComp – compChem CMLCryst – crystallography and condensed matter Interoperates with HTML, MathML, SVG, * AniML + , * ThermoML $ , etc. Capturing Chemistry in XML/CML ACS March 2004 + spectra: ANSI/JCAMP $ thermochemistry: NIST J. Chem. Inf. Comp. Sci., 2003 , 43 , 757
  • 12.
    Case Studies Parsingoutput from 750,000 MOPAC jobs High-throughput parsing of journals Capturing Chemistry in XML/CML ACS March 2004
  • 13.
    CompChem Logs CapturingChemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Point Group Dipole Total Energy
  • 14.
    Loss From CompChemCapturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Total Energy Ionisation Potential
  • 15.
    Loss From CompChemCapturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Total Energy Ionisation Potential
  • 16.
    Parsing Data CompChemOutput Capturing Chemistry in XML/CML ACS March 2004 Coordinates Energy Levels Vibrations Coordinates Energy Level Vibration CML File CMLCore CMLCore CMLComp CMLSpect Input/jobControl General Parsers
  • 17.
    Display Process 1Capturing Chemistry in XML/CML ACS March 2004 CompChem Log Xindice CML XSLT
  • 18.
    Display Process 2Capturing Chemistry in XML/CML ACS March 2004 CML File CMLCore CMLCore CMLComp CMLSpect compChem Output 3D structure, electronic properties Coordinates Energy Levels Vibrations Input/jobControl XSLT Display Normal modes 2D structure, thermodynamic properties
  • 19.
    Parsing Data CapturingChemistry in XML/CML ACS March 2004 Dictionary Entry: The pointgroup of a molecule ... The Schoenflies convention is normally used, but Hermann Mauguin is also allowed. D [debye] ParentSI: c.m Multiplier: 3.335641E-30 CGS units for electric dipole
  • 20.
    Dictionaries Capturing Chemistryin XML/CML ACS March 2004 < scalar dictRef =“ ccml:mp ” units =“units:c” minValue =“65” maxValue =“66” /> Linked to CML schema Accesses CCML namespace Units dictionary id =&quot;celsius&quot; name =&quot;Celsius&quot; parentSI =&quot;k&quot; multiplierToSI =&quot;1&quot; constantToSI =&quot;273.15&quot; abbreviation =&quot;C&quot; unitType =&quot;temp&quot; id =&quot;meltrange&quot; term =&quot;Melting range&quot; definition =&quot;Minimum and maximum values of melting range in degrees Celsius&quot;
  • 21.
    OSCAR Open SourceChemistry Analysis Routines Capturing Chemistry in XML/CML ACS March 2004 Sponsored by the Royal Society of Chemistry (Cambridge) Mounted on http://www.rsc.org/
  • 22.
    Article Structure CapturingChemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  • 23.
    Article Structure CapturingChemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  • 24.
    Article Structure CapturingChemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  • 25.
    Article Structure CapturingChemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  • 26.
    Article Structure CapturingChemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  • 27.
    Article Structure CapturingChemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Synthesis Set up Analysis Compound Name Article Experimental
  • 28.
    Information Checked/ Extracted Capturing Chemistry in XML/CML ACS March 2004 Chemical name Yield Boiling / Melting point Carbon NMR Hydrogen NMR Infra Red spectrometry Mass spectrometry Elemental Analysis Optical Rotation Refractive Index R f value Ultra Violet spectrometry Nature (colour, state, modifiers, description, etc.)
  • 29.
    OSCAR Parsing DataCapturing Chemistry in XML/CML ACS March 2004 H NMR Nature HRMS
  • 30.
    OSCAR Parsing DataCapturing Chemistry in XML/CML ACS March 2004
  • 31.
    OSCAR Data FoundCapturing Chemistry in XML/CML ACS March 2004 Results from one paper
  • 32.
    OSCAR Error CheckingCapturing Chemistry in XML/CML ACS March 2004 Serious Error Warning Type 1 Warning Type 2
  • 33.
    OSCAR Error CheckingCapturing Chemistry in XML/CML ACS March 2004 ~30 errors / warnings searched for This article has: 4 errors 2 warnings (type 1) 30 warnings (type 2) Elemental analysis, incorrect – calculations are for a different molecular formula
  • 34.
    OSCAR Data PresentationCapturing Chemistry in XML/CML ACS March 2004
  • 35.
    OSCAR Speed CapturingChemistry in XML/CML ACS March 2004 A typical paper contains ca. 20 compounds JOC (Feb 2004) contains ~600 compounds OSCAR could extract and tabulate in under 5 minutes OBC (Feb 2004) contains ~300 compounds OSCAR could extract and tabulate in under 3 minutes High throughput, high precision
  • 36.
    OSCAR Accuracy CapturingChemistry in XML/CML ACS March 2004 92 % of Data Correctly Identified 3 % incorrect author entry 5 % missed 437 items, ~10,000 data fields in test set, working with current Regular Expressions False-positives: 3 %
  • 37.
    XML-CML Databases CapturingChemistry in XML/CML ACS March 2004 CML Journals Theses CompChem XMLDb can support > 250,000 molecules Millisecond retrieval on INChI, properties Xindice
  • 38.
    Capturing Molecules CapturingChemistry in XML/CML ACS March 2004 Autogenerate IUPAC INChI universal identifier Embed MDLMol or Chemdraw files in MSWord Autoconvert to CML connection table Next phase: Parse chemical names into CML using modern NLP + Learning-machine rather than rule-based + Natural Language Processing Encourage chemists to
  • 39.
    NLP & ParsingNames Capturing Chemistry in XML/CML ACS March 2004 KEY: Locant Characteristic Group Mono valent parent hydride Multiplier Heterocyclic parent hydride
  • 40.
    Thank You UnileverRSC Jonathan Goodman Sam Adams Fraser Norton Chris Waudby Yong Zhang Capturing Chemistry in XML/CML ACS March 2004