• Save
Capturing Chemistry In XML
Upcoming SlideShare
Loading in...5
×
 

Capturing Chemistry In XML

on

  • 1,063 views

Presentation at ACS conference. How can we and do we convert various legacy formats into XML and CML.

Presentation at ACS conference. How can we and do we convert various legacy formats into XML and CML.

Statistics

Views

Total Views
1,063
Views on SlideShare
1,063
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Capturing Chemistry In XML Capturing Chemistry In XML Presentation Transcript

  • Capturing Chemistry in XML/CML J. A. Townsend * , S. E. Adams * , J. M. Goodman * , P. Murray-Rust * , C. A. Waudby * Capturing Chemistry in XML/CML ACS March 2004 * Unilever Centre for Molecular Informatics, University of Cambridge
  • The Agony Of Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World
  • The Agony Of Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World Sad The Scientist The Lab Journals Web Pages View slide
  • The Vision-1 Capturing Chemistry in XML/CML ACS March 2004 < scalar dictRef =“ ccml:mp ” units =“units:c” minValue =“65” maxValue =“66” /> mp 65-66  C Human-readable Machine-readable View slide
  • The Vision-2
    • Chemists can carry on doing what they want
    Capturing Chemistry in XML/CML ACS March 2004
    • Reuse chemistry
    • Archive data
    • Ensure validity of data
    • Create new sources of data / molecules
    But also
  • Our Approach
    • Let chemists use familiar programs …
    • … and document templates
    • Focus on Journal Articles, Theses, CompChem
    • Create data for knowledge-based discovery
    • Let computers do the work
    • Evolution…
    Capturing Chemistry in XML/CML ACS March 2004
  • Machine Parsing of Chemistry Capturing Chemistry in XML/CML ACS March 2004 Structured (CompChem) Semi-Structured (Articles) Unstructured (Discussion) Structured documents and data in XML MACHINE PARSING ?
  • How? Abstract Discussion Experimental Capturing Chemistry in XML/CML ACS March 2004 Article semi- structured Add Structure Parse with Regular Expressions Legacy to CML converters
  • Regular Expressions Capturing Chemistry in XML/CML ACS March 2004
      • [Mm].?pp{Punct}?s+>?s?d*.?d?s?-s?d*?.?d?s°?s?C
    Maybe ‘.’ Any punctuation 0 or more digits Capital ‘ C’ Melting point: two possible syntaxes Capital or lowercase ‘m’ Lowercase ‘ p’ Maybe whitespace Maybe degrees sign m.p. > 23.5 °C mp 23.5 – 25 °C
  • CML - XML For Chemistry
    • Based on W3C XML Schemas
    • 300+ components
    • Customisable
    • Extensible through dictionaries
    • Openly available software
    Capturing Chemistry in XML/CML ACS March 2004 J. Chem. Inf. Comp. Sci., 2003 , 43 , 757
  • The CML Family Controlled XMLNamespaces: CMLCore – compounds and properties CMLReact – reactions CMLSpect – spectra * CMLComp – compChem CMLCryst – crystallography and condensed matter Interoperates with HTML, MathML, SVG, * AniML + , * ThermoML $ , etc. Capturing Chemistry in XML/CML ACS March 2004 + spectra: ANSI/JCAMP $ thermochemistry: NIST J. Chem. Inf. Comp. Sci., 2003 , 43 , 757
  • Case Studies Parsing output from 750,000 MOPAC jobs High-throughput parsing of journals Capturing Chemistry in XML/CML ACS March 2004
  • CompChem Logs Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Point Group Dipole Total Energy
  • Loss From CompChem Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Total Energy Ionisation Potential
  • Loss From CompChem Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Total Energy Ionisation Potential
  • Parsing Data CompChem Output Capturing Chemistry in XML/CML ACS March 2004 Coordinates Energy Levels Vibrations Coordinates Energy Level Vibration CML File CMLCore CMLCore CMLComp CMLSpect Input/jobControl General Parsers
  • Display Process 1 Capturing Chemistry in XML/CML ACS March 2004 CompChem Log Xindice CML XSLT
  • Display Process 2 Capturing Chemistry in XML/CML ACS March 2004 CML File CMLCore CMLCore CMLComp CMLSpect compChem Output 3D structure, electronic properties Coordinates Energy Levels Vibrations Input/jobControl XSLT Display Normal modes 2D structure, thermodynamic properties
  • Parsing Data Capturing Chemistry in XML/CML ACS March 2004 Dictionary Entry: The pointgroup of a molecule ... The Schoenflies convention is normally used, but Hermann Mauguin is also allowed. D [debye] ParentSI: c.m Multiplier: 3.335641E-30 CGS units for electric dipole
  • Dictionaries Capturing Chemistry in XML/CML ACS March 2004 < scalar dictRef =“ ccml:mp ” units =“units:c” minValue =“65” maxValue =“66” /> Linked to CML schema Accesses CCML namespace Units dictionary id =&quot;celsius&quot; name =&quot;Celsius&quot; parentSI =&quot;k&quot; multiplierToSI =&quot;1&quot; constantToSI =&quot;273.15&quot; abbreviation =&quot;C&quot; unitType =&quot;temp&quot; id =&quot;meltrange&quot; term =&quot;Melting range&quot; definition =&quot;Minimum and maximum values of melting range in degrees Celsius&quot;
  • OSCAR Open Source Chemistry Analysis Routines Capturing Chemistry in XML/CML ACS March 2004 Sponsored by the Royal Society of Chemistry (Cambridge) Mounted on http://www.rsc.org/
  • Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  • Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  • Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  • Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  • Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  • Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Synthesis Set up Analysis Compound Name Article Experimental
  • Information Checked / Extracted Capturing Chemistry in XML/CML ACS March 2004
    • Chemical name
    • Yield
    • Boiling / Melting point
    • Carbon NMR
    • Hydrogen NMR
    • Infra Red spectrometry
    • Mass spectrometry
    • Elemental Analysis
    • Optical Rotation
    • Refractive Index
    • R f value
    • Ultra Violet spectrometry
    • Nature (colour, state, modifiers, description, etc.)
  • OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004 H NMR Nature HRMS
  • OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004
  • OSCAR Data Found Capturing Chemistry in XML/CML ACS March 2004 Results from one paper
  • OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 Serious Error Warning Type 1 Warning Type 2
  • OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 ~30 errors / warnings searched for This article has: 4 errors 2 warnings (type 1) 30 warnings (type 2) Elemental analysis, incorrect – calculations are for a different molecular formula
  • OSCAR Data Presentation Capturing Chemistry in XML/CML ACS March 2004
  • OSCAR Speed Capturing Chemistry in XML/CML ACS March 2004 A typical paper contains ca. 20 compounds JOC (Feb 2004) contains ~600 compounds OSCAR could extract and tabulate in under 5 minutes OBC (Feb 2004) contains ~300 compounds OSCAR could extract and tabulate in under 3 minutes High throughput, high precision
  • OSCAR Accuracy Capturing Chemistry in XML/CML ACS March 2004 92 % of Data Correctly Identified 3 % incorrect author entry 5 % missed 437 items, ~10,000 data fields in test set, working with current Regular Expressions False-positives: 3 %
  • XML-CML Databases Capturing Chemistry in XML/CML ACS March 2004 CML Journals Theses CompChem XMLDb can support > 250,000 molecules Millisecond retrieval on INChI, properties Xindice
  • Capturing Molecules Capturing Chemistry in XML/CML ACS March 2004
    • Autogenerate IUPAC INChI universal identifier
    • Embed MDLMol or Chemdraw files in MSWord
    • Autoconvert to CML connection table
    • Next phase:
    • Parse chemical names into CML using modern NLP +
    • Learning-machine rather than rule-based
    • + Natural Language Processing
    Encourage chemists to
  • NLP & Parsing Names Capturing Chemistry in XML/CML ACS March 2004 KEY: Locant Characteristic Group Mono valent parent hydride Multiplier Heterocyclic parent hydride
  • Thank You Unilever RSC Jonathan Goodman Sam Adams Fraser Norton Chris Waudby Yong Zhang Capturing Chemistry in XML/CML ACS March 2004