The document discusses capturing chemistry information in XML/CML format. It describes parsing journal articles and computational chemistry logs to extract structured data like molecular formulas, properties, reactions, and spectra. This data is stored in XML files using the Chemical Markup Language (CML) to enable indexing, searching, and reuse of chemistry data. Tools like OSCAR were created to automatically parse articles and capture information for thousands of compounds.
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
Capturing Chemistry In XML
1. Capturing Chemistry in XML/CML J. A. Townsend * , S. E. Adams * , J. M. Goodman * , P. Murray-Rust * , C. A. Waudby * Capturing Chemistry in XML/CML ACS March 2004 * Unilever Centre for Molecular Informatics, University of Cambridge
2. The Agony Of Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World
3. The Agony Of Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World Sad The Scientist The Lab Journals Web Pages
4. The Vision-1 Capturing Chemistry in XML/CML ACS March 2004 < scalar dictRef =“ ccml:mp ” units =“units:c” minValue =“65” maxValue =“66” /> mp 65-66 C Human-readable Machine-readable
5.
6.
7. Machine Parsing of Chemistry Capturing Chemistry in XML/CML ACS March 2004 Structured (CompChem) Semi-Structured (Articles) Unstructured (Discussion) Structured documents and data in XML MACHINE PARSING ?
8. How? Abstract Discussion Experimental Capturing Chemistry in XML/CML ACS March 2004 Article semi- structured Add Structure Parse with Regular Expressions Legacy to CML converters
9.
10.
11. The CML Family Controlled XMLNamespaces: CMLCore – compounds and properties CMLReact – reactions CMLSpect – spectra * CMLComp – compChem CMLCryst – crystallography and condensed matter Interoperates with HTML, MathML, SVG, * AniML + , * ThermoML $ , etc. Capturing Chemistry in XML/CML ACS March 2004 + spectra: ANSI/JCAMP $ thermochemistry: NIST J. Chem. Inf. Comp. Sci., 2003 , 43 , 757
12. Case Studies Parsing output from 750,000 MOPAC jobs High-throughput parsing of journals Capturing Chemistry in XML/CML ACS March 2004
13. CompChem Logs Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Point Group Dipole Total Energy
14. Loss From CompChem Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Total Energy Ionisation Potential
15. Loss From CompChem Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Total Energy Ionisation Potential
16. Parsing Data CompChem Output Capturing Chemistry in XML/CML ACS March 2004 Coordinates Energy Levels Vibrations Coordinates Energy Level Vibration CML File CMLCore CMLCore CMLComp CMLSpect Input/jobControl General Parsers
17. Display Process 1 Capturing Chemistry in XML/CML ACS March 2004 CompChem Log Xindice CML XSLT
18. Display Process 2 Capturing Chemistry in XML/CML ACS March 2004 CML File CMLCore CMLCore CMLComp CMLSpect compChem Output 3D structure, electronic properties Coordinates Energy Levels Vibrations Input/jobControl XSLT Display Normal modes 2D structure, thermodynamic properties
19. Parsing Data Capturing Chemistry in XML/CML ACS March 2004 Dictionary Entry: The pointgroup of a molecule ... The Schoenflies convention is normally used, but Hermann Mauguin is also allowed. D [debye] ParentSI: c.m Multiplier: 3.335641E-30 CGS units for electric dipole
20. Dictionaries Capturing Chemistry in XML/CML ACS March 2004 < scalar dictRef =“ ccml:mp ” units =“units:c” minValue =“65” maxValue =“66” /> Linked to CML schema Accesses CCML namespace Units dictionary id ="celsius" name ="Celsius" parentSI ="k" multiplierToSI ="1" constantToSI ="273.15" abbreviation ="C" unitType ="temp" id ="meltrange" term ="Melting range" definition ="Minimum and maximum values of melting range in degrees Celsius"
21. OSCAR Open Source Chemistry Analysis Routines Capturing Chemistry in XML/CML ACS March 2004 Sponsored by the Royal Society of Chemistry (Cambridge) Mounted on http://www.rsc.org/
22. Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
23. Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
24. Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
25. Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
26. Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
27. Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Synthesis Set up Analysis Compound Name Article Experimental
28.
29. OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004 H NMR Nature HRMS
31. OSCAR Data Found Capturing Chemistry in XML/CML ACS March 2004 Results from one paper
32. OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 Serious Error Warning Type 1 Warning Type 2
33. OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 ~30 errors / warnings searched for This article has: 4 errors 2 warnings (type 1) 30 warnings (type 2) Elemental analysis, incorrect – calculations are for a different molecular formula
35. OSCAR Speed Capturing Chemistry in XML/CML ACS March 2004 A typical paper contains ca. 20 compounds JOC (Feb 2004) contains ~600 compounds OSCAR could extract and tabulate in under 5 minutes OBC (Feb 2004) contains ~300 compounds OSCAR could extract and tabulate in under 3 minutes High throughput, high precision
36. OSCAR Accuracy Capturing Chemistry in XML/CML ACS March 2004 92 % of Data Correctly Identified 3 % incorrect author entry 5 % missed 437 items, ~10,000 data fields in test set, working with current Regular Expressions False-positives: 3 %
37. XML-CML Databases Capturing Chemistry in XML/CML ACS March 2004 CML Journals Theses CompChem XMLDb can support > 250,000 molecules Millisecond retrieval on INChI, properties Xindice
38.
39. NLP & Parsing Names Capturing Chemistry in XML/CML ACS March 2004 KEY: Locant Characteristic Group Mono valent parent hydride Multiplier Heterocyclic parent hydride
40. Thank You Unilever RSC Jonathan Goodman Sam Adams Fraser Norton Chris Waudby Yong Zhang Capturing Chemistry in XML/CML ACS March 2004