CHIC - Converting Hamburgers Into Cows


Published on

How to convert legacy documents to more semantic forms and demonstrates what is possible when it is in this form

Published in: Technology, Art & Photos
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Most scientific research is communicated in a formal mannerGroup vs Rest of Community Full Text and Supp InfoMore Data Points require semanitcsSliding Scale – Syntax, Vocab, Ontology, Model(Re)Use:Very hard. Has required human glue before now.This is why we need semantics.
  • Scan of a printoutPicture with Text Comp Chem more strcuture but still hardFree text
  • Char Enc - many papers are unreadable because the various glyphs are unresolvedMARKUP – XML RDF Sematic Webthe components have meaning and possibly behavior associated with them. – OntologyNot just interpretted dataNot whole document – sometimes entities sometimes sections
  • PDF 2 Text HardSAFOSCAR
  • NCEsChemical Terms Chemical DataOMIISections are important – false positives
  • Only way to determine sections correctly is to preprocess before it goes into OSCAR using SciXML to hold the section imformationHard with PDF because of the the loss of line breaks text from pictures
  • SciXML – sections, formattingEmbedded objects can be directly turned into CML (JumboConverters)Suddenly find Data XML too
  • DataXML loses formatting - RegexHard to recombine.Need to know what Data is associated with what preparation hence which moleculeEach step adds sematics – incremental addition of information
  • Object Reuse and Exchange
  • We know that this is a preparationBold NumbersStir phrase Add Phrase
  • TokensEntitiesPOSChunking
  • Tokens in BoxesDouble boxes = entities
  • chunks
  • Complete description of reaction and added data (strcutures)The following query could be used to search for all reactions using N,Ndimethylformamide as a solvent and yields greater than80%.SELECT ?preparationWHERE f?preparationhasSubstance ?substance .?substance hasMolecule<> .?substance hasRole<> .?preparation hasSubstance ?product .?product hasYield ?yield .FILTER(?yield > 80 ) .
  • Maps outside55 compounds madeCompletely new view of this thesis
  • University of Cambridge (UC) and the University of Southern Queensland (USQ) funded by the JISCIntegrated Repository deposition into author workflowFine grained embagoICE allows linking / inclusion of external data filesChem4WordSemantic Authoring for ChemistryLinked ZonesChemically intelligent authoring
  • CHIC - Converting Hamburgers Into Cows

    1. 1. CHIC – Converting Hamburgers Into Cows<br />Joseph Townsend<br /><br />
    2. 2. The Scholarly Publication Cycle<br />
    3. 3.
    4. 4. What is a Cow?<br />the character encoding is clearly stated<br />the document uses a mark-up technology to identify components <br />the components have meaning and possibly behaviour associated with them<br />unreduced data available<br />
    5. 5. What we thought the workflow should look like<br />Standoff Annotation File<br />
    6. 6. OSCAR<br /><br /><br /><br />
    7. 7. Article<br />Front Matter<br />Abstract<br />Introduction<br />Discussion<br />Results<br />Experimental<br />References<br />
    8. 8. Experimental<br />Front Matter<br />Set up <br />Abstract<br />Introduction<br />Compound Name<br />Discussion<br />Results<br />Synthesis<br />Experimental<br />Analysis<br />References<br />
    9. 9. DOCX Workflow (part 1)<br />
    10. 10. DOCX Workflow (part 2)<br />
    11. 11.
    12. 12. OREChem<br />PDF<br />PSU<br />Soton<br />Atom<br />Atom<br />SVG<br />Text<br />Cam<br />CrystalEye<br />PubChem<br />Atom<br />Molecules<br />Gaussian <br />workflow<br />ORE Triplestore<br />IU<br /><br />
    13. 13. What can we do with a Cow?<br />5-Cyclobutyl-2,3-dihydro-[1H]-2-benzazepine 82:<br />Potassium carbonate (0.63 g, 4.56 mmol) and thiophenol(0.19 g, 1.69 mmol) were added to the 2- nitrobenzene sulfonamide 50 (0.50 g, 1.302 mmol) in N,N-dimethylformamide(33 cm3) at room temperature and the mixture was stirred for 16 h. Deionised water (50 cm3) was added and the aqueous phase was extracted with ethyl acetate (5 x 50 cm3). The organic extracts were dried (MgSO4) and concentrated under reduced pressure to give the title compound 82 (0.259 g, 1.302 mmol, ca. 100%) as an oil used without further purification.<br />
    14. 14. Parsing and Semantics<br />
    15. 15. Tokenization and Chunking<br />
    16. 16. Phrase identification<br />
    17. 17. RDF of reaction components<br />
    18. 18. <ul><li>3D Boxes: Solid
    19. 19. Double Circles: Oil
    20. 20. Octagon: Gum
    21. 21. Triple Octagon: Foam
    22. 22. Diamond: Crystals or Needles
    23. 23. Ellipses: Unknown or Unspecified</li></li></ul><li>Semantic Authoring<br />ICE-TheOREM<br /><br />Chem4Word<br /><br /><br />