Your SlideShare is downloading. ×
0
CHIC – Converting Hamburgers Into Cows<br />Joseph Townsend<br />jat45@cam.ac.uk<br />
The Scholarly Publication Cycle<br />
What is a Cow?<br />the character encoding is clearly stated<br />the document uses a mark-up technology to identify compo...
What we thought the workflow should look like<br />Standoff Annotation File<br />
OSCAR<br />http://sourceforge.net/projects/oscar3-chem/<br />http://www.omii.ac.uk/wiki/Nwsltr1209OSCAR<br />http://tinyur...
Article<br />Front Matter<br />Abstract<br />Introduction<br />Discussion<br />Results<br />Experimental<br />References<b...
Experimental<br />Front Matter<br />Set up	<br />Abstract<br />Introduction<br />Compound Name<br />Discussion<br />Result...
DOCX Workflow (part 1)<br />
DOCX Workflow (part 2)<br />
OREChem<br />PDF<br />PSU<br />Soton<br />Atom<br />Atom<br />SVG<br />Text<br />Cam<br />CrystalEye<br />PubChem<br />Ato...
What can we do with a Cow?<br />5-Cyclobutyl-2,3-dihydro-[1H]-2-benzazepine 82:<br />Potassium carbonate (0.63 g, 4.56 mmo...
Parsing and Semantics<br />
Tokenization and Chunking<br />
Phrase identification<br />
RDF of reaction components<br />
<ul><li>3D Boxes: Solid
Double Circles: Oil
Upcoming SlideShare
Loading in...5
×

CHIC - Converting Hamburgers Into Cows

379

Published on

How to convert legacy documents to more semantic forms and demonstrates what is possible when it is in this form

Published in: Technology, Art & Photos
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
379
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Most scientific research is communicated in a formal mannerGroup vs Rest of Community Full Text and Supp InfoMore Data Points require semanitcsSliding Scale – Syntax, Vocab, Ontology, Model(Re)Use:Very hard. Has required human glue before now.This is why we need semantics.
  • Scan of a printoutPicture with Text Comp Chem more strcuture but still hardFree text
  • Char Enc - many papers are unreadable because the various glyphs are unresolvedMARKUP – XML RDF Sematic Webthe components have meaning and possibly behavior associated with them. – OntologyNot just interpretted dataNot whole document – sometimes entities sometimes sections
  • PDF 2 Text HardSAFOSCAR
  • NCEsChemical Terms Chemical DataOMIISections are important – false positives
  • Only way to determine sections correctly is to preprocess before it goes into OSCAR using SciXML to hold the section imformationHard with PDF because of the the loss of line breaks text from pictures
  • SciXML – sections, formattingEmbedded objects can be directly turned into CML (JumboConverters)Suddenly find Data XML too
  • DataXML loses formatting - RegexHard to recombine.Need to know what Data is associated with what preparation hence which moleculeEach step adds sematics – incremental addition of information
  • Object Reuse and Exchange
  • We know that this is a preparationBold NumbersStir phrase Add Phrase
  • TokensEntitiesPOSChunking
  • Tokens in BoxesDouble boxes = entities
  • chunks
  • Complete description of reaction and added data (strcutures)The following query could be used to search for all reactions using N,Ndimethylformamide as a solvent and yields greater than80%.SELECT ?preparationWHERE f?preparationhasSubstance ?substance .?substance hasMolecule<http://www.polymerinformatics.com/#DMF> .?substance hasRole<http://www.polymerinformatics.com/#Solvent> .?preparation hasSubstance ?product .?product hasYield ?yield .FILTER(?yield > 80 ) .
  • Maps outside55 compounds madeCompletely new view of this thesis
  • University of Cambridge (UC) and the University of Southern Queensland (USQ) funded by the JISCIntegrated Repository deposition into author workflowFine grained embagoICE allows linking / inclusion of external data filesChem4WordSemantic Authoring for ChemistryLinked ZonesChemically intelligent authoring
  • Transcript of "CHIC - Converting Hamburgers Into Cows"

    1. 1. CHIC – Converting Hamburgers Into Cows<br />Joseph Townsend<br />jat45@cam.ac.uk<br />
    2. 2. The Scholarly Publication Cycle<br />
    3. 3.
    4. 4. What is a Cow?<br />the character encoding is clearly stated<br />the document uses a mark-up technology to identify components <br />the components have meaning and possibly behaviour associated with them<br />unreduced data available<br />
    5. 5. What we thought the workflow should look like<br />Standoff Annotation File<br />
    6. 6. OSCAR<br />http://sourceforge.net/projects/oscar3-chem/<br />http://www.omii.ac.uk/wiki/Nwsltr1209OSCAR<br />http://tinyurl.com/yakzgkd<br />
    7. 7. Article<br />Front Matter<br />Abstract<br />Introduction<br />Discussion<br />Results<br />Experimental<br />References<br />
    8. 8. Experimental<br />Front Matter<br />Set up <br />Abstract<br />Introduction<br />Compound Name<br />Discussion<br />Results<br />Synthesis<br />Experimental<br />Analysis<br />References<br />
    9. 9. DOCX Workflow (part 1)<br />
    10. 10. DOCX Workflow (part 2)<br />
    11. 11.
    12. 12. OREChem<br />PDF<br />PSU<br />Soton<br />Atom<br />Atom<br />SVG<br />Text<br />Cam<br />CrystalEye<br />PubChem<br />Atom<br />Molecules<br />Gaussian <br />workflow<br />ORE Triplestore<br />IU<br />http://research.microsoft.com/en-us/projects/orechem/<br />
    13. 13. What can we do with a Cow?<br />5-Cyclobutyl-2,3-dihydro-[1H]-2-benzazepine 82:<br />Potassium carbonate (0.63 g, 4.56 mmol) and thiophenol(0.19 g, 1.69 mmol) were added to the 2- nitrobenzene sulfonamide 50 (0.50 g, 1.302 mmol) in N,N-dimethylformamide(33 cm3) at room temperature and the mixture was stirred for 16 h. Deionised water (50 cm3) was added and the aqueous phase was extracted with ethyl acetate (5 x 50 cm3). The organic extracts were dried (MgSO4) and concentrated under reduced pressure to give the title compound 82 (0.259 g, 1.302 mmol, ca. 100%) as an oil used without further purification.<br />
    14. 14. Parsing and Semantics<br />
    15. 15. Tokenization and Chunking<br />
    16. 16. Phrase identification<br />
    17. 17. RDF of reaction components<br />
    18. 18. <ul><li>3D Boxes: Solid
    19. 19. Double Circles: Oil
    20. 20. Octagon: Gum
    21. 21. Triple Octagon: Foam
    22. 22. Diamond: Crystals or Needles
    23. 23. Ellipses: Unknown or Unspecified</li></li></ul><li>Semantic Authoring<br />ICE-TheOREM<br />http://tinyurl.com/y85vh22<br />Chem4Word<br />http://research.microsoft.com/en-us/projects/chem4word/<br />http://bit.ly/c4w<br />
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×