Your SlideShare is downloading. ×
0
Embedding semantic annotations within texts: the FRETTA approach
Embedding semantic annotations within texts: the FRETTA approach
Embedding semantic annotations within texts: the FRETTA approach
Embedding semantic annotations within texts: the FRETTA approach
Embedding semantic annotations within texts: the FRETTA approach
Embedding semantic annotations within texts: the FRETTA approach
Embedding semantic annotations within texts: the FRETTA approach
Embedding semantic annotations within texts: the FRETTA approach
Embedding semantic annotations within texts: the FRETTA approach
Embedding semantic annotations within texts: the FRETTA approach
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Embedding semantic annotations within texts: the FRETTA approach

608

Published on

In order to make semantic assertions about the text content of a document we need a mechanism to identify and organize the text structures of the document itself. Such mechanism would closely resemble …

In order to make semantic assertions about the text content of a document we need a mechanism to identify and organize the text structures of the document itself. Such mechanism would closely resemble a document-oriented markup language and would be free of the classical constraints of an embedded markup language, having no limitations given by sequentiality, containment, or contiguity of text fragments. In the past years we developed EARMARK, our OWL proposal for expressing arbitrary semantic annota- tions about the structure and the text content of a document. In this paper we describe FRETTA, our mechanism for rendering arbitrary EARMARK annotations (including non-sequential, non-hierarchical and non-contiguous ones) in XML, bringing into a unifying framework a half dozen of syntactic tricks used in literature to handle overlapping structures in a strictly hierarchical language.

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
608
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Embedding semantic annotations within texts: the FRETTA approach Gioele Barabucci - barabucc@cs.unibo.it Silvio Peroni - essepuntato@cs.unibo.it Francesco Poggi - fpoggi@cs.unibo.it Fabio Vitali - fabio@cs.unibo.ithttp://creativecommons.org/licenses/by-sa/3.0
  • 2. Outline• Conversion from an XML format into another• Overlapping markup• Abstract conversion framework• FRETTA• Evaluation• Conclusions
  • 3. Converting XML vocabularies that use syntactic workarounds• The conversion of OpenOffice Writer documents (ODT) into Microsoft Word documents (DOCX) (and vice versa) is not a straightforward operation• Converters exist and are included as core components of word processors• Those converters do not implement mechanisms for a full and effective document conversion, especially when particular features are needed – e.g., information tracking document changes occuring over time
  • 4. What happens to markup <text:tracked-changes> <text:changed-region text:id="S1"> ! <text:insertion><office:change-info>OpenOffice (ODT) ! ! <dc:creator>John Smith</dc:creator> ! ! <dc:date>2009-10-27T18:45:00</dc:date> <text:p> ! </office:change-info></text:insertion> The beginning </text:changed-region> and the end. </text:tracked-changes> </text:p> […] <text:p>The beginning and ! <text:change-start text:change-id="S1"/></text:p> <text:p>also <text:change-end text:change-id="S1"/> the end.</text:p>Microsoft Word (DOCX) <w:p> ! <w:pPr><w:rPr> <w:p> ! ! <w:ins w:id="0" w:author="John Smith" <w:r> ! ! ! w:date="2009-10-27T18:50:00Z"/> <w:t> ! </w:rPr></w:pPr> The beginning ! <w:r><w:t>The beginning and </w:t></w:r></w:p> and the end. <w:p> </w:t> ! <w:ins w:id="1" w:author="John Smith" </w:r> ! ! w:date="2009-10-27T18:50:00Z"> </w:p> ! ! <w:r><w:t>also </w:t></w:r></w:ins> ! <w:r><w:t>the end.</w:t></w:r></w:p>
  • 5. Overlapping markup• Overlapping markup is needed when different markup items refer to the same document fragment Previous example in incorrect XML <p>The beginning and <ins></p> <p>also </ins> the end</p> XML formalisation via workarounds <p>The beginning and <ins start=”foo”/></p> <p>also <ins end=”foo”/>the end</p>• Different techniques to embed overlapping structures in XML hierarchies: ✦ milestones: a pair of empty elements representing the start and the end tags, connected to each other by special attributes ✦ fragmentation: elements separated within the primary hierarchy and connected to each other by special attributes ✦ twin documents: each hierarchy is represented by a different document which contains the same textual content ✦ stand-off: places overlapping elements in a separate resource (e.g. another file) specifying the position (down to the individual character) of each start and end location within the main structure
  • 6. Abstract conversion framework XML format 1 with XML format 2 with overlapping workarounds overlapping workarounds(e.g., ODT + change tracking) (e.g., DOCX + change tracking) Step1: Indentification of XML Step2: Syntactic and Step3: Linearisation into overlapping workarounds semantic conversion XML document with and creation of document with from format 1 into overlapping workarounds explicit overlap format 2XML document EARMARK EARMARK XML documentformat 1 document document format 2 format 1 format 2 EARMARK is a non-XML markup metalanguage used as Today’s contribution intermediate language for the conversion. It allows markup structures to be organized both as trees and as generic graphs with no particular limitations.
  • 7. FRETTA • FRETTA (From EARMARK To Tag) is a general and extensible Java framework for expressing EARMARK documents in an embedded XML syntax • Users that want to convert from EARMARK into XML document formats must indicate which workarounds are used in a certain target format • Fretta performs the requested conversion passing through four different and consecutive stepsEARMARKdocument XML document workaround structural semantic linearisation specification conversion conversion The user specifies Pure-structural conversion Semantic conversion Generation of the which workaround that produces a new that may change the resulting XML tree to use to represent EARMARK document in current structure of the with the requested an (EARMARK) which overlapping EARMARK document workarounds overlapping element elements are transformed according to how the in XML appropriately according to target XML format the specified workarounds handles the specified workarounds
  • 8. Evaluation• Comparing FRETTA’s outputs document workarounds WF V N M against a set of twelve TEI documents (TEIDocs) written by agrippine fragmentation ✓ ✓ ✓ ✓ markup experts agrippine milestones ✓ ✓ ✓ ✓ drivemycar fragmentation ✓ ✓ X X• The evaluation took into account johnlovesmary fragmentation ✓ ✓ ✓ ✓ four different principles johnlovesmary milestones ✓ ✓ ✓ ✓ ✦ well-formedness (WF): whether the peergynt fragmentation ✓ ✓ ✓ ✓ framework returns well-formed XML documents peergynt milestones ✓ ✓ ✓ ✓ ✦ validity (V): whether the framework returns peterpaulhammer milestones ✓ ✓ ✓ ✓ valid XML documents according to the thoughtalice fragmentation ✓ ✓ ✓ ✓ particular target XML vocabulary titwillow fragmentation ✓ ✓ X ✓ ✦ naturalness (N): how much the XML titwillow fragmentation ✓ ✓ X X documents returned by the framework are structurally similar to TEIDocs titwillow milestones ✓ ✓ X ✓ ✦ minimality (M): how much the amount of 100% well-formed and valid documents nodes (i.e., elements, attributes and text 67% continues to be natural (N) against TEIDocs nodes) in the XML documents returned by 83% continues to be minimal (M) against TEIDocs the framework varies from TEIDocs
  • 9. Conclusions• Converting XML documents with overlaps expressed via XML workarounds is not a straightforward task• We propose an abstract framework to address this issue, composed of three consecutive steps• FRETTA implements the third step of the conversion framework. It enables one to convert any EARMARK document (that allows multiple overlapping hierarchies at the same time) into one or more embedded XML markup structures• Future works: ✦ developing algorithms that autonomously select the workarounds to adopt in the conversions ✦ integrating FRETTA in the broader framework for the semi-automatic and round- trip conversion from any supported XML format into another
  • 10. Thanks for your attention

×