Handling Markup Overlaps Using OWL
Upcoming SlideShare
Loading in...5
×
 

Handling Markup Overlaps Using OWL

on

  • 646 views

A lot of applications handle XML documents where multi- ple overlapping hierarchies are necessary and make use of a number of workarounds to force overlaps into the single hierarchy of an XML for- ...

A lot of applications handle XML documents where multi- ple overlapping hierarchies are necessary and make use of a number of workarounds to force overlaps into the single hierarchy of an XML for- mat. Although these workarounds are transparent to the users, they are very difficult to handle by applications reading into these formats. This paper proposes an approach to document markup based on Semantic Web technologies. Our model allows the same expressiveness as XML and any other hierarchical meta-markup language, and, rather than re- quiring complex workarounds, allows the explicit expression of overlap- ping structures in such a way that search and manipulation of these structures does not require any specific tool or language. By simply us- ing mainstream technologies such as OWL and SPARQL, our model – called EARMARK (Extremely Annotational RDF Markup) – can per- form rather sophisticated tasks with no special tricks.

Statistics

Views

Total Views
646
Views on SlideShare
646
Embed Views
0

Actions

Likes
0
Downloads
4
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Handling Markup Overlaps Using OWL Handling Markup Overlaps Using OWL Presentation Transcript

    • Handling markup overlaps using OWL Angelo Di Iorio (diiorio@cs.unibo.it) Silvio Peroni (speroni@cs.unibo.it) Fabio Vitali (fabio@cs.unibo.it) http://creativecommons.org/licenses/by-sa/3.0
    • Summary • Overlapping markup in everyday life • EARMARK: an OWL-based meta-markup language • Conclusions and future works
    • Overlapping markup... wait, what? • A definition: overlapping markup “describes cases where some markup structures do not nest neatly into others” DeRose, S. (2004). Markup Overlap: A Review and a Horse. In Proceedings of Extreme Markup Languages 2004. Montreal, Canada. <body> <p>Some <em>very</p> <p>interesting</em> text</p> </body> • Different techniques to embed overlap in XML hierarchies, for instance:
    • Overlapping markup... wait, what? • A definition: overlapping markup “describes cases where some markup structures do not nest neatly into others” DeRose, S. (2004). Markup Overlap: A Review and a Horse. In Proceedings of Extreme Markup Languages 2004. Montreal, Canada. <body> <p>Some <em>very</p> <p>interesting</em> text</p> </body> • Different techniques to embed overlap in XML hierarchies, for instance: ✦ milestones – expressed through empty elements to mark the boundaries of the content <body> <p>Some <em start=”id1”/>very</p> <p>interesting<em end=”id1”/> text</p> </body>
    • Overlapping markup... wait, what? • A definition: overlapping markup “describes cases where some markup structures do not nest neatly into others” DeRose, S. (2004). Markup Overlap: A Review and a Horse. In Proceedings of Extreme Markup Languages 2004. Montreal, Canada. <body> <p>Some <em>very</p> <p>interesting</em> text</p> </body> • Different techniques to embed overlap in XML hierarchies, for instance: ✦ milestones – expressed through empty elements to mark the boundaries of the content <body> <p>Some <em start=”id1”/>very</p> <p>interesting<em end=”id1”/> text</p> </body> ✦ fragmentation – expressed by two non-overlapping elements linked through id-idref pairs <body> <p>Some <em id=”em1” next=”em2”>very</em></p> <p><em id=”em2”>interesting</em> text</p> </body>
    • Overlapping everywhere • Where we can find it: word processor formats + change tracking (e.g., ODT) <office:text> <text:changed-region text:id="S1"> <text:insertion> <office:change-info> <dc:creator>John Smith</dc:creator> <dc:date>2009-10-27T18:45:00</dc:date> </office:change-info> </text:insertion> </text:changed-region> <text:p> The beginning and <text:change-start text:change-id="S1"/> </text:p> <text:p> also <text:change-end text:change-id="S1"/> the end. </text:p> </office:text> What the document is
    • Overlapping everywhere • Where we can find it: word processor formats + change tracking (e.g., ODT) <office:text> <text:changed-region text:id="S1"> <text:insertion> <office:change-info> <dc:creator>John Smith</dc:creator> <dc:date>2009-10-27T18:45:00</dc:date> What the document </office:change-info> represents </text:insertion> </text:changed-region> <text:p> office:text The beginning and <text:change-start text:change-id="S1"/> text:p </text:p> <text:p> before also The beginning and the end. <text:change-end text:change-id="S1"/> 2009-10-27T18:45:00 the end. </text:p> </office:text> What the document is
    • Overlapping everywhere • Where we can find it: word processor formats + change tracking (e.g., ODT) <office:text> <text:changed-region text:id="S1"> <text:insertion> <office:change-info> <dc:creator>John Smith</dc:creator> <dc:date>2009-10-27T18:45:00</dc:date> What the document </office:change-info> represents </text:insertion> </text:changed-region> <text:p> office:text The beginning and <text:change-start text:change-id="S1"/> text:p </text:p> <text:p> before also The beginning and the end. <text:change-end text:change-id="S1"/> 2009-10-27T18:45:00 the end. also </text:p> after </office:text> text:p text:p What the document is office:text
    • Overlapping everywhere • Where we can find it: word processor formats + change tracking (e.g., ODT) <office:text> <text:changed-region text:id="S1"> <text:insertion> <office:change-info> <dc:creator>John Smith</dc:creator> <dc:date>2009-10-27T18:45:00</dc:date> What the document </office:change-info> represents </text:insertion> </text:changed-region> <text:p> office:text The beginning and <text:change-start text:change-id="S1"/> text:p </text:p> <text:p> before also The beginning and the end. <text:change-end text:change-id="S1"/> 2009-10-27T18:45:00 the end. also </text:p> after </office:text> text:p text:p What the document is inserted by John Smith office:text
    • • EARMARK is a vocabulary that defines a meta-markup language by means of OWL ontologies – http://www.essepuntato.it/2008/12/earmark • It is more expressive than XML XML EARMARK Data structure Tree DAG Overlapping Only by using tricks Of course, it is a feature here Semantics What? Yes, it is OWL! • Three disjoint base classes: ✦ Docuverse – it represents the textual content of a document Subclasses: StringDocuverse, URIDocuverse ✦ Range – it describes any text lying between two locations Subclasses: PointerRange, XPathRange, XPathPointerRange ✦ MarkupItem – a collection of individuals belonging to the classes MarkupItem and Range Subclasses: Element, Attribute, Comment
    • An example The beginning and the end.
    • An example @prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> . @prefix : <http://www.example.com/> . :aDoc a earmark:StringDocuverse ; earmark:hasContent “The beginning and the end.”^^xsd:string . The beginning and the end.
    • An example @prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> . @prefix : <http://www.example.com/> . :aDoc a earmark:StringDocuverse ; earmark:hasContent “The beginning and the end.”^^xsd:string . The beginning and the end.
    • An example @prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> . @prefix : <http://www.example.com/> . :aDoc a earmark:StringDocuverse ; earmark:hasContent “The beginning and the end.”^^xsd:string . :r2 a earmark:PointerRange ; earmark:refersTo :aDoc The beginning and the end. ; earmark:begins “14”^^xsd:nonNegativeInteger ; earmark:ends “26”^^xsd:nonNegativeInteger .
    • An example @prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> . @prefix : <http://www.example.com/> . office:text :aDoc a earmark:StringDocuverse ; earmark:hasContent “The beginning and the end.”^^xsd:string . text:p :r2 a earmark:PointerRange ; earmark:refersTo :aDoc The beginning and the end. ; earmark:begins “14”^^xsd:nonNegativeInteger ; earmark:ends “26”^^xsd:nonNegativeInteger . also text:p text:p office:text
    • An example @prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> . @prefix : <http://www.example.com/> . @prefix c: <http://swan.mindinformatics.org/ontologies/1.2/collections/> . office:text :aDoc a earmark:StringDocuverse ; earmark:hasContent “The beginning and the end.”^^xsd:string . text:p :r2 a earmark:PointerRange ; earmark:refersTo :aDoc The beginning and the end. ; earmark:begins “14”^^xsd:nonNegativeInteger ; earmark:ends “26”^^xsd:nonNegativeInteger . also :aMarkupItem a earmark:Element ; earmark:hasGeneralIdentifier “p” text:p text:p ; earmark:hasNamespace “urn:oasis:names:tc:opendocument:xmlns:text:1.0” ; c:firstItem :item1 office:text ; c:lastItem :item2 . :item1 c:itemContent :r1 ; c:nextItem :item2 . :item2 c:itemContent :r2 .
    • An example @prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> . @prefix : <http://www.example.com/> . @prefix c: <http://swan.mindinformatics.org/ontologies/1.2/collections/> . office:text :aDoc a earmark:StringDocuverse ; earmark:hasContent “The beginning and the end.”^^xsd:string . text:p :r2 a earmark:PointerRange ; earmark:refersTo :aDoc The beginning and the end. ; earmark:begins “14”^^xsd:nonNegativeInteger ; earmark:ends “26”^^xsd:nonNegativeInteger . also :aMarkupItem a earmark:Element ; earmark:hasGeneralIdentifier “p” text:p text:p ; earmark:hasNamespace “urn:oasis:names:tc:opendocument:xmlns:text:1.0” inserted by John Smith ; c:firstItem :item1 office:text ; c:lastItem :item2 . :item1 c:itemContent :r1 ; c:nextItem :item2 . :item2 c:itemContent :r2 .
    • An example @prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> . @prefix : <http://www.example.com/> . @prefix c: <http://swan.mindinformatics.org/ontologies/1.2/collections/> . @prefix dc: <http://purl.org/dc/elements/1.1/> . office:text :aDoc a earmark:StringDocuverse ; earmark:hasContent “The beginning and the end.”^^xsd:string . text:p :r2 a earmark:PointerRange ; earmark:refersTo :aDoc The beginning and the end. ; earmark:begins “14”^^xsd:nonNegativeInteger ; earmark:ends “26”^^xsd:nonNegativeInteger . also :aMarkupItem a earmark:Element ; earmark:hasGeneralIdentifier “p” text:p text:p ; earmark:hasNamespace “urn:oasis:names:tc:opendocument:xmlns:text:1.0” inserted by John Smith ; c:firstItem :item1 office:text ; c:lastItem :item2 . :item1 c:itemContent :r1 ; c:nextItem :item2 . :item2 c:itemContent :r2 . :p2 a Insertion ; dc:creator “John Smith” ; dc:date “2009-10-27T18:45:00”^^xsd:dateTime .
    • EARMARK Data Structure • It is an API and a Java library that allows to easily create and modify EARMARK document within Java applications • Open Source project: http://earmark.sourceforge.net EARMARKDocument ed = new EARMARKDocument(new URI("http://www.example.com")); Docuverse aDoc = ed.createStringDocuverse("The beginning and the end."); [...] Range aRange = ed.createPointerRange(aDoc, 14, 26); [...] Element aMarkupItem = ed.createElement("p", "urn:oasis:names:tc:opendocument:xmlns:text:1.0", Collection.Type.List); ed.appendChild(anotherMarkupItem); [...]
    • Semantic Web technologies as added value • Because every EARMARK document is expressed as proper ABox of an ontology, we can use Semantic Web technologies: ✦ to manipulate documents ✦ to query them ✦ to infer new assertions ✦ to check some integrity constraints on document structure and on content semantics • In EARMARK, those technologies can be very helpful in solving issues that are difficult to solve or are not solvable at all by using XML tools • An example: “get all the text fragments inserted by John Smith”
    • Semantic Web technologies as added value • Because every EARMARK document is expressed as proper ABox of an ontology, we can use Semantic Web technologies: ✦ to manipulate documents ✦ to query them ✦ to infer new assertions ✦ to check some integrity constraints on document structure and on content semantics • In EARMARK, those technologies can be very helpful in solving issues that are difficult to solve or are not solvable at all by using XML tools • An example: “get all the text fragments inserted by John Smith” ✦ XPath for $id in //@text:id[../text:insertion//(dc:creator[. = ‘John Smith’] | @office:chg-author[. = ’ John Smith’])] return //text:p//text()[(preceding- sibling::text:change-start[1][@text:change-id = $id] and following- sibling::text:change-end[1][@text:change-id = $id]) or ancestor::text:changed- region/@text:id = $id]
    • Semantic Web technologies as added value • Because every EARMARK document is expressed as proper ABox of an ontology, we can use Semantic Web technologies: ✦ to manipulate documents ✦ to query them ✦ to infer new assertions ✦ to check some integrity constraints on document structure and on content semantics • In EARMARK, those technologies can be very helpful in solving issues that are difficult to solve or are not solvable at all by using XML tools • An example: “get all the text fragments inserted by John Smith” ✦ XPath for $id in //@text:id[../text:insertion//(dc:creator[. = ‘John Smith’] | @office:chg-author[. = ’ John Smith’])] return //text:p//text()[(preceding- sibling::text:change-start[1][@text:change-id = $id] and following- sibling::text:change-end[1][@text:change-id = $id]) or ancestor::text:changed- region/@text:id = $id] ✦ SPARQL SELECT ?r WHERE { ?r a earmark:Range , Insertion ; dc:creator "John Smith" . }
    • Conclusions and future works • We presented a new meta-markup language called EARMARK, defined by means of OWL ontologies, that allows to make very complex markup documents • We applied it in a real-case scenario (ODT format with change tracking) showing how it allows to handle, manipulate and query complex documents in a better way (than XML does) • Future works about this topic include: ✦ Rocco and Fretta are two on-going projects that allow transformations from XML documents (with overlapping markup specified by using tricks) to EARMARK documents, and vice versa ✦ a formalism to specify explicitly semantics of markup and of textual content ✦ a word processor that allows to define EARMARK documents in a very simple way, with the possibility to add any kind of semantic assertions to any entity of the document (both markup items and textual content)
    • Thanks for your attention I think it’s time for questions :-)
    • Late time example: A more complex ODT document... <office:text> <text:changed-region text:id="S2"> ! <text:deletion><office:change-info> ! ! ! <dc:creator>Silvio Peroni</dc:creator> ! ! ! <dc:date>2009-10-27T18:45:00</dc:date> ! ! </office:change-info><text:p>.</text:p></text:deletion> ! <text:insertion> ! ! <office:change-info office:chg-author="Angelo Di Iorio" ! ! ! office:chg-date-time="2009-10-27T18:42:00"/> ! </text:insertion> </text:changed-region> <text:changed-region text:id="A2"> ! <text:insertion><office:change-info> ! ! ! <dc:creator>Angelo Di Iorio</dc:creator> ! ! ! <dc:date>2009-10-27T18:42:00</dc:date> ! ! </office:change-info></text:insertion> </text:changed-region> [...] <text:p>This is one paragraph<text:change-start text:change-id="S1"/>; ! actually, it was!<text:change-end text:change-id="S1"/> ! <text:change text:change-id="S2"/> <text:change-start text:change-id="A2"/></text:p> <text:p><text:change-end text:change-id="A2"/> ! <text:change text:change-id="A3"/><text:change-start text:change-id="A4"/>S ! <text:change-end text:change-id="A4"/>plit in two.</text:p> </office:text>
    • ... and its representation in EARMARK TIME docuverses ranges markup items assertions r6 p text a text:insertion ; dc:creator “Silvio Peroni” ; actually, it was! dc:date “2009-10-27T18:45:00” a text:deletion ; dc:creator “Silvio Peroni” dc:date “2009-10-27T18:45:00” r4 p text r5 a text:insertion ; .S p dc:creator “Angelo Di Iorio” dc:date “2009-10-27T18:42:00” a text:deletion ; r1 dc:creator “Angelo Di Iorio” dc:date “2009-10-27T18:42:00” r2 p text r3 Legend string in the range docuverse begin end This is one paragraph that will be split in two. content location location