Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

434 views
361 views

Published on

The Extremely Annotational RDF Markup, a.k.a. EARMARK, is an OWL 2 DL ontology that defines document meta-markup. It is an ontologically precise definition of markup that instantiates the structure of a text document as an independent OWL document outside of the text string it annotates, and through appropriate OWL and SWRL characterizations it can define organizations such as trees or graphs and can be used to generate validity constraints. In this paper we present an extension of EARMARK that allows us to describe how markup documents evolve in time, which complies with concepts expressed in the Functional Requirements for Bibliographic Records (FRBR).

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
434
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

  1. 1. http://creativecommons.org/licenses/by-sa/3.0 Tracking changes through EARMARK: a theoretical perspective and an implementation Silvio Peroni – essepuntato@cs.unibo.it Francesco Poggi – fpoggi@cs.unibo.it FabioVitali – fabio@cs.unibo.it 1st InternationalWorkshop on Document Changes: Modeling, Detection, Storage andVisualization http://diff.cs.unibo.it/dchanges2013/ @ DocEng 2013, Florence, Italy - September 10, 2013
  2. 2. Outline • Documents and their changes in time through FRBR • Change tracking and provenance data of XML documents • Defining multi-hierarchy documents through EARMARK • EARMARK Changes Ontology and its application • Querying and byte-counting • Conclusions
  3. 3. Documents do change in time • Any creative act of a text ✦ starts from a particular draft made by someone at a certain time ✦ is then modified through consecutive revisions ✦ may end up being forked into different variants ✦ may be modified by additional editorial activities such as typo-fixing, shortening, restructuring, etc. • Importance of keeping tracks of changes ✦ Computer Science: to show how programming code or computational models evolve throughout the natural lifecycle of software development ✦ Philology: to tell the way in which variant copies of a same book overlap in time and content ✦ Scientific Publishing: to understand the entity and quality of the modifications and driving the final acceptance or rejection of a paper ✦ etc.
  4. 4. About (textual) documents and their changes A document is more than the string it is composed of Alice produces an hand-written document on a piece of paper Bob produces a digital document through a word-processor “Hello world.” composed by the string “Hello world.” composed by the string different documents same string modifies Bob’s document producing a new version “Hey, hello world!” composed by the string Charles used by to certain extent same document different strings A document refers to different strings in time
  5. 5. Introducing FRBR for change tracking • “Functional Requirements for Bibliographic Records (FRBR) is a conceptual entity-relationship model developed by the International Federation of Library Associations and Institutions (IFLA) that relates user tasks of retrieval and access in online library catalogues and bibliographic databases from a user’s perspective.” • According to FRBR, document is an overloaded word, and is better substituted by four different concepts called respectively ✦ Work, coupled with the concept of identity ✦ Expression, to record the evolution in time ✦ Manifestation, which specifies the form and format ✦ Item, identifying the concrete object from http://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records string of characters constituting the content of a document FRBR Expression layer Expressions never change in time : a distinct intellectual or artistic creation : the intellectual or artistic realisation of a work in the form of alpha-numeric, musical, or choreographic notation, sound, image, object, movement, etc., or any combination of such forms FRBR Work layer Bob “Hello world.” Charles “Hey, hello world!” the abstract conceptualisation of a document realisation of realisation of revision of Time
  6. 6. What happens to XML-based markup documents? • What changes we have to keep track of among those that occur in XML documents, and what are the markup elements that, after edits and document changes, are ✦ directly affected ✦ hierarchically affected ✦ completely unaffected • Research questions: ✦ When a markup element E1 within a document version V1 changes in some way, e.g. by adding something to the text it contains, thereby generating document version V2, are the two instances of E1 in V1 and E2 in V2 to be considered actually the same element? ✦ In the case the aforementioned instances of E are to be considered different, is the difference meant to be propagated also to their ancestor elements?
  7. 7. DEL <section> NEW Applying FRBR to XML markup Elements and text nodes as FRBR Expressions NEW <section> <p> <p> <em> Some interesting content. It was written by me. INS <em> very NEW <section> <p> <section> <p>Some <em>interesting</em> content.</p> <p>It was written by me.</p> </section> Alice <section> <p>Some <em>very interesting</em> content.</p> <p>It was written by me.</p> </section> Bob revised by <section> <p>Some <em>very interesting</em> content.</p> <p>It was written by me.</p> </section> Charles revised by has part revision of
  8. 8. Who, What, How and When • An important part of change tracking operations involves keeping track of provenance information ✦ who made the modification ✦ what was modified ✦ how it was modified ✦ when it was modified • How do we keep track of all these data in practice? ✦ XML-based languages use workarounds to implement overlapping markup ✦ Other solutions? <section> <section> <p> <p> <em> very <section> Version made by Bob July 3, 2013, at 04:15 Text inserted by Bob July 3, 2013, at 04:15 Version made by Alice June 19, 2013, at 13:45 Markup deleted by Charles July 5, 2013, at 03:33 Version made by Charles July 5, 2013, at 03:33
  9. 9. From theory to practice through The Extremely Annotational RDF Markup (EARMARK) is at the same time a markup meta-language and an ontology of (document) markup ✦ More expressive than XML – it allows to organise markup structures as graphs ✦ It makes easy to associate annotations to document items such as change tracking information – since an EARMARK document is a set of OWL assertions, all the markup items and text nodes are individuals of particular classes identified by an IRI ✦ Lot of tools available: - a Java API - frameworks to convert XML documents into EARMARK ones and vice versa more information at http://palindrom.es/phd/research/earmark
  10. 10. Example of EARMARK document Linearised using Turtle Some interesting content. It was written by me. # Textual content of the document :content a earmark:StringDocuverse ; earmark:hasContent "Some interesting content. It was written by me."^^xsd:string . full Turtle source of the document available at http://www.essepuntato.it/2013/dchanges # String ’Some ’ :r1 a earmark:PointerRange ; earmark:refersTo :content ; earmark:begins "0"^^xsd:nonNegativeInteger ; earmark:ends "5"^^xsd:nonNegativeInteger . # String ‘interesting’ :r2 a earmark:PointerRange ... # String ’ content.’ :r3 a earmark:PointerRange ... # String ’It was written by me.’ :r4 a earmark:PointerRange ... <section> <p> # Element ‘section’ :section a earmark:Element ; earmark:hasGeneralIdentifier "section"^^xsd:string ; co:firstItem [ a co:ListItem ; co:itemContent :p1 ; co:nextItem [ a co:ListItem ; co:itemContent :p2 ] ] . <p> <em> # First element ’p’ :p1 a earmark:Element ; earmark:hasGeneralIdentifier "p"^^xsd:string ; co:firstItem [ a co:ListItem ; co:itemContent :r1 ; co:nextItem [ a co:ListItem ; co:itemContent :em ; co:nextItem [ a co:ListItem ; co:itemContent :r3 ] ] ] . ... and similarly for the other markup elements
  11. 11. EARMARK Changes Ontology Extending EARMARK to manage change tracking information • The EARMARK Changes Ontology (EChO) extends the EARMARK ontology and includes the OWL 2 DL implementation of FRBR (http://purl.org/spar/frbr) and the Provenance Ontology (http://www.w3.org/ns/prov#), so as to keep track of all the changes and provenance data related to different versions of the same document ✦ the EARMARK items (docuverses, ranges and markup items) to model the structure of the different document versions and to store them all within a single EARMARK document ✦ frbr:revisionOf to indicate that a markup item is a revision of another ✦ prov:wasDerivedFrom to indicate that a range is actually derived from another one defined in a previous version of the document ✦ prov:wasGeneratedBy (coupled with instances of echo:VersionCreation and echo:ItemInsertion) and prov:generatedAtTime to indicate that a particular markup item, a range or a whole document version has been created at a certain time ✦ prov:wasInvalidatedBy (coupled with instances of echo:VersionRemoval and echo:ItemDeletion) and prov:invalidatedAtTime to indicate that a particular markup item, a range or a whole document version has been deleted at a certain time ✦ prov:wasAssociatedWith to indicate the agent involved in the activity of generation/invalidation of a certain item
  12. 12. Version creation • Who:Alice • What: document version (implicitly identified by the document element of the markup document :section) • How: creation • When: June 19, 2013 at 13:45 :section # Provenance information prov:wasGeneratedBy :creation-by-alice ; prov:generatedAtTime "2013-06-19T13:45:00Z"^^xsd:dateTime . # Activity of creation of a new version :creation-by-alice a echo:VersionCreation ; prov:wasAssociatedWith :alice .
  13. 13. Revision, insertion and deletion • Bob’s revision of Alice’s version was made on July 3, 2013, at 04:15, and concerns only the insertion of the string “very ” as first textual node of the element em • Charles’ revision of Bob’s version was made on July 5, 2013, at 03:33, and deletes the Bob’s second p of section # Element ’section ’ by Bob :section-by-bob a earmark:Element ; earmark:hasGeneralIdentifier "section"^^xsd:string ; co:firstItem [ a co:ListItem ; co:itemContent :p1-by-bob ; co:nextItem [ a co:ListItem ; co:itemContent :p2 ] ] ; # relation with previous version frbr:revisionOf :section ; # provenance information prov:wasGeneratedBy :creation-by-bob ; prov:generatedAtTime "2013-07-03T04:15:00Z"^^xsd:dateTime . :creation-by-bob a echo:VersionCreation ; prov:wasAssociatedWith :bob . revision # New content of the document :content-by-bob a earmark:StringDocuverse ; earmark:hasContent "very "^^xsd:string . # New string ’very ’ :r5 a earmark:PointerRange ; earmark:refersTo :content-by-bob ; earmark:begins "0"^^xsd:nonNegativeInteger ; earmark:ends "5"^^xsd:nonNegativeInteger ; # provenance information prov:wasGeneratedBy :insertion-by-bob ; prov:generatedAtTime "2013-07-03T04:15:00Z"^^xsd:dateTime . :insertion-by-bob a echo:ItemInsertion ; prov:wasAssociatedWith :bob . insertion :p2 # Second element ’p’ of Bob’s ‘section‘ prov:wasInvalidatedBy :deletion-by-charles prov:invalidatedAtTime "2013-07-05T03:33:00Z"^^xsd:dateTime . :deletion-by-charles a echo:ItemDeletion ; prov:wasAssociatedWith :charles . deletion
  14. 14. Splitting ranges up • Daniel’s revision of the Alice’s version, where Daniel decided to substitute the string “me” in the second p with its name (i.e. the string “Daniel”) • In EARMARK, this string substitution (i.e. a deletion plus an insertion) is possible by defining four new ranges • We use prov:wasDerivedFrom statements between ranges to describe (at an abstract level) a more complex scenario of overlapping markup between the two versions
  15. 15. Querying the history of changes • Since EARMARK is defined by means of Semantic Web technologies, we can use already implemented standards such as SPARQL 1.1 to query over the change tracking history of a certain EARMARK document ✦ Return a new EARMARK document that contains only Bob’s version CONSTRUCT { ?other ?p ?o . ?version ?pv ?ov . ?docuverse ?pd ?od } WHERE { { SELECT DISTINCT ?other ?version WHERE { { SELECT DISTINCT ?version WHERE { ?version a earmark:Element ; prov:wasGeneratedBy ?activity . ?activity a echo:VersionCreation ; prov:wasAssociatedWith :bob } } ?other (^co:itemContent?/^co:item)+ ?version } } ?version ?pv ?ov . ?other ?p ?o . OPTIONAL { ?other a earmark:PointerRange ; earmark:refersTo ?docuverse . ?docuverse ?pd ?od } } ✦ Select the textual content of all paragraphs removed by Charles SELECT DISTINCT ?range WHERE { ?p a earmark:Element ; earmark:hasGeneralIdentifier "p"^^xsd:string ; co:item/co:itemContent ?range ; prov:wasInvalidatedBy ?activity . ?range a earmark:PointerRange . ?activity a echo:ItemDeletion ; prov:wasAssociatedWith :charles }
  16. 16. Byte-counting EARMARK documents • We used two documents ✦ The first document composed of seven different versions, named after the “Seven Dwarfs” for recognizability and obtained by applying very common edits according to three authors ✦ The second document composed of seven different versions, named after the weekdays and created by seven different authors when editing a very simple document • We compared the size in bytes of consecutive versions of such documents according to OpenDocument and OpenXML formats, and to EARMARK linearised in six different formats:Turtle, RDF/XML, OWL/XML, N-Triples, HDT and Manchester Syntax Turtle seems the best linearisation format for EARMARK HDT (compressed) performances similar to ODT and OOXML
  17. 17. Conclusions • In this paper we presented a theoretical approach to track document changes based on FRBR and provenance data • We proposed one possible implementation of it through EARMARK, a Semantic Web-aware meta-markup language that enables the definition of multiple overlapping markup hierarchies representing different versions of the same document • We highlighted the main advantages and drawbacks in terms of querying and storing such EARMARK documents • In the future we plan to extend EChO so as to enable the description of additional change tracking operations (e.g. swap, update) • We also plan to experiment the effective use of translation mechanisms to convert EARMARK documents with change tracking information into XML formats, e.g. ODT and OOXML
  18. 18. Thanks for your attention <end> <end>revision of

×