Dealing with Markup Semantics


Published on

My paper presentation.
i-Semantics 2011, Graz, Austria.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Dealing with Markup Semantics

  1. 1. Dealing with Markup Semantics Silvio Peroni – Aldo Gangemi – Fabio Vitali – fabio@cs.unibo.it
  2. 2. Summary• Semantic markup vs. markup semantics• Why markup semantics• Why XML is not enough• Markup semantics with EARMARK and Linguistic Act• Real-world scenarios• Conclusions
  3. 3. Shift of meaning Markup Tag Semantics and Markup document markup markup element markup semantics 1990 Web of it tells us something a syntactic itemdocuments about the text or representing “what is the meaning of a content of a document the building block of markup element title a document structure contained in a document d?” First Era of the Web (WWW) Second Era of the Web (Semantic Web) resource markup keyword semantic markup it is used to identify a non-hierarchical keywordtoday Web of any data added to a or term assigned to a “the resource r has the string data resource with the piece of information (such Dealing with Markup intention to semantically as an Internet bookmark, Semantics as title” describe it digital image or computer file)
  4. 4. Markup semantics today• The document markup is still here: ✦ lot of research issues are still open-problems now ✦ some on those partially-solved issues can be addressed in a better way through nowadays tools and technologies• So, our question is: Why the Semantic Web has not yet addressed properly markup semantics? Possible answers: ✦ Because the document markup is dead, really ✦ Because markup semantics is not an interesting research topic ✦ Because markup semantics is not an useful tool for solving valuable problems ✦ Actually, the Semantic Web addressed markup semantics
  5. 5. The document markup is dead... wait, really?• The document markup does not play any important role in nowadays research fields and company interests Are we definitely sure? Maybe not!
  6. 6. Research groups’ interest in markup semantics • Does it mean that there is no research communities interested in this issue? Well, actually, it is an old and still-live issue: ✦ Renear, A., Dubin, D., Sperberg-McQueen, C. M. (2002). Towards a Semantics for XML Markup. ✦ Dubin, D. (2003). Object mapping for markup semantics. ✦ Renear, A., Dubin, D., Sperberg-McQueen, C. M., Huitfeldt, C. (2003). XML Semantics and Digital Libraries. ✦ Simons, G. F., Lewis, W. D., Farrar, S. O., Langendoen, D. T., Fitzsimons, B., Gonzalez, H. (2004). The semantics of markup: mapping legacy markup schemas to a common semantics. ✦ Garcia, R., Celma, O. (2005) Semantic Integration and Retrieval of Multimedia Metadata. ✦ Marcoux,Y. (2006). A natural-language approach to modeling: Why is some XML so difficult to write? ✦ Van Deursen, D., Poppe, C., Martens, G., Mannens, E.,Van de Walle, R. (2008). XML to RDF Conversion: a Generic Approach. ✦ Marcoux,Y., Rizkallah, E. (2009). Intertextual semantics: A semantics for information design. ✦ Sperberg-McQueen, C. M., Marcoux,Y., Huitfeldt, C. (2009). Two representations of the semantics of TEI Lite ✦ Nuzzolese, A., Gangemi, A., Presutti,V. (2010). Gathering Lexical Linked Data and Knowledge Patterns from FrameNet. • “The problem addressed seems old and seems to have been solved before, but actually has not [sufficiently]” – by an anonymous reviewer
  7. 7. Markup semantics and real-world problems• Some advantages when having a formal and machine-readable semantics of markup: ✦ perform both syntactic and semantic validation ✦ infer facts from documents automatically ✦ simplify the federation, conversion and translation of documents among digital repositories ✦ query upon the structure of the document by considering its semantics ✦ create visualisations of documents considering the semantics of their structures rather than their markup vocabularies ✦ increase the accessibility of documents’ content (see the “tag abuse” issue) ✦ guarantee a better maintainability when a markup schema evolves• Fields of interest: digital libraries and digital (and semantic) publishing
  8. 8. Semantic Web approaching markup semantics • RDFa may be a valid choice for associating formal semantics with arbitrary text fragments ✦ Pros: easy to use and parse, compliant with XML-like formats ✦ Cons: we need to modify the structure of the document (more attributes, more elements)<?xml version="1.0" encoding="UTF-8"?><p>Fabio says that overlhappens</p> 1 markup element only <?xml version="1.0" encoding="UTF-8"?> RDFa enhancing <p prefix=”: foaf:”> <span about=”:fv” property=”foaf:firstName”>Fabio</span>2 markup elements says that overlhappens3 attributes </p> • There are domains (e.g., those having to deal with administrative and juridical documents) in which we cannot modify the structure of documents • How can we say that the element p in the document means “paragraph”?
  9. 9. Our problems in addressing markup semantics• ✦ Let’s use XML for defining document markup structures Pros: it is the today common format, used in lot of tools and applications ✦ Cons: it does not define a formal way for specifying markup semantics• Let’s use OWL for defining formal semantics and then associating it to XML markup ✦ Pros: OWL was created for define semantics ✦ Cons: we have to use XML-based approaches (RDFa, GRDDL) to link semantics to XML markup and this is not always possible• A compromise between XML and OWL is not fully satisfying• A solution: to elevate either the document markup formalism or the formal semantics model to the level of the other, that means: ✦ to use XML for document markup and another formalism, fully compliant with XML in all the possible scenarios, for defining its markup semantics (does it exist?), or ✦ to develop an OWL ontology for defining document markup and another OWL ontology for specifying its semantics try to guess what we did
  10. 10. • The Extremely Annotational RDF Markup (EARMARK) is at the same time a markup meta-language and an ontology of (document) markup ✦ More expressive than XML – it allows to organise markup structures as graphs ✦ It makes easy to associate OWL semantics to document items – an EARMARK document is a set of OWL assertions, all the markup items and text nodes are individuals of particular classes ✦ Lot of tools available: a Java API, frameworks to convert XML documents into EARMARK ones and to convert complex EARMARK documents (i.e., having a graph structure) into XML ones applying overlapping tricks to store as much information as possible into the simple XML tree hierarchy more information at
  11. 11. An example: XML tricks p agent noun verb This is not directly representable in XML (unless using tricks): “noun” and “verb” overlapFabio says that overlhappens To be representable p XML serialisationin XML it should be... with TEI fragmentation verb <p> <agent>Fabio</agent> says that <noun xml:id=”e1” next=”e2”> overl agent noun noun </noun> <verb> h<noun xml:id=”e2”>ap</noun>pens </verb>Fabio says that overlhappens </p>
  12. 12. An example: EARMARK document p ex:doc a :StringDocuverse; :hasContent "Fabio says that overlhappens". ex:r0-5 a :PointerRange; :refersTo ex:doc; agent noun verb :begins "0"; :ends "5”. ex:r5-16 a :PointerRange; :refersTo ex:doc;Fabio says that overlhappens :begins "5"; :ends "16".ex:agent a :Element; ex:r16-21 a :PointerRange; :hasGeneralIdentifier "agent"; :refersTo ex:doc; c:firstItem [c:itemContent ex:r0-5]. :begins "16"; :ends "21".ex:noun a :Element; ex:r22-24 a :PointerRange; :hasGeneralIdentifier "noun"; :refersTo ex:doc; c:firstItem [c:itemContent ex:r16-21; :begins "22"; :ends "24". c:nextItem [c:itemContent ex:r22-24]] .ex:verb a :Element; ex:r21-28 a :PointerRange; :hasGeneralIdentifier "verb"; :refersTo ex:dox; c:firstItem [c:itemContent ex:r21-28]. :begins "21"; :ends "28".ex:p a :Element ; :hasGeneralIdentifier "p"; c:firstItem [c:itemContent ex:agent; c:nextItem [c:itemContent ex:r5-16; c:nextItem [c:itemContent ex:noun; c:nextItem [c:itemContent ex:verb]]]].
  13. 13. Towards markup semantics• EARMARK is suitable for expressing markup semantics straightforwardly using OWL• What model can we use? It must: ✦ follow precise and theoretically-founded principles ✦ be interoperable across different markup vocabularies• A large amount of vocabularies addresses the representation of terms vs. meanings vs. things – e.g., SKOS, FRBR, CIDOC, OWL- WordNet Problems: ✦ too specific for particular contexts ✦ they are not interoperable
  14. 14. Linguistic Act ontology design pattern• References: any individual from the world we are describing – e.g., Fabio• Meanings: any (meta-level) object that explains something – e.g., person• Information entities: any symbol that has a meaning or denotes one or more references – e.g., the string “Fabio”• Linguistic acts: any communicative situation including information entities, agents, meanings, references, and a possible spatio-temporal context – e.g., to add markup to a document
  15. 15. Example: “Results” section of a paper <section> <div class=”section”> 2 XML excerpts of <info> <h1>Results</h1> <title>Results</title> <p>...</p> “Result” sections </info> </div> <para>...</para> </section> Related EARMARK conversionsex1:div a :Element; ex2:section a :Element; :hasGeneralIdentifier “div”; :hasGeneralIdentifier “section”; c:firstItem [c:itemContent c:firstItem [c:itemContent ex1:class]; ex2:info; c:nextItem [c:itemContent ex1:h1; c:nextItem [c:itemContent c:nextItem [c:itemContent ex1:p]]]; ex2:para]]; la:expresses la:expresses doco:Section, deo:Results. doco:Section, deo:Results.... ...ex1:p a :Element; ex2:para a :Element; :hasGeneralIdentifier “p”; :hasGeneralIdentifier “para”; c:firstItem [c:itemContent c:firstItem [c:itemContent ex1:someText]; ex2:someText]; la:express doco:Paragraph. la:express doco:Paragraph.... ... We are using the Document Components Ontology ( and the Discourse Elements Ontology ( to specify the semantics of markup elements
  16. 16. Searches on heterogeneous repositories • Problem: how to search something across a large number of digital libraries that use storing documents as XML documents of different and non-interoperable formats? • Query: give me all the markup elements that represents paragraphs of any “Result” section of any available document that were written by any person called FabioSELECT ?x WHERE { ?x a :Element ; la:expresses doco:Paragraph ; dc:creator [a foaf:Person ; foaf:name “Fabio”]; (^c:itemContent/^c:item)+ [a :Element; la:expresses doco:Section , deo:Results]} ex1:p and ex2:para are returned
  17. 17. Semantic format conversion • Problem: how to convert a document from a (unknown) format into a target one, without knowing the markup vocabulary of the former and having the possibility of querying its semantics • Convert: substitute any markup element representing a section with a new one named “sec” that contains the same elements and text content of the removed one DELETE {?s :hasGeneralIdentifier ?gi} INSERT {?s :hasGeneralIdentifier “sec”} WHERE { ?s a :Element; :hasGeneralIdentifier ?gi; la:expresses doco:Section } <sec class=”section”> <sec> <info>previous excerpts change: <h1>Results</h1> <title>Results</title> ... ...
  18. 18. Markup sensibility• Problem: how to estimate whether a markup element, that is valid at the syntactical and structural level, is also valid at the semantic level• Semantic constraints can be defined as ontological axioms of the underlying ontology, in order to understand whether a document is adhering to or in contrast with them <smith> a :Element; :hasGeneralIdentifier “TLCPerson”; la:denotes </ontology/ul/person/JohnSmith> ... </ontology/ul/person/JohnSmith> a akomantoso:Person. <akomaNtoso> ... <TLCPerson id=”smith” href=”/ontology/uk/person/JohnSmith” /> ... <speech id=”sp_1” by=”#smith” as=”#mineconomy”> <p>Honorable Members of the Parliament...</p> </speech> ... </akomaNtoso> <sp_1> a :Element; :hasGeneralIdentifier “speech”; la:expresses akomantoso:Speech; la:denotes _:aSpeechEvent; ... _:aSpeechEvent a akomantoso:SpeechEvent; akomantoso:hasSpeaker </ontology/ul/person/JohnSmith>. [] a la:LinguisticAct; sit:isSettingFor <sp_1>, akomantoso:Speech, </ontology/ul/person/JohnSmith>, _:aSpeechEvent.
  19. 19. Verifying semantic constraints• Verify: check whether the markup element “speech” denotes a particular speech event that involves only and at least 1 person as speaker, that is introduced in the document through a markup element(Element that hasGeneralIdentifier value “speech”)SubClassOf(sit:hasSetting only (la:LinguisticAct that sit:isSettingFor exactly 1 (Element and la:InformationEntity) and sit:isSettingFor exactly 1 ( (akomantoso:SpeechEvent and la:Reference) that akomantoso:hasSpeaker some ( akomantoso:Person that la:isDenotedBy some Element ) ) and sit:isSettingFor value akomantoso:Speech ))
  20. 20. Conclusions• The issue of markup semantics is still a interesting research field, with a lot of possible applications in real-world scenarios• We proposed our approach for addressing markup semantics through Semantic Web technologies and we introduced EARMARK, as a new document markup meta-language, and the Linguistic Act ontology design pattern for expressing semantics of EARMARK document markup• We shown how to use these models for addressing real scenarios in which the use of markup semantics can help when doing particular tasks, such as querying on heterogeneous document repositories, converting document markup across different vocabularies, and verifying the validity of markup elements at a semantic level• Future development: ✦ a software assistant that helps users in the definition of markup semantics of a given XML schema ✦ two applications for the semantic validation of markup documents and for the visualisation of document parts according to their semantics
  21. 21. Thanks for your attention