GRDDL
The Why, What, How, and Where




                            Chimezie Ogbuji
                            Cleveland Clinic Foundation
GRDDL: The Acronym
 Gleaning
 Resource
 Descriptions (from)
 Dialects (of)
 Language



   Rather long and intimidating
GRDDL: By Deconstruction

   Wordnet Definition of Glean:
    ◦ (gather, as of natural products)
    ◦ Synonyms: reap, harvest.
   Resource Description Framework (RDF)
    ◦ Logical assertions
   Dialects of Language
    ◦ XML document families (XHTML, for instance)
GRDDL: By Analogy
           GRDDL can be thought of
           as a protocol for sowing
           semantics in web content
           for later harvest.
The Why
   Vast amount of latent semantics in markup
        <span>Chimezie Ogbuji<span>
   Web content today is primarily built for
    human consumption
   Text indexing will only get you so far for
    document retrieval
   If machines are meant to harvest RDF from
    documents, reproducible protocols are
    needed
The Why (Cont.)
 Microformats, eRDF, and RDFa
     Specific to a particular family of
      documents
     XHTML and HTML
 If the goal is machine consumption, the
  bar needs to be raised beyond XHTML
The Why (Cont.)
 It seems easy to forget that XHTML is
  indeed an XML dialect
     You would think the (X) would make
      that obvious
 What was needed was a standard way to
  harvest RDF that is applicable to all XML
  dialects
The What
   Faithful rendition
   Transformations
   GRDDL result
   Source documents
   GRDDL-aware Agents
Faithful Rendition
“By specifying a GRDDL transformation, the author of a document
  states that the transformation will provide a faithful rendition in
  RDF of information (or some portion of the information)
  expressed through the XML dialect used in the source document.”

 Licenses an author-certified interpretation of
  an XML document
 A powerful paradigm for messaging
    See David Booths “RDF and SOA”
        http://www.w3.org/2007/01/wos-papers/booth
GRDDL Transformations
   Functions that take an XML document and
    return an RDF graph
   Transformations can be written in any
    particular language
   The “reference” transformation language is
    XSLT
        “[XSLT1] is the format most widely supported by GRDDL-
         aware agents as of this writing […] is specifically designed to
         express XML to XML transformations and has some good
         safety characteristics”
Other Transformation Languages
   “.. technically Javascript, C, or virtually any
    other programming language may be used to
    express transformations for GRDDL”
   However, these transformations need to be
    deterministic in order to ensure the result is
    a faithful rendition
   Hence, they must be functions
GRDDL Result
   The result of applying the transformation is
    an RDF serialization
   The RDF graph that corresponds to the
    serialization is a GRDDL result of the
    original document
   The “reference” result format is RDF/XML
   Other formats can be used (Turtle, N3,etc.)
GRDDL Source Documents
   The class of documents for which GRDDL
    defines a way to extract a result graph:
      XML Documents
      XML Namespace Documents
      Valid XHTML
      XHTML Profiles
GRDDL Source Documents
GRDDL: XML Documents
   GRDDL Namespace (grddl prefix)
              http://www.w3.org/2003/g/data-view#


   transformation attribute
    <?xml version=“1.0” encoding=“UTF-8”?>
    <root
     xmlns:grddl='http://www.w3.org/2003/g/data-view#’
     grddl:transformation=“.. path to transform ..”>
    … XML content ..
    </root>
Namespace Documents
“Transformations can be associated not only with individual
   documents but also with whole dialects that share an XML
   namespace”

   A GRDDL source document lives at the
    location of the namespace URI of the root
    element (the namespace document)
   The GRDDL result of the namespace
    document has a statement of the form:
            ?nsDoc grddl:namespaceTransformation ?txDoc
•   txDoc is the location of a transformation
    applicable to such XML documents
Valid XHTML Documents
    <html xmlns="http://www.w3.org/1999/xhtml">
     <head
      profile="http://www.w3.org/2003/g/data-view">
      <title>Some Document</title>
        <link rel="transformation"
              href=”.. path to transformation .. " />
        ...
     </head>
    …
    </html>
   Refers to the GRDDL XHTML profile
      Licenses the interpretation of
       rel=“transformation” links
XHTML Profiles
“Adding a GRDDL profileTransformation assertion to a profile
  document is much like adding a namespaceTransformation
  assertion to a namespace document”

   A GRDDL source document lives at the
    location of the profile URI an XHTML
    document
   The GRDDL result of the profile document
    has a statement of the form:
            ?profileDoc grddl:profileTransformation ?txDoc
•   txDoc is the location of a transformation
    applicable to such XML documents
The How
   GRDDL builds on existing XML & RDF
    standards
   An implementation mostly needs to
    orchestrate:
       Parsing of data representations
       Resolving representations from web locations
       The necessary XML processing to peek into and
        harvest RDF from the various sources
       The highly recursive nature of GRDDL 
Technological Overlap
Anatomy of a GRDDL
Implementation: GRDDL.py
   A reference implementation from scratch
   650 LOC
        RDFLib, 4Suite-XML, and Python control logic
   A layered approach
        Core module that handles transformations
        One module per source type stacked on top of the
         core
        A top layer that orchestrates the recursion and
         identification of which ‘class’ a source document
         belongs to
GRDDL.py Core
Component Stack
The Where
   GRDDL services online:
        http://triplr.org/ (Stuff in, triples out)
        http://www.w3.org/2007/08/grddl/ (W3C GRDDL
         Service)
   Primary GRDDL implementations:
        Redland
        GRDDL.py
        Virtuoso
        GRDDL Reader for Jena
   RDFa is most common GRDDL source
    content format in the wild
Hidden Value Proposition
   Supports separation of concerns:
      XML for messaging, data collection,
       structural validation
      RDF for Expressive assertions, inference,
       etc.
   A way to invest in data richness and
    accessibility
GRDDL Usecases
   Embedding scheduling assertions on
    personal pages
   Using GRDDL for extracting RDF from XML
    medical record documents
      Cleveland Clinic use case (clinical
       research)
   Aggregating web-based product reviews
   Embedding web service descriptions
   Adding semantic assertions to XML schemas
   Embedding semantic assertions to Wikis

GRDDL: The Why, What, How, and Where

  • 1.
    GRDDL The Why, What,How, and Where Chimezie Ogbuji Cleveland Clinic Foundation
  • 2.
    GRDDL: The Acronym Gleaning  Resource  Descriptions (from)  Dialects (of)  Language  Rather long and intimidating
  • 3.
    GRDDL: By Deconstruction  Wordnet Definition of Glean: ◦ (gather, as of natural products) ◦ Synonyms: reap, harvest.  Resource Description Framework (RDF) ◦ Logical assertions  Dialects of Language ◦ XML document families (XHTML, for instance)
  • 4.
    GRDDL: By Analogy GRDDL can be thought of as a protocol for sowing semantics in web content for later harvest.
  • 5.
    The Why  Vast amount of latent semantics in markup <span>Chimezie Ogbuji<span>  Web content today is primarily built for human consumption  Text indexing will only get you so far for document retrieval  If machines are meant to harvest RDF from documents, reproducible protocols are needed
  • 6.
    The Why (Cont.) Microformats, eRDF, and RDFa  Specific to a particular family of documents  XHTML and HTML  If the goal is machine consumption, the bar needs to be raised beyond XHTML
  • 7.
    The Why (Cont.) It seems easy to forget that XHTML is indeed an XML dialect  You would think the (X) would make that obvious  What was needed was a standard way to harvest RDF that is applicable to all XML dialects
  • 8.
    The What  Faithful rendition  Transformations  GRDDL result  Source documents  GRDDL-aware Agents
  • 9.
    Faithful Rendition “By specifyinga GRDDL transformation, the author of a document states that the transformation will provide a faithful rendition in RDF of information (or some portion of the information) expressed through the XML dialect used in the source document.”  Licenses an author-certified interpretation of an XML document  A powerful paradigm for messaging  See David Booths “RDF and SOA”  http://www.w3.org/2007/01/wos-papers/booth
  • 10.
    GRDDL Transformations  Functions that take an XML document and return an RDF graph  Transformations can be written in any particular language  The “reference” transformation language is XSLT  “[XSLT1] is the format most widely supported by GRDDL- aware agents as of this writing […] is specifically designed to express XML to XML transformations and has some good safety characteristics”
  • 11.
    Other Transformation Languages  “.. technically Javascript, C, or virtually any other programming language may be used to express transformations for GRDDL”  However, these transformations need to be deterministic in order to ensure the result is a faithful rendition  Hence, they must be functions
  • 12.
    GRDDL Result  The result of applying the transformation is an RDF serialization  The RDF graph that corresponds to the serialization is a GRDDL result of the original document  The “reference” result format is RDF/XML  Other formats can be used (Turtle, N3,etc.)
  • 13.
    GRDDL Source Documents  The class of documents for which GRDDL defines a way to extract a result graph:  XML Documents  XML Namespace Documents  Valid XHTML  XHTML Profiles
  • 14.
  • 15.
    GRDDL: XML Documents  GRDDL Namespace (grddl prefix) http://www.w3.org/2003/g/data-view#  transformation attribute <?xml version=“1.0” encoding=“UTF-8”?> <root xmlns:grddl='http://www.w3.org/2003/g/data-view#’ grddl:transformation=“.. path to transform ..”> … XML content .. </root>
  • 16.
    Namespace Documents “Transformations canbe associated not only with individual documents but also with whole dialects that share an XML namespace”  A GRDDL source document lives at the location of the namespace URI of the root element (the namespace document)  The GRDDL result of the namespace document has a statement of the form: ?nsDoc grddl:namespaceTransformation ?txDoc • txDoc is the location of a transformation applicable to such XML documents
  • 17.
    Valid XHTML Documents <html xmlns="http://www.w3.org/1999/xhtml"> <head profile="http://www.w3.org/2003/g/data-view"> <title>Some Document</title> <link rel="transformation" href=”.. path to transformation .. " /> ... </head> … </html>  Refers to the GRDDL XHTML profile  Licenses the interpretation of rel=“transformation” links
  • 18.
    XHTML Profiles “Adding aGRDDL profileTransformation assertion to a profile document is much like adding a namespaceTransformation assertion to a namespace document”  A GRDDL source document lives at the location of the profile URI an XHTML document  The GRDDL result of the profile document has a statement of the form: ?profileDoc grddl:profileTransformation ?txDoc • txDoc is the location of a transformation applicable to such XML documents
  • 19.
    The How  GRDDL builds on existing XML & RDF standards  An implementation mostly needs to orchestrate:  Parsing of data representations  Resolving representations from web locations  The necessary XML processing to peek into and harvest RDF from the various sources  The highly recursive nature of GRDDL 
  • 20.
  • 21.
    Anatomy of aGRDDL Implementation: GRDDL.py  A reference implementation from scratch  650 LOC  RDFLib, 4Suite-XML, and Python control logic  A layered approach  Core module that handles transformations  One module per source type stacked on top of the core  A top layer that orchestrates the recursion and identification of which ‘class’ a source document belongs to
  • 22.
  • 23.
  • 24.
    The Where  GRDDL services online:  http://triplr.org/ (Stuff in, triples out)  http://www.w3.org/2007/08/grddl/ (W3C GRDDL Service)  Primary GRDDL implementations:  Redland  GRDDL.py  Virtuoso  GRDDL Reader for Jena  RDFa is most common GRDDL source content format in the wild
  • 25.
    Hidden Value Proposition  Supports separation of concerns:  XML for messaging, data collection, structural validation  RDF for Expressive assertions, inference, etc.  A way to invest in data richness and accessibility
  • 26.
    GRDDL Usecases  Embedding scheduling assertions on personal pages  Using GRDDL for extracting RDF from XML medical record documents  Cleveland Clinic use case (clinical research)  Aggregating web-based product reviews  Embedding web service descriptions  Adding semantic assertions to XML schemas  Embedding semantic assertions to Wikis