Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Virtuoso Sponger - RDFizer Middleware for creating RDF from non RDF Data Sources

6,678 views

Published on

Virtuoso Sponger - RDFizer Middleware for creating RDF from non RDF Data Sources

Published in: Technology

Virtuoso Sponger - RDFizer Middleware for creating RDF from non RDF Data Sources

  1. 1. Virtuoso Sponger Extracting RDF Structured Data from Non-RDF Sources
  2. 2. Growing the Semantic Web <ul><li>Classic “chicken ‘n’ egg” problem has impeded the growth of the Semantic Web </li></ul><ul><ul><li>Development of applications for the Semantic Web will remain small-scale without a critical mass of RDF data. </li></ul></ul><ul><ul><li>A critical mass of RDF data won’t be achieved without adequate Semantic Web applications and tools. </li></ul></ul><ul><li>A new class of tools is emerging in response to this need…“ RDFizers ” </li></ul><ul><ul><li>Transform non-RDF data into RDF </li></ul></ul><ul><ul><li>Virtuoso Sponger is one such RDFizer </li></ul></ul>
  3. 3. Virtuoso Sponger <ul><li>An RDFizer introduced in Virtuoso 5.0 </li></ul><ul><li>Provides built-in RDF middleware for transforming non-RDF data into RDF &quot;on the fly“. </li></ul><ul><li>You can use non-RDF data sources as Semantic Web data sources. </li></ul><ul><li>Inputs : Wide variety of non-RDF Web data sources, e.g: </li></ul><ul><ul><li>(X)HTML Web Pages (including hosted microformats) </li></ul></ul><ul><ul><li>Web services (Google, Del.icio.us, Flickr etc.) </li></ul></ul><ul><ul><li>Binary files (MS Office, PDF, OpenDocument etc.) </li></ul></ul><ul><li>Output : RDF structured data </li></ul>
  4. 4. Inputs: Supported Data Sources <ul><li>RDF (inc. N3, Turtle) </li></ul><ul><ul><li>SIOC, SKOS, FOAF, AtomOWL, Annotea … </li></ul></ul><ul><li>(X)HTML pages </li></ul><ul><ul><li>HTML header metadata: Dublin Core </li></ul></ul><ul><ul><li>Microformats: eRDF, RDFa, hCard, hCalendar, XFN, xFolk … </li></ul></ul><ul><li>Syndication formats </li></ul><ul><ul><li>RSS 2.0, Atom, OPML, OCS, XBEL </li></ul></ul><ul><li>GRDDL </li></ul><ul><li>Web service APIs: Google Base, Flickr, Del.icio.us, Ning … </li></ul><ul><li>Files: </li></ul><ul><ul><li>Binary files: MS Office, OpenOffice, images, audio, video … </li></ul></ul><ul><ul><li>Data exchange formats: iCalendar, vCard </li></ul></ul><ul><li>3 rd party metadata extractors: Aperture, Spotlight, SIMILE RDFizers </li></ul><ul><li>or add your own! </li></ul>
  5. 5. Output: Structured Data <ul><li>In the context of the Semantic Data Web: </li></ul><ul><li>“ Data organized into semantic chunks or entities, with similar entities grouped together into relations or classes” Michael Bergman (http://www.mkbergman.com) Article: “More Structure, More Terminology and (hopefully) More Clarity” </li></ul>
  6. 6. Sponger Benefits <ul><li>Majority of the world's data resides in non-RDF form at the current time </li></ul><ul><li>Sponger provides a “Swiss army knife” for RDF structured data generation from non-RDF sources </li></ul><ul><li>Extracting data from non-RDF Web sources and converting it to RDF </li></ul><ul><ul><li>helps “bootstrap” the Semantic Web </li></ul></ul><ul><ul><li>helps drive the transition of the traditional Document-Web into the emerging Semantic Data-Web </li></ul></ul><ul><ul><li>exposes the data in a canonical form for querying and inference </li></ul></ul>
  7. 7. Sponger Inputs & Outputs
  8. 8. Sponger Architecture <ul><li>Sponger is comprised of Sponger Cartridges </li></ul><ul><li>Default cartridge collection is bundled as a Virtuoso VAD </li></ul><ul><li>Cartridge = Metadata Extractor + Ontology Mapper </li></ul><ul><li>Metadata extracted from non-RDF resources is mapped to a suitable ontology by Ontology Mapper to produce Structured Data </li></ul><ul><li>Sponger is highly customizable </li></ul><ul><li>Custom cartridges can be developed </li></ul><ul><ul><li>Using any language (e.g. Virtuoso PL, C/C++, Java) supported by Virtuoso Server Extensions API </li></ul></ul>
  9. 9. Using The Sponger <ul><li>Can be invoked in several ways, via: </li></ul><ul><li>Virtuoso SPARQL query processor </li></ul><ul><li>Virtuoso RDF Proxy Service (/proxy) </li></ul><ul><ul><li>E.g. http://localhost:8890/proxy </li></ul></ul><ul><li>OpenLink RDF client applications </li></ul><ul><li>ODS-Briefcase (Virtuoso WebDAV) </li></ul><ul><li>Directly through Virtuoso PL </li></ul>
  10. 10. Using the Sponger: SPARQL Query Processor <ul><li>Virtuoso extends SPARQL with IRI/URI dereferencing </li></ul><ul><li>Highly distributed nature of Semantic Web makes it highly unlikely all the referenced resources/IRIs will be in the local quad store </li></ul><ul><li>During query execution: </li></ul><ul><ul><li>From a given IRI, remote RDF resources can be downloaded, parsed & the resulting triples stored in local quad store </li></ul></ul><ul><li>IRI dereferencing of FROM clauses </li></ul><ul><ul><li>Downloads & stores triples from named graphs </li></ul></ul><ul><li>IRI dereferencing of SPARQL variables </li></ul><ul><ul><li>Downloads & stores triples based on proximity search from a starting IRI to a given depth (# of hops) via specified predicates </li></ul></ul>
  11. 11. SPARQL Extensions: IRI Dereferencing of FROM Clauses <ul><li>Enabled through ‘define get:…’ pragmas </li></ul><ul><li>DEFINE get:method “GET” </li></ul><ul><li>DEFINE get:soft “soft” </li></ul><ul><li>SELECT ?id </li></ul><ul><li>FROM NAMED <http://myhost/user1.ttl> </li></ul><ul><li>FROM NAMED http://myhost/user2.ttl </li></ul><ul><li>WHERE { GRAPH ?g { ?id a ?o } }; </li></ul><ul><li>get:soft – retrieval mode: “soft” / “replace” </li></ul><ul><li>get:uri – IRI to retrieve if not equal to IRI of FROM clause </li></ul><ul><li>get:method – HTTP “GET” or URIQA “MGET” </li></ul><ul><li>get:refresh – max allowed age (seconds) of cached resource </li></ul><ul><ul><li>can reduce expiry time specified in HTTP headers </li></ul></ul><ul><li>get:proxy – proxy server address if direct download not possible </li></ul>
  12. 12. SPARQL Extensions: IRI Dereferencing of Variables <ul><li>Enabled through ‘define input:grab-…’ pragmas </li></ul><ul><li>DEFINE input:grab-var “?more” </li></ul><ul><li>DEFINE input:grab-depth 10 </li></ul><ul><li>DEFINE input:grab-limit 100 </li></ul><ul><li>DEFINE input:grab-base “http://myhost/” </li></ul><ul><li>SELECT ?id ?fullname ?email </li></ul><ul><li>WHERE { GRAPH ?g { </li></ul><ul><li> ?id a <Person> ; <FullName> ?fullname ; <Email> ?email . </li></ul><ul><li> OPTIONAL { ?id <SeeAlso> ?more } </li></ul><ul><li>} } ; </li></ul><ul><li>input:grab-var - SPARQL variable identifying IRIs to be downloaded </li></ul><ul><li>input:grab-depth – max # of links (predicates) between nodes in graph </li></ul><ul><li>input:grab-limit – max # of resources (subject/object nodes) to retrieve </li></ul><ul><li>input:grab-base – base IRI for converting relative IRIs to absolute </li></ul><ul><li>plus others (grab-seealso, grab-destination …) - see Reference Manual. </li></ul>
  13. 13. Using the Sponger: RDF Proxy Service <ul><li>Sponger functionality is also exposed by Virtuoso “/proxy” endpoint </li></ul><ul><li>An in-built REST style Web service </li></ul><ul><li>Takes a target URL & returns its content “as is” or tries to transform it (by sponging) to RDF </li></ul><ul><li>http://demo.openlinksw.com/proxy?url=http://www.w3c.org/People/Connelly/&force=rdf </li></ul><ul><li>Provides a “pipe” for RDF browsers to browse non-RDF sources </li></ul><ul><li>Caches to temporary Virtuoso storage </li></ul><ul><ul><li>Cache invalidation similar to traditional Web Browser, based on HTTP ‘expires’ header </li></ul></ul>
  14. 14. RDF Proxy Service <ul><li>Parameters : </li></ul><ul><li>url: the URL of the target </li></ul><ul><li>force: if ‘rdf’ is specified, will try to extract RDF data from the target and return it </li></ul><ul><li>header: HTTP headers to be sent to the target </li></ul><ul><li>output-format: output MIME type of the RDF data </li></ul><ul><ul><li>‘ rdf+xml’ (default) / ‘n3’ / ‘turtle’ / ‘ttl’ </li></ul></ul><ul><ul><li>if not specified, proxy service uses content negotiation </li></ul></ul>
  15. 15. Using the Sponger: OpenLink RDF Client Applications <ul><li>Bundled as part of OpenLink AJAX Toolkit (OAT) </li></ul><ul><li>RDF Browser </li></ul><ul><li>Uses /proxy service by default </li></ul><ul><li>iSPARQL – Interactive SPARQL query builder </li></ul><ul><li>Uses /sparql service & 5 sponging settings (translated to IRI dereferencing pragmas on server) </li></ul><ul><ul><li>Get Local Data Only </li></ul></ul><ul><ul><li>Get Remote Data When Missing Locally </li></ul></ul><ul><ul><li>Get All Remote Data </li></ul></ul><ul><ul><li>Get All Remote Data & Related Data </li></ul></ul><ul><ul><li>Get Everything </li></ul></ul>
  16. 16. Using the Sponger: ODS-Briefcase (Virtuoso WebDAV) <ul><li>Briefcase = A component of OpenLink Data Spaces </li></ul><ul><li>Includes high level interface to Virtuoso WebDAV repository </li></ul><ul><li>Web browser based interaction </li></ul><ul><li>Web services support (direct use of WebDAV protocol) </li></ul><ul><li>SPARQL queryable (WebDAV location acts as RDF graph URI) </li></ul><ul><ul><li>Metadata automically extracted at file upload time </li></ul></ul><ul><ul><li>Wide variety of file formats supported </li></ul></ul><ul><ul><li>All WebDAV resources are exposed as SIOC instance data </li></ul></ul><ul><ul><li>Extracted metadata available in two forms </li></ul></ul><ul><ul><ul><li>Pure WebDAV </li></ul></ul></ul><ul><ul><ul><li>RDF (RDF/XML, N3, Turtle) optionally synchronized with Quad Store </li></ul></ul></ul><ul><li>Virtuoso Content Crawler / RDF_Sink folder help automate uploading </li></ul>
  17. 17. SIOC as a Data Space “Glue” Ontology <ul><li>ODS has its own built-in cartridges for mapping to SIOC </li></ul><ul><li>All ODS data containers (ODS-Briefcase, ODS-Weblog, ODS-Wiki, ODS-FeedManager etc) expose their data as SIOC instance data </li></ul><ul><li>SIOC </li></ul><ul><ul><li>provides a generic data model of containers, items and associations between items </li></ul></ul><ul><ul><ul><li>Classes include: User, UserGroup, Role, Site, Forum, Post </li></ul></ul></ul><ul><ul><li>SIOC Types Module (sioc-t) defines further types. </li></ul></ul><ul><ul><ul><li>Classes include: AddressBook, BookmarkFolder, ImageGallery etc etc </li></ul></ul></ul><ul><ul><li>permits the use of other ontologies (e.g. FOAF) when describing attributes of SIOC entities </li></ul></ul><ul><ul><li>provides a generic wrapper (“glue” ontology) for describing RDF structured data derived from OpenLink Data Spaces </li></ul></ul><ul><li>All ODS-related SIOC data can be queried through SPARQL </li></ul>
  18. 18. Using the Sponger: Directly via Virtuoso PL <ul><li>Sponger cartridges are invoked through a cartridge hook </li></ul><ul><ul><li>Provides a Virtuoso PL entry point to the packaged functionality </li></ul></ul><ul><ul><li>Can be called directly from your own Virtuoso PL procedures </li></ul></ul>
  19. 19. Sponger Cartridges
  20. 20. Sponger Architecture <ul><li>Sponger is comprised of cartridges </li></ul><ul><li>Cartridge = metadata extractor + ontology mapper </li></ul><ul><li>Cartridge is invoked through cartridge hook (Virtuoso PL entry point) </li></ul><ul><li>Metadata extractor </li></ul><ul><ul><li>Performs initial data extraction </li></ul></ul><ul><li>Ontology mapper </li></ul><ul><ul><li>Generates RDF instance data from extracted (non-RDF) metadata </li></ul></ul><ul><ul><li>Extracted metadata is mapped to an ontology associated (via an internal mapping table) with the data source type </li></ul></ul><ul><ul><li>Typically uses XSLT (GRDDL or in-built Virtuoso mapping scheme) or Virtuoso PL </li></ul></ul>
  21. 21. Sponger Cartridge Invocation
  22. 22. Sponger Configuration using Conductor UI <ul><li>Virtuoso Conductor provides a browser-based graphical UI for most Virtuoso administration tasks </li></ul><ul><ul><li>including managing Sponger Cartridges and VADs </li></ul></ul><ul><li>VAD = Virtuoso Application Distribution </li></ul><ul><ul><li>Packaging & distribution system for Virtuoso extensions </li></ul></ul><ul><li>RDF Cartridges VAD </li></ul><ul><ul><li>Bundles a variety of pre-built cartridges for popular Web resources and file types </li></ul></ul><ul><ul><li>Installed as part of default Virtuoso installation </li></ul></ul>
  23. 23. Sponger Configuration using Conductor UI: RDF Cartridges Pane
  24. 24. Sponger Configuration using Conductor UI: GRDDL Filters
  25. 25. Sponger Configuration using Conductor UI: XSLT Templates
  26. 26. Sponger Configuration using Conductor UI: Schema Files / Supported Ontologies
  27. 27. Custom Cartridges <ul><li>Sponger is extensible via pluggable cartridge architecture </li></ul><ul><li>Sponge new data formats by creating your own cartridges </li></ul><ul><li>Use Virtuoso PL or any language supported by Virtuoso Server Extensions API (incl: C/C++, Java) </li></ul><ul><li>Register your cartridge in the Cartridge Registry (SYS_RDF_CARTRIDGES table) before use using Conductor or DML </li></ul>
  28. 28. Custom Cartridges <ul><li>Cartridge Hook - Virtuoso PL Prototype </li></ul><ul><li>in graph_iri varchar : IRI of graph being retrieved </li></ul><ul><li>in new_origin_uri varchar : URI of the document being retrieved </li></ul><ul><li>in destination varchar : destination graph IRI </li></ul><ul><li>inout content any : the document content </li></ul><ul><li>inout async_queue any : preallocated asynchronous queue used to call the configured ping service </li></ul><ul><li>inout ping_service any : URL of the ping service, as assigned to the PingService parameter in the [SPARQL] section of the virtuoso.ini file. This argument could be used to notify the PingTheSemanticWeb RDF document repository & notification service </li></ul><ul><li>inout api_key any : unique string providing cartridge specific data taken from the RC_KEY column of the DB.DBA.SYS_RDF_CARTRIDGES table </li></ul>
  29. 29. Flickr Cartridge Extracts <ul><li>procedure DB.DBA.RDF_LOAD_FLICKR_IMG ( </li></ul><ul><li>in graph_iri varchar, in new_origin_uri varchar, in dest varchar, </li></ul><ul><li>inout _ret_body any, inout aq any, inout ps any, inout _key any) </li></ul><ul><li>{ </li></ul><ul><li>declare xd, xt, url, tmp, api_key, img_id, hdr, exif any; </li></ul><ul><li>... </li></ul><ul><li>url := sprintf ('http://api.flickr.com/services/rest/?method=flickr.photos.getInfo&photo_ </li></ul><ul><li>id=%s&api_key=%s', img_id, api_key); </li></ul><ul><li>tmp := http_get (url, hdr); </li></ul><ul><li>... </li></ul><ul><li>xd := xtree_doc (tmp); </li></ul><ul><li>... </li></ul><ul><li>xt := xslt (registry_get ('_rdf_cartridges_path_') || 'xslt/flickr2rdf.xsl', </li></ul><ul><li>xd, vector ('baseUri', coalesce (dest, graph_iri), 'exif', exif)); </li></ul><ul><li>xd := serialize_to_UTF8_xml (xt); </li></ul><ul><li>DB.DBA. RDF_LOAD_RDFXML (xd, new_origin_uri, coalesce (dest, graph_iri)); </li></ul><ul><li>return 1; </li></ul><ul><li>} </li></ul>
  30. 30. Custom Resolvers <ul><li>Sponger supports pluggable “Custom Resolver” cartridges </li></ul><ul><li>Support dereferencing of other forms of URIs besides HTTP URLs, e.g: </li></ul><ul><ul><li>URN schemes ( LSID s) and handle schemes ( DOI s) </li></ul></ul><ul><li>Greatly extends range of data sources which can be linked into the Semantic Web </li></ul><ul><li>http://demo.openlinksw.com/sparql?default-graph-uri= urn:lsid:ubio.org:namebank:11815 &should-sponge=soft&query=SELECT+*+WHERE+{?s+?p+?o}&format=text/html </li></ul><ul><li>Proxy service also recognizes URNs </li></ul><ul><li>http://demo.openlinksw.com/proxy?url=urn:lsid:ubio.org:namebank:11815&force=rdf </li></ul>

×