Virtuoso Sponger Extracting RDF Structured Data from  Non-RDF Sources
Growing the Semantic Web <ul><li>Classic “chicken ‘n’ egg” problem has impeded the growth of the Semantic Web </li></ul><u...
Virtuoso Sponger <ul><li>An RDFizer introduced in Virtuoso 5.0 </li></ul><ul><li>Provides built-in RDF middleware for tran...
Inputs: Supported Data Sources <ul><li>RDF (inc. N3, Turtle) </li></ul><ul><ul><li>SIOC, SKOS, FOAF, AtomOWL, Annotea … </...
Output: Structured Data <ul><li>In the context of the Semantic Data Web: </li></ul><ul><li>“ Data organized into semantic ...
Sponger Benefits <ul><li>Majority of the world's data resides in non-RDF form at the current time </li></ul><ul><li>Sponge...
Sponger Inputs & Outputs
Sponger Architecture <ul><li>Sponger is comprised of  Sponger Cartridges </li></ul><ul><li>Default cartridge collection is...
Using The Sponger <ul><li>Can be invoked in several ways, via: </li></ul><ul><li>Virtuoso SPARQL query processor </li></ul...
Using the Sponger: SPARQL Query Processor <ul><li>Virtuoso extends SPARQL with  IRI/URI dereferencing </li></ul><ul><li>Hi...
SPARQL Extensions: IRI Dereferencing of FROM Clauses <ul><li>Enabled through ‘define get:…’ pragmas </li></ul><ul><li>DEFI...
SPARQL Extensions: IRI Dereferencing of Variables <ul><li>Enabled through ‘define input:grab-…’ pragmas </li></ul><ul><li>...
Using the Sponger: RDF Proxy Service <ul><li>Sponger functionality is also exposed by Virtuoso “/proxy” endpoint </li></ul...
RDF Proxy Service <ul><li>Parameters : </li></ul><ul><li>url:  the URL of the target </li></ul><ul><li>force:  if ‘rdf’ is...
Using the Sponger: OpenLink RDF Client Applications <ul><li>Bundled as part of  OpenLink AJAX Toolkit  (OAT) </li></ul><ul...
Using the Sponger: ODS-Briefcase (Virtuoso WebDAV) <ul><li>Briefcase = A component of OpenLink Data Spaces </li></ul><ul><...
SIOC as a Data Space “Glue” Ontology <ul><li>ODS has its own built-in cartridges for mapping to SIOC </li></ul><ul><li>All...
Using the Sponger: Directly via Virtuoso PL <ul><li>Sponger cartridges are invoked through a  cartridge hook </li></ul><ul...
Sponger Cartridges
Sponger Architecture <ul><li>Sponger is comprised of cartridges </li></ul><ul><li>Cartridge = metadata extractor + ontolog...
Sponger Cartridge Invocation
Sponger Configuration using Conductor UI <ul><li>Virtuoso Conductor provides a browser-based graphical UI for most Virtuos...
Sponger Configuration using Conductor UI: RDF Cartridges Pane
Sponger Configuration using Conductor UI: GRDDL Filters
Sponger Configuration using Conductor UI: XSLT Templates
Sponger Configuration using Conductor UI: Schema Files / Supported Ontologies
Custom Cartridges <ul><li>Sponger is extensible via pluggable cartridge architecture </li></ul><ul><li>Sponge new data for...
Custom Cartridges <ul><li>Cartridge Hook  - Virtuoso PL Prototype </li></ul><ul><li>in graph_iri varchar : IRI of graph be...
Flickr Cartridge Extracts <ul><li>procedure DB.DBA.RDF_LOAD_FLICKR_IMG ( </li></ul><ul><li>in graph_iri varchar, in new_or...
Custom Resolvers <ul><li>Sponger supports pluggable “Custom Resolver” cartridges </li></ul><ul><li>Support dereferencing o...
Upcoming SlideShare
Loading in …5
×

Virtuoso Sponger - RDFizer Middleware for creating RDF from non RDF Data Sources

6,178 views

Published on

Virtuoso Sponger - RDFizer Middleware for creating RDF from non RDF Data Sources

Published in: Technology

Virtuoso Sponger - RDFizer Middleware for creating RDF from non RDF Data Sources

  1. 1. Virtuoso Sponger Extracting RDF Structured Data from Non-RDF Sources
  2. 2. Growing the Semantic Web <ul><li>Classic “chicken ‘n’ egg” problem has impeded the growth of the Semantic Web </li></ul><ul><ul><li>Development of applications for the Semantic Web will remain small-scale without a critical mass of RDF data. </li></ul></ul><ul><ul><li>A critical mass of RDF data won’t be achieved without adequate Semantic Web applications and tools. </li></ul></ul><ul><li>A new class of tools is emerging in response to this need…“ RDFizers ” </li></ul><ul><ul><li>Transform non-RDF data into RDF </li></ul></ul><ul><ul><li>Virtuoso Sponger is one such RDFizer </li></ul></ul>
  3. 3. Virtuoso Sponger <ul><li>An RDFizer introduced in Virtuoso 5.0 </li></ul><ul><li>Provides built-in RDF middleware for transforming non-RDF data into RDF &quot;on the fly“. </li></ul><ul><li>You can use non-RDF data sources as Semantic Web data sources. </li></ul><ul><li>Inputs : Wide variety of non-RDF Web data sources, e.g: </li></ul><ul><ul><li>(X)HTML Web Pages (including hosted microformats) </li></ul></ul><ul><ul><li>Web services (Google, Del.icio.us, Flickr etc.) </li></ul></ul><ul><ul><li>Binary files (MS Office, PDF, OpenDocument etc.) </li></ul></ul><ul><li>Output : RDF structured data </li></ul>
  4. 4. Inputs: Supported Data Sources <ul><li>RDF (inc. N3, Turtle) </li></ul><ul><ul><li>SIOC, SKOS, FOAF, AtomOWL, Annotea … </li></ul></ul><ul><li>(X)HTML pages </li></ul><ul><ul><li>HTML header metadata: Dublin Core </li></ul></ul><ul><ul><li>Microformats: eRDF, RDFa, hCard, hCalendar, XFN, xFolk … </li></ul></ul><ul><li>Syndication formats </li></ul><ul><ul><li>RSS 2.0, Atom, OPML, OCS, XBEL </li></ul></ul><ul><li>GRDDL </li></ul><ul><li>Web service APIs: Google Base, Flickr, Del.icio.us, Ning … </li></ul><ul><li>Files: </li></ul><ul><ul><li>Binary files: MS Office, OpenOffice, images, audio, video … </li></ul></ul><ul><ul><li>Data exchange formats: iCalendar, vCard </li></ul></ul><ul><li>3 rd party metadata extractors: Aperture, Spotlight, SIMILE RDFizers </li></ul><ul><li>or add your own! </li></ul>
  5. 5. Output: Structured Data <ul><li>In the context of the Semantic Data Web: </li></ul><ul><li>“ Data organized into semantic chunks or entities, with similar entities grouped together into relations or classes” Michael Bergman (http://www.mkbergman.com) Article: “More Structure, More Terminology and (hopefully) More Clarity” </li></ul>
  6. 6. Sponger Benefits <ul><li>Majority of the world's data resides in non-RDF form at the current time </li></ul><ul><li>Sponger provides a “Swiss army knife” for RDF structured data generation from non-RDF sources </li></ul><ul><li>Extracting data from non-RDF Web sources and converting it to RDF </li></ul><ul><ul><li>helps “bootstrap” the Semantic Web </li></ul></ul><ul><ul><li>helps drive the transition of the traditional Document-Web into the emerging Semantic Data-Web </li></ul></ul><ul><ul><li>exposes the data in a canonical form for querying and inference </li></ul></ul>
  7. 7. Sponger Inputs & Outputs
  8. 8. Sponger Architecture <ul><li>Sponger is comprised of Sponger Cartridges </li></ul><ul><li>Default cartridge collection is bundled as a Virtuoso VAD </li></ul><ul><li>Cartridge = Metadata Extractor + Ontology Mapper </li></ul><ul><li>Metadata extracted from non-RDF resources is mapped to a suitable ontology by Ontology Mapper to produce Structured Data </li></ul><ul><li>Sponger is highly customizable </li></ul><ul><li>Custom cartridges can be developed </li></ul><ul><ul><li>Using any language (e.g. Virtuoso PL, C/C++, Java) supported by Virtuoso Server Extensions API </li></ul></ul>
  9. 9. Using The Sponger <ul><li>Can be invoked in several ways, via: </li></ul><ul><li>Virtuoso SPARQL query processor </li></ul><ul><li>Virtuoso RDF Proxy Service (/proxy) </li></ul><ul><ul><li>E.g. http://localhost:8890/proxy </li></ul></ul><ul><li>OpenLink RDF client applications </li></ul><ul><li>ODS-Briefcase (Virtuoso WebDAV) </li></ul><ul><li>Directly through Virtuoso PL </li></ul>
  10. 10. Using the Sponger: SPARQL Query Processor <ul><li>Virtuoso extends SPARQL with IRI/URI dereferencing </li></ul><ul><li>Highly distributed nature of Semantic Web makes it highly unlikely all the referenced resources/IRIs will be in the local quad store </li></ul><ul><li>During query execution: </li></ul><ul><ul><li>From a given IRI, remote RDF resources can be downloaded, parsed & the resulting triples stored in local quad store </li></ul></ul><ul><li>IRI dereferencing of FROM clauses </li></ul><ul><ul><li>Downloads & stores triples from named graphs </li></ul></ul><ul><li>IRI dereferencing of SPARQL variables </li></ul><ul><ul><li>Downloads & stores triples based on proximity search from a starting IRI to a given depth (# of hops) via specified predicates </li></ul></ul>
  11. 11. SPARQL Extensions: IRI Dereferencing of FROM Clauses <ul><li>Enabled through ‘define get:…’ pragmas </li></ul><ul><li>DEFINE get:method “GET” </li></ul><ul><li>DEFINE get:soft “soft” </li></ul><ul><li>SELECT ?id </li></ul><ul><li>FROM NAMED <http://myhost/user1.ttl> </li></ul><ul><li>FROM NAMED http://myhost/user2.ttl </li></ul><ul><li>WHERE { GRAPH ?g { ?id a ?o } }; </li></ul><ul><li>get:soft – retrieval mode: “soft” / “replace” </li></ul><ul><li>get:uri – IRI to retrieve if not equal to IRI of FROM clause </li></ul><ul><li>get:method – HTTP “GET” or URIQA “MGET” </li></ul><ul><li>get:refresh – max allowed age (seconds) of cached resource </li></ul><ul><ul><li>can reduce expiry time specified in HTTP headers </li></ul></ul><ul><li>get:proxy – proxy server address if direct download not possible </li></ul>
  12. 12. SPARQL Extensions: IRI Dereferencing of Variables <ul><li>Enabled through ‘define input:grab-…’ pragmas </li></ul><ul><li>DEFINE input:grab-var “?more” </li></ul><ul><li>DEFINE input:grab-depth 10 </li></ul><ul><li>DEFINE input:grab-limit 100 </li></ul><ul><li>DEFINE input:grab-base “http://myhost/” </li></ul><ul><li>SELECT ?id ?fullname ?email </li></ul><ul><li>WHERE { GRAPH ?g { </li></ul><ul><li> ?id a <Person> ; <FullName> ?fullname ; <Email> ?email . </li></ul><ul><li> OPTIONAL { ?id <SeeAlso> ?more } </li></ul><ul><li>} } ; </li></ul><ul><li>input:grab-var - SPARQL variable identifying IRIs to be downloaded </li></ul><ul><li>input:grab-depth – max # of links (predicates) between nodes in graph </li></ul><ul><li>input:grab-limit – max # of resources (subject/object nodes) to retrieve </li></ul><ul><li>input:grab-base – base IRI for converting relative IRIs to absolute </li></ul><ul><li>plus others (grab-seealso, grab-destination …) - see Reference Manual. </li></ul>
  13. 13. Using the Sponger: RDF Proxy Service <ul><li>Sponger functionality is also exposed by Virtuoso “/proxy” endpoint </li></ul><ul><li>An in-built REST style Web service </li></ul><ul><li>Takes a target URL & returns its content “as is” or tries to transform it (by sponging) to RDF </li></ul><ul><li>http://demo.openlinksw.com/proxy?url=http://www.w3c.org/People/Connelly/&force=rdf </li></ul><ul><li>Provides a “pipe” for RDF browsers to browse non-RDF sources </li></ul><ul><li>Caches to temporary Virtuoso storage </li></ul><ul><ul><li>Cache invalidation similar to traditional Web Browser, based on HTTP ‘expires’ header </li></ul></ul>
  14. 14. RDF Proxy Service <ul><li>Parameters : </li></ul><ul><li>url: the URL of the target </li></ul><ul><li>force: if ‘rdf’ is specified, will try to extract RDF data from the target and return it </li></ul><ul><li>header: HTTP headers to be sent to the target </li></ul><ul><li>output-format: output MIME type of the RDF data </li></ul><ul><ul><li>‘ rdf+xml’ (default) / ‘n3’ / ‘turtle’ / ‘ttl’ </li></ul></ul><ul><ul><li>if not specified, proxy service uses content negotiation </li></ul></ul>
  15. 15. Using the Sponger: OpenLink RDF Client Applications <ul><li>Bundled as part of OpenLink AJAX Toolkit (OAT) </li></ul><ul><li>RDF Browser </li></ul><ul><li>Uses /proxy service by default </li></ul><ul><li>iSPARQL – Interactive SPARQL query builder </li></ul><ul><li>Uses /sparql service & 5 sponging settings (translated to IRI dereferencing pragmas on server) </li></ul><ul><ul><li>Get Local Data Only </li></ul></ul><ul><ul><li>Get Remote Data When Missing Locally </li></ul></ul><ul><ul><li>Get All Remote Data </li></ul></ul><ul><ul><li>Get All Remote Data & Related Data </li></ul></ul><ul><ul><li>Get Everything </li></ul></ul>
  16. 16. Using the Sponger: ODS-Briefcase (Virtuoso WebDAV) <ul><li>Briefcase = A component of OpenLink Data Spaces </li></ul><ul><li>Includes high level interface to Virtuoso WebDAV repository </li></ul><ul><li>Web browser based interaction </li></ul><ul><li>Web services support (direct use of WebDAV protocol) </li></ul><ul><li>SPARQL queryable (WebDAV location acts as RDF graph URI) </li></ul><ul><ul><li>Metadata automically extracted at file upload time </li></ul></ul><ul><ul><li>Wide variety of file formats supported </li></ul></ul><ul><ul><li>All WebDAV resources are exposed as SIOC instance data </li></ul></ul><ul><ul><li>Extracted metadata available in two forms </li></ul></ul><ul><ul><ul><li>Pure WebDAV </li></ul></ul></ul><ul><ul><ul><li>RDF (RDF/XML, N3, Turtle) optionally synchronized with Quad Store </li></ul></ul></ul><ul><li>Virtuoso Content Crawler / RDF_Sink folder help automate uploading </li></ul>
  17. 17. SIOC as a Data Space “Glue” Ontology <ul><li>ODS has its own built-in cartridges for mapping to SIOC </li></ul><ul><li>All ODS data containers (ODS-Briefcase, ODS-Weblog, ODS-Wiki, ODS-FeedManager etc) expose their data as SIOC instance data </li></ul><ul><li>SIOC </li></ul><ul><ul><li>provides a generic data model of containers, items and associations between items </li></ul></ul><ul><ul><ul><li>Classes include: User, UserGroup, Role, Site, Forum, Post </li></ul></ul></ul><ul><ul><li>SIOC Types Module (sioc-t) defines further types. </li></ul></ul><ul><ul><ul><li>Classes include: AddressBook, BookmarkFolder, ImageGallery etc etc </li></ul></ul></ul><ul><ul><li>permits the use of other ontologies (e.g. FOAF) when describing attributes of SIOC entities </li></ul></ul><ul><ul><li>provides a generic wrapper (“glue” ontology) for describing RDF structured data derived from OpenLink Data Spaces </li></ul></ul><ul><li>All ODS-related SIOC data can be queried through SPARQL </li></ul>
  18. 18. Using the Sponger: Directly via Virtuoso PL <ul><li>Sponger cartridges are invoked through a cartridge hook </li></ul><ul><ul><li>Provides a Virtuoso PL entry point to the packaged functionality </li></ul></ul><ul><ul><li>Can be called directly from your own Virtuoso PL procedures </li></ul></ul>
  19. 19. Sponger Cartridges
  20. 20. Sponger Architecture <ul><li>Sponger is comprised of cartridges </li></ul><ul><li>Cartridge = metadata extractor + ontology mapper </li></ul><ul><li>Cartridge is invoked through cartridge hook (Virtuoso PL entry point) </li></ul><ul><li>Metadata extractor </li></ul><ul><ul><li>Performs initial data extraction </li></ul></ul><ul><li>Ontology mapper </li></ul><ul><ul><li>Generates RDF instance data from extracted (non-RDF) metadata </li></ul></ul><ul><ul><li>Extracted metadata is mapped to an ontology associated (via an internal mapping table) with the data source type </li></ul></ul><ul><ul><li>Typically uses XSLT (GRDDL or in-built Virtuoso mapping scheme) or Virtuoso PL </li></ul></ul>
  21. 21. Sponger Cartridge Invocation
  22. 22. Sponger Configuration using Conductor UI <ul><li>Virtuoso Conductor provides a browser-based graphical UI for most Virtuoso administration tasks </li></ul><ul><ul><li>including managing Sponger Cartridges and VADs </li></ul></ul><ul><li>VAD = Virtuoso Application Distribution </li></ul><ul><ul><li>Packaging & distribution system for Virtuoso extensions </li></ul></ul><ul><li>RDF Cartridges VAD </li></ul><ul><ul><li>Bundles a variety of pre-built cartridges for popular Web resources and file types </li></ul></ul><ul><ul><li>Installed as part of default Virtuoso installation </li></ul></ul>
  23. 23. Sponger Configuration using Conductor UI: RDF Cartridges Pane
  24. 24. Sponger Configuration using Conductor UI: GRDDL Filters
  25. 25. Sponger Configuration using Conductor UI: XSLT Templates
  26. 26. Sponger Configuration using Conductor UI: Schema Files / Supported Ontologies
  27. 27. Custom Cartridges <ul><li>Sponger is extensible via pluggable cartridge architecture </li></ul><ul><li>Sponge new data formats by creating your own cartridges </li></ul><ul><li>Use Virtuoso PL or any language supported by Virtuoso Server Extensions API (incl: C/C++, Java) </li></ul><ul><li>Register your cartridge in the Cartridge Registry (SYS_RDF_CARTRIDGES table) before use using Conductor or DML </li></ul>
  28. 28. Custom Cartridges <ul><li>Cartridge Hook - Virtuoso PL Prototype </li></ul><ul><li>in graph_iri varchar : IRI of graph being retrieved </li></ul><ul><li>in new_origin_uri varchar : URI of the document being retrieved </li></ul><ul><li>in destination varchar : destination graph IRI </li></ul><ul><li>inout content any : the document content </li></ul><ul><li>inout async_queue any : preallocated asynchronous queue used to call the configured ping service </li></ul><ul><li>inout ping_service any : URL of the ping service, as assigned to the PingService parameter in the [SPARQL] section of the virtuoso.ini file. This argument could be used to notify the PingTheSemanticWeb RDF document repository & notification service </li></ul><ul><li>inout api_key any : unique string providing cartridge specific data taken from the RC_KEY column of the DB.DBA.SYS_RDF_CARTRIDGES table </li></ul>
  29. 29. Flickr Cartridge Extracts <ul><li>procedure DB.DBA.RDF_LOAD_FLICKR_IMG ( </li></ul><ul><li>in graph_iri varchar, in new_origin_uri varchar, in dest varchar, </li></ul><ul><li>inout _ret_body any, inout aq any, inout ps any, inout _key any) </li></ul><ul><li>{ </li></ul><ul><li>declare xd, xt, url, tmp, api_key, img_id, hdr, exif any; </li></ul><ul><li>... </li></ul><ul><li>url := sprintf ('http://api.flickr.com/services/rest/?method=flickr.photos.getInfo&photo_ </li></ul><ul><li>id=%s&api_key=%s', img_id, api_key); </li></ul><ul><li>tmp := http_get (url, hdr); </li></ul><ul><li>... </li></ul><ul><li>xd := xtree_doc (tmp); </li></ul><ul><li>... </li></ul><ul><li>xt := xslt (registry_get ('_rdf_cartridges_path_') || 'xslt/flickr2rdf.xsl', </li></ul><ul><li>xd, vector ('baseUri', coalesce (dest, graph_iri), 'exif', exif)); </li></ul><ul><li>xd := serialize_to_UTF8_xml (xt); </li></ul><ul><li>DB.DBA. RDF_LOAD_RDFXML (xd, new_origin_uri, coalesce (dest, graph_iri)); </li></ul><ul><li>return 1; </li></ul><ul><li>} </li></ul>
  30. 30. Custom Resolvers <ul><li>Sponger supports pluggable “Custom Resolver” cartridges </li></ul><ul><li>Support dereferencing of other forms of URIs besides HTTP URLs, e.g: </li></ul><ul><ul><li>URN schemes ( LSID s) and handle schemes ( DOI s) </li></ul></ul><ul><li>Greatly extends range of data sources which can be linked into the Semantic Web </li></ul><ul><li>http://demo.openlinksw.com/sparql?default-graph-uri= urn:lsid:ubio.org:namebank:11815 &should-sponge=soft&query=SELECT+*+WHERE+{?s+?p+?o}&format=text/html </li></ul><ul><li>Proxy service also recognizes URNs </li></ul><ul><li>http://demo.openlinksw.com/proxy?url=urn:lsid:ubio.org:namebank:11815&force=rdf </li></ul>

×