Slideshare.net (beta)

 
Post: 
Myspace Hi5 Friendster Xanga LiveJournal Facebook Blogger Tagged Typepad Freewebs BlackPlanet gigya icons



All comments

Add a comment on Slide 1

If you have a SlideShare account, login to comment; else you can comment as a guest


Showing 1-50 of 0 (more)

Virtuoso Sponger - RDFizer Middleware for creating RDF from non RDF Data Sources

From rumito, 8 months ago

Virtuoso Sponger - RDFizer Middleware for creating RDF from non RD more

779 views  |  0 comments  |  0 favorites
 

Tags

sql rdf database virtuoso web20 web30 linkeddata dataweb semweb semanticweb

more

 
 

Groups/Events

Not added to any group/event

 
 

Privacy InfoNew!

This slideshow is Public

 
Embed in your blog
Embed (wordpress.com)
custom

Slideshow transcript

Slide 1: Virtuoso Sponger Extracting RDF Structured Data from Non-RDF Sources © 2007 OpenLink Software, All rights reserved

Slide 2: Growing the Semantic Web  Classic “chicken ‘n’ egg” problem has impeded the growth of the Semantic Web  Development of applications for the Semantic Web will remain small-scale without a critical mass of RDF data.  A critical mass of RDF data won’t be achieved without adequate Semantic Web applications and tools.  A new class of tools is emerging in response to this need…“RDFizers”  Transform non-RDF data into RDF  Virtuoso Sponger is one such RDFizer © 2007 OpenLink Software, All rights reserved

Slide 3: Virtuoso Sponger  An RDFizer introduced in Virtuoso 5.0  Provides built-in RDF middleware for transforming non-RDF data into RDF "on the fly“.  You can use non-RDF data sources as Semantic Web data sources.  Inputs: Wide variety of non-RDF Web data sources, e.g:  (X)HTML Web Pages (including hosted microformats)  Web services (Google, Del.icio.us, Flickr etc.)  Binary files (MS Office, PDF, OpenDocument etc.)  Output: RDF structured data © 2007 OpenLink Software, All rights reserved

Slide 4: Inputs: Supported Data Sources  RDF (inc. N3, Turtle)  SIOC, SKOS, FOAF, AtomOWL, Annotea …  (X)HTML pages  HTML header metadata: Dublin Core  Microformats: eRDF, RDFa, hCard, hCalendar, XFN, xFolk …  Syndication formats  RSS 2.0, Atom, OPML, OCS, XBEL  GRDDL  Web service APIs: Google Base, Flickr, Del.icio.us, Ning …  Files:  Binary files: MS Office, OpenOffice, images, audio, video …  Data exchange formats: iCalendar, vCard  3rd party metadata extractors: Aperture, Spotlight, SIMILE RDFizers  or add your own! © 2007 OpenLink Software, All rights reserved

Slide 5: Output: Structured Data In the context of the Semantic Data Web: “Data organized into semantic chunks or entities, with similar entities grouped together into relations or classes” Michael Bergman (http://www.mkbergman.com) Article: “More Structure, More Terminology and (hopefully) More Clarity” © 2007 OpenLink Software, All rights reserved

Slide 6: Sponger Benefits  Majority of the world's data resides in non-RDF form at the current time  Sponger provides a “Swiss army knife” for RDF structured data generation from non-RDF sources  Extracting data from non-RDF Web sources and converting it to RDF  helps “bootstrap” the Semantic Web  helps drive the transition of the traditional Document-Web into the emerging Semantic Data-Web  exposes the data in a canonical form for querying and inference © 2007 OpenLink Software, All rights reserved

Slide 7: Sponger Inputs & Outputs © 2007 OpenLink Software, All rights reserved

Slide 8: Sponger Architecture  Sponger is comprised of Sponger Cartridges  Default cartridge collection is bundled as a Virtuoso VAD  Cartridge = Metadata Extractor + Ontology Mapper  Metadata extracted from non-RDF resources is mapped to a suitable ontology by Ontology Mapper to produce Structured Data  Sponger is highly customizable  Custom cartridges can be developed  Using any language (e.g. Virtuoso PL, C/C++, Java) supported by Virtuoso Server Extensions API © 2007 OpenLink Software, All rights reserved

Slide 9: Using The Sponger Can be invoked in several ways, via:  Virtuoso SPARQL query processor  Virtuoso RDF Proxy Service (/proxy)  E.g. http://localhost:8890/proxy  OpenLink RDF client applications  ODS-Briefcase (Virtuoso WebDAV)  Directly through Virtuoso PL © 2007 OpenLink Software, All rights reserved

Slide 10: Using the Sponger: SPARQL Query Processor  Virtuoso extends SPARQL with IRI/URI dereferencing  Highly distributed nature of Semantic Web makes it highly unlikely all the referenced resources/IRIs will be in the local quad store  During query execution:  From a given IRI, remote RDF resources can be downloaded, parsed & the resulting triples stored in local quad store  IRI dereferencing of FROM clauses  Downloads & stores triples from named graphs  IRI dereferencing of SPARQL variables  Downloads & stores triples based on proximity search from a starting IRI to a given depth (# of hops) via specified predicates © 2007 OpenLink Software, All rights reserved

Slide 11: SPARQL Extensions: IRI Dereferencing of FROM Clauses Enabled through ‘define get:…’ pragmas DEFINE get:method “GET” DEFINE get:soft “soft” SELECT ?id FROM NAMED <http://myhost/user1.ttl> FROM NAMED http://myhost/user2.ttl WHERE { GRAPH ?g { ?id a ?o } };  get:soft – retrieval mode: “soft” / “replace”  get:uri – IRI to retrieve if not equal to IRI of FROM clause  get:method – HTTP “GET” or URIQA “MGET”  get:refresh – max allowed age (seconds) of cached resource  can reduce expiry time specified in HTTP headers  get:proxy – proxy server address if direct download not possible © 2007 OpenLink Software, All rights reserved

Slide 12: SPARQL Extensions: IRI Dereferencing of Variables Enabled through ‘define input:grab-…’ pragmas DEFINE input:grab-var “?more” DEFINE input:grab-depth 10 DEFINE input:grab-limit 100 DEFINE input:grab-base “http://myhost/” SELECT ?id ?fullname ?email WHERE { GRAPH ?g { ?id a <Person> ; <FullName> ?fullname ; <Email> ?email . OPTIONAL { ?id <SeeAlso> ?more } }};  input:grab-var - SPARQL variable identifying IRIs to be downloaded  input:grab-depth – max # of links (predicates) between nodes in graph  input:grab-limit – max # of resources (subject/object nodes) to retrieve  input:grab-base – base IRI for converting relative IRIs to absolute plus others (grab-seealso, grab-destination …) - see Reference Manual. © 2007 OpenLink Software, All rights reserved

Slide 13: Using the Sponger: RDF Proxy Service  Sponger functionality is also exposed by Virtuoso “/proxy” endpoint  An in-built REST style Web service  Takes a target URL & returns its content “as is” or tries to transform it (by sponging) to RDF http://demo.openlinksw.com/proxy?url=http://www.w3c.org/People/Connelly/&force=rdf  Provides a “pipe” for RDF browsers to browse non-RDF sources  Caches to temporary Virtuoso storage  Cache invalidation similar to traditional Web Browser, based on HTTP ‘expires’ header © 2007 OpenLink Software, All rights reserved

Slide 14: RDF Proxy Service Parameters:  url: the URL of the target  force: if ‘rdf’ is specified, will try to extract RDF data from the target and return it  header: HTTP headers to be sent to the target  output-format: output MIME type of the RDF data  ‘rdf+xml’ (default) / ‘n3’ / ‘turtle’ / ‘ttl’  if not specified, proxy service uses content negotiation © 2007 OpenLink Software, All rights reserved

Slide 15: Using the Sponger: OpenLink RDF Client Applications Bundled as part of OpenLink AJAX Toolkit (OAT) RDF Browser  Uses /proxy service by default iSPARQL – Interactive SPARQL query builder  Uses /sparql service & 5 sponging settings (translated to IRI dereferencing pragmas on server)  Get Local Data Only  Get Remote Data When Missing Locally  Get All Remote Data  Get All Remote Data & Related Data  Get Everything © 2007 OpenLink Software, All rights reserved

Slide 16: Using the Sponger: ODS-Briefcase (Virtuoso WebDAV) Briefcase = A component of OpenLink Data Spaces Includes high level interface to Virtuoso WebDAV repository  Web browser based interaction  Web services support (direct use of WebDAV protocol)  SPARQL queryable (WebDAV location acts as RDF graph URI)  Metadata automically extracted at file upload time  Wide variety of file formats supported  All WebDAV resources are exposed as SIOC instance data  Extracted metadata available in two forms  Pure WebDAV  RDF (RDF/XML, N3, Turtle) optionally synchronized with Quad Store  Virtuoso Content Crawler / RDF_Sink folder help automate uploading © 2007 OpenLink Software, All rights reserved

Slide 17: SIOC as a Data Space “Glue” Ontology  ODS has its own built-in cartridges for mapping to SIOC  All ODS data containers (ODS-Briefcase, ODS-Weblog, ODS-Wiki, ODS-FeedManager etc) expose their data as SIOC instance data  SIOC  provides a generic data model of containers, items and associations between items  Classes include: User, UserGroup, Role, Site, Forum, Post SIOC Types Module (sioc-t) defines further types.   Classes include: AddressBook, BookmarkFolder, ImageGallery etc etc permits the use of other ontologies (e.g. FOAF) when describing  attributes of SIOC entities  provides a generic wrapper (“glue” ontology) for describing RDF structured data derived from OpenLink Data Spaces  All ODS-related SIOC data can be queried through SPARQL © 2007 OpenLink Software, All rights reserved

Slide 18: Using the Sponger: Directly via Virtuoso PL  Sponger cartridges are invoked through a cartridge hook  Provides a Virtuoso PL entry point to the packaged functionality  Can be called directly from your own Virtuoso PL procedures © 2007 OpenLink Software, All rights reserved

Slide 19: Sponger Cartridges © 2007 OpenLink Software, All rights reserved

Slide 20: Sponger Architecture  Sponger is comprised of cartridges  Cartridge = metadata extractor + ontology mapper  Cartridge is invoked through cartridge hook (Virtuoso PL entry point)  Metadata extractor  Performs initial data extraction  Ontology mapper  Generates RDF instance data from extracted (non-RDF) metadata  Extracted metadata is mapped to an ontology associated (via an internal mapping table) with the data source type  Typically uses XSLT (GRDDL or in-built Virtuoso mapping scheme) or Virtuoso PL © 2007 OpenLink Software, All rights reserved

Slide 21: Sponger Cartridge Invocation © 2007 OpenLink Software, All rights reserved

Slide 22: Sponger Configuration using Conductor UI  Virtuoso Conductor provides a browser-based graphical UI for most Virtuoso administration tasks  including managing Sponger Cartridges and VADs  VAD = Virtuoso Application Distribution  Packaging & distribution system for Virtuoso extensions  RDF Cartridges VAD  Bundles a variety of pre-built cartridges for popular Web resources and file types  Installed as part of default Virtuoso installation © 2007 OpenLink Software, All rights reserved

Slide 23: Sponger Configuration using Conductor UI: RDF Cartridges Pane © 2007 OpenLink Software, All rights reserved

Slide 24: Sponger Configuration using Conductor UI: GRDDL Filters © 2007 OpenLink Software, All rights reserved

Slide 25: Sponger Configuration using Conductor UI: XSLT Templates © 2007 OpenLink Software, All rights reserved

Slide 26: Sponger Configuration using Conductor UI: Schema Files / Supported Ontologies © 2007 OpenLink Software, All rights reserved

Slide 27: Custom Cartridges  Sponger is extensible via pluggable cartridge architecture  Sponge new data formats by creating your own cartridges  Use Virtuoso PL or any language supported by Virtuoso Server Extensions API (incl: C/C++, Java)  Register your cartridge in the Cartridge Registry (SYS_RDF_CARTRIDGES table) before use using Conductor or DML © 2007 OpenLink Software, All rights reserved

Slide 28: Custom Cartridges Cartridge Hook - Virtuoso PL Prototype in graph_iri varchar: IRI of graph being retrieved in new_origin_uri varchar: URI of the document being retrieved in destination varchar: destination graph IRI inout content any: the document content inout async_queue any: preallocated asynchronous queue used to call the configured ping service inout ping_service any: URL of the ping service, as assigned to the PingService parameter in the [SPARQL] section of the virtuoso.ini file. This argument could be used to notify the PingTheSemanticWeb RDF document repository & notification service inout api_key any: unique string providing cartridge specific data taken from the RC_KEY column of the DB.DBA.SYS_RDF_CARTRIDGES table © 2007 OpenLink Software, All rights reserved

Slide 29: Flickr Cartridge Extracts procedure DB.DBA.RDF_LOAD_FLICKR_IMG ( in graph_iri varchar, in new_origin_uri varchar, in dest varchar, inout _ret_body any, inout aq any, inout ps any, inout _key any) { declare xd, xt, url, tmp, api_key, img_id, hdr, exif any; ... url := sprintf ('http://api.flickr.com/services/rest/?method=flickr.photos.getInfo&photo_ id=%s&api_key=%s', img_id, api_key); tmp := http_get (url, hdr); ... xd := xtree_doc (tmp); ... xt := xslt (registry_get ('_rdf_cartridges_path_') || 'xslt/flickr2rdf.xsl', xd, vector ('baseUri', coalesce (dest, graph_iri), 'exif', exif)); xd := serialize_to_UTF8_xml (xt); DB.DBA.RDF_LOAD_RDFXML (xd, new_origin_uri, coalesce (dest, graph_iri)); return 1; } © 2007 OpenLink Software, All rights reserved

Slide 30: Custom Resolvers  Sponger supports pluggable “Custom Resolver” cartridges  Support dereferencing of other forms of URIs besides HTTP URLs, e.g:  URN schemes (LSIDs) and handle schemes (DOIs)  Greatly extends range of data sources which can be linked into the Semantic Web http://demo.openlinksw.com/sparql?default-graph-uri= urn:lsid:ubio.org:namebank:11815 &should-sponge=soft&query=SELECT+*+WHERE+{?s+?p+?o}&format=text/html Proxy service also recognizes URNs http://demo.openlinksw.com/proxy?url=urn:lsid:ubio.org:namebank:11815&force=rdf © 2007 OpenLink Software, All rights reserved