Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SWPM12 report on the dagstuhl seminar on Semantic Data Management

618 views

Published on

tail end of the

Published in: Education
  • Be the first to comment

  • Be the first to like this

SWPM12 report on the dagstuhl seminar on Semantic Data Management

  1. 1. Provenance at the Dagstuhl seminar on Semantic Data Management, April 2012 Paolo Missier, Jose Manuel Gómez-Perez,Dagstuhl repost @ SWPM 12 - P.Missier Satya Sahoo SWPM’12, June. 2012 1
  2. 2. previously at Dagstuhl... Much provenance, not much semantics - final report to be published soon Interim Seminar wikiDagstuhl repost @ SWPM 12 - P.Missier 2
  3. 3. The provenance day @Dagstuhl Tuesday (main topic: provenance, person in charge: Grigoris Antoniou) Session 1. Provenance in semantic data management ■ Tutorial: Provenance some useful concepts (Paul, 20 minutes) ■ An introduction to the W3C PROV family of specs (Paul Groth / Luc Moreau / Paolo Missier / Olaf, 30 minutes) ■ Presentations from other attendees. ■ Manuel Salvadores: "Access Control in SPARQL: The BioPortal Use Case (15-20 min)" ■ Bryan Thompson: Simple and effective provenance mechanism for triples or quads based on composition Session 2. Presentations ■ Kerry Taylor: Reaping the rewards: what is the provenance saying? (20 min) ■ Martin Theobald: Reasoning in Uncertain RDF Knowledge Bases with Lineage (20 min) ■ James Cheney: Database Wiki and provenance for SPARQL updates (10-15 min). Session 3. Working groups and wrap-up ■ Objective: obtain roadmaps about typical problems on provenance ■ Working groups ■ Frank Van Harmelen: Provenance and scalabilityDagstuhl repost @ SWPM 12 - P.Missier ■ Paolo Missier: Provenance-specific benchmarks and corpora ■ José Manuel Gómez-Pérez: Novel usages of provenance information ■ Norbert Fuhr: Provenance and uncertainty 3
  4. 4. WG: Novel usages of provenance information (José Manuel Gómez-Pérez) • Data integration – assisted analysis, exploration along different dimensions of quality – SmartCities, OpenStreetMap • Analytics in social networks – detect cool members in social networks • Provenance diff (hard in general) • Billing / Privacy – emerging pay-per-query models • Credit, attribution, citation and licensing • Result reproducibility (e.g., Executable Paper Challenge) • Determining quality in the report that has been generated by 3rd parties for an organisation (e.g., Government report)Dagstuhl repost @ SWPM 12 - P.Missier 4
  5. 5. WG: creating provenance-specific benchmarks • Another one of the spontaneous Working Group activities at Dagstuhl • Not strictly “semantic” – but PROV-RDF one of the expected encodings • Led by Satya Sahoo, PM • A community initiative Goal: To collect a corpus of reference provenance traces from multiple contributors from multiple domainsDagstuhl repost @ SWPM 12 - P.Missier and make it available as a community resource 5
  6. 6. Collecting reference provenance datasets Why: • to better understand actual usages of provenance • for analysing properties of provenance graphs – patterns in graphs • to create a level field for performance comparison – storage, compression methods – query models, query processing • SPARQL • Datalog • Graph query languages • to test algorithms that prove interesting hypotheses – “prov(D) contains valid indicators for quality(D)”Dagstuhl repost @ SWPM 12 - P.Missier How: • By collecting submissions from the community • By generating synthetic provenance 6
  7. 7. What: submissions Submission: - a collection of traces - a collection of queries hopefully from a variety of different domains • Interesting properties of each trace: • Graph structure -- regularity, recognizable patterns • Graph size • Scaling factors • what is it to be used for Submission:Dagstuhl repost @ SWPM 12 - P.Missier • Diversity of structure and size within the family • Numerosity of traces 7
  8. 8. What: Traces format • The PROV assumptions: – uptake: PROV will be successful (!) – interoperability: PROV will be sufficiently expressive to provide interoperability • Thus, expecting PROV encoding for submissions seems reasonable • Advantages: – tools are being built to parse, visualize, validate, analyse PROV-compliant traces – multiple encodings available • especially good if RDF is your thing • Issues:Dagstuhl repost @ SWPM 12 - P.Missier – Conversion: existing traces are not natively PROV – is there a need to dereference data at the end of URIs? – licensing: multiple tiers? specific to each dataset? 8
  9. 9. What: Queries • Hypothesis: Some queries are generic, in the sense that they apply across multiple collections of traces Single trace queries: • Reachability queries over data and activity dependencies – backwards (diagnosis) – forwards (impact analysis) • “chains of responsibility” (delegation) Aggregation queries: • production/usages of data, activities across traces – assumes uniformity within a collection • Do graph mining problems apply? do they have interesting interpretations? – eg. subgraph discoveryDagstuhl repost @ SWPM 12 - P.Missier • Feature extraction for learning, mining • Pairwise trace comparison: – “earliest divergence” queries between pairs of "nearly isomorphic" traces – differencing (complex) 9
  10. 10. A provenance repository • If traces are submitted in one of the PROV standard encodings, then the P-rep can provide validation services upon admission • PROV is expected to support the following encodings: – PROV-N -- the technology-neutral notation – RDF -- the main official encoding – XML -- unofficial XSD available – JSON -- unofficial – (Datalog? -- even more unofficial but syntactically very close to PROV-N) Available validations: PROV-N • Syntax:Dagstuhl repost @ SWPM 12 - P.Missier – PROV-N syntax N 2 JSON N 2 RDF N 2 XML – XML schema validation • Consistency: – validation wrt PROV-constraints PROV- PROV- PROV- JSON RDF XML10
  11. 11. Low-hanging fruits • Wikipedia history pages – dumps freely available – or, through the Wikipedia REST API • OpenStreetMap history pages – very similar structure • ...any other?Dagstuhl repost @ SWPM 12 - P.Missier11
  12. 12. Can we learn from similar initiatives? • Well-established repositories for testing Machine Learning methods – the UCI Machine Learning repositories – the KDD Cup datasets – ... and more • “Building better RDF benchmarks”: Kavitha Srinivas @Dagstuhl – DBpedia, UniProt -- large but no representative query workload – YAGO: Wikipedia <-> Wordnet, 8 queries – Barton Library, 7 queries – Linked Sensor Dataset, no queries – TPC-H as RDF – Berlin SPARQL Benchmark (BSBM), 12 queries + mixes – Lehigh University Benchmark (LUBM), 14 queriesDagstuhl repost @ SWPM 12 - P.Missier – SP2Bench (DBLP) 12 queries – Original approach: • Turn every dataset into a benchmark • by editing the dataset to enforce measures of12 – Coverage and Coherence
  13. 13. WG: Provenance and uncertainty (Norbert Fuhr) • Uncertainty in the data – Sensor data, Customer reviews • Issues – Reliability (“is this the original painting?”) – Authenticity • Sources of uncertain provenance – Information extraction / NLP methods – Human errors – Inferences – Instruments • Challenges – We need a data model for uncertainty in provenance • probabilistic dependency relationsDagstuhl repost @ SWPM 12 - P.Missier – Explanation of the derivation of uncertain results • Limitations – Hard rules vs soft rules – Knowledge acquisition process of those rules – provenance incompleteness vs uncertainty13 •

×