Successfully reported this slideshow.

Provenance and Trust


Published on

"At the toolbar (menu, whatever) associated with a document there is a button marked "Oh, yeah?". You press it when you lose that feeling of trust. It says to the Web, 'so how do I know I can trust this information?'. The software then goes directly or indirectly back to metainformation about the document, which suggests a number of reasons."

Tim Berners-Lee, W3C Chair, Web Design Issues, September 1997

Provenance is focused on the description and understanding of where and how data is produced, the actors involved in the production of such data, and the processes by which the data was manipulated and transformed until it arrived to the collection from which it is being accessed. Provenance aims at providing the ability to trace the sources of data, enabling the exploration not just of the relationships between datasets, but also of their authors and affiliations, with the goal of preserving data ownership and establishing a notion of trust based on authenticity and reliability.

The Future Internet poses important challenges for provenance, derived from complex and rich scenarios characterized by the presence of large amounts of data stemming from heterogeneous sources like user communities, services, and things. Such challenges span across technical but also socioeconomic dimensions. The former includes aspects like vocabularies for representing provenance, interoperability and scalability issues, and means to produce, acquire, and reason with provenance in order to provide measures of trust and information quality. However, it is probably in the socieconomic dimension where more significant efforts need to be made as to addressing issues like the role of provenance in the overall picture of the Future Internet, entry barriers preventing the generation of provenance-aware internet content, means required to incentivate the production of such content, and ways to prevent provenance forgery.

In this talk, we provide and overview on provenance and the above mentioned challenges and introduce ongoing work in order to address trust issues from the provenance perspective in the Future Internet. We also link provenance to other relevant aspects for trust discussed in the session, like security, legal frameworks, and economics.

Published in: Technology
  • Be the first to comment

Provenance and Trust

  1. 1. iSOCO Provenance and Trust José Manuel Gómez Pérez Fundations of Trust in the Future Internet Future Internet Assembly Valencia, 15th April 2010 W3C Provenance Incubator Group
  2. 2. Why provenance is important For the Web Architecture - "At the toolbar (menu, whatever) associated with a document there is a button marked "Oh, yeah?". You press it when you lose that feeling of trust. It says to the Web, 'so how do I know I can trust this information?'. The software then goes directly or indirectly back to metainformation about the document, which suggests a number of reasons.“-- Tim Berners-Lee, W3C Chair, Web Design Issues, September 1997 For Linked Data - "Provenance is the number one issue we face when publishing government data as linked data for” -- John Sheridan, UK National Archives,, February 2010 For Science - "We need a paradigm that makes it simple [...] to perform and publish reproducible computational research. [...] A Reproducible Research Environment (RRE) [...] provides computational tools together with the ability to automatically track the provenance of data, analyses, and results and to package them (or pointers to persistent versions of them) for redistribution." W3C Provenance Incubator Group 2
  3. 3. Provenance is… Records of Sources of information, including entities and processes, involved in producing or delivering an artifact History of subsequent owners (change of custody) W3C Provenance Incubator Group 3
  4. 4. Provenance is… Evidence of authenticity, integrity, and information quality Certifies products of good process W3C Provenance Incubator Group 4
  5. 5. Provenance is… Valuable Hard to collect and verify Necessary to assign credit …and blame i.e. to establish Trust W3C Provenance Incubator Group 5
  6. 6. Provenance representation Provenance models define - Types of provenance elements - Relationships between them Provenance represented as a graph - Nodes: Pieces of provenance information (Artifacts, Processes, Agents) - Edges: relate provenance elements to each other Additionally: roles, temporal, dependency, causal dependency The Open Provenance Model (OPM) W3C Provenance Incubator Group 6
  7. 7. Provenance-related vocabularies General Provenance-specific DC – Dublin Core Metadata OPM - Open Provenance Model Terms Provenir FOAF – Friend of a Friend PRV - Provenance Vocabulary SIOC – Semantically-Interlinked PML - Proof Markup Language Online Communities DC - Dublin Core SWP – Semantic Web Publishing PREMIS vocabulary WOT – Web Of Trust schema VOID – VOcabulary of Interlinked Datasets PAV - SWAN Provenance Ontology SWP - Semantic Web Publishing Vocabulary CS - Changeset Vocabulary W3C Provenance Incubator Group 7
  8. 8. A snapshot on provenance Uses Research areas Domains • Meaningful data • Scientific workflows • Science integration • Databases (where and • Open Government and • Establishing trust when why provenance) Linked Data information sources are diverse and of varying • Security (Information • Web search & use quality e.g. in the Web flow and trust) • Cultural heritage • Providing justification to • Justification and the conclusions of Argumentation in KRR reasoning processes • Licensing and attribution • Information Retrieval • Manufacturing • Attribution (who did (analysis of information what) sources) • Enabling comparison and reproducibility of processes W3C Provenance Incubator Group 8
  9. 9. The Linked Data paradigm How can we exploit Tim Berners Lee, 2006 (Design Issues) all the available data? 1. Use URIs to identify things - Anything, not just documents 2. Use HTTP URIs for people to Data reuse and remix lookup such names Common flexible and usable APIs - Globally unique names Standard vocabularies to - Distributed ownership describe interlinked datasets 3. Provide useful information in RDF Tools upon URI resolution Realize the Semantic Web vision 4. Include RDF links to other URIs - Enable discovery of related information W3C Provenance Incubator Group 9
  10. 10. Provenance is key in Linked Data Data Lineage - The origins of data - Related artifacts and actors Information quality - Timeliness - (Semantic) consistency between datasets - Stable and meaningful data links Trust - Data authenticity - Reliability W3C Provenance Incubator Group 10
  11. 11. Provenance is key in Open Government Open Government Initiative ( - Release quickly, improve later - NSF to develop plan ( “NSF is developing an Open Goverment Plan, which will serve as the roadmap for our plans to improve transparency, better integrate public participation and collaboration into our core mission, and become more innovative and efficient.” Similar initiative in the UK ( "Provenance is the number one issue we face when publishing government data as linked data for” -- John Sheridan, UK National Archives Why is provenance so important - Government data comes from very diverse data sources - Varying quality - Different scope - Different assumptions W3C Provenance Incubator Group 11
  12. 12. Large amounts of data and processes In 2006, the size of the digital universe Source: IDC was estimated in 161 exabytes 3 million times the information in all books ever written In 2010, expected to turn 988 exabytes …and all this data is potentially published online W3C Provenance Incubator Group 12
  13. 13. It’s all about producing and consuming information… Automated means to evaluate Information Quality and Trust are needed! W3C Provenance Incubator Group 13
  14. 14. Provenance and Trust in the Web: A real-life example • Two fake web sites Reusing web data without the means that allow • A fake Wikipedia entry contrasting its provenance can be harmful, • Fake California public especially in sensitive domains. safety phone numbers • Fake local TV station The hoax caused a 1000-word tome on Frankfurter Allgemeine Zeitung… and public apologies from DPA Trust on Wikipedia misled DPA In a provenance-aware world, DPA would have had means based on data provenance to evaluate trust - Bluewater did not exist - The Berlin Boys do not exist W3C Provenance Incubator Group 14
  15. 15. Trust evaluation: Useful provenance questions Who created that content (author/attribution)? Was the content ever manipulated, if so by what processes/entities? Who is providing that content (repository)? Can any of the answers to these questions be verified e.g., through e-signatures? W3C Provenance Incubator Group 15
  16. 16. Immediate need for provenance at W3C Trust-driven Others Web of trust Reasoning - Making trust judgments - Attribution of assertions from based on provenance diverse sources Social trust Linked data - Attribution, authority, - Use of conflicting data of propagation varying degrees of quality Social web Life sciences and e-Science - Privacy and use policies of at large sensitive (personal) data - Reproducibility of scientific results W3C Provenance Incubator Group 16
  17. 17. W3C Provenance Group: Charter and goals of the incubator group Provide state-of-the-art understanding and develop a roadmap for development and possible standardization Articulate requirements for accessing and reasoning about provenance information - Develop use cases Identify issues in provenance that are direct concern to the Semantic Web - Articulate relationships with other aspects of Web architecture Report on state-of-the-art work on provenance Report on a roadmap for provenance in the Semantic Web - Identify starting points for provenance representations - Identifying elements of a provenance architecture that would benefit from standardization W3C Provenance Incubator Group 17
  18. 18. W3C Provenance Group: Products of the group to date Group formed in September 2009 - All information is public: Developed a set of key dimensions for provenance (11/09) - Grouped into three major categories: content, management, use Developed use cases for provenance (12/09) - More than 30 use cases, most were improved and curated Developed requirements for provenance that arise from the use cases (1/10) - User requirements: what is the purpose/use of the provenance information - Technical requirements: derived from the user requirements Currently developing state-of-the-art report (expected 6/10) W3C Provenance Incubator Group 18
  19. 19. Scientific and technical challenges of provenance Provenance information needs to be : Represented Captured and recorded Stored and secured, queried, and reasoned about Visualized and browsed W3C Provenance Incubator Group 19
  20. 20. Scientific and technical challenges of provenance (II) Vocabularies for representation of provenance content - Need representations of processes (workflows), entities,roles, data collections, meta-assertions, etc. - The Open Provenance Model (OPM): Granularity of provenance records - How much detail is useful, manageable/scalable in practice? • Size of provenance can be orders of magnitude larger than base data Provenance evaluation for Information Quality and Trust Management Evolution and updates - Shelf timeliness of data • Determine when data becomes obsolete based on provenance info - Versioning of data sources • Relate updates of data based on provenance info - Reproducibility of processes Provenance-aware visualization, navigation, and resource consumption W3C Provenance Incubator Group 20
  21. 21. Scientific and technical challenges of Provenance & Trust Policies based on provenance information - Association-based policies • Source is NYT, source cites NYT • Source is cited in Wikipedia - Bias-based policies • Source is an oil company - Distrust policies • Source is a blog Policies may be restricted to a context - Topic of search, topics of page, tags of page Trust policies may be shared across users - Like bookmarks in W3C Provenance Incubator Group 21
  22. 22. Socioeconomic challenges How to incentivize provenance take-up in the Web architecture so that content and service providers move into a provenance-aware paradigm? - Generation of provenance metadata by content providers - Consumption of provenance metadata by infrastructure and applications like search engines, browsers, etc. Objective evidence of trust in online data or service allows - Ranking increase in search engines - Attracting internet traffic - Increasing automation in data and service consumption But, can you can trust such provenance information itself? - Non-forgery of provenance metadata • Authoritative agencies W3C Provenance Incubator Group 22
  23. 23. José Manuel Gómez-Pérez R&D Director T +34913349778 Thanks for M +34609077103 your attention! iSOCO Para obtener más información sobre como puede ayudar a su empresa a optimizar sus negocios digitales y aportar una solución innovadora, contáctenos en www. .com Barcelona Madrid Valencia Tel +34 93 5677200 +34 91 3349797 +34 96 3467143 Edificio Testa A C/Pedro de Valdivia, 10 Oficina 107 C/ Alcalde Barnils 64-68 28006 Madrid C/ Prof. Beltrán Báguena 4, St. Cugat del Vallès 46009 Valencia 08174 Barcelona W3C Provenance Incubator Group 23