Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SemSci2017 - Detailed Provenance Capture of Data Processing

227 views

Published on

A large part of Linked Data generation entails processing the raw data. However, this process is only documented in human-readable form or as a software repository. This inhibits reproducibility and comparability, as current documentation solutions do not provide detailed metadata and rely on the availability of specific software environments.
We propose an automatic capturing mechanism for interchangeable and implementation independent metadata and provenance that includes data processing. Using declarative mapping documents to describe the computational experiment allows automatic capturing of term-level provenance for both schema and data transformations, and for both the used software tools as the input-output pairs of the data processing executions. This approach is applied to mapping documents described using RML and FnO, and implemented in the RMLMapper. The captured metadata can be used to more easily share, reproduce, and compare the dataset generation process, across software environments.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

SemSci2017 - Detailed Provenance Capture of Data Processing

  1. 1. Detailed Provenance Capture of Data Processing Ben De Meester, Anastasia Dimou, Ruben Verborgh, and Erik Mannens Ghent University – imec – IDLab, Belgium
  2. 2. *
  3. 3. Outline Linked Data Generation Problem Solution Results
  4. 4. Outline Linked Data Generation Problem Solution Results
  5. 5. Linked Data comes from data Unstructured data Semi-structured data Structured data …
  6. 6. Linked Data comes from processed data Unstructured data Parse Semi-structured data Extract Structured data Add schema annotations … …
  7. 7. Going from data to linked data Data Schema transformations Data transformations Linked Data
  8. 8. Linked Data generation = schema + data transformations dbr: Barney_G umble
  9. 9. Linked Data generation = schema + data transformations dbr: Barney_G umble dbo:birthDate dbp:voiceactor dbp:gender dbp:name …
  10. 10. Linked Data generation = schema + data transformations dbr: Hawaii dbr: Barney_G umble dbo:birthDate dbp:voiceactor dbp:gender dbp:name “1954-4-20" dbr: Dan_Caste llaneta “Male” “Barney Gumble”@en … …
  11. 11. Problem: there’s always a drunk Barney Data Schema processing Data processing Linked Data
  12. 12. Data Data Processing
  13. 13. Knowing where the data comes from is as important as the data itself Oh Yeah?
  14. 14. Outline Linked Data Generation Problem Solution Results
  15. 15. dbr: Hawaii dbr: Barney_G umble dbo:birthDate dbp:voiceactor dbp:gender dbp:name “1954-4-20" dbr: Dan_Caste llaneta “Male” “Barney Gumble”@en … … Linked Data re-generation? Provenance of those transformations
  16. 16. How it’s done for data processing Provenance log: Person A used Software B, on System C
  17. 17. Problem: how to reproduce? Provenance log: Person A used Software B, on System C Software B offline? System C not booting?
  18. 18. What do we need? Fine-grained provenance for schema transformations for data transformations Independent of the implementation
  19. 19. How can we tell where the data comes from, without depending on the system?
  20. 20. Outline Linked Data Generation Problem: Data Processing Provenance Solution Results
  21. 21. What do we want? Term-level, implementation-independent provenance for schema transformations for data transformations
  22. 22. What do we want? Term-level, implementation-independent provenance for schema transformations for data transformations Generated automatically
  23. 23. What do we want? Term-level, implementation-independent provenance for schema transformations for data transformations Generated automatically Declarative generation process
  24. 24. Steps Align schema and data transformations in a declarative document Generate provenance based on declarative schema transformations Generate provenance based on declarative data transformations
  25. 25. Declarative generation process? Align schema and data transformations in a declarative document
  26. 26. Declarative generation process? Solved! Align schema and data transformations in a declarative document RML + FnO
  27. 27. Declarative generation process? Solved! Align schema and data transformations in a declarative document RML + FnO for DBpedia EF Declarative data transformations for Linked Data generation: the case of DBpedia De Meester, B., Maroy, W., Dimou, A., Verborgh, R., and Mannens, E. Sustainable Linked Data Generation: The Case of DBpedia Maroy, W., Dimou, A., Kontokostas, D., De Meester, B., Verborgh, R., Lehmann, J., Mannens, E. and Hellmann, S.
  28. 28. Schema transformations provenance? Generate provenance based on declarative mapping document
  29. 29. Schema transformations provenance? Solved! Generate provenance based on declarative mapping document RML + PROV Automated metadata generation for Linked Data generation and publishing workflows Dimou, A., De Nies, T., Verborgh, R., Mannens, E., and Van de Walle, R.
  30. 30. Data transformations provenance? Generate provenance based on declarative data transformations
  31. 31. Data transformations provenance?
  32. 32. Outline Linked Data Generation Problem: Data Processing Provenance Solution Declarative generation FnO and PROV Results
  33. 33. FnO: Function expects output inputString predicate outputString predicate DBpedia_ date_parser fno:Function
  34. 34. FnO: Execution DBpedia_ date_parser Function “April 20th 1954” parseExecution fno:Execution “1954-04-20” outputString executesinputString
  35. 35. FnO: General Execution Function Data Transformation Output Input
  36. 36. Aligning FnO and PROV Output prov:Entity Tool prov:Agent wasGeneratedBy Data Transformation prov:Activity Function prov:Entity used Input prov:Entity used wasAssociatedWith wasAttributedTo
  37. 37. Outline Linked Data Generation Problem: Data Processing Provenance Solution Declarative generation FnO and PROV Results
  38. 38. Uncool thing: It’s big When including provenance generation, for every processed term, you add 10 triples
  39. 39. Cool thing #1: System details complementary Output prov:Entity Tool prov:Agent wasGeneratedBy Data Transformation prov:Activity Function prov:Entity used Input prov:Entity used wasAssociatedWith wasAttributedTo
  40. 40. Cool thing #2: Aligning with RML complementary wasInformedBy Output prov:Entity Tool prov:Agent wasGeneratedBy Data Transformation prov:Activity Function prov:Entity used Input prov:Entity Schema Transformation prov:Activity used wasAssociatedWith wasAttributedTo
  41. 41. Cool thing #3: It actually works RMLMapper https://github.com/RMLio/RML-Mapper FunctionProcessor https://github.com/FnOio/function-processor-java DBpedia Extraction Sample https://fno.io/prov/dbpedia/
  42. 42. How can we find a drunk Barney? Query for long-lasting processes Query all outputs of a certain function/tool Query all input-output pairs
  43. 43. What to do with a drunk Barney? Performance evaluation Qualitative comparison Iterative improvement (only changing what is needed!)
  44. 44. Outline Linked Data Generation Problem: Data Processing Provenance Solution Declarative generation FnO and PROV Results
  45. 45. Detailed Provenance Capture of Data Processing Ben De Meester, Anastasia Dimou, Ruben Verborgh, and Erik Mannens Ghent University – imec – IDLab, Belgium

×