Representing Interoperable Provenance Descriptions  for ETL Workflows  André Freitas, Benedikt Kämpgen, Joao Gabriel Olive...
Motivation        Decision-support on more complex and       heterogeneous data environments (dataspaces,       Linked Ope...
Problem                                                                                                  1. Lookup printer...
Problem                                                                                                    1. Extract from...
Solution: Provenance information about    ETL workflows        Prospective provenance: representation of ETL workflow     ...
Outline        Motivation & Problem        Gap of ETL Descriptions        Interoperable ETL Provenance Model        Case S...
Gap of ETL Descriptions (1)             Provenance models                            Provenance                           ...
Gap of ETL Descriptions (2)        Common ETL applications              such as Kapow Software, Pentaho Data Integration, ...
Outline        Motivation & Problem        Gap of ETL Descriptions        Interoperable ETL Provenance Model        Case S...
Outline         Motivation & Problem         Gap of ETL Descriptions         Interoperable ETL Provenance Model           ...
Requirements Analysis                                       Provenance                          Semantic                  ...
Interoperable Provenance Model for ETL     Workflows                                                                      ...
Open Provenance Model Vocabulary     (OPMV)         Community-built        provenance model         Simple workflow struct...
RDF vocabulary for representing ETL elements        Complementary vocabulary for expressing the        elements present in...
Cogs – OPMV workflow extension         The representation of        nested workflows        allows different        abstra...
Cogs – Structure         Taxonomy of ETL elements mapping        to provenance processes and artifacts         High-level ...
Requirements Coverage Analysis                                      OPMV                                 Cogs             ...
Outline         Motivation & Problem         Gap of ETL Descriptions         Interoperable ETL Provenance Model         Ca...
Case Study – Sustainability Reporting         ETL over heterogeneous data sources (e.g., log        files, survey results,...
Case Study – Sustainability Reporting         ETL over heterogeneous data sources (e.g., log        files, survey results,...
Case Study – Architecture with     Provenance-aware ETL Applications21   28 May 2012   B. Kämpgen – Representing Interoper...
Case Study – Sustainability Report     Values22   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descrip...
Case Study – Provenance Descriptor     Visualization23   28 May 2012   B. Kämpgen – Representing Interoperable Provenance ...
Case Study – Provenance Descriptor     Visualization                                      4. Aggregation24   28 May 2012  ...
Case Study – Possible Queries         OPMV               What are the data artifacts, processes and agents behind this dat...
Outline         Motivation & Problem         Gap of ETL Descriptions         Interoperable ETL Provenance Model         Ca...
Conclusions                        Provenance                         Semantic                          Usability and     ...
Conclusions                        Provenance                         Semantic                          Usability and     ...
Upcoming SlideShare
Loading in …5
×

Representing Interoperable Provenance Descriptions for ETL Workflows

1,322 views
1,254 views

Published on

The increasing availability of data on the Web provided by
the emergence of Web 2.0 applications and, more recently by Linked
Data, brought additional complexity to data management tasks, where
the number of available data sources and their associated heterogeneity drastically increases. In this scenario, where data is reused and repurposed on a new scale, the pattern expressed as Extract-Transform-Load
(ETL) emerges as a fundamental and recurrent process for both producers and consumers of data on the Web. In addition to ETL, provenance,
the representation of source artifacts, processes and agents behind data, becomes another cornerstone element for Web data management, playing
a fundamental role in data quality assessment, data semantics and facilitating the reproducibility of data transformation processes. This paper proposes the convergence of this two Web data management concerns, introducing a principled provenance model for ETL processes in the form
of a vocabulary based on the Open Provenance Model (OPM) standard and focusing on the provision of an interoperable provenance model for Web-based ETL environments. The proposed ETL provenance model is instantiated in a real-world sustainability reporting scenario.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,322
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Representing Interoperable Provenance Descriptions for ETL Workflows

  1. 1. Representing Interoperable Provenance Descriptions for ETL Workflows André Freitas, Benedikt Kämpgen, Joao Gabriel Oliveira, Seán O’Riain, Edward Curry The role of Semantic Web in Provenance Management, Extended Semantic Web Conference 2012 28 May 2012Institute of Applied Informatics and Formal Description Methods (AIFB) KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu
  2. 2. Motivation Decision-support on more complex and heterogeneous data environments (dataspaces, Linked Open Data) Extract-Transform-Load (ETL) workflows inherent part of data analysis Challenges: Management of complex ETL workflows Information quality, trust2 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  3. 3. Problem 1. Lookup printer log ETL file – 20sec ETL 2. Parse to RDF – 30sec ETL 3. Filter for 2010 – 1sec ETL Sustainability report 4. Aggregate over people – 1sec 2009 2010 printing emissions 600 503 paper usage 4 165 3 968 travel emissions 534 000 429 193 commute emissions 456 391 Carbon dioxide emission by kg3 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  4. 4. Problem 1. Extract from travel form DB – 20sec 1. Crawl from RDFa 2. Parse from CSV on website – 1h to RDF – 30sec ETL 2. Apply constant 3. Aggregate over factor – 1sec people – 1sec ETL Sustainability report 4. Filter for 2010 – 1sec 2009 2010 printing emissions 600 503 paper usage 4 165 3 968 travel emissions 534 000 429 193 commute emissions 456 391 Carbon dioxide emission by kg4 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  5. 5. Solution: Provenance information about ETL workflows Prospective provenance: representation of ETL workflow at design time Retrospective provenance: representation of ETL workflow after execution Applications of provenance information for ETL workflows Documentation (reproducibility and reuse) Data quality assessment (trustworthiness) Management (consistency-checking, debugging and semantic reconciliation)5 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  6. 6. Outline Motivation & Problem Gap of ETL Descriptions Interoperable ETL Provenance Model Case Study Conclusions6 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  7. 7. Gap of ETL Descriptions (1) Provenance models Provenance representation Davidson and from an ETL Buneman (1998) Conceptual modelling perspective using ontologies Becker and Ghedini Semantic Galhardas et al. (2001) (2005) interoperability across different CWM, PMML, BPMN, ETL applications Cui and Widom (2003) BPEL + ontologies Data Mining Usability and Simmhan et al. (2005) Ontology (2009) ontological commitment Conceptual modelling of Interoperable Formal models of ETL workflows ETL provenance model ETL workflows7 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  8. 8. Gap of ETL Descriptions (2) Common ETL applications such as Kapow Software, Pentaho Data Integration, Google Refine and Yahoo Pipes do not create and use provenance information or do not support sharing and integrating such provenance information8 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  9. 9. Outline Motivation & Problem Gap of ETL Descriptions Interoperable ETL Provenance Model Case Study Conclusions9 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  10. 10. Outline Motivation & Problem Gap of ETL Descriptions Interoperable ETL Provenance Model Requirements Analysis High-level approach Cogs: Linked Data vocabulary Requirements Coverage Analysis Case Study Conclusions10 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  11. 11. Requirements Analysis Provenance Semantic Usability and representation from an interoperability across ontological ETL perspective different ETL platforms commitment Prospective and + + retrospective descriptions Separation of concerns + Common terminology + + Terminological + + completeness Lightweight ontology + structure Availability of different + + abstraction levels Data representation + independency Accessibility + + Decentralization + +11 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  12. 12. Interoperable Provenance Model for ETL Workflows High-level approach reuse of the OPM Vocabulary (OPMV) workflow structure as abstract provenance model creation of Cogs, an RDF vocabulary for representing ETL Provenance can be extended by domain specific models use of the Linked Data Three-layered Provenance Model principles for representing provenance descriptors12 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  13. 13. Open Provenance Model Vocabulary (OPMV) Community-built provenance model Simple workflow structure (processes, artifacts, agents) Designed to be a minimal level of provenance interoperability Designed to be extensible ETL and provenance share workflow-level semantics http://open-biomed.sourceforge.net/opmv/ns.html13 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  14. 14. RDF vocabulary for representing ETL elements Complementary vocabulary for expressing the elements present in an ETL workflow based on ETL/data transformation tools (Pentaho Data Integration, Google Refine) Concepts and structures from the ETL literature. https://sites.google.com/site/cogsvocab/14 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  15. 15. Cogs – OPMV workflow extension The representation of nested workflows allows different abstraction levels15 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  16. 16. Cogs – Structure Taxonomy of ETL elements mapping to provenance processes and artifacts High-level classes: cogs:Execution, e.g., ScheduledJob cogs:State, e.g., Running opmv:Process cogs:Extraction, e.g., Parsing cogs:Transformation, e.g., RegexFilter cogs:Loading, e.g., IncrementalLoad opmv:Artifact cogs:Object, e.g., CSV File cogs:Layer, e.g., StagingArea Cogs: 151 classes 17 properties16 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  17. 17. Requirements Coverage Analysis OPMV Cogs LD principles Prospective and + + retrospective descriptions Separation of concerns + + Common terminology + + Terminological + + + completeness Lightweight ontology + + structure Availability of different + abstraction levels Data representation + + + independency Accessibility + + Decentralization +17 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  18. 18. Outline Motivation & Problem Gap of ETL Descriptions Interoperable ETL Provenance Model Case Study Conclusions18 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  19. 19. Case Study – Sustainability Reporting ETL over heterogeneous data sources (e.g., log files, survey results, travel request DB, RDF) Sustainability report 2009 2010 printing 600 503 emissions paper 4 165 3 968 usage travel 534 000 429 193 emissions commute 456 391 emissions Carbon dioxide emission by kg19 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  20. 20. Case Study – Sustainability Reporting ETL over heterogeneous data sources (e.g., log files, survey results, travel request DB, RDF) 2. 3. 4. Sustainability report1. 2009 2010 printing 600 503 emissions paper 4 165 3 968 usage travel 534 000 429 193 emissions commute 456 391 emissions Carbon dioxide emission by kg20 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  21. 21. Case Study – Architecture with Provenance-aware ETL Applications21 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  22. 22. Case Study – Sustainability Report Values22 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  23. 23. Case Study – Provenance Descriptor Visualization23 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  24. 24. Case Study – Provenance Descriptor Visualization 4. Aggregation24 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and 1. Lookup Formal Description Methods (AIFB)
  25. 25. Case Study – Possible Queries OPMV What are the data artifacts, processes and agents behind this data value? When and how long were the processes executed? OPMV + Cogs How long did all lookups take? What scripts have been used to transform the data into RDF? To which values constant factors have been applied? Which aggregation functions were used to calculate this indicator?25 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  26. 26. Outline Motivation & Problem Gap of ETL Descriptions Interoperable ETL Provenance Model Case Study Conclusions26 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  27. 27. Conclusions Provenance Semantic Usability and representation interoperability ontological from an ETL across different commitment perspective ETL applications Evaluation in small case study For a full evaluation of interoperability benefits model needs to be adopted in provenance-aware ETL applications. Starting point: Provenance-aware Google Refine using Cogs.27 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  28. 28. Conclusions Provenance Semantic Usability and representation interoperability ontological from an ETL across different commitment perspective ETL applications Thanks!28 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)

×