Capturing Interactive Data Transformation Operations using Provenance Workflows

Andre Freitas
Andre FreitasLecturer at University of Manchester
Digital Enterprise Research Institute                                          www.deri.ie




            Capturing interactive data transformation
             operations using provenance workflows

             Tope Omitola, Andre Freitas, Edward Curry, Sean
             O'Riain, Nicholas Gibbins and Nigel Shadbolt



  SWPM Workshop 28.05.2012, Herakleion, Crete


 Copyright 2009 Digital Enterprise Research Institute. All rights reserved.
Outline
Digital Enterprise Research Institute                 www.deri.ie




           Motivation
           Interactive data transformations (IDTs)
           IDT & Provenance
           Modelling IDTs
           Provenance Representation
           Provenance Capture
           Case Study
           Conclusion
Motivation
Digital Enterprise Research Institute                                  www.deri.ie




           Dataspaces:
                 High number of heterogeneous data sources
                 Complex data transformation environment
                 Need for both repeatable data transformations and once-
                  off transformations
           Traditional    ETL     approaches                 for     data
            transformation/integration:
                 Based on scripting/programming
                 Focus on repeatable data transformation processes
Interactive Data Transformation (IDTs)
Digital Enterprise Research Institute                   www.deri.ie




        Based on user interaction paradigms for user
         creation of data transformations
        Explores    GUI    elements    mapping   to   data
         transformation operations
        Instant feedback of each iteration
        Complementary to existing ETL tools
        Lower the barriers for non-programmers (reduces
         programming effort) of doing data transformations
        Example platforms: Google Refine, Potters Wheel,
         Wrangler
Interactive Data Transformation (IDTs)
Digital Enterprise Research Institute      www.deri.ie
Challenges
Digital Enterprise Research Institute                            www.deri.ie




           How to model IDTs?

           Facilitating the reuse of previous IDTs

           Representing IDTs
                                                           Provenance

           Making IDT platforms provenance-aware

           Enabling transportability across IDT and ETL
            platforms
IDT & Provenance
Digital Enterprise Research Institute                     www.deri.ie




           Provenance supports representation of interactive
            data transformations
           Output: a provenance descriptor which shows the
            relationship between the inputs, the outputs, and
            the applied transformation operations
           Both retrospective and prospective provenance
IDT
Digital Enterprise Research Institute        www.deri.ie




           IDT model
           Formal model (Algebra for IDT)
           Provenance representation
           Provenance capture of IDTs
IDT Model: Core Elements
Digital Enterprise Research Institute                       www.deri.ie




           Schema and instance data
           Set of predefined operations
           GUI elements mapping to predefined operations
           User actions
                 Operation selection
                 Parameter selection
                 Operation composition (workflow)
IDT Model
Digital Enterprise Research Institute   www.deri.ie
Formalizing the mapping from IDT to
     Provenance
Digital Enterprise Research Institute                        www.deri.ie




           Definition 1: A provenance-based interactive data
            transformation engine, consists of a set of
            transformations (or activities) on a set of datasets
            generating outputs in the form of other datasets or
            events which may trigger further transformations

           Definition 2: An interactive data transformation
            event, consists of the input dataset, the output
            dataset(s), the applied transformation function,
            and the time the transformation took place
Formalizing the mapping from IDT to
        Provenance
Digital Enterprise Research Institute                       www.deri.ie




           Definition 3: A run is a function from time to
            dataset(s) and the transformation applied to those
            dataset(s)

           Definition 4: A trace is the sequence of pairs of a
            run and the time the run was made
Provenance Representation
Digital Enterprise Research Institute                      www.deri.ie




           Proposed in Representing Interoperable Provenance
            Descriptions for ETL Workflows

           Three-layered provenance model:
                 Open Provenance Model Vocabulary Layer
                 Cogs ETL Provenance Vocabulary
                 Domain-Specific Model Layer


           Linked Data standards
Provenance Capture Layers
Digital Enterprise Research Institute   www.deri.ie
Provenance Event-Capture Sequence Flow
Digital Enterprise Research Institute    www.deri.ie
Case study
Digital Enterprise Research Institute                                                                                    www.deri.ie




        Implementation over the GR Platform
        Example descriptor

   @prefix grf: <http://127.0.0.1:3333/project/1402144365904/> .

   grf :MassCellChange-1092380975 rdf:type opmv:Process,
   cogs:ColumnOperation, cogs:Transformation;                                 Mapping to the actual program
   cogs:operationName "MassCellChange"^^xsd:string;
   cogs:programUsed "com.google.refine.operations.cell.MassEditOperation"^^xsd:string;                  Process
   rdfs:label "Mass edit 1 cells in column ==List of winners=="^^xsd:string.

   grf:MassCellChange-1092380975/1_0 rdf:type opmv:Artifact ;                                                       Input Artifact
   rdfs:label "* '''1955 [[Meena Kumari]]'[[Parineeta (1953 film)|Parineeta]]''''' as '''Lolita'''"^^xsd:string.

   grf:MassCellChange-1092380975/1_1 rdf:type opmv:Artifact;                                                       Output Artifact
   rdfs:label "* '''John Wayne'''"^^xsd:string.
                                                                                                            Workflow structure
   grf:MassCellChange-1092380975/1_1 opmv:wasDerivedFrom grf:MassCellChange-1092380975/1_0.
   grf:MassCellChange-1092380975 opmv:used grf:MassCellChange-1092380975/1_0.
   grf:MassCellChange-1092380975/1_1 opmv:wasGeneratedBy grf:MassCellChange-1092380975.
   grf:MassCellChange-1092380975/1_1 opmv:wasGeneratedAt "2011-11-16T11:2:14"^xsd: dateTime.
Conclusion
Digital Enterprise Research Institute                     www.deri.ie




           The proposed approach provides low impact on the
            existing IDT process
           Provenance representation supports different data
            models
           Preliminary implementation of a Google Refine
            provenance extension
1 of 17

Recommended

Aps ScanView by
Aps ScanViewAps ScanView
Aps ScanViewRoland Meulenbroek
374 views10 slides
Active Data PDSW'13 by
Active Data PDSW'13Active Data PDSW'13
Active Data PDSW'13Gilles Fedak
444 views30 slides
alphablues - ML applied to text and image in chat bots by
alphablues - ML applied to text and image in chat botsalphablues - ML applied to text and image in chat bots
alphablues - ML applied to text and image in chat botsAndré Karpištšenko
964 views37 slides
Ordex Presentation at Nationaal Congres Open Data Eindhoven 20 april 2012 by
Ordex Presentation at Nationaal Congres Open Data Eindhoven 20 april 2012Ordex Presentation at Nationaal Congres Open Data Eindhoven 20 april 2012
Ordex Presentation at Nationaal Congres Open Data Eindhoven 20 april 2012Tom Zeppenfeldt IEC MSc
1.1K views18 slides
How to Achieve Cross-Industry Semantic Interoperability by
How to Achieve Cross-Industry Semantic InteroperabilityHow to Achieve Cross-Industry Semantic Interoperability
How to Achieve Cross-Industry Semantic InteroperabilityDoug Migliori
2.1K views34 slides
Bhadale group of companies - clean tech innovations programs catalogue by
Bhadale group of companies - clean tech innovations programs catalogueBhadale group of companies - clean tech innovations programs catalogue
Bhadale group of companies - clean tech innovations programs catalogueVijayananda Mohire
54 views7 slides

More Related Content

Similar to Capturing Interactive Data Transformation Operations using Provenance Workflows

Approximate Semantic Matching of Heterogeneous Events by
Approximate Semantic Matching of Heterogeneous EventsApproximate Semantic Matching of Heterogeneous Events
Approximate Semantic Matching of Heterogeneous EventsEdward Curry
6.4K views35 slides
Approximate Semantic Matching of Heterogeneous Events by
Approximate Semantic Matching of Heterogeneous EventsApproximate Semantic Matching of Heterogeneous Events
Approximate Semantic Matching of Heterogeneous EventsSouleiman Hasan
1.3K views34 slides
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz... by
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...HostedbyConfluent
314 views40 slides
Data virtualization an introduction by
Data virtualization an introductionData virtualization an introduction
Data virtualization an introductionDenodo
331 views29 slides
FAIR Computational Workflows by
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational WorkflowsCarole Goble
415 views14 slides
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ... by
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Edward Curry
7.3K views40 slides

Similar to Capturing Interactive Data Transformation Operations using Provenance Workflows(20)

Approximate Semantic Matching of Heterogeneous Events by Edward Curry
Approximate Semantic Matching of Heterogeneous EventsApproximate Semantic Matching of Heterogeneous Events
Approximate Semantic Matching of Heterogeneous Events
Edward Curry6.4K views
Approximate Semantic Matching of Heterogeneous Events by Souleiman Hasan
Approximate Semantic Matching of Heterogeneous EventsApproximate Semantic Matching of Heterogeneous Events
Approximate Semantic Matching of Heterogeneous Events
Souleiman Hasan1.3K views
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz... by HostedbyConfluent
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
HostedbyConfluent314 views
Data virtualization an introduction by Denodo
Data virtualization an introductionData virtualization an introduction
Data virtualization an introduction
Denodo 331 views
FAIR Computational Workflows by Carole Goble
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
Carole Goble415 views
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ... by Edward Curry
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Edward Curry7.3K views
FIWARE Global Summit - IDS Implementation with FIWARE Software Components by FIWARE
FIWARE Global Summit - IDS Implementation with FIWARE Software ComponentsFIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE353 views
An Introduction to Data Virtualization in 2018 by Denodo
An Introduction to Data Virtualization in 2018An Introduction to Data Virtualization in 2018
An Introduction to Data Virtualization in 2018
Denodo 5.1K views
Usage Landscape of Enterprise Open Source Data Integration by OKTOPUS Consulting
Usage Landscape of Enterprise Open Source Data IntegrationUsage Landscape of Enterprise Open Source Data Integration
Usage Landscape of Enterprise Open Source Data Integration
OKTOPUS Consulting1.4K views
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow by Jan Kirenz
MLOps - Build pipelines with Tensor Flow Extended & KubeflowMLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
Jan Kirenz299 views
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha... by Shirshanka Das
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das5.9K views
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at... by Yael Garten
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Yael Garten2.7K views
Why Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An IntroductionWhy Data Virtualization? An Introduction
Why Data Virtualization? An Introduction
Denodo 2.4K views
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea... by Denodo
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Denodo 316 views
Webinar september 2013 by Marc Gille
Webinar september 2013Webinar september 2013
Webinar september 2013
Marc Gille1.7K views
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem by Shirshanka Das
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Shirshanka Das6.8K views
Architecting for change: LinkedIn's new data ecosystem by Yael Garten
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystem
Yael Garten786 views
Data Virtualization: An Introduction by Denodo
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
Denodo 77 views
FAIR Computational Workflows by Carole Goble
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
Carole Goble493 views

More from Andre Freitas

AI Systems @ Manchester by
AI Systems @ ManchesterAI Systems @ Manchester
AI Systems @ ManchesterAndre Freitas
450 views56 slides
AI Beyond Deep Learning by
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep LearningAndre Freitas
403 views51 slides
Building AI Applications using Knowledge Graphs by
Building AI Applications using Knowledge GraphsBuilding AI Applications using Knowledge Graphs
Building AI Applications using Knowledge GraphsAndre Freitas
1.5K views291 slides
Open IE tutorial 2018 by
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018Andre Freitas
2K views216 slides
Effective Semantics for Engineering NLP Systems by
Effective Semantics for Engineering NLP SystemsEffective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP SystemsAndre Freitas
374 views110 slides
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ... by
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...Andre Freitas
576 views37 slides

More from Andre Freitas(20)

Building AI Applications using Knowledge Graphs by Andre Freitas
Building AI Applications using Knowledge GraphsBuilding AI Applications using Knowledge Graphs
Building AI Applications using Knowledge Graphs
Andre Freitas1.5K views
Effective Semantics for Engineering NLP Systems by Andre Freitas
Effective Semantics for Engineering NLP SystemsEffective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP Systems
Andre Freitas374 views
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ... by Andre Freitas
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...
Andre Freitas576 views
Semantic Perspectives for Contemporary Question Answering Systems by Andre Freitas
Semantic Perspectives for Contemporary Question Answering SystemsSemantic Perspectives for Contemporary Question Answering Systems
Semantic Perspectives for Contemporary Question Answering Systems
Andre Freitas666 views
Semantic Relation Classification: Task Formalisation and Refinement by Andre Freitas
Semantic Relation Classification: Task Formalisation and RefinementSemantic Relation Classification: Task Formalisation and Refinement
Semantic Relation Classification: Task Formalisation and Refinement
Andre Freitas718 views
Categorization of Semantic Roles for Dictionary Definitions by Andre Freitas
Categorization of Semantic Roles for Dictionary DefinitionsCategorization of Semantic Roles for Dictionary Definitions
Categorization of Semantic Roles for Dictionary Definitions
Andre Freitas352 views
Word Tagging with Foundational Ontology Classes by Andre Freitas
Word Tagging with Foundational Ontology ClassesWord Tagging with Foundational Ontology Classes
Word Tagging with Foundational Ontology Classes
Andre Freitas375 views
Different Semantic Perspectives for Question Answering Systems by Andre Freitas
Different Semantic Perspectives for Question Answering SystemsDifferent Semantic Perspectives for Question Answering Systems
Different Semantic Perspectives for Question Answering Systems
Andre Freitas1.2K views
WISS QA Do it yourself Question answering over Linked Data by Andre Freitas
WISS QA Do it yourself Question answering over Linked DataWISS QA Do it yourself Question answering over Linked Data
WISS QA Do it yourself Question answering over Linked Data
Andre Freitas991 views
Schema-Agnostic Queries (SAQ-2015): Semantic Web Challenge by Andre Freitas
Schema-Agnostic Queries (SAQ-2015): Semantic Web ChallengeSchema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
Schema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
Andre Freitas952 views
How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic ... by Andre Freitas
How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic ...How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic ...
How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic ...
Andre Freitas734 views
Semantics at Scale: A Distributional Approach by Andre Freitas
Semantics at Scale: A Distributional ApproachSemantics at Scale: A Distributional Approach
Semantics at Scale: A Distributional Approach
Andre Freitas821 views
Schema-agnositc queries over large-schema databases: a distributional semanti... by Andre Freitas
Schema-agnositc queries over large-schema databases: a distributional semanti...Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...
Andre Freitas1.4K views
A Semantic Web Platform for Automating the Interpretation of Finite Element ... by Andre Freitas
A Semantic Web Platform for Automating the Interpretation of Finite Element ...A Semantic Web Platform for Automating the Interpretation of Finite Element ...
A Semantic Web Platform for Automating the Interpretation of Finite Element ...
Andre Freitas882 views
How Semantic Technologies can help to cure Hearing Loss? by Andre Freitas
How Semantic Technologies can help to cure Hearing Loss?How Semantic Technologies can help to cure Hearing Loss?
How Semantic Technologies can help to cure Hearing Loss?
Andre Freitas620 views
Towards a Distributional Semantic Web Stack by Andre Freitas
Towards a Distributional Semantic Web StackTowards a Distributional Semantic Web Stack
Towards a Distributional Semantic Web Stack
Andre Freitas700 views

Recently uploaded

Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesShapeBlue
84 views15 slides
"Surviving highload with Node.js", Andrii Shumada by
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada Fwdays
33 views29 slides
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...Jasper Oosterveld
27 views49 slides
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...ShapeBlue
64 views20 slides
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... by
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...ShapeBlue
40 views25 slides
State of the Union - Rohit Yadav - Apache CloudStack by
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStackShapeBlue
106 views53 slides

Recently uploaded(20)

Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue84 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays33 views
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue64 views
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... by ShapeBlue
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
ShapeBlue40 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue106 views
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue28 views
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue89 views
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De... by Moses Kemibaro
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Moses Kemibaro27 views
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue26 views
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue55 views
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi139 views
HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn26 views
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely29 views
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman38 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue75 views

Capturing Interactive Data Transformation Operations using Provenance Workflows

  • 1. Digital Enterprise Research Institute www.deri.ie Capturing interactive data transformation operations using provenance workflows Tope Omitola, Andre Freitas, Edward Curry, Sean O'Riain, Nicholas Gibbins and Nigel Shadbolt SWPM Workshop 28.05.2012, Herakleion, Crete  Copyright 2009 Digital Enterprise Research Institute. All rights reserved.
  • 2. Outline Digital Enterprise Research Institute www.deri.ie  Motivation  Interactive data transformations (IDTs)  IDT & Provenance  Modelling IDTs  Provenance Representation  Provenance Capture  Case Study  Conclusion
  • 3. Motivation Digital Enterprise Research Institute www.deri.ie  Dataspaces:  High number of heterogeneous data sources  Complex data transformation environment  Need for both repeatable data transformations and once- off transformations  Traditional ETL approaches for data transformation/integration:  Based on scripting/programming  Focus on repeatable data transformation processes
  • 4. Interactive Data Transformation (IDTs) Digital Enterprise Research Institute www.deri.ie  Based on user interaction paradigms for user creation of data transformations  Explores GUI elements mapping to data transformation operations  Instant feedback of each iteration  Complementary to existing ETL tools  Lower the barriers for non-programmers (reduces programming effort) of doing data transformations  Example platforms: Google Refine, Potters Wheel, Wrangler
  • 5. Interactive Data Transformation (IDTs) Digital Enterprise Research Institute www.deri.ie
  • 6. Challenges Digital Enterprise Research Institute www.deri.ie  How to model IDTs?  Facilitating the reuse of previous IDTs  Representing IDTs Provenance  Making IDT platforms provenance-aware  Enabling transportability across IDT and ETL platforms
  • 7. IDT & Provenance Digital Enterprise Research Institute www.deri.ie  Provenance supports representation of interactive data transformations  Output: a provenance descriptor which shows the relationship between the inputs, the outputs, and the applied transformation operations  Both retrospective and prospective provenance
  • 8. IDT Digital Enterprise Research Institute www.deri.ie  IDT model  Formal model (Algebra for IDT)  Provenance representation  Provenance capture of IDTs
  • 9. IDT Model: Core Elements Digital Enterprise Research Institute www.deri.ie  Schema and instance data  Set of predefined operations  GUI elements mapping to predefined operations  User actions  Operation selection  Parameter selection  Operation composition (workflow)
  • 10. IDT Model Digital Enterprise Research Institute www.deri.ie
  • 11. Formalizing the mapping from IDT to Provenance Digital Enterprise Research Institute www.deri.ie  Definition 1: A provenance-based interactive data transformation engine, consists of a set of transformations (or activities) on a set of datasets generating outputs in the form of other datasets or events which may trigger further transformations  Definition 2: An interactive data transformation event, consists of the input dataset, the output dataset(s), the applied transformation function, and the time the transformation took place
  • 12. Formalizing the mapping from IDT to Provenance Digital Enterprise Research Institute www.deri.ie  Definition 3: A run is a function from time to dataset(s) and the transformation applied to those dataset(s)  Definition 4: A trace is the sequence of pairs of a run and the time the run was made
  • 13. Provenance Representation Digital Enterprise Research Institute www.deri.ie  Proposed in Representing Interoperable Provenance Descriptions for ETL Workflows  Three-layered provenance model:  Open Provenance Model Vocabulary Layer  Cogs ETL Provenance Vocabulary  Domain-Specific Model Layer  Linked Data standards
  • 14. Provenance Capture Layers Digital Enterprise Research Institute www.deri.ie
  • 15. Provenance Event-Capture Sequence Flow Digital Enterprise Research Institute www.deri.ie
  • 16. Case study Digital Enterprise Research Institute www.deri.ie  Implementation over the GR Platform  Example descriptor @prefix grf: <http://127.0.0.1:3333/project/1402144365904/> . grf :MassCellChange-1092380975 rdf:type opmv:Process, cogs:ColumnOperation, cogs:Transformation; Mapping to the actual program cogs:operationName "MassCellChange"^^xsd:string; cogs:programUsed "com.google.refine.operations.cell.MassEditOperation"^^xsd:string; Process rdfs:label "Mass edit 1 cells in column ==List of winners=="^^xsd:string. grf:MassCellChange-1092380975/1_0 rdf:type opmv:Artifact ; Input Artifact rdfs:label "* '''1955 [[Meena Kumari]]'[[Parineeta (1953 film)|Parineeta]]''''' as '''Lolita'''"^^xsd:string. grf:MassCellChange-1092380975/1_1 rdf:type opmv:Artifact; Output Artifact rdfs:label "* '''John Wayne'''"^^xsd:string. Workflow structure grf:MassCellChange-1092380975/1_1 opmv:wasDerivedFrom grf:MassCellChange-1092380975/1_0. grf:MassCellChange-1092380975 opmv:used grf:MassCellChange-1092380975/1_0. grf:MassCellChange-1092380975/1_1 opmv:wasGeneratedBy grf:MassCellChange-1092380975. grf:MassCellChange-1092380975/1_1 opmv:wasGeneratedAt "2011-11-16T11:2:14"^xsd: dateTime.
  • 17. Conclusion Digital Enterprise Research Institute www.deri.ie  The proposed approach provides low impact on the existing IDT process  Provenance representation supports different data models  Preliminary implementation of a Google Refine provenance extension