Digital Enterprise Research Institute                                          www.deri.ie




            Capturing interactive data transformation
             operations using provenance workflows

             Tope Omitola, Andre Freitas, Edward Curry, Sean
             O'Riain, Nicholas Gibbins and Nigel Shadbolt



  SWPM Workshop 28.05.2012, Herakleion, Crete


 Copyright 2009 Digital Enterprise Research Institute. All rights reserved.
Outline
Digital Enterprise Research Institute                 www.deri.ie




           Motivation
           Interactive data transformations (IDTs)
           IDT & Provenance
           Modelling IDTs
           Provenance Representation
           Provenance Capture
           Case Study
           Conclusion
Motivation
Digital Enterprise Research Institute                                  www.deri.ie




           Dataspaces:
                 High number of heterogeneous data sources
                 Complex data transformation environment
                 Need for both repeatable data transformations and once-
                  off transformations
           Traditional    ETL     approaches                 for     data
            transformation/integration:
                 Based on scripting/programming
                 Focus on repeatable data transformation processes
Interactive Data Transformation (IDTs)
Digital Enterprise Research Institute                   www.deri.ie




        Based on user interaction paradigms for user
         creation of data transformations
        Explores    GUI    elements    mapping   to   data
         transformation operations
        Instant feedback of each iteration
        Complementary to existing ETL tools
        Lower the barriers for non-programmers (reduces
         programming effort) of doing data transformations
        Example platforms: Google Refine, Potters Wheel,
         Wrangler
Interactive Data Transformation (IDTs)
Digital Enterprise Research Institute      www.deri.ie
Challenges
Digital Enterprise Research Institute                            www.deri.ie




           How to model IDTs?

           Facilitating the reuse of previous IDTs

           Representing IDTs
                                                           Provenance

           Making IDT platforms provenance-aware

           Enabling transportability across IDT and ETL
            platforms
IDT & Provenance
Digital Enterprise Research Institute                     www.deri.ie




           Provenance supports representation of interactive
            data transformations
           Output: a provenance descriptor which shows the
            relationship between the inputs, the outputs, and
            the applied transformation operations
           Both retrospective and prospective provenance
IDT
Digital Enterprise Research Institute        www.deri.ie




           IDT model
           Formal model (Algebra for IDT)
           Provenance representation
           Provenance capture of IDTs
IDT Model: Core Elements
Digital Enterprise Research Institute                       www.deri.ie




           Schema and instance data
           Set of predefined operations
           GUI elements mapping to predefined operations
           User actions
                 Operation selection
                 Parameter selection
                 Operation composition (workflow)
IDT Model
Digital Enterprise Research Institute   www.deri.ie
Formalizing the mapping from IDT to
     Provenance
Digital Enterprise Research Institute                        www.deri.ie




           Definition 1: A provenance-based interactive data
            transformation engine, consists of a set of
            transformations (or activities) on a set of datasets
            generating outputs in the form of other datasets or
            events which may trigger further transformations

           Definition 2: An interactive data transformation
            event, consists of the input dataset, the output
            dataset(s), the applied transformation function,
            and the time the transformation took place
Formalizing the mapping from IDT to
        Provenance
Digital Enterprise Research Institute                       www.deri.ie




           Definition 3: A run is a function from time to
            dataset(s) and the transformation applied to those
            dataset(s)

           Definition 4: A trace is the sequence of pairs of a
            run and the time the run was made
Provenance Representation
Digital Enterprise Research Institute                      www.deri.ie




           Proposed in Representing Interoperable Provenance
            Descriptions for ETL Workflows

           Three-layered provenance model:
                 Open Provenance Model Vocabulary Layer
                 Cogs ETL Provenance Vocabulary
                 Domain-Specific Model Layer


           Linked Data standards
Provenance Capture Layers
Digital Enterprise Research Institute   www.deri.ie
Provenance Event-Capture Sequence Flow
Digital Enterprise Research Institute    www.deri.ie
Case study
Digital Enterprise Research Institute                                                                                    www.deri.ie




        Implementation over the GR Platform
        Example descriptor

   @prefix grf: <http://127.0.0.1:3333/project/1402144365904/> .

   grf :MassCellChange-1092380975 rdf:type opmv:Process,
   cogs:ColumnOperation, cogs:Transformation;                                 Mapping to the actual program
   cogs:operationName "MassCellChange"^^xsd:string;
   cogs:programUsed "com.google.refine.operations.cell.MassEditOperation"^^xsd:string;                  Process
   rdfs:label "Mass edit 1 cells in column ==List of winners=="^^xsd:string.

   grf:MassCellChange-1092380975/1_0 rdf:type opmv:Artifact ;                                                       Input Artifact
   rdfs:label "* '''1955 [[Meena Kumari]]'[[Parineeta (1953 film)|Parineeta]]''''' as '''Lolita'''"^^xsd:string.

   grf:MassCellChange-1092380975/1_1 rdf:type opmv:Artifact;                                                       Output Artifact
   rdfs:label "* '''John Wayne'''"^^xsd:string.
                                                                                                            Workflow structure
   grf:MassCellChange-1092380975/1_1 opmv:wasDerivedFrom grf:MassCellChange-1092380975/1_0.
   grf:MassCellChange-1092380975 opmv:used grf:MassCellChange-1092380975/1_0.
   grf:MassCellChange-1092380975/1_1 opmv:wasGeneratedBy grf:MassCellChange-1092380975.
   grf:MassCellChange-1092380975/1_1 opmv:wasGeneratedAt "2011-11-16T11:2:14"^xsd: dateTime.
Conclusion
Digital Enterprise Research Institute                     www.deri.ie




           The proposed approach provides low impact on the
            existing IDT process
           Provenance representation supports different data
            models
           Preliminary implementation of a Google Refine
            provenance extension

Omitola o rian_eswc_idts final

  • 1.
    Digital Enterprise ResearchInstitute www.deri.ie Capturing interactive data transformation operations using provenance workflows Tope Omitola, Andre Freitas, Edward Curry, Sean O'Riain, Nicholas Gibbins and Nigel Shadbolt SWPM Workshop 28.05.2012, Herakleion, Crete  Copyright 2009 Digital Enterprise Research Institute. All rights reserved.
  • 2.
    Outline Digital Enterprise ResearchInstitute www.deri.ie  Motivation  Interactive data transformations (IDTs)  IDT & Provenance  Modelling IDTs  Provenance Representation  Provenance Capture  Case Study  Conclusion
  • 3.
    Motivation Digital Enterprise ResearchInstitute www.deri.ie  Dataspaces:  High number of heterogeneous data sources  Complex data transformation environment  Need for both repeatable data transformations and once- off transformations  Traditional ETL approaches for data transformation/integration:  Based on scripting/programming  Focus on repeatable data transformation processes
  • 4.
    Interactive Data Transformation(IDTs) Digital Enterprise Research Institute www.deri.ie  Based on user interaction paradigms for user creation of data transformations  Explores GUI elements mapping to data transformation operations  Instant feedback of each iteration  Complementary to existing ETL tools  Lower the barriers for non-programmers (reduces programming effort) of doing data transformations  Example platforms: Google Refine, Potters Wheel, Wrangler
  • 5.
    Interactive Data Transformation(IDTs) Digital Enterprise Research Institute www.deri.ie
  • 6.
    Challenges Digital Enterprise ResearchInstitute www.deri.ie  How to model IDTs?  Facilitating the reuse of previous IDTs  Representing IDTs Provenance  Making IDT platforms provenance-aware  Enabling transportability across IDT and ETL platforms
  • 7.
    IDT & Provenance DigitalEnterprise Research Institute www.deri.ie  Provenance supports representation of interactive data transformations  Output: a provenance descriptor which shows the relationship between the inputs, the outputs, and the applied transformation operations  Both retrospective and prospective provenance
  • 8.
    IDT Digital Enterprise ResearchInstitute www.deri.ie  IDT model  Formal model (Algebra for IDT)  Provenance representation  Provenance capture of IDTs
  • 9.
    IDT Model: CoreElements Digital Enterprise Research Institute www.deri.ie  Schema and instance data  Set of predefined operations  GUI elements mapping to predefined operations  User actions  Operation selection  Parameter selection  Operation composition (workflow)
  • 10.
    IDT Model Digital EnterpriseResearch Institute www.deri.ie
  • 11.
    Formalizing the mappingfrom IDT to Provenance Digital Enterprise Research Institute www.deri.ie  Definition 1: A provenance-based interactive data transformation engine, consists of a set of transformations (or activities) on a set of datasets generating outputs in the form of other datasets or events which may trigger further transformations  Definition 2: An interactive data transformation event, consists of the input dataset, the output dataset(s), the applied transformation function, and the time the transformation took place
  • 12.
    Formalizing the mappingfrom IDT to Provenance Digital Enterprise Research Institute www.deri.ie  Definition 3: A run is a function from time to dataset(s) and the transformation applied to those dataset(s)  Definition 4: A trace is the sequence of pairs of a run and the time the run was made
  • 13.
    Provenance Representation Digital EnterpriseResearch Institute www.deri.ie  Proposed in Representing Interoperable Provenance Descriptions for ETL Workflows  Three-layered provenance model:  Open Provenance Model Vocabulary Layer  Cogs ETL Provenance Vocabulary  Domain-Specific Model Layer  Linked Data standards
  • 14.
    Provenance Capture Layers DigitalEnterprise Research Institute www.deri.ie
  • 15.
    Provenance Event-Capture SequenceFlow Digital Enterprise Research Institute www.deri.ie
  • 16.
    Case study Digital EnterpriseResearch Institute www.deri.ie  Implementation over the GR Platform  Example descriptor @prefix grf: <http://127.0.0.1:3333/project/1402144365904/> . grf :MassCellChange-1092380975 rdf:type opmv:Process, cogs:ColumnOperation, cogs:Transformation; Mapping to the actual program cogs:operationName "MassCellChange"^^xsd:string; cogs:programUsed "com.google.refine.operations.cell.MassEditOperation"^^xsd:string; Process rdfs:label "Mass edit 1 cells in column ==List of winners=="^^xsd:string. grf:MassCellChange-1092380975/1_0 rdf:type opmv:Artifact ; Input Artifact rdfs:label "* '''1955 [[Meena Kumari]]'[[Parineeta (1953 film)|Parineeta]]''''' as '''Lolita'''"^^xsd:string. grf:MassCellChange-1092380975/1_1 rdf:type opmv:Artifact; Output Artifact rdfs:label "* '''John Wayne'''"^^xsd:string. Workflow structure grf:MassCellChange-1092380975/1_1 opmv:wasDerivedFrom grf:MassCellChange-1092380975/1_0. grf:MassCellChange-1092380975 opmv:used grf:MassCellChange-1092380975/1_0. grf:MassCellChange-1092380975/1_1 opmv:wasGeneratedBy grf:MassCellChange-1092380975. grf:MassCellChange-1092380975/1_1 opmv:wasGeneratedAt "2011-11-16T11:2:14"^xsd: dateTime.
  • 17.
    Conclusion Digital Enterprise ResearchInstitute www.deri.ie  The proposed approach provides low impact on the existing IDT process  Provenance representation supports different data models  Preliminary implementation of a Google Refine provenance extension