Detecting Duplicate Records in Scientific Workflow Results

Khalid Belhajjame
Khalid BelhajjameResearch Associate
Detecting Duplicate Records in
 Scientific Workflow Results


Khalid Belhajjame1, Paolo Missier2, and Carole A. Goble1
               1University of Manchester

                2University of Newcastle
Scientific Workflows
                  Scientific workflows are increasingly
                  used by scientists as a means for
                  specifying and enacting their
                  experiments.
                  They tend to be data intensive

                  The data sets obtained as a result of
                  their enactment can be stored in
                  public repositories to be queried,
                  analyzed and used to feed the
                  execution of other workflows.
2   IPAW 2012
Duplicates in Workflow Results

      The datasets obtained as a result of workflow execution often
       contain duplicates.
      As a result:
         The analysis and interpretation of workflow results may become
          tedious.
         The presence of duplicates also unnecessarily increases the size
          of workflow results.



3   IPAW 2012
Duplicate Record Detection
      Research in duplicate record detection has been active for
        more than three decades.
          Elmagarmid et al., 2007 conducted a comprehensive survey of
            the topics.

      We do not aim to design yet another algorithm for
        comparing and matching records.
      Rather, we investigate how provenance traces produced as a
        result of workflow executions can be used to guide the
        detection of duplicate records in workflow results.
    Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Du-plicate record detection: A survey. IEEE
    Trans. Knowl. Data Eng., 19(1):1–16,2007.
4   IPAW 2012
Outline

      Data-Driven Workflows and Provenance Trace


      A method for guiding duplicates detection in workflow
       results based on provenance traces.

      Preliminary validation using real-world workflows.




5   IPAW 2012
Preliminaries: Data-Driven Workflows
      A data driven workflow can be defined as a directed graph:

                         wf = N, E
      A node represent an analysis operation, which has a set of
       input and output parameters.
                      op, Iop , Oop  ∈ N
      The edges are dataflow dependencies:

                  op, o, op , i ∈ E
6   IPAW 2012
Preliminaries: Provenance Trace
    The execution of workflows gives rise to provenance trace,
    which we capture using two relations.
      Transformation: to specify that the execution of an
    operation took as input a given ordered set of records and
    generated another ordered set of records.
      op, o1 , ro1 , . . . , op, om , rom     op, i1 , ri1 , . . . , op, in , rin
                                   OutBop     InBop

      Transfer: to specify transfer of records along the edges of
    the workflow.                op , i , r   op, o, r


7   IPAW 2012
Outline

      Data-Driven Workflows and Provenance Trace


      A method for guiding duplicates detection in workflow
       results based on provenance traces.

      Preliminary validation using real-world workflows.




8   IPAW 2012
Provenance-Guided Detection of
    Duplicates: Approach
    To guide the detection of duplicates in workflow results we
      explore the following fact:

      An operation that is known to be deterministic produces
       identical output bindings given the same input binding.

    deterministic op      OutBop    InBop   T    OutBop    InBop    T
                                                    id OutBop , OutBop




9   IPAW 2012
Provenance-Guided Detection of
     Duplicates: Example
     i                                  o        i’                     o’
                 IdentifyProtein                       GetGOTerm

     Ri                                 Ro       R’i                    R’o
     1.  The set of records Ri that are bound to the input parameter
         of the starting operation are compared to identify duplicate
         records.

          The result of this phase is a partition of disjoint sets of
          identical records.
                                   Ri       R1
                                             i         Rn
                                                        i
10   IPAW 2012
Provenance-Guided Detection of
      Duplicates: Example
     i                                     o          i’                             o’
                  IdentifyProtein                               GetGOTerm

     Ri                                   Ro         R’i                             R’o
      2.  The sets of records Ro, R’i and R’o are partitioned into sets
          of identical records based on the partitioning of Ri. For
          example:               1          n
                                    Ro     Ro              Ro
     Ri
      o   ro Ro s.t. ri Ri ,
                         i          IdentifyProtein, o, ro        IdentifyProtein, i, ri



11    IPAW 2012
Provenance-Guided Detection of
     Duplicates: Example
       In the example just described, the operations that compose
        the workflow have exactly one input and one output
        parameter.
          However, the algorithm presented in the paper supports
           operations with multiple input and output parameters.

       Notice that we assumes that the analysis operations that
        compose the workflow are deterministic. This is not always
        the case.
          This raises the question as to how to determine that a given
           operation is deterministic.
12   IPAW 2012
Verifying The Determinism of Analysis
     Operations
       To verify the determinism of operations, we use an approach
       whereby operations are probed.
     1.  Given an operation op, we select examples values that can
           be used by the inputs of op, and invoke op using those
           values multiple times.
     2.     If op produces identical output values given identical input
           values, then it is likely to be deterministic, otherwise, it is
           not deterministic.



13   IPAW 2012
Collection-Based Workflows
     To support duplicates detection in collection based workflows we
     need to be able to:
       Identify when two collections are identical
        Two collections Ri and Rj are identical if they are of the same size and
        there is a bijective mapping:
                                map : Ri          Rj
        that maps each record ri in Ri to a record rj in Rj such that ri and rj are
        identical
       Identify duplicates records between two collections that
        are known to be identical
        Identify a bijective mapping that maps every ri in Ri to an identical
        rj in Rj.
14   IPAW 2012
Outline

       Data-Driven Workflows and Provenance Trace


       A method for guiding duplicates detection in workflow
        results based on provenance traces.

       Preliminary validation using real-world workflows.




15   IPAW 2012
Validation
       The method that we presented in this paper can be applied when the operations
        are deterministic.

       To have an insight on the degree to which the operations that compose the
        workflows are deterministic, we run en experiments

       Datasets: 15 bioinformatics workflows that cover a wide range of analyzes,
        namely biological pathway analysis, sequence alignment, molecular interaction
        analysis

       Process: To identify which of these operations are deterministic, we run each
        of them 3 times using example values that were found either within
        myExperiment or Biocatalogue


16   IPAW 2012
Validation
       After manual analysis of the results, it transpires that 5 operations out of
         the 151 operations that compose the wokflows are not deterministic.

       Note that many of the operations that we analyzed access and use
         underlying data sources in their computation. Therefore updates to such
         sources may break the determinism assumption (Chirigati and Freire,
         2012).

       This suggests that the determinism holds within a window of time
         during which the underlying sources remain the same, and that there is a
         need for monitoring techniques to identify such windows.
     Fernando Chirigati and Juliana Freire. Towards Integrating Workflow and Database Provenance: A Practical
     Approach . IPAW, 2012.
17   IPAW 2012
Conclusions and Future Work

      we described a method that can be used to guide duplicate
       detection in workflow results.

       Monitoring the determinism of analysis operations


       Extending the method to support duplicate detection across
        the results of different workflows.



18   IPAW 2012
Detecting Duplicate Records in
 Scientific Workflow Results


Khalid Belhajjame1, Paolo Missier2, and Carole A. Goble1
               1University of Manchester

                2University of Newcastle
1 of 19

Recommended

Building on Sand: Standard InChIs on non-standard molfiles by
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
609 views13 slides
Big Data 2.0: ETL & Analytics: Implementing a next generation platform by
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformCaserta
12.5K views40 slides
A Reference Architecture for ETL 2.0 by
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
21.5K views31 slides
Tapp 2014 (belhajjame) by
Tapp 2014 (belhajjame)Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)Khalid Belhajjame
550 views19 slides
Results may vary: Collaborations Workshop, Oxford 2014 by
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Carole Goble
1.8K views52 slides
Sharing massive data analysis: from provenance to linked experiment reports by
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
84 views49 slides

More Related Content

Similar to Detecting Duplicate Records in Scientific Workflow Results

2013-01-17 Research Object by
2013-01-17 Research Object2013-01-17 Research Object
2013-01-17 Research ObjectStian Soiland-Reyes
723 views25 slides
An Ontological Formulation and an OPM profile for Causality in Planning Appli... by
An Ontological Formulation and an OPM profile for Causality in Planning Appli...An Ontological Formulation and an OPM profile for Causality in Planning Appli...
An Ontological Formulation and an OPM profile for Causality in Planning Appli...Daniele Dell'Aglio
549 views16 slides
Privacy-Preserving Data Analysis Workflows for eScience by
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScienceKhalid Belhajjame
350 views21 slides
SherLog: Error Diagnosis Through Connecting Clues from Run-time Logs by
SherLog:  Error Diagnosis Through Connecting Clues from Run-time Logs SherLog:  Error Diagnosis Through Connecting Clues from Run-time Logs
SherLog: Error Diagnosis Through Connecting Clues from Run-time Logs Lisong Guo
786 views24 slides
ISMB Workshop 2014 by
ISMB Workshop 2014ISMB Workshop 2014
ISMB Workshop 2014Alejandra Gonzalez-Beltran
9.8K views42 slides
Ikc 2015 by
Ikc 2015Ikc 2015
Ikc 2015Khalid Belhajjame
1.2K views22 slides

Similar to Detecting Duplicate Records in Scientific Workflow Results(20)

An Ontological Formulation and an OPM profile for Causality in Planning Appli... by Daniele Dell'Aglio
An Ontological Formulation and an OPM profile for Causality in Planning Appli...An Ontological Formulation and an OPM profile for Causality in Planning Appli...
An Ontological Formulation and an OPM profile for Causality in Planning Appli...
Daniele Dell'Aglio549 views
Privacy-Preserving Data Analysis Workflows for eScience by Khalid Belhajjame
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScience
Khalid Belhajjame350 views
SherLog: Error Diagnosis Through Connecting Clues from Run-time Logs by Lisong Guo
SherLog:  Error Diagnosis Through Connecting Clues from Run-time Logs SherLog:  Error Diagnosis Through Connecting Clues from Run-time Logs
SherLog: Error Diagnosis Through Connecting Clues from Run-time Logs
Lisong Guo786 views
178 - A replicated study on duplicate detection: Using Apache Lucene to searc... by ESEM 2014
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
ESEM 2014503 views
SherLog: Error Diagnosis by Connecting Clues from Run-time Logs by Dacong (Tony) Yan
SherLog: Error Diagnosis by Connecting Clues from Run-time LogsSherLog: Error Diagnosis by Connecting Clues from Run-time Logs
SherLog: Error Diagnosis by Connecting Clues from Run-time Logs
Natural Language Processing for Data Extraction and Synthesizability Predicti... by Anubhav Jain
Natural Language Processing for Data Extraction and Synthesizability Predicti...Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Anubhav Jain139 views
The beauty of workflows and models by myGrid team
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and models
myGrid team1.1K views
2013 06-24 Wf4Ever: Annotating research objects (PDF) by Stian Soiland-Reyes
2013 06-24 Wf4Ever: Annotating research objects (PDF)2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PPTX) by Stian Soiland-Reyes
2013 06-24 Wf4Ever: Annotating research objects (PPTX)2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
S Carbon - AmiGO2: document-oriented approach to ontology software and escapi... by Jan Aerts
S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...
S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...
Jan Aerts514 views
From Scientific Workflows to Research Objects: Publication and Abstraction of... by dgarijo
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
dgarijo513 views
Equivalence is in the (ID) of the beholder by mhaendel
Equivalence is in the (ID) of the beholderEquivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholder
mhaendel1K views
HyQue: Evaluating scientific Hypotheses using semantic web technologies by Michel Dumontier
HyQue: Evaluating scientific Hypotheses using semantic web technologiesHyQue: Evaluating scientific Hypotheses using semantic web technologies
HyQue: Evaluating scientific Hypotheses using semantic web technologies
Michel Dumontier668 views
From Scientific Workflows to Research Objects: Publication and Abstraction of... by dgarijo
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
dgarijo838 views

More from Khalid Belhajjame

Provenance witha purpose by
Provenance witha purposeProvenance witha purpose
Provenance witha purposeKhalid Belhajjame
152 views23 slides
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows by
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsKhalid Belhajjame
250 views14 slides
Irpb workshop by
Irpb workshopIrpb workshop
Irpb workshopKhalid Belhajjame
166 views26 slides
Aussois bda-mdd-2018 by
Aussois bda-mdd-2018Aussois bda-mdd-2018
Aussois bda-mdd-2018Khalid Belhajjame
410 views101 slides
Converting scripts into reproducible workflow research objects by
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
509 views43 slides
A Sightseeing Tour of Prov and Some of its Extensions by
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsKhalid Belhajjame
584 views24 slides

More from Khalid Belhajjame(18)

Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows by Khalid Belhajjame
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Khalid Belhajjame250 views
Converting scripts into reproducible workflow research objects by Khalid Belhajjame
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
Khalid Belhajjame509 views
A Sightseeing Tour of Prov and Some of its Extensions by Khalid Belhajjame
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its Extensions
Khalid Belhajjame584 views
Linking the prospective and retrospective provenance of scripts by Khalid Belhajjame
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scripts
Khalid Belhajjame770 views
Introduction to ProvBench @ Provenance Week 2014 by Khalid Belhajjame
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014
Khalid Belhajjame808 views
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat... by Khalid Belhajjame
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Khalid Belhajjame1.4K views
Intégration incrémentale de données (Valenciennes juin 2010) by Khalid Belhajjame
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)
Khalid Belhajjame468 views

Recently uploaded

Data Integrity for Banking and Financial Services by
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial ServicesPrecisely
25 views26 slides
SUPPLIER SOURCING.pptx by
SUPPLIER SOURCING.pptxSUPPLIER SOURCING.pptx
SUPPLIER SOURCING.pptxangelicacueva6
16 views1 slide
The Research Portal of Catalonia: Growing more (information) & more (services) by
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)CSUC - Consorci de Serveis Universitaris de Catalunya
80 views25 slides
Design Driven Network Assurance by
Design Driven Network AssuranceDesign Driven Network Assurance
Design Driven Network AssuranceNetwork Automation Forum
15 views42 slides
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...Bernd Ruecker
40 views69 slides
SAP Automation Using Bar Code and FIORI.pdf by
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdfVirendra Rai, PMP
23 views38 slides

Recently uploaded(20)

Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely25 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker40 views
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana16 views
HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn22 views
Unit 1_Lecture 2_Physical Design of IoT.pdf by StephenTec
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf
StephenTec12 views
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab21 views
"Running students' code in isolation. The hard way", Yurii Holiuk by Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays17 views

Detecting Duplicate Records in Scientific Workflow Results

  • 1. Detecting Duplicate Records in Scientific Workflow Results Khalid Belhajjame1, Paolo Missier2, and Carole A. Goble1 1University of Manchester 2University of Newcastle
  • 2. Scientific Workflows   Scientific workflows are increasingly used by scientists as a means for specifying and enacting their experiments.   They tend to be data intensive   The data sets obtained as a result of their enactment can be stored in public repositories to be queried, analyzed and used to feed the execution of other workflows. 2 IPAW 2012
  • 3. Duplicates in Workflow Results   The datasets obtained as a result of workflow execution often contain duplicates.   As a result:   The analysis and interpretation of workflow results may become tedious.   The presence of duplicates also unnecessarily increases the size of workflow results. 3 IPAW 2012
  • 4. Duplicate Record Detection   Research in duplicate record detection has been active for more than three decades.   Elmagarmid et al., 2007 conducted a comprehensive survey of the topics.   We do not aim to design yet another algorithm for comparing and matching records.   Rather, we investigate how provenance traces produced as a result of workflow executions can be used to guide the detection of duplicate records in workflow results. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Du-plicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1–16,2007. 4 IPAW 2012
  • 5. Outline   Data-Driven Workflows and Provenance Trace   A method for guiding duplicates detection in workflow results based on provenance traces.   Preliminary validation using real-world workflows. 5 IPAW 2012
  • 6. Preliminaries: Data-Driven Workflows   A data driven workflow can be defined as a directed graph: wf = N, E   A node represent an analysis operation, which has a set of input and output parameters. op, Iop , Oop ∈ N   The edges are dataflow dependencies: op, o, op , i ∈ E 6 IPAW 2012
  • 7. Preliminaries: Provenance Trace The execution of workflows gives rise to provenance trace, which we capture using two relations.   Transformation: to specify that the execution of an operation took as input a given ordered set of records and generated another ordered set of records. op, o1 , ro1 , . . . , op, om , rom op, i1 , ri1 , . . . , op, in , rin OutBop InBop   Transfer: to specify transfer of records along the edges of the workflow. op , i , r op, o, r 7 IPAW 2012
  • 8. Outline   Data-Driven Workflows and Provenance Trace   A method for guiding duplicates detection in workflow results based on provenance traces.   Preliminary validation using real-world workflows. 8 IPAW 2012
  • 9. Provenance-Guided Detection of Duplicates: Approach To guide the detection of duplicates in workflow results we explore the following fact:   An operation that is known to be deterministic produces identical output bindings given the same input binding. deterministic op OutBop InBop T OutBop InBop T id OutBop , OutBop 9 IPAW 2012
  • 10. Provenance-Guided Detection of Duplicates: Example i o i’ o’ IdentifyProtein GetGOTerm Ri Ro R’i R’o 1.  The set of records Ri that are bound to the input parameter of the starting operation are compared to identify duplicate records. The result of this phase is a partition of disjoint sets of identical records. Ri R1 i Rn i 10 IPAW 2012
  • 11. Provenance-Guided Detection of Duplicates: Example i o i’ o’ IdentifyProtein GetGOTerm Ri Ro R’i R’o 2.  The sets of records Ro, R’i and R’o are partitioned into sets of identical records based on the partitioning of Ri. For example: 1 n Ro Ro Ro Ri o ro Ro s.t. ri Ri , i IdentifyProtein, o, ro IdentifyProtein, i, ri 11 IPAW 2012
  • 12. Provenance-Guided Detection of Duplicates: Example   In the example just described, the operations that compose the workflow have exactly one input and one output parameter.   However, the algorithm presented in the paper supports operations with multiple input and output parameters.   Notice that we assumes that the analysis operations that compose the workflow are deterministic. This is not always the case.   This raises the question as to how to determine that a given operation is deterministic. 12 IPAW 2012
  • 13. Verifying The Determinism of Analysis Operations To verify the determinism of operations, we use an approach whereby operations are probed. 1.  Given an operation op, we select examples values that can be used by the inputs of op, and invoke op using those values multiple times. 2.  If op produces identical output values given identical input values, then it is likely to be deterministic, otherwise, it is not deterministic. 13 IPAW 2012
  • 14. Collection-Based Workflows To support duplicates detection in collection based workflows we need to be able to:   Identify when two collections are identical Two collections Ri and Rj are identical if they are of the same size and there is a bijective mapping: map : Ri Rj that maps each record ri in Ri to a record rj in Rj such that ri and rj are identical   Identify duplicates records between two collections that are known to be identical Identify a bijective mapping that maps every ri in Ri to an identical rj in Rj. 14 IPAW 2012
  • 15. Outline   Data-Driven Workflows and Provenance Trace   A method for guiding duplicates detection in workflow results based on provenance traces.   Preliminary validation using real-world workflows. 15 IPAW 2012
  • 16. Validation   The method that we presented in this paper can be applied when the operations are deterministic.   To have an insight on the degree to which the operations that compose the workflows are deterministic, we run en experiments   Datasets: 15 bioinformatics workflows that cover a wide range of analyzes, namely biological pathway analysis, sequence alignment, molecular interaction analysis   Process: To identify which of these operations are deterministic, we run each of them 3 times using example values that were found either within myExperiment or Biocatalogue 16 IPAW 2012
  • 17. Validation   After manual analysis of the results, it transpires that 5 operations out of the 151 operations that compose the wokflows are not deterministic.   Note that many of the operations that we analyzed access and use underlying data sources in their computation. Therefore updates to such sources may break the determinism assumption (Chirigati and Freire, 2012).   This suggests that the determinism holds within a window of time during which the underlying sources remain the same, and that there is a need for monitoring techniques to identify such windows. Fernando Chirigati and Juliana Freire. Towards Integrating Workflow and Database Provenance: A Practical Approach . IPAW, 2012. 17 IPAW 2012
  • 18. Conclusions and Future Work  we described a method that can be used to guide duplicate detection in workflow results.   Monitoring the determinism of analysis operations   Extending the method to support duplicate detection across the results of different workflows. 18 IPAW 2012
  • 19. Detecting Duplicate Records in Scientific Workflow Results Khalid Belhajjame1, Paolo Missier2, and Carole A. Goble1 1University of Manchester 2University of Newcastle