• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Detecting Duplicate Records in Scientific Workflow Results
 

Detecting Duplicate Records in Scientific Workflow Results

on

  • 383 views

This talk present a solution whereby duplicate records in workflow results are detected using provenance traces.

This talk present a solution whereby duplicate records in workflow results are detected using provenance traces.

Statistics

Views

Total Views
383
Views on SlideShare
381
Embed Views
2

Actions

Likes
0
Downloads
0
Comments
0

2 Embeds 2

https://si0.twimg.com 1
http://www.docshut.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Detecting Duplicate Records in Scientific Workflow Results Detecting Duplicate Records in Scientific Workflow Results Presentation Transcript

    • Detecting Duplicate Records in Scientific Workflow ResultsKhalid Belhajjame1, Paolo Missier2, and Carole A. Goble1 1University of Manchester 2University of Newcastle
    • Scientific Workflows   Scientific workflows are increasingly used by scientists as a means for specifying and enacting their experiments.   They tend to be data intensive   The data sets obtained as a result of their enactment can be stored in public repositories to be queried, analyzed and used to feed the execution of other workflows.2 IPAW 2012
    • Duplicates in Workflow Results   The datasets obtained as a result of workflow execution often contain duplicates.   As a result:   The analysis and interpretation of workflow results may become tedious.   The presence of duplicates also unnecessarily increases the size of workflow results.3 IPAW 2012
    • Duplicate Record Detection   Research in duplicate record detection has been active for more than three decades.   Elmagarmid et al., 2007 conducted a comprehensive survey of the topics.   We do not aim to design yet another algorithm for comparing and matching records.   Rather, we investigate how provenance traces produced as a result of workflow executions can be used to guide the detection of duplicate records in workflow results. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Du-plicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1–16,2007.4 IPAW 2012
    • Outline   Data-Driven Workflows and Provenance Trace   A method for guiding duplicates detection in workflow results based on provenance traces.   Preliminary validation using real-world workflows.5 IPAW 2012
    • Preliminaries: Data-Driven Workflows   A data driven workflow can be defined as a directed graph: wf = ￿N, E￿   A node represent an analysis operation, which has a set of input and output parameters. ￿op, Iop , Oop ￿ ∈ N   The edges are dataflow dependencies: ￿￿op, o￿, ￿op￿ , i￿￿ ∈ E6 IPAW 2012
    • Preliminaries: Provenance Trace The execution of workflows gives rise to provenance trace, which we capture using two relations.   Transformation: to specify that the execution of an operation took as input a given ordered set of records and generated another ordered set of records. op, o1 , ro1 , . . . , op, om , rom op, i1 , ri1 , . . . , op, in , rin OutBop InBop   Transfer: to specify transfer of records along the edges of the workflow. op , i , r op, o, r7 IPAW 2012
    • Outline   Data-Driven Workflows and Provenance Trace   A method for guiding duplicates detection in workflow results based on provenance traces.   Preliminary validation using real-world workflows.8 IPAW 2012
    • Provenance-Guided Detection of Duplicates: Approach To guide the detection of duplicates in workflow results we explore the following fact:   An operation that is known to be deterministic produces identical output bindings given the same input binding. deterministic op OutBop InBop T OutBop InBop T id OutBop , OutBop9 IPAW 2012
    • Provenance-Guided Detection of Duplicates: Example i o i’ o’ IdentifyProtein GetGOTerm Ri Ro R’i R’o 1.  The set of records Ri that are bound to the input parameter of the starting operation are compared to identify duplicate records. The result of this phase is a partition of disjoint sets of identical records. Ri R1 i Rn i10 IPAW 2012
    • Provenance-Guided Detection of Duplicates: Example i o i’ o’ IdentifyProtein GetGOTerm Ri Ro R’i R’o 2.  The sets of records Ro, R’i and R’o are partitioned into sets of identical records based on the partitioning of Ri. For example: 1 n Ro Ro Ro Ri o ro Ro s.t. ri Ri , i IdentifyProtein, o, ro IdentifyProtein, i, ri11 IPAW 2012
    • Provenance-Guided Detection of Duplicates: Example   In the example just described, the operations that compose the workflow have exactly one input and one output parameter.   However, the algorithm presented in the paper supports operations with multiple input and output parameters.   Notice that we assumes that the analysis operations that compose the workflow are deterministic. This is not always the case.   This raises the question as to how to determine that a given operation is deterministic.12 IPAW 2012
    • Verifying The Determinism of Analysis Operations To verify the determinism of operations, we use an approach whereby operations are probed. 1.  Given an operation op, we select examples values that can be used by the inputs of op, and invoke op using those values multiple times. 2.  If op produces identical output values given identical input values, then it is likely to be deterministic, otherwise, it is not deterministic.13 IPAW 2012
    • Collection-Based Workflows To support duplicates detection in collection based workflows we need to be able to:   Identify when two collections are identical Two collections Ri and Rj are identical if they are of the same size and there is a bijective mapping: map : Ri Rj that maps each record ri in Ri to a record rj in Rj such that ri and rj are identical   Identify duplicates records between two collections that are known to be identical Identify a bijective mapping that maps every ri in Ri to an identical rj in Rj.14 IPAW 2012
    • Outline   Data-Driven Workflows and Provenance Trace   A method for guiding duplicates detection in workflow results based on provenance traces.   Preliminary validation using real-world workflows.15 IPAW 2012
    • Validation   The method that we presented in this paper can be applied when the operations are deterministic.   To have an insight on the degree to which the operations that compose the workflows are deterministic, we run en experiments   Datasets: 15 bioinformatics workflows that cover a wide range of analyzes, namely biological pathway analysis, sequence alignment, molecular interaction analysis   Process: To identify which of these operations are deterministic, we run each of them 3 times using example values that were found either within myExperiment or Biocatalogue16 IPAW 2012
    • Validation   After manual analysis of the results, it transpires that 5 operations out of the 151 operations that compose the wokflows are not deterministic.   Note that many of the operations that we analyzed access and use underlying data sources in their computation. Therefore updates to such sources may break the determinism assumption (Chirigati and Freire, 2012).   This suggests that the determinism holds within a window of time during which the underlying sources remain the same, and that there is a need for monitoring techniques to identify such windows. Fernando Chirigati and Juliana Freire. Towards Integrating Workflow and Database Provenance: A Practical Approach . IPAW, 2012.17 IPAW 2012
    • Conclusions and Future Work  we described a method that can be used to guide duplicate detection in workflow results.   Monitoring the determinism of analysis operations   Extending the method to support duplicate detection across the results of different workflows.18 IPAW 2012
    • Detecting Duplicate Records in Scientific Workflow ResultsKhalid Belhajjame1, Paolo Missier2, and Carole A. Goble1 1University of Manchester 2University of Newcastle