Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Irpb workshop

54 views

Published on

Processing why-not queries over scientific workflow provenance

Published in: Education
  • Be the first to comment

  • Be the first to like this

Irpb workshop

  1. 1. On Answering Why-Not Queries Against Scientific Workflow Provenance Khalid Belhajjame PSL Research University, Paris-Dauphine University, LAMSADE, Paris, 75016, France khalid.belhajjame@dauphine.fr July 13, 2018 Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 1 / 26
  2. 2. Context: Scientific Workflows Scientific workflows have been shown to facilitate and accelerate scientific data exploration and analysis in many areas of sciences, including proteomics, metabolics, astronomy, and bio-medicine. The figure on the right side illustrates an example of a simple workflow used for identifying the pathways associated with a given input metabolite (compound). Given a compound identifier, the first module returns a compound name, which is used to feed the second module to obtain the corresponding pathway. Workflow input ports Workflow output ports compound_id get_compound_info output_pathways extract_pathway_from_compounds_file Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 2 / 26
  3. 3. Aim: Evaluating Why-Not Queries Against Workflow Executions Why-not queries help scientists understand why a given data item, e.g., their favorite biological pathway, was not returned by the workflow executions. While answering such queries has been thoroughly investigated for relational databases, only a few proposals examined their evaluation in the context of scientific workflows. Objective: To elaborate a solution for evaluating why not queries against workflows with black-box modules. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 3 / 26
  4. 4. Related Work: Database (Querying) Land Instance-based attempts to find the data items in the inputs that are responsible for the non appearance of a given data item in the result. Consider the example below (taken from Huang et al. VLDB 2008). The query returns the schools in the state of California are within the top 4 and have job openings. The answer returned by the query is Stanford and its rank in the result. Why-not query: Why does Berkley not appear in he results? What change shall I make to the source to obtain (Homer, 25) in the results? if a potential tuple (berkeley, ca, yes) is inserted into the openings table, Berkeley will become an answer Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 4 / 26
  5. 5. Related Work: Database (Querying) Land Module-based attempts to identify the modules (sub-queries) that are responsible for the non-appearance of a given data item in the workflow results. In the case of the previous example, we have only one join, which is responsible in this case for the non appearance of Berkley in the result set of the query. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 5 / 26
  6. 6. Related Work: Workflow Land The only proposal in this category for workflow provenance is the Why-Not algorithm proposed by Chapman and Jagadish 2009. Using the Why-Not algorithm proposed by Chapman and Jagadish, the user query is expressed as a set of atomic predicates that are combined using AND and OR. Chapman and Jagadish make the assumption that the attributes of the input datasets are preserved by the modules that compose the workflow. This is not the case, however, in the general case. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 6 / 26
  7. 7. Related Work: Workflow Land For example, the modules in the workflow illustrated on the right do not preserve the attribute of the input, viz. Compound − ID, in that the output of the first and the second module do not contain information about the compound identifier. In the work presented in this talk, we drop the assumption made by Chapman and Jagadish, and propose a solution that can be utilized for answering why-not queries for workflow with modules that do not preserve attributes of the input datasets. Furthermore, unlike the Why-Not algorithm which is module-based, our proposal is hybrid in that it seeks to answer instance- and module-based why-not queries. Workflow input ports Workflow output ports compound_id get_compound_info output_pathways extract_pathway_from_compounds_file Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 7 / 26
  8. 8. Foundations Why-not query: A user specifies a why-not query by providing a data item dwhy−not that has the same data type as the output of the last module of the workflow and was not returned by the workflow executions. Module pickyness: Central to the evaluation of why-not queries is the pickyness of its modules. A module M in a workflow is picky with respect to a data item d if its inverse Minv does not accept d as input. More specifically, Minv throws an illegal input exception when its execution is fed d. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 8 / 26
  9. 9. Processing Why-Not Queries The algorithm for processing why-not queries, takes as input a data item dwhy−not specified by the user To answer a why-not query, the modules of the workflow are explored from the sink to the source in a breadth-first fashion. To do so, we group the workflow modules into levels as illustrated in the figure below. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 9 / 26
  10. 10. Processing Why-Not Queries The modules of each level are examined to identify if the module is picky. Specifically, the inverse of the module in question M is examined to check if: 1 It does not accept the corresponding data items that were generated by the inverse of the modules in the previous level. 2 It accepts the corresponding data items that were generated by the inverse of the modules in the previous modules. In this case, the data items the inverse of M produces are saved to be used to feed the inverse of the modules in the succeeding levels, if any. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 10 / 26
  11. 11. Identifying Picky Modules To identify if a module M is picky, we need to invoke its inverse Minv , and check if it accepts the data items in question. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 11 / 26
  12. 12. Identifying Picky Modules To identify if a module M is picky, we need to invoke its inverse Minv , and check if it accepts the data items in question. However, the inverse module rarely exists. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 12 / 26
  13. 13. Identifying Picky Modules To identify if a module M is picky, we need to invoke its inverse Minv , and check if it accepts the data items in question. However, the inverse module rarely exists. To overcome the non-existence of the inverse module, we can probe the modules until we have the output we are after, or else fail and deduce that the module in question is picky. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 13 / 26
  14. 14. Identifying Picky Modules To identify if a module M is picky, we need to invoke its inverse Minv , and check if it accepts the data items in question. However, the inverse module rarely exists. To overcome the non-existence of the inverse module, we can probe the modules until we have the output we are after, or else fail and deduce that the module in question is picky. This is not a reasonable solution because the space of valid input values of a module can be very large or even infinite. The problem is exacerbated by the fact that a module may have multiple inputs, therefore requiring the construction of all possible combination for probing. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 14 / 26
  15. 15. Identifying Picky Modules To identify if a module M is picky, we need to invoke its inverse Minv , and check if it accepts the data items in question. However, the inverse module rarely exists. To overcome the non-existence of the inverse module, we can probe the modules until we have the output we are after, or else fail and deduce that the module in question is picky. This is not a reasonable solution because the space of valid input values of a module can be very large or even infinite. The problem is exacerbated by the fact that a module may have multiple inputs, therefore requiring the construction of all possible combination for probing. Is there a more reasonable solution... that at least allows us to probe the modules using fewer inputs? Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 15 / 26
  16. 16. Identifying Picky Modules by Harvesting the Web A solution that we explored consist in harvesting the (probably) biggest source of information, namely the Web using the information extraction process illustrated below. Indeed, an important number of scientific modules that are provided by major institutions, such as the EBI and DDBJ, provides also for users the means to invoke these modules on the web, and the traces of those module invocation remains in a number of cases accessible on the Web. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 16 / 26
  17. 17. Identifying Picky Modules by Harvesting the Web Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 17 / 26
  18. 18. Identifying Picky Modules by Harvesting the Web If none of the candidate inputs is found to be true positive, then we conclude that the module is likely to be picky. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 18 / 26
  19. 19. Feasibility Study The approach we have just described raises the following question. Is the algorithm proposed able to identify the reason why a given data item does not appear in the work!ow results? More specifically, How effective is this solution in identifying picky modules and missing input data items? To answer the above questions, we run a feasibility experiment, in which we used a sample of 6 real-world workflows from the myExperiment repository. We selected workflows that involve deterministic modules, which mean modules that deliver the same result (if any) given the same input. We did not consider workflows that include modules performing data mining operations, for instance. We have also selected workflows for which the inverse modules are also deterministic functions. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 19 / 26
  20. 20. Feasibility Study We have executed each workflow using example data inputs provided by the workflow authors. We then specified two kinds of queries for each work!ow: Instance-based why-not query. To assess the ability of the algorithm in answering this type of queries, we randomly selected an output data item d that was returned by the workflow executions. Next, we used our algorithm to see if it is able to reconstruct the lineage of d by harvesting the web to identify the input data items that were responsible for its derivation. Module-based why-not query This kind of query is used to assess if the algorithm is able to identify picky modules In total we had 6 queries of the first kind, which we denote by {q+ 1 , . . . , q+ 6 }, and 6 queries of the second kind, which we denote by {q− 1 , . . . , q− 6 }. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 20 / 26
  21. 21. Feasibility Study: Results Of the queries {q+ 1 , . . . , q+ 6 }, our algorithm was able to successfully constructs the provenance of the why-not query up to the workflow input for 3 queries. Most of the modules composing these workflows, namely 8 out of 11, provides information about the input and output datasets on the Web using Tabular formats. After examination of the three remaining workflows, we found that one them utilizes proprietary data sources, the content of which is not accessible on the surface web. The last two workflows, on the other hand, contain modules that manipulate excerpt from HTML web pages. Because of this, our algorithm was not able to find the content on the Web of the input and output of those modules. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 21 / 26
  22. 22. Feasibility Study: Results We also measured the number of Top-k web pages that needed to be examined to identify the input data item corresponding to a given output data item. On average, we needed to examine the content of the 4 top web pages returned by the key-word search engine1. In several cases, however, the top web page was the right one, in the sense that it contained the input data item we are after. 1 We used the Google search engine for our experiment. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 22 / 26
  23. 23. Feasibility Study: Results Regarding the queries {q− 1 , . . . , q− 6 }, our algorithm was more successful in the sense that it was able to correctly identify 4 picky modules out of 6. For two remaining workflows, the module that was identified as picky by our algorithm was not the correct one. After examination, it transpired that for certain modules the corresponding data item could not be found on the web. Again this issue was due to shims modules the input and output data items are not published on the Web. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 23 / 26
  24. 24. Conclusions To sum up, this small feasibility study has shown that our method is promising. It has also brought some insights into the way our solution can be improved. Our ongoing work includes: i)- tuning our algorithm to deal with shims modules in a workflow, ii)- explore new source of information for identifying picky modules, and ii)- an experiment involving a large number of scientific workflows. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 24 / 26
  25. 25. References K. Belhajjame (2018) On Answering Why-Not Queries Against Scientific Workflow Provenance Proceeding of EDBT, Open Proceedings 465–468. N. Bidoit, M. Herschel, K. Tzompanaki (2014) Why not? Proceeding of EDBT, Open Proceedings 145–156. A. Chapman and H.V. Jagadish (2009) Why not? Proceeding of SIGMOD, ACM 523–534. J. Huang, T. Chen, A. Doan, and J. F. Naughton (2008) On the provenance of non-answers to queries over extracted data Proceeding of VLDB, ACM 736-747. Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 25 / 26
  26. 26. The End Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 26 / 26

×