Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Edbt2014 talk

413 views

Published on

I gave this talk in the EDBT 2014 conference, which tool place in Athens, Greece.
I show how data examples can be used to characterize the behavior of scientific modules. I present a new methods that automatically generate the data examples, and show that such data examples are useful for the human user to understand the task of the modules, and that they can be used to assist curators in repairing broken workflows (i.e., workflows for which one or more modules are no longer supplied by their providers)

  • Be the first to comment

  • Be the first to like this

Edbt2014 talk

  1. 1. Annotating the Behavior of Scientific Modules Using Data Examples: A Practical Approach Khalid Belhajjame Université Paris-Dauphine, LAMSADE Khalid.Belhajjame@dauphine.fr
  2. 2. Scientific Workflows We have recorded a dramatic increase in the number of scientist who utilize scientific modules as building in the composition of their experiments In 2011, the EBI recorded 21 millions invocation to the scientific modules they host Typically, an experiment is designed as a workflow, the steps of which represent invocation to scientific modules
  3. 3. Scientific Module Annotation Semantic annotations can be used to describe scientific modules. Existing semantic annotations are confined to the description of modules parameters. Annotations describing the behavior of the modules as to the task they play are rarely available Designing an ontology that captures precisely the behavior of modules is challenging. Proposal: To describe the behavior of scientific modules using data examples
  4. 4. Data Example Describes >
  5. 5. Outline Annotation Annotate Module Parameters Scien fic Module Registry Generate Data Examples Use Explore and Understand Modules Compare Modules Curator Experiment Designer APIHUT Radiant Meteor-s Galaxy Taverna Vistrails 1 2 3 4
  6. 6. Generating Data Examples Data examples can be used as a means to describe the behavior of scientific modules. Enumerating all possible data examples that can be used to describe a given module may be expensive, and may contain redundant data examples that describe the same behavior. Issue: which data examples should be used to characterize the functionality of a given module? Solution: We show how software testing techniques can be adapted to the problem of generating data examples without relying on the availability of the module specification, which often is not accessible.
  7. 7. Identifying the Classes of Behavior of a Scientific Module To generate data examples, we start by identifying the classes of behavior of the module. Consider a module m with an input parameter i, the domain of legal values of I is divided into partitions p1, …, pn. The partitioning is performed in a way to cover all classes of behavior of the module. To do so, we need access to the module specification, which is rarely available. In this work, we use a different source of information, namely the domain ontology used for annotating module parameters.
  8. 8. Identifying the Classes of Behavior of a Scientific Module An ontology can be viewed as a hierarchy of concepts. We use this hierarchy to specify the classes of behavior of scientific modules Consider the module getAccession, which given an input annotated as biological sequence returns the accession used for its identification. a module can be partitioned into the following : BiologicalSequence, NucleotideSequence, RNASequence, DNASequence, and ProteinSequence.
  9. 9. Generating Data Examples Covering Input Parameter Partitions Given the partitions of input parameters identified using the domain ontology, and given a pool of annotated instances, the input values necessary for constructing data examples can be automatically identified: Data examples covering the partitions in question can then be constructed by invoking the model using the input values identified. hat cover thosepartitions. Such dataexamplescan bespecified by soliciting from thehuman annotator examplesinput valuesthat be- ong to the respective partitions, and then invoking the module m o obtain thecorresponding output values, necessary for construct- ng the data examples. The construction of such data examples can, however, befully automated if apool of annotated instancesis available. Specifically, given pl , apool of annotated instances, the valuesof i necessary for constructing dataexamplesthat cover the partitionsof theinput i of themodulemcanbeobtained asfollows: { hc, get I nst ance(c, pl )i s.t . c v sem(i )} where get I nst ance(c, pl ) is a function that returns an instance of theconcept c from theannotated pool of instancespl. Notethat his function returns a realization of the concept in question [25], n thesense that the instance of c chosen is not an instance of any strict subconcept of c, i.e. not an instance of any concept c0 < c.
  10. 10. Generating Data Examples Covering Output Parameter Partitions The method for constructing data examples based on the partitioning of the domains of output parameters is can be difficult to implement. Given a partition po of the output parameter o of a module m, we need to find values that if used to feed the inputs of m, the output o generates a value that belongs to the partition po. A source that we use for identifying (some of) data examples that cover the output partitions, is the set of data examples generated to cover the partitions of the input parameters.
  11. 11. Evaluation The method that we have just described is not an exact method. Rather, it is a heuristic that provides a working solution. Because of this: The domain of a module may be over-partitioned, or Inversely, it may be under-partitioned We therefore assed the effectiveness of the method proposed for generating data examples of 252 scientific modules Notice that the availability of a pool of annotated instances is crutial to our method. We constructed such a pool by harvesting existing provenance traces of scientific workflows.
  12. 12. Evaluation: Metrics Coverage Completeness Conciseness
  13. 13. Coverage We were able to construct data examples that cover all the partitions of the input parameters. Moreover, the data examples generated were found to cover most of the partitions of the output parameters. Indeed, with the exception of the partitions of the outputs of 19 modules. e.g., get_genes_by_enzyme, link and binfo, all the partitions of the outputs of the remaining 233 modules were covered by the data examples generated.
  14. 14. Completeness Conciseness
  15. 15. Outline Annotation Annotate Module Parameters Scien fic Module Registry Generate Data Examples Use Explore and Understand Modules Compare Modules Curator Experiment Designer APIHUT Radiant Meteor-s Galaxy Taverna Vistrails 1 2 3 4
  16. 16. Understanding the Behavior of a Module Using Data Examples Question: Do data examples allow human users understand the behavior of scientific modules? Evaluation exercise: given a module m, we adopted the following two-step process: 1. In the first step, the user was asked to describe the behavior of a module based on its name, the name of its input and output parameters, and the structural and semantic types of those parameters. 2. the user was given additionally the data examples that characterize the module and was asked to update the module’s behavior if he deems necessary
  17. 17. Understanding the Behavior of a Module Using Data Examples
  18. 18. Understanding the Behavior of a Module Using Data Examples An analysis of the results and the modules showed that the ability for the human users to identify or not the behavior of the module is correlated with the nature of the transformation carried out by the module. The human users identified correctly the behavior of modules implementing data retrieval, format transformation and identifier mappings. On the other hand, they were less successful with modules implementing data filtering and complex data analysis, such as text mining. Kind of data manipulation # of modules Format transformation 53 Dataretrieval 51 Mapping identifiersl 62 Filtering 27 Dataanalysis 59 Table 3: Kinds of data manipulation carried out by the scientific modules. complex dataanalysis, dataexamplesmay not havethesamevalue as for other module kinds, as far as the human user is considered. Note, however, that alargeproportion of scientific modules imple- ment format transformation, dataretrieval and mapping identifiers, which arerefereed to in thescientific workflow literature using the term Shims [35]. For example, Table 3 classifies the modules that we analyzed in the experiment. It shows that format transforma- tion, data retrieval and mapping identifiers modules represent be- tween them 66% of the total number of modules that weanalyzed. That said, it is worth stressing, as we will demonstrate in the next identified protein. plemented to auto three modules. Th obtained fromthe tion error and out match. Given a performs a homo teins. The accessi feed the execution responding geneo This workflow wa which ended in 20 froma bioinforma flow. However, b for performing th the user was unab search for an ava and that we can u consuming. We f homology searche Japan13 , the Euro
  19. 19. Outline Annotation Annotate Module Parameters Scien fic Module Registry Generate Data Examples Use Explore and Understand Modules Compare Modules Curator Experiment Designer APIHUT Radiant Meteor-s Galaxy Taverna Vistrails 1 2 3 4
  20. 20. Comparing Scientific Modules Using Data Examples As well as understanding scientific modules, users may be interested in comparing the behavior of two or more modules. Module comparison, as a functionality, is particularly requested by workflows curators to repair broken workflows.
  21. 21. Comparing Scientific Modules Using Data Examples Consider two modules m and m’, and consider that the inputs and outputs of those modules are semantically and structurally compatible. To be able to compare the behavior of m and m’, we generate data examples that characterize their behavior using the method presented earlier. However, to make the comparison of their behavior straightforward, we generate the data examples of m and m’ in a way that their data examples have the same input values.
  22. 22. Comparing Scientific Modules Using Data Examples By comparing the output values of the data examples of m and m’ that have the same input values, we determine if the two modules have behaviors that are: Equivalent: the data examples of the two modules have the same output values Overlapping: Some (but not all) of the data examples of the two modules have the same output values. Disjoint: None of the data examples of the two modules have the same output values.
  23. 23. Evaluation To assess the effectiveness of the above method for comparing modules’ behavior, we used it to assist in the curation of broken workflows. We were able to identify 72 modules that are in the composition of scientific workflows (in the myExperiment repository), that are no longer provided by their suppliers, and for which we were able to construct data examples. We compared those modules with the 252 modules that we characterized using data examples.
  24. 24. 16 23 33
  25. 25. Outline Annotation Annotate Module Parameters Scien fic Module Registry Generate Data Examples Use Explore and Understand Modules Compare Modules Curator Experiment Designer APIHUT Radiant Meteor-s Galaxy Taverna Vistrails 1 2 3 4
  26. 26. Conclusions We showed that it is possible to characterize scientific modules using data examples without relying on module specifications. We also presented two functionalities that utilize the generated data examples. Understanding the module behavior by human users Module comparison Research Question for future work: How can we make data examples more concise (less redundant)? How can we compose modules based only on data examples?
  27. 27. Annotating the Behavior of Scientific Modules Using Data Examples: A Practical Approach Khalid Belhajjame Université Paris-Dauphine, LAMSADE Khalid.Belhajjame@dauphine.fr

×