Annotating the Behavior of
Scientific Modules Using Data
Examples: A Practical Approach
Khalid Belhajjame
Université Paris-Dauphine, LAMSADE
Khalid.Belhajjame@dauphine.fr
Scientific Workflows
We have recorded a dramatic
increase in the number of scientist
who utilize scientific modules as
building in the composition of their
experiments
In 2011, the EBI recorded 21
millions invocation to the
scientific modules they host
Typically, an experiment is designed
as a workflow, the steps of which
represent invocation to scientific
modules
Scientific Module Annotation
Semantic annotations can be used to describe scientific modules.
Existing semantic annotations are confined to the description of
modules parameters.
Annotations describing the
behavior of the modules as to the
task they play are rarely available
Designing an ontology that captures precisely the behavior of modules is
challenging.
Proposal: To describe the behavior of scientific modules using data examples
Data Example
Describes >
Outline
Annotation
Annotate Module
Parameters
Scien fic
Module Registry
Generate Data
Examples
Use
Explore and
Understand Modules
Compare Modules
Curator
Experiment
Designer
APIHUT
Radiant
Meteor-s
Galaxy
Taverna
Vistrails
1 2
3 4
Generating Data Examples
Data examples can be used as a means to
describe the behavior of scientific modules.
Enumerating all possible data examples that
can be used to describe a given module may be
expensive, and may contain redundant data
examples that describe the same behavior.
Issue: which data examples should be used to characterize the functionality
of a given module?
Solution: We show how software testing techniques can be adapted
to the problem of generating data examples without relying on the
availability of the module specification, which often is not accessible.
Identifying the Classes of
Behavior of a Scientific Module
To generate data examples, we start by identifying the classes of
behavior of the module.
Consider a module m with an input parameter i, the
domain of legal values of I is divided into partitions p1, …,
pn. The partitioning is performed in a way to cover all
classes of behavior of the module.
To do so, we need access to the module specification, which is
rarely available.
In this work, we use a different source of information, namely
the domain ontology used for annotating module parameters.
Identifying the Classes of
Behavior of a Scientific Module
An ontology can be viewed as a hierarchy of concepts.
We use this hierarchy to specify the classes of behavior
of scientific modules
Consider the module getAccession,
which given an input annotated as
biological sequence returns the
accession used for its identification.
a module can be partitioned into the following :
BiologicalSequence, NucleotideSequence, RNASequence,
DNASequence, and ProteinSequence.
Generating Data Examples Covering
Input Parameter Partitions
Given the partitions of input parameters identified
using the domain ontology, and given a pool of
annotated instances, the input values necessary for
constructing data examples can be automatically
identified:
Data examples covering the partitions in question can
then be constructed by invoking the model using the
input values identified.
hat cover thosepartitions. Such dataexamplescan bespecified by
soliciting from thehuman annotator examplesinput valuesthat be-
ong to the respective partitions, and then invoking the module m
o obtain thecorresponding output values, necessary for construct-
ng the data examples. The construction of such data examples
can, however, befully automated if apool of annotated instancesis
available. Specifically, given pl , apool of annotated instances, the
valuesof i necessary for constructing dataexamplesthat cover the
partitionsof theinput i of themodulemcanbeobtained asfollows:
{ hc, get I nst ance(c, pl )i s.t . c v sem(i )}
where get I nst ance(c, pl ) is a function that returns an instance
of theconcept c from theannotated pool of instancespl. Notethat
his function returns a realization of the concept in question [25],
n thesense that the instance of c chosen is not an instance of any
strict subconcept of c, i.e. not an instance of any concept c0
< c.
Generating Data Examples Covering
Output Parameter Partitions
The method for constructing data examples based on
the partitioning of the domains of output parameters is
can be difficult to implement.
Given a partition po of the output parameter o of a
module m, we need to find values that if used to feed
the inputs of m, the output o generates a value that
belongs to the partition po.
A source that we use for identifying (some of) data
examples that cover the output partitions, is the set of
data examples generated to cover the partitions of the
input parameters.
Evaluation
The method that we have just described is not an exact
method. Rather, it is a heuristic that provides a working
solution. Because of this:
The domain of a module may be over-partitioned, or
Inversely, it may be under-partitioned
We therefore assed the effectiveness of the method proposed
for generating data examples of 252 scientific modules
Notice that the availability of a pool of annotated instances
is crutial to our method.
We constructed such a pool by harvesting existing
provenance traces of scientific workflows.
Evaluation: Metrics
Coverage
Completeness
Conciseness
Coverage
We were able to construct data examples that cover all
the partitions of the input parameters.
Moreover, the data examples generated were found to
cover most of the partitions of the output parameters.
Indeed, with the exception of the partitions of the
outputs of 19 modules. e.g., get_genes_by_enzyme,
link and binfo, all the partitions of the outputs of the
remaining 233 modules were covered by the data
examples generated.
Completeness
Conciseness
Outline
Annotation
Annotate Module
Parameters
Scien fic
Module Registry
Generate Data
Examples
Use
Explore and
Understand Modules
Compare Modules
Curator
Experiment
Designer
APIHUT
Radiant
Meteor-s
Galaxy
Taverna
Vistrails
1 2
3 4
Understanding the Behavior of a
Module Using Data Examples
Question: Do data examples allow human users understand
the behavior of scientific modules?
Evaluation exercise: given a module m, we adopted the
following two-step process:
1. In the first step, the user was asked to describe the
behavior of a module based on its name, the name of its
input and output parameters, and the structural and
semantic types of those parameters.
2. the user was given additionally the data examples that
characterize the module and was asked to update the
module’s behavior if he deems necessary
Understanding the Behavior of a
Module Using Data Examples
Understanding the Behavior of a
Module Using Data Examples
An analysis of the results and the modules showed that the ability for the
human users to identify or not the behavior of the module is correlated
with the nature of the transformation carried out by the module.
The human users identified correctly the behavior of modules
implementing data retrieval, format transformation and identifier
mappings.
On the other hand, they were less successful with modules implementing
data filtering and complex data analysis, such as text mining.
Kind of data manipulation # of modules
Format transformation 53
Dataretrieval 51
Mapping identifiersl 62
Filtering 27
Dataanalysis 59
Table 3: Kinds of data manipulation carried out by the scientific
modules.
complex dataanalysis, dataexamplesmay not havethesamevalue
as for other module kinds, as far as the human user is considered.
Note, however, that alargeproportion of scientific modules imple-
ment format transformation, dataretrieval and mapping identifiers,
which arerefereed to in thescientific workflow literature using the
term Shims [35]. For example, Table 3 classifies the modules that
we analyzed in the experiment. It shows that format transforma-
tion, data retrieval and mapping identifiers modules represent be-
tween them 66% of the total number of modules that weanalyzed.
That said, it is worth stressing, as we will demonstrate in the next
identified protein.
plemented to auto
three modules. Th
obtained fromthe
tion error and out
match. Given a
performs a homo
teins. The accessi
feed the execution
responding geneo
This workflow wa
which ended in 20
froma bioinforma
flow. However, b
for performing th
the user was unab
search for an ava
and that we can u
consuming. We f
homology searche
Japan13
, the Euro
Outline
Annotation
Annotate Module
Parameters
Scien fic
Module Registry
Generate Data
Examples
Use
Explore and
Understand Modules
Compare Modules
Curator
Experiment
Designer
APIHUT
Radiant
Meteor-s
Galaxy
Taverna
Vistrails
1 2
3 4
Comparing Scientific Modules
Using Data Examples
As well as understanding
scientific modules, users may
be interested in comparing the
behavior of two or more
modules.
Module comparison, as a
functionality, is particularly
requested by workflows
curators to repair broken
workflows.
Comparing Scientific Modules
Using Data Examples
Consider two modules m and m’, and consider that
the inputs and outputs of those modules are
semantically and structurally compatible.
To be able to compare the behavior of m and m’, we
generate data examples that characterize their behavior
using the method presented earlier.
However, to make the comparison of their behavior
straightforward, we generate the data examples of m
and m’ in a way that their data examples have the same
input values.
Comparing Scientific Modules
Using Data Examples
By comparing the output values of the data examples
of m and m’ that have the same input values, we
determine if the two modules have behaviors that are:
Equivalent: the data examples of the two modules have
the same output values
Overlapping: Some (but not all) of the data examples of
the two modules have the same output values.
Disjoint: None of the data examples of the two modules
have the same output values.
Evaluation
To assess the effectiveness of the above method for
comparing modules’ behavior, we used it to assist in
the curation of broken workflows.
We were able to identify 72 modules that are in the
composition of scientific workflows (in the
myExperiment repository), that are no longer provided
by their suppliers, and for which we were able to
construct data examples.
We compared those modules with the 252 modules
that we characterized using data examples.
16
23
33
Outline
Annotation
Annotate Module
Parameters
Scien fic
Module Registry
Generate Data
Examples
Use
Explore and
Understand Modules
Compare Modules
Curator
Experiment
Designer
APIHUT
Radiant
Meteor-s
Galaxy
Taverna
Vistrails
1 2
3 4
Conclusions
We showed that it is possible to characterize scientific
modules using data examples without relying on module
specifications.
We also presented two functionalities that utilize the
generated data examples.
Understanding the module behavior by human users
Module comparison
Research Question for future work:
How can we make data examples more concise (less redundant)?
How can we compose modules based only on data examples?
Annotating the Behavior of
Scientific Modules Using Data
Examples: A Practical Approach
Khalid Belhajjame
Université Paris-Dauphine, LAMSADE
Khalid.Belhajjame@dauphine.fr

Edbt2014 talk

  • 1.
    Annotating the Behaviorof Scientific Modules Using Data Examples: A Practical Approach Khalid Belhajjame Université Paris-Dauphine, LAMSADE Khalid.Belhajjame@dauphine.fr
  • 2.
    Scientific Workflows We haverecorded a dramatic increase in the number of scientist who utilize scientific modules as building in the composition of their experiments In 2011, the EBI recorded 21 millions invocation to the scientific modules they host Typically, an experiment is designed as a workflow, the steps of which represent invocation to scientific modules
  • 3.
    Scientific Module Annotation Semanticannotations can be used to describe scientific modules. Existing semantic annotations are confined to the description of modules parameters. Annotations describing the behavior of the modules as to the task they play are rarely available Designing an ontology that captures precisely the behavior of modules is challenging. Proposal: To describe the behavior of scientific modules using data examples
  • 4.
  • 5.
    Outline Annotation Annotate Module Parameters Scien fic ModuleRegistry Generate Data Examples Use Explore and Understand Modules Compare Modules Curator Experiment Designer APIHUT Radiant Meteor-s Galaxy Taverna Vistrails 1 2 3 4
  • 6.
    Generating Data Examples Dataexamples can be used as a means to describe the behavior of scientific modules. Enumerating all possible data examples that can be used to describe a given module may be expensive, and may contain redundant data examples that describe the same behavior. Issue: which data examples should be used to characterize the functionality of a given module? Solution: We show how software testing techniques can be adapted to the problem of generating data examples without relying on the availability of the module specification, which often is not accessible.
  • 7.
    Identifying the Classesof Behavior of a Scientific Module To generate data examples, we start by identifying the classes of behavior of the module. Consider a module m with an input parameter i, the domain of legal values of I is divided into partitions p1, …, pn. The partitioning is performed in a way to cover all classes of behavior of the module. To do so, we need access to the module specification, which is rarely available. In this work, we use a different source of information, namely the domain ontology used for annotating module parameters.
  • 8.
    Identifying the Classesof Behavior of a Scientific Module An ontology can be viewed as a hierarchy of concepts. We use this hierarchy to specify the classes of behavior of scientific modules Consider the module getAccession, which given an input annotated as biological sequence returns the accession used for its identification. a module can be partitioned into the following : BiologicalSequence, NucleotideSequence, RNASequence, DNASequence, and ProteinSequence.
  • 9.
    Generating Data ExamplesCovering Input Parameter Partitions Given the partitions of input parameters identified using the domain ontology, and given a pool of annotated instances, the input values necessary for constructing data examples can be automatically identified: Data examples covering the partitions in question can then be constructed by invoking the model using the input values identified. hat cover thosepartitions. Such dataexamplescan bespecified by soliciting from thehuman annotator examplesinput valuesthat be- ong to the respective partitions, and then invoking the module m o obtain thecorresponding output values, necessary for construct- ng the data examples. The construction of such data examples can, however, befully automated if apool of annotated instancesis available. Specifically, given pl , apool of annotated instances, the valuesof i necessary for constructing dataexamplesthat cover the partitionsof theinput i of themodulemcanbeobtained asfollows: { hc, get I nst ance(c, pl )i s.t . c v sem(i )} where get I nst ance(c, pl ) is a function that returns an instance of theconcept c from theannotated pool of instancespl. Notethat his function returns a realization of the concept in question [25], n thesense that the instance of c chosen is not an instance of any strict subconcept of c, i.e. not an instance of any concept c0 < c.
  • 10.
    Generating Data ExamplesCovering Output Parameter Partitions The method for constructing data examples based on the partitioning of the domains of output parameters is can be difficult to implement. Given a partition po of the output parameter o of a module m, we need to find values that if used to feed the inputs of m, the output o generates a value that belongs to the partition po. A source that we use for identifying (some of) data examples that cover the output partitions, is the set of data examples generated to cover the partitions of the input parameters.
  • 11.
    Evaluation The method thatwe have just described is not an exact method. Rather, it is a heuristic that provides a working solution. Because of this: The domain of a module may be over-partitioned, or Inversely, it may be under-partitioned We therefore assed the effectiveness of the method proposed for generating data examples of 252 scientific modules Notice that the availability of a pool of annotated instances is crutial to our method. We constructed such a pool by harvesting existing provenance traces of scientific workflows.
  • 12.
  • 13.
    Coverage We were ableto construct data examples that cover all the partitions of the input parameters. Moreover, the data examples generated were found to cover most of the partitions of the output parameters. Indeed, with the exception of the partitions of the outputs of 19 modules. e.g., get_genes_by_enzyme, link and binfo, all the partitions of the outputs of the remaining 233 modules were covered by the data examples generated.
  • 14.
  • 15.
    Outline Annotation Annotate Module Parameters Scien fic ModuleRegistry Generate Data Examples Use Explore and Understand Modules Compare Modules Curator Experiment Designer APIHUT Radiant Meteor-s Galaxy Taverna Vistrails 1 2 3 4
  • 16.
    Understanding the Behaviorof a Module Using Data Examples Question: Do data examples allow human users understand the behavior of scientific modules? Evaluation exercise: given a module m, we adopted the following two-step process: 1. In the first step, the user was asked to describe the behavior of a module based on its name, the name of its input and output parameters, and the structural and semantic types of those parameters. 2. the user was given additionally the data examples that characterize the module and was asked to update the module’s behavior if he deems necessary
  • 17.
    Understanding the Behaviorof a Module Using Data Examples
  • 18.
    Understanding the Behaviorof a Module Using Data Examples An analysis of the results and the modules showed that the ability for the human users to identify or not the behavior of the module is correlated with the nature of the transformation carried out by the module. The human users identified correctly the behavior of modules implementing data retrieval, format transformation and identifier mappings. On the other hand, they were less successful with modules implementing data filtering and complex data analysis, such as text mining. Kind of data manipulation # of modules Format transformation 53 Dataretrieval 51 Mapping identifiersl 62 Filtering 27 Dataanalysis 59 Table 3: Kinds of data manipulation carried out by the scientific modules. complex dataanalysis, dataexamplesmay not havethesamevalue as for other module kinds, as far as the human user is considered. Note, however, that alargeproportion of scientific modules imple- ment format transformation, dataretrieval and mapping identifiers, which arerefereed to in thescientific workflow literature using the term Shims [35]. For example, Table 3 classifies the modules that we analyzed in the experiment. It shows that format transforma- tion, data retrieval and mapping identifiers modules represent be- tween them 66% of the total number of modules that weanalyzed. That said, it is worth stressing, as we will demonstrate in the next identified protein. plemented to auto three modules. Th obtained fromthe tion error and out match. Given a performs a homo teins. The accessi feed the execution responding geneo This workflow wa which ended in 20 froma bioinforma flow. However, b for performing th the user was unab search for an ava and that we can u consuming. We f homology searche Japan13 , the Euro
  • 19.
    Outline Annotation Annotate Module Parameters Scien fic ModuleRegistry Generate Data Examples Use Explore and Understand Modules Compare Modules Curator Experiment Designer APIHUT Radiant Meteor-s Galaxy Taverna Vistrails 1 2 3 4
  • 20.
    Comparing Scientific Modules UsingData Examples As well as understanding scientific modules, users may be interested in comparing the behavior of two or more modules. Module comparison, as a functionality, is particularly requested by workflows curators to repair broken workflows.
  • 21.
    Comparing Scientific Modules UsingData Examples Consider two modules m and m’, and consider that the inputs and outputs of those modules are semantically and structurally compatible. To be able to compare the behavior of m and m’, we generate data examples that characterize their behavior using the method presented earlier. However, to make the comparison of their behavior straightforward, we generate the data examples of m and m’ in a way that their data examples have the same input values.
  • 22.
    Comparing Scientific Modules UsingData Examples By comparing the output values of the data examples of m and m’ that have the same input values, we determine if the two modules have behaviors that are: Equivalent: the data examples of the two modules have the same output values Overlapping: Some (but not all) of the data examples of the two modules have the same output values. Disjoint: None of the data examples of the two modules have the same output values.
  • 23.
    Evaluation To assess theeffectiveness of the above method for comparing modules’ behavior, we used it to assist in the curation of broken workflows. We were able to identify 72 modules that are in the composition of scientific workflows (in the myExperiment repository), that are no longer provided by their suppliers, and for which we were able to construct data examples. We compared those modules with the 252 modules that we characterized using data examples.
  • 24.
  • 25.
    Outline Annotation Annotate Module Parameters Scien fic ModuleRegistry Generate Data Examples Use Explore and Understand Modules Compare Modules Curator Experiment Designer APIHUT Radiant Meteor-s Galaxy Taverna Vistrails 1 2 3 4
  • 26.
    Conclusions We showed thatit is possible to characterize scientific modules using data examples without relying on module specifications. We also presented two functionalities that utilize the generated data examples. Understanding the module behavior by human users Module comparison Research Question for future work: How can we make data examples more concise (less redundant)? How can we compose modules based only on data examples?
  • 27.
    Annotating the Behaviorof Scientific Modules Using Data Examples: A Practical Approach Khalid Belhajjame Université Paris-Dauphine, LAMSADE Khalid.Belhajjame@dauphine.fr