1
An incremental learning method to support the
annotation of workflows with data-to-data relations
Enrico Daga, Mathieu d’...
“LipidMaps Query”
from http://
www.myexperiment.org
/workflows/1052
Workflow models are
focused on actions, to
support multiple and
parametric executions
There are scenarios in

which we need to 

focus on the data…
… and understand how

the data is affected by

the actions of the
workflow.
Data flow (DF): to
express the

implications of the
actions on the data.
Datanode, a taxonomy

of the relations between

data objects, used for
example to
support reasoning on
policy propagation
...
8
Our objective is to derive such data flows from the
representation of existing workflows.
9
APPROACH: to learn how to label data-to-data relations
using the description of the actions in the workflow.
ASSUMPTION: ...
10
Incremental learning method
11
HYPOTHESIS: the quality of the recommendations
improves in time
12
13
WORKFLOW to DATA FLOW
Arcs
=
I/O port pairs (1->3 ; 2->3)
1234 Workflows from www.myexperiments.org = 30612 I/O port pai...
14
FEATURES
Direct:
About the ports and
processors involved:
ids, data types,
annotations, scripts …
Derived:
From annotat...
15
FEATURES
An incremental learning method to support the annotation of workflows 7
Table 2. Example of derived features (b...
16
17
Formal Concept Analysis (FCA)
• FCA is a clustering method for association rule mining
• Lattice of ordered closed item...
18
Step 0
At the beginning, the user adds a single item, without
support. The lattice contains a single concept.
19
Step 1
By adding new annotations, the lattice allows to derive
association rules.
(f1, f2, ..., fn) → (a1, a2, ..., an)
20
Step 2
By adding new annotations, the lattice grows…

allowing to generate recommendations.
(f1, f2, ..., fn) → (a1, a2...
21
Step 3
By adding new annotations, the lattice grows…
allowing to generate more recommendations.
(f1, f2, ..., fn) → (a1...
22
Step 4
By adding new annotations, the lattice grows…
allowing to generate many recommendations.
(f1, f2, ..., fn) → (a1...
23
ASSOCIATION RULE MINING
Generating all association rules on each iteration is
expensive
We query the lattice to retriev...
24
io6: f7,f8,f9,f10,f11,a?
(f7,f8) →(a0) (f8,f9) →(a2)
25
EVALUATION
• Expectation: the quality of the recommendations
improves in time.
• EXPERIMENT:
• Dinowolf (Datanode in wo...
26
RESULTS
of selected recommendations. The vertical axis represents the score placing at
the top the first position. This ...
27
RESULTS
20 40 60 80 100 120 140 160 180 200 220 240 260
Fig. 7. Progress of the ratio of annotations selected from reco...
28
CONCLUSIONS
• Supporting users on annotating workflows with data-to-data
relations with recommendations is problematic b...
29
Thank you
Enrico Daga
Feedback: @enridaga
http://link.springer.com/chapter/10.1007/978-3-319-49004-5_9
30
REFERENCES
• Daga, E., d’Aquin, M., Adamou, A., Motta, E.: Addressing exploitability of smart city data.
In: 2016 IEEE ...
Upcoming SlideShare
Loading in …5
×

An incremental learning method to support the annotation of workflows with data-to-data relations

303 views

Published on

Workflow formalisations are often focused on the representation of a process with the primary objective to support execution. However, there are scenarios where what needs to be represented is the effect of the process on the data artefacts involved, for example when reasoning over the corresponding data policies. This can be achieved by annotating the workflow with the semantic relations that occur between these data artefacts. However, manually producing such annotations is difficult and time consuming. In this paper we introduce a method based on recommendations to support users in this task. Our approach is centred on an incremental rule association mining technique that allows to compensate the cold start problem due to the lack of a training set of annotated workflows. We discuss the implementation of a tool relying on this approach and how its application on an existing repository of workflows effectively enable the generation of such annotations.
--
Presented at
20th International Conference on Knowledge Engineering and Knowledge Management
Bologna, Italy
19-23 November 2016
http://link.springer.com/chapter/10.1007/978-3-319-49004-5_9

Published in: Data & Analytics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
303
On SlideShare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

An incremental learning method to support the annotation of workflows with data-to-data relations

  1. 1. 1 An incremental learning method to support the annotation of workflows with data-to-data relations Enrico Daga, Mathieu d’Aquin, Aldo Gangemi, Enrico Motta Feedback: @enridaga 20th International Conference on Knowledge Engineering and Knowledge Management Bologna, Italy 19-23 November 2016 http://link.springer.com/chapter/10.1007/978-3-319-49004-5_9
  2. 2. “LipidMaps Query” from http:// www.myexperiment.org /workflows/1052
  3. 3. Workflow models are focused on actions, to support multiple and parametric executions
  4. 4. There are scenarios in
 which we need to 
 focus on the data…
  5. 5. … and understand how
 the data is affected by
 the actions of the workflow.
  6. 6. Data flow (DF): to express the
 implications of the actions on the data.
  7. 7. Datanode, a taxonomy
 of the relations between
 data objects, used for example to support reasoning on policy propagation http://purl.org/datanode/ns/ Daga, E., d’Aquin, M., Gangemi, A., Motta, E.: Propagation of policies in rich data flows. In: Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015 http://doi.acm.org/10. 1145/2815833.2815839 

  8. 8. 8 Our objective is to derive such data flows from the representation of existing workflows.
  9. 9. 9 APPROACH: to learn how to label data-to-data relations using the description of the actions in the workflow. ASSUMPTION: there is a correlation between the features of a workflow action and the labels. PROBLEM: Cold start - this requires a pre-existing training set, that we do not have!
  10. 10. 10 Incremental learning method
  11. 11. 11 HYPOTHESIS: the quality of the recommendations improves in time
  12. 12. 12
  13. 13. 13 WORKFLOW to DATA FLOW Arcs = I/O port pairs (1->3 ; 2->3) 1234 Workflows from www.myexperiments.org = 30612 I/O port pairs
  14. 14. 14 FEATURES Direct: About the ports and processors involved: ids, data types, annotations, scripts … Derived: From annotations: Bag of words, NER/DBPedia entities plus types and categories. An incremental learning method to support the annotation of workflows 7 Table 2. Example of derived features (bag of words and DBPedia entities) generated for the IO port pair 1 ! 3. Type Value From/FromPortName-word string To/ToPortName-word split From/FromLinkedPortDescription-word single From/FromLinkedPortDescription-word possibilities From/FromLinkedPortDescription-word orb From/FromLinkedPortDescription-word mass FromToPorts/DbPediaType wgs84:SpatialThing FromToPorts/DbPediaType resource:Text file FromToPorts/DbPediaType resource:Mass FromToPorts/DbPediaType Category:State functions FromToPorts/DbPediaType Category:Physical quantities FromToPorts/DbPediaType Category:Mathematical notation 80% 18% 2% < 10 10 ⇠ 100 > 100 Fig. 4. Distribution of features ex- tracted from the workflow descriptions. 68% 28% 4% < 10 10 ⇠ 100 > 100 Fig. 5. Distribution of features (includ- ing derived features). An incremental learning method to support the annotation of workflows 7 Table 2. Example of derived features (bag of words and DBPedia entities) generated for the IO port pair 1 ! 3. Type Value From/FromPortName-word string To/ToPortName-word split From/FromLinkedPortDescription-word single From/FromLinkedPortDescription-word possibilities From/FromLinkedPortDescription-word orb From/FromLinkedPortDescription-word mass FromToPorts/DbPediaType wgs84:SpatialThing FromToPorts/DbPediaType resource:Text file FromToPorts/DbPediaType resource:Mass FromToPorts/DbPediaType Category:State functions FromToPorts/DbPediaType Category:Physical quantities FromToPorts/DbPediaType Category:Mathematical notation 80% 18% 2% < 10 10 ⇠ 100 > 100 Fig. 4. Distribution of features ex- tracted from the workflow descriptions. 68% 28% 4% < 10 10 ⇠ 100 > 100 Fig. 5. Distribution of features (includ- ing derived features). Distribution: (30612 I/O port pairs)
  15. 15. 15 FEATURES An incremental learning method to support the annotation of workflows 7 Table 2. Example of derived features (bag of words and DBPedia entities) generated for the IO port pair 1 ! 3. Type Value From/FromPortName-word string To/ToPortName-word split From/FromLinkedPortDescription-word single From/FromLinkedPortDescription-word possibilities From/FromLinkedPortDescription-word orb From/FromLinkedPortDescription-word mass FromToPorts/DbPediaType wgs84:SpatialThing FromToPorts/DbPediaType resource:Text file FromToPorts/DbPediaType resource:Mass FromToPorts/DbPediaType Category:State functions FromToPorts/DbPediaType Category:Physical quantities FromToPorts/DbPediaType Category:Mathematical notation 80% 18% 2% < 10 10 ⇠ 100 > 100 Fig. 4. Distribution of features ex- tracted from the workflow descriptions. 68% 28% 4% < 10 10 ⇠ 100 > 100 Fig. 5. Distribution of features (includ- ing derived features). An incremental learning method to support the annotation of workflows 7 Table 2. Example of derived features (bag of words and DBPedia entities) generated for the IO port pair 1 ! 3. Type Value From/FromPortName-word string To/ToPortName-word split From/FromLinkedPortDescription-word single From/FromLinkedPortDescription-word possibilities From/FromLinkedPortDescription-word orb From/FromLinkedPortDescription-word mass FromToPorts/DbPediaType wgs84:SpatialThing FromToPorts/DbPediaType resource:Text file FromToPorts/DbPediaType resource:Mass FromToPorts/DbPediaType Category:State functions FromToPorts/DbPediaType Category:Physical quantities FromToPorts/DbPediaType Category:Mathematical notation 80% 18% 2% < 10 10 ⇠ 100 > 100 Fig. 4. Distribution of features ex- tracted from the workflow descriptions. 68% 28% 4% < 10 10 ⇠ 100 > 100 Fig. 5. Distribution of features (includ- ing derived features). . This processor has three ports: two input ports (1 and 2) and one output port e can translate this model into a graph connecting the data objects of the inputs one of the output. 1. Sample of the features extracted for the IO port pair 1 ! 3 in the example ure 3. Type Value From/FromPortName string To/ToPortName split Activity/ActivityConfField script Activity/ActivityType http://ns.taverna.org.uk/2010/ activity/beanshell Activity/ActivityName reformat list Activity/ConfField/derivedFrom http://ns.taverna.org.uk/2010/ activity/localworker/org.embl. ebi.escience.scuflworkers.java. SplitByRegex Activity/ConfField/script List split = new ArrayList();if (!string.equals(””)) { String regexString = ”,”; if (regex != void) ... Processor/ProcessorType Processor Processor/ProcessorName reformat list owever, the objective of these feature sets is to support the clustering of nnotated IO port pair through finding similarities with IO port pairs to be ated. At this stage of the study we performed a preliminary evaluation of stribution of the features extracted. We discovered that very few of them shared between a significant number of port pairs (see Figure 4). In order rease the number of shared features we generated a set of derived fea- by extracting bags of words from lexical feature values and by performing d Entity Recognition on the features that constituted textual annotations s and comments), when present. Moreover, from the extracted entities we dded the related DBPedia categories and types as additional features. As ple, Table 2 shows a sample of the bag of words and entities extracted from atures listed in the previous Table 1. An incremental learning method to support the annotation of workflow Table 2. Example of derived features (bag of words and DBPedia entities) for the IO port pair 1 ! 3. Type Value From/FromPortName-word string To/ToPortName-word split From/FromLinkedPortDescription-word single From/FromLinkedPortDescription-word possibilities From/FromLinkedPortDescription-word orb From/FromLinkedPortDescription-word mass FromToPorts/DbPediaType wgs84:SpatialThing FromToPorts/DbPediaType resource:Text file FromToPorts/DbPediaType resource:Mass FromToPorts/DbPediaType Category:State functions FromToPorts/DbPediaType Category:Physical quantities FromToPorts/DbPediaType Category:Mathematical notation 80% 18% 2% < 10 10 ⇠ 100 > 100 Fig. 4. Distribution of features ex- tracted from the workflow descriptions. 68% 28% 4% < 1 > Fig. 5. Distribution of featur ing derived features). 3.3 Retrieval of association rules and generation of recommendations Direct: Derived: Distribution: (30612 I/O port pairs)
  16. 16. 16
  17. 17. 17 Formal Concept Analysis (FCA) • FCA is a clustering method for association rule mining • Lattice of ordered closed item sets - concepts • Item: I/O port pair <-> features + annotations • FCA Concept: • Extent (I/O port pairs) • Intent (features, annotations) • Incremental lattice construction (Godin algorithm). • Lattice is reconstructed on each item addition.

  18. 18. 18 Step 0 At the beginning, the user adds a single item, without support. The lattice contains a single concept.
  19. 19. 19 Step 1 By adding new annotations, the lattice allows to derive association rules. (f1, f2, ..., fn) → (a1, a2, ..., an)
  20. 20. 20 Step 2 By adding new annotations, the lattice grows…
 allowing to generate recommendations. (f1, f2, ..., fn) → (a1, a2, ..., an)
  21. 21. 21 Step 3 By adding new annotations, the lattice grows… allowing to generate more recommendations. (f1, f2, ..., fn) → (a1, a2, ..., an)
  22. 22. 22 Step 4 By adding new annotations, the lattice grows… allowing to generate many recommendations. (f1, f2, ..., fn) → (a1, a2, ..., an)
  23. 23. 23 ASSOCIATION RULE MINING Generating all association rules on each iteration is expensive We query the lattice to retrieve only rules applicable to a given I/O port pair. • only rules that have annotations in the rule consequence: • This: (f1, f2, ..., fn) → (a1, a2, ..., an) • Not these: (f1, f2, a6) → (f3, f4), (f1, f2, a6) → (f3, a4) • avoid redundancies (select the best for a certain head) • rank the rules according to: support, confidence and relevance.
  24. 24. 24 io6: f7,f8,f9,f10,f11,a? (f7,f8) →(a0) (f8,f9) →(a2)
  25. 25. 25 EVALUATION • Expectation: the quality of the recommendations improves in time. • EXPERIMENT: • Dinowolf (Datanode in workflows) 
 http://github.com/enridaga/dinowolf 
 Uses SCUFL2, Apache Taverna, Apache Lucene, DBPedia Spotlight • 6 users to annotate 20 workflows from www.myexperiments.org for a total of 260 I/O port pairs.
  26. 26. 26 RESULTS of selected recommendations. The vertical axis represents the score placing at the top the first position. This confirms our hypothesis that the quality of rec- ommendations increases, stabilizing within the upper region after a critical mass of annotated items is produced, reflecting the same behavior observed in Fig. 7. 20 40 60 80 100 120 140 160 180 200 220 240 260 5s 20s 1m 5m 10m Fig. 6. Evolution of the time spent by each user on a given annotation page of the tool before a decision was made. An Incremental Learning Method to Support the Annotation of Workflows 141 20 40 60 80 100 120 140 160 180 200 220 240 260 0.0 0.2 0.5 0.7 1.0 Fig. 7. Progress of the ratio of annotations selected from recommendations. Time required to make a choice: Selections from recommendations: Effort reduced. Cold start problem tackled.
  27. 27. 27 RESULTS 20 40 60 80 100 120 140 160 180 200 220 240 260 Fig. 7. Progress of the ratio of annotations selected from recommendations. 20 40 60 80 100 120 140 160 180 200 220 240 260 0.0 0.2 0.5 0.7 1.0 Fig. 8. Average rank of selected recommendations. The vertical axis represents the score placing at the top the first position. 20 40 60 80 100 120 140 160 180 200 220 240 260 0.0 0.2 0.5 0.7 1.0 Fig. 9. Progress of the average relevance score of picked recommendations. 20 40 60 80 100 120 140 160 180 200 220 240 260 0.0 0.2 Fig. 7. Progress of the ratio of annotations selected from recommendations. 20 40 60 80 100 120 140 160 180 200 220 240 260 0.0 0.2 0.5 0.7 1.0 Fig. 8. Average rank of selected recommendations. The vertical axis represents the score placing at the top the first position. 20 40 60 80 100 120 140 160 180 200 220 240 260 0.0 0.2 0.5 0.7 1.0 Fig. 9. Progress of the average relevance score of picked recommendations. Rank of selected recommendations: Relevance score of selected recommendations: Quality of recommendations increases.
  28. 28. 28 CONCLUSIONS • Supporting users on annotating workflows with data-to-data relations with recommendations is problematic because of the lack of an initial training set (cold start problem).  We tackled this issue by means of an incremental learning process that leverages FCA and an information retrieval approach to ARM. • Future work: • Integrate this approach in Data Hub metadata management to support policy propagation. • Study the quality and consistency of annotations. • Agreement/disagreement between users. • The solution is domain independent, can be applied to other scenarios.
  29. 29. 29 Thank you Enrico Daga Feedback: @enridaga http://link.springer.com/chapter/10.1007/978-3-319-49004-5_9
  30. 30. 30 REFERENCES • Daga, E., d’Aquin, M., Adamou, A., Motta, E.: Addressing exploitability of smart city data. In: 2016 IEEE Second International Smart Cities Conference (ISC2). IEEE (2016) 
 • Daga, E., d’Aquin, M., Gangemi, A., Motta, E.: Describing semantic web applica- tions through relations between data nodes. Technical report kmi-14-05, Knowledge Media Institute, The Open University, Walton Hall, Milton Keynes (2014). http:// kmi.open.ac.uk/ publications/techreport/kmi-14-05 
 • Daga, E., d’Aquin, M., Gangemi, A., Motta, E.: Propagation of policies in rich data flows. In: Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015, New York, NY, USA, pp. 5:1–5:8 (2015). http://doi.acm.org/10. 1145/2815833.2815839 
 • Godin, R., Missaoui, R., Alaoui, H.: Incremental concept formation algorithms based on galois (concept) lattices. Comput. Intell. 11(2), 246–267 (1995) 
 • Poelmans,J.,Elzinga,P.,Viaene,S.,Dedene,G.:Formalconceptanalysisinknowl- edge discovery: a survey. In: Croitoru, M., Ferŕe, S., Lukose, D. (eds.) ICCS 2010. LNCS (LNAI), vol. 6208, pp. 139–153. Springer, Heidelberg (2010). doi:10.1007/ 978-3-642-14197-3 15

×