On Specifying and Sharing Scientific Workflow Optimization Results Using Research Objects

On Specifying and Sharing Scientific Workflow
Optimization Results Using Research Objects

Mitglied der Helmholtz-Gemeinschaft

8th Workshop On Workflows in Support of Large-Scale Science

17. November 2013 | Sonja Holl*, Daniel Garijo+, Khalid Belhajjame$, Olav Zimmermann*,
Renato De Giovanni#, Matthias Obst~, Carole Goble$
*Jülich Supercomputing Centre (JSC),Forschungszentrum Juelich, Germany
+Ontology Engineering Group, Facultad

de Informática Universidad Politécnica de Madrid, Spain

$School of Computer Science University of Manchester, UK
#Reference Center

on Environmental Information Campinas SP, Brazil

~Department of Biological and Environmental Sciences University of Gothenburg, Sweden

Scientific Workflows
•


•

Popular choice to design,
manage, and execute in silico
experiments
Sharing and reuse via workflow
repositories

Sunday Nov. 17, 2013


2

Ecological Niche Modeling
1

4

5

3


2

Perform species adaptation to environmental
changes (BioVeL Project)


3

Ecological Niche Modeling Workflow
Parameter

Occurrence
Data

Environmental
Layer

Geographic
Mask

createModel


testModel

calcAUC

AUC


4

Designing workflow
(from scratch)

in silico experiment

Reusing workflow

REFINE

Sharing & Analysis

Planning

Execution


5

Gamma

Cost

NumberOfPseu
doAbsences

Occurrence
Data

createModel

Environmental
Layer

Geographic
Mask

SVM
Maxent
GARP


testModel

calcAUC

AUC


6

‐3.2

1
11

2.3

1.5
a

4.55

‐3

84

BLAST

10
6.788
Gamma

0.5

Cost

NumberOfPseu
doAbsences

Occurrence
Data

Environmental
Layer

Select Algorithms
0

createModel

Geographic
Mask

12

SVM
Maxent
GARP

Select Parameters

100

testModel

‐2.9

‐bt

1.3

calcAUC

1
AUC

1


/

gaussian
1.9425
6.7

7

13

Common strategies to handle this challenge

•
•
•

Default parameters & applications
Trial and error
Parameter sweeps

But:

•
•
•

Increasing complexity of scientific workflows
Raising number parameters
Work time & compute intensive



8

Designing workflow
(from scratch)

in silico experiment

REFINE

Reusing workflow

Planning


Sharing & Analysis

Execution

Optimization


9

Intelligent automated optimization techniques
Goal:
• Automated way to find workflow settings that optimizes
the output
•


•
•

Define workflow output(s) as fitness value
Use fitness value for evaluation (e.g. AUC or correlation
coefficient)
Use heuristic search algorithm to find best



10

How does it work?
•
•
•


•

Development of optimization framework that extends
Taverna workflow management system
Abstracts optimization process (e.g. parallel execution,
security)
Developer API allows rapid adaption of new optimization
methods
Optimization plugins can be added independently
WMS
Taverna


Framework
Optimization
Layer

Plugins
A
P
I

Parameter Optimization
Component Optimization


11

Taverna

Optimization Framework & Plugin

(1) Define sub-workflow
(2) Specify input
parameters (constraints)
(3) Select fitness output
parameters (e.g. AUC)
(4) Define optimization
method parameters
(population size,
termination criteria)

Best Fitness:
0.34

1

Best Fitness:
0.42

2

Best Fitness:
0.48


.
.
.

Display the
optimization
result

x

Best Fitness: 0.49
Genetic Algorithm Parameter
Optimization Plugin



12

Status quo
•
•

Workflow optimization starts from scratch each time
Optimization meta-data are lost


Idea: Capture optimization meta-data next to traditional
provenance data

⇒
⇒

learn from/extend prior optimization runs
improve and accelerate optimization process



13

Research Objects
•
•
•
•

Aligned with W3C standards
Aggregates various resources
Describes scientific processes in machine readable
format
Specified by several ontologies


…
ore:aggregates



14

Taverna



(2) Specify input
parameters (population
size, termination criteria)

Display the
optimization
result

Best
Fitness:
0.34
Best
Fitness:
0.42
Best
Fitness:
0.48

1

2

.
.
.

x

Best Fitness: 0.49
Optimization Plugin



15

Optimization Research Object Ontology
ro:Research
Object

opt:Optimization
Research
Object

ore:aggregates


opt:Algorithm

Describes the
optimization
algorithm and
its parameters

opt:Fitness

opt:Generation

opt:Optimization
Run

opt:Search
Space

opt:Termination
Condition

opt:Workflow

Describes the
fitness
functions

Defines the
population size
and generation
number for an
Optimization
Run

Represents one
result set: sub‐
workflow,
parameters and
obtained fitness
values

Describes the
dependencies
and parameter
constraints

Describes the
termination
condition
defined by the
user

The workflow
that was
optimized

rdfs:subClassOf

rdf:Property

16

Algorithm


• Genetic Algorihm
• Mutation rate: 0.1
• Crossover rate 0.7



17

Search Space

Gamma:
• Double
• 0 - 10

• Cost/2 < Gamma
(fictional)



18

Optimization Run


• Origin of result
• Parameter setting
• Fitness value



19

Taverna


(2) Specify input

Generation 1 Iteration 1

Best Fitness:
Fitness: 0.05
0.34
Fitness: 0.05

1

Best Fitness:
0.42

2

Best Fitness:
0.48


.
.
.

Display the
optimization
result

x

Best Fitness: 0.49
Optimization Plugin



20

Taverna


(2) Specify input


Best Fitness:
Fitness: 0.05
0.34
Fitness: 0.05

1

Fitness: 0.22
Best Fitness:

0.42
Fitness: 0.27

2

Fitness: 0.19

Best Fitness:
0.48
Fitness: 0.31

.
.
Fitness: 0.34

x


.

Display the
optimization
result

Best Fitness: 0.49
Optimization Plugin



21

Taverna



(2) Specify input

Display the
optimization
result



Best Fitness:
Fitness: 0.05
0.34
Fitness: 0.05

1

Fitness: 0.22
Fitness: 0.05
Best Fitness:

0.42
Fitness: 0.27
Fitness: 0.05
Fitness: 0.22
2
Fitness: 0.19
Fitness: 0.22
Fitness: 0.34
Best Fitness:
0.48
Fitness: 0.31
Fitness: 0.34
Fitness: 0.19
.
x
Generation 3 Iteration 4 .
Fitness: 0.34
.
Fitness: 0.19
Fitness: 0.31
Fitness: 0.31
Best Fitness: 0.49
Fitness: 0.33
Fitness: 0.46
Optimization Plugin


22

Example
Result

Name

Value

Gamma

2.36

Cost

8


NumberOfPseudo 363
Absences
Fitness


0.9207


23

Benefits of sharing and exploiting Optimization
Research Objects
•
•
•


•
•

•

What is the optimal setting? - Reuse optimized settings
What ranges have been explored? - Adopt used parameter
ranges
What algorithm settings were used? - Reuse algorithm
settings
Are there similar optimizations? - Reuse existing results
Resume the optimization
Embed optimization provenance into workflow
infrastructures to be reused by other scientists



24

Conclusion

•

Scientific workflows are hard to configure
Optimization can help but meta-data get lost
Extend Research Objects
Build new Optimization Research Object Ontology
Reuse of optimization meta-data to speed up
optimization
Shareable with the community in workflow infrastructures

•

Outlook: How to learn from similar workflows?

•
•
•
•


•



25

Links


http://purl.org/net/ro-optimization
http://purl.org/net/svm-opt-research-object



26


Questions?
Thank you!

On Specifying and Sharing Scientific Workflow Optimization Results Using Research Objects

Recommended

Recommended

More Related Content

Similar to On Specifying and Sharing Scientific Workflow Optimization Results Using Research Objects

Similar to On Specifying and Sharing Scientific Workflow Optimization Results Using Research Objects (20)

More from dgarijo

More from dgarijo (20)

Recently uploaded

Recently uploaded (20)

On Specifying and Sharing Scientific Workflow Optimization Results Using Research Objects