Why Teams call analytics are critical to your entire business
VIP: design and implementation of the portal and execution service
1. VIP: design and implementation
of the portal and execution service
Rafael FERREIRA DA SILVA
CNRS, CREATIS, INSA-Lyon, Université Lyon 1, INSERM
For the VIP Project Consortium:
VIP Launching Workshop
Lyon, December 14th 2012
1
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
2. Outline
Introduction
VIP Architecture
Web Portal
Data Transfers
Workflow Execution
Workflow Self-Healing
Conclusions
2
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
3. Platform goals
Multi-modality medical image simulators
MRI, US, CT and PET
Objectives
Workflow execution on EGI
Access to storage resources
High–level interface for non-experts
No IT required
Software as a Service (SaaS)
No client software instalation
New features automatically available
Consolidated support and troubleshooting
3
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
4. VIP – Architecture
Object Model
Repository
Data Management
Simulated Data
GASW Repository
Job Generation Workflow Engine
Job Scheduler
4
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
5. VIP – Web Portal
User Front-End
Openly-accessible web portal
Access point to models and simulators.
User-friendly interface which assists users in using image
simulators.
Modular code design (GWT + SmartGWT)
5
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
6. Users/Apps Management
Users Groups Application Classes Applications
6
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
7. VIP – GRIDA
Grid Data Management Agent
Handles file catalog and transfer operations by pooling
Performs data replication
7
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
8. Data Transfers Management
User Machine VIP Server Grid Storage
User uploads file GRIDA Uploads
to VIP Server file to the grid
(replication)
User downloads GRIDA Downloads
the file file to VIP Server
8
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
9. VIP – Data Repositories
Easily integration of third-party
libraries
NeuSemStore-Provenance for simulated
data
NeuSemStore-Simulated-Objects for the
model catalog
Encapsulation of objects as GWT
serialized beans GWT Client GWT Server Databases
More details on the presentation of B. Gibaud RPC call
NeuSemStore
GWT Bean
9
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
10. VIP – Workflow Engine
MOTEUR workflow engine
Applications described on formal language
http://modalis.i3s.unice.fr/softwares/moteur
Generic Application Service Wrapper (GASW)
Bash scripts wrapped in grid jobs
Self-healing of workflow execution
10
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
11. VIP – Architecture
Workload Management
System with Pilot Jobs
Distributed Infrastructure with
Remote Agent Control (DIRAC)
[CPPM-LHCb]
http://diracgrid.org
Hosted by CC-IN2P3
French National Instance
Data Storage and Computing
Back-End
EGI infrastructure, Biomed VO
http://www.egi.eu
11
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
12. Workflow Execution
2. User launches 3. MOTEUR generates
a simulation invocations
4. GASW generates
grid jobs
1. Input data 11. Download results
upload
5. Jobs are submitted
8. Inputs download to DIRAC
6. Pilot jobs are
submitted to EGI
9. Execution
10. Results upload
7. Pilot jobs
fetch grid jobs
12
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
13. Outline
Introduction
VIP Architecture
Web Portal
Data Transfers
Workflow Execution
Workflow Self-Healing
Conclusions
13
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
14. Workflow Self-Healing
Problem: costly manual operations
Rescheduling tasks, restarting services, killing misbehaving
experiments or replicating data files
Objective: automated platform administration
Autonomous detection of operational incidents
Perform appropriate set of actions
Assumptions: online and non-clairvoyant
Only partial information available
Decisions must be fast
Production conditions, no user activity and workloads prediction
14
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
15. General MAPE-K loop
event Incident 1 Incident 2 Incident 3
(job completion and failures)
degree η = 0.8 degree η = 0.4 degree η = 0.1
or
timeout level level level level level level level level level
1 2 3 1 2 3 1 2 3
Monitoring Analysis
Monitoring data
x2 ηi
= n
Set of Actions
∑ j =1
ηj
Execution Knowledge Roulette wheel selection
€
Planning
Rule Confidence (ρ) ρxη
Selected 2 1 0.8 0.32 Selected
Incident 2 31 0.2 0.02 Incident 1
1 1
1.0 0.80
Roulette wheel selection Association rules
based on association rules for incident 1
15
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
16. Incident: Activity Blocked
An invocation is late compared to the others
Invocations completion rate for a simulation Job flow for a simulation
Possible causes
Longer waiting times
Lost tasks (e.g. killed by site due to quota violation)
Resources with poor performance
16
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
17. Activity blocked: degree
Degree computed from all completed jobs of the activity
Job phases: setup inputs download execution outputs upload
Assumption: bag-of-tasks (all jobs have equal durations)
Median-based estimation:
Median duration Estimated job Real job
of jobs phases duration duration
50s 42s 42s
completed
250s 300s 300s
400s 400s* 20s current
15s 15s ?
Mi = 715s Ei = 757s
*: max(400s, 20s) = 400s
Incident degree: job performance w.r.t median
Ei
d= ∈ [0,1]
Mi + Ei
17
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
€
18. Activity blocked: levels and actions
Levels: identified from the platform logs
τ1
Level 1 Level 2
(no actions)
€ action: replicate jobs
d
Replication process for one task
Actions
Job replication
Cancel replicas with
bad performance
Replicate only if all
active replicas are running
18
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
19. Experimental results
Goal: Self-Healing vs No-Healing
Cope with recoverable errors
Metrics
Makespan of the activity execution
Resource waste
speeds up FIELD-II execution up to 4
(CPU + data) self −healing
w= −1
(CPU + data) no−healing Repetition w
1 –0.10
For w < 0: self-healing consumed less resources 2 –0.15
3 –0.09
For w > 0: self-healing wasted resources
4 0.05
€
5 –0.26
Self-Healing process reduced resource
consumption up to 26% when compared
to the No-Healing execution
R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of operational workflow
incidents on distributed computing infrastructures, IEEE/ACM International
19 Symposium on Cluster, Cloud and Grid Computing (CCGrid), Ottawa, Canada, 2012.
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
20. VIP – Facts
321 registered users, from
38 countries
Most used portal certificate in
EGI (August 2012)
https://wiki.egi.eu/wiki/EGI_robot_certificate_users
Consumed 379 CPU years from
January 2011 to August 2012
http://accounting.egi.eu
1/10 of the total activity of the
biomed international VO. One of
the most active users
20
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
21. VIP – Facts
Applications
1155 executed simulations during the last year (~3/day)
Users
Repartition of portal users on EGI (August 2012)
(source: https://wiki.egi.eu/wiki/EGI_robot_certificate_users)
Repartition of application executions in VIP (Nov 2011 – Oct 2012)
21
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
22. Concluding remarks
VIP is an openly-accessible web portal for multi-modality
medical image simulators
MRI, US, CT and PET and other tools
Workflow execution on EGI
Access to storage resources
High–level interface for non-experts
No IT required (Software as a Service)
Facts
321 registered users from 38 countries
Consumed about 400 CPU years / year
Limits and perspectives
Fair resource allocation among workflows
User support
Heavy data transfers
22
http://vip.creatis.insa-lyon.fr Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
23. VIP: design and implementation
of the portal and execution services
Thank you for your attention.
Questions?
http://vip.creatis.insa-lyon.fr!
Rafael FERREIRA DA SILVA
CNRS, CREATIS, INSA-Lyon, Université Lyon 1, INSERM
For the VIP Project Consortium:
VIP Launching Workshop
Lyon, December 14th 2012
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr