Presentation held at CGWS 2012 - Rhodes Island - Greece
Abstract. Archives of distributed workloads acquired at the infrastructure level reputably lack information about users and application-level middleware. Science gateways provide consistent access points to the infrastructure, and therefore are an interesting information source to cope with this issue. In this paper, we describe a workload archive acquired at the science-gateway level, and we show its added value on several case studies related to user accounting, pilot jobs, fine-grained task analysis, bag of tasks, and workflows. Results show that science-gateway workload archives can detect workload wrapped in pilot jobs, improve user identification, give information on distributions of data transfer times, make bag-of-task detection accurate, and retrieve characteristics of workflow executions. Some limits are also identified.
More information: www.rafaelsilva.com
A science-gateway workload archive to study pilot jobs, user activity, bag of tasks, task sub-steps, and workflow executions
1. A science-gateway workload archive
to study pilot jobs, user activity, bag of tasks,
task sub-steps and workflow executions
Rafael FERREIRA DA SILVA and Tristan GLATARD
University of Lyon, CNRS, INSERM, CREATIS
Villeurbanne, France
CoreGRID/ERCIM Workshop on Grids, Clouds and P2P Computing
August 27th 2012
1
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
2. Context: Workload Archives
Assumptions validation
exit_code task_status
useful for
submit_time ime
t ion_t Computational activity
site_name execu modeling
inpu
t _file
id
workflow_
activity_name Methods evaluation
(simulation or experimental)
Information produced by grid workflow executions
2
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
3. Science-gateway architecture
0. Login 3. Launch workflow
1. Send input data
User
Workflow Engine
Web Portal
2. Transfer
4. Generate and
input files
submit task
Storage
Element
8. Get files 7. Get task
9. Execute
10. Upload results Pilot Manager
Computing site
6. Schedule 5. Submit
pilot jobs pilot jobs
Meta-Scheduler
3
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
4. State of the Art
Grid Workload Archives
exit_code task_status
submit_time time
tion_
execu
site_name
inpu
t _file
d
workflow_i Information gathered
activity_name
at infrastructure-level
tasks
Lack of critical information:
• Dependencies among tasks • Parallel Workloads Archive
(http://www.cs.huji.ac.il/labs/parallel/workload/)
• Task sub-steps
• Grid Workloads Archive
• Application-level scheduling artifacts (http://gwa.ewi.tudelft.nl/pmwiki/)
• User
4
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
5. At infrastructure-level
0. Login 3. Launch workflow
1. Send input data
User
Workflow Engine
Web Portal
2. Transfer
4. Generate and
input files
submit task
Storage
Element
8. Get files 7. Get task
9. Execute
10. Upload results Pilot Manager
Computing site
6. Schedule 5. Submit
pilot jobs pilot jobs
Meta-Scheduler
5
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
6. Outline
A science-gateway workload archive
Case studies
Pilot Jobs
Accounting
Task analysis
Bag of tasks
Workflows
Conclusions
6
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
7. Our approach
Science-Gateway Workload Archive
exit_code task_status
submit_time time
tion_
execu
site_name
inpu
t _file
d
Information gathered
workflow_i
activity_name at science-gateway level
Advantages: workflow executions
• Fine-grained information about tasks
• Dependencies among tasks
• Workflow characterization
• Accounting
7
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
8. At science-gateway level
0. Login 3. Launch workflow
1. Send input data
User
Workflow Engine
Web Portal
2. Transfer
4. Generate and
input files
submit task
Storage
Element
8. Get files 7. Get task
9. Execute
10. Upload results Pilot Manager
Computing site
6. Schedule 5. Submit
pilot jobs pilot jobs
Meta-Scheduler
8
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
9. Virtual Imaging Platform
Virtual Imaging Platform (VIP)
Medical imaging science-gateway
Grid of 129 sites (EGI – http://www.egi.eu)
Applications
Significant usage
Registered users: 244 from 26 countries
Applications: 18 File transfer
Consumed 32 CPU years in 2011 VIP – http://vip.creatis.insa-lyon.fr
VIP usage in 2011: CPU consumption
of VIP and related platforms on EGI.
9
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
10. SGWA
Science Gateway Workload Archive (SGWA)
Archive is extracted from VIP
Science-gateway archive model
Task, Site and Workflow Execution File and Pilot Job extracted from
acquired from databases populated the parsing of task standard
by the workflow engine at runtime output and error files
10
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
11. Workload for Case Studies
Based on the workload of VIP
January 2011 to April 2012
338,989 completed
138,480 error
105,488 aborted
15,576 aborted replicas
48,293 stalled
34,162 queued
112 users 2,941 workflow executions 680,988 tasks
339,545 pilot jobs
11
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
12. Pilot Jobs
A single pilot can wrap several
tasks and users 282331
250000
200000
At infrastructure-level 150000
Frequency
100000
Assimilates pilot jobs to tasks and 50000
28121
users 11885
6721
10487
Valid for only 62% of the tasks 0
1 2 3 4 5
Tasks per pilot
Valid for 95% of user-task
associations
323214
300000
250000
200000
150000
Frequency
At science-gateway level 100000
50000
Users and tasks are correctly 15178
associated to pilots
1079
70 4
0
1 2 3 4 5
Users per pilot
12
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
13. Accounting: Users
Authentications based on login and password are mapped to
X.509 robot certificates
At infrastructure-level
All VIP users are reported as a single user
At science-gateway level
Maps task executions to VIP users
40
30
Users
EGI
20 VIP
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Months
Number of reported EGI and VIP users
13
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
14. Accounting: CPU and
Wall-clock Time
Huge discrepancy of values 6e+05
VIP jobs
Pilot jobs do not register to
Number of jobs
5e+05 EGI jobs
the pilot system
4e+05
3e+05
Absence of workload 2e+05
1e+05
Outputs unretrievable 5 10 15
Month
Pilot setup time Number of submitted pilot jobs
by EGI and VIP
Lost tasks (a.k.a. stalled)
150
VIP CPU time
VIP Wall−clock time
100
Undetectable at infrastructure-level EGI CPU time
Years
EGI Wall−clock time
50
5 10 15
Month
Consumed CPU and wall-clock time
by EGI and VIP
14
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
15. Task Analysis
At infrastructure-level
Limited to task exit codes 55165
50925
50000 48293
Number of tasks
40000
30000
At science-gateway level 20000 19463
Fine-grained information
10000
1123
0
Steps in task life application input stalled
Error causes
output folder
Error causes
Replicas per task 1200 1191
1285
1000
Frequency
1.0 800
download 600
0.8
execution 400 401
347 322
0.6 upload
CDF
200
0.4 6
0
1 2 3 4 5 +5
0.2 Replicas per task
1 100 10000
Time(s)
15 Different steps in task life
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
16. Bag of Tasks:
at Infrastructure level
Evaluation of the accuracy of Iosup et al.[8] method to detect
bag of tasks (BoT)
Task 1
Task 2
Two successively submitted
tasks are in the same BoT if Δ1,2 Δ2,3 Task 3
the time interval between
submission times is lower t1 t2 t3 time
or equal to Δ. Δ
Δ
BoT 1 BoT 2
Task 1 Δ1,2 ≤Δ Task 3 Δ2,3 >Δ
|t1 – t2|≤Δ |t2 – t3|>Δ
Task 2
16 [8] Iosup, A., Jan, M., Sonmez, O., Epema, D.: The Characteristics and
performance of groups of jobs in grids. In: Euro-Par. (2007) 382-393 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
17. Bag of Tasks: Size and Duration
Infrastructure vs science-gateway
90% of Batch BoTs size ranges 0.8
from 2 to 10 while it represents 0.6
CDF
50% of Real Batch
0.4
0.2 Real Batch
Batch
0.0
200 400 600 800 1000
Size (number of tasks)
0.8
Non-Batch duration is 0.6
overestimated up to 400%
CDF
Real Batch
0.4
Real Non−Batch
0.2 Batch
Non−Batch
0.0
10000 20000 30000 40000 50000
Duration (s)
Real Batch = ground-truth BoT
Real Non-Batch = ground-truth non-BoT
Batch = Iosup et al. BoT
Non-Batch = Iosup et al. non-BoT
17
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
18. Bag of Tasks: Inter-arrival Time
and Consumed CPU Time
Batch and Non-Batch inter-arrival 0.8
times are underestimated by 0.6
CDF
about 30% 0.4
Real Batch
Real Non−Batch
0.2 Batch
Non−Batch
0.0
2000 4000 6000 8000 10000
Inter−Arrival Time (s)
0.8
CPU times are underestimated of 0.6
25% for Non-Batch and of about
CDF
20% for Batch
Real Batch
0.4
Real Non−Batch
0.2 Batch
Non−Batch
0 5000 10000 15000 20000 25000 30000
Consumed CPUTime (KCPUs)
Real Batch = ground-truth BoT
Real Non-Batch = ground-truth non-BoT
Batch = Iosup et al. BoT
Non-Batch = Iosup et al. non-BoT
18
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
19. Workflow Characterization
At infrastructure-level Small (52%): ≤ 100 tasks
Medium (31%): between 101 and 500 tasks
Hardly possible Large (17%): > 500 tasks
At science-gateway level
1.0 1.0
0.8 0.8
0.6 0.6
CDF
CDF
small
0.4
0.4 medium
0.2 large
0.2 total
2000 4000 6000 8000 1e+03 1e+05 1e+07 1e+09
Size (number of tasks) Makespan (s)
1.0 1.0
0.8 0.8
0.6 0.6
CDF
CDF
small small
0.4 0.4
medium medium
0.2 large 0.2 large
total total
0.0
200 400 600 800 0 1 2 3 4 5 6
Speedup Critical path length
19
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
20. Conclusions
Science-gateway model of workload archive
Illustration by using traces of the VIP from 2011/2012
Added value when compared to infrastructure-level traces
Exactly identify tasks and users
Distinguishes additional workload artifacts from real workload
Fine-grained information about tasks
Ground-truth of bag of tasks
Workflow characterization
Traces are available to the community in the
Grid Observatory
http://www.grid-observatory.org
20
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
21. A science-gateway workload archive
to study pilot jobs, user activity, bag of tasks,
task sub-steps and workflow executions
Thank you for your attention.
Questions?
ACKNOWLEDGMENTS
VIP users and project members
French National Agency for Research (ANR-09-COSI-03)
European Grid Initiative (EGI)
France-Grilles
Rafael FERREIRA DA SILVA and Tristan GLATARD
University of Lyon, CNRS, INSERM, CREATIS
Villeurbanne, France
21
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr