Why do Users kill HPC Jobs?

Why do users kill
HPC jobs?
Venkatesh-Prasad Ranganath Daniel Andresen
December 17-20, 2018

Context
• HPC clusters are worth millions of dollars

• Critical computations depend on HPC

• Numerous eﬀorts have explored HPC ROI

• Most eﬀorts have focused on improving non-human ROI

• System monitoring and management

• failure, power quality, temperature

• Programming support

• novel abstractions, exascale debugging, couplings
between experiments

Context
• Very few efforts have explored human ROI

• Understand how software engineering aspects
influence development and use of scientific software

• Propose methods to model and measure human ROI

• No observational studies of user triggered wastage

Human ROI/productivity

• Effort expended by users to use HPC clusters

• Gains/Losses incurred by HPC users

Study: Questions
1. For what reasons do users terminate HPC jobs?

2. How often do users terminate jobs?

3. How much compute resource is wasted due to user
terminated jobs?

4. How do user terminated jobs compare to system and
scheduler terminated jobs and all jobs executed on the
cluster in terms of consumed compute resources?

5. How does wasted computation translate into user wait
times?

6. How do user terminated jobs compare to system and
scheduler terminated jobs and all jobs executed on the
cluster in terms of user wait times?

Study: Environment
1. Beocat cluster at Kansas State University

1. XSEDE Federation (XF) Level 3 cluster

2. ~7900 processor cores / 300+ nodes

3. 16-80 cores per node

4. 32GB-1.5TB RAM per node

2. Sun Grid Engine (SGE) was used to job scheduling

3. Around 400 unique users (students + researchers)

4. Supported by 1 application scientist and 2 sys admins

Study: Ofﬂine Design
$ qdel 1234

Study: Online Design
$ qdel 1234 -issue "scripting error"
-app VASP

Study: Execution
• Conducted between Aug 15 2016 thru Dec 31 2017

• Used intervention to encourage users to participate in the
study; participation was voluntary and IRB approved

• Manually aggregated collected free-form reasons

• Used SGE accounting ﬁles as the source of runtime
information

• Analyzed collected reasons and runtime info using Awk :)

• Artifacts and scripts available at https://bitbucket.org/
rvprasad/why-do-users-kill-hpc-jobs

Job Costs
Normal Exit CPU Time (s) WC Time (s)
Y 59,664,147,967 13,865,524,891
N 17,452,418,827 3,088,336,839
Total 77,116,566,794 16,953,861,730
Terminated 7,375,029,412 2,162,356,250
9.56% of total CPU time was wasted

12.75% of total WC (User) time was wasted

42.25% of total abnormal exit CPU time was wasted

70.02% of total abnormal exit WC (User) time was wasted
639,102 (649,542) jobs were executed (submitted)

26,967 jobs were terminated by users

13,598 jobs were executing during termination

Reasons & Their Costs
Reasons for User Triggered Terminations CPU Time % WC Time %
1 Exploring and testing Beocat 10.41 32.50
2 System errors 10.10 6.06
3 Incorrect application parameters
4 Decided to change application parameters
5 Computation has converged 4.99
6 Computation is not converging 3.98
7 Application code crashed or encountered errors
8 Job script encountered errors 5.46
9 Decided to change job parameters
10 Issues with requested amount of memory
11 Job will not finish on time 3.08 5.23
12 Testing or debugging code
13 External user error
14 Conflicts with other submitted jobs 4.98
15 Unable to understand the provided reason 9.79 3.83
16 Inefficient use of resources
17 No reasons were provided 45.57 37.13
Total (seconds) 7,375,029,412 2,162,356,250

Remediations for
Top Reasons
• System errors: Improve cluster reliability and reduce system
failures

• Conflicts with other submitted jobs: Help users identify and
use useful configurations

• Computation has converged: Use automation to detect
convergence

• Computation is not converging: Use automation to avoid/
detect divergent computations

• Job will not finish on time: Help users to better estimate time
required for jobs

• Exploring and testing Beocat: Limit compute time or use
dedicated testing sub-cluster or job queue with different SLA

Possible Data Quality
Issues
• Missing Reasons
• Incomprehensible reasons / No reasons are provided

• Ungathered Reasons
• Crashed or unterminated jobs whose results were
discarded

• Inconsistent Reasons
• Diﬀering reasons for same kind of jobs or situations

• Misclassiﬁed Reasons
• Rater biases and human error

Ofﬂine Design vs Online Design

Current and Future Work
• Educate users about using Beocat

• Reduce wastage using existing techniques

• E.g., explore use of checkpointing solutions

• Revamp monitoring and data collection on Beocat

• Explore options to address data quality

• Repeat the study on other clusters similar to Beocat

• E.g., other XSEDE (XF) level 3 clusters

• Repeat the study on clusters not similar to Beocat

• E.g., XSEDE (XF) level 1 and 2 clusters

Takeaway
Call to Action
• User terminated HPC jobs contribute non-trivial amount of
wasted computation, e.g., 10% of execution time

• Top reasons for users to terminate HPC jobs can

• be tackled with existing techniques or

• serve as good research directions to improve HPC

• Repeat the study on your clusters to understand the kinds of
wastage in diﬀerent HPC environments

• Explore human (soft) aspects in HPC
https://bitbucket.org/rvprasad/why-do-users-
kill-hpc-jobs

Why do Users kill HPC Jobs?

More Related Content

Similar to Why do Users kill HPC Jobs?

More from Venkatesh Prasad Ranganath

Recently uploaded

Why do Users kill HPC Jobs?