Why do users kill
HPC jobs?
Venkatesh-Prasad Ranganath Daniel Andresen
December 17-20, 2018
Context
• HPC clusters are worth millions of dollars

• Critical computations depend on HPC

• Numerous efforts have explored HPC ROI

• Most efforts have focused on improving non-human ROI

• System monitoring and management

• failure, power quality, temperature

• Programming support

• novel abstractions, exascale debugging, couplings
between experiments
Context
• Very few efforts have explored human ROI

• Understand how software engineering aspects
influence development and use of scientific software

• Propose methods to model and measure human ROI

• No observational studies of user triggered wastage

Human ROI/productivity

• Effort expended by users to use HPC clusters

• Gains/Losses incurred by HPC users
Study: Questions
1. For what reasons do users terminate HPC jobs?

2. How often do users terminate jobs?

3. How much compute resource is wasted due to user
terminated jobs? 

4. How do user terminated jobs compare to system and
scheduler terminated jobs and all jobs executed on the
cluster in terms of consumed compute resources? 

5. How does wasted computation translate into user wait
times?

6. How do user terminated jobs compare to system and
scheduler terminated jobs and all jobs executed on the
cluster in terms of user wait times?
Study: Environment
1. Beocat cluster at Kansas State University

1. XSEDE Federation (XF) Level 3 cluster

2. ~7900 processor cores / 300+ nodes

3. 16-80 cores per node

4. 32GB-1.5TB RAM per node

2. Sun Grid Engine (SGE) was used to job scheduling

3. Around 400 unique users (students + researchers)

4. Supported by 1 application scientist and 2 sys admins
Study: Offline Design
$	qdel	1234
Study: Online Design
$	qdel	1234	-issue	"scripting	error"		
		-app	VASP
Study: Execution
• Conducted between Aug 15 2016 thru Dec 31 2017

• Used intervention to encourage users to participate in the
study; participation was voluntary and IRB approved

• Manually aggregated collected free-form reasons

• Used SGE accounting files as the source of runtime
information

• Analyzed collected reasons and runtime info using Awk :)

• Artifacts and scripts available at https://bitbucket.org/
rvprasad/why-do-users-kill-hpc-jobs
Job Costs
Normal Exit CPU Time (s) WC Time (s)
Y 59,664,147,967 13,865,524,891
N 17,452,418,827 3,088,336,839
Total 77,116,566,794 16,953,861,730
Terminated 7,375,029,412 2,162,356,250
9.56% of total CPU time was wasted

12.75% of total WC (User) time was wasted

42.25% of total abnormal exit CPU time was wasted

70.02% of total abnormal exit WC (User) time was wasted
639,102 (649,542) jobs were executed (submitted)

26,967 jobs were terminated by users

13,598 jobs were executing during termination
Reasons & Their Costs
Reasons for User Triggered Terminations CPU Time % WC Time %
1 Exploring and testing Beocat 10.41 32.50
2 System errors 10.10 6.06
3 Incorrect application parameters
4 Decided to change application parameters
5 Computation has converged 4.99
6 Computation is not converging 3.98
7 Application code crashed or encountered errors
8 Job script encountered errors 5.46
9 Decided to change job parameters
10 Issues with requested amount of memory
11 Job will not finish on time 3.08 5.23
12 Testing or debugging code
13 External user error
14 Conflicts with other submitted jobs 4.98
15 Unable to understand the provided reason 9.79 3.83
16 Inefficient use of resources
17 No reasons were provided 45.57 37.13
Total (seconds) 7,375,029,412 2,162,356,250
Remediations for
Top Reasons
• System errors: Improve cluster reliability and reduce system
failures

• Conflicts with other submitted jobs: Help users identify and
use useful configurations

• Computation has converged: Use automation to detect
convergence

• Computation is not converging: Use automation to avoid/
detect divergent computations

• Job will not finish on time: Help users to better estimate time
required for jobs

• Exploring and testing Beocat: Limit compute time or use
dedicated testing sub-cluster or job queue with different SLA
Possible Data Quality
Issues
• Missing Reasons
• Incomprehensible reasons / No reasons are provided

• Ungathered Reasons
• Crashed or unterminated jobs whose results were
discarded

• Inconsistent Reasons
• Differing reasons for same kind of jobs or situations

• Misclassified Reasons
• Rater biases and human error
Offline Design vs Online Design
Current and Future Work
• Educate users about using Beocat

• Reduce wastage using existing techniques

• E.g., explore use of checkpointing solutions

• Revamp monitoring and data collection on Beocat

• Explore options to address data quality

• Repeat the study on other clusters similar to Beocat

• E.g., other XSEDE (XF) level 3 clusters

• Repeat the study on clusters not similar to Beocat

• E.g., XSEDE (XF) level 1 and 2 clusters
Takeaway
Call to Action
• User terminated HPC jobs contribute non-trivial amount of
wasted computation, e.g., 10% of execution time

• Top reasons for users to terminate HPC jobs can

• be tackled with existing techniques or

• serve as good research directions to improve HPC

• Repeat the study on your clusters to understand the kinds of
wastage in different HPC environments

• Explore human (soft) aspects in HPC
https://bitbucket.org/rvprasad/why-do-users-
kill-hpc-jobs

Why do Users kill HPC Jobs?

  • 1.
    Why do userskill HPC jobs? Venkatesh-Prasad Ranganath Daniel Andresen December 17-20, 2018
  • 2.
    Context • HPC clustersare worth millions of dollars • Critical computations depend on HPC • Numerous efforts have explored HPC ROI • Most efforts have focused on improving non-human ROI • System monitoring and management • failure, power quality, temperature • Programming support • novel abstractions, exascale debugging, couplings between experiments
  • 3.
    Context • Very fewefforts have explored human ROI • Understand how software engineering aspects influence development and use of scientific software • Propose methods to model and measure human ROI • No observational studies of user triggered wastage Human ROI/productivity • Effort expended by users to use HPC clusters • Gains/Losses incurred by HPC users
  • 4.
    Study: Questions 1. Forwhat reasons do users terminate HPC jobs? 2. How often do users terminate jobs? 3. How much compute resource is wasted due to user terminated jobs? 4. How do user terminated jobs compare to system and scheduler terminated jobs and all jobs executed on the cluster in terms of consumed compute resources? 5. How does wasted computation translate into user wait times? 6. How do user terminated jobs compare to system and scheduler terminated jobs and all jobs executed on the cluster in terms of user wait times?
  • 5.
    Study: Environment 1. Beocatcluster at Kansas State University 1. XSEDE Federation (XF) Level 3 cluster 2. ~7900 processor cores / 300+ nodes 3. 16-80 cores per node 4. 32GB-1.5TB RAM per node 2. Sun Grid Engine (SGE) was used to job scheduling 3. Around 400 unique users (students + researchers) 4. Supported by 1 application scientist and 2 sys admins
  • 6.
  • 7.
  • 8.
    Study: Execution • Conductedbetween Aug 15 2016 thru Dec 31 2017 • Used intervention to encourage users to participate in the study; participation was voluntary and IRB approved • Manually aggregated collected free-form reasons • Used SGE accounting files as the source of runtime information • Analyzed collected reasons and runtime info using Awk :) • Artifacts and scripts available at https://bitbucket.org/ rvprasad/why-do-users-kill-hpc-jobs
  • 9.
    Job Costs Normal ExitCPU Time (s) WC Time (s) Y 59,664,147,967 13,865,524,891 N 17,452,418,827 3,088,336,839 Total 77,116,566,794 16,953,861,730 Terminated 7,375,029,412 2,162,356,250 9.56% of total CPU time was wasted 12.75% of total WC (User) time was wasted 42.25% of total abnormal exit CPU time was wasted 70.02% of total abnormal exit WC (User) time was wasted 639,102 (649,542) jobs were executed (submitted) 26,967 jobs were terminated by users 13,598 jobs were executing during termination
  • 10.
    Reasons & TheirCosts Reasons for User Triggered Terminations CPU Time % WC Time % 1 Exploring and testing Beocat 10.41 32.50 2 System errors 10.10 6.06 3 Incorrect application parameters 4 Decided to change application parameters 5 Computation has converged 4.99 6 Computation is not converging 3.98 7 Application code crashed or encountered errors 8 Job script encountered errors 5.46 9 Decided to change job parameters 10 Issues with requested amount of memory 11 Job will not finish on time 3.08 5.23 12 Testing or debugging code 13 External user error 14 Conflicts with other submitted jobs 4.98 15 Unable to understand the provided reason 9.79 3.83 16 Inefficient use of resources 17 No reasons were provided 45.57 37.13 Total (seconds) 7,375,029,412 2,162,356,250
  • 11.
    Remediations for Top Reasons •System errors: Improve cluster reliability and reduce system failures • Conflicts with other submitted jobs: Help users identify and use useful configurations • Computation has converged: Use automation to detect convergence • Computation is not converging: Use automation to avoid/ detect divergent computations • Job will not finish on time: Help users to better estimate time required for jobs • Exploring and testing Beocat: Limit compute time or use dedicated testing sub-cluster or job queue with different SLA
  • 12.
    Possible Data Quality Issues •Missing Reasons • Incomprehensible reasons / No reasons are provided • Ungathered Reasons • Crashed or unterminated jobs whose results were discarded • Inconsistent Reasons • Differing reasons for same kind of jobs or situations • Misclassified Reasons • Rater biases and human error
  • 13.
    Offline Design vsOnline Design
  • 14.
    Current and FutureWork • Educate users about using Beocat • Reduce wastage using existing techniques • E.g., explore use of checkpointing solutions • Revamp monitoring and data collection on Beocat • Explore options to address data quality • Repeat the study on other clusters similar to Beocat • E.g., other XSEDE (XF) level 3 clusters • Repeat the study on clusters not similar to Beocat • E.g., XSEDE (XF) level 1 and 2 clusters
  • 15.
    Takeaway Call to Action •User terminated HPC jobs contribute non-trivial amount of wasted computation, e.g., 10% of execution time • Top reasons for users to terminate HPC jobs can • be tackled with existing techniques or • serve as good research directions to improve HPC • Repeat the study on your clusters to understand the kinds of wastage in different HPC environments • Explore human (soft) aspects in HPC https://bitbucket.org/rvprasad/why-do-users- kill-hpc-jobs