Online Workflow Management and Performance Analysis with Stampede

Online Workflow Management
and Performance Analysis with
Stampede
Dan Gunter1, Taghrid Samak1, Monte Goode1,
Ewa Deelman2, Gaurang Mehta2, Fabio Silva2, Karan Vahi2
Christopher Brooks3
Priscilla Moraes4, Martin Swany4

1 Lawrence Berkeley National Laboratory
2 University of Southern California, Information Sciences Institute
3 University of San Francisco
4 University of Delaware

1

Background

CNSM 2011, October
24-28, Paris, France 2

Goal: Predict behavior of
running scientific workflows
—  Primarily failures
—  Is a given workflow going to “fail”?
—  Are specific resources causing problems?
—  Which application sub-components are failing?
—  Is the data staging a problem?

—  In large workflows, some failures, etc. are normal
—  This work is about learning from known problems, which
patterns of failures, etc. are unusual and require adaptation

—  Do all of this as generally as possible: Can we provide a
solution that can apply to all workflow engines?

CNSM 2011, October

Approach

—  Model the monitoring data from running workflows
—  Collect all the data in real-time
—  Run analysis, also in real-time, on the collected
data
—  map low-level failures to application-level
characteristics

—  Feed back analysis to user, workflow engine

CNSM 2011, October

Scientific Applications
Montage Epigenome LIGO CyberShake

Astronomy Bioinformatics Astrophysics Geophysics

CNSM 2011, October

Domain: Large Scientific
Workflows
SCEC-2009: Millions of tasks completed per day

Radius = 11 million

6

Workflow structure

CNSM 2011, October

Basic terms and concepts

Success
Execution

Fail

Workflow Resources

Workflow Management System

CNSM 2011, October

Base technologies

—  Workflow management systems
—  Pegasus
—  www.pegasus.isi.edu

—  Monitoring and data analysis +
—  NetLogger
—  www.netlogger.lbl.gov

CNSM 2011, October

Data Model

CNSM 2011, October

Data Model Goals

—  Be widely applicable: —  Provide everything we
there are many need for Pegasus
workflow engines out workflows
there that could
benefit.

CNSM 2011, October
24-28, Paris, France
10/27/11
11

Abstract and Executable
Workflows
—  Workflows start as a resource-independent
statement of computations, input and output data,
and dependencies
—  This is called the Abstract Workflow (AW)
—  For each workflow run, Pegasus-WMS plans the
workflow, adding helper tasks and clustering small
computations together
—  This is called the Executable Workflow (EW)
—  Note: Most of the logs are from the EW but the
user really only knows the AW.

CNSM 2011, October

Additional Terminology
—  Workflow: Container for an entire computation
—  Sub-workflow: Workflow that is contained in another workflow
—  Task: Representation of a computation in the AW
—  Job: Node in the EW
—  May represent part of a task (e.g., a stage-in/out), one task,
or many tasks
—  Job instance: Job scheduled or running by underlying system
—  Due to retries, there may be multiple job instances per job
—  Invocation: One or more executables for a job instance
—  Invocations are the instantiation of tasks, whereas jobs are an
intermediate abstraction for use by the planning and
scheduling sub-systems

CNSM 2011, October

Denormalized Data Model
—  Stream of timestamped “events”:
—  unique, hierarchical, name
—  unique identifiers (workflow, job, etc.)
—  values and metadata
—  Used NETCONF YANG data-modeling language, keyed on
event name [RFCs: 6020 6021 (6087)]
—  YANG schema (see bit.ly/nQfPd1) documents and validates
each log event
Snippet of schema
container stampede.xwf.start {
description “Start of executable workflow”;
uses base-event;
leaf restart_count {
type uint32;
description "Number of times workflow was restarted (due to
failures)”; }}

CNSM 2011, October

Relational data model
Abstract
task_edge Workflow (AW) jobstate
Task parent Job status
and child

task job job_instance
Task Job Job Instance

job_edge
Job parent and child

workflow invocation
Workflow Invocation

workflow_state Executable
Workflow status
AW and EW Workflow (EW)

CNSM 2011, October

Infrastructure

CNSM 2011, October
10/27/11
16

Infrastructure overview

Raw logs

Normalized logs

Query Subscribe

CNSM 2011, October

Detailed data flow

Pegasus

Log collection and
normalization
NetLogger Failure detection
Real-time
analysis

Relational archive

CNSM 2011, October

Message bus usage
BP Log events
Routing key = event name

AMQP Exchange Queue … Queue

Subscribe Data

Analysis client … Analysis client

CNSM 2011, October
10/27/11
19

Analysis

CNSM 2011, October
10/27/11
20

Experimental Dataset
summary
Application
Workflows
Jobs
Tasks
Edges

Cybershake
881
288,668
577,330
1,245,845

Periodograms
45
80,158
1,894,921
80,113

Epigenome
46
10,059
29,837
23,425

Montage
76
56,018
613,107
287,146

Broadband
66
44,182
104,275
141,922

LIGO
26
2,116
2,141
6,203

1,140
481,201
3,221,611
1,784,654

CNSM 2011, October

Workflow clustering

—  Features collected for each workflow run
—  Successful jobs
—  Failed jobs
—  Success duration
—  Fail duration

—  Offline clustering on historical data
—  Algorithm: k-means
—  Online analysis classifies workflows according to
nearest cluster

22

“High Failure” Workflows
(HFW)
—  The workflow engine keeps retrying workflows until
they complete or time out

—  But in the experimental logs, workflows are never
marked as “failed”
—  Aside: this is fixed in the newest version
—  Therefore, we use a simple heuristic for identifying
workflows as problematic:
—  HFW means: > 50% of jobs failed

CNSM 2011, October

HFW failure patterns
Montage application

X-axis is
normalized
workflow execution
time

Y-axis shows the
percent of total job
failures for this
workflow, so far

Legend shows, for
each workflow,
jobs failed/jobs total

24

More HFW Failure Patterns
Epigenome Broadband

Montage CyberShake

25

Offline clustering
3
37
Epigenome
5

Other 3 clusters
4
3
Component 2

High-failure
workflow cluster
2

7
1
●
12
●
21362 18 4
1

●
●
6 2717
20
43
44
23 35
● 33 14
32
31 19
0

38
29
34
39
2
10
40
5
4
15 30
22
28
11
1 8
16 42
24
13
25
41
26
Projection onto first 2
−1

39
principal components

CNSM 2011, October 4
−2 0 2
Component 1

Online classification
Workflows
4 21:512/905
24:28/29
Workflow classification

25:28/29
27:4/4
33:28/30
41:64/89
3
Class

Doesn’t
converge
2

High-failure workflow class
1

0 20 40 60 80 100
Lifetime %

CNSM 2011, October

Anomaly detection
Montage application
1.0

X: total number
0.9 of failures
Anomalous!
0.8

See Slide #24 Y: proportion of
Cumulative Percent

time-windows
0.6

46:281/496 experiencing
48:62/65 that number of
failures or less
0.4

49:44/73
50:36/65
51:22/37
0.2

52:38/51
53:42/57
54:32/48
0.0

0 10 15 20 30
CNSM 2011, October
Failures

System
broadband cybershake
4
10

performance 103

102

Bars show the 101

rate for each 100
Query type

Median queries minute, log10 scale
epigenome ligo 01-JobsTot

type of query 104 02-JobsState
03-JobsType
103 04-JobsHost

Each panel is an 102
05-TimeTot
06-TimeState

application 101 07-TimeType
08-TimeHost
0
10 09-JobDelay

Dashed black
montage periodograms 10-WfSumm
104 11-HostSumm

lines are median 103

arrival rate for 102

the application. 101

100
01 02 03 04 05 06 07 08 09 10 11 01 02 03 04 05 06 07 08 09 10 11

29
Query type
CNSM 2011, October 24-28, Paris,
France

Summary
—  Real-time failure prediction for scientific workflows
is a challenging but important task

—  Unsupervised learning can be used to model high-
level workflow failures from historical data

—  High failure classes of workflows can be predicted
in real-time with high accuracy

—  Future directions
—  Analysis; root-cause investigation
—  System; notifications and updates
—  Working with data from other workflow systems

CNSM 2011, October 24-28, Paris, 30
France

Thank you!
For more information, visit the Stampede wiki at:
https://confluence.pegasus.isi.edu/display/stampede/

Extra slides..

CNSM 2011, October

Scalability

CNSM 2011, October 24-28, Paris, 33
France

Pegasus

—  Maps from abstract to concrete workflow
—  Algorithmic and AI-based techniques

—  Automatically locates physical locations for both
workflow components and data

—  Finds appropriate resources to execute
—  Reuses existing data products where applicable
—  Publishes newly derived data products
—  Provides provenance information

CNSM 2011, October

NetLogger

—  Logging Methodology
—  Timestamped, named, messages at the start and end
of significant events, with additional identifiers and
metadata in a std. line-oriented ASCII format (Best
Practices or BP)
—  APIs are provided, incl. in-memory log aggregation for
high frequency events; but message generation is often
best done within an existing framework

—  Logging and Analysis Tools
—  Parse many existing formats to BP
—  Load BP into message bus, MySQL, MongoDB, etc.
—  Generate profiles, graphs, and CSV from BP data
CNSM 2011, October

Online Workflow Management and Performance Analysis with Stampede

Recommended

Recommended

More Related Content

Similar to Online Workflow Management and Performance Analysis with Stampede

Similar to Online Workflow Management and Performance Analysis with Stampede (20)

Recently uploaded

Recently uploaded (20)

Online Workflow Management and Performance Analysis with Stampede