Online Workflow Management           and Performance Analysis with                     Stampede                 Dan Gunter...
Background  CNSM 2011, October  24-28, Paris, France   2
Goal: Predict behavior ofrunning scientific workflows—  Primarily failures  —    Is a given workflow going to “fail”?  —...
Approach—  Model the monitoring data from running workflows—  Collect all the data in real-time—  Run analysis, also in...
Scientific Applications Montage    Epigenome        LIGO            CyberShakeAstronomy   Bioinformatics    Astrophysics  ...
Domain: Large Scientific      Workflows  SCEC-2009: Millions of tasks completed per day                                   ...
Workflow structure      CNSM 2011, October      24-28, Paris, France   7
Basic terms and concepts                                                  Success           Execution                     ...
Base technologies—  Workflow management systems  —  Pegasus  —  www.pegasus.isi.edu—  Monitoring and data analysis    ...
Data Model  CNSM 2011, October  24-28, Paris, France   10
Data Model Goals—  Be widely applicable:      —  Provide everything we  there are many                  need for Pegasus...
Abstract and Executable           Workflows—  Workflows start as a resource-independent  statement of computations, input...
Additional Terminology—  Workflow: Container for an entire computation—  Sub-workflow: Workflow that is contained in ano...
Denormalized Data Model—  Stream of timestamped “events”:  —  unique, hierarchical, name  —  unique identifiers (workfl...
Relational data model                       Abstract    task_edge        Workflow (AW)         jobstate   Task parent      ...
Infrastructure    CNSM 2011, October    24-28, Paris, France                           10/27/11                           ...
Infrastructure overview                   Raw logs          Normalized logs  Query                       Subscribe        ...
Detailed data flow                      Pegasus  Log collection and    normalization                     NetLogger        ...
Message bus usage        BP Log events        Routing key = event nameAMQP Exchange           Queue   …   Queue Subscribe ...
Analysis CNSM 2011, October 24-28, Paris, France                        10/27/11                                   20
Experimental Dataset                    summaryApplication	      Workflows	           Jobs	          Tasks	          Edges...
Workflow clustering—  Features collected for each workflow run  —    Successful jobs  —    Failed jobs  —    Success d...
“High Failure” Workflows            (HFW)—  The workflow engine keeps retrying workflows until  they complete or time out...
HFW failure patternsMontage application                      X-axis is                      normalized                    ...
More HFW Failure PatternsEpigenome    Broadband Montage     CyberShake                          25
Offline clustering                   3                   37                                                               ...
Online classification                                                            Workflows                        4       ...
Anomaly detection                           Montage application                     1.0                                   ...
System                                                                           broadband                         cybersh...
Summary—  Real-time failure prediction for scientific workflows  is a challenging but important task—  Unsupervised lear...
Thank you!   For more information, visit the Stampede wiki at:https://confluence.pegasus.isi.edu/display/stampede/
Extra slides..    CNSM 2011, October    24-28, Paris, France   32
ScalabilityCNSM 2011, October 24-28, Paris,   33France
Pegasus—  Maps from abstract to concrete workflow  —  Algorithmic and AI-based techniques—  Automatically locates physi...
NetLogger—  Logging Methodology  —  Timestamped, named, messages at the start and end    of significant events, with add...
Upcoming SlideShare
Loading in …5
×

Online Workflow Management and Performance Analysis with Stampede

1,055 views
981 views

Published on

Predicting performance of large scientific workflows.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,055
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Online Workflow Management and Performance Analysis with Stampede

  1. 1. Online Workflow Management and Performance Analysis with Stampede Dan Gunter1, Taghrid Samak1, Monte Goode1, Ewa Deelman2, Gaurang Mehta2, Fabio Silva2, Karan Vahi2 Christopher Brooks3 Priscilla Moraes4, Martin Swany41 Lawrence Berkeley National Laboratory 2 University of Southern California, Information Sciences Institute 3 University of San Francisco 4 University of Delaware 1
  2. 2. Background CNSM 2011, October 24-28, Paris, France 2
  3. 3. Goal: Predict behavior ofrunning scientific workflows—  Primarily failures —  Is a given workflow going to “fail”? —  Are specific resources causing problems? —  Which application sub-components are failing? —  Is the data staging a problem?—  In large workflows, some failures, etc. are normal —  This work is about learning from known problems, which patterns of failures, etc. are unusual and require adaptation—  Do all of this as generally as possible: Can we provide a solution that can apply to all workflow engines? CNSM 2011, October 24-28, Paris, France 3
  4. 4. Approach—  Model the monitoring data from running workflows—  Collect all the data in real-time—  Run analysis, also in real-time, on the collected data —  map low-level failures to application-level characteristics—  Feed back analysis to user, workflow engine CNSM 2011, October 24-28, Paris, France 4
  5. 5. Scientific Applications Montage Epigenome LIGO CyberShakeAstronomy Bioinformatics Astrophysics Geophysics CNSM 2011, October 24-28, Paris, France 5
  6. 6. Domain: Large Scientific Workflows SCEC-2009: Millions of tasks completed per day Radius = 11 million 6
  7. 7. Workflow structure CNSM 2011, October 24-28, Paris, France 7
  8. 8. Basic terms and concepts Success Execution FailWorkflow Resources Workflow Management System CNSM 2011, October 24-28, Paris, France 8
  9. 9. Base technologies—  Workflow management systems —  Pegasus —  www.pegasus.isi.edu—  Monitoring and data analysis + —  NetLogger —  www.netlogger.lbl.gov CNSM 2011, October 24-28, Paris, France 9
  10. 10. Data Model CNSM 2011, October 24-28, Paris, France 10
  11. 11. Data Model Goals—  Be widely applicable: —  Provide everything we there are many need for Pegasus workflow engines out workflows there that could benefit. CNSM 2011, October 24-28, Paris, France 10/27/11 11
  12. 12. Abstract and Executable Workflows—  Workflows start as a resource-independent statement of computations, input and output data, and dependencies —  This is called the Abstract Workflow (AW)—  For each workflow run, Pegasus-WMS plans the workflow, adding helper tasks and clustering small computations together —  This is called the Executable Workflow (EW)—  Note: Most of the logs are from the EW but the user really only knows the AW. CNSM 2011, October 24-28, Paris, France 12
  13. 13. Additional Terminology—  Workflow: Container for an entire computation—  Sub-workflow: Workflow that is contained in another workflow—  Task: Representation of a computation in the AW—  Job: Node in the EW —  May represent part of a task (e.g., a stage-in/out), one task, or many tasks—  Job instance: Job scheduled or running by underlying system —  Due to retries, there may be multiple job instances per job—  Invocation: One or more executables for a job instance —  Invocations are the instantiation of tasks, whereas jobs are an intermediate abstraction for use by the planning and scheduling sub-systems CNSM 2011, October 24-28, Paris, France 13
  14. 14. Denormalized Data Model—  Stream of timestamped “events”: —  unique, hierarchical, name —  unique identifiers (workflow, job, etc.) —  values and metadata—  Used NETCONF YANG data-modeling language, keyed on event name [RFCs: 6020 6021 (6087)] —  YANG schema (see bit.ly/nQfPd1) documents and validates each log event Snippet of schemacontainer stampede.xwf.start { description “Start of executable workflow”; uses base-event; leaf restart_count { type uint32; description "Number of times workflow was restarted (due tofailures)”; }} CNSM 2011, October 24-28, Paris, France 14
  15. 15. Relational data model Abstract task_edge Workflow (AW) jobstate Task parent Job status and child task job job_instance Task Job Job Instance job_edge Job parent and child workflow invocation Workflow Invocation workflow_state Executable Workflow status AW and EW Workflow (EW) CNSM 2011, October 24-28, Paris, France 15
  16. 16. Infrastructure CNSM 2011, October 24-28, Paris, France 10/27/11 16
  17. 17. Infrastructure overview Raw logs Normalized logs Query Subscribe CNSM 2011, October 24-28, Paris, France 17
  18. 18. Detailed data flow Pegasus Log collection and normalization NetLogger Failure detection Real-time analysisRelational archive CNSM 2011, October 24-28, Paris, France 18
  19. 19. Message bus usage BP Log events Routing key = event nameAMQP Exchange Queue … Queue Subscribe Data Analysis client … Analysis client CNSM 2011, October 24-28, Paris, France 10/27/11 19
  20. 20. Analysis CNSM 2011, October 24-28, Paris, France 10/27/11 20
  21. 21. Experimental Dataset summaryApplication   Workflows   Jobs   Tasks   Edges  Cybershake   881   288,668   577,330   1,245,845  Periodograms   45   80,158   1,894,921   80,113  Epigenome   46   10,059   29,837   23,425  Montage   76   56,018   613,107   287,146  Broadband   66   44,182   104,275   141,922  LIGO   26   2,116   2,141   6,203   1,140   481,201   3,221,611   1,784,654   CNSM 2011, October 24-28, Paris, France 21
  22. 22. Workflow clustering—  Features collected for each workflow run —  Successful jobs —  Failed jobs —  Success duration —  Fail duration—  Offline clustering on historical data —  Algorithm: k-means—  Online analysis classifies workflows according to nearest cluster 22
  23. 23. “High Failure” Workflows (HFW)—  The workflow engine keeps retrying workflows until they complete or time out—  But in the experimental logs, workflows are never marked as “failed” —  Aside: this is fixed in the newest version—  Therefore, we use a simple heuristic for identifying workflows as problematic: —  HFW means: > 50% of jobs failed CNSM 2011, October 24-28, Paris, France 23
  24. 24. HFW failure patternsMontage application X-axis is normalized workflow execution time Y-axis shows the percent of total job failures for this workflow, so far Legend shows, for each workflow, jobs failed/jobs total 24
  25. 25. More HFW Failure PatternsEpigenome Broadband Montage CyberShake 25
  26. 26. Offline clustering 3 37 Epigenome 5 Other 3 clusters 4 3Component 2 High-failure workflow cluster 2 7 1 ● 12 ● 21362 18 4 1 ● ● 6 2717 20 43 44 23 35 ● 33 14 32 31 19 0 38 29 34 39 2 10 40 5 4 15 30 22 28 11 1 8 16 42 24 13 25 41 26 Projection onto first 2 −1 39 principal components CNSM 2011, October 4 −2 0 2 24-28, Paris, France 26 Component 1
  27. 27. Online classification Workflows 4 21:512/905 24:28/29 Workflow classification 25:28/29 27:4/4 33:28/30 41:64/89 3 Class Doesn’t converge 2 High-failure workflow class 1 0 20 40 60 80 100 Lifetime % CNSM 2011, October 24-28, Paris, France 27
  28. 28. Anomaly detection Montage application 1.0 X: total number 0.9 of failures Anomalous! 0.8 See Slide #24 Y: proportion ofCumulative Percent time-windows 0.6 46:281/496 experiencing 48:62/65 that number of failures or less 0.4 49:44/73 50:36/65 51:22/37 0.2 52:38/51 53:42/57 54:32/48 0.0 0 10 15 20 30 CNSM 2011, October Failures 24-28, Paris, France 28
  29. 29. System broadband cybershake 4 10 performance 103 102Bars show the 101rate for each 100 Query type Median queries minute, log10 scale epigenome ligo 01-JobsTottype of query 104 02-JobsState 03-JobsType 103 04-JobsHostEach panel is an 102 05-TimeTot 06-TimeStateapplication 101 07-TimeType 08-TimeHost 0 10 09-JobDelayDashed black montage periodograms 10-WfSumm 104 11-HostSummlines are median 103arrival rate for 102the application. 101 100 01 02 03 04 05 06 07 08 09 10 11 01 02 03 04 05 06 07 08 09 10 11 29 Query type CNSM 2011, October 24-28, Paris, France
  30. 30. Summary—  Real-time failure prediction for scientific workflows is a challenging but important task—  Unsupervised learning can be used to model high- level workflow failures from historical data—  High failure classes of workflows can be predicted in real-time with high accuracy—  Future directions —  Analysis; root-cause investigation —  System; notifications and updates —  Working with data from other workflow systems CNSM 2011, October 24-28, Paris, 30 France
  31. 31. Thank you! For more information, visit the Stampede wiki at:https://confluence.pegasus.isi.edu/display/stampede/
  32. 32. Extra slides.. CNSM 2011, October 24-28, Paris, France 32
  33. 33. ScalabilityCNSM 2011, October 24-28, Paris, 33France
  34. 34. Pegasus—  Maps from abstract to concrete workflow —  Algorithmic and AI-based techniques—  Automatically locates physical locations for both workflow components and data—  Finds appropriate resources to execute—  Reuses existing data products where applicable—  Publishes newly derived data products —  Provides provenance information CNSM 2011, October 24-28, Paris, France 34
  35. 35. NetLogger—  Logging Methodology —  Timestamped, named, messages at the start and end of significant events, with additional identifiers and metadata in a std. line-oriented ASCII format (Best Practices or BP) —  APIs are provided, incl. in-memory log aggregation for high frequency events; but message generation is often best done within an existing framework—  Logging and Analysis Tools —  Parse many existing formats to BP —  Load BP into message bus, MySQL, MongoDB, etc. —  Generate profiles, graphs, and CSV from BP data CNSM 2011, October 24-28, Paris, France 35

×