Patrick Hoolboom
September 22, 2016
Incident Management
with Workflows
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
What is a Workflow?
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
What Is a Workflow?
• A sequence of processes through which a piece of work passes from
initiation to completion
• Process as Code
• Living Documentation
– Document your process in an easily human readable, executable format
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 3
Event Driven Automation 2.0
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 4
FBAR (saving 1532 hours/day)
Naoru
Nurse
Winston (powered by StackStorm)
Azure Automation
Mistral workflow service
StackStorm automation platform
ACT
OBSERVE
ORIENT
DECIDE
When to use a workflow
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
When to use a workflow
• Clearly defined process
• When multiple systems or services need to be touched
• Frequently performed tasks
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 6
Why Use Workflows?
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
Why Use Workflows?
• Consistency
– Trust that your automations will perform the same tasks every time
for a given event
• Speed
– Reduce time to resolution for an incident
Audit
– Creates a clear audit trail of what was done when
• Connect Disparate Systems
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 8
Tools…
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 9
What Can Be Automated?
• Security checks
– On malware detection in a VM, isolate
network port on a switch
• Blue-green app deployment
– On Jenkins tests passed, bring new vm
claster, deploy and configure app, set
loadbalancer to send % of traffic to new app,
monitor, roll forward, or back out
• Networking
– On BGP peer goes down: collect
troubleshooting data, post on slack & create
JIRA ticket
– On Link aggregation member error, check
load, if capacity of rest of LAG bundle
enough, disable link with error
• Restart a down service
– On monitoring event, bounce a service
• OpenStack orphan VM clean-up
– On orphans detected, shut down, email owner,
keep for few days, delete
• NFV:
– Nokia, AT&T, with Mistral and OpenStack
• OpenStack VM evacuation on
hardware failures
– On host RAID failure, get list of impacted VMs,
email VM owners, evacuate VMs, create JIRA
ticket for hardware replacement.
• Cassandra “node down”
recovery
– Replace a node on alert
• Clean up disk space
– On monitoring event, clean up disk space
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 10
StackStorm
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
Architecture
12
Web GUI CLI Chatops
Sensor Containers Action Runners
Sensor Plugins
(inbound integrations)
Master
Content Repo
to Audit…
Action Plugins
(outbound integrations)
PLATFORMCLIENTSPLUGINS
AMQP message busAMQP message bus
Workflow
Engine
REST API
{*}
Rules
Engine
IFTTT.yml
KV Store
k[v]
●
Diagnostic Workflows
●
Remediation Workflows
Workflow Design
Patterns
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
Workflow Design Patterns
Diagnostic Workflows
• Troubleshooting and data gathering steps
• No remediations or changes to the system
• Good way to “get your feet wet” with workflows
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 14
Workflow Design Patterns
Remediation Workflows
• Fix the issue!
• Should be triggered after diagnostic workflows if applicable
•
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 15
●
Facilitated Troubleshooting
●
Auto-Remediation
Workflow Use Cases
During an Incident
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
Workflow Use Cases
Facilitated Troubleshooting
• Useful if you don’t quite trust the automation
– Gain confidence in your workflows
• Faster Time to Resolution
• Consistent Data Collection
• Diagnostic workflow with notifications
– Send data to user via
• Email
• Chat
• Ticketing System
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 17
Workflow Use Cases
Auto-Remediation
• Trusted Automation
– Will make automated changes to the system
• Much Faster Time to Resolution
• Consistent Solutions
• Less Pager Fatigue
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 18
●
Low Disk Space Event
Example
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
Automation Example
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 20
Automation
EngineerService
Monitoring Incident
Management
Event: “low disk
on web301”
Web301 is
“low disk”
Resolve known
cases, fast. Is it
/var/log?
Clean up!
Unknown
problem, need a
human
Wake up, buddy.
Something real
is going on…
21
●
Email: phoolboo@brocade.com
●
Twitter: @DoriftoShoes
Thank You!
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY

Incident Management with Workflows

  • 1.
    Patrick Hoolboom September 22,2016 Incident Management with Workflows © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
  • 2.
    What is aWorkflow? © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
  • 3.
    What Is aWorkflow? • A sequence of processes through which a piece of work passes from initiation to completion • Process as Code • Living Documentation – Document your process in an easily human readable, executable format © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 3
  • 4.
    Event Driven Automation2.0 © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 4 FBAR (saving 1532 hours/day) Naoru Nurse Winston (powered by StackStorm) Azure Automation Mistral workflow service StackStorm automation platform ACT OBSERVE ORIENT DECIDE
  • 5.
    When to usea workflow © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
  • 6.
    When to usea workflow • Clearly defined process • When multiple systems or services need to be touched • Frequently performed tasks © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 6
  • 7.
    Why Use Workflows? ©2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
  • 8.
    Why Use Workflows? •Consistency – Trust that your automations will perform the same tasks every time for a given event • Speed – Reduce time to resolution for an incident Audit – Creates a clear audit trail of what was done when • Connect Disparate Systems © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 8
  • 9.
    Tools… © 2016 BROCADECOMMUNICATIONS SYSTEMS, INC. 9
  • 10.
    What Can BeAutomated? • Security checks – On malware detection in a VM, isolate network port on a switch • Blue-green app deployment – On Jenkins tests passed, bring new vm claster, deploy and configure app, set loadbalancer to send % of traffic to new app, monitor, roll forward, or back out • Networking – On BGP peer goes down: collect troubleshooting data, post on slack & create JIRA ticket – On Link aggregation member error, check load, if capacity of rest of LAG bundle enough, disable link with error • Restart a down service – On monitoring event, bounce a service • OpenStack orphan VM clean-up – On orphans detected, shut down, email owner, keep for few days, delete • NFV: – Nokia, AT&T, with Mistral and OpenStack • OpenStack VM evacuation on hardware failures – On host RAID failure, get list of impacted VMs, email VM owners, evacuate VMs, create JIRA ticket for hardware replacement. • Cassandra “node down” recovery – Replace a node on alert • Clean up disk space – On monitoring event, clean up disk space © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 10
  • 11.
    StackStorm © 2016 BROCADECOMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
  • 12.
    Architecture 12 Web GUI CLIChatops Sensor Containers Action Runners Sensor Plugins (inbound integrations) Master Content Repo to Audit… Action Plugins (outbound integrations) PLATFORMCLIENTSPLUGINS AMQP message busAMQP message bus Workflow Engine REST API {*} Rules Engine IFTTT.yml KV Store k[v]
  • 13.
    ● Diagnostic Workflows ● Remediation Workflows WorkflowDesign Patterns © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
  • 14.
    Workflow Design Patterns DiagnosticWorkflows • Troubleshooting and data gathering steps • No remediations or changes to the system • Good way to “get your feet wet” with workflows © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 14
  • 15.
    Workflow Design Patterns RemediationWorkflows • Fix the issue! • Should be triggered after diagnostic workflows if applicable • © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 15
  • 16.
    ● Facilitated Troubleshooting ● Auto-Remediation Workflow UseCases During an Incident © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
  • 17.
    Workflow Use Cases FacilitatedTroubleshooting • Useful if you don’t quite trust the automation – Gain confidence in your workflows • Faster Time to Resolution • Consistent Data Collection • Diagnostic workflow with notifications – Send data to user via • Email • Chat • Ticketing System © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 17
  • 18.
    Workflow Use Cases Auto-Remediation •Trusted Automation – Will make automated changes to the system • Much Faster Time to Resolution • Consistent Solutions • Less Pager Fatigue © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 18
  • 19.
    ● Low Disk SpaceEvent Example © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
  • 20.
    Automation Example © 2016BROCADE COMMUNICATIONS SYSTEMS, INC. 20 Automation EngineerService Monitoring Incident Management Event: “low disk on web301” Web301 is “low disk” Resolve known cases, fast. Is it /var/log? Clean up! Unknown problem, need a human Wake up, buddy. Something real is going on…
  • 21.
  • 22.
    ● Email: phoolboo@brocade.com ● Twitter: @DoriftoShoes ThankYou! © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY

Editor's Notes

  • #2 Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • #3 Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • #4 Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • #6 Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • #7 Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • #8 Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • #9 Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • #10 We run on different cloud stacks, using public and private clouds, There are 140 monitoring systems alone, by my last count We monitor everything, * But when something happens, and our monitoring triggers an event, what do we do? Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • #11 With triggers, rules, workflows and actions, what can be automated? The shorter answer is to say what can NOT be automated Everything that you do manually but don’t want to do manually, all <number>
  • #12 Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • #14 Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • #15 Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • #16 Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • #17 Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • #18 Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • #19 Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • #20 Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • #21 Assume you run an app on a server in your data center. The server is running out of disk space. Shamefully, it’s much more common source of failure that most care to admit. The monitoring tool picks that the server is running low in disk, and raises an event. Automation system catches the event, and fires a “low disk space” trigger The rule is set to run a “remediate out-of-disk” workflow on “low disk space” trigger’; it matches, so it runs the workflow. The workflow runs the process as defined. It may goe and checks what the problem is. And if it’s a known problem with known fix, it fixes it automatically. For instance, if the logs didn’t rotate and filled up the space, it cleans the logs. If something unusual happening, it escalates to human. Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • #22 Here is the simplified version of a troubleshooting workflow (shown in BWC workflow designer) <Describe in words> Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • #23 Title Goes Here 09/22/2016 Page <number> © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.