Event driven-automation and workflows

Event Driven Automation and
Workflows for Auto-remediation
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.

About myself
Past
• Opalis Software (now aka M$ SC Orchestrator)
• VMware
• OpenStack Mistral core team member
• StackStorm founder & CTO
Present:
• Automation and Integration @ Brocade
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 2

Agenda
• Brief History of Event Driven Automation and Workflows
• How it works
• What can be automated
• Workflows - detailed
• Workflow based automation vs alternatives

Automation starts with the workflow
“ Workflow is a set of tasks strung together
to achieve some meaningful business objective “

Apply BPM to IT Automation?
7
The TIBCO Integration Platform

Hype Cycle for Real-Time Infrastructure, 2008

BMC
BMC
CA
Cisco
VMware
Citrix
OpsWare HP
Microsoft

The problem is bigger
than it was 5 years ago
13

Speed

Tools

More Tools…

Still…
• Manual operations
• Custom scripts

Event Driven Automation 2.0
FBAR (saving 13,680 hours/day)
Naoru
Nurse
Winston (powered by StackStorm)
Azure Automation
Mistral workflow service
StackStorm automation platform
ACT
OBSERVE
ORIENT
DECIDE

Ingredients
IT Domains
Config mgmtStorageNetworking ContainersCloud InfraMonitoring
ActionsSensors
WorkflowsRules
Ops Support

Automation Example
Automation
EngineerService
Monitoring Incident
Management
Event: “low disk
on web301”
Web301 is
“low disk”
Resolve known cases,
fast. Is it
/var/log?
Clean up!
Unknown
problem, need a
human
Wake up, buddy.
Something real
is going on…

What can be automated?
• Security checks
– On malware detection in a VM, isolate
network port on a switch
• App blue-green deployment
– On Jenkins tests passed, bring new vm
claster, deploy and configure app, set
loadbalancer to send % of traffic to new
app, monitor, roll forward, or back out
• Networking
– On BGP peer goes down: collect
troubleshooting data, post on slack & create
JIRA ticket
– On Link aggregation member error, check
load, if capacity of rest of LAG bundle
enough, disable link with error
• OpenStack
– orphan VM clean-up: On orphans detected,
shut down, email owner, keep for few days,
delete
– VM evacuation on HW failures: On host RAID
failure, get list of impacted VMs, email VM
owners, evacuate VMs, create JIRA ticket for
hardware replacement.
• NFV:
– Nokia, Ericson, AT&T, with Mistral and
OpenStack
• Service remediation:
– Cassandra “node down” recovery: On ring
node dying, deploy new node, configure, add
to the ring.
– Remediating RabbitMQ, Galera cluster,
MySQL, and more…

What can be automated?
From: Practice of Cloud System Administration, by Thomas Limoncelli

Benefits
• Avoid failures (fixing on computer time, not human time)
• Reduce incident MTTR (Mean Time To Recover)
• Reduce risk of human error (no fat fingers)
•
–
–
–

Engineer
Wakes up
Logs in
and ACK
Checks
runbook
Studies
the alert
Fixes the
problem
Runs
diagnostics
PagerDuty
Alert
2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM
On-call, Without Automation

False
Positive
Winston
2:00 AM
2:05 AM
2:05 AM
2:15 AMAssisted
Diagnostics
Fixed the
problem
On-call With Winston

27
Benefits
Uses event driven automation and
workflows with Brocade Workflow
Composer to run Virtual Desktop Service
Virus Detection 80% reduction in ops man-hours to
detect, isolate and resolve
Adding tenant 70% reduction in man-hours,
Environment Verification 50% time to verify reduced
120% verification coverage
Threshold Monitoring 40% decrease incidences caused by
lack of resources
Troubleshooting 40% reduced data collection time
Network Troubleshooting
(congestion, loops)
80% reduction in man-hours,
minimizing operational mistakes

“Sleep Better at Night: OpenStack Cloud Auto-Healing” @ OpenStack Summit Barcelona
Mirantis: Auto-remediating 2,000 node OpenStack cluster at Symantec with StackStorm

Benefits
• Reduce MTR (Mean Time to Resolution)
• Avoid failures (fixing on computer time, not human time)
• Reduce risk of human error (no fat fingers)
• Positive team impact
– Avoid pager fatigue and team burn-out
– Turn from reactive to proactive (break reactive vicious cycle)
– Capture operational knowledge – as code

•
•
•
•
Into Details:

Workflows

Workflows
IT Domains
Config mgmtStorageNetworking ContainersCloud InfraMonitoring
ActionsSensors
WorkflowsRules
Ops Support
MISTRAL
N.B: Event Driven Automation > Workflow,
but Workflow is a key element.

Key Workflow Patterns
• Theory: ~100 patterns - http://www.workflowpatterns.com/
• Practice: IMAO, only few sufficient for IT & DC automation

Basic: Sequence
...
tasks:
t1_update_config:
action: core.remote_sudo
input:
cmd: sed -i -e"s/keepalive_timeout
hosts: my_webserver.example.com
on-complete: t2_cleanup_logs
t2_cleanup_logs:
action: core.remote_sudo
input:
cmd: rm /var/log/nginx/
hosts: my_webserer.example.com
on-complete: t3_restart_service
t3_restart_service:
action: core.remote_sudo cmd="servic
t1 t2 t3

Basic: Data Passing
examples.data_pass:
input:
- host
tasks:
t1_diagnose:
action: diag.run_mysql_diag
input:
host: <% $.host %>
publish:
- msg: <% t1_diagnose.stdout.summary %>
on-complete: t2_cleanup_logs
t2_post_to_chat:
action: chatops.say
input:
header: Returned <% $.t1_diagnose.code %>
details: <% $.msg %>
t1.code=0
msg=“Some string..”
t1 t2

Basic: Conditions
tasks:
...
t1_deploy:
action: ops.deploy_fleet
on-success: t2_post_to_chat
on-failure: t3_page_ops
t2_post_to_chat:
action: chatops.say
input:
header: Successfully deployed <% $.t1_diag
t3_page_admin:
action: pagerduty.launch_incident
input:
details: Have to wake up dude...
details: <% $.msg %>
t1
t2
t3

Basic: Conditions on Data
t1_diagnose:
action: ops.run_switch_diag
publish:
- code: <% t1_diagnose.return_code %>
on-complete:
- t2_post_to_chat: <% $.code == 0 %>
- t3_page_network_admin: <% $.code > 0 %>
t2_post_to_chat:
action: slack.post
input:
header: ”Switch <% switch %> checked, OK"
t3_page_network_admin:
action: pagerduty.launch_incident
input:
details: Have to wake up dude...
details: <% $.t1_diagnose.stdout %>
t1.code==0
t1.code >0
t1
t2
t3

Sufficient. But there is more…
That’s the basics!

More: Parallel Execution
t4
...
t1_do_build:
action: cicd.do_build_and_packages
on-success:
- t2_test_ubuntu14
- t3_test_fedora20
- t3_test_rhel6
t2_test_ubuntu14:
action: cicd.deploy_and_test distro="UBUNTU14"
t3_test_fedora20:
action: cicd.deploy_and_test distro="F20"
t4_test_rhel6:
action: cicd.deploy_and_test distro="RHEL6"
t4
t1 t3
t2

More: Join
t1 t5
t4
t3
t2

More: Join
16 ways to join
t4
t1 t3
t2
t5

More: Join—Simple Merge
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 42HTTP://WWW.WORKFLOWPATTERNS.COM/PATTERNS/CONTROL/BASIC/WCP5.PHP
...
t2_test_ubuntu14:
action: cicd.deploy_and_test distro="UBUNTU14”
on-success: t5_post_status
t3_test_fedora20:
t4_test_rhel6:
t5_post_status:
action: chatops.say
input:
header: Test completed!
Simple Merge
t5t5t5
t2
t3
t4
t5

More: Join—AND Join
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 43HTTP://WWW.WORKFLOWPATTERNS.COM/PATTERNS/CONTROL/NEW/WCP33.PHP
Full “AND” Join
...
t2_test_ubuntu14:
t3_test_fedora20:
t4_test_rhel6:
t5_tag_release:
join: all
action: cicd.tag_release
t2
t3
t4
t5

More: Join—Discriminator
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 44HTTP://WWW.WORKFLOWPATTERNS.COM/PATTERNS/CONTROL/ADVANCED_BRANCHING/WCP9.PHP
Discriminator
...
t2_test_ubuntu14:
on-failure: t5_report_and_fail
t3_test_fedora20:
t4_test_rhel6:
t5_report_and_fail:
join: one
action: chatops.say header=“FAILURE!”
on-complete: fail
t2
t3
t4
t5

t2t2
More: Multiple Data
...
t1_get_ip_list:
action: myinventory.allocate_ips num=4
publish:
- ip_list: <% $.t1_get_ip_list.ips %>
on-complete: t2_create_vms
t2_create_vms:
with-items: ip in <% $. ip_list %>
action: myaws.create_vms ip=<% $.ip %>
t1 t2
ip_list=[...]

Recap: Key Workflow Operations
• Sequence
• Data passing
• Conditions (on data)
• Parallel execution
• Joins
• Multiple Data Items

Why not Scripts?
48
• Simple to define, reason, visualize
• Transparent
– state is clear, execution is trackable: running, complete, failed steps
•
–
–
–

Workflows Better in Operations
• Simple to define, reason, visualize
• Transparent
– state is clear, execution is trackable: running, complete, failed steps
• Reliable
– Workflows are long-running
– Crash tolerance
– “Restart from point of failure”

Why not Legacy RunBook Automation?
DevOps:
Infrastructure as Code

52
Infrastructure as code
Case Study
• Automated provisioning, 4 Data centers
• Before: CPO, operator updates via GUI, click and pray, x4
• After: BWC, dev -> code review -> staging -> QA-> prod

Infrastructure as code
Top predictor of IT performance?
Version control used by Ops
for Ops artifacts!

Designed for DevOps
1. Support infrastructure as code
2. Open Source
3. Scale and reliability
4. Part of tool chain
5. Social coding & collaboration
6. More demanding - requires skills
54© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.

Part of tool chain
Devops Tools vs Enterprise Suites
OR

Leverage social coding
Community packs @ StackStorm exchange

More demanding
OR
Requires skills – CLI, scripting, understanding

Operation Patterns
Capture and share operational patters
as code!

• Event-driven automation works –
- benefits to reliable cloud operations
• Automation must be reliable and transparent –
- workflows beat scripts
• Infra as code is a key –
- repeatable, testable, reliable automation
Summary

OpenSource Apache 2.0
• Github: github.com/StackStorm/st2
• Twitter: Stack_Storm
• IRC: #stackstorm on FreeNode
• stackstorm.slack.com on Slack
• www.stackstorm.com
StackStorm Brocade Workflow Composer
Commercial Edition
• Enterprise features
• Priority support
• brocade.com/bwc
• docs: bwc-docs.brocade.com
• Network lifecycle automation suite

Questions & Answers

Event driven-automation and workflows

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Event driven-automation and workflows

Similar to Event driven-automation and workflows (20)

Recently uploaded

Recently uploaded (20)

Event driven-automation and workflows

Editor's Notes