Event Driven Automation and
Workflows for Auto-remediation
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
About myself
Past
• Opalis Software (now aka M$ SC Orchestrator)
• VMware
• OpenStack Mistral core team member
• StackStorm founder & CTO
Present:
• Automation and Integration @ Brocade
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 2
Agenda
• Brief History of Event Driven Automation and Workflows
• How it works
• What can be automated
• Workflows - detailed
• Workflow based automation vs alternatives
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 3
Automation starts with the workflow
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 4
“ Workflow is a set of tasks strung together
to achieve some meaningful business objective “
5
6
Business Process Management
Apply BPM to IT Automation?
7
The TIBCO Integration Platform
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 8
Hype Cycle for Real-Time Infrastructure, 2008
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 10
BMC
BMC
CA
Cisco
VMware
Citrix
OpsWare HP
Microsoft
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 11
12
The problem is bigger
than it was 5 years ago
13
Speed
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 14
Tools
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 15
More Tools…
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 16
Still…
• Manual operations
• Custom scripts
Event Driven Automation 2.0
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 18
FBAR (saving 13,680 hours/day)
Naoru
Nurse
Winston (powered by StackStorm)
Azure Automation
Mistral workflow service
StackStorm automation platform
ACT
OBSERVE
ORIENT
DECIDE
Ingredients
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 19
IT Domains
Config mgmtStorageNetworking ContainersCloud InfraMonitoring
ActionsSensors
WorkflowsRules
Ops Support
Automation Example
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 20
Automation
EngineerService
Monitoring Incident
Management
Event: “low disk
on web301”
Web301 is
“low disk”
Resolve known cases,
fast. Is it
/var/log?
Clean up!
Unknown
problem, need a
human
Wake up, buddy.
Something real
is going on…
What can be automated?
• Security checks
– On malware detection in a VM, isolate
network port on a switch
• App blue-green deployment
– On Jenkins tests passed, bring new vm
claster, deploy and configure app, set
loadbalancer to send % of traffic to new
app, monitor, roll forward, or back out
• Networking
– On BGP peer goes down: collect
troubleshooting data, post on slack & create
JIRA ticket
– On Link aggregation member error, check
load, if capacity of rest of LAG bundle
enough, disable link with error
• OpenStack
– orphan VM clean-up: On orphans detected,
shut down, email owner, keep for few days,
delete
– VM evacuation on HW failures: On host RAID
failure, get list of impacted VMs, email VM
owners, evacuate VMs, create JIRA ticket for
hardware replacement.
• NFV:
– Nokia, Ericson, AT&T, with Mistral and
OpenStack
• Service remediation:
– Cassandra “node down” recovery: On ring
node dying, deploy new node, configure, add
to the ring.
– Remediating RabbitMQ, Galera cluster,
MySQL, and more…
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 21
What can be automated?
From: Practice of Cloud System Administration, by Thomas Limoncelli
Benefits
• Avoid failures (fixing on computer time, not human time)
• Reduce incident MTTR (Mean Time To Recover)
• Reduce risk of human error (no fat fingers)
•
–
–
–
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 24
Engineer
Wakes up
Logs in
and ACK
Checks
runbook
Studies
the alert
Fixes the
problem
Runs
diagnostics
PagerDuty
Alert
2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM
On-call, Without Automation
False
Positive
Winston
2:00 AM
2:05 AM
2:05 AM
2:15 AMAssisted
Diagnostics
Fixed the
problem
On-call With Winston
27
Benefits
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
Uses event driven automation and
workflows with Brocade Workflow
Composer to run Virtual Desktop Service
Virus Detection 80% reduction in ops man-hours to
detect, isolate and resolve
Adding tenant 70% reduction in man-hours,
Environment Verification 50% time to verify reduced
120% verification coverage
Threshold Monitoring 40% decrease incidences caused by
lack of resources
Troubleshooting 40% reduced data collection time
Network Troubleshooting
(congestion, loops)
80% reduction in man-hours,
minimizing operational mistakes
“Sleep Better at Night: OpenStack Cloud Auto-Healing” @ OpenStack Summit Barcelona
Mirantis: Auto-remediating 2,000 node OpenStack cluster at Symantec with StackStorm
Benefits
• Reduce MTR (Mean Time to Resolution)
• Avoid failures (fixing on computer time, not human time)
• Reduce risk of human error (no fat fingers)
• Positive team impact
– Avoid pager fatigue and team burn-out
– Turn from reactive to proactive (break reactive vicious cycle)
– Capture operational knowledge – as code
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 29
•
•
•
•
Into Details:
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
Workflows
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
Workflows
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 32
IT Domains
Config mgmtStorageNetworking ContainersCloud InfraMonitoring
ActionsSensors
WorkflowsRules
Ops Support
MISTRAL
N.B: Event Driven Automation > Workflow,
but Workflow is a key element.
Key Workflow Patterns
• Theory: ~100 patterns - http://www.workflowpatterns.com/
• Practice: IMAO, only few sufficient for IT & DC automation
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 33
Basic: Sequence
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 34
...
tasks:
t1_update_config:
action: core.remote_sudo
input:
cmd: sed -i -e"s/keepalive_timeout
hosts: my_webserver.example.com
on-complete: t2_cleanup_logs
t2_cleanup_logs:
action: core.remote_sudo
input:
cmd: rm /var/log/nginx/
hosts: my_webserer.example.com
on-complete: t3_restart_service
t3_restart_service:
action: core.remote_sudo cmd="servic
t1 t2 t3
Basic: Data Passing
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 35
examples.data_pass:
input:
- host
tasks:
t1_diagnose:
action: diag.run_mysql_diag
input:
host: <% $.host %>
publish:
- msg: <% t1_diagnose.stdout.summary %>
on-complete: t2_cleanup_logs
t2_post_to_chat:
action: chatops.say
input:
header: Returned <% $.t1_diagnose.code %>
details: <% $.msg %>
t1.code=0
msg=“Some string..”
t1 t2
Basic: Conditions
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 36
tasks:
...
t1_deploy:
action: ops.deploy_fleet
on-success: t2_post_to_chat
on-failure: t3_page_ops
t2_post_to_chat:
action: chatops.say
input:
header: Successfully deployed <% $.t1_diag
t3_page_admin:
action: pagerduty.launch_incident
input:
details: Have to wake up dude...
details: <% $.msg %>
t1
t2
t3
Basic: Conditions on Data
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 37
t1_diagnose:
action: ops.run_switch_diag
publish:
- code: <% t1_diagnose.return_code %>
on-complete:
- t2_post_to_chat: <% $.code == 0 %>
- t3_page_network_admin: <% $.code > 0 %>
t2_post_to_chat:
action: slack.post
input:
header: ”Switch <% switch %> checked, OK"
t3_page_network_admin:
action: pagerduty.launch_incident
input:
details: Have to wake up dude...
details: <% $.t1_diagnose.stdout %>
t1.code==0
t1.code >0
t1
t2
t3
Sufficient. But there is more…
That’s the basics!
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 38
More: Parallel Execution
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 39
t4
...
t1_do_build:
action: cicd.do_build_and_packages
on-success:
- t2_test_ubuntu14
- t3_test_fedora20
- t3_test_rhel6
t2_test_ubuntu14:
action: cicd.deploy_and_test distro="UBUNTU14"
t3_test_fedora20:
action: cicd.deploy_and_test distro="F20"
t4_test_rhel6:
action: cicd.deploy_and_test distro="RHEL6"
t4
t1 t3
t2
More: Join
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 40
t1 t5
t4
t3
t2
More: Join
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 41
16 ways to join
t4
t1 t3
t2
t5
More: Join—Simple Merge
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 42HTTP://WWW.WORKFLOWPATTERNS.COM/PATTERNS/CONTROL/BASIC/WCP5.PHP
...
t2_test_ubuntu14:
action: cicd.deploy_and_test distro="UBUNTU14”
on-success: t5_post_status
t3_test_fedora20:
action: cicd.deploy_and_test distro="F20"
on-success: t5_post_status
t4_test_rhel6:
action: cicd.deploy_and_test distro="RHEL6"
on-success: t5_post_status
t5_post_status:
action: chatops.say
input:
header: Test completed!
Simple Merge
t5t5t5
t2
t3
t4
t5
More: Join—AND Join
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 43HTTP://WWW.WORKFLOWPATTERNS.COM/PATTERNS/CONTROL/NEW/WCP33.PHP
Full “AND” Join
...
t2_test_ubuntu14:
action: cicd.deploy_and_test distro="UBUNTU14”
on-success: t5_post_status
t3_test_fedora20:
action: cicd.deploy_and_test distro="F20"
on-success: t5_post_status
t4_test_rhel6:
action: cicd.deploy_and_test distro="RHEL6"
on-success: t5_post_status
t5_tag_release:
join: all
action: cicd.tag_release
t2
t3
t4
t5
More: Join—Discriminator
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 44HTTP://WWW.WORKFLOWPATTERNS.COM/PATTERNS/CONTROL/ADVANCED_BRANCHING/WCP9.PHP
Discriminator
...
t2_test_ubuntu14:
action: cicd.deploy_and_test distro="UBUNTU14”
on-failure: t5_report_and_fail
t3_test_fedora20:
action: cicd.deploy_and_test distro="F20"
on-failure: t5_report_and_fail
t4_test_rhel6:
action: cicd.deploy_and_test distro="RHEL6"
on-failure: t5_report_and_fail
t5_report_and_fail:
join: one
action: chatops.say header=“FAILURE!”
on-complete: fail
t2
t3
t4
t5
t2t2
More: Multiple Data
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 45
...
t1_get_ip_list:
action: myinventory.allocate_ips num=4
publish:
- ip_list: <% $.t1_get_ip_list.ips %>
on-complete: t2_create_vms
t2_create_vms:
with-items: ip in <% $. ip_list %>
action: myaws.create_vms ip=<% $.ip %>
t1 t2
ip_list=[...]
Recap: Key Workflow Operations
• Sequence
• Data passing
• Conditions (on data)
• Parallel execution
• Joins
• Multiple Data Items
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 46
Why not Scripts?
47
Why not Scripts?
48
• Simple to define, reason, visualize
• Transparent
– state is clear, execution is trackable: running, complete, failed steps
•
–
–
–
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.49
Workflows Better in Operations
• Simple to define, reason, visualize
• Transparent
– state is clear, execution is trackable: running, complete, failed steps
• Reliable
– Workflows are long-running
– Crash tolerance
– “Restart from point of failure”
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 50
Why not Legacy RunBook Automation?
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 51
DevOps:
Infrastructure as Code
52
Infrastructure as code
Case Study
• Automated provisioning, 4 Data centers
• Before: CPO, operator updates via GUI, click and pray, x4
• After: BWC, dev -> code review -> staging -> QA-> prod
Infrastructure as code
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.53
Top predictor of IT performance?
Version control used by Ops
for Ops artifacts!
Designed for DevOps
1. Support infrastructure as code
2. Open Source
3. Scale and reliability
4. Part of tool chain
5. Social coding & collaboration
6. More demanding - requires skills
54Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
Part of tool chain
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 55
Devops Tools vs Enterprise Suites
OR
Leverage social coding
Community packs @ StackStorm exchange
More demanding
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 57
OR
Requires skills – CLI, scripting, understanding
Operation Patterns
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 58
Capture and share operational patters
as code!
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.59
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.60
61
• Event-driven automation works –
- benefits to reliable cloud operations
• Automation must be reliable and transparent –
- workflows beat scripts
• Infra as code is a key –
- repeatable, testable, reliable automation
Summary
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 62
OpenSource Apache 2.0
• Github: github.com/StackStorm/st2
• Twitter: Stack_Storm
• IRC: #stackstorm on FreeNode
• stackstorm.slack.com on Slack
• www.stackstorm.com
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 63
StackStorm Brocade Workflow Composer
Commercial Edition
• Enterprise features
• Priority support
• brocade.com/bwc
• docs: bwc-docs.brocade.com
• Network lifecycle automation suite
Questions & Answers
Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 64

Event driven-automation and workflows

  • 1.
    Event Driven Automationand Workflows for Auto-remediation Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • 2.
    About myself Past • OpalisSoftware (now aka M$ SC Orchestrator) • VMware • OpenStack Mistral core team member • StackStorm founder & CTO Present: • Automation and Integration @ Brocade © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 2
  • 3.
    Agenda • Brief Historyof Event Driven Automation and Workflows • How it works • What can be automated • Workflows - detailed • Workflow based automation vs alternatives © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 3
  • 4.
    Automation starts withthe workflow © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 4 “ Workflow is a set of tasks strung together to achieve some meaningful business objective “
  • 5.
  • 6.
  • 7.
    Apply BPM toIT Automation? 7 The TIBCO Integration Platform Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • 8.
    Š 2016 BROCADECOMMUNICATIONS SYSTEMS, INC. 8
  • 9.
    Hype Cycle forReal-Time Infrastructure, 2008
  • 10.
    Š 2016 BROCADECOMMUNICATIONS SYSTEMS, INC. 10 BMC BMC CA Cisco VMware Citrix OpsWare HP Microsoft
  • 11.
    Š 2016 BROCADECOMMUNICATIONS SYSTEMS, INC. 11
  • 12.
  • 13.
    The problem isbigger than it was 5 years ago 13
  • 14.
    Speed Š 2016 BROCADECOMMUNICATIONS SYSTEMS, INC. 14
  • 15.
    Tools Š 2016 BROCADECOMMUNICATIONS SYSTEMS, INC. 15
  • 16.
    More Tools… © 2016BROCADE COMMUNICATIONS SYSTEMS, INC. 16
  • 17.
  • 18.
    Event Driven Automation2.0 Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 18 FBAR (saving 13,680 hours/day) Naoru Nurse Winston (powered by StackStorm) Azure Automation Mistral workflow service StackStorm automation platform ACT OBSERVE ORIENT DECIDE
  • 19.
    Ingredients Š 2016 BROCADECOMMUNICATIONS SYSTEMS, INC. 19 IT Domains Config mgmtStorageNetworking ContainersCloud InfraMonitoring ActionsSensors WorkflowsRules Ops Support
  • 20.
    Automation Example © 2016BROCADE COMMUNICATIONS SYSTEMS, INC. 20 Automation EngineerService Monitoring Incident Management Event: “low disk on web301” Web301 is “low disk” Resolve known cases, fast. Is it /var/log? Clean up! Unknown problem, need a human Wake up, buddy. Something real is going on…
  • 21.
    What can beautomated? • Security checks – On malware detection in a VM, isolate network port on a switch • App blue-green deployment – On Jenkins tests passed, bring new vm claster, deploy and configure app, set loadbalancer to send % of traffic to new app, monitor, roll forward, or back out • Networking – On BGP peer goes down: collect troubleshooting data, post on slack & create JIRA ticket – On Link aggregation member error, check load, if capacity of rest of LAG bundle enough, disable link with error • OpenStack – orphan VM clean-up: On orphans detected, shut down, email owner, keep for few days, delete – VM evacuation on HW failures: On host RAID failure, get list of impacted VMs, email VM owners, evacuate VMs, create JIRA ticket for hardware replacement. • NFV: – Nokia, Ericson, AT&T, with Mistral and OpenStack • Service remediation: – Cassandra “node down” recovery: On ring node dying, deploy new node, configure, add to the ring. – Remediating RabbitMQ, Galera cluster, MySQL, and more… © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 21
  • 22.
    What can beautomated? From: Practice of Cloud System Administration, by Thomas Limoncelli
  • 24.
    Benefits • Avoid failures(fixing on computer time, not human time) • Reduce incident MTTR (Mean Time To Recover) • Reduce risk of human error (no fat fingers) • – – – © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 24
  • 25.
    Engineer Wakes up Logs in andACK Checks runbook Studies the alert Fixes the problem Runs diagnostics PagerDuty Alert 2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM On-call, Without Automation
  • 26.
    False Positive Winston 2:00 AM 2:05 AM 2:05AM 2:15 AMAssisted Diagnostics Fixed the problem On-call With Winston
  • 27.
    27 Benefits Š 2016 BROCADECOMMUNICATIONS SYSTEMS, INC. Uses event driven automation and workflows with Brocade Workflow Composer to run Virtual Desktop Service Virus Detection 80% reduction in ops man-hours to detect, isolate and resolve Adding tenant 70% reduction in man-hours, Environment Verification 50% time to verify reduced 120% verification coverage Threshold Monitoring 40% decrease incidences caused by lack of resources Troubleshooting 40% reduced data collection time Network Troubleshooting (congestion, loops) 80% reduction in man-hours, minimizing operational mistakes
  • 28.
    “Sleep Better atNight: OpenStack Cloud Auto-Healing” @ OpenStack Summit Barcelona Mirantis: Auto-remediating 2,000 node OpenStack cluster at Symantec with StackStorm
  • 29.
    Benefits • Reduce MTR(Mean Time to Resolution) • Avoid failures (fixing on computer time, not human time) • Reduce risk of human error (no fat fingers) • Positive team impact – Avoid pager fatigue and team burn-out – Turn from reactive to proactive (break reactive vicious cycle) – Capture operational knowledge – as code © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 29
  • 30.
    • • • • Into Details: © 2016BROCADE COMMUNICATIONS SYSTEMS, INC.
  • 31.
    Workflows Š 2016 BROCADECOMMUNICATIONS SYSTEMS, INC.
  • 32.
    Workflows Š 2016 BROCADECOMMUNICATIONS SYSTEMS, INC. 32 IT Domains Config mgmtStorageNetworking ContainersCloud InfraMonitoring ActionsSensors WorkflowsRules Ops Support MISTRAL N.B: Event Driven Automation > Workflow, but Workflow is a key element.
  • 33.
    Key Workflow Patterns •Theory: ~100 patterns - http://www.workflowpatterns.com/ • Practice: IMAO, only few sufficient for IT & DC automation © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 33
  • 34.
    Basic: Sequence Š 2016BROCADE COMMUNICATIONS SYSTEMS, INC. 34 ... tasks: t1_update_config: action: core.remote_sudo input: cmd: sed -i -e"s/keepalive_timeout hosts: my_webserver.example.com on-complete: t2_cleanup_logs t2_cleanup_logs: action: core.remote_sudo input: cmd: rm /var/log/nginx/ hosts: my_webserer.example.com on-complete: t3_restart_service t3_restart_service: action: core.remote_sudo cmd="servic t1 t2 t3
  • 35.
    Basic: Data Passing ©2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 35 examples.data_pass: input: - host tasks: t1_diagnose: action: diag.run_mysql_diag input: host: <% $.host %> publish: - msg: <% t1_diagnose.stdout.summary %> on-complete: t2_cleanup_logs t2_post_to_chat: action: chatops.say input: header: Returned <% $.t1_diagnose.code %> details: <% $.msg %> t1.code=0 msg=“Some string..” t1 t2
  • 36.
    Basic: Conditions Š 2016BROCADE COMMUNICATIONS SYSTEMS, INC. 36 tasks: ... t1_deploy: action: ops.deploy_fleet on-success: t2_post_to_chat on-failure: t3_page_ops t2_post_to_chat: action: chatops.say input: header: Successfully deployed <% $.t1_diag t3_page_admin: action: pagerduty.launch_incident input: details: Have to wake up dude... details: <% $.msg %> t1 t2 t3
  • 37.
    Basic: Conditions onData © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 37 t1_diagnose: action: ops.run_switch_diag publish: - code: <% t1_diagnose.return_code %> on-complete: - t2_post_to_chat: <% $.code == 0 %> - t3_page_network_admin: <% $.code > 0 %> t2_post_to_chat: action: slack.post input: header: ”Switch <% switch %> checked, OK" t3_page_network_admin: action: pagerduty.launch_incident input: details: Have to wake up dude... details: <% $.t1_diagnose.stdout %> t1.code==0 t1.code >0 t1 t2 t3
  • 38.
    Sufficient. But thereis more… That’s the basics! © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 38
  • 39.
    More: Parallel Execution Š2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 39 t4 ... t1_do_build: action: cicd.do_build_and_packages on-success: - t2_test_ubuntu14 - t3_test_fedora20 - t3_test_rhel6 t2_test_ubuntu14: action: cicd.deploy_and_test distro="UBUNTU14" t3_test_fedora20: action: cicd.deploy_and_test distro="F20" t4_test_rhel6: action: cicd.deploy_and_test distro="RHEL6" t4 t1 t3 t2
  • 40.
    More: Join Š 2016BROCADE COMMUNICATIONS SYSTEMS, INC. 40 t1 t5 t4 t3 t2
  • 41.
    More: Join Š 2016BROCADE COMMUNICATIONS SYSTEMS, INC. 41 16 ways to join t4 t1 t3 t2 t5
  • 42.
    More: Join—Simple Merge ©2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 42HTTP://WWW.WORKFLOWPATTERNS.COM/PATTERNS/CONTROL/BASIC/WCP5.PHP ... t2_test_ubuntu14: action: cicd.deploy_and_test distro="UBUNTU14” on-success: t5_post_status t3_test_fedora20: action: cicd.deploy_and_test distro="F20" on-success: t5_post_status t4_test_rhel6: action: cicd.deploy_and_test distro="RHEL6" on-success: t5_post_status t5_post_status: action: chatops.say input: header: Test completed! Simple Merge t5t5t5 t2 t3 t4 t5
  • 43.
    More: Join—AND Join ©2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 43HTTP://WWW.WORKFLOWPATTERNS.COM/PATTERNS/CONTROL/NEW/WCP33.PHP Full “AND” Join ... t2_test_ubuntu14: action: cicd.deploy_and_test distro="UBUNTU14” on-success: t5_post_status t3_test_fedora20: action: cicd.deploy_and_test distro="F20" on-success: t5_post_status t4_test_rhel6: action: cicd.deploy_and_test distro="RHEL6" on-success: t5_post_status t5_tag_release: join: all action: cicd.tag_release t2 t3 t4 t5
  • 44.
    More: Join—Discriminator © 2016BROCADE COMMUNICATIONS SYSTEMS, INC. 44HTTP://WWW.WORKFLOWPATTERNS.COM/PATTERNS/CONTROL/ADVANCED_BRANCHING/WCP9.PHP Discriminator ... t2_test_ubuntu14: action: cicd.deploy_and_test distro="UBUNTU14” on-failure: t5_report_and_fail t3_test_fedora20: action: cicd.deploy_and_test distro="F20" on-failure: t5_report_and_fail t4_test_rhel6: action: cicd.deploy_and_test distro="RHEL6" on-failure: t5_report_and_fail t5_report_and_fail: join: one action: chatops.say header=“FAILURE!” on-complete: fail t2 t3 t4 t5
  • 45.
    t2t2 More: Multiple Data Š2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 45 ... t1_get_ip_list: action: myinventory.allocate_ips num=4 publish: - ip_list: <% $.t1_get_ip_list.ips %> on-complete: t2_create_vms t2_create_vms: with-items: ip in <% $. ip_list %> action: myaws.create_vms ip=<% $.ip %> t1 t2 ip_list=[...]
  • 46.
    Recap: Key WorkflowOperations • Sequence • Data passing • Conditions (on data) • Parallel execution • Joins • Multiple Data Items © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 46
  • 47.
  • 48.
    Why not Scripts? 48 •Simple to define, reason, visualize • Transparent – state is clear, execution is trackable: running, complete, failed steps • – – –
  • 49.
    Š 2016 BROCADECOMMUNICATIONS SYSTEMS, INC.49
  • 50.
    Workflows Better inOperations • Simple to define, reason, visualize • Transparent – state is clear, execution is trackable: running, complete, failed steps • Reliable – Workflows are long-running – Crash tolerance – “Restart from point of failure” © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 50
  • 51.
    Why not LegacyRunBook Automation? Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 51 DevOps: Infrastructure as Code
  • 52.
    52 Infrastructure as code CaseStudy • Automated provisioning, 4 Data centers • Before: CPO, operator updates via GUI, click and pray, x4 • After: BWC, dev -> code review -> staging -> QA-> prod
  • 53.
    Infrastructure as code Š2016 BROCADE COMMUNICATIONS SYSTEMS, INC.53 Top predictor of IT performance? Version control used by Ops for Ops artifacts!
  • 54.
    Designed for DevOps 1.Support infrastructure as code 2. Open Source 3. Scale and reliability 4. Part of tool chain 5. Social coding & collaboration 6. More demanding - requires skills 54Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  • 55.
    Part of toolchain Š 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 55 Devops Tools vs Enterprise Suites OR
  • 56.
    Leverage social coding Communitypacks @ StackStorm exchange
  • 57.
    More demanding © 2016BROCADE COMMUNICATIONS SYSTEMS, INC. 57 OR Requires skills – CLI, scripting, understanding
  • 58.
    Operation Patterns Š 2016BROCADE COMMUNICATIONS SYSTEMS, INC. 58 Capture and share operational patters as code!
  • 59.
    Š 2016 BROCADECOMMUNICATIONS SYSTEMS, INC.59
  • 60.
    Š 2016 BROCADECOMMUNICATIONS SYSTEMS, INC.60
  • 61.
  • 62.
    • Event-driven automationworks – - benefits to reliable cloud operations • Automation must be reliable and transparent – - workflows beat scripts • Infra as code is a key – - repeatable, testable, reliable automation Summary © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 62
  • 63.
    OpenSource Apache 2.0 •Github: github.com/StackStorm/st2 • Twitter: Stack_Storm • IRC: #stackstorm on FreeNode • stackstorm.slack.com on Slack • www.stackstorm.com © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 63 StackStorm Brocade Workflow Composer Commercial Edition • Enterprise features • Priority support • brocade.com/bwc • docs: bwc-docs.brocade.com • Network lifecycle automation suite
  • 64.
    Questions & Answers Š2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 64

Editor's Notes

  • #27 And now with Winston. They started using Winston for cassandra auto-remediation, and it grew into remediation-as-a-service. Presentation on QCon. Winston gets the Alert. Using its rule engine decide what the right action is. Action then analyse the issue and if it’s identified as a False Positive, no need to Page the on-call. Another use case is that Winston will identify that it can fix the issue. When it does, again, no need to Page the on-call. Last use case, the one we want you to focus on is Assisted Diagnostics. While the on-call is being Paged, Winston runs a series of pre-defined diagnostics and prepare a report for the On-call so that when he logs in the system, he has comprehensive information like the Discovery status, list of recent exceptions or error, or any other relevant context to help him make a decision faster.
  • #32 Now let’s talk about workflows
  • #33 remember that workflow is a part of event driven automation… but a very important part
  • #35 Sequence: tasks run one after another. Typical remediation sequence: update config, clean the logs, restart the server. Note the workflow definition: name of the task, action with input, transition. Simple, concise, readable YAML.
  • #36 Data passing: workflow ability carry the data downstream, and efficiently refer those data, is the key. In this example, troubleshooting results obtained by task 1 are published to chatops by task 2. We can refer the task results directly, or “publish” a named variable for convenience. This funny syntax here is YAQL – yet another query langue – we prefered it over JINGA for extensibility and type support.
  • #37 Simple conditions: simply – deploy app, on success – post to chat, on failure, page admin on call. Conditions can be based on data:
  • #38 Conditions can be based on data: This workflow runs switch diagnostic action, that may be just a shell script, and act based on the return code. Most common pattern.
  • #39 And that’s it! In my view, that set of patterns is sufficient. To make it “efficient”, we may want few more patterns.
  • #40 Parallel task execution. This example is from our own CI: we use stackstorm to build stackstorm. When it is built and packaged, I deploy and test it on 3 operation systems. Obviously, in parallel.
  • #41 Now that the execution is split in parallel, how to join it? How to get this humpy-dumpy back together again? It’s not easy.
  • #42 According to workflow patterns, there are 16 ways to join. How many times t5 is going to run, and how, depends of the type of join.
  • #43 Simple merge. T5 runs 3 times, one for each upstream execution. That’s what I want here: report the completion on each of parallel tasks.
  • #44 Now: to tag the release, we want the tests on all 3 operation systems passed. That is what “AND” join pattern will do.
  • #45 If on the other hand, if any of the OS tests fails, we don’t wait for the rest to call it a failure. In this example, t5 also only runs one, but it will do so on whatever upstream tasks comes first, and workflow moves on. This join is called “discriminator”, because US legal compliance people didn’t review workflow pattern language yet…
  • #46 Finally, “multiple data”. People ask “can workflow have loops”? My answer is “it can but you don’t want it”. If all you need is the same action run on a set of data, use “this pattern”. In Mistal, the keyword for it is “with-items”. Here, task 1 gets the list of available ip addresses from inventory system, and task 2 uses them as an input to create vm action. Here is a cool thing about Mistral workflow: actions run in parallel, AND, you can control concurrency.
  • #47 That’s it, that’s all you need. This is the minimal set that gives enough power but keeps workflows simple to create, track, and reason.
  • #54 D Apply devops to automation itself