Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Event driven-automation and workflows

7,096 views

Published on

Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016

Failures happen. Building resilient cloud infrastructure requires an end-to-end automated approach to failure remediation. This approach must go beyond the current DevOps model of monitoring the system and getting engineers alerted when a failure condition occurs.

Recently, event driven automation and workflows re-emerged as a way to automate troubleshooting, remediation, and a variety of Day-2 operations. Facebook famously uses FBAR to "save 16,000 engineer-hours, a day, in ops". Similar approaches had been reported by other hyper-scale cloud providers. Open-source auto-remediation platforms like StackStorm are replacing legacy Runbook automation products, and have been successfully used to automate applications, networks, security, and cloud infrastructure.

In this presentation we give a brief history of workflow automation, overview the common architecture ingredients of a typical event driven automation framework, compare and contrast alternative approaches to day-2 automation, and, most importantly, share real-world use cases and examples of applying event driven automation in operations.

Published in: Technology
  • Be the first to comment

Event driven-automation and workflows

  1. 1. Event Driven Automation and Workflows for Auto-remediation © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  2. 2. About myself Past • Opalis Software (now aka M$ SC Orchestrator) • VMware • OpenStack Mistral core team member • StackStorm founder & CTO Present: • Automation and Integration @ Brocade © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 2
  3. 3. Agenda • Brief History of Event Driven Automation and Workflows • How it works • What can be automated • Workflows - detailed • Workflow based automation vs alternatives © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 3
  4. 4. Automation starts with the workflow © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 4 “ Workflow is a set of tasks strung together to achieve some meaningful business objective “
  5. 5. 5
  6. 6. 6 Business Process Management
  7. 7. Apply BPM to IT Automation? 7 The TIBCO Integration Platform © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  8. 8. © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 8
  9. 9. Hype Cycle for Real-Time Infrastructure, 2008
  10. 10. © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 10 BMC BMC CA Cisco VMware Citrix OpsWare HP Microsoft
  11. 11. © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 11
  12. 12. 12
  13. 13. The problem is bigger than it was 5 years ago 13
  14. 14. Speed © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 14
  15. 15. Tools © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 15
  16. 16. More Tools… © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 16
  17. 17. Still… • Manual operations • Custom scripts
  18. 18. Event Driven Automation 2.0 © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 18 FBAR (saving 13,680 hours/day) Naoru Nurse Winston (powered by StackStorm) Azure Automation Mistral workflow service StackStorm automation platform ACT OBSERVE ORIENT DECIDE
  19. 19. Ingredients © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 19 IT Domains Config mgmtStorageNetworking ContainersCloud InfraMonitoring ActionsSensors WorkflowsRules Ops Support
  20. 20. Automation Example © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 20 Automation EngineerService Monitoring Incident Management Event: “low disk on web301” Web301 is “low disk” Resolve known cases, fast. Is it /var/log? Clean up! Unknown problem, need a human Wake up, buddy. Something real is going on…
  21. 21. What can be automated? • Security checks – On malware detection in a VM, isolate network port on a switch • App blue-green deployment – On Jenkins tests passed, bring new vm claster, deploy and configure app, set loadbalancer to send % of traffic to new app, monitor, roll forward, or back out • Networking – On BGP peer goes down: collect troubleshooting data, post on slack & create JIRA ticket – On Link aggregation member error, check load, if capacity of rest of LAG bundle enough, disable link with error • OpenStack – orphan VM clean-up: On orphans detected, shut down, email owner, keep for few days, delete – VM evacuation on HW failures: On host RAID failure, get list of impacted VMs, email VM owners, evacuate VMs, create JIRA ticket for hardware replacement. • NFV: – Nokia, Ericson, AT&T, with Mistral and OpenStack • Service remediation: – Cassandra “node down” recovery: On ring node dying, deploy new node, configure, add to the ring. – Remediating RabbitMQ, Galera cluster, MySQL, and more… © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 21
  22. 22. What can be automated? From: Practice of Cloud System Administration, by Thomas Limoncelli
  23. 23. Benefits • Avoid failures (fixing on computer time, not human time) • Reduce incident MTTR (Mean Time To Recover) • Reduce risk of human error (no fat fingers) • – – – © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 24
  24. 24. Engineer Wakes up Logs in and ACK Checks runbook Studies the alert Fixes the problem Runs diagnostics PagerDuty Alert 2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM On-call, Without Automation
  25. 25. False Positive Winston 2:00 AM 2:05 AM 2:05 AM 2:15 AMAssisted Diagnostics Fixed the problem On-call With Winston
  26. 26. 27 Benefits © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. Uses event driven automation and workflows with Brocade Workflow Composer to run Virtual Desktop Service Virus Detection 80% reduction in ops man-hours to detect, isolate and resolve Adding tenant 70% reduction in man-hours, Environment Verification 50% time to verify reduced 120% verification coverage Threshold Monitoring 40% decrease incidences caused by lack of resources Troubleshooting 40% reduced data collection time Network Troubleshooting (congestion, loops) 80% reduction in man-hours, minimizing operational mistakes
  27. 27. “Sleep Better at Night: OpenStack Cloud Auto-Healing” @ OpenStack Summit Barcelona Mirantis: Auto-remediating 2,000 node OpenStack cluster at Symantec with StackStorm
  28. 28. Benefits • Reduce MTR (Mean Time to Resolution) • Avoid failures (fixing on computer time, not human time) • Reduce risk of human error (no fat fingers) • Positive team impact – Avoid pager fatigue and team burn-out – Turn from reactive to proactive (break reactive vicious cycle) – Capture operational knowledge – as code © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 29
  29. 29. • • • • Into Details: © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  30. 30. Workflows © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  31. 31. Workflows © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 32 IT Domains Config mgmtStorageNetworking ContainersCloud InfraMonitoring ActionsSensors WorkflowsRules Ops Support MISTRAL N.B: Event Driven Automation > Workflow, but Workflow is a key element.
  32. 32. Key Workflow Patterns • Theory: ~100 patterns - http://www.workflowpatterns.com/ • Practice: IMAO, only few sufficient for IT & DC automation © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 33
  33. 33. Basic: Sequence © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 34 ... tasks: t1_update_config: action: core.remote_sudo input: cmd: sed -i -e"s/keepalive_timeout hosts: my_webserver.example.com on-complete: t2_cleanup_logs t2_cleanup_logs: action: core.remote_sudo input: cmd: rm /var/log/nginx/ hosts: my_webserer.example.com on-complete: t3_restart_service t3_restart_service: action: core.remote_sudo cmd="servic t1 t2 t3
  34. 34. Basic: Data Passing © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 35 examples.data_pass: input: - host tasks: t1_diagnose: action: diag.run_mysql_diag input: host: <% $.host %> publish: - msg: <% t1_diagnose.stdout.summary %> on-complete: t2_cleanup_logs t2_post_to_chat: action: chatops.say input: header: Returned <% $.t1_diagnose.code %> details: <% $.msg %> t1.code=0 msg=“Some string..” t1 t2
  35. 35. Basic: Conditions © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 36 tasks: ... t1_deploy: action: ops.deploy_fleet on-success: t2_post_to_chat on-failure: t3_page_ops t2_post_to_chat: action: chatops.say input: header: Successfully deployed <% $.t1_diag t3_page_admin: action: pagerduty.launch_incident input: details: Have to wake up dude... details: <% $.msg %> t1 t2 t3
  36. 36. Basic: Conditions on Data © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 37 t1_diagnose: action: ops.run_switch_diag publish: - code: <% t1_diagnose.return_code %> on-complete: - t2_post_to_chat: <% $.code == 0 %> - t3_page_network_admin: <% $.code > 0 %> t2_post_to_chat: action: slack.post input: header: ”Switch <% switch %> checked, OK" t3_page_network_admin: action: pagerduty.launch_incident input: details: Have to wake up dude... details: <% $.t1_diagnose.stdout %> t1.code==0 t1.code >0 t1 t2 t3
  37. 37. Sufficient. But there is more… That’s the basics! © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 38
  38. 38. More: Parallel Execution © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 39 t4 ... t1_do_build: action: cicd.do_build_and_packages on-success: - t2_test_ubuntu14 - t3_test_fedora20 - t3_test_rhel6 t2_test_ubuntu14: action: cicd.deploy_and_test distro="UBUNTU14" t3_test_fedora20: action: cicd.deploy_and_test distro="F20" t4_test_rhel6: action: cicd.deploy_and_test distro="RHEL6" t4 t1 t3 t2
  39. 39. More: Join © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 40 t1 t5 t4 t3 t2
  40. 40. More: Join © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 41 16 ways to join t4 t1 t3 t2 t5
  41. 41. More: Join—Simple Merge © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 42HTTP://WWW.WORKFLOWPATTERNS.COM/PATTERNS/CONTROL/BASIC/WCP5.PHP ... t2_test_ubuntu14: action: cicd.deploy_and_test distro="UBUNTU14” on-success: t5_post_status t3_test_fedora20: action: cicd.deploy_and_test distro="F20" on-success: t5_post_status t4_test_rhel6: action: cicd.deploy_and_test distro="RHEL6" on-success: t5_post_status t5_post_status: action: chatops.say input: header: Test completed! Simple Merge t5t5t5 t2 t3 t4 t5
  42. 42. More: Join—AND Join © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 43HTTP://WWW.WORKFLOWPATTERNS.COM/PATTERNS/CONTROL/NEW/WCP33.PHP Full “AND” Join ... t2_test_ubuntu14: action: cicd.deploy_and_test distro="UBUNTU14” on-success: t5_post_status t3_test_fedora20: action: cicd.deploy_and_test distro="F20" on-success: t5_post_status t4_test_rhel6: action: cicd.deploy_and_test distro="RHEL6" on-success: t5_post_status t5_tag_release: join: all action: cicd.tag_release t2 t3 t4 t5
  43. 43. More: Join—Discriminator © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 44HTTP://WWW.WORKFLOWPATTERNS.COM/PATTERNS/CONTROL/ADVANCED_BRANCHING/WCP9.PHP Discriminator ... t2_test_ubuntu14: action: cicd.deploy_and_test distro="UBUNTU14” on-failure: t5_report_and_fail t3_test_fedora20: action: cicd.deploy_and_test distro="F20" on-failure: t5_report_and_fail t4_test_rhel6: action: cicd.deploy_and_test distro="RHEL6" on-failure: t5_report_and_fail t5_report_and_fail: join: one action: chatops.say header=“FAILURE!” on-complete: fail t2 t3 t4 t5
  44. 44. t2t2 More: Multiple Data © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 45 ... t1_get_ip_list: action: myinventory.allocate_ips num=4 publish: - ip_list: <% $.t1_get_ip_list.ips %> on-complete: t2_create_vms t2_create_vms: with-items: ip in <% $. ip_list %> action: myaws.create_vms ip=<% $.ip %> t1 t2 ip_list=[...]
  45. 45. Recap: Key Workflow Operations • Sequence • Data passing • Conditions (on data) • Parallel execution • Joins • Multiple Data Items © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 46
  46. 46. Why not Scripts? 47
  47. 47. Why not Scripts? 48 • Simple to define, reason, visualize • Transparent – state is clear, execution is trackable: running, complete, failed steps • – – –
  48. 48. © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.49
  49. 49. Workflows Better in Operations • Simple to define, reason, visualize • Transparent – state is clear, execution is trackable: running, complete, failed steps • Reliable – Workflows are long-running – Crash tolerance – “Restart from point of failure” © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 50
  50. 50. Why not Legacy RunBook Automation? © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 51 DevOps: Infrastructure as Code
  51. 51. 52 Infrastructure as code Case Study • Automated provisioning, 4 Data centers • Before: CPO, operator updates via GUI, click and pray, x4 • After: BWC, dev -> code review -> staging -> QA-> prod
  52. 52. Infrastructure as code © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.53 Top predictor of IT performance? Version control used by Ops for Ops artifacts!
  53. 53. Designed for DevOps 1. Support infrastructure as code 2. Open Source 3. Scale and reliability 4. Part of tool chain 5. Social coding & collaboration 6. More demanding - requires skills 54© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
  54. 54. Part of tool chain © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 55 Devops Tools vs Enterprise Suites OR
  55. 55. Leverage social coding Community packs @ StackStorm exchange
  56. 56. More demanding © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 57 OR Requires skills – CLI, scripting, understanding
  57. 57. Operation Patterns © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 58 Capture and share operational patters as code!
  58. 58. © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.59
  59. 59. © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.60
  60. 60. 61
  61. 61. • Event-driven automation works – - benefits to reliable cloud operations • Automation must be reliable and transparent – - workflows beat scripts • Infra as code is a key – - repeatable, testable, reliable automation Summary © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 62
  62. 62. OpenSource Apache 2.0 • Github: github.com/StackStorm/st2 • Twitter: Stack_Storm • IRC: #stackstorm on FreeNode • stackstorm.slack.com on Slack • www.stackstorm.com © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 63 StackStorm Brocade Workflow Composer Commercial Edition • Enterprise features • Priority support • brocade.com/bwc • docs: bwc-docs.brocade.com • Network lifecycle automation suite
  63. 63. Questions & Answers © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 64

×