Managing IT outages with
Icinga and StackStorm
@jfryman
Placingtoolsdirectlyinthe
middleoftheconversation
— Jesse Newland
ChatHipChat
Slack
FlowDock
CampFire
IRC
BotsLita
Hubot
Err
Lazlo*
BotsareNOT100%
necessary
Benefits
· Sharing
· Learning
· Speed
· Security
· Brainstorming
· Fun
Design
Design
Inform Users
But the problem isn’t that
they don’t trust you, they
want to know it’s being fixed.
Trust
Integration
Create integration
from Icinga ->
StackStorm
Integration
Event API coming
soon! (CW42)
Rules
Events are going to
StackStorm: Now what?
Rules
Triggers are emitted into
the system, match against
criteria, and then take an
action if successful.
---
name: "notify_catcam_alerts"
pack: "frymanio"
description: "Notify users when an error comes from CatCam hosts"
enabled: true
trigger:
type: "nagios.service-state-change"
critera:
trigger.host:
type: "matchregex"
pattern: "^camera"
action:
ref: "slack.post_message"
parameters:
channel: "#catcam"
message: “[{{trigger.event_id}}] Problem on {{trigger.host}}/{{trigger.service}} is currently
{{trigger.state}}“
Design
Failures in Slack: Visibility
for all!
StackStorm -> Icinga
Integration
Design
Allows you to expose tools
that would normally be
foreign to developers/
users.
Give control back to users.
Have user also execute
command.
GiveBackControl
Local Runner
Remote Runner
Python Runner
HTTP Runner
WinRM Runner
Runners
name: view_alerts
runner_type: remote-shell-cmd
description: View all alerts on Icinga System
enabled: true
entry_point: ""
parameters:
cmd:
default: 'icingacli monitoring list'
•username (string) - Username used to log-in. If not provided, default username from config is used.
•private_key (string) - Private key used to log in. If not provided, private key from the config file is used.
•timeout (integer) - Action timeout in seconds. Action will get killed if it doesn’t finish in timeout seconds.
•sudo (boolean) - The remote command will be executed with sudo.
•kwarg_op (string) - Operator to use in front of keyword args i.e. “–” or “-”.
•password (string) - Password used to log in. If not provided, private key from the config file is used.
•parallel (boolean) - Default to parallel execution.
•cmd (string) - Arbitrary Linux command to be executed on the remote host(s).
•hosts (string) - A comma delimited string of a list of hosts where the remote command will be executed.
•env (object) - Environment variables which will be available to the command(e.g. key1=val1,key2=val2)
•cwd (string) - Working directory where the script will be executed in
•dir (string) - The working directory where the script will be copied to on the remote host.
Additional metadata with remote-shell-cmd
id: 55ba7a198bc962174a3911c5
status: succeeded
result:
{
"192.168.33.5": {
"succeeded": true,
"failed": false,
"return_code": 0,
"stderr": "No entry for terminal type "unknown";
using dumb terminal settings.
tput: unknown terminal "unknown"",
"stdout": "UP icinga2: PING OK - Packet loss = 0%, RTA = 0.09 ms
OK random-001 (For 0m 3s)
CRIT random-002 (For 0m 1s)
WARN random-003 (For 0m 1s)
CRIT random-004 (For 0m 5s)
WARN random-005 (For 0m 2s)
OK dns icinga.org (Since 19:18)
OK dns netways.org (Since 19:18)
OK ping4 (Since 04:19)
OK ping6 (Since 04:19)
OK Icinga Web 2 (Since 01:17)
What About
ChatOps?
Before:
create a playbook, execute by hand
when things break. Fix it “later”.
Today:
Create script/alias and expose via
ChatOps.
StateofRepair
Design
Describe The
Service
/ci
/graph
How do folks actually
communicate?
Discovery
---
name: “broken_catcam_discovery”
pack: "frymanio"
description: “Make a log when someone talks about CatCam being broken"
enabled: true
trigger:
type: “slack.message"
critera:
trigger.text:
type: "includes"
pattern: "camera"
trigger.text:
type: "matchregex"
pattern: “(broken|not working)“
action:
ref: “core.local"
parameters:
cmd:”echo {{trigger.text}} >> /tmp/catcam_problems.txt“
---
name: "cameras_clean"
action_ref: "catcam.cameras_clean"
description: "Clean up artifacts on cameras"
formats:
- "cameras clean {{host}}"
Discovery
Design
Help comes first, not last
Execution
Create Feedback Loops
ClosedLoop
Does trust exist in automation?
Garner trust via visibility.
CultureAwareness
Next level: kick off workflow.
---
chain:
-
name: notify_chatops
ref: core.local
params:
cmd: "echo 'Removing old camera files...'"
on-success: delete_old_files
-
name: delete_old_files
ref: core.remote
params:
cmd: "sudo find . -type f -mtime +1 -delete"
cwd: “/var/lib/catcam”
hosts: "{{host}}"
Doesn’t solve the problem?
At least you’ve ruled out
known issues now.
Auto-Remediation
WHY
Shared
CLI
Shared
Context
A different kind of bus…
Official Release:
11.11.2015
https://stackstorm.com/community-signup
https://stackstorm.com/#try-now
AMI / VMDK / All in One Installer
Thanks!

ChatOps with Icinga and StackStorm