Michael Kehoe
Senior Site Reliability Engineer
LinkedIn
Using SaltStack to Auto Triage and
Remediate Production Systems
Topics
• $ whoami
• Salt @ LinkedIn
• LinkedIn’s auto remediation story
• Nurse
• Salt API
• Auto-remediation Salt Modules
• How to get started
$ whoami
Salt @ LinkedIn
• 11-14k minions per master
• Up to 8.5 events/ second per master
• Deployment system heavily utilizes Salt
• 5 dedicated SRE’s work on LinkedIn’s Salt infrastructure
• Number of SRE’s who also contribute
Alert Growth vs NOC Engineers
LinkedIn’s Auto-remediation Story
0
5
10
15
20
25
30
0
5000
10000
15000
20000
25000
30000
35000
40000
2010 2011 2012 2013 2014 2015 2016 2017
Num of Alerts
Num of Engineers
The problem
LinkedIn’s Auto-remediation Story
• Growing number of high priority alerts vs Engineers to watch
• Complicated ITR’s (runbooks) took time for NOC engineers to
execute and escalate
• SRE’s would forget to run diagnostic tooling during outages
• Longer MTTR == Bad experience for members
Building the Solution
LinkedIn’s Auto-Remediation Story
• LinkedIn needed a workflow engine
• Requirements
• Take automated actions against applications/ hosts
• Perform the run book automatically
• Appropriate auditing
• Scalable
What’s already out there
Autoremediation
• StackStorm
• fbar
• SaltStack
Started building an in-house solution in Mid-2014
Nurse
Image: freeflaticons.com
Nurse
• ‘Event based’ system
• Built on top of LinkedIn’s existing monitoring infrastructure
• Allows for manual execution via Web UI
• Allows for other systems to plug-in via REST API
• Built in rules-engine
• Allows for complicated checks/ branching
• Global environment awareness
• Scalable!
Nurse
Nurse
The basics
Salt REST API
• RESTful interface to SALT
• Allows for external-authentication
• PAM
• LDAP
• Your own plugin…
• Built on top of CherryPy
Interface
Salt REST API
• /login - Log in to receive a session token
• /logout - Remove or invalidate sessions
• /minions – Working directly with minions
• /jobs - Getting lists of previously run jobs or getting the return from a single job
• /run - Run commands without normal session handling
• /events - Expose the Salt event bus
• /hook - A generic web hook entry point that fires an event on Salt's event bus
• /keys – Wrapper around key management
• /ws - Open a WebSocket connection to Salt's event bus
• /stats - Return a dump of statistics collected from the CherryPy server
https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html
Configuration
Salt REST API
external_auth:
ldap:
headless-nurse:
- '*':
- test.*
- nurse.*
- '@jobs'
https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html
/etc/salt/master.d/salt-api.conf
Configuration
Salt REST API
rest_cherrypy:
port: 8888
ssl_crt: /etc/salt/pki/api/salt-api.crt
ssl_key: /etc/salt/pki/api/salt-api.key
https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html
/etc/salt/master.d/salt-api.conf
Configuration
Salt REST API
auth.ldap.basedn: 'dc=example,dc=com'
auth.ldap.binddn: ’readonly-ldap@example'
auth.ldap.bindpw: ’password'
auth.ldap.filter: sAMAccountName={{ username }}
auth.ldap.server: ldap.example.com
auth.ldap.tls: True
auth.ldap.persontype: 'person'
auth.ldap.groupclass: 'group'
auth.ldap.groupou: 'Users’
https://docs.saltstack.com/en/latest/topics/eauth/index.html
/etc/salt/master.d/salt-api.conf
Code Interfaces
Salt REST API
• There is a Python interface to the Salt API – pepper
• Provides basic wrapper around REST API
• https://github.com/saltstack/pepper/
• See https://github.com/SUSE/salt-netapi-client for JAVA
bindings
Python Code example
Salt REST API
from pepper.libpepper import Pepper
api = Pepper('https://salt.example.com:8888')
api.login('saltdev', 'saltdev', ’ldap')
# Run simple function
api.low([{'client': 'local', 'tgt': '*', 'fun': 'test.ping'}])
# Execute a runner function
api.runner('jobs.lookup_jid', jid=12345)
Goals
Auto-remediation Salt Modules
• Auto-triage issues
• Reduce context-switching
• Run labor intensive tasks automatically
• Gather data while engineer is logging in after escalation
• Auto-remediate issues
• Let the engineer sleep
• Faster MTTR
Implementation
Auto-remediation Salt Modules
• Auto-triage issues
• Identify abusive clients
• Collect & analyses thread/ heap dumps
• Parse & summarize log files
Implementation
Auto-remediation Salt Modules
• Auto-remediate issues
• Restart/ OOR applications
• Scale-up applications
• Blocking abusive clients
• Update A/ B experiment definitions
• Otherwise escalate…
Implementation
Auto-remediation Salt Modules
• Plenty of Salt modules/ runners already out there
• Over 300 modules are available in Salt core
• Modules to take remediate & notify
• Write your own
• Make sure you test!
Success
LinkedIn’s Auto-remediation Story
• 854k actions taken
• 100% of service health check alerts are on boarded
• ~37k man hours have now been automated
• Now automating ~1100 hours/ week
Success
LinkedIn’s Auto-remediation Story
0
5
10
15
20
25
30
0
5000
10000
15000
20000
25000
30000
35000
40000
2010 2011 2012 2013 2014 2015 2016 2017
Num of Alerts
Num of Engineers
Some lessons learnt
How to get started
• Need to make decision on how you architecture YOUR ‘event
bus’
• Use external monitoring to trigger Salt via API
• Use reactors/ beacons internally within Salt’s event bus
• See ‘Thorium’ documentation for Salt’s new reactor engine
• Auditing is important!
• Need to know what/ who triggered actions
Some lessons learnt
How to get started
• Reporting is important!
• Need to know how often automated actions are being taken
• Find failure hotspots
• Leverage event-bus or returners
• Safety first
• Don’t give yourself too much rope…
Conclusion
• Think carefully about your ‘Event-Driven-Automation’
architecture
• The sky is the limit with Salt…don’t limit yourself
• Again…safety first
Questions?
29
Thank You

Using SaltStack to Auto Triage and Remediate Production Systems

  • 1.
    Michael Kehoe Senior SiteReliability Engineer LinkedIn Using SaltStack to Auto Triage and Remediate Production Systems
  • 2.
    Topics • $ whoami •Salt @ LinkedIn • LinkedIn’s auto remediation story • Nurse • Salt API • Auto-remediation Salt Modules • How to get started
  • 3.
  • 4.
    Salt @ LinkedIn •11-14k minions per master • Up to 8.5 events/ second per master • Deployment system heavily utilizes Salt • 5 dedicated SRE’s work on LinkedIn’s Salt infrastructure • Number of SRE’s who also contribute
  • 5.
    Alert Growth vsNOC Engineers LinkedIn’s Auto-remediation Story 0 5 10 15 20 25 30 0 5000 10000 15000 20000 25000 30000 35000 40000 2010 2011 2012 2013 2014 2015 2016 2017 Num of Alerts Num of Engineers
  • 6.
    The problem LinkedIn’s Auto-remediationStory • Growing number of high priority alerts vs Engineers to watch • Complicated ITR’s (runbooks) took time for NOC engineers to execute and escalate • SRE’s would forget to run diagnostic tooling during outages • Longer MTTR == Bad experience for members
  • 7.
    Building the Solution LinkedIn’sAuto-Remediation Story • LinkedIn needed a workflow engine • Requirements • Take automated actions against applications/ hosts • Perform the run book automatically • Appropriate auditing • Scalable
  • 8.
    What’s already outthere Autoremediation • StackStorm • fbar • SaltStack Started building an in-house solution in Mid-2014
  • 9.
  • 10.
    Nurse • ‘Event based’system • Built on top of LinkedIn’s existing monitoring infrastructure • Allows for manual execution via Web UI • Allows for other systems to plug-in via REST API • Built in rules-engine • Allows for complicated checks/ branching • Global environment awareness • Scalable!
  • 11.
  • 12.
  • 13.
    The basics Salt RESTAPI • RESTful interface to SALT • Allows for external-authentication • PAM • LDAP • Your own plugin… • Built on top of CherryPy
  • 14.
    Interface Salt REST API •/login - Log in to receive a session token • /logout - Remove or invalidate sessions • /minions – Working directly with minions • /jobs - Getting lists of previously run jobs or getting the return from a single job • /run - Run commands without normal session handling • /events - Expose the Salt event bus • /hook - A generic web hook entry point that fires an event on Salt's event bus • /keys – Wrapper around key management • /ws - Open a WebSocket connection to Salt's event bus • /stats - Return a dump of statistics collected from the CherryPy server https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html
  • 15.
    Configuration Salt REST API external_auth: ldap: headless-nurse: -'*': - test.* - nurse.* - '@jobs' https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html /etc/salt/master.d/salt-api.conf
  • 16.
    Configuration Salt REST API rest_cherrypy: port:8888 ssl_crt: /etc/salt/pki/api/salt-api.crt ssl_key: /etc/salt/pki/api/salt-api.key https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html /etc/salt/master.d/salt-api.conf
  • 17.
    Configuration Salt REST API auth.ldap.basedn:'dc=example,dc=com' auth.ldap.binddn: ’readonly-ldap@example' auth.ldap.bindpw: ’password' auth.ldap.filter: sAMAccountName={{ username }} auth.ldap.server: ldap.example.com auth.ldap.tls: True auth.ldap.persontype: 'person' auth.ldap.groupclass: 'group' auth.ldap.groupou: 'Users’ https://docs.saltstack.com/en/latest/topics/eauth/index.html /etc/salt/master.d/salt-api.conf
  • 18.
    Code Interfaces Salt RESTAPI • There is a Python interface to the Salt API – pepper • Provides basic wrapper around REST API • https://github.com/saltstack/pepper/ • See https://github.com/SUSE/salt-netapi-client for JAVA bindings
  • 19.
    Python Code example SaltREST API from pepper.libpepper import Pepper api = Pepper('https://salt.example.com:8888') api.login('saltdev', 'saltdev', ’ldap') # Run simple function api.low([{'client': 'local', 'tgt': '*', 'fun': 'test.ping'}]) # Execute a runner function api.runner('jobs.lookup_jid', jid=12345)
  • 20.
    Goals Auto-remediation Salt Modules •Auto-triage issues • Reduce context-switching • Run labor intensive tasks automatically • Gather data while engineer is logging in after escalation • Auto-remediate issues • Let the engineer sleep • Faster MTTR
  • 21.
    Implementation Auto-remediation Salt Modules •Auto-triage issues • Identify abusive clients • Collect & analyses thread/ heap dumps • Parse & summarize log files
  • 22.
    Implementation Auto-remediation Salt Modules •Auto-remediate issues • Restart/ OOR applications • Scale-up applications • Blocking abusive clients • Update A/ B experiment definitions • Otherwise escalate…
  • 23.
    Implementation Auto-remediation Salt Modules •Plenty of Salt modules/ runners already out there • Over 300 modules are available in Salt core • Modules to take remediate & notify • Write your own • Make sure you test!
  • 24.
    Success LinkedIn’s Auto-remediation Story •854k actions taken • 100% of service health check alerts are on boarded • ~37k man hours have now been automated • Now automating ~1100 hours/ week
  • 25.
  • 26.
    Some lessons learnt Howto get started • Need to make decision on how you architecture YOUR ‘event bus’ • Use external monitoring to trigger Salt via API • Use reactors/ beacons internally within Salt’s event bus • See ‘Thorium’ documentation for Salt’s new reactor engine • Auditing is important! • Need to know what/ who triggered actions
  • 27.
    Some lessons learnt Howto get started • Reporting is important! • Need to know how often automated actions are being taken • Find failure hotspots • Leverage event-bus or returners • Safety first • Don’t give yourself too much rope…
  • 28.
    Conclusion • Think carefullyabout your ‘Event-Driven-Automation’ architecture • The sky is the limit with Salt…don’t limit yourself • Again…safety first
  • 29.

Editor's Notes

  • #3 WhoAmI Background on Salt & Auto-remediation at LinkedIn Nurse – LinkedIn’s Auto-remediation engine Salt API & Autoremediation Salt Modules How to get started
  • #4 Been at LinkedIn for nearly 2.5 years Apart of LinkedIn’s ARVT group Occasional committer to LinkedIn’s Salt ecosystem
  • #5 Want to give you some context into our infrastructure Average is 11-14k minions per production datacenter Up to 8.5 events/ second per master Deployment system heavily utilizes Salt 5 dedicated SRE’s work on LinkedIn’s Salt infrastructure Number of SRE’s who also contribute Keep these numbers in mind as the session progresses
  • #6 Let’s talk about LinkedIn’s autoremediation story Graph depicts the number of alerts our NOC watches vs the Number of NOC engineers we employee globally
  • #9 What is out there Has anyone heard of these
  • #15 Login/ logout Jobs Run Events Events – get data from event bus Hook – Fire an event INTO the salt bus WS – Open websocket Stats
  • #21 So what are the goals of building an auto-remediation engine
  • #24 Remediate Apache/ Nginx restarts Notify Slack/ Hipchat modules
  • #26 In 2014, we don’t increase headcount, now we’re lowering our NOC engineer headcount