Using SaltStack to Auto Triage and Remediate Production Systems

Michael Kehoe
Michael KehoeArchitect of reliable, scalable infrastructure at LinkedIn
Michael Kehoe
Senior Site Reliability Engineer
LinkedIn
Using SaltStack to Auto Triage and
Remediate Production Systems
Topics
• $ whoami
• Salt @ LinkedIn
• LinkedIn’s auto remediation story
• Nurse
• Salt API
• Auto-remediation Salt Modules
• How to get started
$ whoami
Salt @ LinkedIn
• 11-14k minions per master
• Up to 8.5 events/ second per master
• Deployment system heavily utilizes Salt
• 5 dedicated SRE’s work on LinkedIn’s Salt infrastructure
• Number of SRE’s who also contribute
Alert Growth vs NOC Engineers
LinkedIn’s Auto-remediation Story
0
5
10
15
20
25
30
0
5000
10000
15000
20000
25000
30000
35000
40000
2010 2011 2012 2013 2014 2015 2016 2017
Num of Alerts
Num of Engineers
The problem
LinkedIn’s Auto-remediation Story
• Growing number of high priority alerts vs Engineers to watch
• Complicated ITR’s (runbooks) took time for NOC engineers to
execute and escalate
• SRE’s would forget to run diagnostic tooling during outages
• Longer MTTR == Bad experience for members
Building the Solution
LinkedIn’s Auto-Remediation Story
• LinkedIn needed a workflow engine
• Requirements
• Take automated actions against applications/ hosts
• Perform the run book automatically
• Appropriate auditing
• Scalable
What’s already out there
Autoremediation
• StackStorm
• fbar
• SaltStack
Started building an in-house solution in Mid-2014
Nurse
Image: freeflaticons.com
Nurse
• ‘Event based’ system
• Built on top of LinkedIn’s existing monitoring infrastructure
• Allows for manual execution via Web UI
• Allows for other systems to plug-in via REST API
• Built in rules-engine
• Allows for complicated checks/ branching
• Global environment awareness
• Scalable!
Nurse
Nurse
The basics
Salt REST API
• RESTful interface to SALT
• Allows for external-authentication
• PAM
• LDAP
• Your own plugin…
• Built on top of CherryPy
Interface
Salt REST API
• /login - Log in to receive a session token
• /logout - Remove or invalidate sessions
• /minions – Working directly with minions
• /jobs - Getting lists of previously run jobs or getting the return from a single job
• /run - Run commands without normal session handling
• /events - Expose the Salt event bus
• /hook - A generic web hook entry point that fires an event on Salt's event bus
• /keys – Wrapper around key management
• /ws - Open a WebSocket connection to Salt's event bus
• /stats - Return a dump of statistics collected from the CherryPy server
https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html
Configuration
Salt REST API
external_auth:
ldap:
headless-nurse:
- '*':
- test.*
- nurse.*
- '@jobs'
https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html
/etc/salt/master.d/salt-api.conf
Configuration
Salt REST API
rest_cherrypy:
port: 8888
ssl_crt: /etc/salt/pki/api/salt-api.crt
ssl_key: /etc/salt/pki/api/salt-api.key
https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html
/etc/salt/master.d/salt-api.conf
Configuration
Salt REST API
auth.ldap.basedn: 'dc=example,dc=com'
auth.ldap.binddn: ’readonly-ldap@example'
auth.ldap.bindpw: ’password'
auth.ldap.filter: sAMAccountName={{ username }}
auth.ldap.server: ldap.example.com
auth.ldap.tls: True
auth.ldap.persontype: 'person'
auth.ldap.groupclass: 'group'
auth.ldap.groupou: 'Users’
https://docs.saltstack.com/en/latest/topics/eauth/index.html
/etc/salt/master.d/salt-api.conf
Code Interfaces
Salt REST API
• There is a Python interface to the Salt API – pepper
• Provides basic wrapper around REST API
• https://github.com/saltstack/pepper/
• See https://github.com/SUSE/salt-netapi-client for JAVA
bindings
Python Code example
Salt REST API
from pepper.libpepper import Pepper
api = Pepper('https://salt.example.com:8888')
api.login('saltdev', 'saltdev', ’ldap')
# Run simple function
api.low([{'client': 'local', 'tgt': '*', 'fun': 'test.ping'}])
# Execute a runner function
api.runner('jobs.lookup_jid', jid=12345)
Goals
Auto-remediation Salt Modules
• Auto-triage issues
• Reduce context-switching
• Run labor intensive tasks automatically
• Gather data while engineer is logging in after escalation
• Auto-remediate issues
• Let the engineer sleep
• Faster MTTR
Implementation
Auto-remediation Salt Modules
• Auto-triage issues
• Identify abusive clients
• Collect & analyses thread/ heap dumps
• Parse & summarize log files
Implementation
Auto-remediation Salt Modules
• Auto-remediate issues
• Restart/ OOR applications
• Scale-up applications
• Blocking abusive clients
• Update A/ B experiment definitions
• Otherwise escalate…
Implementation
Auto-remediation Salt Modules
• Plenty of Salt modules/ runners already out there
• Over 300 modules are available in Salt core
• Modules to take remediate & notify
• Write your own
• Make sure you test!
Success
LinkedIn’s Auto-remediation Story
• 854k actions taken
• 100% of service health check alerts are on boarded
• ~37k man hours have now been automated
• Now automating ~1100 hours/ week
Success
LinkedIn’s Auto-remediation Story
0
5
10
15
20
25
30
0
5000
10000
15000
20000
25000
30000
35000
40000
2010 2011 2012 2013 2014 2015 2016 2017
Num of Alerts
Num of Engineers
Some lessons learnt
How to get started
• Need to make decision on how you architecture YOUR ‘event
bus’
• Use external monitoring to trigger Salt via API
• Use reactors/ beacons internally within Salt’s event bus
• See ‘Thorium’ documentation for Salt’s new reactor engine
• Auditing is important!
• Need to know what/ who triggered actions
Some lessons learnt
How to get started
• Reporting is important!
• Need to know how often automated actions are being taken
• Find failure hotspots
• Leverage event-bus or returners
• Safety first
• Don’t give yourself too much rope…
Conclusion
• Think carefully about your ‘Event-Driven-Automation’
architecture
• The sky is the limit with Salt…don’t limit yourself
• Again…safety first
Questions?
29
Thank You
1 of 29

Recommended

Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn by
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInCouchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInMichael Kehoe
720 views42 slides
Couchbase Connect 2016 by
Couchbase Connect 2016Couchbase Connect 2016
Couchbase Connect 2016Michael Kehoe
665 views28 slides
Reducing MTTR and False Escalations: Event Correlation at LinkedIn by
Reducing MTTR and False Escalations: Event Correlation at LinkedInReducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedInMichael Kehoe
955 views34 slides
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ... by
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...Michael Kehoe
534 views35 slides
SouthBay SRE Meetup Jan 2016 by
SouthBay SRE Meetup Jan 2016SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016Michael Kehoe
586 views17 slides
Introducing Tupilak, Snowplow's unified log fabric by
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricAlexander Dean
1.3K views16 slides

More Related Content

What's hot

Akka and AngularJS – Reactive Applications in Practice by
Akka and AngularJS – Reactive Applications in PracticeAkka and AngularJS – Reactive Applications in Practice
Akka and AngularJS – Reactive Applications in PracticeRoland Kuhn
17.3K views15 slides
URP? Excuse You! The Three Metrics You Have to Know by
URP? Excuse You! The Three Metrics You Have to Know URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know confluent
957 views36 slides
Span Conference: Why your company needs a unified log by
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logAlexander Dean
1.8K views24 slides
Intro to.net core 20170111 by
Intro to.net core   20170111Intro to.net core   20170111
Intro to.net core 20170111Christian Horsdal
263 views45 slides
Integrating Apache Kafka Into Your Environment by
Integrating Apache Kafka Into Your EnvironmentIntegrating Apache Kafka Into Your Environment
Integrating Apache Kafka Into Your Environmentconfluent
3.9K views29 slides
Scala eXchange: Building robust data pipelines in Scala by
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaAlexander Dean
4.6K views30 slides

What's hot(20)

Akka and AngularJS – Reactive Applications in Practice by Roland Kuhn
Akka and AngularJS – Reactive Applications in PracticeAkka and AngularJS – Reactive Applications in Practice
Akka and AngularJS – Reactive Applications in Practice
Roland Kuhn17.3K views
URP? Excuse You! The Three Metrics You Have to Know by confluent
URP? Excuse You! The Three Metrics You Have to Know URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know
confluent957 views
Span Conference: Why your company needs a unified log by Alexander Dean
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified log
Alexander Dean1.8K views
Integrating Apache Kafka Into Your Environment by confluent
Integrating Apache Kafka Into Your EnvironmentIntegrating Apache Kafka Into Your Environment
Integrating Apache Kafka Into Your Environment
confluent3.9K views
Scala eXchange: Building robust data pipelines in Scala by Alexander Dean
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
Alexander Dean4.6K views
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies... by HostedbyConfluent
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...
HostedbyConfluent4.7K views
Chef Analytics Webinar by James Casey
Chef Analytics WebinarChef Analytics Webinar
Chef Analytics Webinar
James Casey1.2K views
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent) K... by confluent
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent)  K...Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent)  K...
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent) K...
confluent3.7K views
[Webinar] AWS Monitoring with Site24x7 by Site24x7
[Webinar] AWS Monitoring with Site24x7[Webinar] AWS Monitoring with Site24x7
[Webinar] AWS Monitoring with Site24x7
Site24x71.8K views
10 Tips to Pump Up Your Atlassian Performance by Atlassian
10 Tips to Pump Up Your Atlassian Performance10 Tips to Pump Up Your Atlassian Performance
10 Tips to Pump Up Your Atlassian Performance
Atlassian4.5K views
JIRA Data Center Implementation at Pitney Bowes - Peter Strickland by Atlassian
JIRA Data Center Implementation at Pitney Bowes - Peter StricklandJIRA Data Center Implementation at Pitney Bowes - Peter Strickland
JIRA Data Center Implementation at Pitney Bowes - Peter Strickland
Atlassian2.7K views
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications by confluent
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming ApplicationsMetrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
confluent10.5K views
Chef Analytics (Chef NYC Meeting - July 2014) by James Casey
Chef Analytics (Chef NYC Meeting - July 2014)Chef Analytics (Chef NYC Meeting - July 2014)
Chef Analytics (Chef NYC Meeting - July 2014)
James Casey935 views
Becoming Protocol-Agnostic with Kafka, REST, GraphQL & gRPC | Tyler Mills, Sm... by HostedbyConfluent
Becoming Protocol-Agnostic with Kafka, REST, GraphQL & gRPC | Tyler Mills, Sm...Becoming Protocol-Agnostic with Kafka, REST, GraphQL & gRPC | Tyler Mills, Sm...
Becoming Protocol-Agnostic with Kafka, REST, GraphQL & gRPC | Tyler Mills, Sm...
HostedbyConfluent627 views
Tale of two streaming frameworks (Karthik D - Walmart) by KafkaZone
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)
KafkaZone231 views
Deep Dive Into Elasticsearch: Establish A Powerful Log Analysis System With E... by Tyler Nguyen
Deep Dive Into Elasticsearch: Establish A Powerful Log Analysis System With E...Deep Dive Into Elasticsearch: Establish A Powerful Log Analysis System With E...
Deep Dive Into Elasticsearch: Establish A Powerful Log Analysis System With E...
Tyler Nguyen1.1K views
Redgate Database Devops Demo webinar - Visual Studio Team Services - 21st Fe... by KateDuggan2
Redgate Database Devops Demo webinar  - Visual Studio Team Services - 21st Fe...Redgate Database Devops Demo webinar  - Visual Studio Team Services - 21st Fe...
Redgate Database Devops Demo webinar - Visual Studio Team Services - 21st Fe...
KateDuggan2109 views

Viewers also liked

Couchbase Meetup Jan 2016 by
Couchbase Meetup Jan 2016Couchbase Meetup Jan 2016
Couchbase Meetup Jan 2016Michael Kehoe
991 views11 slides
SRECon USA 2016: Growing your Entry Level Talent by
SRECon USA 2016: Growing your Entry Level TalentSRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level TalentMichael Kehoe
520 views19 slides
CouchbasetoHadoop_Matt_Michael_Justin v4 by
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe
590 views17 slides
Feedback loops: How SREs benefit and what is needed to realize their potential by
Feedback loops: How SREs benefit and what is needed to realize their potentialFeedback loops: How SREs benefit and what is needed to realize their potential
Feedback loops: How SREs benefit and what is needed to realize their potentialPooja Tangi
531 views14 slides
Event driven-automation and workflows by
Event driven-automation and workflowsEvent driven-automation and workflows
Event driven-automation and workflowsDmitri Zimine
8.7K views64 slides
Couchbase Orchestration and Scaling a Caching Infrastructure At LinkedIn. by
Couchbase Orchestration and Scaling a Caching Infrastructure At LinkedIn.Couchbase Orchestration and Scaling a Caching Infrastructure At LinkedIn.
Couchbase Orchestration and Scaling a Caching Infrastructure At LinkedIn.Issa Fattah
548 views44 slides

Viewers also liked(20)

Couchbase Meetup Jan 2016 by Michael Kehoe
Couchbase Meetup Jan 2016Couchbase Meetup Jan 2016
Couchbase Meetup Jan 2016
Michael Kehoe991 views
SRECon USA 2016: Growing your Entry Level Talent by Michael Kehoe
SRECon USA 2016: Growing your Entry Level TalentSRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level Talent
Michael Kehoe520 views
CouchbasetoHadoop_Matt_Michael_Justin v4 by Michael Kehoe
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
Michael Kehoe590 views
Feedback loops: How SREs benefit and what is needed to realize their potential by Pooja Tangi
Feedback loops: How SREs benefit and what is needed to realize their potentialFeedback loops: How SREs benefit and what is needed to realize their potential
Feedback loops: How SREs benefit and what is needed to realize their potential
Pooja Tangi531 views
Event driven-automation and workflows by Dmitri Zimine
Event driven-automation and workflowsEvent driven-automation and workflows
Event driven-automation and workflows
Dmitri Zimine8.7K views
Couchbase Orchestration and Scaling a Caching Infrastructure At LinkedIn. by Issa Fattah
Couchbase Orchestration and Scaling a Caching Infrastructure At LinkedIn.Couchbase Orchestration and Scaling a Caching Infrastructure At LinkedIn.
Couchbase Orchestration and Scaling a Caching Infrastructure At LinkedIn.
Issa Fattah548 views
SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster by Thomas Jackson
SaltConf 2015: Salt stack at web scale: Better, Stronger, FasterSaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
Thomas Jackson5.1K views
Real-time Infrastructure Management with SaltStack - OpenWest 2013 by SaltStack
Real-time Infrastructure Management with SaltStack - OpenWest 2013Real-time Infrastructure Management with SaltStack - OpenWest 2013
Real-time Infrastructure Management with SaltStack - OpenWest 2013
SaltStack5.7K views
Creating SaltStack State data with Pyobjects by Evan Borgstrom
Creating SaltStack State data with PyobjectsCreating SaltStack State data with Pyobjects
Creating SaltStack State data with Pyobjects
Evan Borgstrom2.7K views
PuppetConf 2016: Keynote: Pulling the Strings to Containerize Your Life - Sco... by Puppet
PuppetConf 2016: Keynote: Pulling the Strings to Containerize Your Life - Sco...PuppetConf 2016: Keynote: Pulling the Strings to Containerize Your Life - Sco...
PuppetConf 2016: Keynote: Pulling the Strings to Containerize Your Life - Sco...
Puppet403 views
Event-driven Infrastructure - Mike Place, SaltStack - DevOpsDays Tel Aviv 2016 by DevOpsDays Tel Aviv
Event-driven Infrastructure - Mike Place, SaltStack - DevOpsDays Tel Aviv 2016Event-driven Infrastructure - Mike Place, SaltStack - DevOpsDays Tel Aviv 2016
Event-driven Infrastructure - Mike Place, SaltStack - DevOpsDays Tel Aviv 2016
ContentCal AutoPilot by Andy Lambert
ContentCal AutoPilotContentCal AutoPilot
ContentCal AutoPilot
Andy Lambert361 views
Intelligent infrastructure with SaltStack by Love Nyberg
Intelligent infrastructure with SaltStackIntelligent infrastructure with SaltStack
Intelligent infrastructure with SaltStack
Love Nyberg1.4K views
OpenStack Journey in Tieto Elastic Cloud by Jakub Pavlik
OpenStack Journey in Tieto Elastic CloudOpenStack Journey in Tieto Elastic Cloud
OpenStack Journey in Tieto Elastic Cloud
Jakub Pavlik617 views
Software reliability tools and common software errors by Himanshu
Software reliability tools and common software errorsSoftware reliability tools and common software errors
Software reliability tools and common software errors
Himanshu 1.8K views
Operators experience and perspective on SDN with VLANs and L3 Networks by Jakub Pavlik
Operators experience and perspective on SDN with VLANs and L3 NetworksOperators experience and perspective on SDN with VLANs and L3 Networks
Operators experience and perspective on SDN with VLANs and L3 Networks
Jakub Pavlik732 views
Automate your development environment with Jira and Saltstack by NetworkedAssets
Automate your development environment with Jira and SaltstackAutomate your development environment with Jira and Saltstack
Automate your development environment with Jira and Saltstack
NetworkedAssets1.9K views
DevOps on AWS: Deep Dive on Infrastructure as Code - Toronto by Amazon Web Services
DevOps on AWS: Deep Dive on Infrastructure as Code - TorontoDevOps on AWS: Deep Dive on Infrastructure as Code - Toronto
DevOps on AWS: Deep Dive on Infrastructure as Code - Toronto
Amazon Web Services3.9K views
SaltStack Configuration Management by Nathan Sickler
SaltStack Configuration ManagementSaltStack Configuration Management
SaltStack Configuration Management
Nathan Sickler4.5K views
On the Importance of Infrastructure as Code by Kris Buytaert
On the Importance of Infrastructure as CodeOn the Importance of Infrastructure as Code
On the Importance of Infrastructure as Code
Kris Buytaert1.8K views

Similar to Using SaltStack to Auto Triage and Remediate Production Systems

Openfest15 MySQL Plugin Development by
Openfest15 MySQL Plugin DevelopmentOpenfest15 MySQL Plugin Development
Openfest15 MySQL Plugin DevelopmentGeorgi Kodinov
540 views26 slides
CIRCUIT 2015 - Monitoring AEM by
CIRCUIT 2015 - Monitoring AEMCIRCUIT 2015 - Monitoring AEM
CIRCUIT 2015 - Monitoring AEMICF CIRCUIT
3.4K views32 slides
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und... by
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...rschuppe
647 views87 slides
Continuous Delivery: How RightScale Releases Weekly by
Continuous Delivery: How RightScale Releases WeeklyContinuous Delivery: How RightScale Releases Weekly
Continuous Delivery: How RightScale Releases WeeklyRightScale
920 views35 slides
Redundant devops by
Redundant devopsRedundant devops
Redundant devopsSzabolcs Szabolcsi-Tóth
490 views74 slides
Getting started with Office 365 SharePoint 2010 online development by
Getting started with Office 365 SharePoint 2010 online developmentGetting started with Office 365 SharePoint 2010 online development
Getting started with Office 365 SharePoint 2010 online developmentJeremy Thake
5.2K views44 slides

Similar to Using SaltStack to Auto Triage and Remediate Production Systems(20)

Openfest15 MySQL Plugin Development by Georgi Kodinov
Openfest15 MySQL Plugin DevelopmentOpenfest15 MySQL Plugin Development
Openfest15 MySQL Plugin Development
Georgi Kodinov540 views
CIRCUIT 2015 - Monitoring AEM by ICF CIRCUIT
CIRCUIT 2015 - Monitoring AEMCIRCUIT 2015 - Monitoring AEM
CIRCUIT 2015 - Monitoring AEM
ICF CIRCUIT3.4K views
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und... by rschuppe
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
rschuppe647 views
Continuous Delivery: How RightScale Releases Weekly by RightScale
Continuous Delivery: How RightScale Releases WeeklyContinuous Delivery: How RightScale Releases Weekly
Continuous Delivery: How RightScale Releases Weekly
RightScale920 views
Getting started with Office 365 SharePoint 2010 online development by Jeremy Thake
Getting started with Office 365 SharePoint 2010 online developmentGetting started with Office 365 SharePoint 2010 online development
Getting started with Office 365 SharePoint 2010 online development
Jeremy Thake5.2K views
Azure Functions Real World Examples by Yochay Kiriaty
Azure Functions Real World Examples Azure Functions Real World Examples
Azure Functions Real World Examples
Yochay Kiriaty9.4K views
SharePoint 2013 – the upgrade story by SPC Adriatics
SharePoint 2013 – the upgrade storySharePoint 2013 – the upgrade story
SharePoint 2013 – the upgrade story
SPC Adriatics4.5K views
淺談 Startup 公司的軟體開發流程 v2 by Wen-Tien Chang
淺談 Startup 公司的軟體開發流程 v2淺談 Startup 公司的軟體開發流程 v2
淺談 Startup 公司的軟體開發流程 v2
Wen-Tien Chang38.8K views
Agile startup company management and operation by Jiang Zhu
Agile startup company management and operationAgile startup company management and operation
Agile startup company management and operation
Jiang Zhu872 views
SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale by SaltStack
SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web ScaleSaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale
SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale
SaltStack2.6K views
Single Page Applications: Your Browser is the OS! by Jeremy Likness
Single Page Applications: Your Browser is the OS!Single Page Applications: Your Browser is the OS!
Single Page Applications: Your Browser is the OS!
Jeremy Likness4.9K views
Stay productive_while_slicing_up_the_monolith by Markus Eisele
Stay productive_while_slicing_up_the_monolithStay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolith
Markus Eisele592 views
Automating the cip compliance test lab by Chuck Reynolds
Automating the cip compliance test labAutomating the cip compliance test lab
Automating the cip compliance test lab
Chuck Reynolds1.2K views
SAP portal: breaking and forensicating by ERPScan
SAP portal: breaking and forensicating SAP portal: breaking and forensicating
SAP portal: breaking and forensicating
ERPScan569 views
Five Ways to Fix Your SQL Server Dev-Test Problems by Catalogic Software
Five Ways to Fix Your SQL Server Dev-Test Problems Five Ways to Fix Your SQL Server Dev-Test Problems
Five Ways to Fix Your SQL Server Dev-Test Problems
Catalogic Software101 views

More from Michael Kehoe

eBPF Workshop by
eBPF WorkshopeBPF Workshop
eBPF WorkshopMichael Kehoe
1.4K views26 slides
eBPF Basics by
eBPF BasicseBPF Basics
eBPF BasicsMichael Kehoe
2.7K views63 slides
Code Yellow: Helping operations top-heavy teams the smart way by
Code Yellow: Helping operations top-heavy teams the smart wayCode Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayMichael Kehoe
140 views29 slides
QConSF 2018: Building Production-Ready Applications by
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsMichael Kehoe
193 views43 slides
Helping operations top-heavy teams the smart way by
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayMichael Kehoe
420 views29 slides
AllDayDevops: What the NTSB teaches us about incident management & postmortems by
AllDayDevops: What the NTSB teaches us about incident management & postmortemsAllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortemsMichael Kehoe
321 views58 slides

More from Michael Kehoe(17)

Code Yellow: Helping operations top-heavy teams the smart way by Michael Kehoe
Code Yellow: Helping operations top-heavy teams the smart wayCode Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart way
Michael Kehoe140 views
QConSF 2018: Building Production-Ready Applications by Michael Kehoe
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready Applications
Michael Kehoe193 views
Helping operations top-heavy teams the smart way by Michael Kehoe
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart way
Michael Kehoe420 views
AllDayDevops: What the NTSB teaches us about incident management & postmortems by Michael Kehoe
AllDayDevops: What the NTSB teaches us about incident management & postmortemsAllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortems
Michael Kehoe321 views
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops by Michael Kehoe
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsPapers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Michael Kehoe285 views
What the NTSB teaches us about incident management & postmortems by Michael Kehoe
What the NTSB teaches us about incident management & postmortemsWhat the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortems
Michael Kehoe489 views
PyBay 2018: Production-Ready Python Applications by Michael Kehoe
PyBay 2018: Production-Ready Python ApplicationsPyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python Applications
Michael Kehoe283 views
Helping operations top-heavy teams the smart way by Michael Kehoe
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart way
Michael Kehoe233 views
The Next Wave of Reliability Engineering by Michael Kehoe
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability Engineering
Michael Kehoe687 views
Building Production-Ready Microservices: DevopsExchangeSF by Michael Kehoe
Building Production-Ready Microservices: DevopsExchangeSFBuilding Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSF
Michael Kehoe452 views
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine... by Michael Kehoe
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
Michael Kehoe321 views
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at... by Michael Kehoe
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
Michael Kehoe270 views
SRECon-Europe-2017: Networks for SREs by Michael Kehoe
SRECon-Europe-2017: Networks for SREsSRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREs
Michael Kehoe383 views
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale by Michael Kehoe
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleVelocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Michael Kehoe247 views

Recently uploaded

fakenews_DBDA_Mar23.pptx by
fakenews_DBDA_Mar23.pptxfakenews_DBDA_Mar23.pptx
fakenews_DBDA_Mar23.pptxdeepmitra8
12 views34 slides
Saikat Chakraborty Java Oracle Certificate.pdf by
Saikat Chakraborty Java Oracle Certificate.pdfSaikat Chakraborty Java Oracle Certificate.pdf
Saikat Chakraborty Java Oracle Certificate.pdfSaikatChakraborty787148
13 views1 slide
Informed search algorithms.pptx by
Informed search algorithms.pptxInformed search algorithms.pptx
Informed search algorithms.pptxDr.Shweta
12 views19 slides
How I learned to stop worrying and love the dark silicon apocalypse.pdf by
How I learned to stop worrying and love the dark silicon apocalypse.pdfHow I learned to stop worrying and love the dark silicon apocalypse.pdf
How I learned to stop worrying and love the dark silicon apocalypse.pdfTomasz Kowalczewski
23 views66 slides
Stone Masonry and Brick Masonry.pdf by
Stone Masonry and Brick Masonry.pdfStone Masonry and Brick Masonry.pdf
Stone Masonry and Brick Masonry.pdfMohammed Abdullah Laskar
19 views6 slides
7_DVD_Combinational_MOS_Logic_Circuits.pdf by
7_DVD_Combinational_MOS_Logic_Circuits.pdf7_DVD_Combinational_MOS_Logic_Circuits.pdf
7_DVD_Combinational_MOS_Logic_Circuits.pdfUsha Mehta
50 views133 slides

Recently uploaded(20)

fakenews_DBDA_Mar23.pptx by deepmitra8
fakenews_DBDA_Mar23.pptxfakenews_DBDA_Mar23.pptx
fakenews_DBDA_Mar23.pptx
deepmitra812 views
Informed search algorithms.pptx by Dr.Shweta
Informed search algorithms.pptxInformed search algorithms.pptx
Informed search algorithms.pptx
Dr.Shweta12 views
How I learned to stop worrying and love the dark silicon apocalypse.pdf by Tomasz Kowalczewski
How I learned to stop worrying and love the dark silicon apocalypse.pdfHow I learned to stop worrying and love the dark silicon apocalypse.pdf
How I learned to stop worrying and love the dark silicon apocalypse.pdf
7_DVD_Combinational_MOS_Logic_Circuits.pdf by Usha Mehta
7_DVD_Combinational_MOS_Logic_Circuits.pdf7_DVD_Combinational_MOS_Logic_Circuits.pdf
7_DVD_Combinational_MOS_Logic_Circuits.pdf
Usha Mehta50 views
An approach of ontology and knowledge base for railway maintenance by IJECEIAES
An approach of ontology and knowledge base for railway maintenanceAn approach of ontology and knowledge base for railway maintenance
An approach of ontology and knowledge base for railway maintenance
IJECEIAES12 views
cloud computing-virtualization.pptx by RajaulKarim20
cloud computing-virtualization.pptxcloud computing-virtualization.pptx
cloud computing-virtualization.pptx
RajaulKarim2082 views
13_DVD_Latch-up_prevention.pdf by Usha Mehta
13_DVD_Latch-up_prevention.pdf13_DVD_Latch-up_prevention.pdf
13_DVD_Latch-up_prevention.pdf
Usha Mehta9 views
Extensions of Time - Contract Management by brainquisitive
Extensions of Time - Contract ManagementExtensions of Time - Contract Management
Extensions of Time - Contract Management
brainquisitive15 views
STUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptx by AnnieRachelJohn
STUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptxSTUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptx
STUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptx
AnnieRachelJohn25 views
Machine Element II Course outline.pdf by odatadese1
Machine Element II Course outline.pdfMachine Element II Course outline.pdf
Machine Element II Course outline.pdf
odatadese16 views
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L... by Anowar Hossain
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...
Anowar Hossain10 views
2_DVD_ASIC_Design_FLow.pdf by Usha Mehta
2_DVD_ASIC_Design_FLow.pdf2_DVD_ASIC_Design_FLow.pdf
2_DVD_ASIC_Design_FLow.pdf
Usha Mehta14 views
9_DVD_Dynamic_logic_circuits.pdf by Usha Mehta
9_DVD_Dynamic_logic_circuits.pdf9_DVD_Dynamic_logic_circuits.pdf
9_DVD_Dynamic_logic_circuits.pdf
Usha Mehta21 views
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,... by AakashShakya12
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...
AakashShakya1245 views

Using SaltStack to Auto Triage and Remediate Production Systems

  • 1. Michael Kehoe Senior Site Reliability Engineer LinkedIn Using SaltStack to Auto Triage and Remediate Production Systems
  • 2. Topics • $ whoami • Salt @ LinkedIn • LinkedIn’s auto remediation story • Nurse • Salt API • Auto-remediation Salt Modules • How to get started
  • 4. Salt @ LinkedIn • 11-14k minions per master • Up to 8.5 events/ second per master • Deployment system heavily utilizes Salt • 5 dedicated SRE’s work on LinkedIn’s Salt infrastructure • Number of SRE’s who also contribute
  • 5. Alert Growth vs NOC Engineers LinkedIn’s Auto-remediation Story 0 5 10 15 20 25 30 0 5000 10000 15000 20000 25000 30000 35000 40000 2010 2011 2012 2013 2014 2015 2016 2017 Num of Alerts Num of Engineers
  • 6. The problem LinkedIn’s Auto-remediation Story • Growing number of high priority alerts vs Engineers to watch • Complicated ITR’s (runbooks) took time for NOC engineers to execute and escalate • SRE’s would forget to run diagnostic tooling during outages • Longer MTTR == Bad experience for members
  • 7. Building the Solution LinkedIn’s Auto-Remediation Story • LinkedIn needed a workflow engine • Requirements • Take automated actions against applications/ hosts • Perform the run book automatically • Appropriate auditing • Scalable
  • 8. What’s already out there Autoremediation • StackStorm • fbar • SaltStack Started building an in-house solution in Mid-2014
  • 10. Nurse • ‘Event based’ system • Built on top of LinkedIn’s existing monitoring infrastructure • Allows for manual execution via Web UI • Allows for other systems to plug-in via REST API • Built in rules-engine • Allows for complicated checks/ branching • Global environment awareness • Scalable!
  • 11. Nurse
  • 12. Nurse
  • 13. The basics Salt REST API • RESTful interface to SALT • Allows for external-authentication • PAM • LDAP • Your own plugin… • Built on top of CherryPy
  • 14. Interface Salt REST API • /login - Log in to receive a session token • /logout - Remove or invalidate sessions • /minions – Working directly with minions • /jobs - Getting lists of previously run jobs or getting the return from a single job • /run - Run commands without normal session handling • /events - Expose the Salt event bus • /hook - A generic web hook entry point that fires an event on Salt's event bus • /keys – Wrapper around key management • /ws - Open a WebSocket connection to Salt's event bus • /stats - Return a dump of statistics collected from the CherryPy server https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html
  • 15. Configuration Salt REST API external_auth: ldap: headless-nurse: - '*': - test.* - nurse.* - '@jobs' https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html /etc/salt/master.d/salt-api.conf
  • 16. Configuration Salt REST API rest_cherrypy: port: 8888 ssl_crt: /etc/salt/pki/api/salt-api.crt ssl_key: /etc/salt/pki/api/salt-api.key https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html /etc/salt/master.d/salt-api.conf
  • 17. Configuration Salt REST API auth.ldap.basedn: 'dc=example,dc=com' auth.ldap.binddn: ’readonly-ldap@example' auth.ldap.bindpw: ’password' auth.ldap.filter: sAMAccountName={{ username }} auth.ldap.server: ldap.example.com auth.ldap.tls: True auth.ldap.persontype: 'person' auth.ldap.groupclass: 'group' auth.ldap.groupou: 'Users’ https://docs.saltstack.com/en/latest/topics/eauth/index.html /etc/salt/master.d/salt-api.conf
  • 18. Code Interfaces Salt REST API • There is a Python interface to the Salt API – pepper • Provides basic wrapper around REST API • https://github.com/saltstack/pepper/ • See https://github.com/SUSE/salt-netapi-client for JAVA bindings
  • 19. Python Code example Salt REST API from pepper.libpepper import Pepper api = Pepper('https://salt.example.com:8888') api.login('saltdev', 'saltdev', ’ldap') # Run simple function api.low([{'client': 'local', 'tgt': '*', 'fun': 'test.ping'}]) # Execute a runner function api.runner('jobs.lookup_jid', jid=12345)
  • 20. Goals Auto-remediation Salt Modules • Auto-triage issues • Reduce context-switching • Run labor intensive tasks automatically • Gather data while engineer is logging in after escalation • Auto-remediate issues • Let the engineer sleep • Faster MTTR
  • 21. Implementation Auto-remediation Salt Modules • Auto-triage issues • Identify abusive clients • Collect & analyses thread/ heap dumps • Parse & summarize log files
  • 22. Implementation Auto-remediation Salt Modules • Auto-remediate issues • Restart/ OOR applications • Scale-up applications • Blocking abusive clients • Update A/ B experiment definitions • Otherwise escalate…
  • 23. Implementation Auto-remediation Salt Modules • Plenty of Salt modules/ runners already out there • Over 300 modules are available in Salt core • Modules to take remediate & notify • Write your own • Make sure you test!
  • 24. Success LinkedIn’s Auto-remediation Story • 854k actions taken • 100% of service health check alerts are on boarded • ~37k man hours have now been automated • Now automating ~1100 hours/ week
  • 26. Some lessons learnt How to get started • Need to make decision on how you architecture YOUR ‘event bus’ • Use external monitoring to trigger Salt via API • Use reactors/ beacons internally within Salt’s event bus • See ‘Thorium’ documentation for Salt’s new reactor engine • Auditing is important! • Need to know what/ who triggered actions
  • 27. Some lessons learnt How to get started • Reporting is important! • Need to know how often automated actions are being taken • Find failure hotspots • Leverage event-bus or returners • Safety first • Don’t give yourself too much rope…
  • 28. Conclusion • Think carefully about your ‘Event-Driven-Automation’ architecture • The sky is the limit with Salt…don’t limit yourself • Again…safety first

Editor's Notes

  1. WhoAmI Background on Salt & Auto-remediation at LinkedIn Nurse – LinkedIn’s Auto-remediation engine Salt API & Autoremediation Salt Modules How to get started
  2. Been at LinkedIn for nearly 2.5 years Apart of LinkedIn’s ARVT group Occasional committer to LinkedIn’s Salt ecosystem
  3. Want to give you some context into our infrastructure Average is 11-14k minions per production datacenter Up to 8.5 events/ second per master Deployment system heavily utilizes Salt 5 dedicated SRE’s work on LinkedIn’s Salt infrastructure Number of SRE’s who also contribute Keep these numbers in mind as the session progresses
  4. Let’s talk about LinkedIn’s autoremediation story Graph depicts the number of alerts our NOC watches vs the Number of NOC engineers we employee globally
  5. What is out there Has anyone heard of these
  6. Login/ logout Jobs Run Events Events – get data from event bus Hook – Fire an event INTO the salt bus WS – Open websocket Stats
  7. So what are the goals of building an auto-remediation engine
  8. Remediate Apache/ Nginx restarts Notify Slack/ Hipchat modules
  9. In 2014, we don’t increase headcount, now we’re lowering our NOC engineer headcount