Incident Management in the Age of DevOps and SRE

1.
Damon Edwards Incident Management inthe Age of DevOps and SRE Salt Lake City DevOps Nov 13, 2019

3.
Assertion: The ability torespond to and resolve incidents is the true indicator of an organization’s operational capabilities

5.
Assertion 2: Everybody nowworks in “Operations"

6.
What Is anIncident? An unplanned disruption impacting customers or business operations

7.
What Is anIncident? An unplanned disruption impacting customers or business operations Outages Service Degradation

8.
What Is anIncident? An unplanned disruption impacting customers or business operations Outages Service Degradation Work interruption Delay/Waiting “Short-Notice” Requests

12.
Board

13.
Integrated Board

14.
Integrated Responsive Board

15.
Integrated Responsive Everywhere Board

16.
Integrated Responsive Everywhere Always Board

17.
Integrated Responsive Everywhere Always Board Tech Org Execution

18.
Integrated Responsive Everywhere Always Board Tech Org Execution

20.
Kubernetes AWS GCP Azure Docker Consul TerraformIstio Zipkin Envoy Serverless OpenShift KafkaLamba Prometheus Containerd Helm Cloud Foundry Linkerd Etcd CoreDNS MongoDB Redis InfluxDB Jaeger gRPC CRI-O Cognito Fargate Cloud Functions Cosmos BigQuery Spark Rook Ceph NGINXHAProxy Open vSwitch NSX Sensu Vault Aurora Nomad

21.

22.

23.
Kubernetes AWS GCP Azure Docker Consul TerraformIstio Zipkin Envoy Serverless OpenShift KafkaLamba Prometheus Containerd Helm Cloud Foundry Linkerd Etcd CoreDNS MongoDB Redis InfluxDB Jaeger gRPC CRI-O Cognito Fargate Cloud Functions Cosmos BigQuery Spark Rook Ceph NGINXHAProxy Open vSwitch NSX Sensu Vault Aurora Nomad SAIL/cornell.edu

24.
Adrian Cockcroft Developer Developer Developer Developer Developer Old ReleaseStill Running Release Plan Release Plan Release Plan Release Plan Deploy Feature to Production Deploy Feature to Production Deploy Feature to Production Deploy Feature to Production Bugs Deploy Feature to Production Immutable microservice deployment scales, is faster with large teams and diverse platform components DockerCon EU 2014 Architecture enables speed. Speed is the advantage.

27.
The Three Ways(2013)

28.
The Three Ways(2013) The Five Ideals (2019)

30.
DEV

31.
Go! Go! Go!DEV

32.
Go! Go! Go!DEV…OPS?

33.
0000 Go! Go! Go!DEV…OPS?

34.
0000 Go! Go! Go!DEV…OPS? Operations: The Last Mile

37.
1. SRE needsService Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Principles of SRE

38.

39.

41.
DevOps + SRE Product, NotProject Continuous Delivery Shift Left Error Budgets 0 100 !! Toil Limits Cloud Native+ + + + +

42.
DevOps + SRE Product, NotProject Continuous Delivery Shift Left Error Budgets 0 100 !! Toil Limits Cloud Native+ + + + +

43.
Dev Ops Cross-Functional Team Cross-FunctionalTeam DevOps + SRE Product, Not Project Continuous Delivery Shift Left Error Budgets 0 100 !! Toil Limits Cloud Native+ + + + +

44.
Dev Ops Cross-Functional Team Cross-FunctionalTeam DevOps + SRE Product, Not Project Continuous Delivery Shift Left Error Budgets 0 100 !! Toil Limits Cloud Native+ + + + + “Value-Aligned” and Self-Regulating Shared Responsibility Model

46.
Traditional ITSM

47.
Traditional ITSM ITIL 1989 -?

48.
Traditional ITSM ITIL 1989 -?

49.
Traditional ITSM Unintentionally EncouragesSilos ITIL 1989 - ?

50.
Traditional ITSM X XX XX X Unintentionally Encourages Silos ITIL 1989 - ?

51.
Traditional ITSM X XX XX X Unintentionally Encourages Silos Encourages command & control management ITIL 1989 - ?

52.
Traditional ITSM X XX XX X Unintentionally Encourages Silos Encourages command & control management ITIL 1989 - ?

53.
Old Way New Way

54.
Old Way New Way

70.
REDeploy.io

71.
There is noroot cause. (That’s just a political distinction) REDeploy.io

72.
Why? Why? Why? Why? Why? There is noroot cause. (That’s just a political distinction) REDeploy.io

73.
Why? Why? Why? Why? Why? There is noroot cause. (That’s just a political distinction) Right, Wrong, Safety II, and You. REDeploy.io

74.
Why? Why? Why? Why? Why? There is noroot cause. (That’s just a political distinction) Right, Wrong, Safety II, and You. Incidents = unplanned investments REDeploy.io

76.
You Not

82.
18Million IT Ops 22.3Million Developers

85.
Col. John Boyd OODALoop

88.
Monitoring Spotting the knowns

89.
Monitoring Spotting the knowns Observability Interrogatingthe unknowns

90.
Observability Interrogating the unknowns

91.
Observability Interrogating the unknowns Logging:The event

92.
Observability Interrogating the unknowns Logging:The event Metrics: Data points over time

93.
Observability Interrogating the unknowns Logging:The event Metrics: Data points over time Tracing: Events in context of a single request

94.
Observability Interrogating the unknowns Logging:The event Metrics: Data points over time Tracing: Events in context of a single request

95.
Automated Governance Objective automatedattestation of GRC controls

96.

97.

98.
Monitoring Observability Governance Everyone Everyone Everyone Everyone

99.
Incident Command Mobilization, Coordination,Communication

100.
Incident Command Mobilization, Coordination,Communication Incident Command System (FEMA)

101.

102.

103.

104.
Incident Command Mobilization, Coordination,Communication Incident Command System (FEMA) GitHub: PagerDuty/incident-response-docs

105.
Ops = PlatformEng + SRE Divide and conquer

106.
Ops = PlatformEng + SRE Divide and conquer

107.
Ops Platform Eng+ SRE Divide and conquer SRE: Expert Operators (distributed) Platform Eng: Build and Operate Platform Services (centralized)

108.

109.

110.
New Views onEscalations Avoid… but swarm if you do Support at the edge Swarm

111.
Diagnose: Health checks,exploratory actions Take Action! Restore: Restart, repair actions, rollback

112.
The Return ofRunbooks Awhile ago Not that long ago Now

113.
The Return ofRunbooks Awhile ago Not that long ago Now Runbooks (Mostly Manual)

114.
The Return ofRunbooks Awhile ago Not that long ago Now Runbooks (Mostly Manual) …

115.
The Return ofRunbooks Awhile ago Not that long ago Now Runbooks (Mostly Manual) Runbooks (Automate!…How?)… Thanks SRE!

116.
Runbook Automation Safe self-serviceaccess to the expert knowledge you need to take action.

117.

118.

119.
Runbook Automation Safe self-serviceaccess to the expert knowledge you need to take action. Moving the bits is the easy part!

120.

121.
Empower those closestto the action! Runbook Automation Safe self-service access to the expert knowledge you need to take action.

122.

123.
De-risk! Runbook Automation Safe self-serviceaccess to the expert knowledge you need to take action.

124.
Before Runbook Automation…

125.
Before Runbook Automation… 3options:

126.
1. Decipher thewiki Before Runbook Automation… 3 options:

127.
1. Decipher thewiki 2.Ad-hoc tool/script usage Before Runbook Automation… 3 options:

128.
1. Decipher thewiki 2.Ad-hoc tool/script usage 3.ESCALATE! Before Runbook Automation… 3 options:

129.
…with Runbook Automation

130.
Shorter Incidents. FewerEscalations. Before RBA

131.
Shorter Incidents. FewerEscalations. Before RBA

132.
With RBA Shorter Incidents.Fewer Escalations.

133.

134.
Before RBA Shorter Incidents.Fewer Escalations.

135.

136.
Solve Difficult Security& Compliance Problems Before RBA

137.
Solve Difficult Security& Compliance Problems With RBA

138.
Everything Through aSDLC Promote

139.
Runbooks as aService

144.
Incidents = unplannedinvestments …the ROI is up to you.

145.
Recap! Elevate the Human.

146.
@damonedwards damon@rundeck.com Let’s talk… Special thanksto

Incident Management in the Age of DevOps and SRE

More Related Content

What's hot

Similar to Incident Management in the Age of DevOps and SRE

More from Rundeck

Recently uploaded

In this document

Incident Management in the Age of DevOps and SRE