The intended presentation audience is developers unfamiliar with owning a production environment. I aim to share lessons I’ve learned while supporting production environments and to paint a path for how ownership can be built.
By no means is this intended to be a comprehensive guide to production ownership. Instead, it should be treated as an introduction or one of the first few steps into the topic.
This presentation was motivated by a former colleague seeking to help frame his team's mindset toward production ownership. He joined a team that was not accustomed to production deploys, on-call, etc and thought it would be valuable to share insight from our experience together in an environment where developers co-owned production.
What to Expect When You're Expecting (to Own Production)
1. What to Expect When You’re
Expecting (to Own Production)
Considerations for Acclimating
Developers to Production Ownership
Michael Diamant
2. The Road Ahead
Source: http://originfinance.com.au/wp-content/uploads/2017/03/End-of-the-Road.jpg
3. Where Are We Trying To Go?
Developers delivering
software
into production
Developers triaging
and remediating
production issues
Cultural change to include
operational requirements in
definition of “done”
Time
Developers proactively
addressing issues before
they manifest
5. Metrics: Understand the Domain
Question Motivation
What questions do non-technical
stakeholders ask?
These topics are likely the ones that matter
most for a particular constituency.
If left unnoticed, what is the one failure that
will cause the business significant harm?
Repeat this question over time to learn
where visibility is most needed.
What SLAs / uptime contracts exist? If a topic is important enough to be
recorded in the legalese, visibility is
crucial.
6. Metrics: Surface Non-functional Requirements
Question Motivation
What happens as reads and writes to a
resource (e.g. file system, database) take
longer?
Tracking read/write latencies ensures that
a situation heading towards “too slow” can
be proactively addressed.
What artifact sizes (e.g. values in a k-v
store) are unbounded?
Production grinds to a halt when system
outputs are “too large”. Visibility into
growth over time provides time to react
calmly.
What are critical thresholds for system
resources (e.g. CPU, disks, memory)?
Without understanding system usage, it is
difficult to suggest optimization techniques
and it limits ability to capacity plan.
What 3rd
party integration points exist? When a 3rd
party integration inevitability
fails, it will be a challenge to understand
what happened without proper visibility.
7. Alerts: Trigger Responsibly
Suggestion Motivation Example
Distinguish between
soft (broadcasts
message without
paging) vs hard alerts
(broadcasts and
pages).
Soft alerts enable the on-call team
to sleep through the night and
provide a heads up that danger is
looming during the day.
Candidate soft alert:
Frequently scheduled job
(e.g. machine learning
algorithm) fails once.
Candidate hard alert: Job
fails 3+ times in a row.
Consider the absence
of a desired
event/outcome an
alert trigger.
Who watches the watchers? This
can be a safety mechanism to
validate the assumption that a
system is “working”. As an added
bonus, this type of alert does not
require output from the system
being observed.
Alert monitoring latency of
events transferred
between systems has no
new observations (i.e. no
data) in the last 10min.
Where possible,
evaluate proportional
rather than absolute
values
Absolute alert thresholds more
easily become stale over time and
are fragile in heterogeneous
environments.
Since the load average is
an aggregate number
across all CPUs, track the
load average per core.
8. Deploys: What stages exist?
Before deployment
planned
Pre-deployment
Deployment
Post-deployment
Note: Box size proportional to effort needed
9. Deploys: Questions to Consider Before a Deploy
is Planned
● What are common rollback scenarios and how are they executed?
● What is the escalation policy should something break?
● What development strategies will be followed to avoid backwards
incompatible changes?
● What procedures (e.g. testing) certify that software is ready for
deployment?
● Involve other stakeholders:
– What are amenable times of day or days of week for
deployments?
– What questions / constraints should be cleared prior to a deploy
(e.g. confirm there are no high-touch client meetings the day of
the deploy)?
– How much downtime is acceptable?
10. Deploys: Questions to Consider Pre-deployment
● Have all artifacts been versioned (e.g. remove
branch/RC/SNAPSHOT modifiers)?
● Have all possible combinations of versions in production and to-be-
deployed versions been exercised together to ensure compatibility?
● Have any side-effecting updates (e.g. DB schema changes) been
tested in a non-production environment?
● Are deployments steps documented?
● Consider outcomes:
– What will a successful deployment look like?
– What signs will a failing/failed deployment show?
– In addition to engineering, what stakeholders are needed to
confirm success/failure?
11. Parting Thoughts
● Trial and error is a part of this process. Mistakes will be made!
● Consider the next step outcome (e.g What happens when…?).
● Codify operational concerns (e.g. alerting) into definition of “done”.
● Vigilantly review alerts firing frequently and/or without action items to
minimize on-call fatigue.
● Periodically audit alerts to identify gaps and remove stale alerts.
● Consider adding developers to on-call rotation.
● Retain flexibility:
– With sufficient alerting in place, there can be less stringent deploys
facilitate faster feedback loops.
– Differentiate definition of done between proof-of-concept (POC) vs
production work and transition point between POC and “production”.
12. Thank you!
To complete the definition of done for this presentation, let’s answer
questions :)