What to Expect When You’re
Expecting (to Own Production)
Considerations for Acclimating
Developers to Production Ownership
Michael Diamant
The Road Ahead
Source: http://originfinance.com.au/wp-content/uploads/2017/03/End-of-the-Road.jpg
Where Are We Trying To Go?
Developers delivering
software
into production
Developers triaging
and remediating
production issues
Cultural change to include
operational requirements in
definition of “done”
Time
Developers proactively
addressing issues before
they manifest
Focus Areas
Metrics Alerts
Deploys
Shared
Ownership
Metrics: Understand the Domain
Question Motivation
What questions do non-technical
stakeholders ask?
These topics are likely the ones that matter
most for a particular constituency.
If left unnoticed, what is the one failure that
will cause the business significant harm?
Repeat this question over time to learn
where visibility is most needed.
What SLAs / uptime contracts exist? If a topic is important enough to be
recorded in the legalese, visibility is
crucial.
Metrics: Surface Non-functional Requirements
Question Motivation
What happens as reads and writes to a
resource (e.g. file system, database) take
longer?
Tracking read/write latencies ensures that
a situation heading towards “too slow” can
be proactively addressed.
What artifact sizes (e.g. values in a k-v
store) are unbounded?
Production grinds to a halt when system
outputs are “too large”. Visibility into
growth over time provides time to react
calmly.
What are critical thresholds for system
resources (e.g. CPU, disks, memory)?
Without understanding system usage, it is
difficult to suggest optimization techniques
and it limits ability to capacity plan.
What 3rd
party integration points exist? When a 3rd
party integration inevitability
fails, it will be a challenge to understand
what happened without proper visibility.
Alerts: Trigger Responsibly
Suggestion Motivation Example
Distinguish between
soft (broadcasts
message without
paging) vs hard alerts
(broadcasts and
pages).
Soft alerts enable the on-call team
to sleep through the night and
provide a heads up that danger is
looming during the day.
Candidate soft alert:
Frequently scheduled job
(e.g. machine learning
algorithm) fails once.
Candidate hard alert: Job
fails 3+ times in a row.
Consider the absence
of a desired
event/outcome an
alert trigger.
Who watches the watchers? This
can be a safety mechanism to
validate the assumption that a
system is “working”. As an added
bonus, this type of alert does not
require output from the system
being observed.
Alert monitoring latency of
events transferred
between systems has no
new observations (i.e. no
data) in the last 10min.
Where possible,
evaluate proportional
rather than absolute
values
Absolute alert thresholds more
easily become stale over time and
are fragile in heterogeneous
environments.
Since the load average is
an aggregate number
across all CPUs, track the
load average per core.
Deploys: What stages exist?
Before deployment
planned
Pre-deployment
Deployment
Post-deployment
Note: Box size proportional to effort needed
Deploys: Questions to Consider Before a Deploy
is Planned
● What are common rollback scenarios and how are they executed?
● What is the escalation policy should something break?
● What development strategies will be followed to avoid backwards
incompatible changes?
● What procedures (e.g. testing) certify that software is ready for
deployment?
● Involve other stakeholders:
– What are amenable times of day or days of week for
deployments?
– What questions / constraints should be cleared prior to a deploy
(e.g. confirm there are no high-touch client meetings the day of
the deploy)?
– How much downtime is acceptable?
Deploys: Questions to Consider Pre-deployment
● Have all artifacts been versioned (e.g. remove
branch/RC/SNAPSHOT modifiers)?
● Have all possible combinations of versions in production and to-be-
deployed versions been exercised together to ensure compatibility?
● Have any side-effecting updates (e.g. DB schema changes) been
tested in a non-production environment?
● Are deployments steps documented?
● Consider outcomes:
– What will a successful deployment look like?
– What signs will a failing/failed deployment show?
– In addition to engineering, what stakeholders are needed to
confirm success/failure?
Parting Thoughts
● Trial and error is a part of this process. Mistakes will be made!
● Consider the next step outcome (e.g What happens when…?).
● Codify operational concerns (e.g. alerting) into definition of “done”.
● Vigilantly review alerts firing frequently and/or without action items to
minimize on-call fatigue.
● Periodically audit alerts to identify gaps and remove stale alerts.
● Consider adding developers to on-call rotation.
● Retain flexibility:
– With sufficient alerting in place, there can be less stringent deploys
facilitate faster feedback loops.
– Differentiate definition of done between proof-of-concept (POC) vs
production work and transition point between POC and “production”.
Thank you!
To complete the definition of done for this presentation, let’s answer
questions :)

What to Expect When You're Expecting (to Own Production)

  • 1.
    What to ExpectWhen You’re Expecting (to Own Production) Considerations for Acclimating Developers to Production Ownership Michael Diamant
  • 2.
    The Road Ahead Source:http://originfinance.com.au/wp-content/uploads/2017/03/End-of-the-Road.jpg
  • 3.
    Where Are WeTrying To Go? Developers delivering software into production Developers triaging and remediating production issues Cultural change to include operational requirements in definition of “done” Time Developers proactively addressing issues before they manifest
  • 4.
  • 5.
    Metrics: Understand theDomain Question Motivation What questions do non-technical stakeholders ask? These topics are likely the ones that matter most for a particular constituency. If left unnoticed, what is the one failure that will cause the business significant harm? Repeat this question over time to learn where visibility is most needed. What SLAs / uptime contracts exist? If a topic is important enough to be recorded in the legalese, visibility is crucial.
  • 6.
    Metrics: Surface Non-functionalRequirements Question Motivation What happens as reads and writes to a resource (e.g. file system, database) take longer? Tracking read/write latencies ensures that a situation heading towards “too slow” can be proactively addressed. What artifact sizes (e.g. values in a k-v store) are unbounded? Production grinds to a halt when system outputs are “too large”. Visibility into growth over time provides time to react calmly. What are critical thresholds for system resources (e.g. CPU, disks, memory)? Without understanding system usage, it is difficult to suggest optimization techniques and it limits ability to capacity plan. What 3rd party integration points exist? When a 3rd party integration inevitability fails, it will be a challenge to understand what happened without proper visibility.
  • 7.
    Alerts: Trigger Responsibly SuggestionMotivation Example Distinguish between soft (broadcasts message without paging) vs hard alerts (broadcasts and pages). Soft alerts enable the on-call team to sleep through the night and provide a heads up that danger is looming during the day. Candidate soft alert: Frequently scheduled job (e.g. machine learning algorithm) fails once. Candidate hard alert: Job fails 3+ times in a row. Consider the absence of a desired event/outcome an alert trigger. Who watches the watchers? This can be a safety mechanism to validate the assumption that a system is “working”. As an added bonus, this type of alert does not require output from the system being observed. Alert monitoring latency of events transferred between systems has no new observations (i.e. no data) in the last 10min. Where possible, evaluate proportional rather than absolute values Absolute alert thresholds more easily become stale over time and are fragile in heterogeneous environments. Since the load average is an aggregate number across all CPUs, track the load average per core.
  • 8.
    Deploys: What stagesexist? Before deployment planned Pre-deployment Deployment Post-deployment Note: Box size proportional to effort needed
  • 9.
    Deploys: Questions toConsider Before a Deploy is Planned ● What are common rollback scenarios and how are they executed? ● What is the escalation policy should something break? ● What development strategies will be followed to avoid backwards incompatible changes? ● What procedures (e.g. testing) certify that software is ready for deployment? ● Involve other stakeholders: – What are amenable times of day or days of week for deployments? – What questions / constraints should be cleared prior to a deploy (e.g. confirm there are no high-touch client meetings the day of the deploy)? – How much downtime is acceptable?
  • 10.
    Deploys: Questions toConsider Pre-deployment ● Have all artifacts been versioned (e.g. remove branch/RC/SNAPSHOT modifiers)? ● Have all possible combinations of versions in production and to-be- deployed versions been exercised together to ensure compatibility? ● Have any side-effecting updates (e.g. DB schema changes) been tested in a non-production environment? ● Are deployments steps documented? ● Consider outcomes: – What will a successful deployment look like? – What signs will a failing/failed deployment show? – In addition to engineering, what stakeholders are needed to confirm success/failure?
  • 11.
    Parting Thoughts ● Trialand error is a part of this process. Mistakes will be made! ● Consider the next step outcome (e.g What happens when…?). ● Codify operational concerns (e.g. alerting) into definition of “done”. ● Vigilantly review alerts firing frequently and/or without action items to minimize on-call fatigue. ● Periodically audit alerts to identify gaps and remove stale alerts. ● Consider adding developers to on-call rotation. ● Retain flexibility: – With sufficient alerting in place, there can be less stringent deploys facilitate faster feedback loops. – Differentiate definition of done between proof-of-concept (POC) vs production work and transition point between POC and “production”.
  • 12.
    Thank you! To completethe definition of done for this presentation, let’s answer questions :)