SlideShare a Scribd company logo
1 of 12
Download to read offline
What to Expect When You’re
Expecting (to Own Production)
Considerations for Acclimating
Developers to Production Ownership
Michael Diamant
The Road Ahead
Source: http://originfinance.com.au/wp-content/uploads/2017/03/End-of-the-Road.jpg
Where Are We Trying To Go?
Developers delivering
software
into production
Developers triaging
and remediating
production issues
Cultural change to include
operational requirements in
definition of “done”
Time
Developers proactively
addressing issues before
they manifest
Focus Areas
Metrics Alerts
Deploys
Shared
Ownership
Metrics: Understand the Domain
Question Motivation
What questions do non-technical
stakeholders ask?
These topics are likely the ones that matter
most for a particular constituency.
If left unnoticed, what is the one failure that
will cause the business significant harm?
Repeat this question over time to learn
where visibility is most needed.
What SLAs / uptime contracts exist? If a topic is important enough to be
recorded in the legalese, visibility is
crucial.
Metrics: Surface Non-functional Requirements
Question Motivation
What happens as reads and writes to a
resource (e.g. file system, database) take
longer?
Tracking read/write latencies ensures that
a situation heading towards “too slow” can
be proactively addressed.
What artifact sizes (e.g. values in a k-v
store) are unbounded?
Production grinds to a halt when system
outputs are “too large”. Visibility into
growth over time provides time to react
calmly.
What are critical thresholds for system
resources (e.g. CPU, disks, memory)?
Without understanding system usage, it is
difficult to suggest optimization techniques
and it limits ability to capacity plan.
What 3rd
party integration points exist? When a 3rd
party integration inevitability
fails, it will be a challenge to understand
what happened without proper visibility.
Alerts: Trigger Responsibly
Suggestion Motivation Example
Distinguish between
soft (broadcasts
message without
paging) vs hard alerts
(broadcasts and
pages).
Soft alerts enable the on-call team
to sleep through the night and
provide a heads up that danger is
looming during the day.
Candidate soft alert:
Frequently scheduled job
(e.g. machine learning
algorithm) fails once.
Candidate hard alert: Job
fails 3+ times in a row.
Consider the absence
of a desired
event/outcome an
alert trigger.
Who watches the watchers? This
can be a safety mechanism to
validate the assumption that a
system is “working”. As an added
bonus, this type of alert does not
require output from the system
being observed.
Alert monitoring latency of
events transferred
between systems has no
new observations (i.e. no
data) in the last 10min.
Where possible,
evaluate proportional
rather than absolute
values
Absolute alert thresholds more
easily become stale over time and
are fragile in heterogeneous
environments.
Since the load average is
an aggregate number
across all CPUs, track the
load average per core.
Deploys: What stages exist?
Before deployment
planned
Pre-deployment
Deployment
Post-deployment
Note: Box size proportional to effort needed
Deploys: Questions to Consider Before a Deploy
is Planned
● What are common rollback scenarios and how are they executed?
● What is the escalation policy should something break?
● What development strategies will be followed to avoid backwards
incompatible changes?
● What procedures (e.g. testing) certify that software is ready for
deployment?
● Involve other stakeholders:
– What are amenable times of day or days of week for
deployments?
– What questions / constraints should be cleared prior to a deploy
(e.g. confirm there are no high-touch client meetings the day of
the deploy)?
– How much downtime is acceptable?
Deploys: Questions to Consider Pre-deployment
● Have all artifacts been versioned (e.g. remove
branch/RC/SNAPSHOT modifiers)?
● Have all possible combinations of versions in production and to-be-
deployed versions been exercised together to ensure compatibility?
● Have any side-effecting updates (e.g. DB schema changes) been
tested in a non-production environment?
● Are deployments steps documented?
● Consider outcomes:
– What will a successful deployment look like?
– What signs will a failing/failed deployment show?
– In addition to engineering, what stakeholders are needed to
confirm success/failure?
Parting Thoughts
● Trial and error is a part of this process. Mistakes will be made!
● Consider the next step outcome (e.g What happens when…?).
● Codify operational concerns (e.g. alerting) into definition of “done”.
● Vigilantly review alerts firing frequently and/or without action items to
minimize on-call fatigue.
● Periodically audit alerts to identify gaps and remove stale alerts.
● Consider adding developers to on-call rotation.
● Retain flexibility:
– With sufficient alerting in place, there can be less stringent deploys
facilitate faster feedback loops.
– Differentiate definition of done between proof-of-concept (POC) vs
production work and transition point between POC and “production”.
Thank you!
To complete the definition of done for this presentation, let’s answer
questions :)

More Related Content

What's hot

Using redmine as a sla ticketing system, helpdesk or service desk software
Using redmine as a sla ticketing system, helpdesk or service desk softwareUsing redmine as a sla ticketing system, helpdesk or service desk software
Using redmine as a sla ticketing system, helpdesk or service desk softwareAleksandar Pavic
 
Open-Source Security Management and Vulnerability Impact Assessment
Open-Source Security Management and Vulnerability Impact AssessmentOpen-Source Security Management and Vulnerability Impact Assessment
Open-Source Security Management and Vulnerability Impact AssessmentPriyanka Aash
 
2021 08-28, QONFEST 2021 - Reliability cenetered maintenance for sleeping giants
2021 08-28, QONFEST 2021 - Reliability cenetered maintenance for sleeping giants2021 08-28, QONFEST 2021 - Reliability cenetered maintenance for sleeping giants
2021 08-28, QONFEST 2021 - Reliability cenetered maintenance for sleeping giantsJaap van Ekris
 
Strategies and Tactics for Effectively Managing Vulnerabilities in Diverse En...
Strategies and Tactics for Effectively Managing Vulnerabilities in Diverse En...Strategies and Tactics for Effectively Managing Vulnerabilities in Diverse En...
Strategies and Tactics for Effectively Managing Vulnerabilities in Diverse En...Rafal Los
 
Software Risk Analysis
Software Risk AnalysisSoftware Risk Analysis
Software Risk AnalysisBrett Leonard
 
#speakgeek - Support Processes for iconnect360
#speakgeek - Support Processes for iconnect360#speakgeek - Support Processes for iconnect360
#speakgeek - Support Processes for iconnect360Derek Chan
 
Patch Management - 2013
Patch Management - 2013Patch Management - 2013
Patch Management - 2013Vicky Ames
 
Reliability Maintenance Engineering 1 - 5 Measuring Reliability
Reliability Maintenance Engineering 1 - 5 Measuring ReliabilityReliability Maintenance Engineering 1 - 5 Measuring Reliability
Reliability Maintenance Engineering 1 - 5 Measuring ReliabilityAccendo Reliability
 
Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)Peter Tröger
 
DevOps for Developers - Friedrichsen
DevOps for Developers - FriedrichsenDevOps for Developers - Friedrichsen
DevOps for Developers - FriedrichsenCodemotion
 
Getting more 9s from your Cloud operations
Getting more 9s from your Cloud operationsGetting more 9s from your Cloud operations
Getting more 9s from your Cloud operationsChamith Kumarage
 
Brainstorming failure
Brainstorming failureBrainstorming failure
Brainstorming failureJeffery Smith
 
Mt s13 defect_management
Mt s13 defect_managementMt s13 defect_management
Mt s13 defect_managementTestingGeeks
 
Software Configuration Management into a CMMI Level 1 Project
Software Configuration Management into a CMMI Level 1 ProjectSoftware Configuration Management into a CMMI Level 1 Project
Software Configuration Management into a CMMI Level 1 Projectelliando dias
 
Waterfall Methodology
Waterfall MethodologyWaterfall Methodology
Waterfall MethodologyNehaHaroon1
 
The Fallacy of Fast - Ines Sombra at Fastly Altitude 2015
The Fallacy of Fast - Ines Sombra at Fastly Altitude 2015The Fallacy of Fast - Ines Sombra at Fastly Altitude 2015
The Fallacy of Fast - Ines Sombra at Fastly Altitude 2015Fastly
 
Reliability Maintenance Engineering 2 - 1 Concepts and Software
Reliability Maintenance Engineering 2 - 1 Concepts and SoftwareReliability Maintenance Engineering 2 - 1 Concepts and Software
Reliability Maintenance Engineering 2 - 1 Concepts and SoftwareAccendo Reliability
 

What's hot (20)

Using redmine as a sla ticketing system, helpdesk or service desk software
Using redmine as a sla ticketing system, helpdesk or service desk softwareUsing redmine as a sla ticketing system, helpdesk or service desk software
Using redmine as a sla ticketing system, helpdesk or service desk software
 
Open-Source Security Management and Vulnerability Impact Assessment
Open-Source Security Management and Vulnerability Impact AssessmentOpen-Source Security Management and Vulnerability Impact Assessment
Open-Source Security Management and Vulnerability Impact Assessment
 
2021 08-28, QONFEST 2021 - Reliability cenetered maintenance for sleeping giants
2021 08-28, QONFEST 2021 - Reliability cenetered maintenance for sleeping giants2021 08-28, QONFEST 2021 - Reliability cenetered maintenance for sleeping giants
2021 08-28, QONFEST 2021 - Reliability cenetered maintenance for sleeping giants
 
Strategies and Tactics for Effectively Managing Vulnerabilities in Diverse En...
Strategies and Tactics for Effectively Managing Vulnerabilities in Diverse En...Strategies and Tactics for Effectively Managing Vulnerabilities in Diverse En...
Strategies and Tactics for Effectively Managing Vulnerabilities in Diverse En...
 
Software Risk Analysis
Software Risk AnalysisSoftware Risk Analysis
Software Risk Analysis
 
#speakgeek - Support Processes for iconnect360
#speakgeek - Support Processes for iconnect360#speakgeek - Support Processes for iconnect360
#speakgeek - Support Processes for iconnect360
 
Patch Management - 2013
Patch Management - 2013Patch Management - 2013
Patch Management - 2013
 
Reliability Maintenance Engineering 1 - 5 Measuring Reliability
Reliability Maintenance Engineering 1 - 5 Measuring ReliabilityReliability Maintenance Engineering 1 - 5 Measuring Reliability
Reliability Maintenance Engineering 1 - 5 Measuring Reliability
 
Troubleshooting.pdf
Troubleshooting.pdfTroubleshooting.pdf
Troubleshooting.pdf
 
Sre summary
Sre summarySre summary
Sre summary
 
Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)
 
DevOps for Developers - Friedrichsen
DevOps for Developers - FriedrichsenDevOps for Developers - Friedrichsen
DevOps for Developers - Friedrichsen
 
Getting more 9s from your Cloud operations
Getting more 9s from your Cloud operationsGetting more 9s from your Cloud operations
Getting more 9s from your Cloud operations
 
Brainstorming failure
Brainstorming failureBrainstorming failure
Brainstorming failure
 
Mt s13 defect_management
Mt s13 defect_managementMt s13 defect_management
Mt s13 defect_management
 
Erp implementation
Erp implementationErp implementation
Erp implementation
 
Software Configuration Management into a CMMI Level 1 Project
Software Configuration Management into a CMMI Level 1 ProjectSoftware Configuration Management into a CMMI Level 1 Project
Software Configuration Management into a CMMI Level 1 Project
 
Waterfall Methodology
Waterfall MethodologyWaterfall Methodology
Waterfall Methodology
 
The Fallacy of Fast - Ines Sombra at Fastly Altitude 2015
The Fallacy of Fast - Ines Sombra at Fastly Altitude 2015The Fallacy of Fast - Ines Sombra at Fastly Altitude 2015
The Fallacy of Fast - Ines Sombra at Fastly Altitude 2015
 
Reliability Maintenance Engineering 2 - 1 Concepts and Software
Reliability Maintenance Engineering 2 - 1 Concepts and SoftwareReliability Maintenance Engineering 2 - 1 Concepts and Software
Reliability Maintenance Engineering 2 - 1 Concepts and Software
 

Similar to What to Expect When You're Expecting (to Own Production)

Management Information Systems – Week 7 Lecture 2Developme.docx
Management Information Systems – Week 7 Lecture 2Developme.docxManagement Information Systems – Week 7 Lecture 2Developme.docx
Management Information Systems – Week 7 Lecture 2Developme.docxcroysierkathey
 
Successful Software Projects - What you need to consider
Successful Software Projects - What you need to considerSuccessful Software Projects - What you need to consider
Successful Software Projects - What you need to considerLloydMoore
 
Software Rollout
Software RolloutSoftware Rollout
Software Rolloutcolmbennett
 
When to Code / Config / Config + Code in Salesforce - Nikunj Doshi
When to Code / Config / Config + Code in Salesforce - Nikunj DoshiWhen to Code / Config / Config + Code in Salesforce - Nikunj Doshi
When to Code / Config / Config + Code in Salesforce - Nikunj DoshiSakthivel Madesh
 
Agile and continuous delivery – How IBM Watson Workspace is built
Agile and continuous delivery – How IBM Watson Workspace is builtAgile and continuous delivery – How IBM Watson Workspace is built
Agile and continuous delivery – How IBM Watson Workspace is builtVincent Burckhardt
 
Sanjeevi's SDLC Guest Lecture in Anna University campus at AU-PERS Centre (Ye...
Sanjeevi's SDLC Guest Lecture in Anna University campus at AU-PERS Centre (Ye...Sanjeevi's SDLC Guest Lecture in Anna University campus at AU-PERS Centre (Ye...
Sanjeevi's SDLC Guest Lecture in Anna University campus at AU-PERS Centre (Ye...Sanjeevi Prasad
 
Problem Solving Methodology 2011 - 2014
Problem Solving Methodology 2011 - 2014Problem Solving Methodology 2011 - 2014
Problem Solving Methodology 2011 - 2014snoonan
 
Implementing Vulnerability Management
Implementing Vulnerability Management Implementing Vulnerability Management
Implementing Vulnerability Management Argyle Executive Forum
 
SCM: An Introduction
SCM: An IntroductionSCM: An Introduction
SCM: An IntroductionAlec Clews
 
Defect effort prediction models in software
Defect effort prediction models in softwareDefect effort prediction models in software
Defect effort prediction models in softwareIAEME Publication
 
Migrating to Cloud: Inhouse Hadoop to Databricks (3)
Migrating to Cloud: Inhouse Hadoop to Databricks (3)Migrating to Cloud: Inhouse Hadoop to Databricks (3)
Migrating to Cloud: Inhouse Hadoop to Databricks (3)Knoldus Inc.
 
WinOps Conf 2016 - Matteo Emili - Development and QA Dilemmas in DevOps
WinOps Conf 2016 - Matteo Emili - Development and QA Dilemmas in DevOpsWinOps Conf 2016 - Matteo Emili - Development and QA Dilemmas in DevOps
WinOps Conf 2016 - Matteo Emili - Development and QA Dilemmas in DevOpsWinOps Conf
 
Luiz Fernando Testa Contador - Aplicando DevOps em grandes corporações
Luiz Fernando Testa Contador - Aplicando DevOps em grandes corporaçõesLuiz Fernando Testa Contador - Aplicando DevOps em grandes corporações
Luiz Fernando Testa Contador - Aplicando DevOps em grandes corporaçõesAgile Trends
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
3Audit Software & Tools.pptx
3Audit Software & Tools.pptx3Audit Software & Tools.pptx
3Audit Software & Tools.pptxjack952975
 
The Automation Firehose: Be Strategic and Tactical by Thomas Haver
The Automation Firehose: Be Strategic and Tactical by Thomas HaverThe Automation Firehose: Be Strategic and Tactical by Thomas Haver
The Automation Firehose: Be Strategic and Tactical by Thomas HaverQA or the Highway
 

Similar to What to Expect When You're Expecting (to Own Production) (20)

Vulnerability and Patch Management
Vulnerability and Patch ManagementVulnerability and Patch Management
Vulnerability and Patch Management
 
Management Information Systems – Week 7 Lecture 2Developme.docx
Management Information Systems – Week 7 Lecture 2Developme.docxManagement Information Systems – Week 7 Lecture 2Developme.docx
Management Information Systems – Week 7 Lecture 2Developme.docx
 
Successful Software Projects - What you need to consider
Successful Software Projects - What you need to considerSuccessful Software Projects - What you need to consider
Successful Software Projects - What you need to consider
 
Software Rollout
Software RolloutSoftware Rollout
Software Rollout
 
When to Code / Config / Config + Code in Salesforce - Nikunj Doshi
When to Code / Config / Config + Code in Salesforce - Nikunj DoshiWhen to Code / Config / Config + Code in Salesforce - Nikunj Doshi
When to Code / Config / Config + Code in Salesforce - Nikunj Doshi
 
Agile and continuous delivery – How IBM Watson Workspace is built
Agile and continuous delivery – How IBM Watson Workspace is builtAgile and continuous delivery – How IBM Watson Workspace is built
Agile and continuous delivery – How IBM Watson Workspace is built
 
Sanjeevi's SDLC Guest Lecture in Anna University campus at AU-PERS Centre (Ye...
Sanjeevi's SDLC Guest Lecture in Anna University campus at AU-PERS Centre (Ye...Sanjeevi's SDLC Guest Lecture in Anna University campus at AU-PERS Centre (Ye...
Sanjeevi's SDLC Guest Lecture in Anna University campus at AU-PERS Centre (Ye...
 
Problem Solving Methodology 2011 - 2014
Problem Solving Methodology 2011 - 2014Problem Solving Methodology 2011 - 2014
Problem Solving Methodology 2011 - 2014
 
software lecture
software lecturesoftware lecture
software lecture
 
Implementing Vulnerability Management
Implementing Vulnerability Management Implementing Vulnerability Management
Implementing Vulnerability Management
 
SCM: An Introduction
SCM: An IntroductionSCM: An Introduction
SCM: An Introduction
 
L08 architecture considerations
L08 architecture considerationsL08 architecture considerations
L08 architecture considerations
 
Feasible
FeasibleFeasible
Feasible
 
Defect effort prediction models in software
Defect effort prediction models in softwareDefect effort prediction models in software
Defect effort prediction models in software
 
Migrating to Cloud: Inhouse Hadoop to Databricks (3)
Migrating to Cloud: Inhouse Hadoop to Databricks (3)Migrating to Cloud: Inhouse Hadoop to Databricks (3)
Migrating to Cloud: Inhouse Hadoop to Databricks (3)
 
WinOps Conf 2016 - Matteo Emili - Development and QA Dilemmas in DevOps
WinOps Conf 2016 - Matteo Emili - Development and QA Dilemmas in DevOpsWinOps Conf 2016 - Matteo Emili - Development and QA Dilemmas in DevOps
WinOps Conf 2016 - Matteo Emili - Development and QA Dilemmas in DevOps
 
Luiz Fernando Testa Contador - Aplicando DevOps em grandes corporações
Luiz Fernando Testa Contador - Aplicando DevOps em grandes corporaçõesLuiz Fernando Testa Contador - Aplicando DevOps em grandes corporações
Luiz Fernando Testa Contador - Aplicando DevOps em grandes corporações
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
3Audit Software & Tools.pptx
3Audit Software & Tools.pptx3Audit Software & Tools.pptx
3Audit Software & Tools.pptx
 
The Automation Firehose: Be Strategic and Tactical by Thomas Haver
The Automation Firehose: Be Strategic and Tactical by Thomas HaverThe Automation Firehose: Be Strategic and Tactical by Thomas Haver
The Automation Firehose: Be Strategic and Tactical by Thomas Haver
 

Recently uploaded

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 

Recently uploaded (20)

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 

What to Expect When You're Expecting (to Own Production)

  • 1. What to Expect When You’re Expecting (to Own Production) Considerations for Acclimating Developers to Production Ownership Michael Diamant
  • 2. The Road Ahead Source: http://originfinance.com.au/wp-content/uploads/2017/03/End-of-the-Road.jpg
  • 3. Where Are We Trying To Go? Developers delivering software into production Developers triaging and remediating production issues Cultural change to include operational requirements in definition of “done” Time Developers proactively addressing issues before they manifest
  • 5. Metrics: Understand the Domain Question Motivation What questions do non-technical stakeholders ask? These topics are likely the ones that matter most for a particular constituency. If left unnoticed, what is the one failure that will cause the business significant harm? Repeat this question over time to learn where visibility is most needed. What SLAs / uptime contracts exist? If a topic is important enough to be recorded in the legalese, visibility is crucial.
  • 6. Metrics: Surface Non-functional Requirements Question Motivation What happens as reads and writes to a resource (e.g. file system, database) take longer? Tracking read/write latencies ensures that a situation heading towards “too slow” can be proactively addressed. What artifact sizes (e.g. values in a k-v store) are unbounded? Production grinds to a halt when system outputs are “too large”. Visibility into growth over time provides time to react calmly. What are critical thresholds for system resources (e.g. CPU, disks, memory)? Without understanding system usage, it is difficult to suggest optimization techniques and it limits ability to capacity plan. What 3rd party integration points exist? When a 3rd party integration inevitability fails, it will be a challenge to understand what happened without proper visibility.
  • 7. Alerts: Trigger Responsibly Suggestion Motivation Example Distinguish between soft (broadcasts message without paging) vs hard alerts (broadcasts and pages). Soft alerts enable the on-call team to sleep through the night and provide a heads up that danger is looming during the day. Candidate soft alert: Frequently scheduled job (e.g. machine learning algorithm) fails once. Candidate hard alert: Job fails 3+ times in a row. Consider the absence of a desired event/outcome an alert trigger. Who watches the watchers? This can be a safety mechanism to validate the assumption that a system is “working”. As an added bonus, this type of alert does not require output from the system being observed. Alert monitoring latency of events transferred between systems has no new observations (i.e. no data) in the last 10min. Where possible, evaluate proportional rather than absolute values Absolute alert thresholds more easily become stale over time and are fragile in heterogeneous environments. Since the load average is an aggregate number across all CPUs, track the load average per core.
  • 8. Deploys: What stages exist? Before deployment planned Pre-deployment Deployment Post-deployment Note: Box size proportional to effort needed
  • 9. Deploys: Questions to Consider Before a Deploy is Planned ● What are common rollback scenarios and how are they executed? ● What is the escalation policy should something break? ● What development strategies will be followed to avoid backwards incompatible changes? ● What procedures (e.g. testing) certify that software is ready for deployment? ● Involve other stakeholders: – What are amenable times of day or days of week for deployments? – What questions / constraints should be cleared prior to a deploy (e.g. confirm there are no high-touch client meetings the day of the deploy)? – How much downtime is acceptable?
  • 10. Deploys: Questions to Consider Pre-deployment ● Have all artifacts been versioned (e.g. remove branch/RC/SNAPSHOT modifiers)? ● Have all possible combinations of versions in production and to-be- deployed versions been exercised together to ensure compatibility? ● Have any side-effecting updates (e.g. DB schema changes) been tested in a non-production environment? ● Are deployments steps documented? ● Consider outcomes: – What will a successful deployment look like? – What signs will a failing/failed deployment show? – In addition to engineering, what stakeholders are needed to confirm success/failure?
  • 11. Parting Thoughts ● Trial and error is a part of this process. Mistakes will be made! ● Consider the next step outcome (e.g What happens when…?). ● Codify operational concerns (e.g. alerting) into definition of “done”. ● Vigilantly review alerts firing frequently and/or without action items to minimize on-call fatigue. ● Periodically audit alerts to identify gaps and remove stale alerts. ● Consider adding developers to on-call rotation. ● Retain flexibility: – With sufficient alerting in place, there can be less stringent deploys facilitate faster feedback loops. – Differentiate definition of done between proof-of-concept (POC) vs production work and transition point between POC and “production”.
  • 12. Thank you! To complete the definition of done for this presentation, let’s answer questions :)