SlideShare a Scribd company logo
1 of 23
Download to read offline
Confidential 1
Testing
in
Production
Saturday 13th October, 2018
shaun@abram.com
shaunabram.com
@shaunabram
linkedin.com/in/sabram/
Evaluation: https://goo.gl/VTAwgw
Confidential 2
Confidential
Testing in Production
3
How is Production different?
Confidential 4
Testing in Production
IS NOT
a replacement
for non-prod testing
Treat production validation with the respect it deserves.
Confidential 5
Testing in Production
Observability
Testing
at Release
Chaos
Engineering
Confidential 6
Observability
The ability to ask new questions of your system without deploying new code
Confidential
We need Observability in our systems
7
 Everything is sometimes broken
 Something is always broken
 If nothing seems broken...
…your monitoring is broken
It’s impossible to predict the myriad states of partial failures we’ll see
Confidential
Privileged and Confidential 8
Observability
How do we observe our apps?
Logs
Metrics
Monitoring &
Alerting
Traces
Tools
Confidential 9
Testing in Production
Observability
Confidential 10
Testing in Production
Observability
Testing
at Release
Chaos
Engineering
Confidential 11
Testing in Production
Testing
at Release
Confidential 12
Testing at Release
Deploy
Config Tests
Smoke Tests
Shadowing
Load Tests
Release
Canary release
Internal release
Post-Release
Feature Flags
A/B Testing
Chaos Engineering…
Confidential 13
Testing in Production
Testing
at Release
Confidential 14
Testing in Production
Observability
Testing
at Release
Chaos
Engineering
Confidential 15
Testing in Production
Chaos
Engineering
Carefully planned experiments
designed to reveal weaknesses in our
systems
aka
Resilience Engineering
Confidential
Game Days
16
An exercise where we place systems
under stress to
learn and improve resilience
(And even just getting the team together to discuss resilience can be worthwhile)
Confidential
Chaos Engineering – a step by step guide
17
Hypothesis
(Steady state)
Minimize
Blast Radius
Run
Analyze
Increase
Repeat,
Automate
Confidential 18
Testing in Production
Chaos
Engineering
Confidential 19
Testing in Production
Observability
Testing
at Release
Chaos
Engineering
Confidential
Reading material
20
Chaos Engineering (free eBook)
https://www.oreilly.com/webops-perf/free/chaos-engineering.csp
Distributed Systems Observability (free eBook)
https://distributed-systems-observability-ebook.humio.com/
Confidential
Reading material
21
shaunabram.com
Principles of Chaos Engineering
principlesofchaos.org
How to run a Game Day
https://www.gremlin.com/community/tutorials/how-to-run-a-gameday/
Testing in production:
https://medium.com/@copyconstruct/testing-in-production-the-safe-way-18ca102d0ef1
Monitoring in the time of Cloud Native
https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e
Deploy != Release
https://blog.turbinelabs.io/deploy-not-equal-release-part-one-4724bc1e726b
Confidential
Testing in production – The Industry Experts
22
Nora Jones
Charity Majors
Cindy Sridharan
Tammy Butow
Confidential
Questions?
And please evaluate!
https://goo.gl/VTAwgw
23

More Related Content

Similar to Testing inproduction svcc18

DevSecOps for Developers, How To Start (ETC 2020)
DevSecOps for Developers, How To Start (ETC 2020)DevSecOps for Developers, How To Start (ETC 2020)
DevSecOps for Developers, How To Start (ETC 2020)Patricia Aas
 
[Rakuten TechConf2014] [G-4] Beyond Agile Testing to Lean Development
[Rakuten TechConf2014] [G-4] Beyond Agile Testing to Lean Development[Rakuten TechConf2014] [G-4] Beyond Agile Testing to Lean Development
[Rakuten TechConf2014] [G-4] Beyond Agile Testing to Lean DevelopmentRakuten Group, Inc.
 
Beyond Agile Testing to Lean Development — Rakuten Technology Conference
Beyond Agile Testing to Lean Development — Rakuten Technology ConferenceBeyond Agile Testing to Lean Development — Rakuten Technology Conference
Beyond Agile Testing to Lean Development — Rakuten Technology ConferenceJames Coplien
 
What Every Developer And Tester Should Know About Software Security
What Every Developer And Tester Should Know About Software SecurityWhat Every Developer And Tester Should Know About Software Security
What Every Developer And Tester Should Know About Software SecurityAnne Oikarinen
 
CyberSecurity - Future Risks, Zero Trust and the Optus Data Leak.pdf
CyberSecurity - Future Risks, Zero Trust and the Optus Data Leak.pdfCyberSecurity - Future Risks, Zero Trust and the Optus Data Leak.pdf
CyberSecurity - Future Risks, Zero Trust and the Optus Data Leak.pdfRoger Qiu
 
12 Crucial Windows Security Skills for 2018
12 Crucial Windows Security Skills for 201812 Crucial Windows Security Skills for 2018
12 Crucial Windows Security Skills for 2018Paula Januszkiewicz
 
Applying formal methods to existing software by B.Monate
Applying formal methods to existing software by B.MonateApplying formal methods to existing software by B.Monate
Applying formal methods to existing software by B.MonateMahaut Gouhier
 
Testistanbul 2016 - Keynote: "Why Automated Verification Matters" by Kristian...
Testistanbul 2016 - Keynote: "Why Automated Verification Matters" by Kristian...Testistanbul 2016 - Keynote: "Why Automated Verification Matters" by Kristian...
Testistanbul 2016 - Keynote: "Why Automated Verification Matters" by Kristian...Turkish Testing Board
 
Multipoint Conferencing Unit Comparative Study
Multipoint Conferencing Unit Comparative StudyMultipoint Conferencing Unit Comparative Study
Multipoint Conferencing Unit Comparative StudyVideoguy
 
Securing the Pipeline
Securing the PipelineSecuring the Pipeline
Securing the PipelineThoughtworks
 
Foundations of Software Testing Lecture 4
Foundations of Software Testing Lecture 4Foundations of Software Testing Lecture 4
Foundations of Software Testing Lecture 4Iosif Itkin
 
Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...
Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...
Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...Simon Storm
 
Building Trust in Automated Tests
Building Trust in Automated TestsBuilding Trust in Automated Tests
Building Trust in Automated TestsJyoti Mittal
 
Are Your Continuous Tests Too Fragile for Agile?
Are Your Continuous Tests Too Fragile for Agile?Are Your Continuous Tests Too Fragile for Agile?
Are Your Continuous Tests Too Fragile for Agile?Parasoft
 
CLL19 - Acceptance Tests as Monitors
CLL19 - Acceptance Tests as MonitorsCLL19 - Acceptance Tests as Monitors
CLL19 - Acceptance Tests as MonitorsPhill Barber
 
How to Better Manage Technical Debt While Innovating on DevOps
How to Better Manage Technical Debt While Innovating on DevOpsHow to Better Manage Technical Debt While Innovating on DevOps
How to Better Manage Technical Debt While Innovating on DevOpsDynatrace
 
Testing in a glance
Testing in a glanceTesting in a glance
Testing in a glanceRajesh Kumar
 
Making Security Agile - Oleg Gryb
Making Security Agile - Oleg GrybMaking Security Agile - Oleg Gryb
Making Security Agile - Oleg GrybSeniorStoryteller
 

Similar to Testing inproduction svcc18 (20)

DevSecOps for Developers, How To Start (ETC 2020)
DevSecOps for Developers, How To Start (ETC 2020)DevSecOps for Developers, How To Start (ETC 2020)
DevSecOps for Developers, How To Start (ETC 2020)
 
[Rakuten TechConf2014] [G-4] Beyond Agile Testing to Lean Development
[Rakuten TechConf2014] [G-4] Beyond Agile Testing to Lean Development[Rakuten TechConf2014] [G-4] Beyond Agile Testing to Lean Development
[Rakuten TechConf2014] [G-4] Beyond Agile Testing to Lean Development
 
Webinar: "La supply chain del software vista a raggi X"
Webinar: "La supply chain del software vista a raggi X" Webinar: "La supply chain del software vista a raggi X"
Webinar: "La supply chain del software vista a raggi X"
 
Beyond Agile Testing to Lean Development — Rakuten Technology Conference
Beyond Agile Testing to Lean Development — Rakuten Technology ConferenceBeyond Agile Testing to Lean Development — Rakuten Technology Conference
Beyond Agile Testing to Lean Development — Rakuten Technology Conference
 
What Every Developer And Tester Should Know About Software Security
What Every Developer And Tester Should Know About Software SecurityWhat Every Developer And Tester Should Know About Software Security
What Every Developer And Tester Should Know About Software Security
 
CyberSecurity - Future Risks, Zero Trust and the Optus Data Leak.pdf
CyberSecurity - Future Risks, Zero Trust and the Optus Data Leak.pdfCyberSecurity - Future Risks, Zero Trust and the Optus Data Leak.pdf
CyberSecurity - Future Risks, Zero Trust and the Optus Data Leak.pdf
 
12 Crucial Windows Security Skills for 2018
12 Crucial Windows Security Skills for 201812 Crucial Windows Security Skills for 2018
12 Crucial Windows Security Skills for 2018
 
Applying formal methods to existing software by B.Monate
Applying formal methods to existing software by B.MonateApplying formal methods to existing software by B.Monate
Applying formal methods to existing software by B.Monate
 
Testistanbul 2016 - Keynote: "Why Automated Verification Matters" by Kristian...
Testistanbul 2016 - Keynote: "Why Automated Verification Matters" by Kristian...Testistanbul 2016 - Keynote: "Why Automated Verification Matters" by Kristian...
Testistanbul 2016 - Keynote: "Why Automated Verification Matters" by Kristian...
 
Multipoint Conferencing Unit Comparative Study
Multipoint Conferencing Unit Comparative StudyMultipoint Conferencing Unit Comparative Study
Multipoint Conferencing Unit Comparative Study
 
Securing the Pipeline
Securing the PipelineSecuring the Pipeline
Securing the Pipeline
 
Foundations of Software Testing Lecture 4
Foundations of Software Testing Lecture 4Foundations of Software Testing Lecture 4
Foundations of Software Testing Lecture 4
 
Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...
Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...
Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...
 
Building Trust in Automated Tests
Building Trust in Automated TestsBuilding Trust in Automated Tests
Building Trust in Automated Tests
 
Are Your Continuous Tests Too Fragile for Agile?
Are Your Continuous Tests Too Fragile for Agile?Are Your Continuous Tests Too Fragile for Agile?
Are Your Continuous Tests Too Fragile for Agile?
 
nullcon 2011 - Fuzzing with Complexities
nullcon 2011 - Fuzzing with Complexitiesnullcon 2011 - Fuzzing with Complexities
nullcon 2011 - Fuzzing with Complexities
 
CLL19 - Acceptance Tests as Monitors
CLL19 - Acceptance Tests as MonitorsCLL19 - Acceptance Tests as Monitors
CLL19 - Acceptance Tests as Monitors
 
How to Better Manage Technical Debt While Innovating on DevOps
How to Better Manage Technical Debt While Innovating on DevOpsHow to Better Manage Technical Debt While Innovating on DevOps
How to Better Manage Technical Debt While Innovating on DevOps
 
Testing in a glance
Testing in a glanceTesting in a glance
Testing in a glance
 
Making Security Agile - Oleg Gryb
Making Security Agile - Oleg GrybMaking Security Agile - Oleg Gryb
Making Security Agile - Oleg Gryb
 

More from Shaun Abram

Unit testing - the hard parts
Unit testing - the hard partsUnit testing - the hard parts
Unit testing - the hard partsShaun Abram
 
RESTful Microservices
RESTful MicroservicesRESTful Microservices
RESTful MicroservicesShaun Abram
 
Rest and Microservices at the Las Vegas Dot Net Group
Rest and Microservices at the Las Vegas Dot Net GroupRest and Microservices at the Las Vegas Dot Net Group
Rest and Microservices at the Las Vegas Dot Net GroupShaun Abram
 
REST and Microservices
REST and MicroservicesREST and Microservices
REST and MicroservicesShaun Abram
 
Software Quality via Unit Testing
Software Quality via Unit TestingSoftware Quality via Unit Testing
Software Quality via Unit TestingShaun Abram
 

More from Shaun Abram (6)

Ship it boise
Ship it boiseShip it boise
Ship it boise
 
Unit testing - the hard parts
Unit testing - the hard partsUnit testing - the hard parts
Unit testing - the hard parts
 
RESTful Microservices
RESTful MicroservicesRESTful Microservices
RESTful Microservices
 
Rest and Microservices at the Las Vegas Dot Net Group
Rest and Microservices at the Las Vegas Dot Net GroupRest and Microservices at the Las Vegas Dot Net Group
Rest and Microservices at the Las Vegas Dot Net Group
 
REST and Microservices
REST and MicroservicesREST and Microservices
REST and Microservices
 
Software Quality via Unit Testing
Software Quality via Unit TestingSoftware Quality via Unit Testing
Software Quality via Unit Testing
 

Recently uploaded

Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Arti Languages Pre Seed Pitchdeck 2024.pdf
Arti Languages Pre Seed Pitchdeck 2024.pdfArti Languages Pre Seed Pitchdeck 2024.pdf
Arti Languages Pre Seed Pitchdeck 2024.pdfwill854175
 
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024BookNet Canada
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Why Agile? - A handbook behind Agile Evolution
Why Agile? - A handbook behind Agile EvolutionWhy Agile? - A handbook behind Agile Evolution
Why Agile? - A handbook behind Agile EvolutionDEEPRAJ PATHAK
 
Introduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptxIntroduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptxmprakaash5
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2
Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2
Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2DianaGray10
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Tecnogravura, Cylinder Engraving for Rotogravure
Tecnogravura, Cylinder Engraving for RotogravureTecnogravura, Cylinder Engraving for Rotogravure
Tecnogravura, Cylinder Engraving for RotogravureAntonio de Llamas
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023
THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023
THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023Joshua Flannery
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Deliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceDeliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceOpsTree solutions
 
Dynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientationDynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientationBuild Intuit
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Which standard is best for your content?
Which standard is best for your content?Which standard is best for your content?
Which standard is best for your content?Rustici Software
 
Automation Ops Series: Session 3 - Solutions management
Automation Ops Series: Session 3 - Solutions managementAutomation Ops Series: Session 3 - Solutions management
Automation Ops Series: Session 3 - Solutions managementDianaGray10
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneUiPathCommunity
 

Recently uploaded (20)

Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Arti Languages Pre Seed Pitchdeck 2024.pdf
Arti Languages Pre Seed Pitchdeck 2024.pdfArti Languages Pre Seed Pitchdeck 2024.pdf
Arti Languages Pre Seed Pitchdeck 2024.pdf
 
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Why Agile? - A handbook behind Agile Evolution
Why Agile? - A handbook behind Agile EvolutionWhy Agile? - A handbook behind Agile Evolution
Why Agile? - A handbook behind Agile Evolution
 
Introduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptxIntroduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptx
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2
Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2
Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Tecnogravura, Cylinder Engraving for Rotogravure
Tecnogravura, Cylinder Engraving for RotogravureTecnogravura, Cylinder Engraving for Rotogravure
Tecnogravura, Cylinder Engraving for Rotogravure
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023
THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023
THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Deliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceDeliver Latency Free Customer Experience
Deliver Latency Free Customer Experience
 
Dynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientationDynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientation
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Which standard is best for your content?
Which standard is best for your content?Which standard is best for your content?
Which standard is best for your content?
 
Automation Ops Series: Session 3 - Solutions management
Automation Ops Series: Session 3 - Solutions managementAutomation Ops Series: Session 3 - Solutions management
Automation Ops Series: Session 3 - Solutions management
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyone
 

Testing inproduction svcc18

Editor's Notes

  1. Joke… or Good? Non-prod = pale imitation, like mocks, or “it works on my machine” Prod is different; 4th trimester… “Testing in production” You may have seen this meme before: The DosEquis guys saying “I don’t always test, but when I do, I test in production” “Testing in production” has been kind of a joke -> what you’re really saying is you don’t test anywhere. And instead you’re just winging it: deploying to production and <CROSS FINGERS> hoping it all works. But then I began to look at it differently. The DosEquis guy usually says “I don’t always drink beer, but when I do, I drink DosEquis” Meaning DosEquis is the best beer to drink. So, the implication here is not that testing in production is a joke, but that Production is actually the BEST place to test. And I’m increasingly believing that to be the case. Or, at least, that production is an environment we shouldn’t be ignoring for testing. After all, prod is the only place your software has an impact on your customers is production. But there has been this status quo of production being sacrosanct. Instead of testing there, it is common to keep a non-prod env, such as staging, as identical to production as possible, and test there. Such environments are usually a pale imitation of production however. Testing in staging is kind of like testing with mocks, an imitation, but not the real thing. Saying “works in staging” is only one step better than “works on my machine”. Production is a different beast! I’ve heard of software being released to production as being like a baby’s 4th trimester. When software leave’s it artificial environments and slams into the real world But what makes the real world of production so special?
  2. Serious question: In what ways in Production different from other environments? Hardware & Cluster size, Data Configuration, Traffic, Monitoring Some things we can only test in production As our architecture becomes more compilated (particularly with Microservices), we need to consider all options to allow us to test and deliver working software to our customer. Including testing in production.
  3. So should we skip testing in non-prod first? No! Testing in production is by no means a substitute for pre-production testing I’ve given talks on unit testing, integration testing Mocks About code coverage and Continuous Integration I believe very firmly in all those things. Testing in Production is an addition to all those. Most production testing is really validation only – although there is at least one exception (A/B testing) Respect production Beware of unwanted side effects Stateless services are good candidates Think SAFE methods e.g. GET, HEAD Consider testing using expected failures of others e.g. PUT that results in 400 error (still tells you something) Or at least be able to tell the difference between test data and “real” prod data
  4. Today we’re going to cover some of the different ways we can test in production We’ll start with Observability, the foundation for any testing in production. Observability = Knowing what the heck your app is doing anyway. Going beyond just logs and alerting Around deployment & release times Chaos engineering. Perhaps the most advanced form of production testing, but I would argue its actually not that advanced. I talk about what is is, some basic rules for doing and how we’ve been starting to use it where I work.
  5. Observability is The ability… Being able to answer questions that you have never thought of before You can think of it as the next step beyond just monitoring and alerting Systems have become more distributed, and in the case of containerization, more ephemeral. It is increasingly difficult to know what our software is doing And Observability means bringing better visibility into systems  To have better visibility, we need to acknowledge that…
  6. Everything is sometimes broken Something is always broker -> No complex system is ever fully healthy If nothing broken… Distributed systems are unpredictable. In particular, it’s impossible to predict all the ways a system might fail Failure needs to be embraced at every phase (from design to implementation, testing, deployment, and operation) Ease of debugging is of high importance
  7. Logging: Structured logging: plain text -> splunk friendly -> json Eventlog can be a great source of logs for debugging too Consider sampling rather than aggregation Metrics: Time series metrics, like tracking system stats such as CPU and mem usage, stats like # logins Tracing: Distributed traceability using Correlation ID lib; Zipkin etc Alerting: Useful for about proactively learning about, typically, predictable issues Tools: e.g. Splunk, NR, OverOps; Wavefront EPX, TDA (Thread Dump Analyzer) UX OverOps / HoneyComb Stacktraces and exception trackers?
  8. OK, so that was Observability The ability to answer questions about our applications behavior in production. Questions we may have never even though of before.
  9. And with Observability in place, what types of production testing can we do… Let’s move onto…
  10. Testing at Release time Let’s start by defining some terms: Deployment vs release
  11. When talking with engineers, I usually use the term Chaos Engineering, because it sounds cool! When talking with management, I tend to use the term resilience engineering, since it sounds less scary. The terms are synonymous. In the past, terms such as Disaster Recovery and Contingency Planning have been used to describe somewhat similar processes. Whatever term you use, it basically refers to  -> Conducting carefully planned experiments designed to reveal weaknesses in our systems. In other words, CE is the practice of confirming that your applications work as you expect them to in production. Despite the name, Chaos Engineering is not about introducing Chaos into your system! Instead it is about identifying any chaos already there, so that you can remediate. For example, if you GIVE A GOOD CANONICAL EXAMPLE OF A CHAOS ENGINEERING EXPERIMENT HERE if you believe your application will failover to something if x happens or can handle x requests per second before failing.  
  12. What are Game Days? If Chaos Engineering is the theory, Game days are the practice; the execution Game days are where you start with Chaos engineering -> Game days are “An exercise where we place systems under stress to learn and improve resilience” Systems can be technology, people, process They are like fire drills – an opportunity to practice a potentially dangerous scenario in a safer environment
  13. To start with, what are we trying to test! Pick a hypothesis. Typically in Chaos Engineering experiments, the hypothesis is that if I do X (take out a server, kill a region), everything should be OK But we need to be specific about how to measure things are OK If out hypothesis is “if we fail primary DB, everything should be ok,” Then we need to define what OK is! And a big part of defining OK is to define “Steady State” Steady state is essentially what the key metrics are for you to monitor as part of your test. It could be things like: Loan application remain constant Or response times remain in an acceptable range If you don’t define steady state, how do you know your test is working on not? How do you know if you are breaking things? With a hypothesis in mind, and a way to test, but first think abut blast radius 2. Minimize the blast radius The blast radius refers to to how much damage can be done by the experiment If you take out a server, and everything is in fact NOT OK, how bad might it be Try to ensure that you limit he possible damage For example, if your hypothesis is that When Foo service is running in a pool of 2 servers And one of those servers dies, CPU and memory utilization should increase on the remain servers, but response time remain unaffected That is a fine thing to test But if you have 10 services depending on that service (even in non prod), and your wrong that response times will be unaffected, you may have caused 10 other services to have problems So a way to limit the blast radius in that test would be to test using a pool of Foo Service that only one other service relies on. Hopefully a service that you also control and that is closely monitored as part of the test. Another way to minimize possible damage is to make sure that you have the equivalent of a big red Stop Test button! If you metrics aren’t looking good, have the ability to abort the test immediately. Remember: our goal here is to build confidence in the resilience of the system and we so that with Controlled, contained experiments, that grow in scope incrementally. 3. Run the experiment Figure out the best way to test your hypothesis If you plan to take out a server, how do you do it? ssh in and kill -9? Orderly shutdown? Have Ops do it for you? Do you simulate failure by using bogus IP addresses, or simply removing a server from a VIP pool? And again, stop if metrics or alert dictate 4. Analyze the results Were your expectations correct? Did your system handle things correctly Did you spot issue with you alerts, metrics that should be improved before any future tests 5. Increase scope The idea is to start small 1 service, in non-prod, and gradually expand to prod. And the goal should be prod. Prod is where’s it’s at!
  14. That brings us to the end of the presentation We have talked about Testing in production No longer a joke, instead increasingly viewed as a best practice. It is not a replacement for the essential and high value non-prod testing we do, but instead an addition. Observability: Testing in production, and indeed in all envs, requires being able to understand what our applications do. Conventional logs, monitoring and alerting are all good, but Observability is about more than that. It’s about the ability to answer complex questions about our apps at run time. Questions we may not have even thought of before like: why is my app slow. Is it me or a downstream service? Where is all my memory being used. We can use metrics, tracing, any tools at our disposal so that we can see what’s going on when things go wrong. Or better still, to proactively spot problems in advance. And with Observability in place, we can actually start to test in production! We ran through different types of Testing at Release We can do, including after deployment (Config, smoke, load, shadow) At release time: Canary and internal release After release: Feature flags and A/B testing Finally, even when everything is up and running in prod, customers are using it, and all looks good, there is still more testing we can do Chaos Engineering Not introducing chaos, but exposing the already present chaos! Carefully planned experiments designed to reveal weaknesses in our systems