Successfully reported this slideshow.
Your SlideShare is downloading. ×

Testing inproduction svcc18

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 23 Ad

Testing inproduction svcc18

Download to read offline

While testing in demo and stage is good (indeed, essential), testing in production is all too often overlooked. Deploying to production and hoping for the best is a gamble, not a strategy.

In this talk, we discuss
1) Better production deployment and testing strategies including dark pool testing, canary releases and feature switching.
2) After deployment, your work is still not done. We'll talk about Observability, including monitoring, tracing and metrics.
3) Finally, even with the best deployment strategies and monitoring techniques, your software WILL fail in production. It's a question of when, not if. So why not simulate those failures first? We'll finish with game days and chaos engineering.

This talk should be of interest to all developers, QA and Ops folks who are responsible for getting working software in front of users.

While testing in demo and stage is good (indeed, essential), testing in production is all too often overlooked. Deploying to production and hoping for the best is a gamble, not a strategy.

In this talk, we discuss
1) Better production deployment and testing strategies including dark pool testing, canary releases and feature switching.
2) After deployment, your work is still not done. We'll talk about Observability, including monitoring, tracing and metrics.
3) Finally, even with the best deployment strategies and monitoring techniques, your software WILL fail in production. It's a question of when, not if. So why not simulate those failures first? We'll finish with game days and chaos engineering.

This talk should be of interest to all developers, QA and Ops folks who are responsible for getting working software in front of users.

Advertisement
Advertisement

More Related Content

Similar to Testing inproduction svcc18 (20)

Advertisement
Advertisement

Testing inproduction svcc18

  1. 1. Confidential 1 Testing in Production Saturday 13th October, 2018 shaun@abram.com shaunabram.com @shaunabram linkedin.com/in/sabram/ Evaluation: https://goo.gl/VTAwgw
  2. 2. Confidential 2
  3. 3. Confidential Testing in Production 3 How is Production different?
  4. 4. Confidential 4 Testing in Production IS NOT a replacement for non-prod testing Treat production validation with the respect it deserves.
  5. 5. Confidential 5 Testing in Production Observability Testing at Release Chaos Engineering
  6. 6. Confidential 6 Observability The ability to ask new questions of your system without deploying new code
  7. 7. Confidential We need Observability in our systems 7  Everything is sometimes broken  Something is always broken  If nothing seems broken... …your monitoring is broken It’s impossible to predict the myriad states of partial failures we’ll see
  8. 8. Confidential Privileged and Confidential 8 Observability How do we observe our apps? Logs Metrics Monitoring & Alerting Traces Tools
  9. 9. Confidential 9 Testing in Production Observability
  10. 10. Confidential 10 Testing in Production Observability Testing at Release Chaos Engineering
  11. 11. Confidential 11 Testing in Production Testing at Release
  12. 12. Confidential 12 Testing at Release Deploy Config Tests Smoke Tests Shadowing Load Tests Release Canary release Internal release Post-Release Feature Flags A/B Testing Chaos Engineering…
  13. 13. Confidential 13 Testing in Production Testing at Release
  14. 14. Confidential 14 Testing in Production Observability Testing at Release Chaos Engineering
  15. 15. Confidential 15 Testing in Production Chaos Engineering Carefully planned experiments designed to reveal weaknesses in our systems aka Resilience Engineering
  16. 16. Confidential Game Days 16 An exercise where we place systems under stress to learn and improve resilience (And even just getting the team together to discuss resilience can be worthwhile)
  17. 17. Confidential Chaos Engineering – a step by step guide 17 Hypothesis (Steady state) Minimize Blast Radius Run Analyze Increase Repeat, Automate
  18. 18. Confidential 18 Testing in Production Chaos Engineering
  19. 19. Confidential 19 Testing in Production Observability Testing at Release Chaos Engineering
  20. 20. Confidential Reading material 20 Chaos Engineering (free eBook) https://www.oreilly.com/webops-perf/free/chaos-engineering.csp Distributed Systems Observability (free eBook) https://distributed-systems-observability-ebook.humio.com/
  21. 21. Confidential Reading material 21 shaunabram.com Principles of Chaos Engineering principlesofchaos.org How to run a Game Day https://www.gremlin.com/community/tutorials/how-to-run-a-gameday/ Testing in production: https://medium.com/@copyconstruct/testing-in-production-the-safe-way-18ca102d0ef1 Monitoring in the time of Cloud Native https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e Deploy != Release https://blog.turbinelabs.io/deploy-not-equal-release-part-one-4724bc1e726b
  22. 22. Confidential Testing in production – The Industry Experts 22 Nora Jones Charity Majors Cindy Sridharan Tammy Butow
  23. 23. Confidential Questions? And please evaluate! https://goo.gl/VTAwgw 23

Editor's Notes

  • Joke… or Good?
    Non-prod = pale imitation, like mocks, or “it works on my machine”
    Prod is different; 4th trimester…


    “Testing in production”
    You may have seen this meme before: The DosEquis guys saying
    “I don’t always test, but when I do, I test in production”
    “Testing in production” has been kind of a joke -> what you’re really saying is you don’t test anywhere.
    And instead you’re just winging it: deploying to production and <CROSS FINGERS> hoping it all works.

    But then I began to look at it differently.
    The DosEquis guy usually says “I don’t always drink beer, but when I do, I drink DosEquis”
    Meaning DosEquis is the best beer to drink.
    So, the implication here is not that testing in production is a joke, but that Production is actually the BEST place to test.

    And I’m increasingly believing that to be the case. Or, at least, that production is an environment we shouldn’t be ignoring for testing.
    After all, prod is the only place your software has an impact on your customers is production.
    But there has been this status quo of production being sacrosanct. Instead of testing there, it is common to keep a non-prod env, such as staging, as identical to production as possible, and test there.
    Such environments are usually a pale imitation of production however.
    Testing in staging is kind of like testing with mocks, an imitation, but not the real thing.
    Saying “works in staging” is only one step better than “works on my machine”.

    Production is a different beast!
    I’ve heard of software being released to production as being like a baby’s 4th trimester.
    When software leave’s it artificial environments and slams into the real world
    But what makes the real world of production so special?
  • Serious question: In what ways in Production different from other environments?
    Hardware & Cluster size, Data
    Configuration, Traffic, Monitoring

    Some things we can only test in production
    As our architecture becomes more compilated (particularly with Microservices), we need to consider all options to allow us to test and deliver working software to our customer. Including testing in production.
  • So should we skip testing in non-prod first? No!
    Testing in production is by no means a substitute for pre-production testing

    I’ve given talks on
    unit testing, integration testing
    Mocks
    About code coverage and Continuous Integration
    I believe very firmly in all those things.
    Testing in Production is an addition to all those.
    Most production testing is really validation only – although there is at least one exception (A/B testing)

    Respect production
    Beware of unwanted side effects
    Stateless services are good candidates
    Think SAFE methods e.g. GET, HEAD
    Consider testing using expected failures of others e.g. PUT that results in 400 error (still tells you something)
    Or at least be able to tell the difference between test data and “real” prod data



  • Today we’re going to cover some of the different ways we can test in production
    We’ll start with Observability, the foundation for any testing in production. Observability = Knowing what the heck your app is doing anyway. Going beyond just logs and alerting

    Around deployment & release times

    Chaos engineering. Perhaps the most advanced form of production testing, but I would argue its actually not that advanced. I talk about what is is, some basic rules for doing and how we’ve been starting to use it where I work.
  • Observability is
    The ability…
    Being able to answer questions that you have never thought of before

    You can think of it as the next step beyond just monitoring and alerting
    Systems have become more distributed, and in the case of containerization, more ephemeral. It is increasingly difficult to know what our software is doing
    And Observability means bringing better visibility into systems 
    To have better visibility, we need to acknowledge that…
  • Everything is sometimes broken
    Something is always broker
    -> No complex system is ever fully healthy
    If nothing broken…


    Distributed systems are unpredictable. In particular, it’s impossible to predict all the ways a system might fail
    Failure needs to be embraced at every phase (from design to implementation, testing, deployment, and operation)
    Ease of debugging is of high importance
  • Logging:
    Structured logging: plain text -> splunk friendly -> json
    Eventlog can be a great source of logs for debugging too
    Consider sampling rather than aggregation
    Metrics: Time series metrics, like tracking system stats such as CPU and mem usage, stats like # logins
    Tracing: Distributed traceability using Correlation ID lib; Zipkin etc
    Alerting: Useful for about proactively learning about, typically, predictable issues
    Tools: e.g. Splunk, NR, OverOps; Wavefront
    EPX, TDA (Thread Dump Analyzer) UX
    OverOps / HoneyComb
    Stacktraces and exception trackers?


  • OK, so that was Observability
    The ability to answer questions about our applications behavior in production. Questions we may have never even though of before.
  • And with Observability in place, what types of production testing can we do…
    Let’s move onto…
  • Testing at Release time
    Let’s start by defining some terms:
    Deployment vs release
  • When talking with engineers, I usually use the term Chaos Engineering, because it sounds cool! When talking with management, I tend to use the term resilience engineering, since it sounds less scary. The terms are synonymous. In the past, terms such as Disaster Recovery and Contingency Planning have been used to describe somewhat similar processes.

    Whatever term you use, it basically refers to 
    ->
    Conducting carefully planned experiments designed to reveal weaknesses in our systems.
    In other words, CE is the practice of confirming that your applications work as you expect them to in production.

    Despite the name, Chaos Engineering is not about introducing Chaos into your system! Instead it is about identifying any chaos already there, so that you can remediate.

    For example, if you GIVE A GOOD CANONICAL EXAMPLE OF A CHAOS ENGINEERING EXPERIMENT HERE
    if you believe your application will failover to something if x happens
    or can handle x requests per second before failing.
     
  • What are Game Days?
    If Chaos Engineering is the theory, Game days are the practice; the execution
    Game days are where you start with Chaos engineering
    ->
    Game days are “An exercise where we place systems under stress to learn and improve resilience”
    Systems can be technology, people, process
    They are like fire drills – an opportunity to practice a potentially dangerous scenario in a safer environment





  • To start with, what are we trying to test! Pick a hypothesis.
    Typically in Chaos Engineering experiments, the hypothesis is that if I do X (take out a server, kill a region), everything should be OK
    But we need to be specific about how to measure things are OK
    If out hypothesis is “if we fail primary DB, everything should be ok,”
    Then we need to define what OK is!
    And a big part of defining OK is to define “Steady State”
    Steady state is essentially what the key metrics are for you to monitor as part of your test. It could be things like:
    Loan application remain constant
    Or response times remain in an acceptable range
    If you don’t define steady state, how do you know your test is working on not? How do you know if you are breaking things?

    With a hypothesis in mind, and a way to test, but first think abut blast radius
    2. Minimize the blast radius
    The blast radius refers to to how much damage can be done by the experiment
    If you take out a server, and everything is in fact NOT OK, how bad might it be
    Try to ensure that you limit he possible damage
    For example, if your hypothesis is that
    When Foo service is running in a pool of 2 servers
    And one of those servers dies, CPU and memory utilization should increase on the remain servers, but response time remain unaffected
    That is a fine thing to test
    But if you have 10 services depending on that service (even in non prod), and your wrong that response times will be unaffected, you may have caused 10 other services to have problems
    So a way to limit the blast radius in that test would be to test using a pool of Foo Service that only one other service relies on. Hopefully a service that you also control and that is closely monitored as part of the test.
    Another way to minimize possible damage is to make sure that you have the equivalent of a big red Stop Test button!
    If you metrics aren’t looking good, have the ability to abort the test immediately.
    Remember: our goal here is to build confidence in the resilience of the system and we so that with Controlled, contained experiments, that grow in scope incrementally.

    3. Run the experiment
    Figure out the best way to test your hypothesis
    If you plan to take out a server, how do you do it?
    ssh in and kill -9? Orderly shutdown? Have Ops do it for you? Do you simulate failure by using bogus IP addresses, or simply removing a server from a VIP pool?
    And again, stop if metrics or alert dictate

    4. Analyze the results
    Were your expectations correct?
    Did your system handle things correctly
    Did you spot issue with you alerts, metrics that should be improved before any future tests

    5. Increase scope
    The idea is to start small
    1 service, in non-prod, and gradually expand to prod.
    And the goal should be prod. Prod is where’s it’s at!
  • That brings us to the end of the presentation
    We have talked about Testing in production
    No longer a joke, instead increasingly viewed as a best practice. It is not a replacement for the essential and high value non-prod testing we do, but instead an addition.

    Observability: Testing in production, and indeed in all envs, requires being able to understand what our applications do. Conventional logs, monitoring and alerting are all good, but Observability is about more than that. It’s about the ability to answer complex questions about our apps at run time. Questions we may not have even thought of before like: why is my app slow. Is it me or a downstream service? Where is all my memory being used. We can use metrics, tracing, any tools at our disposal so that we can see what’s going on when things go wrong. Or better still, to proactively spot problems in advance.

    And with Observability in place, we can actually start to test in production!
    We ran through different types of

    Testing at Release
    We can do, including
    after deployment (Config, smoke, load, shadow)
    At release time: Canary and internal release
    After release: Feature flags and A/B testing

    Finally, even when everything is up and running in prod, customers are using it, and all looks good, there is still more testing we can do

    Chaos Engineering
    Not introducing chaos, but exposing the already present chaos!
    Carefully planned experiments designed to reveal weaknesses in our systems

×