Testing Microservices:
From Development to Production
Daniel Bryant
@danielbryantuk
Rube Goldberg’s self-operating napkin
09/10/2018 @danielbryantuk
Was Rube Goldberg the first microservice architect?
09/10/2018 @danielbryantuk
tl;dr
• A lot of microservice testing attempts I see are at the fringes of a
spectrum from “YOLO” to seeking absolute correctness
• I believe the key tradeoffs should be around pre-prod vs post-prod tests
• Contract testing, API simulation and chaos experimentation can be useful
techniques for microservice testing
09/10/2018 @danielbryantuk
@danielbryantuk
• Independent Technical Consultant, Product Architect at Datawire
• Architecture, DevOps, Java, microservices, cloud, containers
• Continuous Delivery (CI/CD) advocate
• Leading change through technology and teams
09/10/2018 @danielbryantuk
bit.ly/2jWDSF7
oreil.ly/2E63nCR
Testing microservice 101
09/10/2018 @danielbryantuk
Testing: Core concepts
09/10/2018 @danielbryantuk
/lisacrispin.com/2011/11/08/using-the-agile-testing-quadrants/martinfowler.com/bliki/TestPyramid.html
The test pyramid (is just a model)
• This model was created before the
rise in popularity of microservices…
• …but after David Parnas’ modularity
• Applies at system and service level
• Probably needs updating…
09/10/2018 @danielbryantuk
martinfowler.com/bliki/TestPyramid.html
New testing strategies for microservices
09/10/2018 @danielbryantuk
https://medium.com/@copyconstruct/testing-microservices-the-sane-way-9bb31d158c16 http://distributed-systems-observability-ebook.humio.com/
Microservice test funnel
09/10/2018 @danielbryantuk
https://medium.com/@copyconstruct/testing-microservices-the-sane-way-9bb31d158c16
General lessons learned
(and mistakes made)
09/10/2018 @danielbryantuk
I’m not suggesting that you avoid unit tests
09/10/2018 @danielbryantuk
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf https://www.youtube.com/watch?v=ZMbqbXxRthE
I’m not suggesting that you avoid unit tests
• 77% of production failures can be
reproduced by a unit test
• Testing error handling code could
have prevented 58% of catastrophic
failures
• 35% of catastrophic failures
• Empty error handler, or contains FIXME
• Error handler aborts system
09/10/2018 @danielbryantuk
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf
Integration/component tests
If your component/integration tests look too complicated,
they probably are
Coupling and cohesion apply to everything!
09/10/2018 @danielbryantuk
09/10/2018 @danielbryantuk
https://itnext.io/microservice-testing-coupling-and-cohesion-all-the-way-down-a9f100cda523
End-to-end tests
Representational data is often the weakest link
Understand your data ”shape” and volume
09/10/2018 @danielbryantuk
Synthetic datastores/middleware
09/10/2018 @danielbryantuk
General strategies
• Test outside-in
• Acceptance tests for system and services
• “LUFD” the context and TDD the API
• Virtualise dependencies
• Test contracts of unstable APIs
• Invest in monitoring, synthetic txns and
chaos engineerings (in this order)
09/10/2018 @danielbryantuk
https://itnext.io/microservice-testing-coupling-and-cohesion-all-the-way-down-a9f100cda523
Let’s look at some techniques in more depth
09/10/2018 @danielbryantuk
Contract
(Testing syntax)
09/10/2018 @danielbryantuk
So, where do contracts fit into this…
09/10/2018 @danielbryantuk
martinfowler.com/bliki/TestPyramid.html
Contract Tests Focused on system
Focused on service/function
API contracts
• APIs are service contracts
• Many are producer-driven
• It’s possible to design outside-in:
• Consumer-Driven Contracts
• martinfowler.com/articles/consumerDrivenContracts.html
09/10/2018 @danielbryantuk
CDC concepts
09/10/2018 @danielbryantuk
https://codefresh.io/docker-tutorial/how-to-test-microservice-integration-with-pact/
CDC workflow
1. Consumer writes a contract that defines an interaction with the API.
1. For HTTP RPC this is simply request with acceptable params and response
2. Often the contract can be autogenerated from a test
2. Consumer issues a pull request to producer containing the contract
3. Producer runs the SUT (via pipeline) and tests if the contract is valid
1. If yes, then simply accept the pull request
2. If no, then modify the SUT to meet the contract (this often involves inter-
team communication), and then accept the pull request
4. Producer deploys (via pipeline), and consumer deploys (via pipeline)
1. Take care in regards to backwards compatibility
09/10/2018 @danielbryantuk
1.
2. 3. 4.
4.
CDC frameworks
09/10/2018 @danielbryantuk
docs.pact.io
cloud.spring.io/spring-cloud-contract
github.com/spring-cloud-samples/spring-cloud-contract-samples
CDC for messaging
• What about messaging?
• Message schema are an API
• Pact supports AMQP contracts
• www.infoq.com/presentations/contracts-streaming-microservices
09/10/2018 @danielbryantuk
CDC for messaging
09/10/2018 @danielbryantuk
www.infoq.com/presentations/contracts-streaming-microservices
docs.confluent.io/current/schema-registry/docs/maven-plugin.html
Contract testing musings
• Great in low trust or poor communication organisations
• Act as a cue for a conversation
• Can be used to implement TDD for the API
• Resource intensive to create and maintain
09/10/2018 @danielbryantuk
API Simulation
(Testing semantics)
09/10/2018 @danielbryantuk
09/10/2018 @danielbryantuk
09/10/2018 @danielbryantuk
09/10/2018 @danielbryantuk
09/10/2018 @danielbryantuk
09/10/2018 @danielbryantuk
09/10/2018 @danielbryantuk
09/10/2018 @danielbryantuk
API simulation musings
• Great when a dependency is “expensive” to access or tricky to mock
• Useful when failure modes of dependency are hard to recreate
• Simulations can be fragile and/or complicated
09/10/2018 @danielbryantuk
09/10/2018 @danielbryantuk
09/10/2018 @danielbryantuk
Fault injection
(Testing resilience)
09/10/2018 @danielbryantuk
When engineers hear the phrase “chaos engineering”
09/10/2018 @danielbryantuk
09/10/2018 @danielbryantuk
When non technical folk hear the phrase “chaos engineering”
09/10/2018 @danielbryantuk
09/10/2018 @danielbryantuk
09/10/2018 @danielbryantuk
https://principlesofchaos.org/
09/10/2018 @danielbryantuk
https://chaostoolkit.org/
09/10/2018 @danielbryantuk
https://www.gremlin.com/ https://www.infoq.com/news/2018/10/gremlin-alfi
Chaos engineering prerequisites
Tammy Butow’s three prerequisites:
1. High severity incident management
2. Monitoring
3. Measure the impact of downtime
09/10/2018 @danielbryantuk
https://www.infoq.com/news/2018/03/resilient-systems-chaos-engineer
Chaos engineering prerequisites
09/10/2018 @danielbryantuk
https://www.infoq.com/news/2018/03/resilient-systems-chaos-engineer
Tammy Butow’s three prerequisites:
1. High severity incident management
2. Monitoring
3. Measure the impact of downtime
Chaos engineering musings
• Great for codifying/asserting system quality attributes
• Can prompt team to think about monitoring and DR/BC
• Can cause a lot of damage if approached casually
09/10/2018 @danielbryantuk
Wrapping up
09/10/2018 @danielbryantuk
Conclusion
• Try and avoid microservice testing strategies that are solely YOLO or
attempting to seek absolute correctness
• Balance pre-prod (generally technology facing and supporting the team)
vs post-prod tests (generally business facing and critiquing the product)
• Contract testing, API simulation and chaos experimentation can be useful
techniques for microservice testing
09/10/2018 @danielbryantuk
Thanks for listening…
Twitter: @danielbryantuk
Email: daniel.bryant@tai-dev.co.uk
Writing: https://www.infoq.com/profile/Daniel-Bryant
Talks: https://www.youtube.com/playlist?list=PLoVYf_0qOYNeBmrpjuBOOAqJnQb3QAEtM
09/10/2018 @danielbryantuk
oreil.ly/2E63nCR

Jax London 2018: "Testing Microservices from Development to Production"

Editor's Notes

  • #13 We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failure. We found that from a testing point of view, almost all failures require only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. H
  • #14 We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failure. We found that from a testing point of view, almost all failures require only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. H