Testing Distributed
Systems in Production.
|Paul Bakker
@pbakker
Paul Bakker
Edge Developer Productivity @ Netflix
Paul Bakker
@pbakker
Contents.
Types of testing for micro services
Simulation testing
Simone example
Simulation testing architecture
Client libs vs grpc
Micro
Service
R
E
S
T
DRM
Playback
Micro
Service
g
r
p
c
Streaming
logs
Edge
Zuul
Edge Developer
Experience
Device API
scripts
Content
Types of testing
Within each service
Unit / Integration / Smoke testing
Functional tests in a test environment
Performance / squeeze testing
Shadow traffic / Replay
Canary
Simulations in production
Infra
Failure Injection Testing
Chaos engineering
Failover/Evacuation exercises
Wait, aren’t micro
services suppose to be
easy!?
Test vs Prod environment
Two separate AWS environments
Test closely mimics Prod
Smaller capacity
Data is different (caches, customer data, etc.)
In-process smoke tests.
Bootstraps the service from a JUnit test
All dependencies are real (test environment)
Tests hit the in-process server on HTTP or grpc
Service owners should spend a lot of time here
In-process smoke tests.
@Singleton
@Path("v1/hello")
public final class MemeTestingExamplesResource {
/** Where we keep our greetings for each user. */
private final MemeTestingExamplesDao memeTestingExamplesDao;
private final MemeTestingExamplesConfig config;
@Inject
public MemeTestingExamplesResource(
MemeTestingExamplesDao memeTestingExamplesDao,
MemeTestingExamplesConfig config,
Registry registry) {
this.memeTestingExamplesDao = memeTestingExamplesDao;
this.config = config;
}
@Path("{user}")
@GET
@Produces({MediaType.APPLICATION_JSON})
public Greeting getGreeting(@PathParam("user") String userEmail) {
return memeTestingExamplesDao.loadGreeting(userEmail).orElseGet( () -> {
Greeting annonymousGreeting = new Greeting();
annonymousGreeting.setUserEmail(userEmail);
annonymousGreeting.setFirstName(config.getDefaultAnonymousName());
annonymousGreeting.setMessage(config.getDefaultGreeting());
return annonymousGreeting;
});
}
@POST
@Produces({MediaType.APPLICATION_JSON})
@Consumes({MediaType.APPLICATION_JSON})
public String setGreeting(Greeting greeting) {
memeTestingExamplesDao.storeGreeting(greeting);
return ""OK"";
}
}
@RunWith(GovernatorJunit4ClassRunner.class)
@ModulesForTesting({MemeTestingExamplesModule.class, Archaius2JettyModule.class})
@TestPropertyOverride(value={“governator.jetty.embedded.port=0"},
propertyFiles={"laptop.properties"})
public class SmokeTest {
@Inject
@Named("embeddedJettyPort")
private int ephemeralPort;
@Test
public void testRestEndpoint() {
given().port(ephemeralPort).log().ifValidationFails()
.when()
.get("/REST/v1/hello/n@n.com")
.then()
.assertThat().statusCode(200)
.and()
.body("userEmail", equalTo("n@n.com"));
}
}
Service to test
Test
In-process smoke tests
vs deployments.
Faster to run
Less moving parts (Jenkins, Spinnaker, AWS etc.) => Less flakiness
Options for mocking some dependencies
Functional and squeeze tests
Validate service from perspective of service consumers
Deploy to a test cluster
Create a reference application that integrates with your service, using your
client lib
Run tests against reference app application
Tests Reference App Service
http / grpc http / grpc
Shadow traffic
Zuul
Prod
cluster
Shadow
cluster
100%
100%
Send all traffic to a second cluster
Shadow cluster drops responses, but generates metrics
Good for performance testing, and high level correctness (e.g. error rate)
Canaries
Zuul
Baseline
cluster
Canary
cluster
90%
10%
Send a percentage of traffic to a second cluster
Track metrics to find issues compared to baseline
This affects production traffic!
Chaos and Failure Injection Testing
Test failover and fallback scenarios
Introduce failures or latency in the http/grpc layer
Terminate instances
Datastore failure/latency
Region failover exercises
Exercises the capability to fail out of a region
E.g. what happens if an AWS region goes dark?
Commonly exposes (minor) issues
Make sure you can recover quickly!
https://www.techrepublic.com/article/aws-outage-how-netflix-weathered-the-storm-by-preparing-for-the-worst/
SimulationTesting.
Simulation Testing
End to end tests
Device certification
Content validation (e.g. fixes to subtitles)
The device is unaware of any simulations
How to externally trigger
special behavior in a service?
Simone ServerMicro service
simone-client
/REST/variants
Kafka
2. Publish Variant to clients3. Receive Variant
6. Consume
Test script
1. Create Variant
Cassandra
ElasticSearch
Dynomite
API
4. Start test
5. Check variants
/REST/insights
7. Verify variant insights
Variant storage
Insights
Simone architecture
Simone Demo.
Does a device behave correctly with different bit rates?
Bitrates are adaptive, they change depending on bandwidth
Simone ServerMicro service
simone-client
/REST/variants
Kafka
2. Publish Variant to clients3. Receive Variant
6. Consume
Test script
1. Create Variant
Cassandra
ElasticSearch
Dynomite
API
4. Start test
5. Check variants
/REST/insights
7. Verify variant insights
Variant storage
Insights
Simone architecture
Who needs a test
environment…
When you can just test
in prod!?
Why run in prod!?
Devices are unaware of tests
They only know the real thing
They can’t access internal systems
Also, caching is hard…
Other examples
Testing license failures
Testing “too many devices”
Testing CDN overrides
Forcing API errors
SimulationResponse response =
simoneClient.execute("com.netflix.okja.regressiontests.TestTemplate", Trigger.esnTrigger(request.getEsn()),
getPassport(request.getCustomerId(), request.getEsn()),
() -> SimulationResponse.newBuilder().setVariantApplied(false).setMessage("No variant found").build(),
(ctx) -> SimulationResponse.newBuilder().setVariantApplied(true).setMessage("Variant applied").build(),
(ctx) -> ctx.setDomainData(request.getDomainDataJson())
);
Simone client example
Type of simulation to run
How to trigger the simulation. E.g. by customerId, esn, viewableId…
User for this request (can we pre-check for test users!?)
Default handler, no simulation is found
Simulation handler
Post handler - Apply extra logging for insights
Get out of the critical path!
We’re embedded in most tier 1 services
What if we mess up?
Have really aggressive grpc timeouts
Circuit breakers
Pre-checks if we should run at all
Thank you.
Paul Bakker
@pbakker

Testing distributed systems in production

  • 1.
    Testing Distributed Systems inProduction. |Paul Bakker @pbakker
  • 2.
    Paul Bakker Edge DeveloperProductivity @ Netflix Paul Bakker @pbakker
  • 4.
    Contents. Types of testingfor micro services Simulation testing Simone example Simulation testing architecture Client libs vs grpc
  • 5.
  • 6.
    Types of testing Withineach service Unit / Integration / Smoke testing Functional tests in a test environment Performance / squeeze testing Shadow traffic / Replay Canary Simulations in production Infra Failure Injection Testing Chaos engineering Failover/Evacuation exercises
  • 7.
    Wait, aren’t micro servicessuppose to be easy!?
  • 8.
    Test vs Prodenvironment Two separate AWS environments Test closely mimics Prod Smaller capacity Data is different (caches, customer data, etc.)
  • 9.
    In-process smoke tests. Bootstrapsthe service from a JUnit test All dependencies are real (test environment) Tests hit the in-process server on HTTP or grpc Service owners should spend a lot of time here
  • 10.
    In-process smoke tests. @Singleton @Path("v1/hello") publicfinal class MemeTestingExamplesResource { /** Where we keep our greetings for each user. */ private final MemeTestingExamplesDao memeTestingExamplesDao; private final MemeTestingExamplesConfig config; @Inject public MemeTestingExamplesResource( MemeTestingExamplesDao memeTestingExamplesDao, MemeTestingExamplesConfig config, Registry registry) { this.memeTestingExamplesDao = memeTestingExamplesDao; this.config = config; } @Path("{user}") @GET @Produces({MediaType.APPLICATION_JSON}) public Greeting getGreeting(@PathParam("user") String userEmail) { return memeTestingExamplesDao.loadGreeting(userEmail).orElseGet( () -> { Greeting annonymousGreeting = new Greeting(); annonymousGreeting.setUserEmail(userEmail); annonymousGreeting.setFirstName(config.getDefaultAnonymousName()); annonymousGreeting.setMessage(config.getDefaultGreeting()); return annonymousGreeting; }); } @POST @Produces({MediaType.APPLICATION_JSON}) @Consumes({MediaType.APPLICATION_JSON}) public String setGreeting(Greeting greeting) { memeTestingExamplesDao.storeGreeting(greeting); return ""OK""; } } @RunWith(GovernatorJunit4ClassRunner.class) @ModulesForTesting({MemeTestingExamplesModule.class, Archaius2JettyModule.class}) @TestPropertyOverride(value={“governator.jetty.embedded.port=0"}, propertyFiles={"laptop.properties"}) public class SmokeTest { @Inject @Named("embeddedJettyPort") private int ephemeralPort; @Test public void testRestEndpoint() { given().port(ephemeralPort).log().ifValidationFails() .when() .get("/REST/v1/hello/n@n.com") .then() .assertThat().statusCode(200) .and() .body("userEmail", equalTo("n@n.com")); } } Service to test Test
  • 11.
    In-process smoke tests vsdeployments. Faster to run Less moving parts (Jenkins, Spinnaker, AWS etc.) => Less flakiness Options for mocking some dependencies
  • 12.
    Functional and squeezetests Validate service from perspective of service consumers Deploy to a test cluster Create a reference application that integrates with your service, using your client lib Run tests against reference app application Tests Reference App Service http / grpc http / grpc
  • 13.
    Shadow traffic Zuul Prod cluster Shadow cluster 100% 100% Send alltraffic to a second cluster Shadow cluster drops responses, but generates metrics Good for performance testing, and high level correctness (e.g. error rate)
  • 14.
    Canaries Zuul Baseline cluster Canary cluster 90% 10% Send a percentageof traffic to a second cluster Track metrics to find issues compared to baseline This affects production traffic!
  • 15.
    Chaos and FailureInjection Testing Test failover and fallback scenarios Introduce failures or latency in the http/grpc layer Terminate instances Datastore failure/latency
  • 16.
    Region failover exercises Exercisesthe capability to fail out of a region E.g. what happens if an AWS region goes dark? Commonly exposes (minor) issues Make sure you can recover quickly!
  • 17.
  • 18.
  • 19.
    Simulation Testing End toend tests Device certification Content validation (e.g. fixes to subtitles) The device is unaware of any simulations How to externally trigger special behavior in a service?
  • 20.
    Simone ServerMicro service simone-client /REST/variants Kafka 2.Publish Variant to clients3. Receive Variant 6. Consume Test script 1. Create Variant Cassandra ElasticSearch Dynomite API 4. Start test 5. Check variants /REST/insights 7. Verify variant insights Variant storage Insights Simone architecture
  • 21.
    Simone Demo. Does adevice behave correctly with different bit rates? Bitrates are adaptive, they change depending on bandwidth
  • 22.
    Simone ServerMicro service simone-client /REST/variants Kafka 2.Publish Variant to clients3. Receive Variant 6. Consume Test script 1. Create Variant Cassandra ElasticSearch Dynomite API 4. Start test 5. Check variants /REST/insights 7. Verify variant insights Variant storage Insights Simone architecture
  • 23.
    Who needs atest environment… When you can just test in prod!?
  • 24.
    Why run inprod!? Devices are unaware of tests They only know the real thing They can’t access internal systems Also, caching is hard…
  • 25.
    Other examples Testing licensefailures Testing “too many devices” Testing CDN overrides Forcing API errors
  • 26.
    SimulationResponse response = simoneClient.execute("com.netflix.okja.regressiontests.TestTemplate",Trigger.esnTrigger(request.getEsn()), getPassport(request.getCustomerId(), request.getEsn()), () -> SimulationResponse.newBuilder().setVariantApplied(false).setMessage("No variant found").build(), (ctx) -> SimulationResponse.newBuilder().setVariantApplied(true).setMessage("Variant applied").build(), (ctx) -> ctx.setDomainData(request.getDomainDataJson()) ); Simone client example Type of simulation to run How to trigger the simulation. E.g. by customerId, esn, viewableId… User for this request (can we pre-check for test users!?) Default handler, no simulation is found Simulation handler Post handler - Apply extra logging for insights
  • 27.
    Get out ofthe critical path! We’re embedded in most tier 1 services What if we mess up? Have really aggressive grpc timeouts Circuit breakers Pre-checks if we should run at all
  • 28.