Testing in a distributed world

Fernando Mayo
CTO & co-founder, Undefined Labs
Testing in a distributed systems world

About me
Previously
Currently
@fernandomayo
fernando@undefinedlabs.com

Agenda
• Why testing?
• Why is testing microservices so hard?
• Let’s test in production!
• Using distributed context propagation

Debugging an issue in production using
observability is great
… but it should be our last resort!

Cost of solving an issue
• Detection cost
• Troubleshooting cost
• Fixing cost
• Verification cost
• User impact cost
• Engineering burnout cost
👩💻 Developer End user 👨💼
💵

Testing is about reducing the risk
of your application not performing as expected,
at the lowest possible cost
We want applications to be
robust, performant and correct

Testing monoliths
Pre-production:
Robust: test your few known failure modes
Performant: benchmark, load, stress tests
Correct: unit, integration, end-to-end tests
Production:
Monitoring to detect issues (error, latency)
Logging to troubleshoot them
Database
Backend
UI

Testing microservices
Pre-production:
We test each individual service in isolation
Using mocks, contract tests, traffic replays…
Production:
Tracing to troubleshoot issues
Wide divergence

Testing in production
If we can no longer replicate production…
# of services
# of third-party APIs
Data
Traffic patterns
Configuration
Scale
Release cadence
OS kernel version
Service mesh
Network latency
DNS records
SSL certificates
Backups
Monitoring
…let’s test in production!
Serverless

Testing in production
❌ It is NOT a replacement for pre-production testing
✅ It is another testing technique to add to your toolbox
✅ It requires engineering investment to do it right

Types of tests in production
• Integration testing
• End-to-end testing
• Shadowing/traffic mirroring
• Canary deployments
• Feature flags
• Chaos engineering
• API testing/Real user monitoring
Deploy
Operate
Release

Risks of testing in production
• User impact
• State poisoning
• Traffic saturation
• Telemetry data skew
• Misfired alerts
The application needs to be
aware of tests being
performed in production

Test tenancy
End user
Tests
• Test label is propagated across services per-request
• Services and routing layer are aware of test tenancy

Risks of testing in production
• User impact
• State poisoning
• Traffic saturation
• Telemetry data skew
• Misfired alerts
Test before releasing
Separate writes to datastores
Implement QoS based on test label
Mark telemetry with test label
Exclude test telemetry from alerts

Context propagation allows developers to attach
arbitrary metadata to the current request
that will be propagated automatically
to all downstream dependencies

Context propagation
✅ It comes for “free” with tracing
✅ Developer-friendly API
✅ Read/write at any point in the request
✅ Compatible with threads and co-routines
✅ Compatible with multiple sync and async protocols
⚠ Increases request size

Context propagation
Name “Baggage” “Tags” “DistributedContext”
Definition Key (string): Value (string)
TagKey (string): TagValue (string)
+ TagMetadata
EntryKey (string): EntryValue (string)
+ EntryMetadata
Serialization Via Tracer.Extract/Tracer.Inject Via OpenCensus plugins Via OpenTelemetry API
Text-based format
(e.g. HTTP)
Tracer-specific Varies Uses W3C Correlation Context
Binary format
(e.g. gRPC)
Tracer-specific Varies
Own binary format similar to
W3C Correlation Context

Context propagation
traceparent: 00-0af7651916cd43dd8448eb211c80319c-00f067aa0ba902b7-01
tracestate: rojo=00f067aa0ba902b7,congo=t61rcWkgMzE
Version 32-bit trace ID 16-bit parent span ID Trace flags
Vendor ID Vendor-specific payload
Correlation-Context: tenancy=test;ttl=-1,user.id=3
User-defined key User-defined value Property
Trace Context (W3C Candidate Recommendation)
Correlation Context (W3C Editor’s Draft)

Context propagation
Already used for:
• Tracing information
• Sampling information
But we can also use it for:
• Test traffic label
• Fault injection instructions
• User account information
• Feature flags

Example: integration testing
Proxy
svc v1
svc v1
svc v1
svc v2
End user
Tests
Downstream
services
Correlation-Context: tenancy=test

Instrumenting tests
func TestIntegration(t *testing.T) {
span, ctx := opentracing.StartSpanFromContext(context.Background(), t.Name())
defer span.Finish()
span.SetBaggageItem(“tenancy", "test")
// ...
}
With OpenTracing:
func TestIntegration(t *testing.T) {
tracer := global.TraceProvider().GetTracer("")
ctx := distributedcontext.NewContext(context.Background(), key.String("tenancy", "test"))
ctx, span := tracer.Start(ctx, t.Name())
defer span.End()
// ...
}
With OpenTelemetry:

Managing state
datastore
datastore
datastore
Multi-tenant service
Multi-tenant datastore
Single-tenant datastores
service
service
Single-tenant services
Single-tenant datastores
datastore
datastore
service
service
datastore
Multi-tenant datastore
service
End users
Tests
End users
Tests
End users
Tests
End users
Tests

Managing telemetry data
// Init measure
meter := global.MeterProvider().GetMeter("")
tenancyKey := key.New("tenancy")
measure := meter.NewInt64Measure("myMeasure", metric.WithKeys(tenancyKey))
// Extract tenancy from distributed context
var labels []core.KeyValue
if tenancyValue, ok := distributedcontext.FromContext(ctx).Value("tenancy"); ok {
labels = append(labels, core.KeyValue{Key: tenancyKey, Value: tenancyValue})
}
// Attach labels to measurement
measure.Record(ctx, 123, meter.Labels(labels...))
With OpenTelemetry:

Managing other side effects
// Check if current request belongs to test tenancy
func inTesting(ctx context.Context) bool {
value, ok := distributedcontext.FromContext(ctx).Value("tenancy")
return ok && value == core.String("test")
}
With OpenTelemetry:
Examples:
• Implementing multi-tenant storage
• Using sandbox accounts from third-party services

Example: shadow e2e testing
Proxy
svc v1
svc v1
svc v1
svc v2
Downstream
services
Correlation-Context: tenancy=test,svc.target=v2
End user
Tests

Example: fault injection testing
Proxy
svc v1
svc v1
svc v1
Correlation-Context: tenancy=test,svc.fault.http.delay=10s
🕘
End user
Tests
Downstream
services

Example: feature flagging
Proxy
svc v1
svc v1
svc v1
Correlation-Context: tenancy=test,svc.feature1.enabled=true
End user
Tests
Downstream
services

Example: test accounts
Proxy
svc v1
svc v1
svc v1
Correlation-Context: tenancy=test,user.kind=test
Auth serviceEnd user
Downstream
services

Consequences
• No need to replicate the entire application stack anywhere
• Locally, on CI or staging
• Segregated telemetry allows us to monitor and troubleshoot tests
• Same tools and visibility as with production traffic
• We can add other types of traffic to our application
• Examples: “sandbox”, “development” traffic

Key takeaways
• Let’s catch issues as early as possible through proactive testing
• Testing in production can be the most efficient way to test complex
systems
• We should design our applications to allow safely testing in production
• Let’s make use of context propagation and make the most of our
observability instrumentation

Thank you!
fernando@undefinedlabs.com
@fernandomayo

Testing in a distributed world

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Testing in a distributed world

Similar to Testing in a distributed world (20)

Recently uploaded

Recently uploaded (20)

Testing in a distributed world