Diffy
Automatic Testing of Microservices @Twitter
Puneet Khanduri, Arun Kejariwal
(@pzdk, @arun_kejariwal)
1
Oct 8, 2014
Twitter, Inc. Down 2% Due To Broken Signup
2
Oct 8, 2014
Twitter, Inc. NOT Down 2% Due To NOT Broken Signup
3
“I just refactored a critical part of my
service. How do I know I didn’t break
anything?”
- Every Service Developer @ Twitter
4
“They just refactored a critical part of
their service. How do I know they didn’t
break anything?”
- Every Site Reliability Engineer @ Twitter
5
Tier #0
Unit Tests
Cost
Writing good tests takes
1.5x development time
Limited Scope
Testing classes/methods
in isolation
High coverage per test
Example: A method has 5
independent code paths
1 unit test => 20% coverage
Tier#0 - Unit Tests
Cost
Writing good tests takes ~1.5x of development time
Limited Scope
Testing classes/methods in isolation
High Coverage % per Test
e.g. A method has 5 independent code paths
=> 1 test yields 20% coverage
6
Tier #1
Component Tests
Cost
Same as Unit Tests
Limited Scope
Testing classes/methods
in isolation
Low coverage per test
Cyclomatic complexity is
O(kn
) - impractical to
target 100%
Handpicked test cases
Tier#1 - Component Tests
Testing a service in isolation with a fully mocked environment.
Cost of a single test
Same as unit tests
Low Coverage% per test
Cyclomatic complexity is O(k^n) - impractical to target
100%
Handpicked test cases
e.g. A request path has 6 methods with 5 paths per method
=> 1 test = 0.03% coverage
7
Tier #1
Component Tests
t.
Request path with 6 methods and
5 paths per method
1 test => 0.03% coverage
8
Tier #2
Integration Tests
Cost
Same as Unit Tests
+ Amortized cost of a staging environment
Negligible coverage per test
Much less than component tests
A request path has 4 services, 6 methods/
service, 5 paths/methods
Testing a service and its downstream dependencies in a real (staging)
environment
9
Emerging pattern
Super exponential cost of coverage
emerging pattern ...
uper exponential cost of coverage
10
Diffy Approach
Higher coverage for free
11
Diffy Approach
Free test inputs
Sample production traffic or whatever traffic
source you prefer
Free assertions
Use “known good” versions of your code to
generate assertions
12
What about the noise?
Server generated timestamps
Random number generators
Downstream non-determinism
Race conditions
13
Diffy Topology
iffy Topology
diffy
secondary
candidate
primary
raw
differences
non-deterministic noise
filtered
differences
sampled
production
traffic
14
15
Automation
Compare latest in master against last deploy to production
Automatically deploy master as candidate
Automatically deploy prod tag as primary and secondary
16
Automation (contd.)
Reporting
Diffy e-mails a report with highlighted critical endpoints and fields
Sample requests and response available for further analysis
17
18
Performance Regression
Why is it challenging?
Software
New release
Hardware performance
Uncontrolled parameter
Makes robust analysis challenging
Large variability across nodes
19
Performance Regression: Diffy Approach
Observation
All target service instances see identical load
Key Idea
Discover all performance metrics (thousands of time series)
Compare reference instances to test instances
Report metrics with significant deviations
20
Performance Regression (contd.)
Visual analysis: Error prone
False&nega)ve&
21
Common Statistical Methods
Welch’s t-Test
Two sample test
H0: Means of two populations are equal
22
Common Statistical Methods (contd.)
F-Test
H0: Means of a set of populations are equal
Two groups
F = t2, where t is Student’s statistic
Assumptions
Normally distributed populations [1]
Equal variance (Homoscedastic)
Independent samples
[1]	
  “Power	
  Func/on	
  of	
  the	
  F-­‐Test	
  Under	
  Non-­‐Normal	
  Situa/ons”,	
  by	
  M.	
  L.	
  Tiku.	
  In	
  Journal	
  of	
  the	
  American	
  Sta2s2cal	
  Associa2on,	
  Vol.	
  66,	
  No.	
  336	
  (Dec.,	
  1971),	
  pp.	
  913-­‐916. 23
Similarity based
Match count
Longest subsequence based
Clustering
k-Means, phased k-Means
EM
Dynamic clustering
k-Mediods
Single linkage clustering
PCA, SVM
24
Other Previous Work
Common Statistical Methods (contd.)
Diffy Performance TopologyDiffy-Performance Topology
diffy
reference
cluster
test cluster
sampled
production
traffic
classifier
PASSED
IGNORED
FAILED
25
Classifiers
Sample count
Minimum number of samples
Relative Threshold
Variance within reference vs. distance between reference and test
Absolute Threshold
Distance between reference and test vs. median of reference
26
Classifiers (contd.)
MAD
Median Absolute Deviation
Robust Statistic
27
Classifiers (contd.)
Ensemble of Composable Classifiers
val classifier = {
SampleCountClassifier(40) and (
RelativeThresholdClassifier(50, 0.1) or
AbsoluteThresholdClassifier(50, 0.1) or
MadClassifier
)
}
28
DEMO
29
Open Source (@diffyproject)
Github
https://github.com/twitter/diffy
Blog
https://blog.twitter.com/2015/diffy-testing-services-without-writing-tests
30
31

Diffy : Automatic Testing of Microservices @ Twitter