Agile development has become a norm nowadays. Though it fosters faster product development cycles, it often results in a higher number of functional and/or performance regressions. In an SOA setting such as Twitter, such regressions may cascade from one service to one or more services. Detecting such regressions manually is not practically feasible in light of the hundreds of services and tens of thousands of metrics each service collects. To this end, we developed a novel tool called Diffy to automatically detect such regressions.
The key highlights of the talk are the following:
A simple yet effective approach for detecting functional regressions. False positives are minimized via statistical analysis of metrics obtained from a tuple <primary,> of nodes, where the same traffic is sent to each node.
An ensemble approach to performance regression. The need for an ensemble of classifiers stemmed from the multifaceted characteristics of the performance data. In order to minimize the impact of variability of hardware performance across nodes, we used two clusters – instead of a tuple of nodes – corresponding to the release candidate and production code. The approach is robust against the presence of anomalies in the performance data.
The proposed techniques work well with minute data. Diffy has been in use in production by multiple services at Twitter, and has been baked into the continuous build process so as to actively detect functional and/or performance regressions.
We shall take the audience through how the techniques are being used at Twitter with REAL data.
4. “I just refactored a critical part of my
service. How do I know I didn’t break
anything?”
- Every Service Developer @ Twitter
4
5. “They just refactored a critical part of
their service. How do I know they didn’t
break anything?”
- Every Site Reliability Engineer @ Twitter
5
6. Tier #0
Unit Tests
Cost
Writing good tests takes
1.5x development time
Limited Scope
Testing classes/methods
in isolation
High coverage per test
Example: A method has 5
independent code paths
1 unit test => 20% coverage
Tier#0 - Unit Tests
Cost
Writing good tests takes ~1.5x of development time
Limited Scope
Testing classes/methods in isolation
High Coverage % per Test
e.g. A method has 5 independent code paths
=> 1 test yields 20% coverage
6
7. Tier #1
Component Tests
Cost
Same as Unit Tests
Limited Scope
Testing classes/methods
in isolation
Low coverage per test
Cyclomatic complexity is
O(kn
) - impractical to
target 100%
Handpicked test cases
Tier#1 - Component Tests
Testing a service in isolation with a fully mocked environment.
Cost of a single test
Same as unit tests
Low Coverage% per test
Cyclomatic complexity is O(k^n) - impractical to target
100%
Handpicked test cases
e.g. A request path has 6 methods with 5 paths per method
=> 1 test = 0.03% coverage
7
9. Tier #2
Integration Tests
Cost
Same as Unit Tests
+ Amortized cost of a staging environment
Negligible coverage per test
Much less than component tests
A request path has 4 services, 6 methods/
service, 5 paths/methods
Testing a service and its downstream dependencies in a real (staging)
environment
9
12. Diffy Approach
Free test inputs
Sample production traffic or whatever traffic
source you prefer
Free assertions
Use “known good” versions of your code to
generate assertions
12
13. What about the noise?
Server generated timestamps
Random number generators
Downstream non-determinism
Race conditions
13
16. Automation
Compare latest in master against last deploy to production
Automatically deploy master as candidate
Automatically deploy prod tag as primary and secondary
16
19. Performance Regression
Why is it challenging?
Software
New release
Hardware performance
Uncontrolled parameter
Makes robust analysis challenging
Large variability across nodes
19
20. Performance Regression: Diffy Approach
Observation
All target service instances see identical load
Key Idea
Discover all performance metrics (thousands of time series)
Compare reference instances to test instances
Report metrics with significant deviations
20
23. Common Statistical Methods (contd.)
F-Test
H0: Means of a set of populations are equal
Two groups
F = t2, where t is Student’s statistic
Assumptions
Normally distributed populations [1]
Equal variance (Homoscedastic)
Independent samples
[1]
“Power
Func/on
of
the
F-‐Test
Under
Non-‐Normal
Situa/ons”,
by
M.
L.
Tiku.
In
Journal
of
the
American
Sta2s2cal
Associa2on,
Vol.
66,
No.
336
(Dec.,
1971),
pp.
913-‐916. 23
24. Similarity based
Match count
Longest subsequence based
Clustering
k-Means, phased k-Means
EM
Dynamic clustering
k-Mediods
Single linkage clustering
PCA, SVM
24
Other Previous Work
Common Statistical Methods (contd.)
26. Classifiers
Sample count
Minimum number of samples
Relative Threshold
Variance within reference vs. distance between reference and test
Absolute Threshold
Distance between reference and test vs. median of reference
26
28. Classifiers (contd.)
Ensemble of Composable Classifiers
val classifier = {
SampleCountClassifier(40) and (
RelativeThresholdClassifier(50, 0.1) or
AbsoluteThresholdClassifier(50, 0.1) or
MadClassifier
)
}
28