Testing As A Bottleneck - How Testing Slows Down Modern Development Processes And How To Compensate

Toolsfor
Software Engineers
Testing as Bottleneck of Modern
Software Development Processes
Kim Herzig (Software engineer & Researcher)
kimh@microsoft.com
www.research.microsoft.com/people/kimh
www.kim-herzig.de

Build Tools/Service
Verification Tools
(Testing, Code Review)
Artifact Management
Analytics
Tools/Services
Toolsfor
Software Engineers

We keep treating testing as a parallel process.
Instead of linking it against other processes.
Verification process / Risk Management
Functional correctness
(Unit testing)
Constrains verification
(system & integration
testing)
Build Verification Static Analysis Code Review
Fast
Limited scope
Full scope
Slow
Essential
Basic
Many false
positives
Does not find
bugs …

INNER & OUTER DEVELOPMENT LOOP
Engineers desktop Integration process

FASTER RELEASE CYCLES
Releasing software monthly / weekly / daily!
• Gaining or defending market share.
• Customer got used to it and now demand it.
• Enforces agility / flexibility in product teams.
• We simply cannot run all tests on all changes anymore.
• We need to be clever in selecting the right test at the right time.
• It’s all about RISK MANAGEMENT.
We have less time for (system) testing
$
Speed
R
Cost
Quality / Risk

PRODUCT TYPES & RELEASE CYCLES
Planning
Implementing
Unit testing
Code review
Deploy
Dogfooding
Service / Agile
System
testing
Platform testing
Planning
Implementing
Unit testing
Code review
Shipping
Dogfooding
Box product
hardeasier

How do we test a system like
Windows / Office / SQL-Server?
Where
When
Who
How frequent

SYSTEM AND INTEGRATION TESTING
Software testing is expensive
• 10k+ gates executed, 1M+ test cases
• Different branches, architectures, languages, devices, platforms, …
• Aims to find code issues as early as possible
• Slows down product development

Verification time defines a lower bound
on how fast we can deliver software.

EXAMPLE CHALLENGES (@ MICROSOFT)

TEST FAILURE SEVERITY
DON’T CARE CAN WAIT BLOCKER
• Test issue
• Not my code
• Will not fix
• Low priority bug
• Before release
• Refactoring
• Fix it now
• I need to know this
False test alarms
Each false alarm is expensive
• Human inspection,
• Integration request failed, need to re-run,
• Might hide real defects.

WHAT IS A FALSE TEST ALARM?
What is a false test alarm?
Test execution
fails
False positive
failure
Mapped to
bug report?
yes
Bug report
fixed?
yes
True positive
failure
Resolved via
code change?
yes
no
Detection possible, but with long delay
• System & integration tests usually depend on environment (automated tests)
• Test failures due to any other reason than code defect.
• Does not imply broken test. System tests usually consume external data
• e.g. fetching media file to test media player
• depends on: hardware, network, remote server, …

GOAL
Immediate identification of false test alarms
• Filter false test alarm
• Engineers should focus on other failures,
• Use false failures later to identify potential test issues
Avoid any runtime overhead
• System tests are slow enough.
Self-adaptive method
• New issues come and go and we want to detect them automatically
• No manual effort to add new white or black list rules.

HISTORIC PATTERNS
Test case
Test step 2
Test step 3
Test step n
Test step 1
Test case
Test step 2
Test step 3
Test step n
Test step 1
Test case
Test step 2
Test step 3
Test step n
Test step 1
Test case
Test step 2
Test step 3
Test step n
Test step 1
Test case
Test step 2
Test step 3
Test step n
Test step 1
False alarm False alarm False alarm
𝑇𝑒𝑠𝑡𝑆𝑡𝑒𝑝2 ⋀ 𝑇𝑒𝑠𝑡𝑆𝑡𝑒𝑝 𝑛 ⇒ False alarm

USE MACHINE LEARNING
Test A Test B Test C Test D Test E Test F Test G Test H False
alarm?
0 0 0 0 0 0 0 0 0
0 0 1 0 1 0 1 1 1
0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
… … … … … ... … … …
Rows:Events
(e.g.buildsorQTestruns)
Test A failed?
Test B failed?
Test C failed?
false alarm
false alarm
prob. = 0.8
prob. = 0.6
Decision tree
Wrongly classified false alarms (worst case)
(1 – recall)
Missed false alarms
84%* 8%
Precision
Real false alarms
Predicted false alarms

Great. Test failures translate to code defects, but …

REDUCE EXECUTION FREQUENCY
Do not sacrifice code quality
• Run every test at least once on every code change
• Eventually find all code defects, taking risk of finding defects later ok.
Effectiveness
Reliability
Cannot trust
result
Why are we
running this
test?
Ok
Rarely finds a
defect
high
lowhigh
low
Cannot trust
result
Rarely finds a
defect
high
lowhigh
low

Executing tests costs money.
What is the return of investment?

COST MODEL
𝐶𝑜𝑠𝑡 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 > 𝐶𝑜𝑠𝑡 𝑆𝑘𝑖𝑝 ? suspend ∶ execute test
𝐶𝑜𝑠𝑡 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 = 𝐶𝑜𝑠𝑡 𝑀𝑎𝑐ℎ𝑖𝑛𝑒/𝑇𝑖𝑚𝑒 ∗ 𝑇𝑖𝑚𝑒 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 + "Cost of potential false alarm"
= 𝐶𝑜𝑠𝑡 𝑀𝑎𝑐ℎ𝑖𝑛𝑒/𝑇𝑖𝑚𝑒 ∗ 𝑇𝑖𝑚𝑒 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 + (𝑃𝐹𝑃 ∗ 𝐶𝑜𝑠𝑡 𝐷𝑒𝑣𝑒𝑙𝑜𝑝𝑒𝑟/𝑇𝑖𝑚𝑒 ∗𝑇𝑖𝑚𝑒 𝑇𝑟𝑖𝑎𝑔𝑒 )
𝐶𝑜𝑠𝑡 𝑆𝑘𝑖𝑝 = "Potential cost of elapsing a bug to next higher branch level"
= 𝑃 𝑇𝑃 ∗ 𝐶𝑜𝑠𝑡 𝐷𝑒𝑣𝑒𝑙𝑜𝑝𝑒𝑟/𝑇𝑖𝑚𝑒 ∗ 𝑇𝑖𝑚𝑒 𝐹𝑟𝑒𝑒𝑧𝑒 𝑏𝑟𝑎𝑛𝑐ℎ ∗ #𝐷𝑒𝑣𝑒𝑙𝑜𝑝𝑒𝑟𝑠 𝐵𝑟𝑎𝑛𝑐ℎ
[1] K. Herzig, M. Greiler, J. Czerwonka, and B. Murphy, “The Art of Testing Less without Sacrificing Quality,”
in Proceedings of the 2015 International Conference on Software Engineering, 2015.

Windows
Results
Simulated on
Windows 8.1
development period
(BVT only)
[1] K. Herzig, M. Greiler, J. Czerwonka, and B. Murphy, “The Art of Testing Less without Sacrificing Quality,”
in Proceedings of the 2015 International Conference on Software Engineering, 2015.

Great. We are testing fast, but …

RESEARCH APPROACHES
Code Coverage
• Coverage does not imply verification
• Collecting coverage slows down testing
(how frequently do we need to collect coverage?)
• Expensive to collect and store (100’s GB per run per product)
Test Generation
• Engineers do not trust generated code (except their own).
• Mostly use code coverage bound fitness functions to cover new areas (see above).
• How well does the generated test match the intended / users behaviour?

What is happening out there (in the real world)?

DIFFS ON (USAGE) SCENARIOS!
Engineer
Test
Automation??

In collaboration with
Christopher Theisen, Dr. Laurie Williams
& Brendan Murphy (Microsoft Research)
Legend
Verification
Customer
Overlap
Legacy code

• No test failures on new code (unlikely)
• Is our test code base growing relatively to the code base?
• What’s the average age of tests?
Are we testing new code?
• No: we need to change tests or prevent misuse.
• Why are people using the feature differently?
Are features used as intended/tested?
• Tests need to adapt to current usage scenarios.
• What is the intended life-time of a test case?
How does usage change over time?
TESTING AS REALITY CHECK

We need sustainable, applicable test strategies.
We simply cannot afford to run all tests on all changes anymore.
Testing can be more than just verification.
Combined with customer telemetry we can turn testing into a feedback loops.
Software development processes change radically.
We haven’t changed testing significantly.
Testing significantly impacts development speed.
Verification time defines lower bound on how fast we can deliver software.
WRAP UP

Testing As A Bottleneck - How Testing Slows Down Modern Development Processes And How To Compensate

More Related Content

What's hot

Similar to Testing As A Bottleneck - How Testing Slows Down Modern Development Processes And How To Compensate

More from TEST Huddle

Recently uploaded

Testing As A Bottleneck - How Testing Slows Down Modern Development Processes And How To Compensate

Editor's Notes