Toolsfor
Software Engineers
Testing as Bottleneck of Modern
Software Development Processes
Kim Herzig (Software engineer & Researcher)
kimh@microsoft.com
www.research.microsoft.com/people/kimh
www.kim-herzig.de
Build Tools/Service
Verification Tools
(Testing, Code Review)
Artifact Management
Analytics
Tools/Services
Toolsfor
Software Engineers
We keep treating testing as a parallel process.
Instead of linking it against other processes.
Verification process / Risk Management
Functional correctness
(Unit testing)
Constrains verification
(system & integration
testing)
Build Verification Static Analysis Code Review
Fast
Limited scope
Full scope
Slow
Essential
Basic
Many false
positives
Does not find
bugs …
INNER & OUTER DEVELOPMENT LOOP
Engineers desktop Integration process
FASTER RELEASE CYCLES
Releasing software monthly / weekly / daily!
• Gaining or defending market share.
• Customer got used to it and now demand it.
• Enforces agility / flexibility in product teams.
• We simply cannot run all tests on all changes anymore.
• We need to be clever in selecting the right test at the right time.
• It’s all about RISK MANAGEMENT.
We have less time for (system) testing
$
Speed
R
Cost
Quality / Risk
PRODUCT TYPES & RELEASE CYCLES
Planning
Implementing
Unit testing
Code review
Deploy
Dogfooding
Service / Agile
System
testing
Platform testing
Planning
Implementing
Unit testing
Code review
Shipping
Dogfooding
Box product
hardeasier
How do we test a system like
Windows / Office / SQL-Server?
Where
When
Who
How frequent
SYSTEM AND INTEGRATION TESTING
Software testing is expensive
• 10k+ gates executed, 1M+ test cases
• Different branches, architectures, languages, devices, platforms, …
• Aims to find code issues as early as possible
• Slows down product development
Verification time defines a lower bound
on how fast we can deliver software.
EXAMPLE CHALLENGES (@ MICROSOFT)
1 False Test Alarms
TEST FAILURE SEVERITY
DON’T CARE CAN WAIT BLOCKER
• Test issue
• Not my code
• Will not fix
• Low priority bug
• Before release
• Refactoring
• Fix it now
• I need to know this
False test alarms
Each false alarm is expensive
• Human inspection,
• Integration request failed, need to re-run,
• Might hide real defects.
WHAT IS A FALSE TEST ALARM?
What is a false test alarm?
Test execution
fails
False positive
failure
Mapped to
bug report?
yes
Bug report
fixed?
yes
True positive
failure
Resolved via
code change?
yes
no
Detection possible, but with long delay
• System & integration tests usually depend on environment (automated tests)
• Test failures due to any other reason than code defect.
• Does not imply broken test. System tests usually consume external data
• e.g. fetching media file to test media player
• depends on: hardware, network, remote server, …
GOAL
Immediate identification of false test alarms
• Filter false test alarm
• Engineers should focus on other failures,
• Use false failures later to identify potential test issues
Avoid any runtime overhead
• System tests are slow enough.
Self-adaptive method
• New issues come and go and we want to detect them automatically
• No manual effort to add new white or black list rules.
HISTORIC PATTERNS
Test case
Test step 2
Test step 3
Test step n
Test step 1
Test case
Test step 2
Test step 3
Test step n
Test step 1
Test case
Test step 2
Test step 3
Test step n
Test step 1
Test case
Test step 2
Test step 3
Test step n
Test step 1
Test case
Test step 2
Test step 3
Test step n
Test step 1
False alarm False alarm False alarm
𝑇𝑒𝑠𝑡𝑆𝑡𝑒𝑝2 ⋀ 𝑇𝑒𝑠𝑡𝑆𝑡𝑒𝑝 𝑛 ⇒ False alarm
USE MACHINE LEARNING
Test A Test B Test C Test D Test E Test F Test G Test H False
alarm?
0 0 0 0 0 0 0 0 0
0 0 1 0 1 0 1 1 1
0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
… … … … … ... … … …
Rows:Events
(e.g.buildsorQTestruns)
Test A failed?
Test B failed?
Test C failed?
false alarm
false alarm
prob. = 0.8
prob. = 0.6
Decision tree
Wrongly classified false alarms (worst case)
(1 – recall)
Missed false alarms
84%* 8%
Precision
Real false alarms
Predicted false alarms
Great. Test failures translate to code defects, but …
2 We need to cut test time
REDUCE EXECUTION FREQUENCY
Do not sacrifice code quality
• Run every test at least once on every code change
• Eventually find all code defects, taking risk of finding defects later ok.
Effectiveness
Reliability
Cannot trust
result
Why are we
running this
test?
Ok
Rarely finds a
defect
high
lowhigh
low
Cannot trust
result
Rarely finds a
defect
high
lowhigh
low
Executing tests costs money.
What is the return of investment?
COST MODEL
𝐶𝑜𝑠𝑡 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 > 𝐶𝑜𝑠𝑡 𝑆𝑘𝑖𝑝 ? suspend ∶ execute test
𝐶𝑜𝑠𝑡 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 = 𝐶𝑜𝑠𝑡 𝑀𝑎𝑐ℎ𝑖𝑛𝑒/𝑇𝑖𝑚𝑒 ∗ 𝑇𝑖𝑚𝑒 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 + "Cost of potential false alarm"
= 𝐶𝑜𝑠𝑡 𝑀𝑎𝑐ℎ𝑖𝑛𝑒/𝑇𝑖𝑚𝑒 ∗ 𝑇𝑖𝑚𝑒 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 + (𝑃𝐹𝑃 ∗ 𝐶𝑜𝑠𝑡 𝐷𝑒𝑣𝑒𝑙𝑜𝑝𝑒𝑟/𝑇𝑖𝑚𝑒 ∗𝑇𝑖𝑚𝑒 𝑇𝑟𝑖𝑎𝑔𝑒 )
𝐶𝑜𝑠𝑡 𝑆𝑘𝑖𝑝 = "Potential cost of elapsing a bug to next higher branch level"
= 𝑃 𝑇𝑃 ∗ 𝐶𝑜𝑠𝑡 𝐷𝑒𝑣𝑒𝑙𝑜𝑝𝑒𝑟/𝑇𝑖𝑚𝑒 ∗ 𝑇𝑖𝑚𝑒 𝐹𝑟𝑒𝑒𝑧𝑒 𝑏𝑟𝑎𝑛𝑐ℎ ∗ #𝐷𝑒𝑣𝑒𝑙𝑜𝑝𝑒𝑟𝑠 𝐵𝑟𝑎𝑛𝑐ℎ
[1] K. Herzig, M. Greiler, J. Czerwonka, and B. Murphy, “The Art of Testing Less without Sacrificing Quality,”
in Proceedings of the 2015 International Conference on Software Engineering, 2015.
Windows
Results
Simulated on
Windows 8.1
development period
(BVT only)
[1] K. Herzig, M. Greiler, J. Czerwonka, and B. Murphy, “The Art of Testing Less without Sacrificing Quality,”
in Proceedings of the 2015 International Conference on Software Engineering, 2015.
Great. We are testing fast, but …
3 Are we testing reality?
What shall we test next?
RESEARCH APPROACHES
Code Coverage
• Coverage does not imply verification
• Collecting coverage slows down testing
(how frequently do we need to collect coverage?)
• Expensive to collect and store (100’s GB per run per product)
Test Generation
• Engineers do not trust generated code (except their own).
• Mostly use code coverage bound fitness functions to cover new areas (see above).
• How well does the generated test match the intended / users behaviour?
What is happening out there (in the real world)?
DIFFS ON (USAGE) SCENARIOS!
Engineer
Test
Automation??
In collaboration with
Christopher Theisen, Dr. Laurie Williams
& Brendan Murphy (Microsoft Research)
Legend
Verification
Customer
Overlap
Legacy code
• No test failures on new code (unlikely)
• Is our test code base growing relatively to the code base?
• What’s the average age of tests?
Are we testing new code?
• No: we need to change tests or prevent misuse.
• Why are people using the feature differently?
Are features used as intended/tested?
• Tests need to adapt to current usage scenarios.
• What is the intended life-time of a test case?
How does usage change over time?
TESTING AS REALITY CHECK
We need sustainable, applicable test strategies.
We simply cannot afford to run all tests on all changes anymore.
Testing can be more than just verification.
Combined with customer telemetry we can turn testing into a feedback loops.
Software development processes change radically.
We haven’t changed testing significantly.
Testing significantly impacts development speed.
Verification time defines lower bound on how fast we can deliver software.
WRAP UP

Testing As A Bottleneck - How Testing Slows Down Modern Development Processes And How To Compensate

  • 1.
    Toolsfor Software Engineers Testing asBottleneck of Modern Software Development Processes Kim Herzig (Software engineer & Researcher) kimh@microsoft.com www.research.microsoft.com/people/kimh www.kim-herzig.de
  • 2.
    Build Tools/Service Verification Tools (Testing,Code Review) Artifact Management Analytics Tools/Services Toolsfor Software Engineers
  • 3.
    We keep treatingtesting as a parallel process. Instead of linking it against other processes. Verification process / Risk Management Functional correctness (Unit testing) Constrains verification (system & integration testing) Build Verification Static Analysis Code Review Fast Limited scope Full scope Slow Essential Basic Many false positives Does not find bugs …
  • 4.
    INNER & OUTERDEVELOPMENT LOOP Engineers desktop Integration process
  • 5.
    FASTER RELEASE CYCLES Releasingsoftware monthly / weekly / daily! • Gaining or defending market share. • Customer got used to it and now demand it. • Enforces agility / flexibility in product teams. • We simply cannot run all tests on all changes anymore. • We need to be clever in selecting the right test at the right time. • It’s all about RISK MANAGEMENT. We have less time for (system) testing $ Speed R Cost Quality / Risk
  • 6.
    PRODUCT TYPES &RELEASE CYCLES Planning Implementing Unit testing Code review Deploy Dogfooding Service / Agile System testing Platform testing Planning Implementing Unit testing Code review Shipping Dogfooding Box product hardeasier
  • 7.
    How do wetest a system like Windows / Office / SQL-Server? Where When Who How frequent
  • 8.
    SYSTEM AND INTEGRATIONTESTING Software testing is expensive • 10k+ gates executed, 1M+ test cases • Different branches, architectures, languages, devices, platforms, … • Aims to find code issues as early as possible • Slows down product development
  • 9.
    Verification time definesa lower bound on how fast we can deliver software.
  • 10.
  • 11.
  • 12.
    TEST FAILURE SEVERITY DON’TCARE CAN WAIT BLOCKER • Test issue • Not my code • Will not fix • Low priority bug • Before release • Refactoring • Fix it now • I need to know this False test alarms Each false alarm is expensive • Human inspection, • Integration request failed, need to re-run, • Might hide real defects.
  • 13.
    WHAT IS AFALSE TEST ALARM? What is a false test alarm? Test execution fails False positive failure Mapped to bug report? yes Bug report fixed? yes True positive failure Resolved via code change? yes no Detection possible, but with long delay • System & integration tests usually depend on environment (automated tests) • Test failures due to any other reason than code defect. • Does not imply broken test. System tests usually consume external data • e.g. fetching media file to test media player • depends on: hardware, network, remote server, …
  • 14.
    GOAL Immediate identification offalse test alarms • Filter false test alarm • Engineers should focus on other failures, • Use false failures later to identify potential test issues Avoid any runtime overhead • System tests are slow enough. Self-adaptive method • New issues come and go and we want to detect them automatically • No manual effort to add new white or black list rules.
  • 15.
    HISTORIC PATTERNS Test case Teststep 2 Test step 3 Test step n Test step 1 Test case Test step 2 Test step 3 Test step n Test step 1 Test case Test step 2 Test step 3 Test step n Test step 1 Test case Test step 2 Test step 3 Test step n Test step 1 Test case Test step 2 Test step 3 Test step n Test step 1 False alarm False alarm False alarm 𝑇𝑒𝑠𝑡𝑆𝑡𝑒𝑝2 ⋀ 𝑇𝑒𝑠𝑡𝑆𝑡𝑒𝑝 𝑛 ⇒ False alarm
  • 16.
    USE MACHINE LEARNING TestA Test B Test C Test D Test E Test F Test G Test H False alarm? 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 … … … … … ... … … … Rows:Events (e.g.buildsorQTestruns) Test A failed? Test B failed? Test C failed? false alarm false alarm prob. = 0.8 prob. = 0.6 Decision tree Wrongly classified false alarms (worst case) (1 – recall) Missed false alarms 84%* 8% Precision Real false alarms Predicted false alarms
  • 17.
    Great. Test failurestranslate to code defects, but …
  • 18.
    2 We needto cut test time
  • 19.
    REDUCE EXECUTION FREQUENCY Donot sacrifice code quality • Run every test at least once on every code change • Eventually find all code defects, taking risk of finding defects later ok. Effectiveness Reliability Cannot trust result Why are we running this test? Ok Rarely finds a defect high lowhigh low Cannot trust result Rarely finds a defect high lowhigh low
  • 20.
    Executing tests costsmoney. What is the return of investment?
  • 21.
    COST MODEL 𝐶𝑜𝑠𝑡 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛> 𝐶𝑜𝑠𝑡 𝑆𝑘𝑖𝑝 ? suspend ∶ execute test 𝐶𝑜𝑠𝑡 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 = 𝐶𝑜𝑠𝑡 𝑀𝑎𝑐ℎ𝑖𝑛𝑒/𝑇𝑖𝑚𝑒 ∗ 𝑇𝑖𝑚𝑒 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 + "Cost of potential false alarm" = 𝐶𝑜𝑠𝑡 𝑀𝑎𝑐ℎ𝑖𝑛𝑒/𝑇𝑖𝑚𝑒 ∗ 𝑇𝑖𝑚𝑒 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 + (𝑃𝐹𝑃 ∗ 𝐶𝑜𝑠𝑡 𝐷𝑒𝑣𝑒𝑙𝑜𝑝𝑒𝑟/𝑇𝑖𝑚𝑒 ∗𝑇𝑖𝑚𝑒 𝑇𝑟𝑖𝑎𝑔𝑒 ) 𝐶𝑜𝑠𝑡 𝑆𝑘𝑖𝑝 = "Potential cost of elapsing a bug to next higher branch level" = 𝑃 𝑇𝑃 ∗ 𝐶𝑜𝑠𝑡 𝐷𝑒𝑣𝑒𝑙𝑜𝑝𝑒𝑟/𝑇𝑖𝑚𝑒 ∗ 𝑇𝑖𝑚𝑒 𝐹𝑟𝑒𝑒𝑧𝑒 𝑏𝑟𝑎𝑛𝑐ℎ ∗ #𝐷𝑒𝑣𝑒𝑙𝑜𝑝𝑒𝑟𝑠 𝐵𝑟𝑎𝑛𝑐ℎ [1] K. Herzig, M. Greiler, J. Czerwonka, and B. Murphy, “The Art of Testing Less without Sacrificing Quality,” in Proceedings of the 2015 International Conference on Software Engineering, 2015.
  • 22.
    Windows Results Simulated on Windows 8.1 developmentperiod (BVT only) [1] K. Herzig, M. Greiler, J. Czerwonka, and B. Murphy, “The Art of Testing Less without Sacrificing Quality,” in Proceedings of the 2015 International Conference on Software Engineering, 2015.
  • 23.
    Great. We aretesting fast, but …
  • 24.
    3 Are wetesting reality?
  • 25.
    What shall wetest next?
  • 26.
    RESEARCH APPROACHES Code Coverage •Coverage does not imply verification • Collecting coverage slows down testing (how frequently do we need to collect coverage?) • Expensive to collect and store (100’s GB per run per product) Test Generation • Engineers do not trust generated code (except their own). • Mostly use code coverage bound fitness functions to cover new areas (see above). • How well does the generated test match the intended / users behaviour?
  • 27.
    What is happeningout there (in the real world)?
  • 28.
    DIFFS ON (USAGE)SCENARIOS! Engineer Test Automation??
  • 29.
    In collaboration with ChristopherTheisen, Dr. Laurie Williams & Brendan Murphy (Microsoft Research) Legend Verification Customer Overlap Legacy code
  • 30.
    • No testfailures on new code (unlikely) • Is our test code base growing relatively to the code base? • What’s the average age of tests? Are we testing new code? • No: we need to change tests or prevent misuse. • Why are people using the feature differently? Are features used as intended/tested? • Tests need to adapt to current usage scenarios. • What is the intended life-time of a test case? How does usage change over time? TESTING AS REALITY CHECK
  • 31.
    We need sustainable,applicable test strategies. We simply cannot afford to run all tests on all changes anymore. Testing can be more than just verification. Combined with customer telemetry we can turn testing into a feedback loops. Software development processes change radically. We haven’t changed testing significantly. Testing significantly impacts development speed. Verification time defines lower bound on how fast we can deliver software. WRAP UP

Editor's Notes

  • #9 The system is designed for maximal protection But that is not possible anymore: we simply cannot run all tests on all code changes anymore Which one to run? Coverage: no!
  • #20 NO EXTRA DATA COLLECTED! No runtime overhead Data exists in every (!) test execution database (at least it should) 
  • #22 FOR EACH CODE CHANGE WE RUN EVERY TEST AT LEAST ONCE BEFORE TRUNK INTEGRATION: This check is only performed when we know that the test will be executed in a later stage, e.g. branch closer to root node of tree