We study more than 20,000 non-trivial software
projects and explore the correlation of test cases with various
project development characteristics including: project size,
development team size, number of bugs, number of bug
reporters, and the programming languages of these projects.
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
An Empirical Study of Adoption of Software Testing in Open Source Projects
1. Pavneet Singh Kochhar1, Tegawendé F. Bissyandé2, David Lo1,
Lingxiao Jiang1
1Singapore Management University
2University of Luxembourg
2. 2/24
Importance of Software Testing
Functionality -- Requirements
Debugging -- Software complexity
Costs -- $59 billions* for inadequate testing
What is the adoption of test cases
In open-source projects?
*G. Tassey, “The economic impacts of inadequate infrastructure for software
testing,” National Institute of Standards and Technology, RTI Project, 2002.
3. 3/24
Objective & Contributions
Popularity of test cases
Presence of test cases – project characteristics
Influence of software development artifacts
Large Scale Study on over 20,000 GitHub projects
4. 4/24
Dataset & Statistical computations
Downloaded over 100,000 projects from GitHub
Randomly selected 50,000 projects
Preliminary study
Filter out projects with < 500 Lines of Code (LOC)
20,817 projects
5. 5/24
Dataset & Statistical computations
Lines of code
LOC* – By programming languages
Number of test cases
Count of test files
Developer contributions
Project team size
Bug count
Tags
Bug reporters
User names
*SLOCCount (http://www.dwheeler.com/sloccount/)
7. 7/24
RQ1– Popularity of Test Cases
Projects % of Projects
Without Test Cases 38.34%
With Test Cases 61.65%
84.87% of the projects < 100 test cases
10.7% of the projects have >100 & < 500 cases
4.4% of the projects >500 test cases
Distribution of Test Cases
9. 9/24
RQ1– Popularity of Test Cases
LOC (Projects with & without Test cases)
Difference between the distributions is statistically significan
(p-value < 0.05)
10. 10/24
RQ1– Popularity of Test Cases
LOC & Test Cases
Positive correlation between #LOC and #Test Cases (ρ=0.427)
(p-value < 0.05)
11. 11//24
RQ1– Popularity of Test Cases
LOC & Test cases/LOC
Negative correlation between #LOC and #Test Cases/LOC (ρ=-0.451)
(p-value < 0.05)
12. 12/24
RQ2– Developers & Test Cases
Developers (Projects with & without Test cases)
Difference between the distributions is statistically significant
(p-value < 0.05)
13. 13/24
RQ2– Developers & Test Cases
Developers & Test cases
Weak correlation between #Developers and #Test Cases (ρ=0.207)
(p-value < 0.05)
14. 14/24
RQ2– Developers & Test Cases
Developers & Test cases/developer
Negative correlation between Team size and #Test Cases per developer (ρ=-0.444)
(p-value < 0.05)
15. 15/24
RQ3–Bug Count and Test Cases
Identifying bugs (Tags)
bug bug; T bug; Bug Confirmed; bugs; starter
bug; bug fix etc.
defect defect; Type-Defect; minor defect
error error; Wow error; build error; error page;
user error etc.
16. 16/24
RQ3–Bug Count and Test Cases
Test cases & Bugs
Weak correlation between # bugs and #Test Cases (ρ=0.181)
(p-value < 0.05)
17. 17/24
RQ4–Bug Reporters and Test Cases
Bug reporters (Projects with & without Test cases)
Difference between the distributions is statistically significant
(p-value < 0.05)
18. 18/24
RQ4– Bug Reporters and Test Cases
Test cases & Bug reporters
Weak correlation between # bug reporters and #Test Cases (ρ=0.171)
(p-value < 0.05)
19. 19/24
RQ5–Programming Languages and Test Cases
Projects (Top 10 Languages)
1. Java
2. Ruby
3. PHP
4. Python
5. ANSI C
6. C++
7. Objective-C
8. C#
9. JavaScript
10.Perl
20. 20/24
RQ5–Programming Languages and Test Cases
Test Cases/Project (Top 10 Languages)
Language # of Projects # of Test Cases Test Cases/ Project
C++ 1,920 648,773 337.90
ANSI C 2,197 286,009 130.18
PHP 2,902 255,553 88.06
C# 1,042 81,334 78.05
Java 3,112 196,703 63.20
Ruby 3,016 173,864 57.64
JavaScript 819 39,070 47.70
Python 2,536 103,600 40.85
Objective-C 1,153 21,343 18.51
Perl 630 7,690 12.20
23. 23/24
Threats to Validity
Heuristics to detect test cases
Counting bugs
Tags: bug, error, defect
Not all projects use GitHub’s issue tracking
system
24. 24/24
Conclusion
Findings:
o Projects with test cases are bigger in size.
o # of test cases per LOC decreases with increasing LOC.
o The more developers, the more test cases
o The more developers, the less ratio of test cases/developer
o Weak correlation between # of test cases and # of bugs
o # of test cases and # of bug reporters have weak positive correlation
o Projects written in popular languages such as C++, ANSI C & PHP
have higher mean numbers of test cases.
Future agenda:
-- Exploration of the influence of more project characteristics/metrics
-- Check with other open source datasets
-- Use language specific heuristics
26. Bug Tags
27
installation rich Improvement Reporting
duplicated pat New feature community
feature mark Confirmed documentation
routing needs review In Progress categorization
optimization Samples
Feature
request
publishing
security
Unable to
reproduce
Wont fix ranker
translations nack Resolved server
ui rich Bug confirmed Fatal
TODO pat backend Build System
low priority mark low-priority MS AspNet
Sam presentation frontend OAuth2
27. 22/23
C++ test cases
URL Language
# of test
cases
https://github.com/isis-project/WebKit cpp 166,488
https://github.com/cswei/Olympia_on_Desktop cpp 94,591
https://github.com/librelab/qtmoko-test cpp 52,039
https://github.com/mozilla/mozilla-central cpp 36,671
https://github.com/weissms/owb-mirror cpp 29,340
29. 30
RQ5–Programming Languages and Test Cases
Test Cases (Top 10 Languages)
Median
Lower
Quartile
Upper
Quartile
Lower
whisker
Upper
Whisker Outliers
50% of
Data