Your SlideShare is downloading. ×
Modeling the Effect of Size of Defect Proneness for Open-Source Software
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Modeling the Effect of Size of Defect Proneness for Open-Source Software

1,122
views

Published on

A. Gunes Koru and Donsong Zhang and Hongfang Liu

A. Gunes Koru and Donsong Zhang and Hongfang Liu

Published in: Technology, Education

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,122
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
47
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. PROMISE at ICSE’07 MODELING the EFFECT of SIZE on DEFECT PRONENESS for OPEN-SOURCE SOFTWARE A. Güneş Koru1, Donsong Zhang1, and Hongfang Liu2 1Department of Information Systems UMBC Baltimore, MD, USA 2Georgetown Medical Center Department of Bioinformatics, Biostatistics, and Biomathematics Georgetown University, Washington, D.C., USA E-mails: gkoru@umbc.edu, zhangd@umbc.edu, hl224@georgetown.edu
  • 2. UMBC • UMBC, University of Maryland, Baltimore County (http://umbc.edu/ ~gkoru) • Public research university with a focus on graduate education. • Theoretically, all campuses belong to the University of Maryland but practically they look like different universities. • UMBC is located in Baltimore in a small suburban neighborhood called Catonsville. UMBC is not • University of Maryland, College Park • University of Baltimore (Business school) • University of Maryland Baltimore (Medical School) • Hongfang Liu is with the Georgetown University located in Washington, D.C., Interested in Bioinformatics and Health Care.
  • 3. Size--Defect Relationship • Size is perhaps the oldest measure. Mostly, measured by lines of code (sometimes function points). • Several studies found size to be associated with defect count. Earliest: A linear model in [Akiyama 71]. • Many other measures (e.g. cyclomatic complexity [McCabe 76], software science measures [Halstead 77]) are also correlated with size. There is some consensus that these are also size measures [Fenton and Pfleeger 96]. “May be size does not explain everything, but it explains a lot.” Bojan Cukic, PROMISE 2007 • Functional form of this relationship is still not understood well. • Commonly, practitioners assume a linear relationship [El Emam 05]. • Only general conclusion is that there is a continuously increasing relationship between the two [Fenton and Ohlsson 00, El Emam et al. 01].
  • 4. Size--Defect Relationship: Alternative Forms defects defects defects size size size (a) (b) (c) • Implications: “Things are linear is open to questions” Tim Menzies, PROMISE 2007 • (a) Linear: Smaller and larger modules are proportionally equally • Theoretical and Practical Importance problematic • Decomposition • (b) Quadratic: Larger modules are • Focused quality assurance proportionally more problematic • Functional Enhancements • (c) Logarithmic: Smaller modules are proportionally more problematic
  • 5. Why the relationship is still unclear... • Many earlier studies did not fully explore alternative functional forms or test the deviation from linearity significantly. • Linear models [Akiyama 71] or correlations [Andersson and Runeson 07] were found sufficient. • A study stated that linear models could be good as first approximations and there was better tool support [Shen 85] • Number of data points were very limited in the earlier studies (e.g. Akiyama 71). • Deriving models analytically and then fitting data to validate those models [Lipow 82]. • Accepting correlations as a sign of a linear relationship [Schneidewind and Hoffman 78]. Correlations do not imply proportionality. • Focus shift on defect density. Observations for optimal module size that minimizes defect density. U-shaped curve (Goldilock’s conjecture) [Withrow 90, Hatton 97, Hatton 98, etc]. See [El Emam 02] for a detailed review. • This approach can mask the plain size--defect relationship and mislead us. [El Emam 02, Fenton and Neil 00, and Rosenberg 97] • Gets more difficult to understand from multivariate and sophisticated machine learning models (e.g. from Neural Networks in [Khoshgoftaar 97]).
  • 6. Conventional Approach to Investigate Size--Defect Relationship • All these studies share a common characteristic • A software system is measured at a snapshot time, then the obtained measurements are associated with the future defect count (note this might be pre-release or post release) For ex: [Koru and Tian 03] [Khoshgoftaar 96] • Usually, measurement and analysis performed at module level. • A common problem is the availability of data [Fenton and Ohlsson 02]. • Publicly available Open Source Software (OSS) repositories: Source code, change data, and defect data [Koru and Tian 04].
  • 7. Challenges with Using Conventional Method in OSS Context • Evolutionary aspects of OSS. Continuous and concurrent functional enhancements, defect fixes, all other changes (perfective, adaptive, etc.) Bazaar model rather than cathedral model [Raymond 99]. • OSS, usually, developed by volunteers, not too much planning, no requirements or design documents, source code is the main artifact. [Mockus et al. 00, Mockus et al. 02]. • Quality assurance activities are not systematic in OSS (see Zhao and Elbaum 03, Koru et al. 07]) • So far, research using conventional approach focused on relatively better planned, analyzed, designed, and tested closed source products. • Internal validity problems caused by the dynamic OSS context: • Deleted classes • Size changes • There might be closed source products developed in an evolutionary manner and vice versa. Such comparisons are outside of the scope here (see [Paulson et al. 04]))
  • 8. In this study... “If developers play with a file, it can change its defect proneness” Elaine Weyuker, PROMISE 2007 • To gain a better understanding of the size--defect relationship, we used both • Novel approach that adopts Cox Proportional Hazards Modeling with Recurrent Events (Cox Modeling) [Cox 72]. • The data comes from a large-scale long-lived OSS product Mozilla (http://www.mozilla.org). • The evolutionary aspects of the Mozilla project was shown in other studies: • Gyiomothy et al. [04] found that size of Mozilla increased significantly during successive releases. • Mockus et al. [02] found that there was no particular development process in Mozilla.
  • 9. In the rest of this presentation... •Methodology •Demonstrating the evolutionary aspects of Mozilla •Cox Modeling •Data Collection •Modeling and Results •Future Work •Conclusion
  • 10. Results: Demonstrating Evolutionary Aspects of Mozilla (a) 1000 ● Cumulative Number of Deleted Classes ● ● ● ● • For only Mozilla 1.0 ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ●● ● ●● ●●● ● ●● ● ●● ● ●● ● ● ● ●● ● ●●● 800 ●●● ● ● ●● ● ●● ● ● classes ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 600 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ●● ● ●● ●● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ●● ●● ●● ●● ●● ●● ● ● 400 ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ●● ●●● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● • (a) Cumulative ● ●● ●●● ●● ●● ●● ● ● ● ● ● ● ● ● 200 ● ● ●● ● ●● ●●● ●●● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ● ●●● ●● number of ●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ●● ●● ● ● ●●● ●●● ●● ●● ● ● ● ●● ● ● ● ●●● 0 ● deleted classes 2003 2004 2005 2006 Years (b) • (b) Cumulative ●●●●●●●●●●●●●● ●●●●●●●●● ●●●● ●●●●●●● ●●● ●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●● ●● ●●●● ●●●●●●● ●●●●●●● ●●●●●● ●●●●●●●● ● ●●● ●●●●●●● ● ●●● ●●●● ●●●●●●●● ●●●●●●●●●●●●●● ●●●●● ● ●●●●● ●●●●●●●●● ●● ●● ● ●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●● ●●●● 1500000 ●●●●●●●●●● ●●●● ●●●●●●●●●● ●●●
  • 11. Cox Modeling ( A non-parametric approach) • The instantaneous relative risk (hazard) of defect fix, also called event, becomes the response variable. Note that it can recur. • A complete size history is obtained for each class by measuring size at each change and corrective changes are marked. • Time of change is also noted. At each unique time, the hazard is calculated by dividing the events at that time by the classes at risk at that time. λi (t) = λ0 (t)eβxi (t) . (1) • Hazard function: β is the regression coefficient for xi (t) and λ0 (t) is an unspecified non-negative function of time called the baseline hazard function. It is the instantaneous hazard of having an event without any covariate effect (i.e., when β = 0). • Relative hazard: eβ(xj (t)−xk (t)) • Note that the relative hazard is proportional to the difference in covariate values. This is called proportional hazards assumption and needs to be checked.
  • 12. Methodology • Relative log risk is noted by f(size) (for median size, it is set to zero). • Examine the functional model with Cubic Spline Functions using four knots f (size) = β0 +β1 size+β2 (size−k1 )3 +β3 (size−k2 )3 +β4 (size−k3 )3 +β5 (size−k4 )3 + + + + (1) where, (size − kn ), if (size − kn ) > 0 (size − kn )+ = (2) 0, otherwise • Examined the alternative model visually • Tested whether the deviation from linearity was statistically significant H0 : β2 = β3 = β4 = 0
  • 13. Methodology - Data Layout and Collection (A) • We developed PERL scripts to extract class name size defect count source code, analyze CVS changes, and A 75 0 to find whether a class is affected or not B 250 2 C 300 2 • (a) What would the data look like if D 600 2 conventional approach was used. E 800 3 F 220 0 • (b) Novel Approach: Classes between G 300 0 added to the system after Mozilla 1.0 . . . . . . release date were measured until Feb 22, (B) 2006. class name start end event size state • Each change resulted in an observation Y 0 50 0 75 0 Y 50 100 1 200 1 • 15,545 observations Y 100 200 0 300 1 • Events were identified by searching the Z 0 200 1 250 0 Z 200 800 0 180 1 CVS logs for words ‘bug’, ‘defect’, and Z 800 1400 1 400 1 ‘fix’. When we sampled 100 logs Z 1400 1800 0 300 1 . . . . . randomly, we saw that this automated . . . . . approach was correct for 98 of them.
  • 14. Results - Functional Form 2.0 1.5 Instantaneous relative risk of defect fix 1.0 0.5 0.0 −0.5 −1.0 0 2000 4000 6000 8000 10000 12000 Size (LOC) • When we use cubic spline functions the logarithmic form is also obvious. The curve down at the end is only for less than 0.3% of the data points. We can use log(size) directly in the Cox model
  • 15. Results -- Modeling results MANUSCRIPT SUBMITTED TO TSE coef exp(coef) se(coef) robust se z p log(size) 0.368 1.44 0.00732 0.018 20.4 0 Rsquare= 0.152 (max possible= 1) Likelihood ratio test= 2565 on 1 df, p=0 Wald test = 416 on 1 df, p=0 Score (logrank) test = 2565 on 1 df, p=0, Robust Score = 142 p=0 Fig. 5. Modeling results using logarithmic transform of size
  • 16. Outlier Analysis - Checking for overly influential data pointslotted Martingale Residualsutliers are still !! ! !! ! !! ! ! ! !! ! ! !! ! ! !! ! !! ! !! ! ! !! ! ! valid observations Influence ! !! ! ! ! !! !0.0015 Removing them only brings the unit effect outliers ! of log size to 45 % !0.0025 (small change) ! 0 5000 10000 15000 Decided to keep them. Observation id
  • 17. Test of Proportional Hazards • Commonly, interaction with time is tested 20 • Example: A drug only effective in the first hour. 10 Beta(t) for log(size) • Note: This test can also become significant when a 0 wrong functional form is used. !10 • Result: p = 0.835 highly insignificant. !20 • A smooth plot of Schönfeld 0 500000 1000000 1500000 2000000 residuals show almost a Time perfectly straight line.
  • 18. Model Fitness - Arjas plot • Arjas plot shows ! !! ! cumulative sum of ! !! !! !! !! ! ! ! ! ! 8000 ! ! !! ! ! ! ! expected versus ! ! ! ! !! ! ! ! ! !! !!! ! ! !! Cumulative Sum of Expected Events ! ! ! ! cumulative sum of ! ! ! ! ! ! ! ! ! !!! ! ! ! ! !! ! !! !! actual events. ! ! ! ! ! ! ! ! 6000 ! ! !! !! ! ! ! ! ! ! !! !! !! !! • Should follow 45- ! ! ! ! ! !! !! !! !! !! !! !! !! !! !! !! ! ! !! ! ! degree line. ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !! ! ! ! ! 4000 ! ! ! ! ! ! ! !! • Ours follow the line ! ! ! ! !! !! ! !! !! !! ! ! !!! ! ! !! ! ! ! !! !! !! !! !! closely. ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !!! ! ! ! ! ! ! ! • Spearman correlation ! ! ! ! ! 2000 ! ! ! ! !! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! !! between actual vs. ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! expected: 0.79 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! • The model passes all ! ! 0 the tests 0 2000 4000 6000 8000 Cumulative Sum of Actual Events
  • 19. Interpretation of the results - bootstrapping For Mozilla classes, one unit of increase in the natural • log of size caused the rate of defect fix to increase by 44%. • We run bootstrapping to derive an estimate of the 95% confidence interval. •Sampled 1,000 classes 1,000 times. •For each sample, a different Cox model was produced using all observations in the sample, and a point estimate was made for the log(size) effect. •The point estimate was 44% and the 95% confidence interval calculated via this procedure was [39% -- 49%].
  • 20. Overall Results • Note that the results using both approaches showed that the functional form of the relationship was close to a logarithmic form. • Implies that smaller modules are proportionally more troublesome. •Note that they do not imply that smaller modules have less defects or less probability of having defects. •There is no threshold effect of size on defect proneness. The plot curves down for very large classes but the confidence band gets larger too.
  • 21. Implications • Practical: • A 1,000 LOC class, although 10 times bigger, is estimated to be only 2.33 times more defect-prone (95% CI [2.13 , 2.54]). • If one has resources to inspect 10,000 LOC, it is better to pick 100 classes of size 100-LOC as opposed to picking 10 classes of 1,000 LOC. The first approach would be estimated to be 329% times more effective (95% CI [293.70%,369.48%]). • Theoretical: • Decomposition might have side effects • If the interface defects are responsible as suggested in [Basili and Perricone 84], the extent of decomposition needs to be reconsidered.
  • 22. Related Work • Similar earlier observations were reported, however, by focusing on the size-defect density relationship [Hatton 97] [Hatton 98] and [Withrow 90]. • Such studies observed a U-shaped curve and identified an optimum size that minimized defect density. • Later El Emam et al. [El Emam 02] reported that such an approach can mask the true relationship between size and defects and mislead us by showing some threshold effects. • Indeed same points were made earlier in [Fenton and Neil 99] and even earlier in [Rosenberg 98] • In our study, we focused on the basic size--defect relationship.
  • 23. Directions for Future Research • Replicated Studies for validation • Studying Modularity: Is the reason interface defects as suggested earlier [Basili and Perricone 84]? What is the interplay between coupling and size? • Studying people aspects: Experts develop larger modules? • Studying process aspects: Larger modules inspected and tested better? Systematic or non-systematic reuse? Copy- paste into larger classes?
  • 24. Conclusion Our empirical results using a large-scale product that offered thousands of data points showed that: • Size--defect relationship took a logarithmic form • Defect-proneness increased as size increased • There is no threshold value; a continuously increasing relationship • Smaller modules are proportionally more troublesome • Results can be immediately useful for Mozilla project • Results also trigger many interesting research questions for the future.