PROMISE at ICSE’07
MODELING the EFFECT of SIZE on
DEFECT PRONENESS for OPEN-SOURCE
A. Güneş Koru1, Donsong Zhang1, and Hongfang Liu2
1Department of Information Systems
Baltimore, MD, USA
2Georgetown Medical Center
Department of Bioinformatics, Biostatistics, and Biomathematics
Georgetown University, Washington, D.C., USA
E-mails: email@example.com, firstname.lastname@example.org, email@example.com
• UMBC, University of Maryland, Baltimore County (http://umbc.edu/
• Public research university with a focus on graduate education.
• Theoretically, all campuses belong to the University of Maryland but
practically they look like different universities.
• UMBC is located in Baltimore in a small suburban neighborhood called
Catonsville. UMBC is not
• University of Maryland, College Park
• University of Baltimore (Business school)
• University of Maryland Baltimore (Medical School)
• Hongfang Liu is with the Georgetown University located in Washington,
D.C., Interested in Bioinformatics and Health Care.
• Size is perhaps the oldest measure. Mostly, measured by lines of code
(sometimes function points).
• Several studies found size to be associated with defect count. Earliest:
A linear model in [Akiyama 71].
• Many other measures (e.g. cyclomatic complexity [McCabe 76],
software science measures [Halstead 77]) are also correlated with size.
There is some consensus that these are also size measures [Fenton and
“May be size does not explain everything, but it explains a lot.”
Bojan Cukic, PROMISE 2007
• Functional form of this relationship is still not understood well.
• Commonly, practitioners assume a linear relationship [El Emam 05].
• Only general conclusion is that there is a continuously increasing
relationship between the two [Fenton and Ohlsson 00, El Emam et al.
Size--Defect Relationship: Alternative Forms
defects defects defects
size size size
(a) (b) (c)
• Implications: “Things are linear is open to questions”
Tim Menzies, PROMISE 2007
• (a) Linear: Smaller and larger
modules are proportionally equally • Theoretical and Practical Importance
• (b) Quadratic: Larger modules are
• Focused quality assurance
proportionally more problematic
• Functional Enhancements
• (c) Logarithmic: Smaller modules
are proportionally more problematic
Why the relationship is still unclear...
• Many earlier studies did not fully explore alternative functional forms or test the
deviation from linearity signiﬁcantly.
• Linear models [Akiyama 71] or correlations [Andersson and Runeson 07] were
• A study stated that linear models could be good as ﬁrst approximations and
there was better tool support [Shen 85]
• Number of data points were very limited in the earlier studies (e.g. Akiyama 71).
• Deriving models analytically and then ﬁtting data to validate those models [Lipow
• Accepting correlations as a sign of a linear relationship [Schneidewind and
Hoffman 78]. Correlations do not imply proportionality.
• Focus shift on defect density. Observations for optimal module size that
minimizes defect density. U-shaped curve (Goldilock’s conjecture) [Withrow 90,
Hatton 97, Hatton 98, etc]. See [El Emam 02] for a detailed review.
• This approach can mask the plain size--defect relationship and mislead us. [El
Emam 02, Fenton and Neil 00, and Rosenberg 97]
• Gets more difﬁcult to understand from multivariate and sophisticated machine
learning models (e.g. from Neural Networks in [Khoshgoftaar 97]).
Conventional Approach to
Investigate Size--Defect Relationship
• All these studies share a common characteristic
• A software system is measured at a snapshot time, then the
obtained measurements are associated with the future defect count
(note this might be pre-release or post release) For ex: [Koru and
Tian 03] [Khoshgoftaar 96]
• Usually, measurement and analysis performed at module level.
• A common problem is the availability of data [Fenton and Ohlsson
• Publicly available Open Source Software (OSS) repositories: Source
code, change data, and defect data [Koru and Tian 04].
Challenges with Using Conventional Method in
• Evolutionary aspects of OSS. Continuous and concurrent functional
enhancements, defect ﬁxes, all other changes (perfective, adaptive, etc.) Bazaar
model rather than cathedral model [Raymond 99].
• OSS, usually, developed by volunteers, not too much planning, no requirements
or design documents, source code is the main artifact. [Mockus et al. 00, Mockus
et al. 02].
• Quality assurance activities are not systematic in OSS (see Zhao and Elbaum 03,
Koru et al. 07])
• So far, research using conventional approach focused on relatively better
planned, analyzed, designed, and tested closed source products.
• Internal validity problems caused by the dynamic OSS context:
• Deleted classes
• Size changes
• There might be closed source products developed in an evolutionary manner and
vice versa. Such comparisons are outside of the scope here (see [Paulson et al.
In this study...
“If developers play with a file, it can change its defect proneness”
Elaine Weyuker, PROMISE 2007
• To gain a better understanding of the size--defect relationship, we
• Novel approach that adopts Cox Proportional Hazards Modeling
with Recurrent Events (Cox Modeling) [Cox 72].
• The data comes from a large-scale long-lived OSS product Mozilla
• The evolutionary aspects of the Mozilla project was shown in other
• Gyiomothy et al.  found that size of Mozilla increased
signiﬁcantly during successive releases.
• Mockus et al.  found that there was no particular development
process in Mozilla.
In the rest of this presentation...
•Demonstrating the evolutionary aspects of Mozilla
•Modeling and Results
( A non-parametric approach)
• The instantaneous relative risk (hazard) of defect ﬁx, also called event,
becomes the response variable. Note that it can recur.
• A complete size history is obtained for each class by measuring size at each
change and corrective changes are marked.
• Time of change is also noted. At each unique time, the hazard is calculated by
dividing the events at that time by the classes at risk at that time.
λi (t) = λ0 (t)eβxi (t) . (1)
• Hazard function:
β is the regression coeﬃcient for xi (t) and λ0 (t) is an unspeciﬁed non-negative
function of time called the baseline hazard function. It is the instantaneous
hazard of having an event without any covariate eﬀect (i.e., when β = 0).
• Relative hazard: eβ(xj (t)−xk (t))
• Note that the relative hazard is proportional to the difference in covariate
values. This is called proportional hazards assumption and needs to be
• Relative log risk is noted by f(size) (for median size, it is set to zero).
• Examine the functional model with Cubic Spline Functions using four knots
f (size) = β0 +β1 size+β2 (size−k1 )3 +β3 (size−k2 )3 +β4 (size−k3 )3 +β5 (size−k4 )3
+ + + +
(size − kn ), if (size − kn ) > 0
(size − kn )+ = (2)
• Examined the alternative model visually
• Tested whether the deviation from linearity was statistically signiﬁcant
H0 : β2 = β3 = β4 = 0
Methodology - Data Layout and Collection
• We developed PERL scripts to extract class name size defect count
source code, analyze CVS changes, and A 75 0
to ﬁnd whether a class is affected or not B 250 2
C 300 2
• (a) What would the data look like if D 600 2
conventional approach was used. E 800 3
F 220 0
• (b) Novel Approach: Classes between G 300 0
added to the system after Mozilla 1.0 . . .
. . .
release date were measured until Feb 22,
2006. class name start end event size state
• Each change resulted in an observation Y 0 50 0 75 0
Y 50 100 1 200 1
• 15,545 observations Y 100 200 0 300 1
• Events were identiﬁed by searching the Z 0 200 1 250 0
Z 200 800 0 180 1
CVS logs for words ‘bug’, ‘defect’, and Z 800 1400 1 400 1
‘ﬁx’. When we sampled 100 logs Z 1400 1800 0 300 1
. . . . .
randomly, we saw that this automated . . . . .
approach was correct for 98 of them.
Results - Functional Form
Instantaneous relative risk of defect fix
0 2000 4000 6000 8000 10000 12000
• When we use cubic spline functions the logarithmic form is also obvious. The
curve down at the end is only for less than 0.3% of the data points. We can
use log(size) directly in the Cox model
Results -- Modeling results
MANUSCRIPT SUBMITTED TO TSE
coef exp(coef) se(coef) robust se z p
log(size) 0.368 1.44 0.00732 0.018 20.4 0
Rsquare= 0.152 (max possible= 1)
Likelihood ratio test= 2565 on 1 df, p=0
Wald test = 416 on 1 df, p=0
Score (logrank) test = 2565 on 1 df, p=0,
Robust Score = 142 p=0
Fig. 5. Modeling results using logarithmic transform of size
Test of Proportional Hazards
• Commonly, interaction with
time is tested
• Example: A drug only
effective in the ﬁrst hour.
Beta(t) for log(size)
• Note: This test can also
become signiﬁcant when a
wrong functional form is
• Result: p = 0.835 highly
• A smooth plot of Schönfeld
0 500000 1000000 1500000 2000000
residuals show almost a
perfectly straight line.
Interpretation of the results - bootstrapping
For Mozilla classes, one unit of increase in the natural
log of size caused the rate of defect ﬁx to increase by
• We run bootstrapping to derive an estimate of the 95%
•Sampled 1,000 classes 1,000 times.
•For each sample, a different Cox model was produced
using all observations in the sample, and a point
estimate was made for the log(size) effect.
•The point estimate was 44% and the 95% conﬁdence
interval calculated via this procedure was [39% -- 49%].
• Note that the results using both approaches showed
that the functional form of the relationship was close
to a logarithmic form.
• Implies that smaller modules are proportionally more
•Note that they do not imply that smaller modules have
less defects or less probability of having defects.
•There is no threshold effect of size on defect
proneness. The plot curves down for very large
classes but the conﬁdence band gets larger too.
• A 1,000 LOC class, although 10 times bigger, is estimated to
be only 2.33 times more defect-prone (95% CI [2.13 , 2.54]).
• If one has resources to inspect 10,000 LOC, it is better to
pick 100 classes of size 100-LOC as opposed to picking
10 classes of 1,000 LOC. The ﬁrst approach would be
estimated to be 329% times more effective (95% CI
• Decomposition might have side effects
• If the interface defects are responsible as suggested in
[Basili and Perricone 84], the extent of decomposition needs
to be reconsidered.
• Similar earlier observations were reported, however, by
focusing on the size-defect density relationship [Hatton 97]
[Hatton 98] and [Withrow 90].
• Such studies observed a U-shaped curve and identiﬁed an
optimum size that minimized defect density.
• Later El Emam et al. [El Emam 02] reported that such an
approach can mask the true relationship between size and
defects and mislead us by showing some threshold effects.
• Indeed same points were made earlier in [Fenton and Neil 99]
and even earlier in [Rosenberg 98]
• In our study, we focused on the basic size--defect relationship.
Directions for Future Research
• Replicated Studies for validation
• Studying Modularity: Is the reason interface defects as
suggested earlier [Basili and Perricone 84]? What is the
interplay between coupling and size?
• Studying people aspects: Experts develop larger
• Studying process aspects: Larger modules inspected and
tested better? Systematic or non-systematic reuse? Copy-
paste into larger classes?
Our empirical results using a large-scale product that offered
thousands of data points showed that:
• Size--defect relationship took a logarithmic form
• Defect-proneness increased as size increased
• There is no threshold value; a continuously increasing
• Smaller modules are proportionally more troublesome
• Results can be immediately useful for Mozilla project
• Results also trigger many interesting research questions
for the future.