PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work

Handling missing data in software effort
prediction with naive Bayes and EM algorithm

Wen Zhang Ye Yang Qing Wang

Laboratory for Internet Software Technologies
Institute of Software, Chinese Academy of Sciences
Beijing 100190, P.R.China
{zhangwen,ye,wq}@itechs.iscas.ac.cn

7th International Conference on Predictive Models in
Software Engineering (PROMISE), 2011

Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm

Introduction
Experiments
Threats.

Outline
1 Introduction
2 Naive Bayes and EM for software effort prediction
3 Missing data handling strategies
Missing data toleration strategy.
Missing data imputation strategy
4 Experiments
The datasets
Experiment setup
Experimental results
5 Threats.
6 Conclusion and future work


Introduction
Experiments
Threats.

Effort prediction with missing data.

The knowledge on software project effort stored in the
historical datasets can be used to develop predictive
models, by either statistical methods such as linear
regression and correlation analysis to predict the effort of
new incoming projects.
Usually, most historical effort datasets contain large
amount of missing data.


Introduction
Experiments
Threats.

Effort prediction with missing data.

Due to the small sizes of most historical databases, the
common practice of ignoring projects with missing data will
lead to biased and inaccurate prediction model.
For these reasons, how to handle missing data in software
effort datasets is becoming an important problem.


Introduction
Experiments
Threats.

Sample data
The historical effort data of projects were organized as
shown in the following Table.

Table: The sample data in historical project dataset.
D X1 ... Xj ... Xn H
D1 x11 ... x1j ... x1n h1
... ... ... ... ... ... ...
Di xi1 ... xij ... xin hi
... ... ... ... ... ... ...
Dm xm1 ... xmj ... xmn hm
Xj (1 ≤ j ≤ n) denotes an attribute of project Di
(1 ≤ i ≤ m). hi is the effort class label of Di and it is
derived from the real effort of project Di .

Introduction
Experiments
Threats.

Sample data.

There are l effort classes for all the projects in a dataset,
that is, hi is equal to one of the elements in {c1 , ..., cl }.
Xj is independent of each other and has Boolean values
without missing data, i.e. xij ∈ {0, 1}.


Introduction
Experiments
Threats.

Formulation of the problem.

An effort dataset Ycom containing m historical projects as
Ycom = (D1 , ..., Di , ..., Dm )T , where Di (1 ≤ i ≤ m) is a
historical project and Di = (xi1 , ..., xij , ..., xin )T is
represented by n attributes Xj (1 ≤ j ≤ n).
hi denotes the effort class label of project Di . For each xij ,
which is the value of attribute Xj ) (1 ≤ j ≤ n)on Di , it would
be observed or missing.
Cross validation on effort prediction is used to to evaluate
the performances of missing data handling techniques.


Introduction
Experiments
Threats.

Motivation.

EM (Expectation Maximization) algorithm is a method for
ﬁnding maximum likelihood or maximum a posteriori
estimates of parameters in statistical models.
The motivation of applying EM(Expectation Maximization)
to na¨ Bayes is to augment the unlabeled projects with
ive
their estimated effort class labels into the labeled data sets.
Thus, the performance of classiﬁcation would be improved
by using more data to train the prediction model.


Introduction
Experiments
Threats.

Labeled projects and unlabeled projects.

For a labeled project DiL , its effort class
P(hi = ct ∣DiL ) ∈ {0, 1} is determinate.
For an unlabeled project DiU , its label P(hi = ct ∣DiU ) is
unknown.
However, if we can assign predicted effort class to DiU ,
then DiU could also be used to update the estimates
P{Xj = 0∣ct }, P{Xj = 1∣ct } and P(ct ), and further to reﬁne
the effort prediction model P(ct ∣Di ). This process is
described in Equations 1, 2, 3 and 4.


Introduction
Experiments
Threats.

Estimating P ( +1)
(Xj = 1∣ct ).

The likelihood of occurrence of Xj with respect to ct at
+ 1 iteration, is updated by Equation 1 using the
estimates at iteration.

1 + m xij P ( ) (hi = ct ∣Di )
P( +1)
(Xj = 1∣ct ) = i=1
. (1)
n+ n j=1
m
i=1 xij P
( ) (h = c ∣D )
i t i

In practice, we explain P ( +1) (Xj = 1∣ct ) as probability of
attribute Xj appearing in a project whose effort class is ct .


Introduction
Experiments
Threats.

Estimating P ( +1)
(Xj = 0∣ct ).

Accordingly, the likelihood of non-occurrence of Xj with
respect to ct at + 1 iteration, P ( +1) (Xj = 0∣ct ) is
estimated by Equation 2.

P( +1)
(Xj = 0∣ct ) = 1 − P ( +1)
(Xj = 1∣ct ). (2)


Introduction
Experiments
Threats.

Estimating P ( +1)
(ct ).

Second, the effort class prior probability, P ( +1) (ct ), is updated
in the same manner by Equation 3 using estimates at the
iteration. In practice, we may regard P ( +1) (ct ) as the prior
probability of class label ct appearing in all the software
projects.
m ( ) (h
1+ i=1 P i = ct ∣Di )
P( +1)
(ct ) = . (3)
l +m


Introduction
Experiments
Threats.

Estimating P ( +1)
(hi ′ = ct ∣Di ′ ).

Third, the posterior probability of an unlabeled project Di ′
belonging to an effort class ct at the + 1 iteration,
P ( +1) (hi ′ = ct ∣Di ′ ), is updated using Equation 4.

P ( ) (ct )P ( ) (Di ′ ∣ct )
P( +1)
(hi ′ = ct ∣Di ′ ) =
P ( ) (Di ′ )
n
P ( ) (ct ) P ( ) (xi ′ j ∣ct ) (4)
j=1
= .
l n
P ( ) (ct ) P ( ) (xi ′ j ∣ct )
t=1 j=1


Introduction
Experiments
Threats.

Estimating P ( +1)
(hi ′ = ct ∣Di ′ ).

Hereafter,
for labeled projects, if xij = 1, then
P ( ) (xij ∣ct ) = P ( ) (Xj = 1∣ct ). Otherwise xij = 0, then
P ( ) (xij ∣ct ) = P ( ) (Xj = 0∣ct ).
for unlabeled projects, if xi ′ j = 1, then
P ( ) (xi ′ j ∣ct ) = P ( ) (Xj = 1∣ct ). Otherwise xi ′ j = 0, then
P ( ) (xi ′ j ∣ct ) = P ( ) (Xj = 0∣ct ).
Here, P (0) (Xj = 1∣ct ) and P (0) (ct ) are initially estimated by
merely the labeled projects at the ﬁrst step of iteration, and
the unlabeled project cases are appended into the learning
process after they were predicted probabilistic effort class
by P (1) (hi ′ = ct ∣Di ′ ).


Introduction
Experiments
Threats.

Predicting the effort class of unlabeled projects.

We loop the Equations 1, 2, 3 and 4 until their estimates
converge to stable values.
Then, P ( +1) (h
i′ = ct ∣Di ′ ) is used to predict effort class of
Di ′ .
The ct ∈ {c1 , ..cl } that maximizes P ( +1) (h
i′ = ct ∣Di ′ ) is
regarded as the effort class of Di ′ .


Introduction
Missing data handling strategies Missing data toleration strategy.
Experiments Missing data imputation strategy
Threats.

Outline
1 Introduction
4 Experiments
The datasets
Experiment setup
5 Threats.


Introduction
Threats.

Initial setting.
When we use Equation 1 to estimate the likelihood of Xj
with respect to ct , P(Xj = 1∣ct ) or P(Xj = 0∣ct ), we do not
consider missing values involved in xij (1 ≤ i ≤ m).
For each Xj , we can divide the whole historical dataset D
into two subsets, i.e. D = {Dobs,j ∣Dmis,j } where Dobs,j is the
set of projects whose values on attribute Xj are observed
and Dmis,j is the set of projects whose values on attribute
are unobserved.
We may also divide the attributes in a project Di into two
subsets, i.e. Di = {Xobs,i ∣Xmis,i } where Xobs,i is the set of
attributes whose values are observed in project Di and
Xmis,i denotes the set of attributes whose values are
unobserved in project Di .

Introduction
Threats.


This strategy is very similar with the method adopted by
C4.5 to handle missing data. That is, we ignore missing
values in training prediction model.
To estimate P ( +1) (Xj = 1∣ct ) under this strategy, we
rewrite Equation 1 into Equation 5.
∣Dobs,j ∣
1+ xij P ( ) (hi = ct ∣Di )
i=1
P( +1)
(Xj = 1∣ct ) = n
. (5)
∣Dobs,j ∣
n+ i=1 xij P ( ) (hi = ct ∣Di )
j=1


Introduction
Threats.


The difference between Equations 1 and 5 lies in that only
observed projects on attribute Xj , i.e., Dobs,j are used to
estimate P ( +1) (Xj = 1∣ct ).
Equation 2 can also be used here to estimate
P ( +1) (Xj = 0∣ct ). To estimate P ( +1) (ct ), Equation 3 can
also be used here.


Introduction
Threats.


Accordingly, the prediction model should be adapted from
Equation 4 to Equation 6.

P ( ) (ct )P ( ) (Di ′ ∣ct )
P( +1)
(hi ′ = ct ∣Di ′ ) =
P ( ) (Di ′ )
∣Xobs,i ∣
P ( ) (ct ) P ( ) (xi ′ j ∣ct )
j=1
= . (6)
∣Xobs,i ∣ l
P ( ) (ct )P ( ) (xi ′ j ∣ct )
j=1 t=1


Introduction
Threats.

Missing data imputation strategy.

The basic idea of this strategy is that unobserved values of
attributes can be imputed using the observed values.
Then, both observed values and imputed values are used
to construct the prediction model.


Introduction
Threats.


This strategy is an embedded processing in na¨ Bayes
ive
and EM and we may rewrite Equation 1 to Equation 7 to
estimate P ( +1) (Xj = 1∣ct ).

P( +1)
(Xj = 1∣ct ) =
∣Dobs,j ∣ ∣Dmis,j ∣
1+ xij P ( ) (hi = ct ∣Di ) + x˜ P ( ) (hi = ct ∣Ds )
sj
i=1 s=1
.
n ∣Dobs,j ∣ ∣Dmis,j ∣
n+ { xij P ( ) (hi = ct ∣Di ) + x˜ P ( ) (hi = ct ∣Ds )}
sj
j=1 i=1 s=1
(7)


Introduction
Threats.

The missing value xsj , which is the value of attribute Xj on
the project Ds , is imputed using x˜ with Equation 8
sj

∣Dobs,j ∣
xij P ( ) (hi = ct ∣Di )
i=1
x˜ =
sj . (8)
∣Dobs,j ∣
P ( ) (hi = ct ∣Di )
i=1

x˜ is a constant independent of Ds given ct .
sj
We regulate that x˜ is approximated to 1 if x˜ ≥ 0.5.
sj sj
Otherwise, x˜ is approximated to 0.
sj
Here, we also use Equation 3 to estimate P ( +1) (ct ) .

Introduction
Threats.

As for the prediction model, P ( +1) (ct ∣Di ), can be
constructed in Equation 9 with considering the missing
values.
P ( ) (ct )P ( ) (Di ′ ∣ct )
P( +1)
(hi ′ = ct ∣Di ′ ) =
P ( ) (Di ′ )
n
P ( ) (ct ) P ( ) (xi ′ j ∣ct )
j=1
= . (9)
n l
P ( ) (ct )P ( ) (xi ′ j ∣ct )
j=1 t=1

Note that if xi ′ j is unobserved, it value will be substituted
with x˜′ j given by Equation 8.
i


Introduction
The datasets
Experiment setup
Experiments
Threats.

Outline
1 Introduction
4 Experiments
The datasets
Experiment setup
5 Threats.


Introduction
The datasets
Experiment setup
Experiments
Threats.

The ISBSG dataset.

The ISBSG data set (http://www.isbsg.org) has 70
attributes and many attributes have no values in the
corresponding places.
We extract 188 projects with 16 attributes with the criterion
that each project has at least 2/3 attributes whose values
are observed and, for an attribute, its values should be
observed at least in 2/3 of total projects.
13 attributes are nominal attributes and 3 attributes are
continuous attributes.


Introduction
The datasets
Experiment setup
Experiments
Threats.

The ISBSG dataset.

We use Equation 10 to normalize the efforts of projects
into l(= 3) classes.

l × (effortDi − effortmin )
ct = ⌊ ⌋+1 (10)
effortmax − effortmin

Table: The effort classes in ISBSG data set.
Class No. # of projects Label
1 85 Low
2 76 Medium
3 27 High


Introduction
The datasets
Experiment setup
Experiments
Threats.

The CSBSG dataset.
CSBSG dataset contains 1103 projects collected from 140
organizations and 15 regions across China by Chinese
association of software industry.
We extract 94 projects and 21 attributes (15 nominal
attributes and 6 continuous attributes) with same selection
criterion of ISBSG data set. We use Equation 10 to
normalize the efforts of projects into l(= 3) classes.

Table: The effort classes in CSBSG data set.
Class No. # of projects Label
1 27 Low
2 31 Medium
3 36 High

Introduction
The datasets
Experiment setup
Experiments
Threats.

Experiment setup.

To evaluate the proposed method comparatively, we adopt
MI and MINI to impute the missing values of the assigned
ISBSG and CSBSG dataset.
BPNN is used to classify the projects in the data sets after
imputation.
Our experiments are conducted with 10-ﬂod
cross-validation technique.


Introduction
The datasets
Experiment setup
Experiments
Threats.

EM-T and EM-I on ISBSG dataset.

The following ﬁgure illustrates the performances, of the
missing data toleration strategy (hereafter called EM-T)
and missing data imputation strategy (hereafter called
EM-I) in handling the missing date for effort prediction on
ISBSG data set.


Introduction
The datasets
Experiment setup
Experiments
Threats.


EM−I
EM−T
BPNN+MI
BPNN+MINI

0.8

0.75
Accuracy

0.7

0.65

0.6
0 4 8 12 16 20
# of unlabeled projects

Figure: Performances of naive Bayes with EM-I and EM-T in
comparison with BPNN on effort prediction using ISBSG data set.


Introduction
The datasets
Experiment setup
Experiments
Threats.


What we can see from the ﬁgure.
Both EM-I and EM-T have better performances than BPNN
with either MI or MINI on classifying the projects in ISBSG
data set.
The performance of naive Bayes and EM is augmented
when unlabeled projects are appended. This outcome
illustrates that semi-supervised learning can improve the
prediction of software effort.


Introduction
The datasets
Experiment setup
Experiments
Threats.


What we can see from ﬁgure.
If supervised learning was used for software effort
prediction, MINI method is favorable to impute the missing
values but missing toleration strategy may not be desirable
to handle missing values.
Imputing strategy for missing data is more effective than
tolerating strategy when naive Bayes and EM is used for
predicting ISBSG software efforts.


Introduction
The datasets
Experiment setup
Experiments
Threats.

EM-T and EM-I on CSBSG dataset.
EM-T and EM-I in handling the missing date for effort
prediction on CSBSG dataset.
0.8
EM−I
EM−T
BPNN+MI
BPNN+MINI
0.75

0.7
Accuracy

0.65

0.6

0.55

0.5
0 2 4 6 8
# of unlabeled projects

Figure: Performances of EM-I and EM-T in comparison with BPNN on predicting effort with different
number of unlabeled projects using CSBSG dataset.


Introduction
The datasets
Experiment setup
Experiments
Threats.

EM-T and EM-I on CSBSG dataset.
What we can see from the above ﬁgure.
The better performance of EM-I than EM-T is also
observed using CSBSG data set, which is the same as
using ISBSG dataset. This further validate our conjecture
that EM-I outperforms EM-T in software effort prediction.
EM-T has better performance than EM-I on condition that
the number of unlabeled projects is larger than that of
"maxima", that is different from that of ISBSG dataset. We
explain this result may be brought out by the relative small
size of CSBSG dataset where imputation strategy will be
more prone to bring bias into predictive than toleration
strategy.


Introduction
The datasets
Experiment setup
Experiments
Threats.

More experiments and hypotheses testing.

More experimental results with explanations are detailed in the
paper. Also, we conduct hypotheses testing to examine the
signiﬁcance of the conclusions draw from our experiments. One
of interest may refer to the paper.


Introduction
Experiments
Threats.

The threat to external validity primarily is the degree to
which the attributes we used to describe the projects and
the representative capacity of ISBSG and CSBSG sample
datasets.
The threat to internal validity are measurement and data
effects that can bias our results caused by performance
measure as accuracy.
The threat to construct validity is that our experiments
make use of clipping attributes and clipping project data
from both ISBSG and CSBSG datasets


Introduction
Experiments
Threats.

Conclusion

Semi-supervised learning as naive Bayes and EM is
employed to predict software effort.
We propose two embedded strategies in naive Bayes and
EM to handle the missing data.


Introduction
Experiments
Threats.

Future work

We plan to compare the proposed techniques with other
missing data imputation techniques, such as FIML and
MSWR.
We will develop more missing data techniques embedded
with naive Bayes and EM for software effort prediction.
We have already investigated the underlying mechanism of
missingness (structural missing or unstructured missing) of
software effort data. With this progress, we will improve the
missing data handling strategies oriented to the underlying
missing mechanism of software effort data.


Introduction
Experiments
Threats.

Thanks

Any further questions about the content of the slides and the
paper can be sent to Mr. Wen Zhang.
Email: zhangwen@itechs.iscas.ac.cn


PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

Similar to PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM" (20)

More from CS, NcState

More from CS, NcState (20)

Recently uploaded

Recently uploaded (20)

PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"