SlideShare a Scribd company logo
Naive Bayes and EM for software effort prediction
               Missing data handling strategies
                    Conclusion and future work

   Handling missing data in software effort
prediction with naive Bayes and EM algorithm

                   Wen Zhang                 Ye Yang         Qing Wang

                     Laboratory for Internet Software Technologies
                 Institute of Software, Chinese Academy of Sciences
                               Beijing 100190, P.R.China

    7th International Conference on Predictive Models in
           Software Engineering (PROMISE), 2011

               Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
      Naive Bayes and EM for software effort prediction
                     Missing data handling strategies
                          Conclusion and future work

  1   Introduction
  2   Naive Bayes and EM for software effort prediction
  3   Missing data handling strategies
        Missing data toleration strategy.
        Missing data imputation strategy
  4   Experiments
        The datasets
        Experiment setup
        Experimental results
  5   Threats.
  6   Conclusion and future work

                     Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                        Conclusion and future work

Effort prediction with missing data.

       The knowledge on software project effort stored in the
       historical datasets can be used to develop predictive
       models, by either statistical methods such as linear
       regression and correlation analysis to predict the effort of
       new incoming projects.
       Usually, most historical effort datasets contain large
       amount of missing data.

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                        Conclusion and future work

Effort prediction with missing data.

       Due to the small sizes of most historical databases, the
       common practice of ignoring projects with missing data will
       lead to biased and inaccurate prediction model.
       For these reasons, how to handle missing data in software
       effort datasets is becoming an important problem.

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                        Conclusion and future work

Sample data
       The historical effort data of projects were organized as
       shown in the following Table.

             Table: The sample data in historical project dataset.
                    D      X1 ... Xj ... Xn            H
                    D1 x11 ... x1j ... x1n h1
                    ...    ... ... ... ... ...         ...
                    Di     xi1 ... xij ... xin         hi
                    ...    ... ... ... ... ...         ...
                   Dm xm1 ... xmj ... xmn hm
       Xj (1 ≤ j ≤ n) denotes an attribute of project Di
       (1 ≤ i ≤ m). hi is the effort class label of Di and it is
       derived from the real effort of project Di .
                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                        Conclusion and future work

Sample data.

       There are l effort classes for all the projects in a dataset,
       that is, hi is equal to one of the elements in {c1 , ..., cl }.
       Xj is independent of each other and has Boolean values
       without missing data, i.e. xij ∈ {0, 1}.

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                        Conclusion and future work

Formulation of the problem.

       An effort dataset Ycom containing m historical projects as
       Ycom = (D1 , ..., Di , ..., Dm )T , where Di (1 ≤ i ≤ m) is a
       historical project and Di = (xi1 , ..., xij , ..., xin )T is
       represented by n attributes Xj (1 ≤ j ≤ n).
       hi denotes the effort class label of project Di . For each xij ,
       which is the value of attribute Xj ) (1 ≤ j ≤ n)on Di , it would
       be observed or missing.
       Cross validation on effort prediction is used to to evaluate
       the performances of missing data handling techniques.

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                        Conclusion and future work


       EM (Expectation Maximization) algorithm is a method for
       finding maximum likelihood or maximum a posteriori
       estimates of parameters in statistical models.
       The motivation of applying EM(Expectation Maximization)
       to na¨ Bayes is to augment the unlabeled projects with
       their estimated effort class labels into the labeled data sets.
       Thus, the performance of classification would be improved
       by using more data to train the prediction model.

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                        Conclusion and future work

Labeled projects and unlabeled projects.

       For a labeled project DiL , its effort class
       P(hi = ct ∣DiL ) ∈ {0, 1} is determinate.
       For an unlabeled project DiU , its label P(hi = ct ∣DiU ) is
       However, if we can assign predicted effort class to DiU ,
       then DiU could also be used to update the estimates
       P{Xj = 0∣ct }, P{Xj = 1∣ct } and P(ct ), and further to refine
       the effort prediction model P(ct ∣Di ). This process is
       described in Equations 1, 2, 3 and 4.

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                        Conclusion and future work

Estimating P (            +1)
                                 (Xj = 1∣ct ).

       The likelihood of occurrence of Xj with respect to ct at
        + 1 iteration, is updated by Equation 1 using the
       estimates at iteration.

                                               1 + m xij P ( ) (hi = ct ∣Di )
         P(   +1)
                    (Xj = 1∣ct ) =                    i=1
                                                                                    . (1)
                                             n+ n j=1
                                                          i=1 xij P
                                                                    ( ) (h = c ∣D )
                                                                          i   t  i

       In practice, we explain P ( +1) (Xj = 1∣ct ) as probability of
       attribute Xj appearing in a project whose effort class is ct .

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                        Conclusion and future work

Estimating P (            +1)
                                 (Xj = 0∣ct ).

       Accordingly, the likelihood of non-occurrence of Xj with
       respect to ct at + 1 iteration, P ( +1) (Xj = 0∣ct ) is
       estimated by Equation 2.

                      P(    +1)
                                  (Xj = 0∣ct ) = 1 − P (            +1)
                                                                          (Xj = 1∣ct ).                 (2)

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
     Naive Bayes and EM for software effort prediction
                    Missing data handling strategies
                         Conclusion and future work

Estimating P (             +1)
                                  (ct ).

  Second, the effort class prior probability, P ( +1) (ct ), is updated
  in the same manner by Equation 3 using estimates at the
  iteration. In practice, we may regard P ( +1) (ct ) as the prior
  probability of class label ct appearing in all the software
                                                         m     ( ) (h
                                             1+          i=1 P        i   = ct ∣Di )
                     P(    +1)
                                 (ct ) =                                                 .                 (3)
                                                             l +m

                    Wen Zhang, Ye Yang, Qing Wang          Software effort prediction with naive Bayes and EM algorithm
     Naive Bayes and EM for software effort prediction
                    Missing data handling strategies
                         Conclusion and future work

Estimating P (             +1)
                                  (hi ′ = ct ∣Di ′ ).

  Third, the posterior probability of an unlabeled project Di ′
  belonging to an effort class ct at the + 1 iteration,
  P ( +1) (hi ′ = ct ∣Di ′ ), is updated using Equation 4.

                                                            P ( ) (ct )P ( ) (Di ′ ∣ct )
                    P(    +1)
                                (hi ′ = ct ∣Di ′ ) =
                                                                   P ( ) (Di ′ )
                                                    P ( ) (ct )         P ( ) (xi ′ j ∣ct )                 (4)
                                          =                                                      .
                                                l                      n
                                                      P ( ) (ct )          P ( ) (xi ′ j ∣ct )
                                              t=1                   j=1

                    Wen Zhang, Ye Yang, Qing Wang           Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                        Conclusion and future work

Estimating P (            +1)
                                 (hi ′ = ct ∣Di ′ ).

               for labeled projects, if xij = 1, then
               P ( ) (xij ∣ct ) = P ( ) (Xj = 1∣ct ). Otherwise xij = 0, then
               P ( ) (xij ∣ct ) = P ( ) (Xj = 0∣ct ).
               for unlabeled projects, if xi ′ j = 1, then
               P ( ) (xi ′ j ∣ct ) = P ( ) (Xj = 1∣ct ). Otherwise xi ′ j = 0, then
               P ( ) (xi ′ j ∣ct ) = P ( ) (Xj = 0∣ct ).
       Here, P (0) (Xj = 1∣ct ) and P (0) (ct ) are initially estimated by
       merely the labeled projects at the first step of iteration, and
       the unlabeled project cases are appended into the learning
       process after they were predicted probabilistic effort class
       by P (1) (hi ′ = ct ∣Di ′ ).

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                        Conclusion and future work

Predicting the effort class of unlabeled projects.

       We loop the Equations 1, 2, 3 and 4 until their estimates
       converge to stable values.
       Then, P (        +1) (h
                                 i′   = ct ∣Di ′ ) is used to predict effort class of
       Di ′ .
       The ct ∈ {c1 , } that maximizes P (                       +1) (h
                                                                               i′   = ct ∣Di ′ ) is
       regarded as the effort class of Di ′ .

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
      Naive Bayes and EM for software effort prediction
                     Missing data handling strategies     Missing data toleration strategy.
                                         Experiments      Missing data imputation strategy
                          Conclusion and future work

  1   Introduction
  2   Naive Bayes and EM for software effort prediction
  3   Missing data handling strategies
        Missing data toleration strategy.
        Missing data imputation strategy
  4   Experiments
        The datasets
        Experiment setup
        Experimental results
  5   Threats.
  6   Conclusion and future work

                     Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
     Naive Bayes and EM for software effort prediction
                    Missing data handling strategies     Missing data toleration strategy.
                                        Experiments      Missing data imputation strategy
                         Conclusion and future work

Initial setting.
        When we use Equation 1 to estimate the likelihood of Xj
        with respect to ct , P(Xj = 1∣ct ) or P(Xj = 0∣ct ), we do not
        consider missing values involved in xij (1 ≤ i ≤ m).
        For each Xj , we can divide the whole historical dataset D
        into two subsets, i.e. D = {Dobs,j ∣Dmis,j } where Dobs,j is the
        set of projects whose values on attribute Xj are observed
        and Dmis,j is the set of projects whose values on attribute
        are unobserved.
        We may also divide the attributes in a project Di into two
        subsets, i.e. Di = {Xobs,i ∣Xmis,i } where Xobs,i is the set of
        attributes whose values are observed in project Di and
        Xmis,i denotes the set of attributes whose values are
        unobserved in project Di .
                    Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies          Missing data toleration strategy.
                                       Experiments           Missing data imputation strategy
                        Conclusion and future work

Missing data toleration strategy.

       This strategy is very similar with the method adopted by
       C4.5 to handle missing data. That is, we ignore missing
       values in training prediction model.
       To estimate P ( +1) (Xj = 1∣ct ) under this strategy, we
       rewrite Equation 1 into Equation 5.
                                                            ∣Dobs,j ∣
                                                  1+                    xij P ( ) (hi = ct ∣Di )
         P(    +1)
                     (Xj = 1∣ct ) =                     n
                                                                                                          . (5)
                                                                ∣Dobs,j ∣
                                             n+                 i=1       xij P ( ) (hi     = ct ∣Di )

                   Wen Zhang, Ye Yang, Qing Wang             Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies     Missing data toleration strategy.
                                       Experiments      Missing data imputation strategy
                        Conclusion and future work

Missing data toleration strategy.

       The difference between Equations 1 and 5 lies in that only
       observed projects on attribute Xj , i.e., Dobs,j are used to
       estimate P ( +1) (Xj = 1∣ct ).
       Equation 2 can also be used here to estimate
       P ( +1) (Xj = 0∣ct ). To estimate P ( +1) (ct ), Equation 3 can
       also be used here.

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies       Missing data toleration strategy.
                                       Experiments        Missing data imputation strategy
                        Conclusion and future work

Missing data toleration strategy.

       Accordingly, the prediction model should be adapted from
       Equation 4 to Equation 6.

                                                          P ( ) (ct )P ( ) (Di ′ ∣ct )
                   P(    +1)
                               (hi ′ = ct ∣Di ′ ) =
                                                                 P ( ) (Di ′ )
                                                          ∣Xobs,i ∣
                                            P ( ) (ct )               P ( ) (xi ′ j ∣ct )
                                     =                                                        .           (6)
                                          ∣Xobs,i ∣ l
                                                        P ( ) (ct )P ( ) (xi ′ j ∣ct )
                                            j=1 t=1

                   Wen Zhang, Ye Yang, Qing Wang          Software effort prediction with naive Bayes and EM algorithm
      Naive Bayes and EM for software effort prediction
                     Missing data handling strategies     Missing data toleration strategy.
                                         Experiments      Missing data imputation strategy
                          Conclusion and future work

  1   Introduction
  2   Naive Bayes and EM for software effort prediction
  3   Missing data handling strategies
        Missing data toleration strategy.
        Missing data imputation strategy
  4   Experiments
        The datasets
        Experiment setup
        Experimental results
  5   Threats.
  6   Conclusion and future work

                     Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies     Missing data toleration strategy.
                                       Experiments      Missing data imputation strategy
                        Conclusion and future work

Missing data imputation strategy.

       The basic idea of this strategy is that unobserved values of
       attributes can be imputed using the observed values.
       Then, both observed values and imputed values are used
       to construct the prediction model.

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies     Missing data toleration strategy.
                                       Experiments      Missing data imputation strategy
                        Conclusion and future work

Missing data imputation strategy.

       This strategy is an embedded processing in na¨ Bayes
       and EM and we may rewrite Equation 1 to Equation 7 to
       estimate P ( +1) (Xj = 1∣ct ).

                                        P(   +1)
                                                   (Xj = 1∣ct ) =
                     ∣Dobs,j ∣                                 ∣Dmis,j ∣
            1+                   xij P ( ) (hi = ct ∣Di ) +                x˜ P ( ) (hi = ct ∣Ds )
                       i=1                                       s=1
                 n     ∣Dobs,j ∣                                  ∣Dmis,j ∣
        n+           {             xij P ( ) (hi = ct ∣Di ) +                 x˜ P ( ) (hi = ct ∣Ds )}
               j=1       i=1                                       s=1

                     Wen Zhang, Ye Yang, Qing Wang      Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies         Missing data toleration strategy.
                                       Experiments          Missing data imputation strategy
                        Conclusion and future work

Missing data imputation strategy.
       The missing value xsj , which is the value of attribute Xj on
       the project Ds , is imputed using x˜ with Equation 8

                                           ∣Dobs,j ∣
                                                        xij P ( ) (hi = ct ∣Di )
                                 x˜ =
                                  sj                                                   .                    (8)
                                             ∣Dobs,j ∣
                                                         P ( ) (hi = ct ∣Di )

       x˜ is a constant independent of Ds given ct .
       We regulate that x˜ is approximated to 1 if x˜ ≥ 0.5.
                          sj                         sj
       Otherwise, x˜ is approximated to 0.
       Here, we also use Equation 3 to estimate P ( +1) (ct ) .
                   Wen Zhang, Ye Yang, Qing Wang            Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies      Missing data toleration strategy.
                                       Experiments       Missing data imputation strategy
                        Conclusion and future work

Missing data imputation strategy.
       As for the prediction model, P ( +1) (ct ∣Di ), can be
       constructed in Equation 9 with considering the missing
                                                 P ( ) (ct )P ( ) (Di ′ ∣ct )
         P(    +1)
                     (hi ′ = ct ∣Di ′ ) =
                                                        P ( ) (Di ′ )
                                                         P ( ) (ct )         P ( ) (xi ′ j ∣ct )
                                                   =                                               .     (9)
                                                         n    l
                                                                  P ( ) (ct )P ( ) (xi ′ j ∣ct )
                                                        j=1 t=1

       Note that if xi ′ j is unobserved, it value will be substituted
       with x˜′ j given by Equation 8.

                   Wen Zhang, Ye Yang, Qing Wang         Software effort prediction with naive Bayes and EM algorithm
      Naive Bayes and EM for software effort prediction
                                                          The datasets
                     Missing data handling strategies
                                                          Experiment setup
                                                          Experimental results
                          Conclusion and future work

  1   Introduction
  2   Naive Bayes and EM for software effort prediction
  3   Missing data handling strategies
        Missing data toleration strategy.
        Missing data imputation strategy
  4   Experiments
        The datasets
        Experiment setup
        Experimental results
  5   Threats.
  6   Conclusion and future work

                     Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                                                        The datasets
                   Missing data handling strategies
                                                        Experiment setup
                                                        Experimental results
                        Conclusion and future work

The ISBSG dataset.

       The ISBSG data set ( has 70
       attributes and many attributes have no values in the
       corresponding places.
       We extract 188 projects with 16 attributes with the criterion
       that each project has at least 2/3 attributes whose values
       are observed and, for an attribute, its values should be
       observed at least in 2/3 of total projects.
       13 attributes are nominal attributes and 3 attributes are
       continuous attributes.

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                                                        The datasets
                   Missing data handling strategies
                                                        Experiment setup
                                                        Experimental results
                        Conclusion and future work

The ISBSG dataset.

       We use Equation 10 to normalize the efforts of projects
       into l(= 3) classes.

                                        l × (effortDi − effortmin )
                             ct = ⌊                                 ⌋+1                               (10)
                                          effortmax − effortmin

                   Table: The effort classes in ISBSG data set.
                        Class No.             # of projects           Label
                            1                      85                  Low
                            2                      76                Medium
                            3                      27                 High

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
   Naive Bayes and EM for software effort prediction
                                                       The datasets
                  Missing data handling strategies
                                                       Experiment setup
                                                       Experimental results
                       Conclusion and future work

The CSBSG dataset.
      CSBSG dataset contains 1103 projects collected from 140
      organizations and 15 regions across China by Chinese
      association of software industry.
      We extract 94 projects and 21 attributes (15 nominal
      attributes and 6 continuous attributes) with same selection
      criterion of ISBSG data set. We use Equation 10 to
      normalize the efforts of projects into l(= 3) classes.

                  Table: The effort classes in CSBSG data set.
                           Class No.             # of projects           Label
                               1                      27                  Low
                               2                      31                Medium
                               3                      36                 High
                  Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
      Naive Bayes and EM for software effort prediction
                                                          The datasets
                     Missing data handling strategies
                                                          Experiment setup
                                                          Experimental results
                          Conclusion and future work

  1   Introduction
  2   Naive Bayes and EM for software effort prediction
  3   Missing data handling strategies
        Missing data toleration strategy.
        Missing data imputation strategy
  4   Experiments
        The datasets
        Experiment setup
        Experimental results
  5   Threats.
  6   Conclusion and future work

                     Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                                                        The datasets
                   Missing data handling strategies
                                                        Experiment setup
                                                        Experimental results
                        Conclusion and future work

Experiment setup.

       To evaluate the proposed method comparatively, we adopt
       MI and MINI to impute the missing values of the assigned
       ISBSG and CSBSG dataset.
       BPNN is used to classify the projects in the data sets after
       Our experiments are conducted with 10-flod
       cross-validation technique.

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
      Naive Bayes and EM for software effort prediction
                                                          The datasets
                     Missing data handling strategies
                                                          Experiment setup
                                                          Experimental results
                          Conclusion and future work

  1   Introduction
  2   Naive Bayes and EM for software effort prediction
  3   Missing data handling strategies
        Missing data toleration strategy.
        Missing data imputation strategy
  4   Experiments
        The datasets
        Experiment setup
        Experimental results
  5   Threats.
  6   Conclusion and future work

                     Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                                                        The datasets
                   Missing data handling strategies
                                                        Experiment setup
                                                        Experimental results
                        Conclusion and future work

EM-T and EM-I on ISBSG dataset.

       The following figure illustrates the performances, of the
       missing data toleration strategy (hereafter called EM-T)
       and missing data imputation strategy (hereafter called
       EM-I) in handling the missing date for effort prediction on
       ISBSG data set.

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                                                                      The datasets
                   Missing data handling strategies
                                                                      Experiment setup
                                                                      Experimental results
                        Conclusion and future work

EM-T and EM-I on ISBSG dataset.






                                               0   4    8                  12    16           20
                                                       # of unlabeled projects

  Figure: Performances of naive Bayes with EM-I and EM-T in
  comparison with BPNN on effort prediction using ISBSG data set.

                   Wen Zhang, Ye Yang, Qing Wang                      Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                                                        The datasets
                   Missing data handling strategies
                                                        Experiment setup
                                                        Experimental results
                        Conclusion and future work

EM-T and EM-I on ISBSG dataset.

  What we can see from the figure.
       Both EM-I and EM-T have better performances than BPNN
       with either MI or MINI on classifying the projects in ISBSG
       data set.
       The performance of naive Bayes and EM is augmented
       when unlabeled projects are appended. This outcome
       illustrates that semi-supervised learning can improve the
       prediction of software effort.

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                                                        The datasets
                   Missing data handling strategies
                                                        Experiment setup
                                                        Experimental results
                        Conclusion and future work

EM-T and EM-I on ISBSG dataset.

  What we can see from figure.
       If supervised learning was used for software effort
       prediction, MINI method is favorable to impute the missing
       values but missing toleration strategy may not be desirable
       to handle missing values.
       Imputing strategy for missing data is more effective than
       tolerating strategy when naive Bayes and EM is used for
       predicting ISBSG software efforts.

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                                                                  The datasets
                   Missing data handling strategies
                                                                  Experiment setup
                                                                  Experimental results
                        Conclusion and future work

EM-T and EM-I on CSBSG dataset.
       EM-T and EM-I in handling the missing date for effort
       prediction on CSBSG dataset.





                                                   0   2              4              6               8
                                                           # of unlabeled projects

       Figure:     Performances of EM-I and EM-T in comparison with BPNN on predicting effort with different
       number of unlabeled projects using CSBSG dataset.

                   Wen Zhang, Ye Yang, Qing Wang                  Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                                                        The datasets
                   Missing data handling strategies
                                                        Experiment setup
                                                        Experimental results
                        Conclusion and future work

EM-T and EM-I on CSBSG dataset.
  What we can see from the above figure.
       The better performance of EM-I than EM-T is also
       observed using CSBSG data set, which is the same as
       using ISBSG dataset. This further validate our conjecture
       that EM-I outperforms EM-T in software effort prediction.
       EM-T has better performance than EM-I on condition that
       the number of unlabeled projects is larger than that of
       "maxima", that is different from that of ISBSG dataset. We
       explain this result may be brought out by the relative small
       size of CSBSG dataset where imputation strategy will be
       more prone to bring bias into predictive than toleration

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                                                        The datasets
                   Missing data handling strategies
                                                        Experiment setup
                                                        Experimental results
                        Conclusion and future work

More experiments and hypotheses testing.

  More experimental results with explanations are detailed in the
  paper. Also, we conduct hypotheses testing to examine the
  significance of the conclusions draw from our experiments. One
  of interest may refer to the paper.

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Naive Bayes and EM for software effort prediction
               Missing data handling strategies
                    Conclusion and future work

   The threat to external validity primarily is the degree to
   which the attributes we used to describe the projects and
   the representative capacity of ISBSG and CSBSG sample
   The threat to internal validity are measurement and data
   effects that can bias our results caused by performance
   measure as accuracy.
   The threat to construct validity is that our experiments
   make use of clipping attributes and clipping project data
   from both ISBSG and CSBSG datasets

               Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                        Conclusion and future work


       Semi-supervised learning as naive Bayes and EM is
       employed to predict software effort.
       We propose two embedded strategies in naive Bayes and
       EM to handle the missing data.

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                        Conclusion and future work

Future work

       We plan to compare the proposed techniques with other
       missing data imputation techniques, such as FIML and
       We will develop more missing data techniques embedded
       with naive Bayes and EM for software effort prediction.
       We have already investigated the underlying mechanism of
       missingness (structural missing or unstructured missing) of
       software effort data. With this progress, we will improve the
       missing data handling strategies oriented to the underlying
       missing mechanism of software effort data.

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                        Conclusion and future work


  Any further questions about the content of the slides and the
  paper can be sent to Mr. Wen Zhang.

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm

More Related Content

What's hot

IJERD ( International Journal of Engineering Research and Devel...
IJERD ( International Journal of Engineering Research and Devel...IJERD ( International Journal of Engineering Research and Devel...
IJERD ( International Journal of Engineering Research and Devel...IJERD Editor
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
Aditya singh gaur
[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization
JaeJun Yoo
Aditya Bhattacharya Chest XRay Image Analysis Using Deep Learning
Aditya Bhattacharya Chest XRay Image Analysis Using Deep LearningAditya Bhattacharya Chest XRay Image Analysis Using Deep Learning
Aditya Bhattacharya Chest XRay Image Analysis Using Deep Learning
Aditya Bhattacharya
Deep Neural Networks for Multimodal Learning
Deep Neural Networks for Multimodal LearningDeep Neural Networks for Multimodal Learning
Deep Neural Networks for Multimodal Learning
Marc Bolaños Solà
Continual Learning with Deep Architectures - Tutorial ICML 2021
Continual Learning with Deep Architectures - Tutorial ICML 2021Continual Learning with Deep Architectures - Tutorial ICML 2021
Continual Learning with Deep Architectures - Tutorial ICML 2021
Vincenzo Lomonaco
notes as .ppt
notes as .pptnotes as .ppt
notes as .pptbutest
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Miningbutest
A critical review on Adversarial Attacks on Intrusion Detection Systems: Must...
A critical review on Adversarial Attacks on Intrusion Detection Systems: Must...A critical review on Adversarial Attacks on Intrusion Detection Systems: Must...
A critical review on Adversarial Attacks on Intrusion Detection Systems: Must...
PhD Assistance
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
Integrate fault tree analysis and fuzzy sets in quantitative risk assessment
Integrate fault tree analysis and fuzzy sets in quantitative risk assessmentIntegrate fault tree analysis and fuzzy sets in quantitative risk assessment
Integrate fault tree analysis and fuzzy sets in quantitative risk assessment
IAEME Publication
Neural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learningNeural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learning
Francisco E. Figueroa-Nigaglioni
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
Pramit Choudhary
Cs231n 2017 lecture13 Generative Model
Cs231n 2017 lecture13 Generative ModelCs231n 2017 lecture13 Generative Model
Cs231n 2017 lecture13 Generative Model
Yanbin Kong
David Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIDavid Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AI
Bayes Nets meetup London
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?
Gael Varoquaux
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Ian Morgan
Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learning
Ritesh Sawant
Cs231n 2017 lecture12 Visualizing and Understanding
Cs231n 2017 lecture12 Visualizing and UnderstandingCs231n 2017 lecture12 Visualizing and Understanding
Cs231n 2017 lecture12 Visualizing and Understanding
Yanbin Kong

What's hot (20)

IJERD ( International Journal of Engineering Research and Devel...
IJERD ( International Journal of Engineering Research and Devel...IJERD ( International Journal of Engineering Research and Devel...
IJERD ( International Journal of Engineering Research and Devel...
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization
Aditya Bhattacharya Chest XRay Image Analysis Using Deep Learning
Aditya Bhattacharya Chest XRay Image Analysis Using Deep LearningAditya Bhattacharya Chest XRay Image Analysis Using Deep Learning
Aditya Bhattacharya Chest XRay Image Analysis Using Deep Learning
Deep Neural Networks for Multimodal Learning
Deep Neural Networks for Multimodal LearningDeep Neural Networks for Multimodal Learning
Deep Neural Networks for Multimodal Learning
Continual Learning with Deep Architectures - Tutorial ICML 2021
Continual Learning with Deep Architectures - Tutorial ICML 2021Continual Learning with Deep Architectures - Tutorial ICML 2021
Continual Learning with Deep Architectures - Tutorial ICML 2021
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Mining
A critical review on Adversarial Attacks on Intrusion Detection Systems: Must...
A critical review on Adversarial Attacks on Intrusion Detection Systems: Must...A critical review on Adversarial Attacks on Intrusion Detection Systems: Must...
A critical review on Adversarial Attacks on Intrusion Detection Systems: Must...
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
Integrate fault tree analysis and fuzzy sets in quantitative risk assessment
Integrate fault tree analysis and fuzzy sets in quantitative risk assessmentIntegrate fault tree analysis and fuzzy sets in quantitative risk assessment
Integrate fault tree analysis and fuzzy sets in quantitative risk assessment
Neural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learningNeural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learning
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
Cs231n 2017 lecture13 Generative Model
Cs231n 2017 lecture13 Generative ModelCs231n 2017 lecture13 Generative Model
Cs231n 2017 lecture13 Generative Model
David Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIDavid Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AI
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learning
Cs231n 2017 lecture12 Visualizing and Understanding
Cs231n 2017 lecture12 Visualizing and UnderstandingCs231n 2017 lecture12 Visualizing and Understanding
Cs231n 2017 lecture12 Visualizing and Understanding

Viewers also liked

Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
Md. Enamul Haque Chowdhury
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
Josh Patterson
Statistical Approaches to Missing Data
Statistical Approaches to Missing DataStatistical Approaches to Missing Data
Statistical Approaches to Missing Data
Imputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trialsImputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trials
Nitin George
Missing Data and Causes
Missing Data and CausesMissing Data and Causes
Missing Data and Causes
akanni azeez olamide
Pattern recognition binoy 05-naive bayes classifier
Pattern recognition binoy 05-naive bayes classifierPattern recognition binoy 05-naive bayes classifier
Pattern recognition binoy 05-naive bayes classifier
Data mining-2
Data mining-2Data mining-2
Data mining-2
Nit Hik
Stata tutorial
Stata tutorialStata tutorial
Stata tutorial
Patrick Elyanu
Analysis of crop yield prediction using data mining techniques
Analysis of crop yield prediction using data mining techniquesAnalysis of crop yield prediction using data mining techniques
Analysis of crop yield prediction using data mining techniques
eSAT Journals
Naive Bayes | Statistics
Naive Bayes | StatisticsNaive Bayes | Statistics
Naive Bayes | Statistics
Transweb Global Inc
Weather report project
Weather report projectWeather report project
Weather report projectalzambra
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
Dhwaj Raj
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
Dev Sahu
Software Project Management for 'Weather Forecasting using Data mining'
Software Project Management for 'Weather Forecasting using Data mining'Software Project Management for 'Weather Forecasting using Data mining'
Software Project Management for 'Weather Forecasting using Data mining'
Rushikesh Mangrulkar
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
Bayesian classification
Bayesian classificationBayesian classification
Bayesian classification
Manu Chandel

Viewers also liked (20)

Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
Naive bayes
Naive bayesNaive bayes
Naive bayes
Statistical Approaches to Missing Data
Statistical Approaches to Missing DataStatistical Approaches to Missing Data
Statistical Approaches to Missing Data
Imputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trialsImputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trials
Missing Data and Causes
Missing Data and CausesMissing Data and Causes
Missing Data and Causes
Pattern recognition binoy 05-naive bayes classifier
Pattern recognition binoy 05-naive bayes classifierPattern recognition binoy 05-naive bayes classifier
Pattern recognition binoy 05-naive bayes classifier
Bayes 6
Bayes 6Bayes 6
Bayes 6
Data mining-2
Data mining-2Data mining-2
Data mining-2
Stata tutorial
Stata tutorialStata tutorial
Stata tutorial
Analysis of crop yield prediction using data mining techniques
Analysis of crop yield prediction using data mining techniquesAnalysis of crop yield prediction using data mining techniques
Analysis of crop yield prediction using data mining techniques
Naive Bayes | Statistics
Naive Bayes | StatisticsNaive Bayes | Statistics
Naive Bayes | Statistics
Weather report project
Weather report projectWeather report project
Weather report project
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
Software Project Management for 'Weather Forecasting using Data mining'
Software Project Management for 'Weather Forecasting using Data mining'Software Project Management for 'Weather Forecasting using Data mining'
Software Project Management for 'Weather Forecasting using Data mining'
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
Bayesian classification
Bayesian classificationBayesian classification
Bayesian classification

Similar to PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

Software Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled DatasetsSoftware Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled Datasets
Sung Kim
Transferable GAN-generated Images Detection Framework.
Transferable GAN-generated Images  Detection Framework.Transferable GAN-generated Images  Detection Framework.
Transferable GAN-generated Images Detection Framework.
Predicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine LearningPredicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine LearningGuido A. Ciollaro
Bayesian network based software reliability prediction
Bayesian network based software reliability predictionBayesian network based software reliability prediction
Bayesian network based software reliability predictionJULIO GONZALEZ SANZ
Manifold learning for credit risk assessment
Manifold learning for credit risk assessment Manifold learning for credit risk assessment
Manifold learning for credit risk assessment Armando Vieira
Estimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approachEstimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approach
Don't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptxDon't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptx
Förderverein Technische Fakultät
Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...
Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...
Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...
IDES Editor
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selectionchenhm
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert  pre_training_of_deep_bidirectional_transformers_for_language_understandingBert  pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Adversarial Variational Autoencoders to extend and improve generative model -...
Adversarial Variational Autoencoders to extend and improve generative model -...Adversarial Variational Autoencoders to extend and improve generative model -...
Adversarial Variational Autoencoders to extend and improve generative model -...
Loc Nguyen
Image De-Noising Using Deep Neural Network
Image De-Noising Using Deep Neural NetworkImage De-Noising Using Deep Neural Network
Image De-Noising Using Deep Neural Network
Adversarial Variational Autoencoders to extend and improve generative model
Adversarial Variational Autoencoders to extend and improve generative modelAdversarial Variational Autoencoders to extend and improve generative model
Adversarial Variational Autoencoders to extend and improve generative model
Loc Nguyen
final_ICSE '22 Presentaion_Sherry.pdf
final_ICSE '22 Presentaion_Sherry.pdffinal_ICSE '22 Presentaion_Sherry.pdf
final_ICSE '22 Presentaion_Sherry.pdf
Empirical Study on Collaborative Software in the field of Machine learning.pptx
Empirical Study on Collaborative Software in the field of Machine learning.pptxEmpirical Study on Collaborative Software in the field of Machine learning.pptx
Empirical Study on Collaborative Software in the field of Machine learning.pptx

Similar to PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM" (20)

Software Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled DatasetsSoftware Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled Datasets
Transferable GAN-generated Images Detection Framework.
Transferable GAN-generated Images  Detection Framework.Transferable GAN-generated Images  Detection Framework.
Transferable GAN-generated Images Detection Framework.
Predicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine LearningPredicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine Learning
Bayesian network based software reliability prediction
Bayesian network based software reliability predictionBayesian network based software reliability prediction
Bayesian network based software reliability prediction
Manifold learning for credit risk assessment
Manifold learning for credit risk assessment Manifold learning for credit risk assessment
Manifold learning for credit risk assessment
Estimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approachEstimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approach
Don't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptxDon't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptx
Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...
Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...
Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert  pre_training_of_deep_bidirectional_transformers_for_language_understandingBert  pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Adversarial Variational Autoencoders to extend and improve generative model -...
Adversarial Variational Autoencoders to extend and improve generative model -...Adversarial Variational Autoencoders to extend and improve generative model -...
Adversarial Variational Autoencoders to extend and improve generative model -...
Image De-Noising Using Deep Neural Network
Image De-Noising Using Deep Neural NetworkImage De-Noising Using Deep Neural Network
Image De-Noising Using Deep Neural Network
Adversarial Variational Autoencoders to extend and improve generative model
Adversarial Variational Autoencoders to extend and improve generative modelAdversarial Variational Autoencoders to extend and improve generative model
Adversarial Variational Autoencoders to extend and improve generative model
final_ICSE '22 Presentaion_Sherry.pdf
final_ICSE '22 Presentaion_Sherry.pdffinal_ICSE '22 Presentaion_Sherry.pdf
final_ICSE '22 Presentaion_Sherry.pdf
Empirical Study on Collaborative Software in the field of Machine learning.pptx
Empirical Study on Collaborative Software in the field of Machine learning.pptxEmpirical Study on Collaborative Software in the field of Machine learning.pptx
Empirical Study on Collaborative Software in the field of Machine learning.pptx

More from CS, NcState

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdec
CS, NcState
Future se oct15
Future se oct15Future se oct15
Future se oct15
CS, NcState
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
CS, NcState
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
CS, NcState
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
CS, NcState
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9CS, NcState
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
CS, NcState
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
CS, NcState
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
CS, NcState
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab templateCS, NcState
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUCS, NcState
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
CS, NcState
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
CS, NcState
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
CS, NcState
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
CS, NcState
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
CS, NcState
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1CS, NcState
Know thy tools
Know thy toolsKnow thy tools
Know thy tools
CS, NcState
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
CS, NcState

More from CS, NcState (20)

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdec
Future se oct15
Future se oct15Future se oct15
Future se oct15
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
Know thy tools
Know thy toolsKnow thy tools
Know thy tools
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data

Recently uploaded

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu

Recently uploaded (20)

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs

PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

  • 1. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Handling missing data in software effort prediction with naive Bayes and EM algorithm Wen Zhang Ye Yang Qing Wang Laboratory for Internet Software Technologies Institute of Software, Chinese Academy of Sciences Beijing 100190, P.R.China {zhangwen,ye,wq} 7th International Conference on Predictive Models in Software Engineering (PROMISE), 2011 Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 2. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Outline 1 Introduction 2 Naive Bayes and EM for software effort prediction 3 Missing data handling strategies Missing data toleration strategy. Missing data imputation strategy 4 Experiments The datasets Experiment setup Experimental results 5 Threats. 6 Conclusion and future work Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 3. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Effort prediction with missing data. The knowledge on software project effort stored in the historical datasets can be used to develop predictive models, by either statistical methods such as linear regression and correlation analysis to predict the effort of new incoming projects. Usually, most historical effort datasets contain large amount of missing data. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 4. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Effort prediction with missing data. Due to the small sizes of most historical databases, the common practice of ignoring projects with missing data will lead to biased and inaccurate prediction model. For these reasons, how to handle missing data in software effort datasets is becoming an important problem. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 5. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Sample data The historical effort data of projects were organized as shown in the following Table. Table: The sample data in historical project dataset. D X1 ... Xj ... Xn H D1 x11 ... x1j ... x1n h1 ... ... ... ... ... ... ... Di xi1 ... xij ... xin hi ... ... ... ... ... ... ... Dm xm1 ... xmj ... xmn hm Xj (1 ≤ j ≤ n) denotes an attribute of project Di (1 ≤ i ≤ m). hi is the effort class label of Di and it is derived from the real effort of project Di . Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 6. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Sample data. There are l effort classes for all the projects in a dataset, that is, hi is equal to one of the elements in {c1 , ..., cl }. Xj is independent of each other and has Boolean values without missing data, i.e. xij ∈ {0, 1}. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 7. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Formulation of the problem. An effort dataset Ycom containing m historical projects as Ycom = (D1 , ..., Di , ..., Dm )T , where Di (1 ≤ i ≤ m) is a historical project and Di = (xi1 , ..., xij , ..., xin )T is represented by n attributes Xj (1 ≤ j ≤ n). hi denotes the effort class label of project Di . For each xij , which is the value of attribute Xj ) (1 ≤ j ≤ n)on Di , it would be observed or missing. Cross validation on effort prediction is used to to evaluate the performances of missing data handling techniques. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 8. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Motivation. EM (Expectation Maximization) algorithm is a method for finding maximum likelihood or maximum a posteriori estimates of parameters in statistical models. The motivation of applying EM(Expectation Maximization) to na¨ Bayes is to augment the unlabeled projects with ive their estimated effort class labels into the labeled data sets. Thus, the performance of classification would be improved by using more data to train the prediction model. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 9. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Labeled projects and unlabeled projects. For a labeled project DiL , its effort class P(hi = ct ∣DiL ) ∈ {0, 1} is determinate. For an unlabeled project DiU , its label P(hi = ct ∣DiU ) is unknown. However, if we can assign predicted effort class to DiU , then DiU could also be used to update the estimates P{Xj = 0∣ct }, P{Xj = 1∣ct } and P(ct ), and further to refine the effort prediction model P(ct ∣Di ). This process is described in Equations 1, 2, 3 and 4. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 10. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Estimating P ( +1) (Xj = 1∣ct ). The likelihood of occurrence of Xj with respect to ct at + 1 iteration, is updated by Equation 1 using the estimates at iteration. 1 + m xij P ( ) (hi = ct ∣Di ) P( +1) (Xj = 1∣ct ) = i=1 . (1) n+ n j=1 m i=1 xij P ( ) (h = c ∣D ) i t i In practice, we explain P ( +1) (Xj = 1∣ct ) as probability of attribute Xj appearing in a project whose effort class is ct . Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 11. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Estimating P ( +1) (Xj = 0∣ct ). Accordingly, the likelihood of non-occurrence of Xj with respect to ct at + 1 iteration, P ( +1) (Xj = 0∣ct ) is estimated by Equation 2. P( +1) (Xj = 0∣ct ) = 1 − P ( +1) (Xj = 1∣ct ). (2) Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 12. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Estimating P ( +1) (ct ). Second, the effort class prior probability, P ( +1) (ct ), is updated in the same manner by Equation 3 using estimates at the iteration. In practice, we may regard P ( +1) (ct ) as the prior probability of class label ct appearing in all the software projects. m ( ) (h 1+ i=1 P i = ct ∣Di ) P( +1) (ct ) = . (3) l +m Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 13. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Estimating P ( +1) (hi ′ = ct ∣Di ′ ). Third, the posterior probability of an unlabeled project Di ′ belonging to an effort class ct at the + 1 iteration, P ( +1) (hi ′ = ct ∣Di ′ ), is updated using Equation 4. P ( ) (ct )P ( ) (Di ′ ∣ct ) P( +1) (hi ′ = ct ∣Di ′ ) = P ( ) (Di ′ ) n P ( ) (ct ) P ( ) (xi ′ j ∣ct ) (4) j=1 = . l n P ( ) (ct ) P ( ) (xi ′ j ∣ct ) t=1 j=1 Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 14. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Estimating P ( +1) (hi ′ = ct ∣Di ′ ). Hereafter, for labeled projects, if xij = 1, then P ( ) (xij ∣ct ) = P ( ) (Xj = 1∣ct ). Otherwise xij = 0, then P ( ) (xij ∣ct ) = P ( ) (Xj = 0∣ct ). for unlabeled projects, if xi ′ j = 1, then P ( ) (xi ′ j ∣ct ) = P ( ) (Xj = 1∣ct ). Otherwise xi ′ j = 0, then P ( ) (xi ′ j ∣ct ) = P ( ) (Xj = 0∣ct ). Here, P (0) (Xj = 1∣ct ) and P (0) (ct ) are initially estimated by merely the labeled projects at the first step of iteration, and the unlabeled project cases are appended into the learning process after they were predicted probabilistic effort class by P (1) (hi ′ = ct ∣Di ′ ). Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 15. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Predicting the effort class of unlabeled projects. We loop the Equations 1, 2, 3 and 4 until their estimates converge to stable values. Then, P ( +1) (h i′ = ct ∣Di ′ ) is used to predict effort class of Di ′ . The ct ∈ {c1 , } that maximizes P ( +1) (h i′ = ct ∣Di ′ ) is regarded as the effort class of Di ′ . Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 16. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Outline 1 Introduction 2 Naive Bayes and EM for software effort prediction 3 Missing data handling strategies Missing data toleration strategy. Missing data imputation strategy 4 Experiments The datasets Experiment setup Experimental results 5 Threats. 6 Conclusion and future work Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 17. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Initial setting. When we use Equation 1 to estimate the likelihood of Xj with respect to ct , P(Xj = 1∣ct ) or P(Xj = 0∣ct ), we do not consider missing values involved in xij (1 ≤ i ≤ m). For each Xj , we can divide the whole historical dataset D into two subsets, i.e. D = {Dobs,j ∣Dmis,j } where Dobs,j is the set of projects whose values on attribute Xj are observed and Dmis,j is the set of projects whose values on attribute are unobserved. We may also divide the attributes in a project Di into two subsets, i.e. Di = {Xobs,i ∣Xmis,i } where Xobs,i is the set of attributes whose values are observed in project Di and Xmis,i denotes the set of attributes whose values are unobserved in project Di . Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 18. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Missing data toleration strategy. This strategy is very similar with the method adopted by C4.5 to handle missing data. That is, we ignore missing values in training prediction model. To estimate P ( +1) (Xj = 1∣ct ) under this strategy, we rewrite Equation 1 into Equation 5. ∣Dobs,j ∣ 1+ xij P ( ) (hi = ct ∣Di ) i=1 P( +1) (Xj = 1∣ct ) = n . (5) ∣Dobs,j ∣ n+ i=1 xij P ( ) (hi = ct ∣Di ) j=1 Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 19. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Missing data toleration strategy. The difference between Equations 1 and 5 lies in that only observed projects on attribute Xj , i.e., Dobs,j are used to estimate P ( +1) (Xj = 1∣ct ). Equation 2 can also be used here to estimate P ( +1) (Xj = 0∣ct ). To estimate P ( +1) (ct ), Equation 3 can also be used here. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 20. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Missing data toleration strategy. Accordingly, the prediction model should be adapted from Equation 4 to Equation 6. P ( ) (ct )P ( ) (Di ′ ∣ct ) P( +1) (hi ′ = ct ∣Di ′ ) = P ( ) (Di ′ ) ∣Xobs,i ∣ P ( ) (ct ) P ( ) (xi ′ j ∣ct ) j=1 = . (6) ∣Xobs,i ∣ l P ( ) (ct )P ( ) (xi ′ j ∣ct ) j=1 t=1 Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 21. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Outline 1 Introduction 2 Naive Bayes and EM for software effort prediction 3 Missing data handling strategies Missing data toleration strategy. Missing data imputation strategy 4 Experiments The datasets Experiment setup Experimental results 5 Threats. 6 Conclusion and future work Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 22. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Missing data imputation strategy. The basic idea of this strategy is that unobserved values of attributes can be imputed using the observed values. Then, both observed values and imputed values are used to construct the prediction model. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 23. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Missing data imputation strategy. This strategy is an embedded processing in na¨ Bayes ive and EM and we may rewrite Equation 1 to Equation 7 to estimate P ( +1) (Xj = 1∣ct ). P( +1) (Xj = 1∣ct ) = ∣Dobs,j ∣ ∣Dmis,j ∣ 1+ xij P ( ) (hi = ct ∣Di ) + x˜ P ( ) (hi = ct ∣Ds ) sj i=1 s=1 . n ∣Dobs,j ∣ ∣Dmis,j ∣ n+ { xij P ( ) (hi = ct ∣Di ) + x˜ P ( ) (hi = ct ∣Ds )} sj j=1 i=1 s=1 (7) Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 24. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Missing data imputation strategy. The missing value xsj , which is the value of attribute Xj on the project Ds , is imputed using x˜ with Equation 8 sj ∣Dobs,j ∣ xij P ( ) (hi = ct ∣Di ) i=1 x˜ = sj . (8) ∣Dobs,j ∣ P ( ) (hi = ct ∣Di ) i=1 x˜ is a constant independent of Ds given ct . sj We regulate that x˜ is approximated to 1 if x˜ ≥ 0.5. sj sj Otherwise, x˜ is approximated to 0. sj Here, we also use Equation 3 to estimate P ( +1) (ct ) . Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 25. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Missing data imputation strategy. As for the prediction model, P ( +1) (ct ∣Di ), can be constructed in Equation 9 with considering the missing values. P ( ) (ct )P ( ) (Di ′ ∣ct ) P( +1) (hi ′ = ct ∣Di ′ ) = P ( ) (Di ′ ) n P ( ) (ct ) P ( ) (xi ′ j ∣ct ) j=1 = . (9) n l P ( ) (ct )P ( ) (xi ′ j ∣ct ) j=1 t=1 Note that if xi ′ j is unobserved, it value will be substituted with x˜′ j given by Equation 8. i Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 26. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work Outline 1 Introduction 2 Naive Bayes and EM for software effort prediction 3 Missing data handling strategies Missing data toleration strategy. Missing data imputation strategy 4 Experiments The datasets Experiment setup Experimental results 5 Threats. 6 Conclusion and future work Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 27. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work The ISBSG dataset. The ISBSG data set ( has 70 attributes and many attributes have no values in the corresponding places. We extract 188 projects with 16 attributes with the criterion that each project has at least 2/3 attributes whose values are observed and, for an attribute, its values should be observed at least in 2/3 of total projects. 13 attributes are nominal attributes and 3 attributes are continuous attributes. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 28. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work The ISBSG dataset. We use Equation 10 to normalize the efforts of projects into l(= 3) classes. l × (effortDi − effortmin ) ct = ⌊ ⌋+1 (10) effortmax − effortmin Table: The effort classes in ISBSG data set. Class No. # of projects Label 1 85 Low 2 76 Medium 3 27 High Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 29. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work The CSBSG dataset. CSBSG dataset contains 1103 projects collected from 140 organizations and 15 regions across China by Chinese association of software industry. We extract 94 projects and 21 attributes (15 nominal attributes and 6 continuous attributes) with same selection criterion of ISBSG data set. We use Equation 10 to normalize the efforts of projects into l(= 3) classes. Table: The effort classes in CSBSG data set. Class No. # of projects Label 1 27 Low 2 31 Medium 3 36 High Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 30. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work Outline 1 Introduction 2 Naive Bayes and EM for software effort prediction 3 Missing data handling strategies Missing data toleration strategy. Missing data imputation strategy 4 Experiments The datasets Experiment setup Experimental results 5 Threats. 6 Conclusion and future work Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 31. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work Experiment setup. To evaluate the proposed method comparatively, we adopt MI and MINI to impute the missing values of the assigned ISBSG and CSBSG dataset. BPNN is used to classify the projects in the data sets after imputation. Our experiments are conducted with 10-flod cross-validation technique. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 32. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work Outline 1 Introduction 2 Naive Bayes and EM for software effort prediction 3 Missing data handling strategies Missing data toleration strategy. Missing data imputation strategy 4 Experiments The datasets Experiment setup Experimental results 5 Threats. 6 Conclusion and future work Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 33. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work EM-T and EM-I on ISBSG dataset. The following figure illustrates the performances, of the missing data toleration strategy (hereafter called EM-T) and missing data imputation strategy (hereafter called EM-I) in handling the missing date for effort prediction on ISBSG data set. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 34. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work EM-T and EM-I on ISBSG dataset. EM−I EM−T BPNN+MI BPNN+MINI 0.8 0.75 Accuracy 0.7 0.65 0.6 0 4 8 12 16 20 # of unlabeled projects Figure: Performances of naive Bayes with EM-I and EM-T in comparison with BPNN on effort prediction using ISBSG data set. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 35. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work EM-T and EM-I on ISBSG dataset. What we can see from the figure. Both EM-I and EM-T have better performances than BPNN with either MI or MINI on classifying the projects in ISBSG data set. The performance of naive Bayes and EM is augmented when unlabeled projects are appended. This outcome illustrates that semi-supervised learning can improve the prediction of software effort. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 36. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work EM-T and EM-I on ISBSG dataset. What we can see from figure. If supervised learning was used for software effort prediction, MINI method is favorable to impute the missing values but missing toleration strategy may not be desirable to handle missing values. Imputing strategy for missing data is more effective than tolerating strategy when naive Bayes and EM is used for predicting ISBSG software efforts. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 37. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work EM-T and EM-I on CSBSG dataset. EM-T and EM-I in handling the missing date for effort prediction on CSBSG dataset. 0.8 EM−I EM−T BPNN+MI BPNN+MINI 0.75 0.7 Accuracy 0.65 0.6 0.55 0.5 0 2 4 6 8 # of unlabeled projects Figure: Performances of EM-I and EM-T in comparison with BPNN on predicting effort with different number of unlabeled projects using CSBSG dataset. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 38. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work EM-T and EM-I on CSBSG dataset. What we can see from the above figure. The better performance of EM-I than EM-T is also observed using CSBSG data set, which is the same as using ISBSG dataset. This further validate our conjecture that EM-I outperforms EM-T in software effort prediction. EM-T has better performance than EM-I on condition that the number of unlabeled projects is larger than that of "maxima", that is different from that of ISBSG dataset. We explain this result may be brought out by the relative small size of CSBSG dataset where imputation strategy will be more prone to bring bias into predictive than toleration strategy. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 39. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work More experiments and hypotheses testing. More experimental results with explanations are detailed in the paper. Also, we conduct hypotheses testing to examine the significance of the conclusions draw from our experiments. One of interest may refer to the paper. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 40. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work The threat to external validity primarily is the degree to which the attributes we used to describe the projects and the representative capacity of ISBSG and CSBSG sample datasets. The threat to internal validity are measurement and data effects that can bias our results caused by performance measure as accuracy. The threat to construct validity is that our experiments make use of clipping attributes and clipping project data from both ISBSG and CSBSG datasets Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 41. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Conclusion Semi-supervised learning as naive Bayes and EM is employed to predict software effort. We propose two embedded strategies in naive Bayes and EM to handle the missing data. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 42. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Future work We plan to compare the proposed techniques with other missing data imputation techniques, such as FIML and MSWR. We will develop more missing data techniques embedded with naive Bayes and EM for software effort prediction. We have already investigated the underlying mechanism of missingness (structural missing or unstructured missing) of software effort data. With this progress, we will improve the missing data handling strategies oriented to the underlying missing mechanism of software effort data. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 43. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Thanks Any further questions about the content of the slides and the paper can be sent to Mr. Wen Zhang. Email: Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm