Difference Between filter based method and feature selection:
Dataset selection can lead to better performance for cross project defect prediction(CPDP). On the other hand, feature selection and data quality are issues to consider in CPDP.
With the availability of thehuge amount of data that can be obtained from mining software historical repositories, it becomes possible to have some features (metrics) that are not correlated with the faults, which consequently mislead the learning algorithm and thus decrease its performance.
We aim at utilizing the Nearest Neighbor (NN)-Filter, embedded in genetic algorithm to produce validation sets for generating evolving training datasets to tackle CPDP while accounting for potential noise in defect labels. We also investigate the impact of using different feature sets.
A novel FS approach is proposed to enhance the performance of a layered recurrent neural network (L-RNN), which is used as a classification technique for the SFP problem.
2. Context:
• Dataset selection can lead to better
performance for cross project defect
prediction(CPDP). On the other hand,
feature selection and data quality are
issues to consider in CPDP.
• With the availability of the
huge amount of data that can be
obtained from mining software
historical repositories, it becomes
possible to have some features
(metrics) that are not correlated with
the faults, which consequently
mislead the learning algorithm and
thus decrease its performance.
3. Objective:
• We aim at utilizing the Nearest
Neighbor (NN)-Filter, embedded
in genetic algorithm to produce
validation sets for generating
evolving training datasets to
tackle CPDP while accounting for
potential noise in defect labels.
We also investigate the impact
of using different feature sets.
• A novel FS approach is proposed
to enhance the performance of a
layered recurrent neural
network (L-RNN), which is used
as a classification technique for
the SFP problem.
4. Method:
• We use 41 releases of 11 multi-version
projects to assess the performance GIS in
comparison with benchmark CPDP (NN-
filter and Naive-CPDP) and within project
(Cross Validation(CV) and Previous
Releases(PR)). To assess the impact of
feature sets, we use two sets of features,
SCM+OO+LOC(all) and CK+LOC(ckloc) as
well as iterative info-gain sub setting(IG)
for feature selection.
• Three different wrapper FS algorithms
(i.e, Binary Genetic Algorithm (BGA),
Binary Particle Swarm Optimization
(BPSO), and Binary Ant Colony
Optimization (BACO)) were employed
iteratively. To assess the performance
of the proposed approach, 19 real
world software projects from
PROMISE repository are investigated
and the experimental results
5. Results:
• The performance of GIS is comparable
to that of within project defect
prediction (WPDP) benchmarks, i.e.
CV and PR. In terms of multiple
comparisons test, all variants of GIS
belong to the top ranking group of
approaches. Better feature selection
techniques coupled with the proposed
instance selection approach, i.e. GIS,
can lead to better predictions and
even outperforms WPDP.
• The results are compared with other
stateof-art approaches including
Naïve Bayes (NB), Artificial Neural
Network (ANN), logistic regression
(LR), the k-nearest neighbors (k-NN)
and C4.5 decision trees, in terms of
area under the curve (AUC).
8. Conclusions:
• the results of this study, we show the
usefulness of third party project data and the
search based methods in the context of cross
project defect prediction. We observed that
the performance of a simple classifier like
Naive Bayes could be boosted with such
approaches. Using a different fitness function
targeting other measures like precision, AUC
(Area Under the Curve) or other measures
may lead to different results while giving the
practitioners the flexibility of guiding the
process toward their desired goals
• The proposed algorithm is able to obtain an
excellent classification rate (with an average
of 0.8358 over all datasets) based on AUC
results, which outperforms existing results
found in the literature such as Naïve Bayes
(NB), Artificial Neural Network (ANN), logistic
regression (LR), the k-nearest neighbors (k-
NN) and C4.5. The obtained results support
our claim of the importance of feature
selection in building a high quality classifier
rather than using a fixed set of features or all
features.
9. Future Work:
• Other validation dataset selection
techniques using approaches like
clustering, distributional characteristics,
small portions of within project data,
better and more powerful feature
selection techniques and tuning the
parameters of the genetic model in
addition to designing other fitness
functions with a focus on different
measures are among possible future
works to pursue.
• For future work, we plan to investigate
the performance of different classifiers
such as genetic programming to build a
computer model that is able to predict
faults based on a selected metrics.