Introduction Ø Software defect prediction models reveal defectprone parts of the software to guide managers inallocating testing resources efficientlyØ Popular studiesØ Estimate number of defects remaining in software systemsØ Discover defect associationsØ Classify defect-proneness of software components into two classes,defect-prone and not defect-proneØ MetricsØ Static codeØ HistoryØ Social
Introduction Ø Numerous defect prediction research in the last 40yearsØ Statistical techniques with machine learning algorithms are adoptedØ Nagappan et al., Ostrand et al., Zimmermann et al., Fenton et al.,Khoshgoftaar et al.Ø Benchmarking studiesØ Lessmann et al. and Menzies et al.Ø Systematic literature surveysØ Hall et al.Ø Industrial case studiesØ Tosun et al.
Introduction Ø Major challenges in building defect prediction models:Ø High dimensionality of software defect dataØ The number of available software metrics is too large for a classifier to workØ Skewed, imbalanced data setsØ Proportion of one of the classes is quite larger than the proportion of theother class.Ø Performance limitationsØ Limited information contentØ Performance ceiling effectØ Incomplete datasetsØ Features of the train set may differ from the features of test setØ Some of the test set attributes may be missingØ There may be extra attributes in test setsØ Building model with several datasets.Ø Different datasets may have different attribute sets.
Introduction Ø Missing value pattern may be indifferent forms:Ø Data may be missing at individual pointsØ Some attribute values may be considered asoutliers. Data may be missing in chunksØ You may want to build your model with severaldatasets and the attributes of these datasetsmay differ.Ø When these datasets are concatenated, therewill probably be missing chunks.Ø Solution might be:Ø To use the largest common attribute set ORØ To introduce imputation to the missingattributes
Proposed Solution Matrix Factorization is a solution to data scarcityproblem in recommendation systems
Related Work • Recommendation systemso Netflix Prize competition• Koren, Bell, and Volinskyo Collective Matrix Factorization• Singh et al. and Lippert et al.
Matrix Factorization • Netflix competitiono Matrix Factorization models are actually superior to classical nearest-neighbortechniques as they offer incorporation of an additional information andscalable predictive accuracy (Bell et al.)• Matrix factorization is basically factorizing alarge matrix into two smaller matrices calledfactors.• Factors are multiplied to obtain the original matrix.
Matrix Factorization • Nonnegative MFAlgorithms (Berry et al.)o Multiplicative updatealgorithmso Gradient descent algorithms• Easiest to implementand to scaleo Alternating least squarealgorithms• Multi Relational MatrixFactorization by Lippertet al.o Low-norm MatrixFactorization based ongradient descent algorithm
Experiment Datasets Static Code Metrics Churn Metrics Social Metrics Instances Defective % Android 106 15 25 12981 6.4 Linux Kernel 106 15 25 14801 5.5 Perl 106 15 25 125 61.6 VLC 106 15 25 936 39.2 Datasets• Androido Open source Operating System designed for mobile devices• Linux Kernelo Open source operating system• Perlo Stable, cross-platform, open source interpreted language• VLCo Open source multimedia player
Experiment Experiment 1• The performance of Naive Bayesalgorithm is explored• Run 10 times 10-fold crossvalidation while graduallyremoving attributes from datasets• Attributes are removed accordingto their correlation with the classattribute• Pearson correlation is used• 4(datasets)x10(removalsteps)x10x10(fold size)=4000 NaiveBayes prediction models are builtExperiment 2• The performances of Naive Bayeswith Imputation and MatrixFactorization are compared• Attributes are chosen according totheir correlation with the classattributeo Pearson correlation is used• Imputation or removal procedure isdone on the chosen attributes in theincreasing proportion• 4(datasets)x10(attribute selectionsteps)x10(imputation steps)x10(foldsize)=4000 Naive Bayes and MatrixFactorization models are built
Results (Exp. 1) Balance values of Naive Bayes with respect to feature reduction percentageAndroid Kernel Perl VLC
Results (Exp. 2) Android Kernel Perl VLC Balance values of MF with respect to the missing Churn and Social Attribute dataand NB with imputation on Churn and Social Attributes
Threats to Validity • Internal Validityo Naive Bayes, Mean-Value Imputation and Matrix Factorization are used largelyin previous studies.o Performance measurements used for evaluation are also adopted by severalresearchers in the past.o The number of studies discussing static code, history and social metrics is quiteabundant.o The datasets are extracted from open source project repositories and they arealso used in previous studies.• External validityo Four different datasets extracted from open source project repositories.o Nevertheless, our results are limited to the analyzed data and context
Conclusion • Collective matrix factorization from recommender systems formissing data problem in defect prediction• Two experiments conductedo The performance of NB with feature reductiono The performance of NB with mean-value imputation vs. the performance of MF withmissing data• NB performance decreases while the number of features arereduced.• Matrix Factorization performs better on datasets with missingdata than the benchmark model with imputation• Future Worko Support the findings with using complex imputation techniqueso Different missing data scenarios may be adopted