Handling  Missing  A,ributes  using  Matrix  Factorization	Övünç BozcanSoftware Research LabDept. of Computer EngineeringB...
Outline	•  Introduction•  Related Work•  Matrix Factorization•  Experiment•  Results•  Conclusion
Introduction	Ø  Software defect prediction models reveal defectprone parts of the software to guide managers inallocating...
Introduction	Ø  Numerous defect prediction research in the last 40yearsØ  Statistical techniques with machine learning a...
Introduction	Ø  Major challenges in building defect prediction models:Ø  High dimensionality of software defect dataØ  ...
Introduction	Ø  Missing value pattern may be indifferent forms:Ø  Data may be missing at individual pointsØ  Some attri...
Proposed  Solution	Matrix Factorization is a solution to data scarcityproblem in recommendation systems
Related  Work	•  Recommendation systemso  Netflix Prize competition•  Koren, Bell, and Volinskyo  Collective Matrix Factor...
Matrix  Factorization	•  Netflix competitiono  Matrix Factorization models are actually superior to classical nearest-neig...
Matrix  Factorization	•  Nonnegative MFAlgorithms (Berry et al.)o  Multiplicative updatealgorithmso  Gradient descent algo...
Experiment	Datasets	 Static  Code  Metrics	Churn  Metrics	Social  Metrics	Instances	 Defective  %	Android	 106	 15	 25	 12...
Experiment	Performance Measureso  Pdo  Pfo  BalanceLearning Algorithmso  Naive Bayeso  Matrix Factorization
Experiment	Experiment 1•  The performance of Naive Bayesalgorithm is explored•  Run 10 times 10-fold crossvalidation while...
Results  (Exp.  1)	Balance values of Naive Bayes with respect to feature reduction percentageAndroid	 Kernel	Perl	 VLC
Results  (Exp.  2)	Android	 Kernel	Perl	 VLC	Balance values of MF with respect to the missing Churn and Social Attribute d...
Threats  to  Validity	•  Internal Validityo  Naive Bayes, Mean-Value Imputation and Matrix Factorization are used largelyi...
Conclusion	•  Collective matrix factorization from recommender systems formissing data problem in defect prediction•  Two ...
Thank  You
Upcoming SlideShare
Loading in …5
×

Handling Missing Attributes using Matrix Factorization 

3,054 views

Published on

Övünç Bozcan, Raise'13
Ayşe Başar Bener

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,054
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
17
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Handling Missing Attributes using Matrix Factorization 

  1. 1. Handling  Missing  A,ributes  using  Matrix  Factorization Övünç BozcanSoftware Research LabDept. of Computer EngineeringBoğaziçi UniversityIstanbul, Turkeyovuncbozcan@gmail.comAyşe Başar BenerData Science LabMechanical and Industrial EngineeringRyerson UniversityToronto, Canadaayse.bener@ryerson.ca
  2. 2. Outline •  Introduction•  Related Work•  Matrix Factorization•  Experiment•  Results•  Conclusion
  3. 3. Introduction Ø  Software defect prediction models reveal defectprone parts of the software to guide managers inallocating testing resources efficientlyØ  Popular studiesØ  Estimate number of defects remaining in software systemsØ  Discover defect associationsØ  Classify defect-proneness of software components into two classes,defect-prone and not defect-proneØ  MetricsØ  Static codeØ  HistoryØ  Social
  4. 4. Introduction Ø  Numerous defect prediction research in the last 40yearsØ  Statistical techniques with machine learning algorithms are adoptedØ  Nagappan et al., Ostrand et al., Zimmermann et al., Fenton et al.,Khoshgoftaar et al.Ø  Benchmarking studiesØ  Lessmann et al. and Menzies et al.Ø  Systematic literature surveysØ  Hall et al.Ø  Industrial case studiesØ  Tosun et al.
  5. 5. Introduction Ø  Major challenges in building defect prediction models:Ø  High dimensionality of software defect dataØ  The number of available software metrics is too large for a classifier to workØ  Skewed, imbalanced data setsØ  Proportion of one of the classes is quite larger than the proportion of theother class.Ø  Performance limitationsØ  Limited information contentØ  Performance ceiling effectØ  Incomplete datasetsØ  Features of the train set may differ from the features of test setØ  Some of the test set attributes may be missingØ  There may be extra attributes in test setsØ  Building model with several datasets.Ø  Different datasets may have different attribute sets.
  6. 6. Introduction Ø  Missing value pattern may be indifferent forms:Ø  Data may be missing at individual pointsØ  Some attribute values may be considered asoutliers. Data may be missing in chunksØ  You may want to build your model with severaldatasets and the attributes of these datasetsmay differ.Ø  When these datasets are concatenated, therewill probably be missing chunks.Ø  Solution might be:Ø  To use the largest common attribute set ORØ  To introduce imputation to the missingattributes
  7. 7. Proposed  Solution Matrix Factorization is a solution to data scarcityproblem in recommendation systems
  8. 8. Related  Work •  Recommendation systemso  Netflix Prize competition•  Koren, Bell, and Volinskyo  Collective Matrix Factorization•  Singh et al. and Lippert et al.
  9. 9. Matrix  Factorization •  Netflix competitiono  Matrix Factorization models are actually superior to classical nearest-neighbortechniques as they offer incorporation of an additional information andscalable predictive accuracy (Bell et al.)•  Matrix factorization is basically factorizing alarge matrix into two smaller matrices calledfactors.•  Factors are multiplied to obtain the original matrix.
  10. 10. Matrix  Factorization •  Nonnegative MFAlgorithms (Berry et al.)o  Multiplicative updatealgorithmso  Gradient descent algorithms•  Easiest to implementand to scaleo  Alternating least squarealgorithms•  Multi Relational MatrixFactorization by Lippertet al.o  Low-norm MatrixFactorization based ongradient descent algorithm
  11. 11. Experiment Datasets Static  Code  Metrics Churn  Metrics Social  Metrics Instances Defective  % Android 106 15 25 12981 6.4 Linux  Kernel 106 15 25 14801 5.5 Perl 106 15 25 125 61.6 VLC 106 15 25 936 39.2 Datasets•  Androido  Open source Operating System designed for mobile devices•  Linux Kernelo  Open source operating system•  Perlo  Stable, cross-platform, open source interpreted language•  VLCo  Open source multimedia player
  12. 12. Experiment Performance Measureso  Pdo  Pfo  BalanceLearning Algorithmso  Naive Bayeso  Matrix Factorization
  13. 13. Experiment Experiment 1•  The performance of Naive Bayesalgorithm is explored•  Run 10 times 10-fold crossvalidation while graduallyremoving attributes from datasets•  Attributes are removed accordingto their correlation with the classattribute•  Pearson correlation is used•  4(datasets)x10(removalsteps)x10x10(fold size)=4000 NaiveBayes prediction models are builtExperiment 2•  The performances of Naive Bayeswith Imputation and MatrixFactorization are compared•  Attributes are chosen according totheir correlation with the classattributeo  Pearson correlation is used•  Imputation or removal procedure isdone on the chosen attributes in theincreasing proportion•  4(datasets)x10(attribute selectionsteps)x10(imputation steps)x10(foldsize)=4000 Naive Bayes and MatrixFactorization models are built
  14. 14. Results  (Exp.  1) Balance values of Naive Bayes with respect to feature reduction percentageAndroid Kernel Perl VLC
  15. 15. Results  (Exp.  2) Android Kernel Perl VLC Balance values of MF with respect to the missing Churn and Social Attribute dataand NB with imputation on Churn and Social Attributes
  16. 16. Threats  to  Validity •  Internal Validityo  Naive Bayes, Mean-Value Imputation and Matrix Factorization are used largelyin previous studies.o  Performance measurements used for evaluation are also adopted by severalresearchers in the past.o  The number of studies discussing static code, history and social metrics is quiteabundant.o  The datasets are extracted from open source project repositories and they arealso used in previous studies.•  External validityo  Four different datasets extracted from open source project repositories.o  Nevertheless, our results are limited to the analyzed data and context
  17. 17. Conclusion •  Collective matrix factorization from recommender systems formissing data problem in defect prediction•  Two experiments conductedo  The performance of NB with feature reductiono  The performance of NB with mean-value imputation vs. the performance of MF withmissing data•  NB performance decreases while the number of features arereduced.•  Matrix Factorization performs better on datasets with missingdata than the benchmark model with imputation•  Future Worko  Support the findings with using complex imputation techniqueso  Different missing data scenarios may be adopted
  18. 18. Thank  You

×