Feature Selection Techniques For
Software Fault Prediction
(Summary)
Sungdo Gu
2015.03.27
MOTIVATION & PAPERS
 What is the minimum number of software metrics(features) that should be
considered for building an effective defect prediction model?
• A typical software defect prediction model is trained using software metrics
and fault data that have been collected from previously-developed software
releases or similar projects
• Quality of the software is an important aspect and software fault prediction
helps to better concentrate on faulty modules.
• With increasing complexity of software nowadays, feature selection is
important to remove the redundant, irrelevant and erroneous data from
dataset.
“How Many Software Metrics Should be Selected for Defect Prediction?”
“Measuring Stability of Threshold-based Feature Selection Techniques”
“A Hybrid Feature Selection Model For Software Fault Prediction”
FEATURE SELECTION TECHNIQUE
 Feature Selection Technique
 feature ranking
 feature subset selection
 Feature Selection Technique
 filter : which a feature subset is selected without involving any
learning algorithm.
 wrapper : use feedback from a learning algorithm to determine which
features to include in building a classification model.
 Feature Selection
: the process of choosing a subset of feature.
SOFTWARE METRICS
 A software metric is a quantitative measure of a degree to which a
software system or process possesses some property.
 CK metrics were desigened:
 to measure unique aspects of the Object Oriented approach.
 to measure complexity of the design.
 McCabe & Halstead metrics were designed:
 to measure complexity of module-based program.
SOFTWARE METRICS: Examples
<McCabe & Halstead Metrics> <CK Metrics>
CK Metrics: Examples
 WMC (Weighted Methods per Class)
 Definition
• WMC is the sum of the complexity of the methods of a class.
• WMC = Number of Methods (NOM), when all methods’ complexity are
considered UNITY.
 DIT (Depth of Inheritance Tree)
 Definition
• The maximum length from the node to the root of the tree
 CBO (Coupling Between Objects)
 Definition
• It is a count of the number of other classes to which it is coupled.
THRESHOLD-BASED FEATURE RANKING
 Five versions of TBFS feature rankers based on five different performance
metrics are considered.
• Mutual Information (MI)
• Kolmogorov-Smirnov (KS)
• Deviance (DV)
• Area Under the ROC (Receiver Operating Characteristic) Curve (AUC)
• Area Under the Precision-Recall Curve (PRC)
 Threshold-Based Feature Selection technique (TBFS)
: belongs to filter-based feature ranking techniques category.
 the TBFS can be extended to additional performance metrics such as
F-measure, Odds Ratio etc.
THRESHOLD-BASED FEATURE RANKING
CLASSIFIER
 Three classifiers
 Multilayer Perceptron
 k-Nearest Neighbors
 Logistic Regression
 Classifier Performance Metric
→ AUC (Area Under the ROC(Receiver Operating Characteristic))
: Performance metric that considers the ability of a classifier to differentiate
between the two classes.
- The AUC is a single-value measurement, whose value ranges from 0 to 1.
SOFTWARE MEASUREMENT DATA
 The software metrics & fault data collected from a real-world software project.
: The Eclipse from the PROMISE data repository.
 Transform the original data by
(1) removing all non-numeric attributes
(2) converting the post-release defects attribute to a binary class attribute
: fault-prone (fp) / not-fault-prone (nfp)
EMPIRICAL DESIGN
 Rank the metrics and choose the top 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15 and 20 metrics
according to their respective scores.
 The defect prediction models are evaluated in term of the AUC performance
metric.
 To understand the impact of
 different size of feature subset
 the five filter-based rankers
 the three different learners on the models’ predicive power
 five-fold cross-validation
EMPIRICAL RESULT
EMPIRICAL RESULT
STABILITY (ROBUSTNESS)
 The STABILITY of feature selection method is normally defined as the
degree of agreement between its outputs when applied to randomly-
selected subsets of the same input data.
where 𝑛 is the total number of features in the dataset, 𝑑 is the cardinality of
the intersection between subsets 𝑇𝑖 and 𝑇𝑗, and
Let 𝑇𝑖 𝑎𝑛𝑑 𝑇𝑗 be subsets of features, where 𝑇𝑖 = 𝑇𝑗 = 𝑘.
=> The greater the consistency index, the more similar the subsets are.
• To assess the robustness (stability) of feature selection techniques,
consistency index was used.
ANOTHER RESULTS
A HYBRID FEATURE SELECTION MODEL
A HYBRID FEATURE SELECTION MODEL
• Correlation based Feature Selection
• Chi-Squared
• OneR
• Gain Ratio
 Filter-method
• Naïve Bayes
• RBF Network (Radial Basis Function Network)
• J48 (Decision Tree)
 Wrapper-method
A HYBRID FEATURE SELECTION: RESULT
A HYBRID FEATURE SELECTION: RESULT
Thank you
Q & A

Feature Selection Techniques for Software Fault Prediction (Summary)

  • 1.
    Feature Selection TechniquesFor Software Fault Prediction (Summary) Sungdo Gu 2015.03.27
  • 2.
    MOTIVATION & PAPERS What is the minimum number of software metrics(features) that should be considered for building an effective defect prediction model? • A typical software defect prediction model is trained using software metrics and fault data that have been collected from previously-developed software releases or similar projects • Quality of the software is an important aspect and software fault prediction helps to better concentrate on faulty modules. • With increasing complexity of software nowadays, feature selection is important to remove the redundant, irrelevant and erroneous data from dataset. “How Many Software Metrics Should be Selected for Defect Prediction?” “Measuring Stability of Threshold-based Feature Selection Techniques” “A Hybrid Feature Selection Model For Software Fault Prediction”
  • 3.
    FEATURE SELECTION TECHNIQUE Feature Selection Technique  feature ranking  feature subset selection  Feature Selection Technique  filter : which a feature subset is selected without involving any learning algorithm.  wrapper : use feedback from a learning algorithm to determine which features to include in building a classification model.  Feature Selection : the process of choosing a subset of feature.
  • 4.
    SOFTWARE METRICS  Asoftware metric is a quantitative measure of a degree to which a software system or process possesses some property.  CK metrics were desigened:  to measure unique aspects of the Object Oriented approach.  to measure complexity of the design.  McCabe & Halstead metrics were designed:  to measure complexity of module-based program.
  • 5.
    SOFTWARE METRICS: Examples <McCabe& Halstead Metrics> <CK Metrics>
  • 6.
    CK Metrics: Examples WMC (Weighted Methods per Class)  Definition • WMC is the sum of the complexity of the methods of a class. • WMC = Number of Methods (NOM), when all methods’ complexity are considered UNITY.  DIT (Depth of Inheritance Tree)  Definition • The maximum length from the node to the root of the tree  CBO (Coupling Between Objects)  Definition • It is a count of the number of other classes to which it is coupled.
  • 7.
    THRESHOLD-BASED FEATURE RANKING Five versions of TBFS feature rankers based on five different performance metrics are considered. • Mutual Information (MI) • Kolmogorov-Smirnov (KS) • Deviance (DV) • Area Under the ROC (Receiver Operating Characteristic) Curve (AUC) • Area Under the Precision-Recall Curve (PRC)  Threshold-Based Feature Selection technique (TBFS) : belongs to filter-based feature ranking techniques category.  the TBFS can be extended to additional performance metrics such as F-measure, Odds Ratio etc.
  • 8.
  • 9.
    CLASSIFIER  Three classifiers Multilayer Perceptron  k-Nearest Neighbors  Logistic Regression  Classifier Performance Metric → AUC (Area Under the ROC(Receiver Operating Characteristic)) : Performance metric that considers the ability of a classifier to differentiate between the two classes. - The AUC is a single-value measurement, whose value ranges from 0 to 1.
  • 10.
    SOFTWARE MEASUREMENT DATA The software metrics & fault data collected from a real-world software project. : The Eclipse from the PROMISE data repository.  Transform the original data by (1) removing all non-numeric attributes (2) converting the post-release defects attribute to a binary class attribute : fault-prone (fp) / not-fault-prone (nfp)
  • 11.
    EMPIRICAL DESIGN  Rankthe metrics and choose the top 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15 and 20 metrics according to their respective scores.  The defect prediction models are evaluated in term of the AUC performance metric.  To understand the impact of  different size of feature subset  the five filter-based rankers  the three different learners on the models’ predicive power  five-fold cross-validation
  • 12.
  • 13.
  • 14.
    STABILITY (ROBUSTNESS)  TheSTABILITY of feature selection method is normally defined as the degree of agreement between its outputs when applied to randomly- selected subsets of the same input data. where 𝑛 is the total number of features in the dataset, 𝑑 is the cardinality of the intersection between subsets 𝑇𝑖 and 𝑇𝑗, and Let 𝑇𝑖 𝑎𝑛𝑑 𝑇𝑗 be subsets of features, where 𝑇𝑖 = 𝑇𝑗 = 𝑘. => The greater the consistency index, the more similar the subsets are. • To assess the robustness (stability) of feature selection techniques, consistency index was used.
  • 15.
  • 16.
    A HYBRID FEATURESELECTION MODEL
  • 17.
    A HYBRID FEATURESELECTION MODEL • Correlation based Feature Selection • Chi-Squared • OneR • Gain Ratio  Filter-method • Naïve Bayes • RBF Network (Radial Basis Function Network) • J48 (Decision Tree)  Wrapper-method
  • 18.
    A HYBRID FEATURESELECTION: RESULT
  • 19.
    A HYBRID FEATURESELECTION: RESULT
  • 20.

Editor's Notes

  • #2 Today, I'd like to give a presentation about software quality. It's going to cover feature selection issue in software quality, and this might be a summary of a couple papers that I have read I gave a title to "Feature Selection Technique~".
  • #3 품질이 중요하고, 결함 예측이 결함 모듈에 집중하도록 도움이 된다. SW 복잡도가 증가함에 따라, 피처셀렉션은 중복, 불필요 데이터를 제거하는데 중요하다. 일반적 sw결함 예측 모델은 메트릭과 결함 데이터를 이용하여 트레이닝되는데, 그 데이터들은 기존에 개발되었거나 비슷한 프로젝트로 부터 수집된다.
  • #4 Feature selection technique – feature ranking / feature subset selection으로 나눔 Feature ranking은 각각의 predictive power에 따라 순위를 매겨 결정 Feature subset selection은 좋은 predictive power를 총괄적으로 가지고 있는 속성들의 subset을 찾는 것 또한 feature selection technique은 – filter / wrapper / embedded로 나눌수 있음 Filter: 어떤 learning 알고리즘을 쓰지않고 feature subset 선택하는 것 Wrapper: classification 모델을 만드는데 어떤 feature를 포함시킬지 결정하는데 learning 알고리즘의 feedback을 이용
  • #5 There are pretty many types of SW metrics, but I am gonna introduce two kinds of SW metrics which are mainly used.
  • #9 First, each attribute’s values are normalized between 0 and 1, and calculating performance metric using normalized attribute. Create feature ranking. 각 속성(피처)값을 0과1값으로 normalize한다. -> 각 독립 속성은 클래스 속성과 짝을 이룬다. (Y값 말하는듯..) 그리고 줄어든 두개 속성 데이터셋은 11개의 다른 성능메트릭으로 평가, 사후 확률에 기반하여.
  • #10 두 클래스를 구별하기 위해 분류기의 능력을 고려한 성능 메트릭.
  • #12 They wanted to figure out the impact of size of feature subset. 그래서 그들은 1,2,3,~20까지 메트릭 순위를 매기고 선택함 -> 다음의 영향을 이해하기 위해.
  • #15 Besides, one of papers that I read focus on the robustness (or stability) of feature selection techniques. 같은 input data를 랜덤하게 선택한 subset을 적용했을 때, output들간의 일치 정도..-> Stability / robustness (안정성, 단단함) # cardinality: 집합의 원소 개수 (d는 교집합의 원소개수?)
  • #16 이 논문은 stability를 확인하기 위해 데이터셋을 계속 바꿔가며 실험했다. pertubation
  • #17 Furthermore, there are model of A mixture of filter and wrapper approach. A Hybrid feature selection model for Software fault prediction.
  • #18 Hybrid feature selection model for Software fault prediction