• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Cukic Promise08 V3

Cukic Promise08 V3



Comparing Design and Code Metrics for Software Quality Prediction

Comparing Design and Code Metrics for Software Quality Prediction



Total Views
Views on SlideShare
Embed Views



2 Embeds 104

http://promisedata.org 103
https://webvpn.ucr.edu 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Cukic Promise08 V3 Cukic Promise08 V3 Presentation Transcript

  • Comparing Design and Code Metrics for Software Quality Prediction Y. Jiang, B. Cukic, T. Menzies Lane Department of CSEE West Virginia University PROMISE 2008
  • Predicting Faults Earlier Matters
    • Boehm observed that fault removal is 50 to 200 times less costly when performed in the design phase rather than after the deployment.
    • NASA research shows that a fault introduced in the requirements, which leaks into the design, code, test, integration, and operational phases, ensues the correction cost factors of 5, 10, 50, 130, and 368, respectively.
    • Therefore, the earlier we can identify fault-prone artifacts, the better.
  • How early?
    • Do requirements metrics correlate with fault-proneness? [Jiang et. al. ISSRE 07]
  • Predicting From Design Metrics?
    • It has been successfully demonstrated.
    • Ohlsson and Alberg (’96) demonstrated that design metrics predict fault prone modules effectively.
      • “ Design metrics are better predictors than code size (LOC)”
      • Telephone switching domain
    • Basili validated so called CK object oriented (design) metrics using eight student developed systems.
    • Nagappan, Ball & Zeller confirmed Ohlsson’s findings using OO design metrics on five Microsoft systems
  • Goal of This Study
    • Thorough comparison of fault prediction models which utilize:
      • Design metrics
      • Static code metrics
      • Combination of both
    • Statistically significant number of projects and modules within projects.
  • Metrics Description (1)
    • Code metrics
  • Metrics Description (2)
    • Design metrics
  • Experimental Design Classification 10x10 CV Illustrate Results Using ROC Code Visualize Using Boxplot diagrams Compare using Nonparametric Statistical Tests Evaluate Results Using AUC trapezoid rule Design All
  • Datasets: NASA MDP
    • Used every dataset which offered both design and code level
    • metrics.
  • Experimental Design (2)
    • 5 classification algorithms
      • Random forest, begging, boosting, logistic regression, NaiveBayes
    • 10 by 10 way cross-validation:
      • one 10 way experiment generates an ROC curve => 10 ROCs => 10 AUCs
    • We analyzed 1950 experiments!
      • 13 [Data sets] *3 [Metrics sets] *5 [Classifiers] *10 [CV]
    • We only show the best model from each metrics set in each data set (project).
  • Analysis example: PC5 data set
    • The mean AUC
    • All : 0.979
    • Code : 0.967
    • Design: 0.956.
    ROC Boxplot
  • Typical Results
  • Not So Typical Results
  • Atypical Results
  • Test Statistical Significance
    • Use the procedure recommended by Demsar for each of the 13 data sets.
      • Friedman test tests whether performance differs amongst design , code , and all experiments .
        • If no, no further test is necessary.
        • If yes, then
      • Use pairwise nonparametric tests (typically the Wilcoxon test or the Mann-Whitney test) to determine which group of metrics is the best.
    • 95% confidence level used in all experiments
  • Pairwise comparison
    • Test the following hypotheses for pairwise comparison of two experiments A and B .
      • H 0 : There is no difference in the performance of the models from metrics from group A and group B ;
      • H 1 : The performance of the group A metrics is better than that of group B metrics;
      • H 2 : The performance of the group A metrics is worse than that of group B metrics.
  • The Result of Hypothesis Test (1)
    • Friedman’s test
      • Average p-value = 0.00003604 (<0.05)
      • Strongly suggests there is statistically significant difference amongst the models from all , code , and design over all 13 datasets.
    • Two pairwise nonparametric tests (the Wilcoxon test or the Mann-Whitney test) agree in all cases but one
      • PC2: the Mann-Whitney has all >code , but the Wilcoxon has all=code,
      • This discrepancy does not affect our overall trend.
  • Findings
    • Statistical significance tests utilized AUC for model comparison
    • In 7 datasets: all=code;
    • In 6 datasets, all >code .
    • In all 13 datasets, all>design .
    • In 12 datasets, code>design .
    • Only exception is KC4 project, where design>code .
  • Summary of Observations
    • The performance of models is influenced
      • MORE by metrics
      • THAN by classification algorithms.
    • Combination of design AND code metrics provides better models than code or design metrics alone.
    • The models from code metrics generally perform better than that formed from design metrics only.
    • Design metrics useful to predict fault prone modules earlier.
    • Clear indication that integrating metrics from different phases of development is useful.
  • Threats to Validity
    • Noise in the metrics data sets.
      • Would feature selection change some outcomes?
    • Generality of NASA datasets.
    • Design metrics reengineered from code.
      • More accurately reflect the code base than those computed from design documentation.
    • All metrics data contains a few independent variables which are not in Code or Design groups.
      • Needs correction, but the results unlikely to change.
  • Ensuing Research
    • Software fault prediction can be improved
      • Improvement unlikely to come from the application of more off-the-shelf data mining algorithms.
      • Accounting for project’s “business context” may contribute to improvement.
        • Metrics from different development stages add information not available from the code.
        • Evaluation of effectiveness should be tailored to project-specific (subsystem/module-specific) risks.
      • Reliable metrics collection.