Successfully reported this slideshow.
Your SlideShare is downloading. ×

Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Software Defects

Ad

Explainable Artificial Intelligence (XAI) 

to Predict and Explain Future Software Defects
Dr. Chakkrit (Kla) Tantithamtha...

Ad

Bug
Dr. Chakkrit Tantithamthavorn

Ad

Software bugs globally cost $2.84 trillion dollars
https://www.it-cisq.org/the-cost-of-poor-quality-software-in-the-us-a-2...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Check these out next

1 of 64 Ad
1 of 64 Ad

Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Software Defects

In today's increasingly digitalised world, software defects are enormously expensive. In 2018, the Consortium for IT Software Quality reported that software defects cost the global economy $2.84 trillion dollars and affected more than 4 billion people. The average annual cost of software defects on Australian businesses is A$29 billion per year. Thus, failure to eliminate defects in safety-critical systems could result in serious injury to people, threats to life, death, and disasters. Traditionally, software quality assurance activities like testing and code review are widely adopted to discover software defects in a software product. However, ultra-large-scale systems, such as, Google, can consist of more than two billion lines of code, so exhaustively reviewing and testing every single line of code isn't feasible with limited time and resources. This project aims to create technologies that enable software engineers to produce the highest quality software systems with the lowest operational costs. To achieve this, this project will invent an end-to-end explainable AI platform to (1) understand the nature of critical defects; (2) predict and locate defects; (3) explain and visualise the characteristics of defects; (4) suggest potential patches to automatically fix defects; (5) integrate such platform as a GitHub bot plugin.

In today's increasingly digitalised world, software defects are enormously expensive. In 2018, the Consortium for IT Software Quality reported that software defects cost the global economy $2.84 trillion dollars and affected more than 4 billion people. The average annual cost of software defects on Australian businesses is A$29 billion per year. Thus, failure to eliminate defects in safety-critical systems could result in serious injury to people, threats to life, death, and disasters. Traditionally, software quality assurance activities like testing and code review are widely adopted to discover software defects in a software product. However, ultra-large-scale systems, such as, Google, can consist of more than two billion lines of code, so exhaustively reviewing and testing every single line of code isn't feasible with limited time and resources. This project aims to create technologies that enable software engineers to produce the highest quality software systems with the lowest operational costs. To achieve this, this project will invent an end-to-end explainable AI platform to (1) understand the nature of critical defects; (2) predict and locate defects; (3) explain and visualise the characteristics of defects; (4) suggest potential patches to automatically fix defects; (5) integrate such platform as a GitHub bot plugin.

Advertisement
Advertisement

More Related Content

Slideshows for you (19)

Similar to Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Software Defects (20)

Advertisement

Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Software Defects

  1. 1. Explainable Artificial Intelligence (XAI) 
 to Predict and Explain Future Software Defects Dr. Chakkrit (Kla) Tantithamthavorn Monash University, Melbourne, Australia. chakkrit@monash.edu @klainfohttp://chakkrit.com
  2. 2. Bug Dr. Chakkrit Tantithamthavorn
  3. 3. Software bugs globally cost $2.84 trillion dollars https://www.it-cisq.org/the-cost-of-poor-quality-software-in-the-us-a-2018-report/The-Cost-of-Poor-Quality-Software-in-the-US-2018-Report.pdf A failure to eliminate defects in safety-critical systems could result in serious injury to people, threats to life, death, and disasters https://news.microsoft.com/en-au/features/direct-costs-associated-with-cybersecurity-incidents-costs-australian-businesses-29-billion-per-annum/ 59.5 billions annually for US 29 billions annually for Australia
  4. 4. Software evolves extremely fast 50% of the Google’s code base changes every month Windows 8 involves 100K+ code changes Software is written in multiple languages, by many people, over a long period of time 
 in order to fix bugs , add new features , and improve code quality . every day And, software is released faster at massive scale every 6 months every 6 weeksevery 6 months
  5. 5. How to find bugs? Use unit testing to test the functionality correctness, But manual testing for all files is time-consuming Use static analysis tools check code quality Use code review to find bugs and check code quality Use CI/CD to automatically build, test, and merge with confidence Others: UI testing, fuzzing, load/performance testing, etc.
  6. 6. QA activities take too much time (~50% of a project) • Large and complex code base: 1 billion lines of code • > 10K developers in 40+ office locations • 5K+ projects under active development • 17K code reviews per day • 100 million test cases run per day Given a limited time, how can we effectively prioritise QA resources on the most risky program elements? ’s rules: All changes must be reviewed* https://www.codegrip.tech/productivity/what-is-googles-internal-code-review-process/ https://eclipsecon.org/2013/sites/eclipsecon.org.2013/files/2013-03-24%20Continuous%20Integration%20at%20Google%20Scale.pdf Within 6 months, 1K developers perform 80K+ code reviews 
 (~77 reviews per person) for 30K+ code changes / one release
  7. 7. Software Analytics = Software Data + Data Analytics 26 million 
 developers 57 million
 repositories 100 million
 pull requests + code review + CI logs + test logs + docker config files + others
  8. 8. WHY DO WE NEED SOFTWARE ANALYTICS? To make informed decisions, glean actionable insights, and build empirical theories PROCESS IMPROVEMENT How do code review practices and rapid releases impact software quality? PRODUCTIVITY IMPROVEMENT How do continuous integration practices impact team productivity? QUALITY IMPROVEMENT Why do programs crash? How to prevent bugs in the future? EMPIRICAL THEORY BUILDING A Theory of Software Quality A Theory of Effort/Cost Estimation Beyond predicting defects
  9. 9. AI/ML IS SHAPING SOFTWARE ENGINEERING IMPROVE SOFTWARE QUALITY Predict defects, vulnerabilities, malware Generate test cases
  10. 10. AI/ML IS SHAPING SOFTWARE ENGINEERING IMPROVE SOFTWARE QUALITY Predict defects, vulnerabilities, malware Generate test cases Generating UI/requirements/code/comments
 Predict developer/team productivity
 Recommend developers/reviewers Identify developer turnover IMPROVE PRODUCTIVITY
  11. 11. AI/ML MODELS FOR SOFTWARE DEFECTS Focus on predicting, explaining future software defects, and building empirical theories Predicting future software defects so practitioners can effectively optimize limited resources Building empirical- grounded theories of software quality Explaining what makes a software fail so managers can develop the most effective improvement plans
  12. 12. ANALYTICAL MODELLING FRAMEWORK MAME: Mining, Analyzing, Modelling, Explaining Raw Data …… …… A B Clean Data MINING Correlation . . . .. . . . . .. 
 ANALYZING Analytical 
 Models 
 MODELLING Knowledge 
 EXPLAINING
  13. 13. Raw Data ITS Issue 
 Tracking
 System (ITS) MINING SOFTWARE DEFECTS Issue 
 Reports VCS Version
 Control
 System (VCS) Code Changes Code Snapshot Commit Log STEP 1: EXTRACT DATA
  14. 14. Reference: https://github.com/apache/lucene-solr/tree/662f8dd3423b3d56e9e1a197fe816393a33155e2 What are the source files in this release? ITS Code Changes Code Snapshot Commit Log VCS Issue 
 Tracking
 System (ITS) Version
 Control
 System (VCS) Issue 
 Reports Raw Data STEP 2: COLLECT METRICS STEP 1: EXTRACT DATA MINING SOFTWARE DEFECTS
  15. 15. Reference: https://github.com/apache/lucene-solr/commit/662f8dd3423b3d56e9e1a197fe816393a33155e2 How many lines are added or deleted? ITS VCS Issue 
 Tracking
 System (ITS) Version
 Control
 System (VCS) Raw Data Commit Log Issue 
 Reports MINING SOFTWARE DEFECTS Code Changes Code Snapshot STEP 2: COLLECT METRICS STEP 1: EXTRACT DATA Who edit this file?
  16. 16. ITS VCS Issue 
 Tracking
 System (ITS) Version
 Control
 System (VCS) Raw Data STEP 1: EXTRACT DATA CODE METRICS Size, Code Complexity, Cognitive Complexity,
 OO Design (e.g., coupling, cohesion) PROCESS METRICS Development Practices 
 (e.g., #commits, #dev, churn, #pre- release defects, change complexity) HUMAN FACTORS Code Ownership, #MajorDevelopers, 
 #MinorDevelopers, Author Ownership,
 Developer Experience Code Changes Code Snapshot Commit Log Issue 
 Reports STEP 2: COLLECT METRICS MINING SOFTWARE DEFECTS
  17. 17. Reference: https://issues.apache.org/jira/browse/LUCENE-4128 Issue Reference ID Bug / New Feature Which releases are affected? Which commits belong to this issue report? ITS Code Changes Code Snapshot Commit Log VCS Issue 
 Tracking
 System (ITS) Version
 Control
 System (VCS) Issue 
 Reports Raw Data STEP 1: EXTRACT DATA STEP 2: COLLECT METRICSSTEP 3: IDENTIFY DEFECTS MINING SOFTWARE DEFECTS Whether this report is created after the release of interest?
  18. 18. ITS Code Changes Code Snapshot Commit Log VCS Issue 
 Tracking
 System (ITS) Version
 Control
 System (VCS) Issue 
 Reports Raw Data STEP 1: EXTRACT DATA STEP 3: IDENTIFY DEFECTS STEP 2: COLLECT METRICS …… …… A B Defect
 Dataset MINING SOFTWARE DEFECTS Which files were changed to fix the defect? Link Check “Mining Software Defects” paper [Yatish et al., ICSE 2019]
  19. 19. LABELLING SOFTWARE DEFECTS Release 1.0 Changes Issues Timeline Timeline C1: Fixed ID-1 ID=1, v=1.0 A.java ID=2, v=0.9 C2: Fixed ID-2 B.java ID=3, v=1.0 C3: Fixed ID-3 C.java ID=4, v=1.0 C4: Fixed ID-4 D.java Post-release defects are defined as modules that are fixed for a defect report that affected a release of interest ID indicates a defect report ID, 
 C indicates a commit hash,
 v indicates affected release(s) DEFECTIVE CLEAN DEFECTIVE DEFECTIVE FILE A.java B.java C.java D.java LABEL Yatish et al., Mining Software Defects: Should We Consider Affected Releases?, In ICSE’19
  20. 20. HIGHLY-CURATED DATASETS 32 releases that span across 9 open-source software systems Name %DefectiveRatio KLOC ActiveMQ 6%-15% 142-299 Camel 2%-18% 75-383 Derby 14%-33% 412-533 Groovy 3%-8% 74-90 HBase 20%-26% 246-534 Hive 8%-19% 287-563 JRuby 5%-18% 105-238 Lucene 3%-24% 101-342 Wicket 4%-7% 109-165 Each dataset has 65 software metrics • 54 code metrics • 5 process metrics • 6 ownership metrics https://awsm-research.github.io/Rnalytica/ Yatish et al., Mining Software Defects: Should We Consider Affected Releases?, In ICSE’19
  21. 21. ANALYTICAL MODELLING FRAMEWORK MAME: Mining, Analyzing, Modelling, Explaining Raw Data …… …… A B Clean Data MINING Correlation . . . .. . . . . .. 
 ANALYZING Analytical 
 Models 
 MODELLING Knowledge 
 EXPLAINING
  22. 22. Black-Box 
 Models Training 
 Data Learning Algorithms A.java A.java is 
 likely to be defective
 (P=0.90) SOFTWARE DEFECT MODELLING FRAMEWORK Using well-established AI/ML learning algorithms Developers make 
 an informed decision
  23. 23. Black-Box 
 Models Training 
 Data Learning Algorithms A.java A.java is 
 likely to be defective
 (P=0.90) SOFTWARE DEFECT MODELLING FRAMEWORK Using well-established AI/ML learning algorithms Developers make 
 an informed decision Why is A.java defective? Why is A.java defective rather than clean? Why is file A.java defective, 
 while file B.java is clean?
  24. 24. Article 22 of the European Union’s General Data Protection Regulation “The use of data in decision- making that affects an individual or group requires 
 an explanation for any decision made by an algorithm.” http://www.privacy-regulation.eu/en/22.htm
  25. 25. AI/ML BRINGS CONCERNS TO REGULATORS FAT: Fairness, Accountability, and Transparency What if AI-assisted productivity analytics tend to promote males more than females? Do the AI systems conform to regulation and legislation? Do we understand how machines work? Why models make that predictions?
  26. 26. EXPLAINABLE ARTIFICIAL INTELLIGENCE (XAI) A suite of AI/ML techniques that produce accurate predictions, while being able to explain such predictions Black-Box 
 Models Training 
 Data Learning Algorithms A.java Prediction A.java is 
 likely to be defective
 (P=0.90) Explainable
 Interface Explanation The system provides an explanation that justifies its prediction to the user
  27. 27. EXPLAINING A BLACK-BOX MODEL Model-Specific Techniques (e.g., ANOVA for Regression / Variable Importance for Random Forest) Unseen Data Black-Box 
 Models Explaining a black-box model to identify the most important features based on the training data Model-specific interpretation 
 techniques
 (VarImp) Global Explanation A prediction score of 90% Predictions
  28. 28. Explaining an individual prediction: we know how do features contribute to the final probability for each prediction? EXPLAINING AN INDIVIDUAL PREDICTION Model-Agnostic Techniques to Generate An Outcome Explanation A prediction score of 90% Model- Agnostic
 Techniques Unseen Data Model-specific interpretation 
 techniques
 (VarImp) Black-Box 
 Models Global Explanation Instance ExplanationsPredictions
  29. 29. WHY IS A.JAVA DEFECTIVE? Explaining the importance of each metric that contributes to the final probability of each prediction 0.268 0.332 0.169 0.036 0.02 0.007 0.832 + MAJOR_LINE = 2 + ADEV = 12 + CountDeclMethodPrivate = 6 + CountDeclMethodPublic = 44 + CountClassCoupled = 16 remaining 21 variables final_prognosis 0.00 0.25 0.50 0.75 1.00 1.25 #ActiveDevelopers contributed the most to the likelihood of being defective for this module
  30. 30. A Quality Improvement Plan “A policy to maintain the maximum number of (two) developers who can edited a module in the past (six) months”
  31. 31. Software Analytics in Action
 A Hands-on Tutorial on Analyzing and Modelling Software Data Dr. Chakkrit (Kla) Tantithamthavorn Monash University, Melbourne, Australia. chakkrit@monash.edu @klainfohttp://chakkrit.com
  32. 32. Statistical
 Model Training
 Corpus Classifier 
 Parameters (7) Model
 Construction Performance
 Measures Data 
 Sampling (2) Data Cleaning and Filtration (3) Metrics Extraction and Normalization (4) Descriptive Analytics (+/-) Relationship to the Outcome Y X x Software
 Repository Software
 Dataset Clean
 Dataset Studied Dataset Outcome Studied Metrics Control Metrics +~ (1) Data Collection Predictive 
 Analytics Prescriptive Analytics (8) Model Validation (9) Model Analysis and Interpretation Importance 
 Score Testing
 Corpus PredictionsPerformance
 Estimates Patterns Challenges of Data Analytics Pipeline How to clean data? How to collect ground-truths? Should we rebalance the data? Are features correlated? Which ML techniques is best? Which model validation techniques should I use? What is the benefit of optimising ML parameters? How to analyse or explain the ML models? Should we apply feature reduction? What is best data analytics pipeline for software defects?
  33. 33. Mining Software Data Analyzing Software Data Affected Releases
 [ICSE’19] Issue Reports
 [ICSE’15] Control Features
 [ICSE-SEIP’18] Feature Selection
 [ICSME’18] Correlation Analysis
 [TSE’19] Modelling Software Data Class Imbalance
 [TSE’19] Parameters
 [ICSE’16,TSE’18] Model Validation
 [TSE’17] Measures
 [ICSE-SEIP’18] Explaining Software Data Model Statistics
 [ICSE-SEIP’18] Interpretation
 [TSE’19] ANALYZING AND MODELLING 
 SOFTWARE DEFECTS Tantithamthavorn and Hassan. An Experience Report on Defect Modelling in Practice: Pitfalls and Challenges. In ICSE-SEIP’18 MSR’19 
 Education
  34. 34. RUN JUPYTER + R ANYTIME AND ANYWHERE http://github.com/awsm-research/tutorial Shift + Enter to run a cell
  35. 35. EXAMPLE DATASET # Load a defect dataset > > > > >
 > source("import.R")
 eclipse <- loadDefectDataset("eclipse-2.0") data <- eclipse$data indep <- eclipse$indep dep <- eclipse$dep data[,dep] <- factor(data[,dep]) 6,729 files, 32 metrics
 14% defective ratio Tantithamthavorn and Hassan. An Experience Report on Defect Modelling in Practice: Pitfalls and Challenges. In ICSE-SEIP’18, pages 286-295. # Understand your dataset > describe(data) data 33 Variables 6729 Observations ------------------------------------------------- CC_sum n missing distinct Mean 6729 0 268 26.9 lowest : 0 1 , highest: 1052 1299 ———————————————————————— post n missing distinct 6729 0 2 Value FALSE TRUE Frequency 5754 975 Proportion 0.855 0.145[Zimmermann et al, PROMISE’07]
  36. 36. Is program complexity associated with software quality? BUILDING A THEORY OF SOFTWARE QUALITY
  37. 37. # Develop a logistic regression > m <- glm(post ~ CC_max, data = data) # Print a model summary > summary(m) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.490129 0.051777 -48.09 <2e-16 *** CC_max 0.104319 0.004819 21.65 <2e-16 ***
 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 INTRO: BASIC REGRESSION ANALYSIS Theoretical Assumptions 1. Binary dependent variable and ordinal independent variables 2. Observations are independent 3. No (multi-)collinearity among independent variables 4. Assume a linear relationship between the logit of the outcome and each variable # Visualize the relationship of the studied variable > > > install.packages("effects") library(effects) plot(allEffects(m))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 CC_max effect plot CC_max post 0.20.40.60.8 0 50 100 150 200 250 300
  38. 38. Which factors share the strongest association with software quality? BUILDING A THEORY OF SOFTWARE QUALITY
  39. 39. …… …… A B Knowledge Analytical 
 Models Clean Data Correlation . . . .. . . . . .. BEST PRACTICES FOR ANALYTICAL MODELLING (1) Include control features (3) Build interpretable models (4) Explore different settings (2) Remove correlated features (7) Visualize the relationship (5) Use out-of-sample bootstrap(6) Summarize by a Scott-Knott test (1) Don’t use ANOVA Type-I (2) Don’t optimize prob thresholds 7 DOs and 3 DON’Ts (3) Don’t solely use F-measure
  40. 40. STEP1: INCLUDE CONTROL FEATURES Size, OO Design 
 (e.g., coupling, cohesion), Program Complexity Software Defects Control features are features that are not of interest even though they could affect the outcome of a model (e.g., lines of code when modelling defects). #commits, #dev, churn, 
 #pre-release defects, 
 change complexity Code Ownership,
 #MinorDevelopers, Experience Principles of designing factors 1. Easy/simple measurement 2. Explainable and actionable 3. Support decision making Tantithamthavorn et al., An Experience Report on Defect Modelling in Practice: Pitfalls and Challenges. ICSE-SEIP’18
  41. 41. The risks of not including control features (e.g., lines of code) STEP1: INCLUDE CONTROL FEATURES # post ~ CC_max + PAR_max + FOUT_max
 > > m1 <- glm(post ~ CC_max + PAR_max + FOUT_max, data = data, family="binomial") anova(m1) Analysis of Deviance Table Model: binomial, link: logit Response: post Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 6728 5568.3 CC_max 1 600.98 6727 4967.3 PAR_max 1 131.45 6726 4835.8 FOUT_max 1 60.21 6725 4775.6
 # post ~ TLOC + CC_max + PAR_max + FOUT_max
 > > m2 <- glm(post ~ TLOC + CC_max + PAR_max + FOUT_max, data = data, family="binomial") anova(m2)
 Analysis of Deviance Table Model: binomial, link: logit Response: post Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 6728 5568.3 TLOC 1 709.19 6727 4859.1 CC_max 1 74.56 6726 4784.5 PAR_max 1 63.35 6725 4721.2 FOUT_max 1 17.41 6724 4703.8 Complexity is the top rank Lines of code is the top rank Conclusions may change when including control features
  42. 42. STEP2: REMOVE CORRELATED FEATURES The state of practices in software engineering Jiarpakdee et al: The Impact of Correlated Metrics on the Interpretation of Defect Models. TSE’19 “82% of SE datasets have 
 correlated features” “63% of SE studies do not 
 mitigate correlated features” Why? Most metrics are aggregated. Collinearity is a phenomenon in which one feature can be linearly predicted by another feature
  43. 43. The risks of not removing correlated factors STEP2: REMOVE CORRELATED FEATURES Model 1 Model 2 CC_max 74 19 CC_avg 2 58 PAR_max 16 16 FOUT_max 7 7 Model1: Post ~ CC_max + CC_avg + PAR_max + FOUT_max Model2: Post ~ CC_avg + CC_max + PAR_max + FOUT_max CC_max is highly correlated with CC_avg The values indicate the contribution of each factor to the model (from ANOVA analysis) Jiarpakdee et al: The Impact of Correlated Metrics on the Interpretation of Defect Models. TSE’19 Conclusions may be changed when reordering the correlated features
  44. 44. # Visualize spearman’s correlation for all metrics using a hierarchical clustering > > > library(rms) plot(varclus(as.matrix(data[,indep]), similarity="spear", trans="abs")) abline(h=0.3, col="red") STEP2: REMOVE CORRELATED FACTORS Using Spearman’s correlation analysis to detect collinearity 1 NSM_avg NSM_max NSM_sum NSF_avg NSF_max NSF_sum PAR_avg PAR_max PAR_sum pre NOI NOT FOUT_sum MLOC_sum TLOC NBD_sum CC_sum FOUT_avg FOUT_max NBD_avg NBD_max CC_avg CC_max MLOC_avg MLOC_max ACD NOF_avg NOF_max NOF_sum NOM_avg NOM_max NOM_sum 1.00.60.2 Spearmanρ 2 4 5 3 6 7
  45. 45. # Visualize spearman’s correlation for all metrics using a hierarchical clustering > > > library(rms) plot(varclus(as.matrix(data[,indep]), similarity="spear", trans="abs")) abline(h=0.3, col="red") STEP2: REMOVE CORRELATED FACTORS Using Spearman’s correlation analysis to detect collinearity 1 NSM_avg NSM_max NSM_sum NSF_avg NSF_max NSF_sum PAR_avg PAR_max PAR_sum pre NOI NOT FOUT_sum MLOC_sum TLOC NBD_sum CC_sum FOUT_avg FOUT_max NBD_avg NBD_max CC_avg CC_max MLOC_avg MLOC_max ACD NOF_avg NOF_max NOF_sum NOM_avg NOM_max NOM_sum 1.00.60.2 Spearmanρ 2 4 Using domain knowledge to manually select one metric in a group. After mitigating correlated metrics, we should have 9 factors (7+2). A GROUP OF CORRELATED METRICS NON-CORRELATED METRICS 5 3 6 7
  46. 46. # Visualize spearman’s correlation for all metrics using a hierarchical clustering > > > library(rms) plot(varclus(as.matrix(data[,indep]), similarity="spear", trans="abs")) abline(h=0.3, col="red") STEP2: REMOVE CORRELATED FACTORS 1 NSM_avg NSM_max NSM_sum NSF_avg NSF_max NSF_sum PAR_avg PAR_max PAR_sum pre NOI NOT FOUT_sum MLOC_sum TLOC NBD_sum CC_sum FOUT_avg FOUT_max NBD_avg NBD_max CC_avg CC_max MLOC_avg MLOC_max ACD NOF_avg NOF_max NOF_sum NOM_avg NOM_max NOM_sum 1.00.60.2 Spearmanρ 2 3 4 5 6 AutoSpearman (1) removes constant factors, and (2) selects one factor of each group that shares the least correlation with other factors that are not in that group 7 How to automatically mitigate (multi-)collinearity? Jiarpakdee et al: AutoSpearman: Automatically Mitigating Correlated Software Metrics for Interpreting Defect Models. ICSME’18
  47. 47. # Run a AutoSpearman > > > library(Rnalytica) filterindep <- AutoSpearman(data, indep) plot(varclus(as.matrix(data[, filterindep]), similarity="spear", trans="abs")) abline(h=0.3, col="red") STEP2: REMOVE CORRELATED FACTORS How to automatically mitigate (multi-)collinearity? NSF_avg NSM_avg PAR_avg pre NBD_avg NOT ACD NOF_avg NOM_avg 0.70.50.30.1 Spearmanρ Jiarpakdee et al: AutoSpearman: Automatically Mitigating Correlated Software Metrics for Interpreting Defect Models. ICSME’18
  48. 48. STEP3: BUILD AND EXPLAIN DECISION TREES R implementation of a Decision Trees-Based model (C5.0) # Build a C5.0 tree-based model > > tree.model <- C5.0(x = data[,indep], y = data[,dep]) summary(tree.model) Read 6,729 cases (10 attributes) from undefined.data Decision tree: pre <= 1: :...NOM_avg <= 17.5: FALSE (4924/342) : NOM_avg > 17.5: : :...NBD_avg > 1.971831: : :...ACD <= 2: TRUE (51/14) : : ACD > 2: FALSE (5)
 
 Tantithamthavorn et al: Automated parameter optimization of classification techniques for defect prediction models. ICSE’16 # Plot a Decision Tree-based model >plot(tree.model)
 
 
 
 
 
 
 
 
 
 
 
 
 
 pre 1 ≤ 1 > 1 NOM_avg 2 ≤ 17.5 > 17.5 Node 3 (n = 4924) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 NBD_avg 4 ≤ 1.972 > 1.972 NBD_avg 5 ≤ 1.029 > 1.029 Node 6 (n = 64) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 NOM_avg 7 ≤ 64 > 64 Node 8 (n = 332) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 Node 9 (n = 21) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 ACD 10 ≤ 2 > 2 Node 11 (n = 51) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 Node 12 (n = 5) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 pre 13 ≤ 6 > 6 NBD_avg 14 ≤ 1.012 > 1.012 Node 15 (n = 180) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 NOM_avg 16 ≤ 23.5 > 23.5 PAR_avg 17 ≤ 0.677 > 0.677 PAR_avg 18 ≤ 0.579 > 0.579 NBD_avg 19 ≤ 2.833> 2.833 Node 20 (n = 118) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 Node 21 (n = 7) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 NBD_avg 22 ≤ 1.564> 1.564 Node 23 (n = 70) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 Node 24 (n = 29) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 pre 25 ≤ 2 > 2 NBD_avg 26 ≤ 2.13 > 2.13 Node 27 (n = 188) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 NOF_avg 28 ≤ 1.75 > 1.75 Node 29 (n = 29) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 NOM_avg 30 ≤ 6.5 > 6.5 Node 31 (n = 10) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 Node 32 (n = 15) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 PAR_avg 33 ≤ 1.75 > 1.75 Node 34 (n = 288) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 NBD_avg 35 ≤ 1.161> 1.161 Node 36 (n = 4) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 Node 37 (n = 37) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 ACD 38 ≤ 0 > 0 NSF_avg 39 ≤ 11 > 11 Node 40 (n = 73) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 Node 41 (n = 12) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 NOF_avg 42 ≤ 12 > 12 Node 43 (n = 27) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 Node 44 (n = 7) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 pre 45 ≤ 12 > 12 NSM_avg 46 ≤ 0.25 > 0.25 NOF_avg 47 ≤ 10.5 > 10.5 Node 48 (n = 94) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 Node 49 (n = 11) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 Node 50 (n = 56) TRUEFALSE 0 0.2 0.4 0.6 0.8 1 Node 51 (n = 77) TRUEFALSE 0 0.2 0.4 0.6 0.8 1
  49. 49. STEP3: BUILD AND EXPLAIN RULES MODELS R implementation of a Rules-Based model (C5.0) Tantithamthavorn et al: Automated parameter optimization of classification techniques for defect prediction models. ICSE’16 # Build a C5.0 rule-based model Rule 13: (56/19, lift 4.5) pre <= 1 NBD_avg > 1.971831 NOM_avg > 17.5 -> class TRUE [0.655] Rule 14: (199/70, lift 4.5) pre > 1 NBD_avg > 1.012195 NOM_avg > 23.5 -> class TRUE [0.647] Rule 15: (45/16, lift 4.4) pre > 2 pre <= 6 NBD_avg > 1.012195 PAR_avg > 1.75 -> class TRUE [0.638] # Build a C5.0 rule-based model > > rule.model <- C5.0(x = data[, indep], y = data[,dep], rules = TRUE) summary(rule.model)
 
 Rules: Rule 1: (2910/133, lift 1.1) pre <= 6 NBD_avg <= 1.16129 -> class FALSE [0.954] Rule 2: (3680/217, lift 1.1) pre <= 2 NOM_avg <= 6.5 -> class FALSE [0.941]
 Rule 3: (4676/316, lift 1.1) pre <= 1 NBD_avg <= 1.971831 NOM_avg <= 64 -> class FALSE [0.932]
  50. 50. STEP3: BUILD AND EXPLAIN RF MODELS R implementation of a Random Forest model # Build a random forest model > > > f <- as.formula(paste( "RealBug", '~', paste(indep, collapse = "+"))) rf.model <- randomForest(f, data = data, importance = TRUE) print(rf.model)
 
 Call: Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 12.3% Confusion matrix: FALSE TRUE class.error FALSE 567 42 0.06896552 TRUE 57 139 0.29081633
 
 
 # Plot a Random Forest model >plot(rf.model)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 NOT ACD NSF_avg NSM_avg NOF_avg PAR_avg NBD_avg NOM_avg pre ● ● ● ● ● ● ● ● ● 10 20 30 40 50 60 70 MeanDecreaseAccuracy NOT ACD NSM_avg NSF_avg NOF_avg pre NOM_avg PAR_avg NBD_avg ● 0 rf.model
  51. 51. STEP4: EXPLORE DIFFERENT SETTINGS The risks of using default parameter settings Tantithamthavorn et al: Automated parameter optimization of classification techniques for defect prediction models. ICSE’16
 Fu et al. Tuning for software analytics: Is it really necessary? IST'16 87% of the widely-used classification techniques require at least one parameter setting [ICSE’16] #trees for 
 random forest #clusters for 
 k-nearest neighbors #hidden layers for neural networks "80% of top-50 highly-cited defect studies rely on a default setting [IST’16]”
  52. 52. STEP4: EXPLORE DIFFERENT SETTINGS The risks of using default parameter settings Dataset Generate training samples Training
 samples Testing
 samples Models Build 
 models
 w/ diff settings Random Search
 Differential Evolution ●● 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 C 50.1trial C 50.100trials R F.10trees R F.100trees G LM AUC AUC Improvement
 for C5.0 AUC Improvement
 for RF
  53. 53. STEP5: USE OUT-OF-SAMPLE BOOTSTRAP To estimate how well a model will perform on unseen data Tantithamthavorn et al: An Empirical Comparison of Model Validation Techniques for Defect Prediction Models. TSE’17 Testing 70% 30% Training Holdout Validation k-Fold Cross Validation Repeat k times Bootstrap Validation 50% Holdout 70% Holdout Repeated 50% Holdout Repeated 70% Holdout Leave-one-out CV 2 Fold CV 10 Fold CV Repeated 10 fold CV Ordinary bootstrap Optimism-reduced bootstrap Out-of-sample bootstrap .632 Bootstrap TestingTraining Repeat N times TestingTraining
  54. 54. STEP5: USE OUT-OF-SAMPLE BOOTSTRAP R Implementation of out-of-sample bootstrap and 10-folds cross validation Tantithamthavorn et al: An Empirical Comparison of Model Validation Techniques for Defect Prediction Models. TSE’17 # Out-of-sample Bootstrap Validation > > > > > > for(i in seq(1,100)){ set.seed(1234+i) indices <- sample(nrow(data), replace=TRUE) training <- data[indices,] testing <- data[-indices,] … } # 10-Folds Cross-Validation Bootstrap Validation > > > > > indices <- createFolds(data[, dep], k = 10, list = TRUE, returnTrain = TRUE) for(i in seq(1,10)){ training <- data[indices[[i]],] testing <- data[-indices[[i]],] … } ● ● AUC 100 Bootstrap 10X10−Fold C V 0.75 0.78 0.81 0.84 value More accurate and more stable performance estimates [TSE’17]
  55. 55. Fold 1 100 modules, 5% defective rate 10-folds cross-validation Fold 5 Fold 6 … Fold 10 There is a high chance that a testing sample does not have any defective modules … Out-of-sample bootstrap Training Testing A sample with replacement with the same size of the original sample Modules that do not appear in the bootstrap sample Bootstrap sample ~36.8% A bootstrap sample is nearly representative of the original dataset STEP5: USE OUT-OF-SAMPLE BOOTSTRAP The risks of using 10-folds CV on small datasets Tantithamthavorn et al: An Empirical Comparison of Model Validation Techniques for Defect Prediction Models. TSE’17
  56. 56. STEP6: SUMMARIZE BY A SCOTTKNOTT-ESD TEST To statistically determine the ranks of the most significant metrics # Run a ScottKnottESD test > > > > > > > > > > > > > > importance <- NULL indep <- AutoSpearman(data, eclipse$indep) f <- as.formula(paste( "post", '~', paste(indep, collapse = "+"))) for(i in seq(1,100)){ indices <- sample(nrow(data), replace=TRUE) training <- data[indices,] m <- glm(f, data = training, family="binomial") importance <- rbind(importance, Anova(m,type="2",test="LR")$"LR Chisq") } importance <- data.frame(importance) colnames(importance) <- indep sk_esd(importance) Groups: pre NOM_avg NBD_avg ACD NSF_avg PAR_avg 1 2 3 4 5 6 
 NOT NSM_avg NOF_avg
 7 7 8 ● ● ●● ● ● ● ●● ● ● ●● ● ● ●●●●● ● ●●● ● ●●●●●●●● ● 1 2 3 4 5 6 7 8 pre N O M _avg N BD _avg AC D N SF_avg PAR _avg N O T N SM _avg N O F_avg 0 100 200 300 variablevalue Each rank has a statistically significant difference with non- negligible effect size [TSE’17]
  57. 57. # Visualize the relationship of the studied variable > > > > > library(effects) indep <- AutoSpearman(data, eclipse$indep) f <- as.formula(paste( "post", '~', paste(indep, collapse = "+"))) m <- glm(f, data = data, family="binomial") plot(effect("pre",m))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 STEP7: VISUALIZE THE RELATIONSHIP To understand the relationship between the studied metric and the outcome pre effect plot pre post 0.2 0.4 0.6 0.8 0 10 20 30 40 50 60 70
  58. 58. # ANOVA Type-I > > > > > > Df Deviance Resid. Df Resid. Dev NULL 6728 5568.3 NSF_max 1 45.151 6727 5523.1 NSM_max 1 17.178 6726 5505.9 NOF_max 1 50.545 6725 5455.4 ACD 1 43.386 6724 5412. FIRST, DON’T USE ANOVA TYPE-I To measure the significance/contribution of each metric to the model Jiarpakdee et al: The Impact of Correlated Metrics on the Interpretation of Defect Models. TSE’19 RSS(post ~ 1) RSS(post ~ NSF_max) ANOVA Type-I measures the improvement of the Residual Sum of Squares (RSS) (i.e., the unexplained variance) when each metric is sequentially added into the model. RSS(post ~ NSF_max) - RSS(post ~ 1) = 45.151 RSS(post ~ NSF_max + NSM_max) - RSS(post ~ NSF_max) = 17.178
  59. 59. FIRST, DON’T USE ANOVA TYPE-I To measure the significance/contribution of each metric to the model Jiarpakdee et al: The Impact of Correlated Metrics on the Interpretation of Defect Models. TSE’19 # ANOVA Type-II > > Anova(m) Analysis of Deviance Table (Type II tests) Response: post LR Chisq Df Pr(>Chisq) NSF_max 10.069 1 0.001508 ** NSM_max 17.756 1 2.511e-05 *** NOF_max 21.067 1 4.435e-06 *** ACD 43.386 1 4.493e-11 *** RSS(post ~ all except the studied metric) - RSS(post ~ all metrics) ANOVA Type-II measures the improvement of the Residual Sum of Squares (RSS) (i.e., the unexplained variance) when adding a metric under examination to the model after the other metrics. glm(post ~ X2 + X3 + X4, data=data)$deviance - glm(post ~ X1 + X2 + X3 + X4, data=data)$deviance
  60. 60. DON’T USE ANOVA TYPE-I Instead, future studies must use ANOVA Type-II/III Jiarpakdee et al: The Impact of Correlated Metrics on the Interpretation of Defect Models. TSE’19 Model 1 Model 2 Type 1 Type 2 Type 1 Type 2 ACD 28% 47% 49% 47% NOF_max 32% 23% 13% 23% NSM_max 11% 19% 31% 19% NSF_max 29% 11% 7% 11% Model1: post ~ NSF_max + NSM_max + NOF_max + ACD Model2: post ~ NSM_max + ACD + NSF_max + NOF_max Reordering
  61. 61. DON’T SOLELY USE F-MEASURES Other (domain-specific) practical measures should also be included Threshold-independent Measures
 Area Under the ROC Curve = The discrimination ability to classify 2 outcomes. Ranking Measures
 Precision@20%LOC = The precision when inspecting the top 20% LOC
 Recall@20%LOC = The recall when inspecting the top 20% LOC
 Initial False Alarm (IFA) = The number of false alarms to find the first bug [Xia ICSME’17]
 Effort-Aware Measures
 Popt = an effort-based cumulative lift chart [Mende PROMISE’09]
 Inspection Effort = The amount of effort (LOC) that is required to find the first bug. [Arisholm JSS’10]
  62. 62. DON’T SOLELY USE F-MEASURES ● ● 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 C 50.100trials R F.100trees C 50.1trial R F.10trees G LM F−measure(0.5) ● ● ● ● ● ● 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 C 50.100trials R F.100trees C 50.1trial R F.10trees G LM F−measure(0.8) ● 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 C 50.100trials R F.100trees C 50.1trial R F.10trees G LM F−measure(0.2) Tantithamthavorn and Hassan. An Experience Report on Defect Modelling in Practice: Pitfalls and Challenges. In ICSE-SEIP’18 The risks of changing probability thresholds
  63. 63. DO NOT IMPLY CAUSATIONS Complexity is the root cause of software defects Software defects are caused by the high code complexity Complexity shares the strongest association with defect- proneness
  64. 64. PH.D. SCHOLARSHIP • Tuition Fee Waivers • $28,000 Yearly Stipend • Travel Funding • A University-Selected Laptop (e.g., MacBook Pro) • Access to HPC/GPU clusters (4,112 CPU cores, 168 GPU co-processors, 3PB) + NVIDIA DGX1-V 1. Written Communication Skills 2. Research 3. Public Speaking 4. Project Management 5. Leadership 6. Critical Thinking Skills 7. Team Collaboration 7 Developing Skills

×