Predicting Defective Lines Using
a Model-AgnosticTechnique
Journal-First Paper (Transactions on Software Engineering)
Supatsara
Wattanakriengkrai
Patanamon
Thongtanunam
Chakkrit
Tantithamthavorn
Hideaki
Hata
Kenichi
Matsumoto
An SQA team needs to carefully identify defects in
changed files that will be merged into the release branch
2
Source Code
Files
Version Control
System
Localize defects
…
Defect-Prone
Code
SQATeam
B. Adams and S. McIntosh, “Modern release engineering in a nutshell–why researchers should care,” in SANER, 2016, pp. 78–90.
To spend the optimal effort, the SQA team needs to
prioritize files that are likely to have defects in the future
3
Source Code
Files
Version Control
System
Prioritize files
SQATeam
Ranked Source
Code Files
Defect-Prone
Code
…
Localize
defective lines
…
B. Adams and S. McIntosh, “Modern release engineering in a nutshell–why researchers should care,” in SANER, 2016, pp. 78–90.
Defect prediction models are proposed to help
SQA teams prioritize their effort
4
Source Code
Files
Version Control
System
Prioritize files
using defect
models
Ranked Source
Code Files
…
[Kamei et al. ICSME 2010]
[Mende and Koschke CSMR 2010]
…
However, developers could still waste an SQA effort on
manually identifying the most risky lines
5
Source Code
Files
Version Control
System
Prioritize files
using defect
models
Ranked Source
Code Files
Defect-Prone
Code
…
Localize
defective lines
…
SQATeam
[Kamei et al. ICSME 2010]
[Mende and Koschke CSMR 2010]
…
As little as 1%-3% of the lines of code in a file are actually
defective after release
6
Studied systems Activemq Camel Derby Groovy Hbase Hive Jruby Lucene Wicket
%Defective files 2-7% 2-8% 6-28% 2-4% 7-11% 6-19% 2-13% 2-8% 2-16%
%Defective lines
in defective files
(at the median)
2% 2% 2% 2% 1% 2% 2% 3% 3%
**Detective lines are the source code lines that will be changed by bug-fixing commits to fix post-release defects
As little as 1%-3% of the lines of code in a file are actually
defective after release
7
Developers may still waste 97%-99% of the effort
when inspecting defect-prone files
Predicting Defective Lines Using
a Model-AgnosticTechnique
8
Defect
Dataset
Extracting
Features
Building
file-level
defect models
File_A.java
File_B.java
File_C.java
Token_1
Token_2
… Token_N
An overview of building file-level defect models
Predicting Defective Lines Using
a Model-AgnosticTechnique
9
File-Level
Defect Models
LIME
Testing Files
Predicting
defect-prone
lines
Defect-Prone
Files
Identifying
defect-prone
lines
Identified Defect-
Prone Lines
Ranking
defect-prone
lines
…
Most-risky
Least-risky
[Ribeiro et al. SIGKDD 2016]
**LIME is a model-agnostic technique that aims to mimic the behavior of the predictions of the defect model by explaining the individual predictions
An illustrative example of our approach
for identifying defect-prone lines
10
LIME
A Defect-Prone
File
Identified defect-prone lines
File-Level Defect
Prediction Model
A File of Interest
(Testing file)
LIME
oldCurrent
current
node
closure
Defective
(LIME Score >0)
Clean
(LIME score < 0)
0.8
0.1
-0.3
-0.7
Ranking tokens based on
LIME scores
Mapping tokens
to lines
if(closure != null){
Object oldCurrent = current;
setClosure(closure, node);
closure.call();
current = oldCurrent;
}
if(closure != null){
Object oldCurrent = current;
setClosure(closure, node);
closure.call();
current = oldCurrent;
}
Identified defect-prone lines
File-Level Defect
Prediction Model
A File of Interest
(Testing file)
LIME
oldCurrent
current
node
closure
Defective
(LIME Score >0)
Clean
(LIME score < 0)
0.8
0.1
-0.3
-0.7
Ranking tokens based on
LIME scores
Mapping tokens
to lines
if(closure != null){
Object oldCurrent = current;
setClosure(closure, node);
closure.call();
current = oldCurrent;
}
if(closure != null){
Object oldCurrent = current;
setClosure(closure, node);
closure.call();
current = oldCurrent;
}
Mapping tokens
to lines
Identified defect-prone lines
File-Level Defect
Prediction Model
A File of Interest
(Testing file)
LIME
oldCurrent
current
node
closure
Defective
(LIME Score >0)
Clean
(LIME score < 0)
0.8
0.1
-0.3
-0.7
Ranking tokens based on
LIME scores
Mapping tokens
to lines
if(closure != null){
Object oldCurrent = current;
setClosure(closure, node);
closure.call();
current = oldCurrent;
}
if(closure != null){
Object oldCurrent = current;
setClosure(closure, node);
closure.call();
current = oldCurrent;
}
Ranking tokens
based on LIME
scores
Identified Defect-
Prone Lines
[Ribeiro et al. SIGKDD 2016]
An illustrative example of our approach
for identifying defect-prone lines
11
Code tokens that frequently appeared in defective files
in the past may also appear in the lines that
will be fixed after release
Research Questions
12
Computation
Time
Ranking
Performance
Predictive Accuracy
LIME
We compare our approach
against six line-level baseline approaches
13
Our Appraoch
Random Guessing PMD ErrorProne
NLP
Random
Forest
Logistic
Regression
ErrorProne
!= null)
If ( closure
TMI-RF TMI-LR
[Copeland PMDApplied 2005] [Aftandilian et al SCAM 2012]
[Hellendoorn et al
ESEC/FSE 2017]
**TMI-LR is a traditional model interpretation based approach with logistic regression
**TMI-RF is a traditional model interpretation based approach with random forest
Our approach achieves an overall predictive accuracy
better than baseline approaches
14
Our Approach
Line-level
Baseline Approaches
Recall 0.61 – 0.62 0.01 - 0.51
MCC 0.04 – 0.05 -0.01 – 0.03
False Alarm 0.47 – 0.48 o.01 -0.54
Distance to Heaven
(the root mean square of the recall and
false alarm values)
0.43– 0.44 0.52 – 0.70
LIME
The higher
the better
The lower
the better
Experimental Results
15
Computation
Time
Ranking
Performance
Predictive Accuracy
Our approach
achieves an overall
predictive accuracy
better than baseline
approaches
Given a fixed amount of effort, our approach identify
actual defective lines better than baseline approaches
16
Our Approach
Line-level
Baseline Approaches
Recall@Top20%LOC 0.26 – 0.27 0.17 – 0.22
Initial False Alarm
(the number of clean lines on which
developers spend SQA effort until
the first defective line is found when
lines are ranked)
9 - 16 10- 403
LIME
The higher
the better
The lower
the better
Experimental Results
17
Computation
Time
Ranking
Performance
Predictive Accuracy
Our approach
achieves an overall
predictive accuracy
better than baseline
approaches
Given a fixed amount
of effort, our
approach identify
actual defective lines
better than baseline
approaches
The computational time of our approach is manageable
when considering the predictive accuracy of defective lines
18
Within-release Cross-release
Our Approach
3 rd 8.46
PMD
4 th 4 th
ErrorProne
5 th 5 th
NLP
26.85
TMI-LR
6 th 6 th
TMI-RF
11.15 3 rd
1 st 1 st
2 nd
2 nd
Computation
Time
Ranking
Performance
Predictive Accuracy
Experimental Results
19
Our approach
achieves an overall
predictive accuracy
better than baseline
approaches
Given a fixed amount
of effort, our
approach identify
actual defective lines
better than baseline
approaches
The computational
time of our approach
is manageable
when considering the
predictive accuracy of
defective lines
Predicting Defective Lines Using
a Model-AgnosticTechnique
20
Our work builds an important step towards line-level defect prediction
by leveraging a model-agnostic technique .
Our framework will help developers effectively prioritize SQA effort.
Supatsara Wattanakriengkrai
wattanakri.supatsara.ws3@is.naist.jp
21

Predicting Defective Lines Using a Model-Agnostic Technique

  • 1.
    Predicting Defective LinesUsing a Model-AgnosticTechnique Journal-First Paper (Transactions on Software Engineering) Supatsara Wattanakriengkrai Patanamon Thongtanunam Chakkrit Tantithamthavorn Hideaki Hata Kenichi Matsumoto
  • 2.
    An SQA teamneeds to carefully identify defects in changed files that will be merged into the release branch 2 Source Code Files Version Control System Localize defects … Defect-Prone Code SQATeam B. Adams and S. McIntosh, “Modern release engineering in a nutshell–why researchers should care,” in SANER, 2016, pp. 78–90.
  • 3.
    To spend theoptimal effort, the SQA team needs to prioritize files that are likely to have defects in the future 3 Source Code Files Version Control System Prioritize files SQATeam Ranked Source Code Files Defect-Prone Code … Localize defective lines … B. Adams and S. McIntosh, “Modern release engineering in a nutshell–why researchers should care,” in SANER, 2016, pp. 78–90.
  • 4.
    Defect prediction modelsare proposed to help SQA teams prioritize their effort 4 Source Code Files Version Control System Prioritize files using defect models Ranked Source Code Files … [Kamei et al. ICSME 2010] [Mende and Koschke CSMR 2010] …
  • 5.
    However, developers couldstill waste an SQA effort on manually identifying the most risky lines 5 Source Code Files Version Control System Prioritize files using defect models Ranked Source Code Files Defect-Prone Code … Localize defective lines … SQATeam [Kamei et al. ICSME 2010] [Mende and Koschke CSMR 2010] …
  • 6.
    As little as1%-3% of the lines of code in a file are actually defective after release 6 Studied systems Activemq Camel Derby Groovy Hbase Hive Jruby Lucene Wicket %Defective files 2-7% 2-8% 6-28% 2-4% 7-11% 6-19% 2-13% 2-8% 2-16% %Defective lines in defective files (at the median) 2% 2% 2% 2% 1% 2% 2% 3% 3% **Detective lines are the source code lines that will be changed by bug-fixing commits to fix post-release defects
  • 7.
    As little as1%-3% of the lines of code in a file are actually defective after release 7 Developers may still waste 97%-99% of the effort when inspecting defect-prone files
  • 8.
    Predicting Defective LinesUsing a Model-AgnosticTechnique 8 Defect Dataset Extracting Features Building file-level defect models File_A.java File_B.java File_C.java Token_1 Token_2 … Token_N An overview of building file-level defect models
  • 9.
    Predicting Defective LinesUsing a Model-AgnosticTechnique 9 File-Level Defect Models LIME Testing Files Predicting defect-prone lines Defect-Prone Files Identifying defect-prone lines Identified Defect- Prone Lines Ranking defect-prone lines … Most-risky Least-risky [Ribeiro et al. SIGKDD 2016] **LIME is a model-agnostic technique that aims to mimic the behavior of the predictions of the defect model by explaining the individual predictions
  • 10.
    An illustrative exampleof our approach for identifying defect-prone lines 10 LIME A Defect-Prone File Identified defect-prone lines File-Level Defect Prediction Model A File of Interest (Testing file) LIME oldCurrent current node closure Defective (LIME Score >0) Clean (LIME score < 0) 0.8 0.1 -0.3 -0.7 Ranking tokens based on LIME scores Mapping tokens to lines if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } Identified defect-prone lines File-Level Defect Prediction Model A File of Interest (Testing file) LIME oldCurrent current node closure Defective (LIME Score >0) Clean (LIME score < 0) 0.8 0.1 -0.3 -0.7 Ranking tokens based on LIME scores Mapping tokens to lines if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } Mapping tokens to lines Identified defect-prone lines File-Level Defect Prediction Model A File of Interest (Testing file) LIME oldCurrent current node closure Defective (LIME Score >0) Clean (LIME score < 0) 0.8 0.1 -0.3 -0.7 Ranking tokens based on LIME scores Mapping tokens to lines if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } Ranking tokens based on LIME scores Identified Defect- Prone Lines [Ribeiro et al. SIGKDD 2016]
  • 11.
    An illustrative exampleof our approach for identifying defect-prone lines 11 Code tokens that frequently appeared in defective files in the past may also appear in the lines that will be fixed after release
  • 12.
  • 13.
    LIME We compare ourapproach against six line-level baseline approaches 13 Our Appraoch Random Guessing PMD ErrorProne NLP Random Forest Logistic Regression ErrorProne != null) If ( closure TMI-RF TMI-LR [Copeland PMDApplied 2005] [Aftandilian et al SCAM 2012] [Hellendoorn et al ESEC/FSE 2017] **TMI-LR is a traditional model interpretation based approach with logistic regression **TMI-RF is a traditional model interpretation based approach with random forest
  • 14.
    Our approach achievesan overall predictive accuracy better than baseline approaches 14 Our Approach Line-level Baseline Approaches Recall 0.61 – 0.62 0.01 - 0.51 MCC 0.04 – 0.05 -0.01 – 0.03 False Alarm 0.47 – 0.48 o.01 -0.54 Distance to Heaven (the root mean square of the recall and false alarm values) 0.43– 0.44 0.52 – 0.70 LIME The higher the better The lower the better
  • 15.
    Experimental Results 15 Computation Time Ranking Performance Predictive Accuracy Ourapproach achieves an overall predictive accuracy better than baseline approaches
  • 16.
    Given a fixedamount of effort, our approach identify actual defective lines better than baseline approaches 16 Our Approach Line-level Baseline Approaches Recall@Top20%LOC 0.26 – 0.27 0.17 – 0.22 Initial False Alarm (the number of clean lines on which developers spend SQA effort until the first defective line is found when lines are ranked) 9 - 16 10- 403 LIME The higher the better The lower the better
  • 17.
    Experimental Results 17 Computation Time Ranking Performance Predictive Accuracy Ourapproach achieves an overall predictive accuracy better than baseline approaches Given a fixed amount of effort, our approach identify actual defective lines better than baseline approaches
  • 18.
    The computational timeof our approach is manageable when considering the predictive accuracy of defective lines 18 Within-release Cross-release Our Approach 3 rd 8.46 PMD 4 th 4 th ErrorProne 5 th 5 th NLP 26.85 TMI-LR 6 th 6 th TMI-RF 11.15 3 rd 1 st 1 st 2 nd 2 nd
  • 19.
    Computation Time Ranking Performance Predictive Accuracy Experimental Results 19 Ourapproach achieves an overall predictive accuracy better than baseline approaches Given a fixed amount of effort, our approach identify actual defective lines better than baseline approaches The computational time of our approach is manageable when considering the predictive accuracy of defective lines
  • 20.
    Predicting Defective LinesUsing a Model-AgnosticTechnique 20 Our work builds an important step towards line-level defect prediction by leveraging a model-agnostic technique . Our framework will help developers effectively prioritize SQA effort. Supatsara Wattanakriengkrai wattanakri.supatsara.ws3@is.naist.jp
  • 21.