Micro Interaction Metrics for Defect Prediction (ESEC/FSE 2011)

Micro Interaction Metrics
for Defect Prediction

Taek Lee, Jaechang Nam, Dongyun Han, Sunghun Kim, Hoh Peter In
FSE 2011, Hungary, Sep. 5-9

Outline

• Research motivation
• The existing metrics
• The proposed metrics
• Experiment results
• Threats to validity
• Conclusion

Defect Prediction?
why is it necessary?

Software quality assurance
is inherently a resource
constrained activity!

Predicting defect-prone
software entities* is
to put the best labor effort
on the entities

* functions or code files

Indicators of defects

• Complexity of source codes (Chidamber and Kemerer 1994)

• Frequent code changes (Moser et al. 2008)

• Previous defect information (Kim et al. 2007)

• Code dependencies (Zimmermann 2007)

Indeed,
where do defects
come from?

Humans Error!
Programmers make mistakes,
consequently defects are
injected, and software fails

Human Bugs Software
Errors Injected fails

Programmer Interaction
and Software Quality


“Errors are from cognitive breakdown
while understanding and implementing
requirements”
- Ko et al. 2005


“Errors are from cognitive breakdown
while understanding and implementing
requirements”
- Ko et al. 2005

“Work interruptions or task switching
may affect programmer productivity”
- DeLine et al. 2006

Don’t we need to also consider
developers’ interactions
as defect indicators?

…, but the existing indicators
can NOT directly capture
developers’ interactions

Using Mylyn data, we propose novel
“Micro Interaction Metrics (MIMs)”
capturing developers’ interactions

The Mylyn* data is stored
as an attachment to the
corresponding bug reports in
the XML format

* Eclipse plug-in storing and recovering task contexts

Two levels of MIMs Design

File-level MIMs
specific interactions for a
file in a task
(e.g., AvgTimeIntervalEditEdit)


File-level MIMs
file in a task

Task-level MIMs
property values shared
over the whole task
(e.g., TimeSpent)


File-level MIMs Mylyn Task Logs
file in a task 10:30 Selection file A

11:00 Edit file B

12:30 Edit file B


Mylyn Task Logs

10:30 Selection file A

Task-level MIMs 11:00 Edit file B

property values shared 12:30 Edit file B

over the whole task
(e.g., TimeSpent)

The Proposed Micro Interaction Metrics

For example,
NumPatternSXEY is to capture
this interaction:

For example,
NumPatternSXEY is to capture
this interaction:

“How much times did a programmer
Select a file of group X
and then Edit a file of group Y
in a task activity?”

X if a file shows defect
locality* properties
group X or Y
Y otherwise

H if a file has
group H or L high** DOI value
L otherwise

* hinted by the paper [Kim et al. 2007]
** threshold: median of degree of interest (DOI) values in a task

STEP1: Counting & Labeling
Instances

Task 1 Task 2 Task 3 Task i Task i+1 Task i+2 Task i+3

f1.java
f3.java f2.java
… f2.java …
f1.java f3.java
f3.java

Dec 2005 Time P Sep 2010

Instances


f1.java
f3.java f2.java
… f2.java …
f1.java f3.java
f3.java


All the Mylyn task data collectable from
Eclipse subprojects (Dec 05 ~Sep 10)

Instances


f1.java
f3.java f2.java
… f2.java …
f1.java f3.java
f3.java


Post-defect counting period

Instances

The number of counted post defects
(edited files only within bug fixing tasks)
Task 1
f1.java Task 3
Task 2
=1 Task i Task i+1 Task i+2 Task i+3
f2.java = 1
f3.java = 2 f1.java
f3.java f2.java
… … f1.java f3.java
f2.java …
f3.java
Labeling rule for the file instance
“buggy” (if # of post-defects > 0)
Dec 2005 “clean” (if # of post-defects = 0) Time P Sep 2010

Post-defect counting period

STEP2: Extraction of MIMs



Task 1 Task 2 Task 3 Task 4

…


Metrics extraction period


Task 1

f3.java
...
edit
…
edit
…




Metrics Computation
Task 1

f3.java
...
edit
…
edit
…




Metrics Computation
Task 1 MIMf3.java  valueTask1
f3.java
...
edit
…
edit
…




Metrics Computation
Task 1 Task 2 MIMf3.java  valueTask1
f3.java
...
f1.java
...
MIMf1.java  valueTask2
edit edit
… …
edit edit
… …




Metrics Computation
Task 1 Task 2 Task 3 MIMf3.java  valueTask1
f3.java
...
f1.java
...
f2.java
...
edit edit edit
…
edit
…
edit
…
edit
… … …




Metrics Computation
Task 1 Task 2 Task 3 Task 4 MIMf3.java  valueTask1
f3.java
...
f1.java
...
f2.java
...
f1.java MIMf1.java  valueTask2
…edit
edit edit edit
…edit..
… … … …
edit edit edit f2.java
… … … …edit…




Metrics Computation
f3.java f1.java f2.java f1.java MIMf1.java  (valueTask2+valueTask4)
... ... ... …edit
edit
…
edit
…
edit
…
…edit..
f2.java
… MIMf2.java  (valueTask3+valueTask4)
edit edit edit
… … … …edit…




Metrics Computation
f3.java f1.java f2.java f1.java MIMf1.java  (valueTask2+valueTask4)/2
)
... ... ... …edit
edit
…
edit
…
edit
…
…edit..
f2.java
… MIMf2.java  (valueTask3+valueTask4)/2
)
edit edit edit
… … … …edit…



Understand JAVA tool was used
for extracting 32 source Code
Metrics (CMs)*

* Chidamber and Kemerer, and OO metrics

Understand JAVA tool was used
for extracting 32 source Code
Metrics (CMs)*
List of selected source code metrics
CVS
last revision

…


* Chidamber and Kemerer, and OO metrics

Fifteen History Metrics (HMs)* were
collected from the corresponding
CVS repository

* Moser et al.

Fifteen History Metrics (HMs)* were
collected from the corresponding
CVS repository
List of history metrics (HMs)

CVS revisions

…


* Moser et al.

STEP3: Creating a training corpus

Instance
Extracted MIMs Label
Name

Training
… Classifier

Instance # of post
Extracted MIMs
Name defects

Training
… Regression

STEP4: Building prediction models

Classification and Regression
modeling with different machine
learning algorithms using the
WEKA* tool

* an open source data mining tool

STEP5: Prediction Evaluation

Classification
Measures


How many instances
are really buggy among
the buggy-predicted
outcomes?

Classification
Measures


How many instances
are really buggy among
the buggy-predicted
outcomes?

How many instances
are correctly predicted as
‘buggy’ among the real
buggy ones

Classification
Measures


Regression
Measures

correlation coefficient (-1~1)
mean absolute error (0~1)
root square error (0~1)


Regression
between # of real buggy
instances and # of instances Measures
predicted as buggy

correlation coefficient (-1~1)
mean absolute error (0~1)
root square error (0~1)

T-test with 100 times of
10-fold cross validation

Reject H0* and accept H1*
if p-value < 0.05
(at the 95% confidence level)

* H0: no difference in average performance, H1: different (better!)

Result Summary
MIMs improve prediction accuracy for
1 different Eclipse project subjects

2 different machine learning algorithms

3 different model training periods

Prediction for different project subjects

File instances and % of defects


MIM: the proposed metrics CM: source code metrics HM: history metrics


BASELINE: Dummy Classifier
predicts in a purely random manner

e.g., for 12.5% of buggy instances
Precision(B)=12.5%, Recall(B)=50%
F-measure(B)=20%

MIM: the proposed metrics CM: source code metrics HM: history metrics


T-test results (significant figures are in bold, p-value < 0.05)

Prediction with different algorithms

Prediction in different training periods

Model training period Model testing period

Dec 2005 Sep 2010
Time P


50% : 50%
70% : 30%
80% : 20%

Model training period Model testing period

Dec 2005 Sep 2010
Time P

Top 42 (37%) from MIMs
among total 113 metrics
(MIMs+CMs+HMs)

Possible Insight
TOP 1: NumLowDOIEdit
TOP 2: NumPatternEXSX
TOP 3: TimeSpentOnEdit

Possible Insight
TOP 1: NumLowDOIEdit
TOP 2: NumPatternEXSX
TOP 3: TimeSpentOnEdit
Chances are that more defects might be generated
if a programmer TOP2 repeatedly edit and browse a
file especially related to the previous defects TOP3

with putting more weight on editing time, and
especially TOP1 when editing such the files less
frequently or less recently accessed ever …

Performance comparison
with regression modeling
for predicting # of post-defects

Predicting Post-Defect Numbers

Threats to Validity
• Systems examined might not be representative

• Systems are all open source projects

• Defect information might be biased

Conclusion

Our findings exemplify that developer’s
interaction can affect software quality

Our proposed micro interaction metrics
improve defect prediction accuracy
significantly
…

We believe future defect prediction
models will use more developers’ direct and
micro level interaction information

MIMs are a first step towards it

Thank you! Any Question?
• Problem
– Developer’s interaction information can affect
software quality (defects)?
• Approach
– We proposed novel micro interaction metrics
(MIMs) overcoming the popular static metrics
• Result
– MIMs significantly improve prediction accuracy
compared to source code metrics (CMs) and
history metrics (HMs)

One possible ARGUMENT:
Some developers may not
have used Mylyn to fix bugs

Error chance in counting post-defects
as a result getting biased labels
(i.e., incorrect % of buggy instances)

Repeated experiment using
same instances but with a
different heuristics of defect
counting, CVS-log-based
approach*

* with keywords: “fix”, “bug”, “bug report ID” in change logs

Prediction with CVS-log-based approach

CVS-log-based

CVS-log-based approach reported more
additional post-defects
(more % of buggy-labeled instances)

CVS-log-based approach reported more
additional post-defects
(more % of buggy-labeled instances)

MIMs failed to feature them due to
lack of the corresponding Mylyn data

Note that it is difficult to
100% guarantee the quality of
CVS change logs
(e.g., no explicit bug ID, missing logs)

Micro Interaction Metrics for Defect Prediction (ESEC/FSE 2011)

Recommended

Recommended

More Related Content

More from Sung Kim

More from Sung Kim (20)

Recently uploaded

Recently uploaded (20)

Micro Interaction Metrics for Defect Prediction (ESEC/FSE 2011)