Automatic Fine-Grained Issue Report Reclassification

Automatic Fine-Grained Issue Report
Reclassification
Pavneet Singh Kochhar, Ferdian Thung, David Lo
Singapore Management University
{kochharps.2012, ferdiant.2013, davidlo}@smu.edu.sg

2/24
Misclassification of Issue Reports
BUG
Herzig et al. *
• 40% of issue reports are misclassified.
• 1/3 issue reports are wrongly classified as bugs.
* It’s not a Bug, it’s a Feature: How Misclassiﬁcation Impacts Bug Prediction,
K. Herzig, S. Just, A. Zeller, ICSE 2013
DOCUMENTATIONIMPROVEMENT
REFACTORING
BACKPORTCLEANUP
DESIGN DEFECT
TASK
TEST

Impact of Misclassification
• Well-known projects receive large number of issue reports
• Large number of bug reports can overwhelm the
number of developers.
• Mozilla developer - “Everyday, almost 300 bugs appear
that need triaging.” *
• Manual Process
• Misclassified reports take more time to fix+
* J. Anvik, L. Hiew, and G. C. Murphy, “Coping with an open bug repository,” in ETX, pp. 35–39, 2005
+ X. Xia, D. Lo, M. Wen, E. Shihab, and B. Zhou, “An empirical study of bug report field reassignment,” in
CSMR-WCRE, pp. 174–183, 2014.
3/24

Related Work
• Herzig et al. [1] –
• Manually classify over 7000 issue reports.
• 14 different categories
 We use the same dataset
 We use 13 categories (merge UNKNOWN & OTHERS)
• Antoniol et al. [2] –
• Classify issue reports either as “bug” or “enhancement”
 We consider “reclassification” problem
 We use 13 different categories
[1] It’s not a Bug, it’s a Feature: How Misclassiﬁcation Impacts Bug Prediction, K. Herzig, S. Just, A.
Zeller, ICSE 2013
[2] G. Antoniol, K. Ayari, M. D. Penta, F. Khomh, and Y.-G. Gueheneuc, “Is it a bug or an enhancement?
a text-based approach to classify change requests,” in CASCON, pp. 23:304–23:318, 2008.
4/24

Our Study
Fine-Grained Issue Report Reclassification
13 Categories*
BUG RFE IMPROVEMENT DOCUMENTATION
TASK BUILD
REFACTORING
DESIGN
DEFECT
TEST CLEANUP
BACKPORT
SPECIFICATION
OTHERS
5/24
(Adaptive
Maintenance)
(Perfective
Maintenance)
(Deallocating
memory)
(Removing
Duplicate
methods)

Overall Framework
Training
Issue
Reports
Ground
Truth
Categories*
New Issue
Reports
Model
Building
Model
Feature Extraction
Predicted
Reclassified
Categories
Training Phase Deployment Phase
*Herzig et al.
6/24

Pre-Processing
• Text Pre-Processing
• Summary & Description fields
• Stop-word removal
• eg., “is”, “are”, “if”
• Stemming (Reducing to root form)
• eg., “reads” and “reading” -----> “read”
• Use Porter Stemmer*
*http://tartarus.org/martin/PorterStemmer/
7/24

Feature Extraction
1. TF-IDF
TF - Term Frequency, IDF- Inverse Document Frequency
2. Reported Category (C1-C13)
Cn=1 where n=1 to 13
8/24

Feature Extraction
3. Exception Trace (S)
a) Phrase: “Exception in thread”
b) Regex : [A-Za-z0-9$.]+Exception
eg., java.lang.NullPointerException
c) Regex :
[A-Za-z0-9$.]+[A-Za-z0-9]+([A-Za-z0-9]+(java:[0-9]+)?)
eg., oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:447)
4. Issue Reporter (R1-RM)
where M is total number of reporters
9/24

Model Building
• LibSVM (Support Vector Machine)*
• Multi-class classification
• Inputs
• L, Learner (Training Algorithm)
• X, Set of Training Data i.e., Issue Reports
• y, where 𝑦𝑖 ∈ {1, … 𝑘}, Labels i.e., 13 categories
• Output
• A list of classifiers 𝑓 𝑘 for k ∈ {1, … 𝑘},
• Classifiers are applied on unseen data to predict label k
*http://www.csie.ntu.edu.tw/~cjlin/libsvm/
10/24

Dataset
Projects Organization Tracker Number of
Issue Reports
HTTPClient Apache JIRA 746
Jackrabbit Apache JIRA 2402
Lucene-Java Apache JIRA 2443
Rhino Mozilla BugZilla 1226
Tomcat5 Apache BugZilla 584
Total = 7401 Issue Reports *
11/24

Evaluation Metrics
𝑃𝑟𝑒 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 = #𝑇𝑃𝑐𝑎𝑡𝑒𝑔 𝑜𝑟𝑦
#𝑇𝑃𝑐𝑎𝑡𝑒𝑔 𝑜𝑟𝑦
+ #𝐹𝑃 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
(Precision)
𝑅𝑒𝑐𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 = #𝑇𝑃𝑐𝑎𝑡𝑒𝑔 𝑜𝑟𝑦
#𝑇𝑃𝑐𝑎𝑡𝑒𝑔 𝑜𝑟𝑦
+ #𝐹𝑁 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
(Recall)
𝐹1 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 = 2 𝑥 𝑃𝑟𝑒 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑥 𝑅𝑒𝑐 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑃𝑟𝑒 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
+𝑅𝑒𝑐𝑐𝑎𝑡𝑒 𝑔𝑜𝑟𝑦
(F-Measure)
𝑊𝐹1 =
1
𝑁 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦=1
#𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑛 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑋 𝐹1 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
( Weighted F-Measure)
We use Weighted Precision, Recall & F-Measure
12/24

Baselines
• Baseline-1
Predicts reclassified category same as assigned category
• Baseline-2
Predicts reclassified category as “BUG”
(Majority of the issues are BUGS)
13/24

Research Questions
RQ1: Effectiveness of Our Approach
RQ2: Varying the Amount of Training Data
RQ3: Most Discriminative Features
RQ4: Analysis of Correctly & Wrongly Classified Issue Reports
RQ5: Comparison to Other Classification Algorithms
14/24

RQ1: Effectiveness of Our Approach
HTTPClient Jackrabbit Lucene-Java
Prec Rec WF1 Prec Rec WF1 Prec Rec WF1
Ours 0.61 0.63 0.60 0.71 0.72 0.71 0.63 0.62 0.63
Baseline-1 0.54 0.52 0.43 0.61 0.62 0.54 0.50 0.50 0.43
Baseline-2 0.16 0.40 0.23 0.15 0.39 0.21 0.08 0.28 0.12
Improvement-1 12.96 21.15 39.53 16.39 16.12 31.48 24.00 26.00 44.18
Improvement-2 281.2 57.4 160.8 373.3 84.6 238.0 675.0 125.0 416.6
Rhino Tomcat5
Prec Rec WF1 Prec Rec WF1
Ours 0.58 0.61 0.57 0.58 0.62 0.58
Baseline-1 0.35 0.57 0.43 0.36 0.58 0.45
Baseline-2 0.26 0.51 0.35 0.30 0.54 0.38
Improvement-1 65.71 7.01 32.55 61.11 6.89 28.88
Improvement-2 123.0 19.6 62.85 93.3 14.8 52.63
15/24

RQ2: Varying Training Data
% of Issue
Reports
HTTPClient Jackrabbit Lucene-Java
10 0.49 0.56 0.47 0.63 0.65 0.60 0.55 0.57 0.53
20 0.54 0.55 0.46 0.64 0.66 0.61 0.57 0.57 0.54
30 0.58 0.60 0.54 0.68 0.70 0.67 0.59 0.60 0.58
40 0.54 0.53 0.48 0.69 0.71 0.68 0.59 0.58 0.56
50 0.58 0.61 0.57 0.69 0.71 0.69 0.62 0.63 0.61
60 0.59 0.62 0.58 0.64 0.65 0.62 0.61 0.62 0.61
70 0.60 0.62 0.58 0.70 0.72 0.70 0.62 0.63 0.62
80 0.62 0.68 0.61 0.70 0.72 0.70 0.63 0.64 0.63
90 0.61 0.64 0.60 0.71 0.73 0.71 0.62 0.63 0.62
16/24

RQ2: Varying Training Data
% of Issue
Reports
Rhino Tomcat5
10 0.45 0.52 0.40 0.47 0.54 0.43
20 0.46 0.50 0.39 0.50 0.55 0.45
30 0.46 0.50 0.40 0.54 0.60 0.53
40 0.47 0.48 0.40 0.56 0.62 0.56
50 0.52 0.58 0.50 0.56 0.61 0.56
60 0.55 0.59 0.53 0.50 0.48 0.42
70 0.56 0.60 0.54 0.49 0.44 0.38
80 0.58 0.61 0.56 0.57 0.62 0.58
90 0.59 0.61 0.56 0.54 0.59 0.55
17/24

HTTPClient Jackrabbit
Feature Fisher
Score
Feature Fisher
Score
Stemmed word “test” 1.73 Reported Category (BUG) 0.72
Reported Category (TASK) 0.58 Stemmed word “test” 0.55
Stemmed word “privat” 0.56 Stemmed word “maven” 0.51
Reported Category (BUG) 0.54 Stemmed word “backport” 0.46
Stemmed word “cleanup” 0.50 Reported Category (IMPR) 0.43
18/24

Lucene-Java Rhino
Feature Fisher
Score
Feature Fisher
Score
Stemmed word “test” 0.94 Stemmed word “test” 3.84
Reported Category (BUG) 0.61 Stemmed word “suit” 0.43
Reported Category (TEST) 0.50 Stemmed word “patch” 0.32
Stemmed word “backport” 0.45 Stemmed word “driver” 0.29
Stemmed word “remov” 0.38 Stemmed word “regress” 0.27
Tomcat5
Feature Fisher Score
Stemmed word “longer” 1.15
Issue Reporter “starksm” 0.71
Stemmed word “class” 0.64
Stemmed word “ant” 0.62
Reported Category (BUG) 0.56
19/24

RQ4: Correctly & Wrongly Classified Reports
BUG RFE IMPR TEST DOC BUILD CLEANUP REFAC
BUG 2631 48 119 26 23 8 8 1
RFE 139 765 223 6 13 7 13 31
IMPR 320 214 658 8 12 13 16 19
TEST 84 12 15 220 1 8 4 3
DOC 95 39 37 0 209 13 17 2
BUILD 29 17 19 11 10 127 5 1
CLEANUP 58 30 42 6 11 5 104 12
REFAC 20 51 61 1 2 0 16 91
Predicted Labels
GroundTruthLabels
Table shows 8 categories (Total 13 categories)
BUG – 2631/2914 (90.3%)
TEST – 220/349 (63%)
RFE – 765/1221 (62.7%)
20/24

RQ4: Correctly & Wrongly Classified Reports
BUG RFE IMPR TEST DOC BUILD CLEANUP REFAC
BUG 2631 48 119 26 23 8 8 1
RFE 139 765 223 6 13 7 13 31
IMPR 320 214 658 8 12 13 16 19
TEST 84 12 15 220 1 8 4 3
DOC 95 39 37 0 209 13 17 2
BUILD 29 17 19 11 10 127 5 1
CLEANUP 58 30 42 6 11 5 104 12
REFAC 20 51 61 1 2 0 16 91
Predicted Labels
GroundTruthLabels
21/24

RQ5: Comparison with Other Algorithms
Approach HTTPClient Jackrabbit Lucene-Java
Ours (LibSVM) 0.61 0.63 0.60 0.71 0.72 0.71 0.62 0.63 0.62
Naïve Bayes 0.49 0.47 0.48 0.51 0.39 0.43 0.46 0.37 0.40
NB
Multinomial
0.53 0.60 0.54 0.64 0.66 0.61 0.60 0.59 0.56
K-Nearest
Neighbors
0.47 0.29 0.34 0.60 0.58 0.59 0.46 0.40 0.42
Random
Forest
0.45 0.56 0.46 0.54 0.58 0.53 0.45 0.48 0.43
RBF Network 0.37 0.39 0.37 0.39 0.41 0.40 0.31 0.31 0.30
22/24

RQ5: Comparison with Other Algorithms
Approach Rhino Tomcat5
Ours (LibSVM) 0.58 0.61 0.57 0.58 0.62 0.58
Naïve Bayes 0.51 0.51 0.51 0.48 0.40 0.42
NB
Multinomial
0.52 0.58 0.49 0.51 0.58 0.47
K-Nearest
Neighbors
0.50 0.43 0.43 0.43 0.43 0.42
Random
Forest
0.51 0.56 0.47 0.45 0.56 0.46
RBF Network 0.40 0.43 0.41 0.33 0.54 0.39
23/24

Conclusion & Future Work
Automated approach to reclassify issue reports
Evaluate over 7000 issue reports
Extract features such as TF-IDF, Reported
category, Exception trace, Issue reporter
Perform multi-class classification (13 Categories)
F-Measure Score 0.57-0.71
Improvement of 28.88% - 414.66% over baselines
Future Work:
 Analyse more issue reports
 Design advanced multi-class solution
24/24

Thank You!
Email: kochharps.2012@smu.edu.sg

Automatic Fine-Grained Issue Report Reclassification

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Similar to Automatic Fine-Grained Issue Report Reclassification

Similar to Automatic Fine-Grained Issue Report Reclassification (20)

More from Pavneet Singh Kochhar

More from Pavneet Singh Kochhar (10)

Recently uploaded

Recently uploaded (20)

Automatic Fine-Grained Issue Report Reclassification