NLBSE’22: Tool Competition

@NLBSE_workshop nlbse2022.github.io/tools
NLBSE’22: Tool Competition
Oscar Chaparro
Rafael Kallis
College of
William & Mary
USA
Rafael Kallis
Consulting
Switzerland

The competition at a glance
Goal: develop more accurate models for issue classification
Baseline model: TicketTagger
Dataset: 800k+ issue reports from 127k+ GitHub projects
Competitors: 5 teams

Issue report classification
Issue
report
Classification
model
Bug
Enhancement
Question
Important task in issue management and prioritization
Extensive research in the field that uses NLP/ML

Baseline model: TicketTagger
Rafael Kallis et al., Ticket tagger: Machine learning driven issue classification, ICSME’19
Rafael Kallis et al., Predicting issue types on GitHub, Science of Computer Programming, 2021
Bug
Enhancement
Question
Issue title &
description

Benchmark dataset
800k+ issues from 127k+ GitHub projects
Closed issues with any of the 3 labels using Google BigQuery
• Label (aka issue type)
• Title and description
• URL (issue and repository), timestamp
• Author type (owner, contributor, etc.)

Benchmark dataset
50.0%
41.4%
8.6%
% of issues
Bug Enhancement Question
Training set: 90% (723k)
Testing set: 10% (80.5k)

Infrastructure

Competition rules
Training and fine-tuning on the training set
Model classification accuracy on the testing set
Preprocessing and manipulation on the training set
Feature engineering, data balancing, validation set, etc.
No balancing of the testing set

Metrics
Precision, recall, and
F1-score for each label
Micro average F1-score to
declare the winner

Competitors
Team 4
(Colavito et al.)
Team 3
(Trautsch & Herbold)
Team 2
(Bharadwaj & Kadam)
Team 5
(Izadi)
Team 1
(Siddiq & Santos)

Submitted tools at a glance
BERT*
XLNet
MLP
Log. Regression
Title
Description
Repository
Timestamp
Author
Text
normalization
Duplicate
removal
…

Classification results
0.872
0.865
0.859 0.858 0.857
0.818
0.79
0.8
0.81
0.82
0.83
0.84
0.85
0.86
0.87
0.88
Micro avg. F1-score
Team A Team B Team C Team D Team E Baseline

Results for the best model
0.84 0.874
0.72
0.897 0.885 0.879
0.72
0.664 0.691
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Precision Recall F1-score
Bug Enhancement Question

Competition ranking
Team 4
(Colavito et al.)
Team 3
(Trautsch & Herbold)
Team 1
(Siddiq & Santos)
3

Competition ranking
Team 2
(Bharadwaj & Kadam)
Team 5
(Izadi)
1
2

Tool presentations
Motivation for choosing models
Challenges during model training/evaluation
Features that most contributed to performance
Preprocessing pipeline and their effect on performance
Examples of failing and successful predicted cases
Ideas on a customized model to improve performance

Discussion panel
ADDITIONAL Q&A FEEDBACK ON
COMPETITION
IDEAS FOR THE
NEXT EDITION

NLBSE’22: Tool Competition

More Related Content

More from Sebastiano Panichella

Recently uploaded

NLBSE’22: Tool Competition