ISEC-2021-Presentation-Saikat-Mondal

Early Detection and Guidelines to
Improve Unanswered Questions on
Stack Overflow
Saikat Mondal
Software Research Lab
University of Saskatchewan
Canada
Chanchal Roy
Canada
ISEC 2021
Bhubaneswar, India
C M Khaled Saifullah
Canada
Avijit Bhattacharjee
Canada
Masud Rahman
RAISE Lab
Dalhousie University
Canada
I S E C 2 0 2 1 M a i n T r a c k | | F e b r u a r y 2 0 2 1

2
Most popular among 174 sites of Stack Exchange !
53rd most-visited site in the world, and 49th in the USA !
Get indexed top by Google !

3
Unanswered
Questions
(%)
Research Problem
0
5
10
15
20
25
30
35
2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
29%

4
Stack Overflow
Questions
Data
Filtration
Questions
of Interest
Question
Separation
Answered
Question
Unanswered
Question
Feature
Extraction
Comparative
Analysis
(Quantitative)
Comparative
Analysis
(Qualitative)
Selection of
Dimensions
Question
Prediction
Result
Analysis
Methodology

Comparison between Unanswered and Answered
Questions using Quantitative Features
6
RQ1
Text Readability
Topic Response Ratio
Topic Entropy
Metric Entropy
Average Terms Entropy
Text-code Ratio
User Reputation
Received Response Ratio

Text Readability
7
RQ1
Readability
Grade
Level
Fig: Text readability (TAQ=Total Answered Questions,
RSAQ=Randomly Sampled Answered Questions,
UAQ=Unanswered Questions)
Extract the title and textual description from
the body of each question
Discard the code elements, code segments,
and stack traces
Automated Reading Index (ARI), Coleman
Liau Index, SMOG Grade, Gunning Fox Index,
Readability Index (RIX) Readability of the answered questions
is significantly higher than that of the
unanswered questions regardless of
the programming languages

Topic Response Ratio
8
RQ1
Topic
Response
Ratio
Fig: Topic response ratio (TAQ=Total Answered
Questions, RSAQ=Randomly Sampled Answered
Questions, UAQ=Unanswered Questions)
First, calculate the answering ratio of each
topic
Then, measure the topic response ratio of
each question
Topic response ratio of the answered
questions is significantly higher than
that of the unanswered questions.

Topic Entropy
9
RQ1
We consider the probability of a question
discussing a certain topic as a random
variable and determine topic entropy of each
question
Topic entropy of answered questions is significantly
lower than that of the unanswered questions

Metric Entropy and Average Terms Entropy
10
RQ1
Metric Entropy is the Shannon entropy divided by the character length of the
question’s text.
We find a relatively lower metric and average terms entropy for
the unanswered questions than that of the answered questions.
To compute Average Term Entropy, we first calculate the entropy for each
term in the Stack Overflow data dump. We then determine the equivalent
entropy of each term. Finally, we calculate the Average Term Entropy for each
question by summing each of the terms entropy for the question divide by the
character length of the question’s text.

Text-code Ratio
11
RQ1
Large code segments with little explanation might
make a question hard to comprehend
Text-code ratio of the unanswered questions is
lower than that of the answered questions
We calculate the text-code ratio by dividing the
character length of the text with the length of the code
segment.

User Reputation
12
RQ1
User
Reputation
in
Posting
Time
Fig: User reputation during question submission time
(TAQ=Total Answered Questions, RSAQ=Randomly
Sampled Answered Questions, UAQ=Unanswered
Questions)
Stack Overflow stores all the activities (e.g.,
votes, acceptances, bounties) of users with
their dates. We thus use the snapshot of the
activities of users to calculate the reputation
during their question submission
For reputation calculation, we used a
standard equation provided by the Stack
Overflow.
Reputation of the authors of unanswered questions is
lower than that of the authors of answered questions

Received Response Ratio
13
RQ1
The past question-answering history of a user might
provide useful information as to whether a question
submitted by the user will be answered or not.
A user with a higher percentage of receiving answers in the
past has a higher chance of receiving answers in the future
We thus measure the percentage of answered
questions asked by users during their new question
submission.

Agreement between Quantitative and Qualitative Findings
14
RQ2
Study Setup
Manual Analysis
Results

15
RQ2
1. Problem description and code explanation
2. Topic assignment
3. Question formatting and
4. Appropriateness of the code segments
included with questions.
Study Setup
Manual Analysis
Results

16
RQ2
Fig: Comparison (qualitative) between unanswered and
answered questions.
Study Setup
Manual Analysis
Results

Prediction Models and Result Analysis
17
RQ3
Algorithms
Metrics of Success
Model Configuration
Dataset Selection
Prediction of Unanswered Questions
Feature Strength Analysis
Comparison with Baseline Models
Generalizability of the Models’ Performance
 Decision Trees (CART)
 Random Forest (RF)
 K-Nearest Neighbors (KNN)
 Artificial Neural Network (ANN)

18
RQ3
Algorithms
Metrics of Success
Model Configuration
Dataset Selection
 Precision
 Recall
 F1-score, and
 Classification accuracy

19
RQ3
Algorithms
Metrics of Success
Model Configuration
Dataset Selection
 We use a grid search algorithm, namely
GridSearchCV, from the scikit-learn library to
select the best model configuration.

20
RQ3
Algorithms
Metrics of Success
Model Configuration
Dataset Selection
 We use the attributes of the questions submitted
on Stack Overflow in the year 2016 or earlier for
training and questions submitted in the year 2017
for the testing purpose.

21
RQ3
Algorithms
Metrics of Success
Model Configuration
Dataset Selection

22
RQ3
Algorithms
Metrics of Success
Model Configuration
Dataset Selection

23
RQ3
Algorithms
Metrics of Success
Model Configuration
Dataset Selection
R. K. Saha, A. K Saha, and D. E. Perry. Toward understanding the
causes of unanswered questions in software information sites:
a case study of stack overflow. FSE 2013
F. Calefato, F. Lanubile, and N. Novielli. How to ask for technical
help? Evidence-based guidelines for writing questions on Stack
Overflow. IST 94 (2018).

25
RQ3
Algorithms
Metrics of Success
Model Configuration
Dataset Selection

Key Findings & Guidelines
26
1 Question format matters
2 Attach code segment when required
3 Improper and non-technical description hurts
4 Stack Overflow is not suitable for all topics
5 Multiple issues in a single question hurts
6 Unanswered questions are not always really unanswered
7 Miscellaneous findings

27

28

29

30

31

32

Acknowledgement
33
This research is supported by the Natural Sciences and Engineering Research Council of
Canada (NSERC), and by a Canada First Research Excellence Fund (CFREF) grant
coordinated by the Global Institute for Food Security (GIFS).

34
THANKS!
Any questions?
Early Detection and Guidelines to Improve
Unanswered Questions on Stack Overflow
Analyze 4.8 million questions
Investigate generalizability of models
Build models to predict unanswered questions
Guide question submitters
@saikatcsebd
saikatcsebd@gmail.com

ISEC-2021-Presentation-Saikat-Mondal

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to ISEC-2021-Presentation-Saikat-Mondal

Similar to ISEC-2021-Presentation-Saikat-Mondal (20)

Recently uploaded

Recently uploaded (20)

ISEC-2021-Presentation-Saikat-Mondal

Editor's Notes