1. AN INSIGHT INTO THE UNRESOLVED
QUESTIONS AT STACK OVERFLOW
Mohammad Masudur Rahman, Chanchal K. Roy
Department of Computer Science
University of Saskatchewan
Presented By: Ripon K. Saha
12th Working Conference on Mining Software
Repositories (MSR 2015) (Challenge Track)
Florence, Italy
2. RESEARCH PROBLEM: HIGHER RATE OF
UNRESOLVED QUESTIONS
Unresolved question:
none of the answers was
accepted as a solution.
Exponential increase over
the last 6 years.
2.4m (27%) unresolved
out of 8.8m questions at SO
(Feb, 2015)
RQ1: Why do questions at Stack Overflow remain unresolved for
long time?
RQ2: Can we predict the questions for which none of the answers
might be accepted as solutions?
2
3. ASPECTS OF STUDY
Comparative analysis (RQ1)
between questions using four
aspects:
Lexical Analysis
Code Readability (CR)
Text Readability (TR)
Semantic Analysis
Topic Similarity (TS)
Topic Entropy (TE)
User Behaviour Analysis
Answer Rejection Ratio (ARR)
Last Access Delay (LAD)
Popularity Analysis
Votes for Questions (V)
Reputation of Question Owners (R)
Dataset Used
3,956 Unresolved
questions & 4,101
Resolved questions
Each question has at
least 10 answers.
3
4. CODE & TEXT READABILITY
Existing readability tools used– Buse and Weimer (TSE,
2010) and Readability Grade levels (Ponzanelli et al, ICSME,
2014)
Distribution Fitting Curves of readability
No significant difference in readability between two
types of questions. 4
5. TOPIC SIMILARITY & TOPIC ENTROPY
Mallet (McCallum, 2002) for topic modeling
Topic Similarity (Fig-a) between questions and
corresponding answers identical for both question types.
Topic Entropy (i.e., topic uncertainty) (Fig-b) higher for
unresolved questions– unresolved questions are
less specific about topics of requirement.
5
6. USER BEHAVIOUR ANALYSIS
Distribution Fitting Curves of rejection ratio.
Owners of unresolved questions have greater
answer rejection ratio.
Owners of unresolved questions are less frequent
at Stack Overflow. 6
7. POPULARITY ANALYSIS
Used Question Votes and User Reputation
Unresolved questions are less popular than resolved
questions.
Owners of unresolved questions are less reputed.
7
8. PREDICTION MODELS (RQ2)
Algorithm Metrics Overall
Accuracy
Unresolved Questions
Precision Recall
J48
{ TE, ARR, LAD, V, R } 78.11% 78.70% 76.10%
{ARR, LAD, V} 77.90% 79.60% 73.90%
Logistic
Regression
{ TE, ARR, LAD, V, R } 73.58% 72.60% 74.20%
{ARR, LAD, V} 73.28% 71.70% 75.20%
Naïve
Bayes
{ TE, ARR, LAD, V, R } 71.69% 69.50% 75.50%
{ARR, LAD, V} 74.48% 80.00% 64.00%
Three prediction models used from WEKA with 10-fold
cross-validation.
78.11% prediction accuracy with 78.70% precision
and 76.10% recall.
The identified features are satisfactorily predictive.
8
9. TAKE-HOME MESSAGE
27% of SO questions are unresolved, and they are
increasing almost exponentially.
Unresolved questions are ambiguous, less
focused and less popular.
Owners of unresolved questions are less reputed
and less frequent at SO.
Identified features can satisfactorily separate
unresolved from resolved questions.
Findings can assist in question quality
management at SO.
9
Introduce yourself +introductory statements.
Today, I am going to talk about the findings on unresolved questions from Stack Overflow.
First, lets clarify unresolved questions
We refer to such questions as unresolved which are posted at least 6 months ago, but none of the posted answers are accepted as solutions.
Right now, SO has 27% of such questions and they increased almost exponentially over the last 6 years.
So, in this paper we answer two research questions:
Why do questions at Stack Overflow remain unresolved for long time?
Can we develop a model that would predict unresolved questions?
For answering RQ1, we conduct a comparative study between unresolved and resolved questions (answer accepted as solution) from SO.
We collect about 4K questions of each type, and compare them using four different analysis:
Lexical analysis which includes checking for readability of code and text in the questions.
Semantic analysis which focuses on question-answer topic similarity and topic entropy.
User behaviuor analysis focuses on certain activities of the question owners.
Popularity analysis compares questions votes and user reputation for both types of questions.
This slide shows the readability comparison between unresolved and resolved questions.
Green refers to readability distribution fit for resolved questions, and red means the same for unresolved questions.
We find no significant difference in the readability of both questions.
However, we got an interesting finding in case of question topics.
Using topic modeling and information theory, we calculate topic entropy (analogous to Information entropy) for both resolved and unresolved questions.
We found that topic entropy is higher for unresolved questions which suggests that
Unresolved questions are less specific about requirements , that means less focused, which probably prevents them from satisfactory answers.
In case of user behaviour analysis, we found that owners of unresolved questions are relatively reluctant in accepting answers as solution which suggest they are either careless or skeptical.
Our analysis also shows that they are less frequent in SO.
In case of popularity analysis, we found that unresolved questions are less popular than resolved questions, and
owners of unresolved questions are generally less reputed than the owners of resolved questions.
Now, in order to answer RQ2, we use the identified features in RQ1, and collect features for both question types (8K)
We then develop 3 prediction models using J48, Logistic regression and Naïve Bayes from WEKA, and apply 10-fold cross-validation.
We found a overall classification accuracy of 78.11% which is impressive.
In case of unresolved questions, we found 80% precision and 76.10% recall which suggests that the identified features are quite predictive.
So, here are the take-home messages:
27% of SO questions are unresolved and they are increasing almost exponentially.
Unresolved questions are ambiguous, less focused and less popular
Owners of unresolved questions are less reputed and less frequent at SO
The identified features in this study are quite predictive for unresolved questions. So, they can be used for question quality management.