MSR2015-Challenge

•Download as PPTX, PDF•

0 likes•60 views

Masud Rahman

An Insight into the Unresolved Questions at Stack Overflow

Technology

RESEARCH PROBLEM: HIGHER RATE OF
UNRESOLVED QUESTIONS
 Unresolved question:
none of the answers was
accepted as a solution.
 Exponential increase over
the last 6 years.
 2.4m (27%) unresolved
out of 8.8m questions at SO
(Feb, 2015)
RQ1: Why do questions at Stack Overflow remain unresolved for
long time?
RQ2: Can we predict the questions for which none of the answers
might be accepted as solutions?
2

ASPECTS OF STUDY
 Comparative analysis (RQ1)
between questions using four
aspects:
 Lexical Analysis
 Code Readability (CR)
 Text Readability (TR)
 Semantic Analysis
 Topic Similarity (TS)
 Topic Entropy (TE)
 User Behaviour Analysis
 Answer Rejection Ratio (ARR)
 Last Access Delay (LAD)
 Popularity Analysis
 Votes for Questions (V)
 Reputation of Question Owners (R)
Dataset Used
 3,956 Unresolved
questions & 4,101
Resolved questions
 Each question has at
least 10 answers.
3

CODE & TEXT READABILITY
 Existing readability tools used– Buse and Weimer (TSE,
2010) and Readability Grade levels (Ponzanelli et al, ICSME,
2014)
 Distribution Fitting Curves of readability
 No significant difference in readability between two
types of questions. 4

TOPIC SIMILARITY & TOPIC ENTROPY
 Mallet (McCallum, 2002) for topic modeling
 Topic Similarity (Fig-a) between questions and
corresponding answers identical for both question types.
 Topic Entropy (i.e., topic uncertainty) (Fig-b) higher for
unresolved questions– unresolved questions are
less specific about topics of requirement.
5

USER BEHAVIOUR ANALYSIS
 Distribution Fitting Curves of rejection ratio.
 Owners of unresolved questions have greater
answer rejection ratio.
 Owners of unresolved questions are less frequent
at Stack Overflow. 6

POPULARITY ANALYSIS
 Used Question Votes and User Reputation
 Unresolved questions are less popular than resolved
questions.
 Owners of unresolved questions are less reputed.
7

PREDICTION MODELS (RQ2)
Algorithm Metrics Overall
Accuracy
Unresolved Questions
Precision Recall
J48
{ TE, ARR, LAD, V, R } 78.11% 78.70% 76.10%
{ARR, LAD, V} 77.90% 79.60% 73.90%
Logistic
Regression
{ TE, ARR, LAD, V, R } 73.58% 72.60% 74.20%
{ARR, LAD, V} 73.28% 71.70% 75.20%
Naïve
Bayes
{ TE, ARR, LAD, V, R } 71.69% 69.50% 75.50%
{ARR, LAD, V} 74.48% 80.00% 64.00%
 Three prediction models used from WEKA with 10-fold
cross-validation.
 78.11% prediction accuracy with 78.70% precision
and 76.10% recall.
 The identified features are satisfactorily predictive.
8

TAKE-HOME MESSAGE
 27% of SO questions are unresolved, and they are
increasing almost exponentially.
 Unresolved questions are ambiguous, less
focused and less popular.
 Owners of unresolved questions are less reputed
and less frequent at SO.
 Identified features can satisfactorily separate
unresolved from resolved questions.
 Findings can assist in question quality
management at SO.
9

What's hot

Determining the Credibility of Science CommunicationIsabelle Augenstein

MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...Lifeng (Aaron) Han

PARCC Grade 6 MathJon Lewis

SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...Isabelle Augenstein

Ela g3Jason Lee

Ela g7Jon Lewis

Social networksBeatrizBorao

PARCC Grade 5 Math Jon Lewis

Helping Prospective Students Understand the Computing DisciplinesRandy Connolly

Attracting Women to Computing and Why it MattersGail Carmichael

Asking Clarifying Questions in Open-Domain Information-Seeking ConversationsMohammad Aliannejadi

Computational Exploration of the Linguistic Structures of Future-Oriented Exp...Jinho Choi

Semantics-based Graph Approach to Complex Question-AnsweringJinho Choi

Ontology-Based Data Access Mapping Generation using Data, Schema, Query, and ...Pieter Heyvaert

ResumeJonathan Donkor-Leh

Question Answering for Machine Reading Evaluation on Romanian and EnglishFaculty of Computer Science

NAACL2015 presentationHan Xu, PhD

LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han

What's hot (18)

Determining the Credibility of Science Communication

MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...

PARCC Grade 6 Math

SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...

Ela g3

Ela g7

Social networks

PARCC Grade 5 Math

Helping Prospective Students Understand the Computing Disciplines

Attracting Women to Computing and Why it Matters

Asking Clarifying Questions in Open-Domain Information-Seeking Conversations

Computational Exploration of the Linguistic Structures of Future-Oriented Exp...

Semantics-based Graph Approach to Complex Question-Answering

Ontology-Based Data Access Mapping Generation using Data, Schema, Query, and ...

Resume

Question Answering for Machine Reading Evaluation on Romanian and English

NAACL2015 presentation

LEPOR: an augmented machine translation evaluation metric - Thesis PPT

Similar to MSR2015-Challenge

MSR2017-RevHelperMasud Rahman

Code-Review-COW56-MeetingMasud Rahman

R programming for psychometricsDiane Talley

CodeInsight-SCAM2015Masud Rahman

STRICT-SANER2017Masud Rahman

CORRECT-ICSE2016Masud Rahman

Topic Set Size Design with the Evaluation Measures for Short Text ConversationTetsuya Sakai

The effect of number of concepts on readability of schemas 2Saman Sara

Rubric Detail A rubric lists grading criteria that instruct.docxrobert345678

How to conduct systematic literature reviewKashif Hussain

Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...Preetha Chatterjee

Zouaq wole2013Amal Zouaq

Question Classification using Semantic, Syntactic and Lexical featuresIJwest

Question Classification using Semantic, Syntactic and Lexical featuresdannyijwest

A Set of Heuristics to Support Early Identification of Conflicting RequirementsAlejandro Salado

An Investigation of Machine Translation Evaluation Metrics in Cross-lingual Q...Kyoshiro Sugiyama

CORRECT-ToolDemo-ASE2016Masud Rahman

Query Recommendation - Barcelona 2017Puya - Hossein Vahabi

An IDE-Based Context-Aware Meta Search EngineMasud Rahman

SurfClipse-- An IDE based context-aware Meta Search Engine (ERA Track)Masud Rahman

Similar to MSR2015-Challenge (20)

MSR2017-RevHelper

Code-Review-COW56-Meeting

R programming for psychometrics

CodeInsight-SCAM2015

STRICT-SANER2017

CORRECT-ICSE2016

Topic Set Size Design with the Evaluation Measures for Short Text Conversation

The effect of number of concepts on readability of schemas 2

Rubric Detail A rubric lists grading criteria that instruct.docx

How to conduct systematic literature review

Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...

Zouaq wole2013

Question Classification using Semantic, Syntactic and Lexical features

A Set of Heuristics to Support Early Identification of Conflicting Requirements

An Investigation of Machine Translation Evaluation Metrics in Cross-lingual Q...

CORRECT-ToolDemo-ASE2016

Query Recommendation - Barcelona 2017

An IDE-Based Context-Aware Meta Search Engine

SurfClipse-- An IDE based context-aware Meta Search Engine (ERA Track)

Recently uploaded

costume and set research powerpoint presentationphoebematthew05

The transition to renewables in India.pdfCompetition Advisory Services (India) LLP

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

APIForce Zurich 5 April Automation LPDGMarianaLemus7

Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Bluetooth Controlled Car with Arduino.pdfngoud9212

Artificial intelligence in the post-deep learning eraDeakin University

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Build your next Gen AI Breakthrough - April 2024Neo4j

Recently uploaded (20)

costume and set research powerpoint presentation

The transition to renewables in India.pdf

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service

Pigging Solutions in Pet Food Manufacturing

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

APIForce Zurich 5 April Automation LPDG

Unlocking the Potential of the Cloud for IBM Power Systems

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Bluetooth Controlled Car with Arduino.pdf

Artificial intelligence in the post-deep learning era

Human Factors of XR: Using Human Factors to Design XR Systems

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Connect Wave/ connectwave Pitch Deck Presentation

Streamlining Python Development: A Guide to a Modern Project Setup

Unblocking The Main Thread Solving ANRs and Frozen Frames

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Advanced Test Driven-Development @ php[tek] 2024

Build your next Gen AI Breakthrough - April 2024

MSR2015-Challenge

1. AN INSIGHT INTO THE UNRESOLVED QUESTIONS AT STACK OVERFLOW Mohammad Masudur Rahman, Chanchal K. Roy Department of Computer Science University of Saskatchewan Presented By: Ripon K. Saha 12th Working Conference on Mining Software Repositories (MSR 2015) (Challenge Track) Florence, Italy

2. RESEARCH PROBLEM: HIGHER RATE OF UNRESOLVED QUESTIONS  Unresolved question: none of the answers was accepted as a solution.  Exponential increase over the last 6 years.  2.4m (27%) unresolved out of 8.8m questions at SO (Feb, 2015) RQ1: Why do questions at Stack Overflow remain unresolved for long time? RQ2: Can we predict the questions for which none of the answers might be accepted as solutions? 2

3. ASPECTS OF STUDY  Comparative analysis (RQ1) between questions using four aspects:  Lexical Analysis  Code Readability (CR)  Text Readability (TR)  Semantic Analysis  Topic Similarity (TS)  Topic Entropy (TE)  User Behaviour Analysis  Answer Rejection Ratio (ARR)  Last Access Delay (LAD)  Popularity Analysis  Votes for Questions (V)  Reputation of Question Owners (R) Dataset Used  3,956 Unresolved questions & 4,101 Resolved questions  Each question has at least 10 answers. 3

4. CODE & TEXT READABILITY  Existing readability tools used– Buse and Weimer (TSE, 2010) and Readability Grade levels (Ponzanelli et al, ICSME, 2014)  Distribution Fitting Curves of readability  No significant difference in readability between two types of questions. 4

5. TOPIC SIMILARITY & TOPIC ENTROPY  Mallet (McCallum, 2002) for topic modeling  Topic Similarity (Fig-a) between questions and corresponding answers identical for both question types.  Topic Entropy (i.e., topic uncertainty) (Fig-b) higher for unresolved questions– unresolved questions are less specific about topics of requirement. 5

6. USER BEHAVIOUR ANALYSIS  Distribution Fitting Curves of rejection ratio.  Owners of unresolved questions have greater answer rejection ratio.  Owners of unresolved questions are less frequent at Stack Overflow. 6

7. POPULARITY ANALYSIS  Used Question Votes and User Reputation  Unresolved questions are less popular than resolved questions.  Owners of unresolved questions are less reputed. 7

8. PREDICTION MODELS (RQ2) Algorithm Metrics Overall Accuracy Unresolved Questions Precision Recall J48 { TE, ARR, LAD, V, R } 78.11% 78.70% 76.10% {ARR, LAD, V} 77.90% 79.60% 73.90% Logistic Regression { TE, ARR, LAD, V, R } 73.58% 72.60% 74.20% {ARR, LAD, V} 73.28% 71.70% 75.20% Naïve Bayes { TE, ARR, LAD, V, R } 71.69% 69.50% 75.50% {ARR, LAD, V} 74.48% 80.00% 64.00%  Three prediction models used from WEKA with 10-fold cross-validation.  78.11% prediction accuracy with 78.70% precision and 76.10% recall.  The identified features are satisfactorily predictive. 8

9. TAKE-HOME MESSAGE  27% of SO questions are unresolved, and they are increasing almost exponentially.  Unresolved questions are ambiguous, less focused and less popular.  Owners of unresolved questions are less reputed and less frequent at SO.  Identified features can satisfactorily separate unresolved from resolved questions.  Findings can assist in question quality management at SO. 9

10. THANK YOU!! 10

Editor's Notes

Introduce yourself +introductory statements. Today, I am going to talk about the findings on unresolved questions from Stack Overflow.
First, lets clarify unresolved questions We refer to such questions as unresolved which are posted at least 6 months ago, but none of the posted answers are accepted as solutions. Right now, SO has 27% of such questions and they increased almost exponentially over the last 6 years. So, in this paper we answer two research questions: Why do questions at Stack Overflow remain unresolved for long time? Can we develop a model that would predict unresolved questions?
For answering RQ1, we conduct a comparative study between unresolved and resolved questions (answer accepted as solution) from SO. We collect about 4K questions of each type, and compare them using four different analysis: Lexical analysis which includes checking for readability of code and text in the questions. Semantic analysis which focuses on question-answer topic similarity and topic entropy. User behaviuor analysis focuses on certain activities of the question owners. Popularity analysis compares questions votes and user reputation for both types of questions.
This slide shows the readability comparison between unresolved and resolved questions. Green refers to readability distribution fit for resolved questions, and red means the same for unresolved questions. We find no significant difference in the readability of both questions.
However, we got an interesting finding in case of question topics. Using topic modeling and information theory, we calculate topic entropy (analogous to Information entropy) for both resolved and unresolved questions. We found that topic entropy is higher for unresolved questions which suggests that Unresolved questions are less specific about requirements , that means less focused, which probably prevents them from satisfactory answers.
In case of user behaviour analysis, we found that owners of unresolved questions are relatively reluctant in accepting answers as solution which suggest they are either careless or skeptical. Our analysis also shows that they are less frequent in SO.
In case of popularity analysis, we found that unresolved questions are less popular than resolved questions, and owners of unresolved questions are generally less reputed than the owners of resolved questions.
Now, in order to answer RQ2, we use the identified features in RQ1, and collect features for both question types (8K) We then develop 3 prediction models using J48, Logistic regression and Naïve Bayes from WEKA, and apply 10-fold cross-validation. We found a overall classification accuracy of 78.11% which is impressive. In case of unresolved questions, we found 80% precision and 76.10% recall which suggests that the identified features are quite predictive.
So, here are the take-home messages: 27% of SO questions are unresolved and they are increasing almost exponentially. Unresolved questions are ambiguous, less focused and less popular Owners of unresolved questions are less reputed and less frequent at SO The identified features in this study are quite predictive for unresolved questions. So, they can be used for question quality management.
Thanks for your time. Questions!!

MSR2015-Challenge

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to MSR2015-Challenge

Similar to MSR2015-Challenge (20)

More from Masud Rahman

More from Masud Rahman (20)

Recently uploaded

Recently uploaded (20)

MSR2015-Challenge

Editor's Notes