Assignment 6 
 
Analyzing response time in Q&A websites 
 
Question and Answer (Q&A) sites like StackOverflow, Yahoo! Answers, Naver, Quora,                     
LiveQnA, WikiAnswers etc. are becoming increasingly popular with the growth of the Web.                         
These are large collaborative production and social computing platforms of the Web, aimed at                           
crowd­sourcing knowledge by allowing users to post and answer questions. They not only                         
provide a platform for experts to share their knowledge and get identified but also help novice                               
users solve their problems effectively. StackOverflow is one such community­driven Q&A                     
website used by more than a million software developers who post and answer questions                           
related to computer programming. It is governed by a reputation system which rewards the                           
users by giving reputation points, badges, extra privileges on the website, etc. by the usefulness                             
of their posts. The usefulness of a question or an answer is largely determined by the number of                                   
votes it receives. 
 
In such a crowd­sourced system driven by a reputation mechanism, response time of                         
questions to receive the first answer plays an important role and would largely determine the                             
popularity of the website. People who post questions would want to know the time by which they                                 
can expect a response to their question. In this assignment, we want to investigate whether                             
besides several other factors, tags of a question have strong correlation with response time.                           
Tagging questions involves askers selecting appropriate keywords (e.g., android, jquery, c#) to                       
broadly identify the domains to which their questions are related. There also exist mechanisms                           
by which other users can subscribe to tags, search via tags, mark tags as favorites, etc. As a                                   
result, tags should play a crucial role in how the questions are answered and hence determining                               
their response time. 
 
Input Dataset: 
 
http://gaming.stackexchange.com/ 
(Dataset­ https://archive.org/download/stackexchange/gaming.stackexchange.com.7z) is     
a sister site of StackOverflow where questions related to Gaming are discussed. We have                           
attached the datadump of the website till 26th September, 2014. Download and Unzip the                           
dataset and you will find the following files 
● Badges.xml 
● Comments.xml 
● PostHistory.xml 
● PostLinks.xml 
● Posts.xml 
● Tags.xml 
● Users.xml 
● Votes.xml 
 
Information about all the posts (questions and answers) and tags can be found in “Posts.xml”                             
and “Tags.xml” files respectively. Examples from each of the files are given below. 
 
Typical Question 
 
<row Id="7" PostTypeId="1" AcceptedAnswerId="10" CreationDate="2014­05­14T00:11:06.457" 
Score="1" ViewCount="185" Body="&lt;p&gt;As a researcher and instructor, I'm looking for 
open­source books (or similar materials) that provide a relatively thorough overview of data 
science from an applied perspective. To be clear, I'm especially interested in a thorough 
overview that provides material suitable for a college­level course, not particular pieces or 
papers.&lt;/p&gt;&#xA;" OwnerUserId="36" LastEditorUserId="97" 
LastEditDate="2014­05­16T13:45:00.237" LastActivityDate="2014­05­16T13:45:00.237" 
Title="What open­source books (or other materials) provide a relatively thorough overview of 
data science?" Tags="&lt;education&gt;&lt;open­source&gt;" AnswerCount="3" 
CommentCount="4" FavoriteCount="1" ClosedDate="2014­05­14T08:40:54.950" ></row> 
  
Typical Answer 
 
<row Id="10" PostTypeId="2" ParentId="7" CreationDate="2014­05­14T00:53:43.273" Score="8" 
Body="&lt;p&gt;One book that's freely available is &quot;The Elements of Statistical 
Learning&quot; by Hastie, Tibshirani, and Friedman (published by Springer): &lt;a 
href=&quot;http://statweb.stanford.edu/~tibs/ElemStatLearn/&quot;&gt;see Tibshirani's 
website&lt;/a&gt;.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Another fantastic source, although it isn't a book, 
is Andrew Ng's Machine Learning course on Coursera. This has a much more applied­focus 
than the above book, and Prof. Ng does a great job of explaining the thinking behind several 
different machine learning algorithms/situations.&lt;/p&gt;&#xA;" OwnerUserId="22" 
LastActivityDate="2014­05­14T00:53:43.273" CommentCount="1" /> 
  
Typical Tag 
 
<row Id="3" TagName="bigdata" Count="46" ExcerptPostId="66" WikiPostId="65" /> 
  
Output Deliverables: 
 
A. Feature Calculation 
You should use Java to parse these xml files and for each question, calculate the                             
response time and the following tag based features: 
 
1. tag_popularity: We define popularity of a tag t as its frequency, i.e., the number of                             
questions that contains t as one of its tags. For each question, you should compute the                               
average popularity of all its tags.  
2. num_pop_tags: We consider a tag to be popular if its frequency is more than 20. Here                               
you should count the number of popular tags each question contains. There will be atmost                             
6 boxes in plot as each question can contain at max 5 tags. 
3. num_subs_ans: We define an “active subscriber” of a tag t to be a user who has posted                                 
“sufficient” answers in the “recent past” to questions containing t. We say that a user has                               
posted “sufficient” answers when the number of their answers is greater than 5 and by                             
“recent past” we mean answers posted after 7th Jan 2014. After computing the number                           
of active subscribers for every tag, you should compute the average number of active                           
subscribers for individual tags in each question. 
4. percent_subs_ans: For each tag, you should also compute the ratio of the number of                           
“active subscribers” to the total number of subscribers, where the total number of                         
subscribers indicates the number of users who have posted at least one answer to a                             
question containing a particular tag. After computing the ratio for every tag, you should                           
compute the average ratio for individual tags in each question. 
 
B. Feature Analysis 
To analyze the question features and their correlation with response time, you should                         
construct plots of the response time against the values of different features. You should                           
distribute the feature values into ten equal bins and then use gnuplot to produce the following                               
two plots: 
1. Box plots that capture the median, 25% and 75% of the response time distributions, as                             
well as the minimum and maximum values, and  
2. Cumulative distribution function (CDF) plots of the response time. 

Analyzing Stack Overflow - Problem

  • 1.
    Assignment 6    Analyzing response time in Q&A websites    Question and Answer(Q&A) sites like StackOverflow, Yahoo! Answers, Naver, Quora,                      LiveQnA, WikiAnswers etc. are becoming increasingly popular with the growth of the Web.                          These are large collaborative production and social computing platforms of the Web, aimed at                            crowd­sourcing knowledge by allowing users to post and answer questions. They not only                          provide a platform for experts to share their knowledge and get identified but also help novice                                users solve their problems effectively. StackOverflow is one such community­driven Q&A                      website used by more than a million software developers who post and answer questions                            related to computer programming. It is governed by a reputation system which rewards the                            users by giving reputation points, badges, extra privileges on the website, etc. by the usefulness                              of their posts. The usefulness of a question or an answer is largely determined by the number of                                    votes it receives.    In such a crowd­sourced system driven by a reputation mechanism, response time of                          questions to receive the first answer plays an important role and would largely determine the                              popularity of the website. People who post questions would want to know the time by which they                                  can expect a response to their question. In this assignment, we want to investigate whether                              besides several other factors, tags of a question have strong correlation with response time.                            Tagging questions involves askers selecting appropriate keywords (e.g., android, jquery, c#) to                        broadly identify the domains to which their questions are related. There also exist mechanisms                            by which other users can subscribe to tags, search via tags, mark tags as favorites, etc. As a                                    result, tags should play a crucial role in how the questions are answered and hence determining                                their response time.    Input Dataset:    http://gaming.stackexchange.com/  (Dataset­ https://archive.org/download/stackexchange/gaming.stackexchange.com.7z) is      a sister site of StackOverflow where questions related to Gaming are discussed. We have                            attached the datadump of the website till 26th September, 2014. Download and Unzip the                            dataset and you will find the following files  ● Badges.xml  ● Comments.xml  ● PostHistory.xml  ● PostLinks.xml  ● Posts.xml  ● Tags.xml  ● Users.xml  ● Votes.xml 
  • 2.
      Information about allthe posts (questions and answers) and tags can be found in “Posts.xml”                              and “Tags.xml” files respectively. Examples from each of the files are given below.    Typical Question    <row Id="7" PostTypeId="1" AcceptedAnswerId="10" CreationDate="2014­05­14T00:11:06.457"  Score="1" ViewCount="185" Body="&lt;p&gt;As a researcher and instructor, I'm looking for  open­source books (or similar materials) that provide a relatively thorough overview of data  science from an applied perspective. To be clear, I'm especially interested in a thorough  overview that provides material suitable for a college­level course, not particular pieces or  papers.&lt;/p&gt;&#xA;" OwnerUserId="36" LastEditorUserId="97"  LastEditDate="2014­05­16T13:45:00.237" LastActivityDate="2014­05­16T13:45:00.237"  Title="What open­source books (or other materials) provide a relatively thorough overview of  data science?" Tags="&lt;education&gt;&lt;open­source&gt;" AnswerCount="3"  CommentCount="4" FavoriteCount="1" ClosedDate="2014­05­14T08:40:54.950" ></row>     Typical Answer    <row Id="10" PostTypeId="2" ParentId="7" CreationDate="2014­05­14T00:53:43.273" Score="8"  Body="&lt;p&gt;One book that's freely available is &quot;The Elements of Statistical  Learning&quot; by Hastie, Tibshirani, and Friedman (published by Springer): &lt;a  href=&quot;http://statweb.stanford.edu/~tibs/ElemStatLearn/&quot;&gt;see Tibshirani's  website&lt;/a&gt;.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Another fantastic source, although it isn't a book,  is Andrew Ng's Machine Learning course on Coursera. This has a much more applied­focus  than the above book, and Prof. Ng does a great job of explaining the thinking behind several  different machine learning algorithms/situations.&lt;/p&gt;&#xA;" OwnerUserId="22"  LastActivityDate="2014­05­14T00:53:43.273" CommentCount="1" />     Typical Tag    <row Id="3" TagName="bigdata" Count="46" ExcerptPostId="66" WikiPostId="65" />     Output Deliverables:    A. Feature Calculation  You should use Java to parse these xml files and for each question, calculate the                              response time and the following tag based features:    1. tag_popularity: We define popularity of a tag t as its frequency, i.e., the number of                              questions that contains t as one of its tags. For each question, you should compute the                                average popularity of all its tags.  
  • 3.
    2. num_pop_tags: Weconsider a tag to be popular if its frequency is more than 20. Here                                you should count the number of popular tags each question contains. There will be atmost                              6 boxes in plot as each question can contain at max 5 tags.  3. num_subs_ans: We define an “active subscriber” of a tag t to be a user who has posted                                  “sufficient” answers in the “recent past” to questions containing t. We say that a user has                                posted “sufficient” answers when the number of their answers is greater than 5 and by                              “recent past” we mean answers posted after 7th Jan 2014. After computing the number                            of active subscribers for every tag, you should compute the average number of active                            subscribers for individual tags in each question.  4. percent_subs_ans: For each tag, you should also compute the ratio of the number of                            “active subscribers” to the total number of subscribers, where the total number of                          subscribers indicates the number of users who have posted at least one answer to a                              question containing a particular tag. After computing the ratio for every tag, you should                            compute the average ratio for individual tags in each question.    B. Feature Analysis  To analyze the question features and their correlation with response time, you should                          construct plots of the response time against the values of different features. You should                            distribute the feature values into ten equal bins and then use gnuplot to produce the following                                two plots:  1. Box plots that capture the median, 25% and 75% of the response time distributions, as                              well as the minimum and maximum values, and   2. Cumulative distribution function (CDF) plots of the response time.