1. StackOverflow Ask to Answer
A discovery based approach to recommend users
Neel Tiwari Amit Tiwari Manpreet Singh
neelt@sfu.ca amitt@sfu.ca msa175@sfu.ca
Abstract
These days one of the most popular categories of start-ups can be characterized as Q&A sites, where users
partake in answering questions and have discussions among communities. One drawback to these sites is a lack
of mechanism to discover users who could best answer questions correctly and quickly and thus ensure people
continue to visit the site more often. We propose to build an application for discovering and recommending such
users, taking the use case of the popular website - stackoverflow.com which has a huge collection of users, but
not a system in place to direct people with questions to users who can correctly answer questions. We propose a
solution for this problem.
Introduction
We propose to recommend a list of probable users who can answer questions asked by users visiting the site.
Taking into account the fact that questions could be vague or may not contain any direct indication of the tag,
we first classify the question into one or more of the possible tags. This is done by iteratively testing against the
models created on the training set. These models which are associated with each tag were generated using
classification techniques like logistic regression and SVM. Testing the question asked against these models
returns a list of tags to which the question belongs.
On the other hand we have a list of active users, where each user is associated with a number of features ranging
from userid, name, location and many more. When a user answers a question related to one or several tags, we
use some particularly interesting features like accepted answer, upvotes, downvotes and favourite count to come
up with a score function. The score function can be defined as:
F(userid, tag*) score
Where the score is a weighted score defined on the features above and associates each user for every tag with a
projected score. Once the question asked by the user is classified into one or more tags, we look up the top users
related to those tags based on the weighted score function and suggest these users as the most probable to
answer the questions. The nature of the score function keeps the list of users dynamically updated so that the top
users are based on the quality of their answers.
Other attempts to find relationships among users involve a clustering based approach to generate a user
community based on identical interests. This was attempted through a power iterative clustering (PIC) approach.
As many tags like Machine Learning, Clustering, Big Data etc. are interlinked, we also attempt to find
associations between the tags involved. This was achieved by first mining the tags by frequent pattern mining
using FP Growth algorithm and then clustering them using power iteration clustering (PIC) to find possible
associations.
Approach
The dataset [1] which we downloaded is in the form of xml file where each question/answer is represented as a
row with different attributes like body, tags, title etc.
2. We combined body and title to form input feature and tags as classes as shown in the figure.
Figure 1
First we transformed the input text into TF IDF vector using HashingTF utility provided in spark MLLib library
[2]. Then for each tag we trained a model using logistic regression and support vector machine. (Figure 2)
We had around 2050 questions and answers, which we picked from datascience.stackexchange.com. There are
155 different tags for this url. So, precisely we trained 155 different models. We created a table with attributes
user, tag and score which stores score of each user for every tag. The score is a function of some specific
attributes and weighted according to the frequency of tags in the corpus.
Specifically;
Score(s) = Sum(Upvotes + Accepted Answers - Downvotes)
Weighted Score(ws) = f(userid, tag* , score) = {userid, score/count(tag*)} where tag* refers to each tag and
count(tag*) is the total count of each tag in the corpus.
One way to interpret the WeightedScore(ws) is that if a user has a better score in some less popular tag like
libsvm he will have a high weight than the users who have high score in more common tag like machine
learning. This ensures that users who answer more is some specific tags will retain a higher score. From this
table, we recommend 5 users, whose weighted score is more for the tags predicted by the classifier.
The thing to be noted here is that the tag which is rare i.e. it has less count has high weight.
With the tags in the corpus, we did frequent pattern mining to check which tags come together more frequently.
For frequent pattern mining we used FP growth algorithm from Spark MLLib Library. Using the result of
frequent pattern mining we generated a graph as shown in figure 3. Also, we did the clustering of this graph data
where source and target are tags and weight is count of their co-occurrence.
3. Figure 2
By keeping the number of clusters 3 we got the same result as shown in the graph. Our clusters were centered
around big data, data mining and machine learning.
Figure 3
4. Experiments
We present our observations on two categories of tags one which is very common like machine-learning and
one which is not that common in corpus like nlp.
We trained models for each tag using Logistic regression and SVM.
NLP
We achieved 95 % accuracy using Logistic Regression and roc of 0.82 using SVM
This is actually misleading because the training data for NLP was less. Even if it classifies, the true negatives
correctly which is very high; it does not classifies true positives correctly which makes the classifier inaccurate.
Confusion Matrix for Logistic Regression (NLP)
NLP Absent Present
Absent 559.0 13.0
Present 16.0 9.0
5. Machine Learning
The accuracy for logistic regression achieved was 69.8% and roc of 0.71 for SVM.
Confusion Matrix for Logistic Regression (ML)
Machine Learning Absent Present
Absent 329.0 95.0
Present 92.0 105.0
6. User recommendation
In our experiments, the classification of questions to tags yielded fairly satisfactory results with accuracy
running to as high as more than 71% for common tags and greater than 90% for less common tags. Under these
assumptions and also taking into account the weighted score function, for the training data (limited) at hand, we
were able to predict users quite well. But since we were limited to draw the set of users from the training set, we
may run into false assumptions about the reliability of our model. There is another immediate problem that need
to be addressed and will form the basis of our future work; that the recommended users actually end up
answering the question. For this we will have to come up with a predictor function that takes into account
features like interest, locality and identifying active vs. less active users. The predictor function in conjunction
with weighted score may form a better measure for suggesting users and will be the direction of future work in
this context.
Contributions
1. Classification (Logistic Regression and SVM) : Neel Kamal and Amit Tiwari
2. Scoring Function : Neel Kamal and Manpreet Singh
3. Frequent Pattern Mining of Tags : Manpreet Singh and Amit Tiwari
4. Clustering of Tags : Neel Kamal and Manpreet Singh
5. Clustering Graph : Neel Kamal and Amit Tiwari
Difficulties Encountered:
1. Multi label – Multiclass classification. Input set in the form of questions need to be classified to multiple
tags.
2. Imbalanced class problem - Some tags were not frequent in data. Under such conditions, classification of
such tags could not be achieved with high accuracy.
3. Accuracy is 99% but still it is inaccurate - For rare tags, accuracy achieved was not true in the sense that
true positives were not correctly classified. Displayed accuracy was due to correct classification of true
negative values.
4. Large number of models to test - Data contained lot of different tags. Number of tags was equal to number
of models trained.
7. Conclusions and Future work
1.) SVM performs better than Logistic Regression, for machine learning tags where data is balanced i.e.
more tags available in data.
2.) Both algorithms fail; if the data is imbalanced i.e. the training data is less for one class than other.
Future work will focus on improving user experience in Q&A sites like stackoverflow.com. We could extend
our work on score function to include and define a predictor function which ensures that recommended users
would actually be interested in answering questions. On the lines of improving user experience , user
communities could be further refined and extended to not only include the types of comments by users but also
activities by users such as marking a question invalid or closing a question. Challenges would include
quantifying such activities to find actual active users.
Citations
1) Stack exchange dump https://archive.org/details/stackexchange
2) Apache Spark, MLLib 1.5.1 http://spark.apache.org/