SlideShare a Scribd company logo
1 of 7
Download to read offline
StackOverflow Ask to Answer
A discovery based approach to recommend users
Neel Tiwari Amit Tiwari Manpreet Singh
neelt@sfu.ca amitt@sfu.ca msa175@sfu.ca
Abstract
These days one of the most popular categories of start-ups can be characterized as Q&A sites, where users
partake in answering questions and have discussions among communities. One drawback to these sites is a lack
of mechanism to discover users who could best answer questions correctly and quickly and thus ensure people
continue to visit the site more often. We propose to build an application for discovering and recommending such
users, taking the use case of the popular website - stackoverflow.com which has a huge collection of users, but
not a system in place to direct people with questions to users who can correctly answer questions. We propose a
solution for this problem.
Introduction
We propose to recommend a list of probable users who can answer questions asked by users visiting the site.
Taking into account the fact that questions could be vague or may not contain any direct indication of the tag,
we first classify the question into one or more of the possible tags. This is done by iteratively testing against the
models created on the training set. These models which are associated with each tag were generated using
classification techniques like logistic regression and SVM. Testing the question asked against these models
returns a list of tags to which the question belongs.
On the other hand we have a list of active users, where each user is associated with a number of features ranging
from userid, name, location and many more. When a user answers a question related to one or several tags, we
use some particularly interesting features like accepted answer, upvotes, downvotes and favourite count to come
up with a score function. The score function can be defined as:
F(userid, tag*)  score
Where the score is a weighted score defined on the features above and associates each user for every tag with a
projected score. Once the question asked by the user is classified into one or more tags, we look up the top users
related to those tags based on the weighted score function and suggest these users as the most probable to
answer the questions. The nature of the score function keeps the list of users dynamically updated so that the top
users are based on the quality of their answers.
Other attempts to find relationships among users involve a clustering based approach to generate a user
community based on identical interests. This was attempted through a power iterative clustering (PIC) approach.
As many tags like Machine Learning, Clustering, Big Data etc. are interlinked, we also attempt to find
associations between the tags involved. This was achieved by first mining the tags by frequent pattern mining
using FP Growth algorithm and then clustering them using power iteration clustering (PIC) to find possible
associations.
Approach
The dataset [1] which we downloaded is in the form of xml file where each question/answer is represented as a
row with different attributes like body, tags, title etc.
We combined body and title to form input feature and tags as classes as shown in the figure.
Figure 1
First we transformed the input text into TF IDF vector using HashingTF utility provided in spark MLLib library
[2]. Then for each tag we trained a model using logistic regression and support vector machine. (Figure 2)
We had around 2050 questions and answers, which we picked from datascience.stackexchange.com. There are
155 different tags for this url. So, precisely we trained 155 different models. We created a table with attributes
user, tag and score which stores score of each user for every tag. The score is a function of some specific
attributes and weighted according to the frequency of tags in the corpus.
Specifically;
Score(s) = Sum(Upvotes + Accepted Answers - Downvotes)
Weighted Score(ws) = f(userid, tag* , score) = {userid, score/count(tag*)} where tag* refers to each tag and
count(tag*) is the total count of each tag in the corpus.
One way to interpret the WeightedScore(ws) is that if a user has a better score in some less popular tag like
libsvm he will have a high weight than the users who have high score in more common tag like machine
learning. This ensures that users who answer more is some specific tags will retain a higher score. From this
table, we recommend 5 users, whose weighted score is more for the tags predicted by the classifier.
The thing to be noted here is that the tag which is rare i.e. it has less count has high weight.
With the tags in the corpus, we did frequent pattern mining to check which tags come together more frequently.
For frequent pattern mining we used FP growth algorithm from Spark MLLib Library. Using the result of
frequent pattern mining we generated a graph as shown in figure 3. Also, we did the clustering of this graph data
where source and target are tags and weight is count of their co-occurrence.
Figure 2
By keeping the number of clusters 3 we got the same result as shown in the graph. Our clusters were centered
around big data, data mining and machine learning.
Figure 3
Experiments
We present our observations on two categories of tags one which is very common like machine-learning and
one which is not that common in corpus like nlp.
We trained models for each tag using Logistic regression and SVM.
NLP
We achieved 95 % accuracy using Logistic Regression and roc of 0.82 using SVM
This is actually misleading because the training data for NLP was less. Even if it classifies, the true negatives
correctly which is very high; it does not classifies true positives correctly which makes the classifier inaccurate.
Confusion Matrix for Logistic Regression (NLP)
NLP Absent Present
Absent 559.0 13.0
Present 16.0 9.0
Machine Learning
The accuracy for logistic regression achieved was 69.8% and roc of 0.71 for SVM.
Confusion Matrix for Logistic Regression (ML)
Machine Learning Absent Present
Absent 329.0 95.0
Present 92.0 105.0
User recommendation
In our experiments, the classification of questions to tags yielded fairly satisfactory results with accuracy
running to as high as more than 71% for common tags and greater than 90% for less common tags. Under these
assumptions and also taking into account the weighted score function, for the training data (limited) at hand, we
were able to predict users quite well. But since we were limited to draw the set of users from the training set, we
may run into false assumptions about the reliability of our model. There is another immediate problem that need
to be addressed and will form the basis of our future work; that the recommended users actually end up
answering the question. For this we will have to come up with a predictor function that takes into account
features like interest, locality and identifying active vs. less active users. The predictor function in conjunction
with weighted score may form a better measure for suggesting users and will be the direction of future work in
this context.
Contributions
1. Classification (Logistic Regression and SVM) : Neel Kamal and Amit Tiwari
2. Scoring Function : Neel Kamal and Manpreet Singh
3. Frequent Pattern Mining of Tags : Manpreet Singh and Amit Tiwari
4. Clustering of Tags : Neel Kamal and Manpreet Singh
5. Clustering Graph : Neel Kamal and Amit Tiwari
Difficulties Encountered:
1. Multi label – Multiclass classification. Input set in the form of questions need to be classified to multiple
tags.
2. Imbalanced class problem - Some tags were not frequent in data. Under such conditions, classification of
such tags could not be achieved with high accuracy.
3. Accuracy is 99% but still it is inaccurate - For rare tags, accuracy achieved was not true in the sense that
true positives were not correctly classified. Displayed accuracy was due to correct classification of true
negative values.
4. Large number of models to test - Data contained lot of different tags. Number of tags was equal to number
of models trained.
Conclusions and Future work
1.) SVM performs better than Logistic Regression, for machine learning tags where data is balanced i.e.
more tags available in data.
2.) Both algorithms fail; if the data is imbalanced i.e. the training data is less for one class than other.
Future work will focus on improving user experience in Q&A sites like stackoverflow.com. We could extend
our work on score function to include and define a predictor function which ensures that recommended users
would actually be interested in answering questions. On the lines of improving user experience , user
communities could be further refined and extended to not only include the types of comments by users but also
activities by users such as marking a question invalid or closing a question. Challenges would include
quantifying such activities to find actual active users.
Citations
1) Stack exchange dump https://archive.org/details/stackexchange
2) Apache Spark, MLLib 1.5.1 http://spark.apache.org/

More Related Content

What's hot

Binary search query classifier
Binary search query classifierBinary search query classifier
Binary search query classifierEsteban Ribero
 
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATIONIMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATIONadeij1
 
Neural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment AnalysisNeural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment AnalysisEditor IJCATR
 
Collaborative Filtering Recommendation Algorithm based on Hadoop
Collaborative Filtering Recommendation Algorithm based on HadoopCollaborative Filtering Recommendation Algorithm based on Hadoop
Collaborative Filtering Recommendation Algorithm based on HadoopTien-Yang (Aiden) Wu
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..butest
 
A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...
A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...
A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...ijcsa
 
Collaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CFCollaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CFYusuke Yamamoto
 
Sentiment analysis of tweets using Neural Networks
Sentiment analysis of tweets using Neural NetworksSentiment analysis of tweets using Neural Networks
Sentiment analysis of tweets using Neural NetworksAdrián Palacios Corella
 
IRJET- Survey of Classification of Business Reviews using Sentiment Analysis
IRJET- Survey of Classification of Business Reviews using Sentiment AnalysisIRJET- Survey of Classification of Business Reviews using Sentiment Analysis
IRJET- Survey of Classification of Business Reviews using Sentiment AnalysisIRJET Journal
 
Movie recommendation project
Movie recommendation projectMovie recommendation project
Movie recommendation projectAbhishek Jaisingh
 
IRJET- Sentimental Analysis of Product Reviews for E-Commerce Websites
IRJET- Sentimental Analysis of Product Reviews for E-Commerce WebsitesIRJET- Sentimental Analysis of Product Reviews for E-Commerce Websites
IRJET- Sentimental Analysis of Product Reviews for E-Commerce WebsitesIRJET Journal
 
Ijmer 46067276
Ijmer 46067276Ijmer 46067276
Ijmer 46067276IJMER
 
Amazon Product Review Sentiment Analysis with Machine Learning
Amazon Product Review Sentiment Analysis with Machine LearningAmazon Product Review Sentiment Analysis with Machine Learning
Amazon Product Review Sentiment Analysis with Machine Learningijtsrd
 
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET-  	  Twitter Sentimental Analysis for Predicting Election Result using ...IRJET-  	  Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...IRJET Journal
 
Item Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithmsnextlib
 
(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project reportGaurav Sawant
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTrilok Sharma
 

What's hot (20)

Binary search query classifier
Binary search query classifierBinary search query classifier
Binary search query classifier
 
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATIONIMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
 
Neural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment AnalysisNeural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment Analysis
 
Collaborative filtering
Collaborative filteringCollaborative filtering
Collaborative filtering
 
Collaborative Filtering Recommendation Algorithm based on Hadoop
Collaborative Filtering Recommendation Algorithm based on HadoopCollaborative Filtering Recommendation Algorithm based on Hadoop
Collaborative Filtering Recommendation Algorithm based on Hadoop
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..
 
A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...
A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...
A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...
 
Developing Movie Recommendation System
Developing Movie Recommendation SystemDeveloping Movie Recommendation System
Developing Movie Recommendation System
 
Collaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CFCollaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CF
 
Sentiment analysis of tweets using Neural Networks
Sentiment analysis of tweets using Neural NetworksSentiment analysis of tweets using Neural Networks
Sentiment analysis of tweets using Neural Networks
 
IRJET- Survey of Classification of Business Reviews using Sentiment Analysis
IRJET- Survey of Classification of Business Reviews using Sentiment AnalysisIRJET- Survey of Classification of Business Reviews using Sentiment Analysis
IRJET- Survey of Classification of Business Reviews using Sentiment Analysis
 
Abstract
AbstractAbstract
Abstract
 
Movie recommendation project
Movie recommendation projectMovie recommendation project
Movie recommendation project
 
IRJET- Sentimental Analysis of Product Reviews for E-Commerce Websites
IRJET- Sentimental Analysis of Product Reviews for E-Commerce WebsitesIRJET- Sentimental Analysis of Product Reviews for E-Commerce Websites
IRJET- Sentimental Analysis of Product Reviews for E-Commerce Websites
 
Ijmer 46067276
Ijmer 46067276Ijmer 46067276
Ijmer 46067276
 
Amazon Product Review Sentiment Analysis with Machine Learning
Amazon Product Review Sentiment Analysis with Machine LearningAmazon Product Review Sentiment Analysis with Machine Learning
Amazon Product Review Sentiment Analysis with Machine Learning
 
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET-  	  Twitter Sentimental Analysis for Predicting Election Result using ...IRJET-  	  Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
 
Item Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithms
 
(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVM
 

Viewers also liked

International finance
International financeInternational finance
International financeFarhana Asim
 
DISEÑADORES GRAFICOS VENEZOLANOS E INTERNACIONALES
DISEÑADORES GRAFICOS VENEZOLANOS E INTERNACIONALESDISEÑADORES GRAFICOS VENEZOLANOS E INTERNACIONALES
DISEÑADORES GRAFICOS VENEZOLANOS E INTERNACIONALESEmily Teixeira
 
Specification Sheet_H923 - 65.5m ROV OSV [Kendal]
Specification Sheet_H923 - 65.5m ROV OSV [Kendal]Specification Sheet_H923 - 65.5m ROV OSV [Kendal]
Specification Sheet_H923 - 65.5m ROV OSV [Kendal]Gerald Patrick Soriano
 
Unit plan powerpoint
Unit plan powerpointUnit plan powerpoint
Unit plan powerpointxfaithxc
 
Project Report Tron Legacy
Project Report Tron LegacyProject Report Tron Legacy
Project Report Tron LegacyManpreet Singh
 
REGRAS XERAIS DE ACENTUACIÓN
REGRAS XERAIS DE ACENTUACIÓNREGRAS XERAIS DE ACENTUACIÓN
REGRAS XERAIS DE ACENTUACIÓNManuela Castro
 
GuideBenefitsKnowItAll12016
GuideBenefitsKnowItAll12016GuideBenefitsKnowItAll12016
GuideBenefitsKnowItAll12016Denise Perkins
 
ECOCARDIOGRfia basica
ECOCARDIOGRfia basica ECOCARDIOGRfia basica
ECOCARDIOGRfia basica UCV, NSU
 
Kimball Perry cover letter/resume
Kimball Perry cover letter/resumeKimball Perry cover letter/resume
Kimball Perry cover letter/resumeKimball Perry
 
Grafik konsultirane roditeli (2)
Grafik konsultirane roditeli (2)Grafik konsultirane roditeli (2)
Grafik konsultirane roditeli (2)soffii_h
 
New Waves (1960's-1970's)
New Waves (1960's-1970's) New Waves (1960's-1970's)
New Waves (1960's-1970's) kirstiec24
 
FINAL IG BLACKBOOK (1)
FINAL IG BLACKBOOK (1)FINAL IG BLACKBOOK (1)
FINAL IG BLACKBOOK (1)Arun rana
 

Viewers also liked (20)

International finance
International financeInternational finance
International finance
 
Expo geek b
Expo geek bExpo geek b
Expo geek b
 
DISEÑADORES GRAFICOS VENEZOLANOS E INTERNACIONALES
DISEÑADORES GRAFICOS VENEZOLANOS E INTERNACIONALESDISEÑADORES GRAFICOS VENEZOLANOS E INTERNACIONALES
DISEÑADORES GRAFICOS VENEZOLANOS E INTERNACIONALES
 
DiseñadoreS
DiseñadoreSDiseñadoreS
DiseñadoreS
 
Specification Sheet_H923 - 65.5m ROV OSV [Kendal]
Specification Sheet_H923 - 65.5m ROV OSV [Kendal]Specification Sheet_H923 - 65.5m ROV OSV [Kendal]
Specification Sheet_H923 - 65.5m ROV OSV [Kendal]
 
Unit plan powerpoint
Unit plan powerpointUnit plan powerpoint
Unit plan powerpoint
 
Project Report Tron Legacy
Project Report Tron LegacyProject Report Tron Legacy
Project Report Tron Legacy
 
REGRAS XERAIS DE ACENTUACIÓN
REGRAS XERAIS DE ACENTUACIÓNREGRAS XERAIS DE ACENTUACIÓN
REGRAS XERAIS DE ACENTUACIÓN
 
Periodo
PeriodoPeriodo
Periodo
 
Anime
AnimeAnime
Anime
 
GuideBenefitsKnowItAll12016
GuideBenefitsKnowItAll12016GuideBenefitsKnowItAll12016
GuideBenefitsKnowItAll12016
 
ECOCARDIOGRfia basica
ECOCARDIOGRfia basica ECOCARDIOGRfia basica
ECOCARDIOGRfia basica
 
Kimball Perry cover letter/resume
Kimball Perry cover letter/resumeKimball Perry cover letter/resume
Kimball Perry cover letter/resume
 
SWARUP'S CV
SWARUP'S CVSWARUP'S CV
SWARUP'S CV
 
Grafik konsultirane roditeli (2)
Grafik konsultirane roditeli (2)Grafik konsultirane roditeli (2)
Grafik konsultirane roditeli (2)
 
Brixton
BrixtonBrixton
Brixton
 
New Waves (1960's-1970's)
New Waves (1960's-1970's) New Waves (1960's-1970's)
New Waves (1960's-1970's)
 
Laudo pousada cantagalo2
Laudo pousada cantagalo2Laudo pousada cantagalo2
Laudo pousada cantagalo2
 
Summary
SummarySummary
Summary
 
FINAL IG BLACKBOOK (1)
FINAL IG BLACKBOOK (1)FINAL IG BLACKBOOK (1)
FINAL IG BLACKBOOK (1)
 

Similar to Report

SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfDevinSohi
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics DomainDrjabez
 
Machine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxMachine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxiaeronlineexm
 
Chapter 05 Machine Learning.pptx
Chapter 05 Machine Learning.pptxChapter 05 Machine Learning.pptx
Chapter 05 Machine Learning.pptxssuser957b41
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applicationsBenjaminlapid1
 
Predicting Yelp Review Star Ratings with Language
Predicting Yelp Review Star Ratings with LanguagePredicting Yelp Review Star Ratings with Language
Predicting Yelp Review Star Ratings with LanguageSebastian W. Cheah
 
E-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer ReviewE-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer ReviewIRJET Journal
 
Prediction of Reaction towards Textual Posts in Social Networks
Prediction of Reaction towards Textual Posts in Social NetworksPrediction of Reaction towards Textual Posts in Social Networks
Prediction of Reaction towards Textual Posts in Social NetworksMohamed El-Geish
 
data-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdfdata-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdfDanilo Cardona
 
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov DisplayIRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov DisplayIRJET Journal
 
Analyzing Stack Overflow - Problem
Analyzing Stack Overflow - ProblemAnalyzing Stack Overflow - Problem
Analyzing Stack Overflow - ProblemAmrith Krishna
 
IRJET- Semantic Analysis of Online Customer Queries
IRJET-  	  Semantic Analysis of Online Customer QueriesIRJET-  	  Semantic Analysis of Online Customer Queries
IRJET- Semantic Analysis of Online Customer QueriesIRJET Journal
 
Boston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender SystemsBoston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender SystemsJames Kirk
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentationNaveen Kumar
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10Roger Barga
 
IRJET- Analysis of Question and Answering Recommendation System
IRJET-  	  Analysis of Question and Answering Recommendation SystemIRJET-  	  Analysis of Question and Answering Recommendation System
IRJET- Analysis of Question and Answering Recommendation SystemIRJET Journal
 
Customer_Analysis.docx
Customer_Analysis.docxCustomer_Analysis.docx
Customer_Analysis.docxKevalKabariya
 
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine LearningSentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine LearningIRJET Journal
 

Similar to Report (20)

SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdf
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics Domain
 
Machine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxMachine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptx
 
Chapter 05 Machine Learning.pptx
Chapter 05 Machine Learning.pptxChapter 05 Machine Learning.pptx
Chapter 05 Machine Learning.pptx
 
Machine Learning - A Simplified view
Machine Learning - A Simplified viewMachine Learning - A Simplified view
Machine Learning - A Simplified view
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applications
 
Predicting Yelp Review Star Ratings with Language
Predicting Yelp Review Star Ratings with LanguagePredicting Yelp Review Star Ratings with Language
Predicting Yelp Review Star Ratings with Language
 
E-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer ReviewE-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer Review
 
Prediction of Reaction towards Textual Posts in Social Networks
Prediction of Reaction towards Textual Posts in Social NetworksPrediction of Reaction towards Textual Posts in Social Networks
Prediction of Reaction towards Textual Posts in Social Networks
 
data-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdfdata-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdf
 
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov DisplayIRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
 
Analyzing Stack Overflow - Problem
Analyzing Stack Overflow - ProblemAnalyzing Stack Overflow - Problem
Analyzing Stack Overflow - Problem
 
IRJET- Semantic Analysis of Online Customer Queries
IRJET-  	  Semantic Analysis of Online Customer QueriesIRJET-  	  Semantic Analysis of Online Customer Queries
IRJET- Semantic Analysis of Online Customer Queries
 
Boston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender SystemsBoston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender Systems
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentation
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
IRJET- Analysis of Question and Answering Recommendation System
IRJET-  	  Analysis of Question and Answering Recommendation SystemIRJET-  	  Analysis of Question and Answering Recommendation System
IRJET- Analysis of Question and Answering Recommendation System
 
Customer_Analysis.docx
Customer_Analysis.docxCustomer_Analysis.docx
Customer_Analysis.docx
 
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine LearningSentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
 
Machine Learning by Rj
Machine Learning by RjMachine Learning by Rj
Machine Learning by Rj
 

Report

  • 1. StackOverflow Ask to Answer A discovery based approach to recommend users Neel Tiwari Amit Tiwari Manpreet Singh neelt@sfu.ca amitt@sfu.ca msa175@sfu.ca Abstract These days one of the most popular categories of start-ups can be characterized as Q&A sites, where users partake in answering questions and have discussions among communities. One drawback to these sites is a lack of mechanism to discover users who could best answer questions correctly and quickly and thus ensure people continue to visit the site more often. We propose to build an application for discovering and recommending such users, taking the use case of the popular website - stackoverflow.com which has a huge collection of users, but not a system in place to direct people with questions to users who can correctly answer questions. We propose a solution for this problem. Introduction We propose to recommend a list of probable users who can answer questions asked by users visiting the site. Taking into account the fact that questions could be vague or may not contain any direct indication of the tag, we first classify the question into one or more of the possible tags. This is done by iteratively testing against the models created on the training set. These models which are associated with each tag were generated using classification techniques like logistic regression and SVM. Testing the question asked against these models returns a list of tags to which the question belongs. On the other hand we have a list of active users, where each user is associated with a number of features ranging from userid, name, location and many more. When a user answers a question related to one or several tags, we use some particularly interesting features like accepted answer, upvotes, downvotes and favourite count to come up with a score function. The score function can be defined as: F(userid, tag*)  score Where the score is a weighted score defined on the features above and associates each user for every tag with a projected score. Once the question asked by the user is classified into one or more tags, we look up the top users related to those tags based on the weighted score function and suggest these users as the most probable to answer the questions. The nature of the score function keeps the list of users dynamically updated so that the top users are based on the quality of their answers. Other attempts to find relationships among users involve a clustering based approach to generate a user community based on identical interests. This was attempted through a power iterative clustering (PIC) approach. As many tags like Machine Learning, Clustering, Big Data etc. are interlinked, we also attempt to find associations between the tags involved. This was achieved by first mining the tags by frequent pattern mining using FP Growth algorithm and then clustering them using power iteration clustering (PIC) to find possible associations. Approach The dataset [1] which we downloaded is in the form of xml file where each question/answer is represented as a row with different attributes like body, tags, title etc.
  • 2. We combined body and title to form input feature and tags as classes as shown in the figure. Figure 1 First we transformed the input text into TF IDF vector using HashingTF utility provided in spark MLLib library [2]. Then for each tag we trained a model using logistic regression and support vector machine. (Figure 2) We had around 2050 questions and answers, which we picked from datascience.stackexchange.com. There are 155 different tags for this url. So, precisely we trained 155 different models. We created a table with attributes user, tag and score which stores score of each user for every tag. The score is a function of some specific attributes and weighted according to the frequency of tags in the corpus. Specifically; Score(s) = Sum(Upvotes + Accepted Answers - Downvotes) Weighted Score(ws) = f(userid, tag* , score) = {userid, score/count(tag*)} where tag* refers to each tag and count(tag*) is the total count of each tag in the corpus. One way to interpret the WeightedScore(ws) is that if a user has a better score in some less popular tag like libsvm he will have a high weight than the users who have high score in more common tag like machine learning. This ensures that users who answer more is some specific tags will retain a higher score. From this table, we recommend 5 users, whose weighted score is more for the tags predicted by the classifier. The thing to be noted here is that the tag which is rare i.e. it has less count has high weight. With the tags in the corpus, we did frequent pattern mining to check which tags come together more frequently. For frequent pattern mining we used FP growth algorithm from Spark MLLib Library. Using the result of frequent pattern mining we generated a graph as shown in figure 3. Also, we did the clustering of this graph data where source and target are tags and weight is count of their co-occurrence.
  • 3. Figure 2 By keeping the number of clusters 3 we got the same result as shown in the graph. Our clusters were centered around big data, data mining and machine learning. Figure 3
  • 4. Experiments We present our observations on two categories of tags one which is very common like machine-learning and one which is not that common in corpus like nlp. We trained models for each tag using Logistic regression and SVM. NLP We achieved 95 % accuracy using Logistic Regression and roc of 0.82 using SVM This is actually misleading because the training data for NLP was less. Even if it classifies, the true negatives correctly which is very high; it does not classifies true positives correctly which makes the classifier inaccurate. Confusion Matrix for Logistic Regression (NLP) NLP Absent Present Absent 559.0 13.0 Present 16.0 9.0
  • 5. Machine Learning The accuracy for logistic regression achieved was 69.8% and roc of 0.71 for SVM. Confusion Matrix for Logistic Regression (ML) Machine Learning Absent Present Absent 329.0 95.0 Present 92.0 105.0
  • 6. User recommendation In our experiments, the classification of questions to tags yielded fairly satisfactory results with accuracy running to as high as more than 71% for common tags and greater than 90% for less common tags. Under these assumptions and also taking into account the weighted score function, for the training data (limited) at hand, we were able to predict users quite well. But since we were limited to draw the set of users from the training set, we may run into false assumptions about the reliability of our model. There is another immediate problem that need to be addressed and will form the basis of our future work; that the recommended users actually end up answering the question. For this we will have to come up with a predictor function that takes into account features like interest, locality and identifying active vs. less active users. The predictor function in conjunction with weighted score may form a better measure for suggesting users and will be the direction of future work in this context. Contributions 1. Classification (Logistic Regression and SVM) : Neel Kamal and Amit Tiwari 2. Scoring Function : Neel Kamal and Manpreet Singh 3. Frequent Pattern Mining of Tags : Manpreet Singh and Amit Tiwari 4. Clustering of Tags : Neel Kamal and Manpreet Singh 5. Clustering Graph : Neel Kamal and Amit Tiwari Difficulties Encountered: 1. Multi label – Multiclass classification. Input set in the form of questions need to be classified to multiple tags. 2. Imbalanced class problem - Some tags were not frequent in data. Under such conditions, classification of such tags could not be achieved with high accuracy. 3. Accuracy is 99% but still it is inaccurate - For rare tags, accuracy achieved was not true in the sense that true positives were not correctly classified. Displayed accuracy was due to correct classification of true negative values. 4. Large number of models to test - Data contained lot of different tags. Number of tags was equal to number of models trained.
  • 7. Conclusions and Future work 1.) SVM performs better than Logistic Regression, for machine learning tags where data is balanced i.e. more tags available in data. 2.) Both algorithms fail; if the data is imbalanced i.e. the training data is less for one class than other. Future work will focus on improving user experience in Q&A sites like stackoverflow.com. We could extend our work on score function to include and define a predictor function which ensures that recommended users would actually be interested in answering questions. On the lines of improving user experience , user communities could be further refined and extended to not only include the types of comments by users but also activities by users such as marking a question invalid or closing a question. Challenges would include quantifying such activities to find actual active users. Citations 1) Stack exchange dump https://archive.org/details/stackexchange 2) Apache Spark, MLLib 1.5.1 http://spark.apache.org/