3. Introduction
Yahoo Answers, a community-driven question-and-answer (Q&A) site from Yahoo! allows users to both
submit questions and answer the questions asked by other users. In Yahoo Answers when a user posts a
question, he/she categorizes it into a category in order to make it easier for other users to answer it.
Once the question is posted, there is minimum time limit of an hour for the user to select a best answer
amongst the responses from other users.
Our project focuses on using various techniques to predict the best answer for a particular question. We
base our predictions on certain papers published on the characteristics of a best answer. Some of the
criteria we use to select the best answer are the length of the answer and, the answer’s cosine and
jaccard similarity with a combination of question and other answers.
Dataset
Dataset was acquired from Yahoo on request for research purpose. Yahoo Labs provided us with an xml
dataset of size 143, 627 question answer pairs that contained the following tags.
<vespaadd> - Holds the components of a question.
<uri> - Each question has a unique anonymized URI.
<subject> - Holds a question.
<content> - An optional element that holds additional information about the question.
<bestanswer> - Holds the answer that is selected as the best.
<nbestanswers> - Holds all answers posted for a question. Answers are separated using <answer_item>
sub-elements.
The question is optionally classified into the question taxonomy using three elements:
<maincat> - Holds the main category of the question.
<cat> - Holds the category of the question.
<subcat> - Holds the sub-category of the question.
Due to resource limitations we limited our analysis to the following four categories:
Science & Mathematics
Health
Finance & Business
Family & Relationships
4. Dataset Pre-Processing
In order to standardize the data for analysis, we performed the pre-processing activities. The XML
dataset was parsed using ElementTree module in Python.
Data was standardized to lowercase using lower().
Whitespaces in the beginning and the end of the data were removed using strip().
Digits and punctuations were removed from the dataset, as they don’t offer much insights
during the analysis, which would help to improve the prediction accuracy.
Words that don’t offer any insights (StopWords) during analysis were removed using NLTK
module.
Feature Extraction
To extract the features out of our data set we used the following methods.
Tf-idf Vectorizer
The term frequency – inverse document frequency (a.k.a tf-idf), is a method to evaluate the importance
of a word in a document. It converts textual information into sparse features. The dataset was
processed to extract the textual information and obtain their relative importance.
Latent Semantic Analysis
Latent Semantic Analysis (LSA) is a statistical technique to extract and compute the similarity of meaning
of words and passages by analysis of large bodies of text. This technique uses singular value
decomposition (SVD), to reduce a large matrix (word-by-context) into a smaller dimensional
representation.
Model Evaluation
Cosine-Similarity
We use Cosine Similarity as one of the approaches to predict the best answer. The question, its content
(description) and all its corresponding answers were put in a set. We compared it with another set that
contained the question, its content, and one of its answers, and estimated its similarity score. We
repeated this process for every answer in a question. The answer with the highest similarity value was
then verified if it was actually the best answer. We repeated this process for all the questions in a
category.
Cosine-Similarity Accuracies
Science 63.0479
Health 49.6361
Finance 66.4629
Family 45.0614
5. Longest Answer
From multiple research papers and publications, we got to know that length of an answer can be a good
predictor of its quality. In general, answerers post answer only if no proper and explanative answers had
been posted before. Thus they read through all the answers that had been written by other answerers,
and in the due course they build on their knowledge based on those answers. Thus, when they feel they
have a much explanative answer, they post an answer that builds on top of the insights that were gained
from the previous answers. Hence, this procedure tends to yield longer and higher quality answers, and
is more probable of being selected as the best answer. And intuitively, this means that in most cases the
last answer tends to be longer. We took the longest answer for a question and compared it with the
answer that was flagged as the best answer. This process was then repeated for all the questions, across
all categories.
Longest Answer Accuracies
Science 65.5059
Health 59.5235
Finance 70.7016
Family 57.6882
Jaccard Similarity
It is a statistic used to compute similarity of sample sets. The Jaccard coefficient measures
similarity between finite sample sets wherein the Jaccard similarity score determines the
intersection between two sets divided by the union between them. In our project, we use
question, content and all the answers as a set and then the question and answer pairs for all
questions as another set. Then, we find the intersection divided by the union between the two
sets. Then, we find the item with the highest jaccard index and find the accuracy.
Jaccard Similarity Accuracies
Science 65.3420
Health 59.4616
Finance 70.6317
Family 57.5770
Latent Semantic Analysis
We came across a technique in NLP known as Latent Semantic Analysis (LSA). We used this for feature
extraction and analyze word-word and word-passage semantic relations. For our analysis, we extracted
top 20 words from the combination of question, its content and all its answers, and stored it in a list.
6. Then the top 20 words of all its answers were stored in separate lists for each of the answer. We used
this to get the most important words that describe the passage. Cosine similarity was then used to
estimate the similarity among the list containing the top 20 words from QA combination with each of
the list containing the top 20 words from the answer. This process was then repeated for all the
questions, across all categories. In this process we selected the answer with the top highest similarity
score for the best answer. However, while estimating the scores we found out that in few instances, two
of the answers had very close similarity scores, and the cosine-similarity algorithm estimated the one
that wasn’t flagged as the best answer, to be the best answer. This led us to the discovery that in many
cases there can be more than one answer that might be considered to be the best answer.
Let’s consider this scenario, if a questioner posts a question on the first day of a month and then he gets
couple of answers within few days. Let the answers be A1, A2, A3, A4 and so on. Of which A1 and A3 are
not close to what the questioner expected, A4 is somewhat close to what the questioner expected and
A2 exactly matches his requirement. So he flags it to be the best answer. Later after another week, a
new questioner wants to post a similar question whose answer is exactly A4, but he wants to check if a
similar question exists in the repository. So he gets to this question and in the meantime lots of answers
were posted to the question. Due to this the second best answer A4 was pushed down in the answer
stack, thus it went hidden and the questioner didn’t find the best answer any useful. In such scenarios,
this algorithm could be adopted by Yahoo Answers to rank and display the closer answers on top of the
answers stack.
LSA Similarity Accuracies
LSA – 1 Best Answer LSA – 2 Best Answers
Science 43.88912 68.53748
Health 29.95 52.425
Finance 49.3125 75.675
Family 27.875 47.5125
Gensim
We use the Gensim module in python as another approach in finding the similarity between the
question/description and answer. In this approach all the answers are stored in a list as corpus.
The answer corpus and the question/description are then transformed to vectors spaces.
Similarity is then calculated between the question and every answer pair and the answers are
ordered based on their similarity score with the question. We take two scenarios in the analysis
where in the first case we compute the accuracy using the answer with the highest similarity
score. In the second case we pick up two probable best answers and compute its accuracy with
the actual answer selected by the user.
7. Gensim Similarity Accuracies
Gensim – 1 Best Answer Gensim – 2 Best Answers
Science 48.2589 69.5753
Health 32.5625 51.3
Finance 30.375 48.675
Family 56.575 78
Inferences
From these results we can see that the Longest answer is usually selected as the best answer in Yahoo!
Answers and it’s the best and the fastest way to validate an answer as the best answer. Also we see that
LSA and Gensim do a good well on selecting the answers that are related to the question and if we
expand our analysis to predict two answers instead of one we will get a much better accuracy compared
to the longest answer, though the former methods are more resource hungry.
Scope of Future Analysis
Extending the scope of analysis to the other categories.
Implementing Vowpal Wabbit Fast Learning algorithm.
Use a voted model to decide on the best answer based on the predictions by the individual
models.
Optimizing the code to take advantage of multiple threads.
0
10
20
30
40
50
60
70
80
90
Science Health Finance Family
Comparison of Accuracies
Cosine Similarity Longest Answer Jaccard Similarity LSA1
LSA2 Gensim1 Gensim2
8. References
[1] Taikai Takeda, Weicheng Yu and Xingwei Liu. Best Answer Prediction on Stack Overflow Data Set.
[2] Alina Beygelzimer, Ruggiero Cavallo and Joel Tetreault. On Yahoo Answers, Long Answers are Best.
[3] Thomas K Landauer, Peter W. Foltz and Darrell Laham. An Introduction to Latent Semantic Analysis.
[4] M. IKONOMAKIS, S. KOTSIANTIS and V. TAMPAKAS. Text Classification Using Machine Learning
Techniques.
[5] Latent Semantic Analysis in Python. http://blog.josephwilk.net/projects/latent-semantic-analysis-in-
python.html
[6] Text Analytics - Latent Semantic Analysis. https://www.youtube.com/watch?v=BJ0MnawUpaU
[7] Gensim, Topic Modelling for Humans. https://radimrehurek.com/gensim