SlideShare a Scribd company logo
1 of 8
Download to read offline
Contents
Introduction ............................................................................................................................................3
Dataset....................................................................................................................................................3
Dataset Pre-Processing............................................................................................................................4
Feature Extraction...................................................................................................................................4
Tf-idf Vectorizer...................................................................................................................................4
Latent Semantic Analysis.....................................................................................................................4
Model Evaluation ....................................................................................................................................4
Cosine-Similarity..................................................................................................................................4
Longest Answer...................................................................................................................................5
Jaccard Similarity.................................................................................................................................5
Latent Semantic Analysis.....................................................................................................................5
Gensim................................................................................................................................................6
Inferences ...............................................................................................................................................7
Scope of Future Analysis..........................................................................................................................7
References ..............................................................................................................................................8
Introduction
Yahoo Answers, a community-driven question-and-answer (Q&A) site from Yahoo! allows users to both
submit questions and answer the questions asked by other users. In Yahoo Answers when a user posts a
question, he/she categorizes it into a category in order to make it easier for other users to answer it.
Once the question is posted, there is minimum time limit of an hour for the user to select a best answer
amongst the responses from other users.
Our project focuses on using various techniques to predict the best answer for a particular question. We
base our predictions on certain papers published on the characteristics of a best answer. Some of the
criteria we use to select the best answer are the length of the answer and, the answer’s cosine and
jaccard similarity with a combination of question and other answers.
Dataset
Dataset was acquired from Yahoo on request for research purpose. Yahoo Labs provided us with an xml
dataset of size 143, 627 question answer pairs that contained the following tags.
<vespaadd> - Holds the components of a question.
<uri> - Each question has a unique anonymized URI.
<subject> - Holds a question.
<content> - An optional element that holds additional information about the question.
<bestanswer> - Holds the answer that is selected as the best.
<nbestanswers> - Holds all answers posted for a question. Answers are separated using <answer_item>
sub-elements.
The question is optionally classified into the question taxonomy using three elements:
<maincat> - Holds the main category of the question.
<cat> - Holds the category of the question.
<subcat> - Holds the sub-category of the question.
Due to resource limitations we limited our analysis to the following four categories:
 Science & Mathematics
 Health
 Finance & Business
 Family & Relationships
Dataset Pre-Processing
In order to standardize the data for analysis, we performed the pre-processing activities. The XML
dataset was parsed using ElementTree module in Python.
 Data was standardized to lowercase using lower().
 Whitespaces in the beginning and the end of the data were removed using strip().
 Digits and punctuations were removed from the dataset, as they don’t offer much insights
during the analysis, which would help to improve the prediction accuracy.
 Words that don’t offer any insights (StopWords) during analysis were removed using NLTK
module.
Feature Extraction
To extract the features out of our data set we used the following methods.
Tf-idf Vectorizer
The term frequency – inverse document frequency (a.k.a tf-idf), is a method to evaluate the importance
of a word in a document. It converts textual information into sparse features. The dataset was
processed to extract the textual information and obtain their relative importance.
Latent Semantic Analysis
Latent Semantic Analysis (LSA) is a statistical technique to extract and compute the similarity of meaning
of words and passages by analysis of large bodies of text. This technique uses singular value
decomposition (SVD), to reduce a large matrix (word-by-context) into a smaller dimensional
representation.
Model Evaluation
Cosine-Similarity
We use Cosine Similarity as one of the approaches to predict the best answer. The question, its content
(description) and all its corresponding answers were put in a set. We compared it with another set that
contained the question, its content, and one of its answers, and estimated its similarity score. We
repeated this process for every answer in a question. The answer with the highest similarity value was
then verified if it was actually the best answer. We repeated this process for all the questions in a
category.
Cosine-Similarity Accuracies
Science 63.0479
Health 49.6361
Finance 66.4629
Family 45.0614
Longest Answer
From multiple research papers and publications, we got to know that length of an answer can be a good
predictor of its quality. In general, answerers post answer only if no proper and explanative answers had
been posted before. Thus they read through all the answers that had been written by other answerers,
and in the due course they build on their knowledge based on those answers. Thus, when they feel they
have a much explanative answer, they post an answer that builds on top of the insights that were gained
from the previous answers. Hence, this procedure tends to yield longer and higher quality answers, and
is more probable of being selected as the best answer. And intuitively, this means that in most cases the
last answer tends to be longer. We took the longest answer for a question and compared it with the
answer that was flagged as the best answer. This process was then repeated for all the questions, across
all categories.
Longest Answer Accuracies
Science 65.5059
Health 59.5235
Finance 70.7016
Family 57.6882
Jaccard Similarity
It is a statistic used to compute similarity of sample sets. The Jaccard coefficient measures
similarity between finite sample sets wherein the Jaccard similarity score determines the
intersection between two sets divided by the union between them. In our project, we use
question, content and all the answers as a set and then the question and answer pairs for all
questions as another set. Then, we find the intersection divided by the union between the two
sets. Then, we find the item with the highest jaccard index and find the accuracy.
Jaccard Similarity Accuracies
Science 65.3420
Health 59.4616
Finance 70.6317
Family 57.5770
Latent Semantic Analysis
We came across a technique in NLP known as Latent Semantic Analysis (LSA). We used this for feature
extraction and analyze word-word and word-passage semantic relations. For our analysis, we extracted
top 20 words from the combination of question, its content and all its answers, and stored it in a list.
Then the top 20 words of all its answers were stored in separate lists for each of the answer. We used
this to get the most important words that describe the passage. Cosine similarity was then used to
estimate the similarity among the list containing the top 20 words from QA combination with each of
the list containing the top 20 words from the answer. This process was then repeated for all the
questions, across all categories. In this process we selected the answer with the top highest similarity
score for the best answer. However, while estimating the scores we found out that in few instances, two
of the answers had very close similarity scores, and the cosine-similarity algorithm estimated the one
that wasn’t flagged as the best answer, to be the best answer. This led us to the discovery that in many
cases there can be more than one answer that might be considered to be the best answer.
Let’s consider this scenario, if a questioner posts a question on the first day of a month and then he gets
couple of answers within few days. Let the answers be A1, A2, A3, A4 and so on. Of which A1 and A3 are
not close to what the questioner expected, A4 is somewhat close to what the questioner expected and
A2 exactly matches his requirement. So he flags it to be the best answer. Later after another week, a
new questioner wants to post a similar question whose answer is exactly A4, but he wants to check if a
similar question exists in the repository. So he gets to this question and in the meantime lots of answers
were posted to the question. Due to this the second best answer A4 was pushed down in the answer
stack, thus it went hidden and the questioner didn’t find the best answer any useful. In such scenarios,
this algorithm could be adopted by Yahoo Answers to rank and display the closer answers on top of the
answers stack.
LSA Similarity Accuracies
LSA – 1 Best Answer LSA – 2 Best Answers
Science 43.88912 68.53748
Health 29.95 52.425
Finance 49.3125 75.675
Family 27.875 47.5125
Gensim
We use the Gensim module in python as another approach in finding the similarity between the
question/description and answer. In this approach all the answers are stored in a list as corpus.
The answer corpus and the question/description are then transformed to vectors spaces.
Similarity is then calculated between the question and every answer pair and the answers are
ordered based on their similarity score with the question. We take two scenarios in the analysis
where in the first case we compute the accuracy using the answer with the highest similarity
score. In the second case we pick up two probable best answers and compute its accuracy with
the actual answer selected by the user.
Gensim Similarity Accuracies
Gensim – 1 Best Answer Gensim – 2 Best Answers
Science 48.2589 69.5753
Health 32.5625 51.3
Finance 30.375 48.675
Family 56.575 78
Inferences
From these results we can see that the Longest answer is usually selected as the best answer in Yahoo!
Answers and it’s the best and the fastest way to validate an answer as the best answer. Also we see that
LSA and Gensim do a good well on selecting the answers that are related to the question and if we
expand our analysis to predict two answers instead of one we will get a much better accuracy compared
to the longest answer, though the former methods are more resource hungry.
Scope of Future Analysis
 Extending the scope of analysis to the other categories.
 Implementing Vowpal Wabbit Fast Learning algorithm.
 Use a voted model to decide on the best answer based on the predictions by the individual
models.
 Optimizing the code to take advantage of multiple threads.
0
10
20
30
40
50
60
70
80
90
Science Health Finance Family
Comparison of Accuracies
Cosine Similarity Longest Answer Jaccard Similarity LSA1
LSA2 Gensim1 Gensim2
References
[1] Taikai Takeda, Weicheng Yu and Xingwei Liu. Best Answer Prediction on Stack Overflow Data Set.
[2] Alina Beygelzimer, Ruggiero Cavallo and Joel Tetreault. On Yahoo Answers, Long Answers are Best.
[3] Thomas K Landauer, Peter W. Foltz and Darrell Laham. An Introduction to Latent Semantic Analysis.
[4] M. IKONOMAKIS, S. KOTSIANTIS and V. TAMPAKAS. Text Classification Using Machine Learning
Techniques.
[5] Latent Semantic Analysis in Python. http://blog.josephwilk.net/projects/latent-semantic-analysis-in-
python.html
[6] Text Analytics - Latent Semantic Analysis. https://www.youtube.com/watch?v=BJ0MnawUpaU
[7] Gensim, Topic Modelling for Humans. https://radimrehurek.com/gensim

More Related Content

Similar to Yahoo Answers! Answer Evaluation

Facial Expression Recognition via Python
Facial Expression Recognition via PythonFacial Expression Recognition via Python
Facial Expression Recognition via PythonSaurav Gupta
 
Predicting Facial Expression using Neural Network
Predicting Facial Expression using Neural Network Predicting Facial Expression using Neural Network
Predicting Facial Expression using Neural Network Santanu Paul
 
DeepSearch_Project_Report
DeepSearch_Project_ReportDeepSearch_Project_Report
DeepSearch_Project_ReportUrjit Patel
 
IB Mathematics Extended Essay (2021) - Building A Predictive Text List Using ...
IB Mathematics Extended Essay (2021) - Building A Predictive Text List Using ...IB Mathematics Extended Essay (2021) - Building A Predictive Text List Using ...
IB Mathematics Extended Essay (2021) - Building A Predictive Text List Using ...Michelle Bojorquez
 
Nidoy_Grounded Theory by Corbin & Strauss.pptx
Nidoy_Grounded Theory by Corbin & Strauss.pptxNidoy_Grounded Theory by Corbin & Strauss.pptx
Nidoy_Grounded Theory by Corbin & Strauss.pptxChristyPansyNidoy
 
feras_kalita_mcgrory_2015
feras_kalita_mcgrory_2015feras_kalita_mcgrory_2015
feras_kalita_mcgrory_2015Conor McGrory
 
Grounded Theory
Grounded TheoryGrounded Theory
Grounded Theorylitdoc1999
 
Principles of Health Informatics: Informatics skills - searching and making d...
Principles of Health Informatics: Informatics skills - searching and making d...Principles of Health Informatics: Informatics skills - searching and making d...
Principles of Health Informatics: Informatics skills - searching and making d...Martin Chapman
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text miningIRJET Journal
 
Creating a formal laboratory
Creating a formal laboratoryCreating a formal laboratory
Creating a formal laboratorympiskel
 
reading and writing RESEARCH REPORT.pptx
reading and writing RESEARCH REPORT.pptxreading and writing RESEARCH REPORT.pptx
reading and writing RESEARCH REPORT.pptxevafecampanado1
 
Intelligent information extraction based on artificial neural network
Intelligent information extraction based on artificial neural networkIntelligent information extraction based on artificial neural network
Intelligent information extraction based on artificial neural networkijfcstjournal
 
Paper id 28201441
Paper id 28201441Paper id 28201441
Paper id 28201441IJRAT
 

Similar to Yahoo Answers! Answer Evaluation (20)

Facial Expression Recognition via Python
Facial Expression Recognition via PythonFacial Expression Recognition via Python
Facial Expression Recognition via Python
 
Predicting Facial Expression using Neural Network
Predicting Facial Expression using Neural Network Predicting Facial Expression using Neural Network
Predicting Facial Expression using Neural Network
 
Science skills
Science skillsScience skills
Science skills
 
A Review on Novel Scoring System for Identify Accurate Answers for Factoid Qu...
A Review on Novel Scoring System for Identify Accurate Answers for Factoid Qu...A Review on Novel Scoring System for Identify Accurate Answers for Factoid Qu...
A Review on Novel Scoring System for Identify Accurate Answers for Factoid Qu...
 
DeepSearch_Project_Report
DeepSearch_Project_ReportDeepSearch_Project_Report
DeepSearch_Project_Report
 
IB Mathematics Extended Essay (2021) - Building A Predictive Text List Using ...
IB Mathematics Extended Essay (2021) - Building A Predictive Text List Using ...IB Mathematics Extended Essay (2021) - Building A Predictive Text List Using ...
IB Mathematics Extended Essay (2021) - Building A Predictive Text List Using ...
 
MScDissertation
MScDissertationMScDissertation
MScDissertation
 
Nidoy_Grounded Theory by Corbin & Strauss.pptx
Nidoy_Grounded Theory by Corbin & Strauss.pptxNidoy_Grounded Theory by Corbin & Strauss.pptx
Nidoy_Grounded Theory by Corbin & Strauss.pptx
 
feras_kalita_mcgrory_2015
feras_kalita_mcgrory_2015feras_kalita_mcgrory_2015
feras_kalita_mcgrory_2015
 
Grounded Theory
Grounded TheoryGrounded Theory
Grounded Theory
 
Principles of Health Informatics: Informatics skills - searching and making d...
Principles of Health Informatics: Informatics skills - searching and making d...Principles of Health Informatics: Informatics skills - searching and making d...
Principles of Health Informatics: Informatics skills - searching and making d...
 
SCDAPaper2016(1)
SCDAPaper2016(1)SCDAPaper2016(1)
SCDAPaper2016(1)
 
M4D-v0.4.pdf
M4D-v0.4.pdfM4D-v0.4.pdf
M4D-v0.4.pdf
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
 
Creating a formal laboratory
Creating a formal laboratoryCreating a formal laboratory
Creating a formal laboratory
 
reading and writing RESEARCH REPORT.pptx
reading and writing RESEARCH REPORT.pptxreading and writing RESEARCH REPORT.pptx
reading and writing RESEARCH REPORT.pptx
 
Vivo Search
Vivo SearchVivo Search
Vivo Search
 
Data analysis
Data analysisData analysis
Data analysis
 
Intelligent information extraction based on artificial neural network
Intelligent information extraction based on artificial neural networkIntelligent information extraction based on artificial neural network
Intelligent information extraction based on artificial neural network
 
Paper id 28201441
Paper id 28201441Paper id 28201441
Paper id 28201441
 

More from Vivek Adithya Mohankumar

Career Opportunities in Business Analytics - What it needs to take there?
Career Opportunities in Business Analytics - What it needs to take there?Career Opportunities in Business Analytics - What it needs to take there?
Career Opportunities in Business Analytics - What it needs to take there?Vivek Adithya Mohankumar
 
Time Series Study on Bitcoin and Crude Oil Prices
Time Series Study on Bitcoin and Crude Oil PricesTime Series Study on Bitcoin and Crude Oil Prices
Time Series Study on Bitcoin and Crude Oil PricesVivek Adithya Mohankumar
 
NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and C...
NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and C...NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and C...
NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and C...Vivek Adithya Mohankumar
 

More from Vivek Adithya Mohankumar (6)

Vivek Adithya Mohankumar Resume
Vivek Adithya Mohankumar ResumeVivek Adithya Mohankumar Resume
Vivek Adithya Mohankumar Resume
 
Career Opportunities in Business Analytics - What it needs to take there?
Career Opportunities in Business Analytics - What it needs to take there?Career Opportunities in Business Analytics - What it needs to take there?
Career Opportunities in Business Analytics - What it needs to take there?
 
Time Series Study on Bitcoin and Crude Oil Prices
Time Series Study on Bitcoin and Crude Oil PricesTime Series Study on Bitcoin and Crude Oil Prices
Time Series Study on Bitcoin and Crude Oil Prices
 
Enterprise Process Integeratioin
Enterprise Process IntegeratioinEnterprise Process Integeratioin
Enterprise Process Integeratioin
 
NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and C...
NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and C...NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and C...
NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and C...
 
Vivek Adithya Mohankumar - Resume
Vivek Adithya Mohankumar - ResumeVivek Adithya Mohankumar - Resume
Vivek Adithya Mohankumar - Resume
 

Recently uploaded

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 

Recently uploaded (20)

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 

Yahoo Answers! Answer Evaluation

  • 1.
  • 2. Contents Introduction ............................................................................................................................................3 Dataset....................................................................................................................................................3 Dataset Pre-Processing............................................................................................................................4 Feature Extraction...................................................................................................................................4 Tf-idf Vectorizer...................................................................................................................................4 Latent Semantic Analysis.....................................................................................................................4 Model Evaluation ....................................................................................................................................4 Cosine-Similarity..................................................................................................................................4 Longest Answer...................................................................................................................................5 Jaccard Similarity.................................................................................................................................5 Latent Semantic Analysis.....................................................................................................................5 Gensim................................................................................................................................................6 Inferences ...............................................................................................................................................7 Scope of Future Analysis..........................................................................................................................7 References ..............................................................................................................................................8
  • 3. Introduction Yahoo Answers, a community-driven question-and-answer (Q&A) site from Yahoo! allows users to both submit questions and answer the questions asked by other users. In Yahoo Answers when a user posts a question, he/she categorizes it into a category in order to make it easier for other users to answer it. Once the question is posted, there is minimum time limit of an hour for the user to select a best answer amongst the responses from other users. Our project focuses on using various techniques to predict the best answer for a particular question. We base our predictions on certain papers published on the characteristics of a best answer. Some of the criteria we use to select the best answer are the length of the answer and, the answer’s cosine and jaccard similarity with a combination of question and other answers. Dataset Dataset was acquired from Yahoo on request for research purpose. Yahoo Labs provided us with an xml dataset of size 143, 627 question answer pairs that contained the following tags. <vespaadd> - Holds the components of a question. <uri> - Each question has a unique anonymized URI. <subject> - Holds a question. <content> - An optional element that holds additional information about the question. <bestanswer> - Holds the answer that is selected as the best. <nbestanswers> - Holds all answers posted for a question. Answers are separated using <answer_item> sub-elements. The question is optionally classified into the question taxonomy using three elements: <maincat> - Holds the main category of the question. <cat> - Holds the category of the question. <subcat> - Holds the sub-category of the question. Due to resource limitations we limited our analysis to the following four categories:  Science & Mathematics  Health  Finance & Business  Family & Relationships
  • 4. Dataset Pre-Processing In order to standardize the data for analysis, we performed the pre-processing activities. The XML dataset was parsed using ElementTree module in Python.  Data was standardized to lowercase using lower().  Whitespaces in the beginning and the end of the data were removed using strip().  Digits and punctuations were removed from the dataset, as they don’t offer much insights during the analysis, which would help to improve the prediction accuracy.  Words that don’t offer any insights (StopWords) during analysis were removed using NLTK module. Feature Extraction To extract the features out of our data set we used the following methods. Tf-idf Vectorizer The term frequency – inverse document frequency (a.k.a tf-idf), is a method to evaluate the importance of a word in a document. It converts textual information into sparse features. The dataset was processed to extract the textual information and obtain their relative importance. Latent Semantic Analysis Latent Semantic Analysis (LSA) is a statistical technique to extract and compute the similarity of meaning of words and passages by analysis of large bodies of text. This technique uses singular value decomposition (SVD), to reduce a large matrix (word-by-context) into a smaller dimensional representation. Model Evaluation Cosine-Similarity We use Cosine Similarity as one of the approaches to predict the best answer. The question, its content (description) and all its corresponding answers were put in a set. We compared it with another set that contained the question, its content, and one of its answers, and estimated its similarity score. We repeated this process for every answer in a question. The answer with the highest similarity value was then verified if it was actually the best answer. We repeated this process for all the questions in a category. Cosine-Similarity Accuracies Science 63.0479 Health 49.6361 Finance 66.4629 Family 45.0614
  • 5. Longest Answer From multiple research papers and publications, we got to know that length of an answer can be a good predictor of its quality. In general, answerers post answer only if no proper and explanative answers had been posted before. Thus they read through all the answers that had been written by other answerers, and in the due course they build on their knowledge based on those answers. Thus, when they feel they have a much explanative answer, they post an answer that builds on top of the insights that were gained from the previous answers. Hence, this procedure tends to yield longer and higher quality answers, and is more probable of being selected as the best answer. And intuitively, this means that in most cases the last answer tends to be longer. We took the longest answer for a question and compared it with the answer that was flagged as the best answer. This process was then repeated for all the questions, across all categories. Longest Answer Accuracies Science 65.5059 Health 59.5235 Finance 70.7016 Family 57.6882 Jaccard Similarity It is a statistic used to compute similarity of sample sets. The Jaccard coefficient measures similarity between finite sample sets wherein the Jaccard similarity score determines the intersection between two sets divided by the union between them. In our project, we use question, content and all the answers as a set and then the question and answer pairs for all questions as another set. Then, we find the intersection divided by the union between the two sets. Then, we find the item with the highest jaccard index and find the accuracy. Jaccard Similarity Accuracies Science 65.3420 Health 59.4616 Finance 70.6317 Family 57.5770 Latent Semantic Analysis We came across a technique in NLP known as Latent Semantic Analysis (LSA). We used this for feature extraction and analyze word-word and word-passage semantic relations. For our analysis, we extracted top 20 words from the combination of question, its content and all its answers, and stored it in a list.
  • 6. Then the top 20 words of all its answers were stored in separate lists for each of the answer. We used this to get the most important words that describe the passage. Cosine similarity was then used to estimate the similarity among the list containing the top 20 words from QA combination with each of the list containing the top 20 words from the answer. This process was then repeated for all the questions, across all categories. In this process we selected the answer with the top highest similarity score for the best answer. However, while estimating the scores we found out that in few instances, two of the answers had very close similarity scores, and the cosine-similarity algorithm estimated the one that wasn’t flagged as the best answer, to be the best answer. This led us to the discovery that in many cases there can be more than one answer that might be considered to be the best answer. Let’s consider this scenario, if a questioner posts a question on the first day of a month and then he gets couple of answers within few days. Let the answers be A1, A2, A3, A4 and so on. Of which A1 and A3 are not close to what the questioner expected, A4 is somewhat close to what the questioner expected and A2 exactly matches his requirement. So he flags it to be the best answer. Later after another week, a new questioner wants to post a similar question whose answer is exactly A4, but he wants to check if a similar question exists in the repository. So he gets to this question and in the meantime lots of answers were posted to the question. Due to this the second best answer A4 was pushed down in the answer stack, thus it went hidden and the questioner didn’t find the best answer any useful. In such scenarios, this algorithm could be adopted by Yahoo Answers to rank and display the closer answers on top of the answers stack. LSA Similarity Accuracies LSA – 1 Best Answer LSA – 2 Best Answers Science 43.88912 68.53748 Health 29.95 52.425 Finance 49.3125 75.675 Family 27.875 47.5125 Gensim We use the Gensim module in python as another approach in finding the similarity between the question/description and answer. In this approach all the answers are stored in a list as corpus. The answer corpus and the question/description are then transformed to vectors spaces. Similarity is then calculated between the question and every answer pair and the answers are ordered based on their similarity score with the question. We take two scenarios in the analysis where in the first case we compute the accuracy using the answer with the highest similarity score. In the second case we pick up two probable best answers and compute its accuracy with the actual answer selected by the user.
  • 7. Gensim Similarity Accuracies Gensim – 1 Best Answer Gensim – 2 Best Answers Science 48.2589 69.5753 Health 32.5625 51.3 Finance 30.375 48.675 Family 56.575 78 Inferences From these results we can see that the Longest answer is usually selected as the best answer in Yahoo! Answers and it’s the best and the fastest way to validate an answer as the best answer. Also we see that LSA and Gensim do a good well on selecting the answers that are related to the question and if we expand our analysis to predict two answers instead of one we will get a much better accuracy compared to the longest answer, though the former methods are more resource hungry. Scope of Future Analysis  Extending the scope of analysis to the other categories.  Implementing Vowpal Wabbit Fast Learning algorithm.  Use a voted model to decide on the best answer based on the predictions by the individual models.  Optimizing the code to take advantage of multiple threads. 0 10 20 30 40 50 60 70 80 90 Science Health Finance Family Comparison of Accuracies Cosine Similarity Longest Answer Jaccard Similarity LSA1 LSA2 Gensim1 Gensim2
  • 8. References [1] Taikai Takeda, Weicheng Yu and Xingwei Liu. Best Answer Prediction on Stack Overflow Data Set. [2] Alina Beygelzimer, Ruggiero Cavallo and Joel Tetreault. On Yahoo Answers, Long Answers are Best. [3] Thomas K Landauer, Peter W. Foltz and Darrell Laham. An Introduction to Latent Semantic Analysis. [4] M. IKONOMAKIS, S. KOTSIANTIS and V. TAMPAKAS. Text Classification Using Machine Learning Techniques. [5] Latent Semantic Analysis in Python. http://blog.josephwilk.net/projects/latent-semantic-analysis-in- python.html [6] Text Analytics - Latent Semantic Analysis. https://www.youtube.com/watch?v=BJ0MnawUpaU [7] Gensim, Topic Modelling for Humans. https://radimrehurek.com/gensim