1. Sustainable QuestionsDetermining the expiration date of answers 27 August 2012Supervisors Bart de GoedeMaarten de Rijke, Anne Schuth Universiteit van Amsterdam
2. Outline• Introduction to CQA• Problem statement• Approach • Cluster similar questions • Compare answers in clusters • Classify sustainable clusters• Discussion and conclusion
3. Community Question Answering• Community of users asking and answering questions• Natural language• Formally, a service that involves: 1) A method for a person to present his/her information need in natural language, 2) a place where other people can respond to that information need and 3) a community built around such a service based on participation. (Shah et al., 2009)
4. Community Question Answering
5. Community Question Answering• CQA-services have many answered questions• CQA-retrieval aims to ﬁnd answered questions similar to the question a user posts• However, not all questions may be readily reused: • Who designed the Eiffel Tower? Alexander Gustave Eiffel. • Who is the prime minister of the UK? Now: David Cameron. Before: Gordon Brown.
6. Problem statement• Some questions are sustainable and can readily be reused, others are not• A question is sustainable if the answer to that question is independent of the point in time the question is asked• So, if the answer to semantically similar questions over time does not change, the questions are considered sustainable
7. Research questions RQ1: What are the distinguishing properties of sustainable ques- tions? RQ2: Can we measure these properties of sustainability? RQ3: Can we tell sustainable and non-sustainable questions apart based on these properties?
8. Approach:What makes a question sustainable?1. Cluster semantically similar questions2. Compare answers in each cluster3. Classify clusters as sustainable Time
9. Cluster semantically similar questions• Questions are semantically similar if they would be satisﬁed by the same information when asked at the same time• However, questions tend to be • very short • phrased in different ways • noisy • littered with function words
10. Cluster semantically similar questions• Latent Semantic Analysis (LSA; Deerwester et al., 1990) or Latent Dirichlet Allocation (LDA; Blei et al., 2003) • topic modeling techniques • cosine distance between topic vectors• Locality Sensitive Hashing (LSH; Charikar, 2002) • Used for near-duplicate detection • Intuition: near-duplicates are very likely to be similar
11. Cluster semantically similar questions• Manually labeled set of 559 question pairs• Calculate accuracy on samples of Yahoo! Answers Comprehensive Questions and Answers version 1.0 3.5 sample size algorithm 10K 100K all 3.0 LDA 0.435 0.500 - LSA 0.706 0.638 - 2.5 LSH16bits 0.472 0.484 0.500 LSH24bits 0.465 0.502 0.495 2.0 Density LSH32bits 0.512 0.514 0.509 LSH40bits 0.523 0.537 0.542 1.5 Accuracy of several question clustering methods. 1.0 Missing values represent experiments that never Table 2: Accuracy of several question clustering terminated. methods. Miss-
12. Compare answers in each cluster• Answers to similar questions that do not change over time indicate sustainable questions• Output of LSA contained 904 clusters: • 9 clusters considered sustainable • 143 clusters considered similar • 756 clusters considered all• Compute properties of question-answer pairs (change, time, number of answers, etc.)
13. Compare answers in each cluster 8 Linear ﬁtted line 7 Cumulative cosine distance 6 5 Cumulative cosine distance 4 3 2 1 0 −1 −2 Jan 2006 Feb 2006 Mar 2006 Apr 2006 May 2006 Jun 2006
14. Compare answers in each cluster 3.5 3.5 0.009 all all all similar 0.008 similar 3.0 3.0 similar sustainable sustainable sustainable 0.007 2.5 2.5 0.006 2.0 2.0 0.005 Density Density Density 1.5 1.5 0.004 0.003 1.0 1.0 0.002 s- 0.5 0.5 0.001 0.0 0.000 −200 −100 0 100 200 300 400 500 −0.5 0.0 0.0 0.5 1.0 1.5 −0.5 0.0 Average cosine distance 0.5 1.0 1.5 Days between question posted and last answer Average cosine distancethr), Figure 5: Kernel density estimation of the average cosine dis- Figure 8: Kernel density estimation of the average time in days e- tance (i.e. 6: Kernel density estimation of theas best accord- dis- Figure change rate) between answers labeled average cosine between posting of a question and the last answer a question tance (i.e. change rate) between semanticized answers labeled ing to either the user or the community.4], received. as best according to either the user or the community. rt ia The clusters in the similar class are only required to have similar (shown in Figure 5) with a kernel density estimation of the averageto questions—questions asking for thequestion being marked as resolved. posting of a question and that same information—regardless time in days between posting a question and that question receivingnk ofAlmost all questions are answeredbe sustainable of posting, although the answers; these clusters can thus within days and unsustain- its last answer (shown in Figure 8) we see that the time between theen able. Additionally, the clusters in the sustainable class are required more similar and sustainable question clusters seem to incorporate posting of a question and receiving its last answer is very indicativere to questions thatthat do not change be answered satisfactorydeﬁni- have answers require longer to over time. Note that this than regularor in describing sustainability: the longer a question solicits answers, tion implies that the sustainable class is a subset of the similar class
15. lar in describing sustainability: the longer a question solicits answers, the higher the probability of said question to be sustainable. In addition, from the simple properties (average, standard devi- Classify slope, SSE; detailed in Section 4.1.2) of clusters, we con- ation, clusters as sustainable structed ﬁve feature sets, as listed in Table 3. These correspond to approaches disscussed in Section 3.2; change per question (i.e. the • Construct feature sets (change, change over time, time to answer) amount of change between sequential questions), change per ques- tion normalised for time, and the change over time for semanticized • Train a classiﬁer* on of questions, as well as the time between asking representations re-sampled data and answering of questions (both between asking and labeling of • Accuracy in answer, and time between asking and reception of the last the best stratiﬁed 10-fold cross-validation: answer). Also, we used a combination of the ‘change over time’ and ‘time to answer’ sets. feature set accuracy800 change per question 66,9% change over time 86,0% semanticized change over time 75,3%ays time to answer 89,3% ed change/time combination 91,5% *We use the WEKA (Hall et al., 2009) implementation of C4.5 by Quinlan (1993)
16. Conclusions• Explored a new problem concerning sustainability and reusability of questions in a CQA setting• Sustainability can be reasonably estimated by simple question properties, where time is most descriptive (RQ1)• These properties can be obtained easily, also from data from other CQA services (RQ2)• Using a simple classiﬁer, these properties can be used to distinguish sustainable from non-sustainable questions (RQ3)
17. Future work• Scaling (considered sample 3% of training set)• Clustering: • on answers (twice as long as questions) • both (where do clusters of answers and questions ‘agree’?) • retrieval approach• Evaluation; does factoring in sustainability have a positive effect on precision?
19. References• D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3: 993–1022, March 2003. ISSN 1532-4435. URL http:// dl.acm.org/citation.cfm?id=944919.944937• M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388. ACM, 2002.• S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6): 391–407, 1990.• M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten. The WEKA data mining software: an update. SIGKDD, 11(1):10–18, 2009.• J. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, 1993.• C. Shah, S. Oh, and J. Oh. Research agenda for social Q&A. Library & Information Science Research, 31(4):205–209, 2009.
20. Data descriptives sample size Statistic 10K 100K all Number of questions 10K 100K 3.2M Average number of answers/question 7,1 7,1 7,1 Std. dev. number of answers/question 7,4 7,2 8,1 Average number of characters/question 175,0 176,7 177,3 Std. dev. of characters/question 204,2 200,0 201,7 Median of characters/question 103 104 105 Average number of characters/answer 332,8 336,5 336,0 Std. dev. of characters/answer 507,6 503,7 499,6 Median of characters/answer 168 175 177 Average number of sentences/question 2,8 2,8 2,9 Std. dev. number of sentences/question 2,7 2,6 2,606 Apr 2006 May 2006 Jun 2006 Median number of sentences/questions 2 2 2 Average number of sentences/answer 3,9 3,9 3,9e cumulative cosine distance be- Std. dev. number of sentences/answer 6,3 5,2 5,1ations of answers with linear ﬁt- Median number of sentences/answer 2 2 2However, here the timing of the Question languages 6 12 28 Main categories 163 176 179 Categories 869 1744 2853 Sub categories 677 1245 1539 s are more likely to solicit answersan non-sustainable questions; manyaway and disappear in the timelinens keep getting attention, and are Table 1: Descriptive statistics of the Yahoo! Answer data set. The average number of answers is per question, the aver-
21. Cluster properties 3.5 all similar 3.0 sustainable 2.5 2.0 Density 1.5 1.0Miss- 0.5 0.0 −0.5 0.0 0.5 1.0 1.5 Average cosine distancewith
22. Cluster properties 0.009 all 0.008 similar sustainable 0.007 0.006 0.005 Density 0.004 0.003 0.002 0.001 0.000 −200 −100 0 100 200 300 400 5001.5 Days between question posted and last answer Figure 8: Kernel density estimation of the average time in days
23. posting of a question and that question being marked as resolved. time Almost all questions are answered within days of posting, although its la similar and sustainable question clusters seem to incorporate more postCluster properties longer to be answered satisfactory than regular questions that require in d clusters. However, the distinction is not that clear. the h 0.030 In all atio similar 0.025 struc sustainable appr 0.020 amo tion repr Density 0.015 and the b 0.010 answ ‘time 0.005 0.000 −400 −200 0 200 400 600 800 Days between question posted and best answer Figure 7: Kernel density estimation of average time in days
24. dentifynswersneed to Figure 2: As in Figure 1, the cumulative cosine distance be- tween vector representations of answers with linear ﬁtted line Cluster a single cluster. However, here the timing of the answers is for properties taken in to account.e of the 8r or the Linear ﬁtted line ulative 7 Cumulative cosine distance 1), as 6er time nge in 5 Cumulative cosine distance in thegesting 4olution 3 ow the 2 1 0 −1 0 1 2 3 4 5 6 7 8 Figure 3: Cumulative cosine distance between semanticized
25. Cluster properties 8 Linear ﬁtted line 7 Cumulative cosine distance Sta 6 Nu 5 Ave Std Cumulative cosine distance 4 3 Ave Std 2 Me 1 Ave Std 0 Me −1 Ave −2 Std Jan 2006 Feb 2006 Mar 2006 Apr 2006 May 2006 Jun 2006 Me Ave