Seshadri ML Report

328 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
328
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Seshadri ML Report

  1. 1. Predicting Question Quality in Community Question Answering Sites Juan M. Caicedo C. Seshadri Sridharan Aarti Singh AndrewID: jcaicedo AndrewID: seshadrs (Project mentor) jcaicedo@cs.cmu.edu seshadrs@andrew.cmu.edu Abstract We present a model to predict question quality, an abstract measure, using only the content level features available at the time a new question is posted. We first pre- dict asker satisfaction using preexisting labels, and then predict different aspects of question quality using human annotations. For the former task, we use features from question text and community interaction to improve on the baseline model of Liu et al. For the latter, we hypothesize that question content and community response can independently model question quality, and enrich the content based model using co-training.1 IntroductionCommunity Question Answering (CQA) web sites allow the users to post questions, submit an-swers and interact with other users by voting, writing comments or by using other mechanisms ofparticipation. These sites have become a popular source for seeking information of different kinds,both for general topics, such as Yahoo Answers or Answer Bag, and for more specialized ones, likeQuora and StackOverflow. One key element to the success of a CQA site is the quality of the con-tent generated by the community of users. Particularly, the quality of the questions affects directlythe relevance of the content, the willingness of the community to participate and the likelihood thatvisitors to the site want to engage in the process. For this reason, we think that it is important tounderstand the factors that affect the quality of questions and, if possible, to be able to assess itsquality automatically.Detecting the quality of the questions also benefits the users of a CQA site. First, the askers canknow in advance if the questions they ask will be graded as high quality. This would allow them tolearn to ask better questions and, ideally, improve the satisfaction that they get from the site. Second,the moderators of a CQA can monitor the quality of the recently posted questions; this would allowthem to detect and improve those that are low quality and to highlight the high quality questions sothat they receive more attention by the community. We call the first application the online scenario,when the question is being asked, and the second one the offline scenario, after the question hasbeen posted and the community has started to participate.Although several problems of CQA have been addressed using diverse machine learning techniques,predicting question quality poses challenges that have not been covered in much detail. The maindifficulty arises from the nature of the two possible applications. In the offline scenario, machinelearning algorithms can use features extracted from the community reaction to the question, whichis a reliable indicator of the quality of the content, whereas in the online scenario this informationis not available and the algorithms have to rely on the asker’s profile and the text of the question,which requires techniques from NLP to extract informative features about the quality.In this project we present a model for predicting the question quality in the online scenario. First,we extend the existing work for predicting asker satisfaction [3] and we test its applicability acrossa different dataset. Second, we improve it by using richer linguistic features extracted from the 1
  2. 2. question content. Then, after showing the high predictability of the models on this task, we move tothe related problem of predicting question quality. For this task we use manually labeled questionsto train the models again. To overcome the problem of labeling a large set of questions, we useco-training to generate more training instances that allow us to improve both models.2 Related WorkThe interest in CQA sites has also increased within research areas related to information retrieval.Much of that work has focused on content ranking and recommendation, content analysis, socialnetwork analysis, user modeling and quality evaluation. [1] and [3] present an overview of theresearch done in those areas. We will discuss here the works that are closely related to our tasks.A framework for automatically classifying content of high quality in social media sites is describedin [1], where the quality is modeled in terms of the content itself and of the relations between theuser and the generated content and its usage statistics. However, they see the features extracted fromthe content and the features derived from the community as complementary, and they do not studythe differences of the online and offline scenarios.The problem of predicting user satisfaction is studied in [3]. It can be argued that modeling usersatisfaction can be used as an approximate measure of the quality of the content they create. Webelieve that the satisfaction of an asker depends on the response from the community generated byhis or her questions, which depends in turn on the quality of the question itself. Liu et al. present aprediction model that uses features based on the content and the community structure and evaluate itthe two scenarios that we are considering. We extend their work by using richer text based featuresand exploiting additional interactions from the community. In [5], Shah and Pomerantz present astudy where they train a classifier that accurately predicts the quality of an answer based on humanjudgment. We take a subset of the criteria used by the human judges that participated in their study,and we use it to assess the quality of the questions.3 Predicting Question QualityThe problem of assessing the quality of a question is subjective, since it depends on several factorsthat can vary in the context of the evaluation. For this reason, we decide to address two related tasks:(1) predict the asker satisfaction as an indicator of the question quality and (2) predict the qualityassessments assigned by humans.Predicting Asker Satisfaction: We define that an asker is satisfied if he selects one of the postedanswers he received as the best one for his question; additionally, this answer must have at least 2votes. We based this definition on the proposed by Liu et al. [3], and we adapt it to the data that wehave for this task. We add the constraint on the number of votes to also consider the judgment of thecommunity. Thus, we have a binary classification task where we have to predict whether the askerof a question was satisfied or not.Predicting Question Quality based on human assessments We use human judgments to assess thequestion quality based on five aspects of question quality: readability, conciseness, detail, politenessand appropriateness. They are a subset of the criteria used by [5] to measure answer quality on aCQA site. The questions were annotated giving a value on a 1 to 5 scale for each of the selectedcriteria and we aggregate them to define an indicator of the overall question quality. Under thisaggregated measure, we define that a high quality question has values greater than or equal to threefor at least 3 of the criteria. This is again a binary classification task, where our target label is thisaggregated measure.4 Task 1: Predicting Asker Satisfaction4.1 Experimental SetupIn this section we present the experimental setup for the asker satisfaction task. We describe thedataset, features, classification algorithms and the evaluation metrics used for each of the experi-ments. 2
  3. 3. Online Features Offline Features Question Content Features Community Interaction Features Title length Question favourite count Content punctuation density Question’s community score Text spacing density Number of question revisions Content body length Question new tag change count Code block counts, total length Community Answers Features Time(hour) posted Answers count Tag count Answers score max Extended Question Content Features Answers score total Text misspelling ratio Best Answer body length Text capitalization ratio Best Answer body spacing density Text blacklisted word count Accepted count Words per sentence Accepted ratio Uppercase word length ratio Answers to question ratio Number of sentences Answers reputation Text similarity with the text of questions Answerers Profile Features where the user is satisfied. Average Answerer membership age Similarity of the sequence of POS tags with the Most voted, Most reputed answerer’s answer accepted answer count questions where the user is satisfied. Most voted, Most reputed answerer’s answer acceptance ratio Asker Profile Features Most voted, Most reputed answerer’s reputation Answers to Questions ratio Most voted, Most reputed answerer’s question solved count Answers received Membership age Solved Question count Average past score Recent past score Table 1: Feature classes and their featuresDataset: Our dataset is based on the Stack Exchange network of CQA sites 1 . It contains 2.2 mil-lion of questions and 4.8 million of answers, along with the complete information for the questions,the answers posted, the selected answer by the asker, the user information (askers and answerers)and the community response (votes, comments and modifications) for 35 of their sites. We selectedStackOverflow, a site dedicated to computer programming. For our experiments we randomly se-lected 5,000 questions that were at least 2 month old and their corresponding 10,902 answers. Thereare 1,734 (34.68 %) questions where the user is satisfied; this distribution is similar to the one of theoriginal dataset (33.72 %).Features: We use two sets of features corresponding to the scenarios of our task. For the onlinescenario, we extract features from the text of the question and the profile of the asker; for the offlinescenario, we add features from the answers posted and the reaction of the community to the question.As our baseline model we use the features proposed in [3]; we extend them by adding more richerfeatures extracted from the text. Table 1 presents the list of the features used.Algorithms: We trained three classifiers based on decision trees, logistic regression and NaiveBayes. We chose these algorithms since they have been used successfully in CQA related problems.We are also particularly interested in using decision trees for their readability, given the applicationof our task. We used the algorithm implementations provided in Scikit-learn toolkit [4].Metrics: We report the overall accuracy of the classifiers, along with the averaged measures ofprecision, recall and the F1 score over the two classes. We perform 10-fold cross-validation over5,000 training instances.4.2 Experiment ResultsWe present in this section the results of the predicting asker satisfaction task. First, we evaluated theperformance of the algorithms varying the number of training instances and, for decision trees, vary-ing their parameters. We then evaluated the best algorithm, logistic regression, using the differentsets of features for each of the scenarios. Finally, we report the features with higher predictabilityaccording to their information gain.Algorithm evaluation: Since we are mainly interested in the online scenario, we compared thealgorithms using the corresponding set of features. In the case of decision trees, we evaluated firstthe maximum depth of the learned tree in order to choose the appropriate value for our task. Figure1 presents how the complexity of the tree affects the accuracy of the training and the test set. Wechose a maximum depth of 8 for the rest of our experiments. The algorithm performs poorly below 1 http://stackexchange.com/ 3
  4. 4. Features Accuracy Precision Recall F1 Baseline offline 0.8177 0.8028 0.6411 0.7126 Offline 0.8175 0.7994 0.6451 0.7135 Baseline online 0.6841 0.5822 0.3747 0.4544 Online 0.6887 0.5886 0.3912 0.4692 Question only 0.6607 0.5337 0.2976 0.3811 Table 2: Classification results for the different sets of featuresdepth 8 and it starts to over-fit beyond it. The error bars in this and the upcoming figures correspondto the standard error of the sample mean.Although decision trees have a bad performance in this scenario (accuracy 66.06%), they achievebetter results in the offline scenario, where a tree of depth 8 was 87% accurate. This can be explainedby examining the features with most Information Gain presented in Table 3.We compared all the algorithms varying the number of training instances. The performance oflogistic regression based learner has been the best overall. In fact, using L1 regularization we achievethe highest accuracy (70%), this shows that some of the features might be redundant and we canperform more experiments using feature selection. For using Naive Bayes learner, we normalizedthe values of the features in order to adjust their scales and to use the same value for smoothing allthe features when there are zero values.Different feature sets: We evaluated the logistic based classifier (L1) using five sets of features:two sets based on [3] are considered as baselines for each scenario, then two sets with extendedfeatures and another one considering only the features extracted from the question content. TheTable 2 presents these results. Using the new features, the classifiers are slightly better than thebaseline.Most informative features: Of the new text based features we added to our satisfaction predictionmodel, only the misspelling ratio appears (with a weak value of 0.008) in the top Information Gainfeatures presented in Table 3. The features in the baseline, such as asker information, are muchsuperior to the ones from the text. This could be a reason as to why we observe a decrease inperformance by supplementing the baseline with richer textual features.We also observed that all algorithms performed better at predicting the unsatisfied questions betterthan the satisfied ones. This could be attributed to the inherent skewed distribution: there are 1.85times unsatisfied questions as the satisfied ones. Its also possible that there exist questions for whichmany users haven’t selected an answer, in spite of having gotten an answer.Another interesting phenomenon we observe is that the seniority and successfulness of users seemsto divulge most information about the asker satisfaction. We can see that total number of questionssolved by asker and his membership age are the two most important features in the online scenario. 1.0 ● ● ● ● ● ● 0.9 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Accuracy 0.8 0.7 0.6 0.5 1 2 3 4 5 6 7 8 9 20 30 50 Tree depth Dataset ● Test ● Train Figure 1: Decision Trees accuracy varying the tree depth 4
  5. 5. Accuracy Precision 0.70 ● 0.60 ● ● ● ● ● ● ● ● 0.68 ● ● ● ● ● ● 0.55 ● ● ● ● ● 0.66 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.64 ● ● 0.50 ● ● ● ● ● ● ● ● ● ● 0.62 ● ● ● 0.45 ● 0.60 ● ● ● ● ● 0.58 ● 0.40 ● ● Recall F1 ● ● ● ● ● 0.40 ● ● ● ● ● ● 0.45 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.35 ● ● ● 0.40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.30 ● ● ● ● ● 0.35 ● ● 0.25 ● 0.30 0.20 ● ● ● ● 0.25 ● ● ● ● 1000 2000 3000 4000 1000 2000 3000 4000 Training instances ● Decision Tree ● Logistic Regression (L1) ● Logistic Regression (L2) ● Multinomial NB Figure 2: Performance on Online Satisfaction Prediction task. Online Offline Feature Information Gain Feature Information Gain Asker’s total number of questions solved 0.0453 Best answer score 0.5449 Asker’s membership age 0.0424 Highest answerer reputation 0.1139 Asker’s answers to question ratio 0.0338 Community score for question 0.11168 Asker’s average past question score 0.0307 Answerer’s reputation 0.10283 Asker’s recent questions score 0.0166 Top value of answerers’ accepted answer count 0.09611 Question code length 0.0112 Top value of answerers’ question solved count 0.09234 Question text unigram entity 0.0083 Top value of answerers’ answer count 0.09218 Question text misspelling ratio 0.0080 Answerers’ answer accepted count 0.09153 Question text bigram the top 0.0059 Reputation of most voted answerer 0.09129 Question tag android 0.057 Top answerer’s answer count 0.08176 Question content only Feature Information Gain Question code length 0.01122 Question text unigram entity 0.00838 Question text misspelling ratio 0.00809 Question text bigram the top 0.00594 Question tag android 0.00571 Question body length 0.00515 Question text unigram url 0.00481 Question text bigram using this 0.00458 Question text bigram reference to 0.00455 Question text bigram any ideas 0.00444 Table 3: Feature sets ranked by the Information GainImportantly, this is in accord with the general trend of success and satisfaction patterns we wouldexpect in CQA sites.5 Task 2: Predicting Question Quality based on human assessmentOur main interest in the question quality task is to evaluate the classifiers on a semi-supervised set-ting as a solution to training data scarcity. We continue the experiments performed for the previoustask, but this time we evaluate the classifiers on the dataset of annotated questions. In addition to theevaluate the different algorithms using the features available on each scenario (online and offline),we apply co-training in order to expand the training examples. In the following section we presentthese experiments and their results.Definition of Question QualityThe merit or quality of a question is a highly subjective factor that is difficult to quantify. Sinceit cannot be measured directly or extended from any feature(s) available, we had to hand annotateit. We define the quality of a question as a combination of five metrics: conciseness, politeness,readability, detail and relevance rated on a scale of 1 to 5. 5
  6. 6. In addition to these metrics, we also annotated them with a quality label, that represents our judgmenton the question quality. We used this measure to understand the importance of the five metrics andtheir combined influence on question quality.We found that conciseness, readability, politeness anddetail are reliable estimators of question quality. We looked at different patterns of values the metricstook. The most consistent one was that high-quality questions had at least three metrics with valuesgreater than or equal to 3. We used the same as our label for question quality. This rough rule forlabeling was followed because high quality questions occurred in different forms: they were conciseand readable but not detailed, detailed and polite but not concise, readable and polite but not relevantet cetera. In addition to that, our question-quality labeling rule also had considerable correlation withour hand annotated quality metric.5.1 Experimental SetupWe use the algorithms, features and evaluation metrics defined in the Asker Satisfaction task, but weapply them for the dataset of the manually labeled questions. This dataset contains 172 instanceswhere 127 (73.83 %) of them are labeled as high quality, and the rest (45, 26.16%) as low quality.Another difference in this task is that we perform 4-fold cross validation, since the number of traininginstances is much smaller.Co-Training: For increasing the number of training examples, we apply this technique as presentedin [2], making an adjustment for our task: since the target classes are not balanced, we ensure thatthe classifiers add the appropriate number of instances that preserves the class distribution. We traintwo classifiers where each one uses one of the set of features of the online and offline scenarios.The following are the values that we assign to the four parameters of the co-training algorithm: • p and n: number of positive and negative examples labeled by the classifiers and added to the training pool. We set this values to p = 3 and n = 1. • k Number of iterations. We set this parameter to k = 100. • u Number of unlabeled examples that are labeled by the classifiers in each iteration. We perform our experiments using different values.For the evaluation, we use 4-fold cross validation as follows: we partition the set of labeled instancesin 4 subsets, maintaining the same class distribution on each subset. We select 3 subsets for trainingthe classifiers and leave the remaining subset for testing them after each iteration. The subsetsselected for training are going to be extended with the new instances labeled by the classifier, whilethe test subset is not modified. In each iteration we evaluate the accuracy, precision, recall and F1score of each classifier.5.2 Experiment ResultsQuestion quality and feature sets: We evaluated the same classifiers that we used on the Askersatisfaction task but for predicting the label assigned after the annotation; we also compared themusing the two sets of features (online and offline). The averaged results are presented in Figure 3.Again, the best results are obtained with the logistic regression based learner using L1 regularization,with an accuracy of 0.74418 and a F1 score of 0.83851. However, for this task the classifiers aremore accurate and the features from the online scenario have higher predictability.Cotraining: We evaluated the overall improvement across the iterations and the effect of varying theparameter u (the number of unlabeled instances that the classifiers will label) on the performance ofeach classifier. Figure 4 shows the F1 score of each classifier for each iteration of five experimentsrunning cotraining with different values for the parameter u.We note that the accuracies of both classifiers improved, however the one for the offline scenario,which initially was the weakest, improved with larger margins up to the point to achieve the perfor-mance of the other classifier (when u = 75 and u = 100). Regarding the number of iterations, ingeneral the improvements occur within the first 50. At this point, the training set varies from 213 to255 instances, depending on the value of u. This variation is the effect of the random sampling onthe set of unlabeled data, which is controlled by this parameter. 6
  7. 7. Accuracy F1 Multinomial NB Logistic Regression (L2) Logistic Regression (L1) Decision Tree 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 Feature set Both Offline Online Figure 3: Evaluation of the classifiers on the Question Quality task. 0.88 0.86 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.84 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.82 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.80 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.78 ● ● ● 0.88 0.86 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.84 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 0.82 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.80 ● ● ● ● ● 0.78 ● ● ● ● 0.88 ● 0.86 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.84 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 75 ● ● ● ● ● ● ● ● 0.82 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.80 ● ● ● ● ● ● ● ● 0.78 ● ● F1 0.88 0.86 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.84 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.82 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.80 ● ● ● ● ● ● 0.78 ● 0.88 0.86 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.84 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 200 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.82 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.80 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.78 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.88 0.86 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.84 ● 500 ● 0.82 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.80 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.78 ● ● ● ● 0 10 20 30 40 50 60 70 80 90 100 Iteration ● Offline ● Online Figure 4: Evaluation of the classifiers in co-training varying the parameter u.The best performance was obtained when u = 75. In this case, the F1 score of the classifier thatuses the offline features increased from 0.77662 to 0.85222, while for the online classifier the im-provement was from 0.83852 to 0.86456. This coincides with the remark made in [2] about thetraining pool size, which notes that selecting the unlabeled instances from a smaller set forces theclassifiers to learn from examples that are more representative of the underlying distribution. Wealso examined the confidence of the classifiers for each target class, i.e. the minimum probabilitypredicted by the classifiers for the instances that were added to the training set. For the positiveclass (high quality), the lowest value was 0.92328, while for the negative class it was 0.20913. Thisuncertainty can be attributed to the fact that the classes are unbalanced. Table 4 presents a summaryof the results of the experiments. 7
  8. 8. Feature set Initial F1 u Max. F1 Iteration with Max. F1 Gain Offline 0.77662 25 0.83277 30 0.05615 50 0.82055 15 0.04393 75 0.85222 80 0.07560 100 0.83854 17 0.06192 200 0.81701 84 0.04040 500 0.81256 22 0.03594 Online 0.83852 25 0.86227 76 0.02375 50 0.85710 11 0.01858 75 0.86456 27 0.02604 100 0.85415 70 0.01563 200 0.85130 8 0.01278 500 0.86388 8 0.02536 Table 4: Maximum values of the F1 in co-training achieved varying the parameter u.6 ConclusionsFrom the above experiments, we have learnt vital insights about the performance of the algorithmsin the asker satisfaction and question quality prediction. We realized the effect of the skew in askersatisfaction in the classification accuracy values. We have also seen certain well known CQA trendsmanifest in the feature analysis. The offline features clearly predict the asker satisfaction better. Ourapproach to look at a question as having offline and online views has been successful, giving usinsights into question quality.We characterized question quality, an abstract measure as a combination of particular aspects. Ourapproach to quantify question quality showed the importance of the contributing metrics individu-ally, besides revealing its high subjectivity, and we were able to train classifiers that predicted thevalues of the annotations.We used co-training to expand the training data; this increased the predictive performance of bothclassifiers, but mainly the one based on the offline set of features. Furthermore, we learnt the thequality and consistency of the annotations are a limitation of this technique in this scenario.While we experimented with different learning algorithms and feature sets, we found that both askersatisfaction and question quality can be modeled similarly. Through our experiments, we were ableto show that we can improve question quality prediction by defining more specific features and byexpanding the training set using semi-supervised learning.References[1] Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis, and Gilad Mishne. Finding high-quality content in social media. Proceedings of the international conference on Web search and web data mining WSDM 08, page 183, 2008.[2] Avrim Blum and Tom Mitchell. Combining Labeled and Unlabeled Data with Co-training. In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, 1998.[3] Yandong Liu, Jiang Bian, and Eugene Agichtein. Predicting information seeker satisfaction in community question answering. Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval SIGIR 08, (Section 2):483, 2008.[4] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret- tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per- rot, and Duchesnay E. Scikit-learn: Machine Learning in Python . Journal of Machine Learning Research, 12:2825–2830, 2011.[5] C Shah and J Pomerantz. Evaluating and Predicting Answer Quality in Community QA. Li- brary, (March 2008):411–418, 2010. 8

×