Successfully reported this slideshow.
Your SlideShare is downloading. ×

Transformation Functions for Text Classification: A case study with StackOverflow

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 24 Ad
Advertisement

More Related Content

Slideshows for you (20)

Viewers also liked (16)

Advertisement

Similar to Transformation Functions for Text Classification: A case study with StackOverflow (20)

More from Sebastian Ruder (14)

Advertisement

Recently uploaded (20)

Transformation Functions for Text Classification: A case study with StackOverflow

  1. 1. Transformation Functions for Text Classification: A case study with StackOverflow The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. 1 Natural Language Processing, Dublin Meetup 28th Sept, 2016
  2. 2. Piyush Arora, Debasis Ganguly, Gareth J.F. Jones ADAPT Centre, School of Computing, Dublin City University {parora@computing.dcu.ie, piyusharora07@gmail.com} https://computing.dcu.ie/~parora/ 2
  3. 3. www.adaptcentre.ieOverview of the Talk ❏ Informal Overview of the problem. ❏ StackOverflow data characteristics. ❏ A more technical introduction to the problem. ❏ Text based Classification. ❏ Vector Embedding based Classification. ❏ Conclusions 3
  4. 4. www.adaptcentre.ieOverview of the Problem ❏ Parametric approach: Draws a ‘decision boundary’ (a vector in the parameter space) bsed on labelled samples. ❏ Consider the role of additional (unlabelled) samples. 4
  5. 5. www.adaptcentre.ieOverview of the Problem ❏ Apply a transformation function, which transforms a labelled sample to another point, depending on its neighbourhood(4). ❏ Retrain a standard parametric classifier on the transformed samples. ❏ Hypothesis: The classification effectiveness after the ‘transformation’ will improve. Transformed point Transformed and re- trained 5
  6. 6. www.adaptcentre.ieStackOverflow Question Quality Prediction ❏ Motivation: ❏ Rapid increase in the number of questions posted on CQA forums. ❏ Need for automated methods of question quality moderation to improve user experience and forum effectiveness. 6
  7. 7. www.adaptcentre.ieStackOverflow Question Quality Prediction ❏ Problem Statement: ❏ How to classify a new question as suitable or unsuitable without the use of community feedback such as votes, comments or answers unlike previous work(2). ❏ Addressing “Cold Start Problem”(1) ❏ Proposed solution: ❏ Approach based on “Nearest Neighbour based Transformation Functions”. ❏ Application: ❏ Automatic moderation for online forums: saving cost, resources and improving user experience. ❏ General transformation model: can be adapted for any dataset. 7
  8. 8. www.adaptcentre.ieSO Data Selection(1) 8
  9. 9. www.adaptcentre.ieSO Question Classification ❏ To predict if a question is good (net votes +ve) or bad (net votes –ve). Category All Views > 1000 Bad (-ve net score) 380800 30163 Good (+ve net score) 3,780,301 1,315,731 Vocab Overlap 59.5% 34.6% 9 Questions Distribution
  10. 10. www.adaptcentre.ieSO Question Classification (M1) ❏ Imbalanced class distribution. ❏ High vocabulary overlap between the classes. ❏ Relatively short documents (avg. length of 69 words). ❏ Lack of informative and discriminative content for classification. ❏ Training with ‘all’ labelled samples: Creates a biased classifier which outputs almost every question as ‘good’. ❏ Problem: High accuracy but low recall and precision for the –ve class. Question Text Accuracy F-measure Titles only 0.9707 0.503 (almost random) Title + Body 0.9735 0.503 (almost random) 10 Raw classification results
  11. 11. www.adaptcentre.ieSO Question Classification (M2) ❏ K-NN classification (a non-parametric approach). ❏ Why not just K-NN? ❏ Because its performance is not good. ❏ Because it is solely non-parametric. ❏ No use of rich textual information for classification. K Accuracy F-score 1 0.4668 0.4766 3 0.4594 0.4548 5 0.4599 0.4523 Knn- results 11
  12. 12. www.adaptcentre.ieSO Question Classification ❏Use other ‘similar’ questions previously asked in the forum to ❏ Transform every question Q to Q’ (3) ❏ Retrain classification model on the Q’ instances. ❏ Combines parametric with non-parametric approach. 12
  13. 13. www.adaptcentre.ieTransformation Function ❏ The transformation function φ operating on a vector x depends on the neighbourhood of x. ❏ φ(x) = Φ(x, N(x)), where N(x) = {xi: d(x, xi) <= r} ❏ There are various choices for defining Φ, e.g. weighted centroid etc. ❏ Mainly depends on the type of x, i.e. categorical (text) or real (vectors). ❏ Experiments on both categorical x (i.e. term space representation of documents) and real valued x (embedded vectors of documents(5)). 13
  14. 14. www.adaptcentre.ieQuestion Fields 14
  15. 15. www.adaptcentre.ieText based Classification (M3) ❏ Baseline: Multinomial Naïve Bayes (MNB) on the text (title+body) of each question. ❏ Obtain neighbourhood of each document by: ❏ Treat the title of each question as a query. ❏ Use this query to retrieve top K similar documents by BM25 (k=1.2, b=0.75)(6) ❏ Choose the Φ(x, N(x)) function as the ‘concatenation’ operator. K Accuracy F-measure 0 (MNB) 0.713 0.704 1 0.718 0.710 3 0.719 0.713 5 0.715 0.710 9 0.715 0.711 15 Text Expansion Results
  16. 16. www.adaptcentre.ieText based Classification ❏ Query: Title field ❏ For retrieval: BM25F with different weights for title and body. ❏ Two step grid search for finding optimal parameter settings. ❏ Best results obtained with w(T), w(B) = 1, 3. 16
  17. 17. www.adaptcentre.ieText based Classification (M4) ❏ Obtain neighbourhood of each document by: ❏ Treat the title of each question as a query. ❏ Use this query to retrieve top K similar documents by BM25F (k=1.2, b=0.75) with w(T), w(B) = 1, 3. ❏ Optimized search ❏ Choose the Φ(x, N(x)) function as the ‘concatenation’ operator. K Accuracy F-measure 0 (MNB) 0.713 0.704 3 (BM25) 0.719 0.713 3 (BM25F) 0.738 0.733 17 Textual Space results
  18. 18. www.adaptcentre.ieDoc2vec embeddings 18
  19. 19. www.adaptcentre.ieEmbedding based Classification ❏ Motivation: Document embedding captures the semantic similarity between questions. ❏ Embed the text (title + body) of each SO question by doc2vec. ❏ Components of each vector in [-1, 1]. ❏ Use SVM for classifying these vectors. ❏ Best results obtained when #dimensions set to 200. ❏ Transformation function: Weighted centroid. 19
  20. 20. www.adaptcentre.ieEmbedding based Classification (M5) ❏ For document embedded vector based experiments, we use an SVM classifier (Gaussian kernel with default parameters). ❏ The SVM classification effectiveness obtained with the dbow document vectors outperforms those obtained with the dmm model. K Accuracy F-measure 0 (MNB)2 0.713 0.704 3 (BM25F)2 0.738 0.733 0 (SVM Baseline)1 0.743 0.743 1 (SVM)1 0.740 0.739 3 (SVM)1 0.747 0.746 5 (SVM)1 0.750 0.749 9 (SVM)1 0.769 0.768 11 (SVM)1 0.765 0.764 20 1 indicates embedding space results and 2 indicates textual space results included for comparison
  21. 21. www.adaptcentre.ieSummary ❏ A general framework for applying a non-parametric based transformation function. ❏ Empirical investigation on StackOverflow questions to predict question quality. ❏ Two domains investigated: text and real vectors. ❏ Two neighbourhood functions: ❏ Text: Concatenation ❏ Docvecs: Weighted centroid ❏ Interpretation of the transformation function φ operating on a vector x ❏ φ(x) = Φ(x, N(x)), where N(x) = {xi: d(x, xi) <= r} 21
  22. 22. www.adaptcentre.ieConclusions ❏ BM25F with more weight to the ‘body’ field of a question improves results by 4.1% relative to MNB baseline. ❏ For docvecs, results are improved by 3.4% relative to SVM baseline. ❏ Consistent trends in improvements of classification results for both text and document vectors. ❏ Explore alternative transformation functions, and different ways of combining the neighbourhood and the transformation functions of the textual and the document vector spaces 22
  23. 23. www.adaptcentre.ieReferences 1. S. Ravi, B. Pang, V. Rastogi, and R. Kumar. Great Question! Question Quality in Community Q&A. In Proc. of ICWSM ’14 , 2014 2. D. Correa and A. Sureka. Chaff from the wheat: characterization and modeling of deleted questions on stack overflow. In Proceedings of WWW ’14 , pages 631–642, 2014. 3. M. Efron, P. Organisciak, and K. Fenlon. Improving retrieval of short texts through document expansion. In Proceedings of the SIGIR ’12 ,pages 911–920, 2012. 4. Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of ICML ’14 , pages 1188–1196, 2014. 5. K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Schölkopf. Learning from distributions via support measure machines. In Proc. of NIPS ’12 6. S. E. Robertson, H. Zaragoza, and M. J. Taylor. Simple BM25 extension to multiple weighted fields. In Proceedings of CIKM ’04 , pages 42–49, 2004. 7. The good, the bad and their kins: Identifying questions with negative scores in StackOverflow. P Arora, D Ganguly, GJF Jones. In proceedings of ASONAM’ 2015, pages 1232-1239 8. Nearest Neighbour based Transformation Functions for Text Classification: A Case Study with StackOverflow. P Arora, D Ganguly, GJF Jones. In Proceedings of lCTIR’ 2016, pages 299-302 23
  24. 24. www.adaptcentre.ieQ & A 24

×