SlideShare a Scribd company logo
1 of 24
Transformation Functions for Text
Classification: A case study with
StackOverflow
The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.
1
Natural Language Processing, Dublin Meetup
28th Sept, 2016
Piyush Arora, Debasis Ganguly, Gareth J.F. Jones
ADAPT Centre, School of Computing,
Dublin City University
{parora@computing.dcu.ie, piyusharora07@gmail.com}
https://computing.dcu.ie/~parora/
2
www.adaptcentre.ieOverview of the Talk
❏ Informal Overview of the problem.
❏ StackOverflow data characteristics.
❏ A more technical introduction to the problem.
❏ Text based Classification.
❏ Vector Embedding based Classification.
❏ Conclusions
3
www.adaptcentre.ieOverview of the Problem
❏ Parametric approach: Draws a ‘decision boundary’ (a vector in the
parameter space) bsed on labelled samples.
❏ Consider the role of additional (unlabelled) samples.
4
www.adaptcentre.ieOverview of the Problem
❏ Apply a transformation function, which transforms a labelled sample to
another point, depending on its neighbourhood(4).
❏ Retrain a standard parametric classifier on the transformed samples.
❏ Hypothesis: The classification effectiveness after the ‘transformation’ will
improve.
Transformed
point
Transformed and re-
trained
5
www.adaptcentre.ieStackOverflow Question Quality Prediction
❏ Motivation:
❏ Rapid increase in the number of questions posted on CQA forums.
❏ Need for automated methods of question quality moderation to
improve user experience and forum effectiveness.
6
www.adaptcentre.ieStackOverflow Question Quality Prediction
❏ Problem Statement:
❏ How to classify a new question as suitable or unsuitable without the
use of community feedback such as votes, comments or answers
unlike previous work(2).
❏ Addressing “Cold Start Problem”(1)
❏ Proposed solution:
❏ Approach based on “Nearest Neighbour based Transformation
Functions”.
❏ Application:
❏ Automatic moderation for online forums: saving cost, resources and
improving user experience.
❏ General transformation model: can be adapted for any dataset.
7
www.adaptcentre.ieSO Data Selection(1)
8
www.adaptcentre.ieSO Question Classification
❏ To predict if a question is good (net votes +ve) or bad (net votes –ve).
Category All Views > 1000
Bad (-ve net score) 380800 30163
Good (+ve net score) 3,780,301 1,315,731
Vocab Overlap 59.5% 34.6%
9
Questions Distribution
www.adaptcentre.ieSO Question Classification (M1)
❏ Imbalanced class distribution.
❏ High vocabulary overlap between the classes.
❏ Relatively short documents (avg. length of 69 words).
❏ Lack of informative and discriminative content for classification.
❏ Training with ‘all’ labelled samples: Creates a biased classifier which
outputs almost every question as ‘good’.
❏ Problem: High accuracy but low recall and precision for the –ve class.
Question Text Accuracy F-measure
Titles only 0.9707 0.503 (almost random)
Title + Body 0.9735 0.503 (almost random)
10
Raw classification results
www.adaptcentre.ieSO Question Classification (M2)
❏ K-NN classification (a non-parametric approach).
❏ Why not just K-NN?
❏ Because its performance is not good.
❏ Because it is solely non-parametric.
❏ No use of rich textual information for classification.
K Accuracy F-score
1 0.4668 0.4766
3 0.4594 0.4548
5 0.4599 0.4523
Knn- results
11
www.adaptcentre.ieSO Question Classification
❏Use other ‘similar’ questions previously asked in the forum to
❏ Transform every question Q to Q’ (3)
❏ Retrain classification model on the Q’ instances.
❏ Combines parametric with non-parametric approach.
12
www.adaptcentre.ieTransformation Function
❏ The transformation function φ operating on a vector x depends on the
neighbourhood of x.
❏ φ(x) = Φ(x, N(x)), where N(x) = {xi: d(x, xi) <= r}
❏ There are various choices for defining Φ, e.g. weighted centroid etc.
❏ Mainly depends on the type of x, i.e. categorical (text) or real (vectors).
❏ Experiments on both categorical x (i.e. term space representation of
documents) and real valued x (embedded vectors of documents(5)).
13
www.adaptcentre.ieQuestion Fields
14
www.adaptcentre.ieText based Classification (M3)
❏ Baseline: Multinomial Naïve Bayes (MNB) on the text (title+body) of
each question.
❏ Obtain neighbourhood of each document by:
❏ Treat the title of each question as a query.
❏ Use this query to retrieve top K similar documents by BM25 (k=1.2,
b=0.75)(6)
❏ Choose the Φ(x, N(x)) function as the ‘concatenation’ operator.
K Accuracy F-measure
0 (MNB) 0.713 0.704
1 0.718 0.710
3 0.719 0.713
5 0.715 0.710
9 0.715 0.711
15
Text Expansion Results
www.adaptcentre.ieText based Classification
❏ Query: Title field
❏ For retrieval: BM25F with different weights for title and body.
❏ Two step grid search for finding optimal parameter settings.
❏ Best results obtained with w(T), w(B) = 1, 3.
16
www.adaptcentre.ieText based Classification (M4)
❏ Obtain neighbourhood of each document by:
❏ Treat the title of each question as a query.
❏ Use this query to retrieve top K similar documents by BM25F (k=1.2,
b=0.75) with w(T), w(B) = 1, 3.
❏ Optimized search
❏ Choose the Φ(x, N(x)) function as the ‘concatenation’ operator.
K Accuracy F-measure
0 (MNB) 0.713 0.704
3 (BM25) 0.719 0.713
3 (BM25F) 0.738 0.733
17
Textual Space results
www.adaptcentre.ieDoc2vec embeddings
18
www.adaptcentre.ieEmbedding based Classification
❏ Motivation: Document embedding captures the semantic similarity
between questions.
❏ Embed the text (title + body) of each SO question by doc2vec.
❏ Components of each vector in [-1, 1].
❏ Use SVM for classifying these vectors.
❏ Best results obtained when #dimensions set to 200.
❏ Transformation function: Weighted centroid.
19
www.adaptcentre.ieEmbedding based Classification (M5)
❏ For document embedded vector based experiments, we use an SVM
classifier (Gaussian kernel with default parameters).
❏ The SVM classification effectiveness obtained with the dbow
document vectors outperforms those obtained with the dmm model.
K Accuracy F-measure
0 (MNB)2 0.713 0.704
3 (BM25F)2 0.738 0.733
0 (SVM Baseline)1 0.743 0.743
1 (SVM)1 0.740 0.739
3 (SVM)1 0.747 0.746
5 (SVM)1 0.750 0.749
9 (SVM)1 0.769 0.768
11 (SVM)1 0.765 0.764
20
1 indicates embedding space results and 2 indicates textual space results
included for comparison
www.adaptcentre.ieSummary
❏ A general framework for applying a non-parametric based transformation
function.
❏ Empirical investigation on StackOverflow questions to predict question
quality.
❏ Two domains investigated: text and real vectors.
❏ Two neighbourhood functions:
❏ Text: Concatenation
❏ Docvecs: Weighted centroid
❏ Interpretation of the transformation function φ operating on a vector x
❏ φ(x) = Φ(x, N(x)), where N(x) = {xi: d(x, xi) <= r}
21
www.adaptcentre.ieConclusions
❏ BM25F with more weight to the ‘body’ field of a question improves results
by 4.1% relative to MNB baseline.
❏ For docvecs, results are improved by 3.4% relative to SVM baseline.
❏ Consistent trends in improvements of classification results for both text
and document vectors.
❏ Explore alternative transformation functions, and different ways of
combining the neighbourhood and the transformation functions of the
textual and the document vector spaces
22
www.adaptcentre.ieReferences
1. S. Ravi, B. Pang, V. Rastogi, and R. Kumar. Great Question! Question Quality in
Community Q&A. In Proc. of ICWSM ’14 , 2014
2. D. Correa and A. Sureka. Chaff from the wheat: characterization and modeling of
deleted questions on stack overflow. In Proceedings of WWW ’14 , pages 631–642,
2014.
3. M. Efron, P. Organisciak, and K. Fenlon. Improving retrieval of short texts through
document expansion. In Proceedings of the SIGIR ’12 ,pages 911–920, 2012.
4. Q. V. Le and T. Mikolov. Distributed representations of sentences and documents.
In Proceedings of ICML ’14 , pages 1188–1196, 2014.
5. K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Schölkopf. Learning from distributions
via support measure machines. In Proc. of NIPS ’12
6. S. E. Robertson, H. Zaragoza, and M. J. Taylor. Simple BM25 extension to multiple
weighted fields. In Proceedings of CIKM ’04 , pages 42–49, 2004.
7. The good, the bad and their kins: Identifying questions with negative scores in
StackOverflow. P Arora, D Ganguly, GJF Jones. In proceedings of ASONAM’ 2015,
pages 1232-1239
8. Nearest Neighbour based Transformation Functions for Text Classification: A Case
Study with StackOverflow. P Arora, D Ganguly, GJF Jones. In Proceedings of
lCTIR’ 2016, pages 299-302
23
www.adaptcentre.ieQ & A
24

More Related Content

What's hot

Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentationBushra Jbawi
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document RankingBhaskar Mitra
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Daniele Di Mitri
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsClaudia Wagner
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalNik Spirin
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Primya Tamil
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Roman Stanchak
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Bhaskar Mitra
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 

What's hot (20)

Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Collaborative DL
Collaborative DLCollaborative DL
Collaborative DL
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008
 
Term weighting
Term weightingTerm weighting
Term weighting
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 

Viewers also liked

E traction presentation_20141212_eng
E traction presentation_20141212_engE traction presentation_20141212_eng
E traction presentation_20141212_engEvgeniy Shchepelin
 
Content production for sellers
Content production for sellersContent production for sellers
Content production for sellersQTran2909
 
Bringing Content and Commerce Together (presented by Ania Hentz at eZ Confere...
Bringing Content and Commerce Together (presented by Ania Hentz at eZ Confere...Bringing Content and Commerce Together (presented by Ania Hentz at eZ Confere...
Bringing Content and Commerce Together (presented by Ania Hentz at eZ Confere...eZ Systems
 
Funded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENFunded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENSebastian Ruder
 
Synthrone 102016
Synthrone 102016Synthrone 102016
Synthrone 102016Henry Val
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoSebastian Ruder
 
E-commerce Berlin Expo - Tomasz Mazur - Danone
E-commerce Berlin Expo - Tomasz Mazur - DanoneE-commerce Berlin Expo - Tomasz Mazur - Danone
E-commerce Berlin Expo - Tomasz Mazur - DanoneE-Commerce Berlin EXPO
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderSebastian Ruder
 
How to build a world-class Digital Agency - Masterclass at Kreative Asia 2012
How to build a world-class Digital Agency - Masterclass at Kreative Asia 2012How to build a world-class Digital Agency - Masterclass at Kreative Asia 2012
How to build a world-class Digital Agency - Masterclass at Kreative Asia 2012Ming Chan
 
[Report] Organizing for Content: Models to Incorporate Content Strategy and C...
[Report] Organizing for Content: Models to Incorporate Content Strategy and C...[Report] Organizing for Content: Models to Incorporate Content Strategy and C...
[Report] Organizing for Content: Models to Incorporate Content Strategy and C...Altimeter, a Prophet Company
 
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...Andrew Gardner
 
IKRA Creative Agency Presentation (ENG)
IKRA Creative Agency Presentation (ENG)IKRA Creative Agency Presentation (ENG)
IKRA Creative Agency Presentation (ENG)IKRA Creative agency
 
19 Reasons Your LinkedIn Photo Is an Epic Fail
19 Reasons Your LinkedIn Photo Is an Epic Fail 19 Reasons Your LinkedIn Photo Is an Epic Fail
19 Reasons Your LinkedIn Photo Is an Epic Fail MarketingProfs
 
Content Strategy for Everything
Content Strategy for EverythingContent Strategy for Everything
Content Strategy for EverythingKristina Halvorson
 

Viewers also liked (15)

E traction presentation_20141212_eng
E traction presentation_20141212_engE traction presentation_20141212_eng
E traction presentation_20141212_eng
 
Content production for sellers
Content production for sellersContent production for sellers
Content production for sellers
 
Bringing Content and Commerce Together (presented by Ania Hentz at eZ Confere...
Bringing Content and Commerce Together (presented by Ania Hentz at eZ Confere...Bringing Content and Commerce Together (presented by Ania Hentz at eZ Confere...
Bringing Content and Commerce Together (presented by Ania Hentz at eZ Confere...
 
Funded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENFunded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIEN
 
Synthrone 102016
Synthrone 102016Synthrone 102016
Synthrone 102016
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer Calixto
 
E-commerce Berlin Expo - Tomasz Mazur - Danone
E-commerce Berlin Expo - Tomasz Mazur - DanoneE-commerce Berlin Expo - Tomasz Mazur - Danone
E-commerce Berlin Expo - Tomasz Mazur - Danone
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian Ruder
 
How to build a world-class Digital Agency - Masterclass at Kreative Asia 2012
How to build a world-class Digital Agency - Masterclass at Kreative Asia 2012How to build a world-class Digital Agency - Masterclass at Kreative Asia 2012
How to build a world-class Digital Agency - Masterclass at Kreative Asia 2012
 
[Report] Organizing for Content: Models to Incorporate Content Strategy and C...
[Report] Organizing for Content: Models to Incorporate Content Strategy and C...[Report] Organizing for Content: Models to Incorporate Content Strategy and C...
[Report] Organizing for Content: Models to Incorporate Content Strategy and C...
 
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...
 
IKRA Creative Agency Presentation (ENG)
IKRA Creative Agency Presentation (ENG)IKRA Creative Agency Presentation (ENG)
IKRA Creative Agency Presentation (ENG)
 
19 Reasons Your LinkedIn Photo Is an Epic Fail
19 Reasons Your LinkedIn Photo Is an Epic Fail 19 Reasons Your LinkedIn Photo Is an Epic Fail
19 Reasons Your LinkedIn Photo Is an Epic Fail
 
Content Strategy for Everything
Content Strategy for EverythingContent Strategy for Everything
Content Strategy for Everything
 
Content Marketing Predictions 2017
Content Marketing Predictions 2017Content Marketing Predictions 2017
Content Marketing Predictions 2017
 

Similar to Transformation Functions for Text Classification: A case study with StackOverflow

IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents Sharvil Katariya
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text ClassificationSai Srinivas Kotni
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain OntologyKeerti Bhogaraju
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspectiveankurpandeyinfo
 
Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3butest
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationPaul Houle
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
Athifah procedia technology_2013
Athifah procedia technology_2013Athifah procedia technology_2013
Athifah procedia technology_2013Nong Tiun
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Dev Sahu
 
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET Journal
 
Learning scientific scholar representations using a combination of collaborat...
Learning scientific scholar representations using a combination of collaborat...Learning scientific scholar representations using a combination of collaborat...
Learning scientific scholar representations using a combination of collaborat...Ankush Khandelwal
 
A systematic study of text mining techniques
A systematic study of text mining techniquesA systematic study of text mining techniques
A systematic study of text mining techniquesijnlc
 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase miningjins0618
 
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Association for Computational Linguistics
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional VerificationSai Kiran Kadam
 

Similar to Transformation Functions for Text Classification: A case study with StackOverflow (20)

IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspective
 
Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly Information
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
Athifah procedia technology_2013
Athifah procedia technology_2013Athifah procedia technology_2013
Athifah procedia technology_2013
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Topic modelling
Topic modellingTopic modelling
Topic modelling
 
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
 
Learning scientific scholar representations using a combination of collaborat...
Learning scientific scholar representations using a combination of collaborat...Learning scientific scholar representations using a combination of collaborat...
Learning scientific scholar representations using a combination of collaborat...
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Seminar dm
Seminar dmSeminar dm
Seminar dm
 
A systematic study of text mining techniques
A systematic study of text mining techniquesA systematic study of text mining techniques
A systematic study of text mining techniques
 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining
 
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
 
9-Query Processing-05-06-2023.PPT
9-Query Processing-05-06-2023.PPT9-Query Processing-05-06-2023.PPT
9-Query Processing-05-06-2023.PPT
 

More from Sebastian Ruder

Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language ProcessingSebastian Ruder
 
On the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionOn the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionSebastian Ruder
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSebastian Ruder
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoHuman Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoSebastian Ruder
 
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiMachine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiSebastian Ruder
 
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana IfrimHashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana IfrimSebastian Ruder
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingSebastian Ruder
 
Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Sebastian Ruder
 
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSpoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSebastian Ruder
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Sebastian Ruder
 
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Sebastian Ruder
 
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment AnalysisA Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment AnalysisSebastian Ruder
 
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...Sebastian Ruder
 

More from Sebastian Ruder (14)

Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language Processing
 
On the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionOn the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary Induction
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep Learning
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoHuman Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
 
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiMachine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
 
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana IfrimHashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language Processing
 
Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...
 
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSpoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
 
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
 
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment AnalysisA Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
 
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
 

Recently uploaded

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 

Recently uploaded (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 

Transformation Functions for Text Classification: A case study with StackOverflow

  • 1. Transformation Functions for Text Classification: A case study with StackOverflow The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. 1 Natural Language Processing, Dublin Meetup 28th Sept, 2016
  • 2. Piyush Arora, Debasis Ganguly, Gareth J.F. Jones ADAPT Centre, School of Computing, Dublin City University {parora@computing.dcu.ie, piyusharora07@gmail.com} https://computing.dcu.ie/~parora/ 2
  • 3. www.adaptcentre.ieOverview of the Talk ❏ Informal Overview of the problem. ❏ StackOverflow data characteristics. ❏ A more technical introduction to the problem. ❏ Text based Classification. ❏ Vector Embedding based Classification. ❏ Conclusions 3
  • 4. www.adaptcentre.ieOverview of the Problem ❏ Parametric approach: Draws a ‘decision boundary’ (a vector in the parameter space) bsed on labelled samples. ❏ Consider the role of additional (unlabelled) samples. 4
  • 5. www.adaptcentre.ieOverview of the Problem ❏ Apply a transformation function, which transforms a labelled sample to another point, depending on its neighbourhood(4). ❏ Retrain a standard parametric classifier on the transformed samples. ❏ Hypothesis: The classification effectiveness after the ‘transformation’ will improve. Transformed point Transformed and re- trained 5
  • 6. www.adaptcentre.ieStackOverflow Question Quality Prediction ❏ Motivation: ❏ Rapid increase in the number of questions posted on CQA forums. ❏ Need for automated methods of question quality moderation to improve user experience and forum effectiveness. 6
  • 7. www.adaptcentre.ieStackOverflow Question Quality Prediction ❏ Problem Statement: ❏ How to classify a new question as suitable or unsuitable without the use of community feedback such as votes, comments or answers unlike previous work(2). ❏ Addressing “Cold Start Problem”(1) ❏ Proposed solution: ❏ Approach based on “Nearest Neighbour based Transformation Functions”. ❏ Application: ❏ Automatic moderation for online forums: saving cost, resources and improving user experience. ❏ General transformation model: can be adapted for any dataset. 7
  • 9. www.adaptcentre.ieSO Question Classification ❏ To predict if a question is good (net votes +ve) or bad (net votes –ve). Category All Views > 1000 Bad (-ve net score) 380800 30163 Good (+ve net score) 3,780,301 1,315,731 Vocab Overlap 59.5% 34.6% 9 Questions Distribution
  • 10. www.adaptcentre.ieSO Question Classification (M1) ❏ Imbalanced class distribution. ❏ High vocabulary overlap between the classes. ❏ Relatively short documents (avg. length of 69 words). ❏ Lack of informative and discriminative content for classification. ❏ Training with ‘all’ labelled samples: Creates a biased classifier which outputs almost every question as ‘good’. ❏ Problem: High accuracy but low recall and precision for the –ve class. Question Text Accuracy F-measure Titles only 0.9707 0.503 (almost random) Title + Body 0.9735 0.503 (almost random) 10 Raw classification results
  • 11. www.adaptcentre.ieSO Question Classification (M2) ❏ K-NN classification (a non-parametric approach). ❏ Why not just K-NN? ❏ Because its performance is not good. ❏ Because it is solely non-parametric. ❏ No use of rich textual information for classification. K Accuracy F-score 1 0.4668 0.4766 3 0.4594 0.4548 5 0.4599 0.4523 Knn- results 11
  • 12. www.adaptcentre.ieSO Question Classification ❏Use other ‘similar’ questions previously asked in the forum to ❏ Transform every question Q to Q’ (3) ❏ Retrain classification model on the Q’ instances. ❏ Combines parametric with non-parametric approach. 12
  • 13. www.adaptcentre.ieTransformation Function ❏ The transformation function φ operating on a vector x depends on the neighbourhood of x. ❏ φ(x) = Φ(x, N(x)), where N(x) = {xi: d(x, xi) <= r} ❏ There are various choices for defining Φ, e.g. weighted centroid etc. ❏ Mainly depends on the type of x, i.e. categorical (text) or real (vectors). ❏ Experiments on both categorical x (i.e. term space representation of documents) and real valued x (embedded vectors of documents(5)). 13
  • 15. www.adaptcentre.ieText based Classification (M3) ❏ Baseline: Multinomial Naïve Bayes (MNB) on the text (title+body) of each question. ❏ Obtain neighbourhood of each document by: ❏ Treat the title of each question as a query. ❏ Use this query to retrieve top K similar documents by BM25 (k=1.2, b=0.75)(6) ❏ Choose the Φ(x, N(x)) function as the ‘concatenation’ operator. K Accuracy F-measure 0 (MNB) 0.713 0.704 1 0.718 0.710 3 0.719 0.713 5 0.715 0.710 9 0.715 0.711 15 Text Expansion Results
  • 16. www.adaptcentre.ieText based Classification ❏ Query: Title field ❏ For retrieval: BM25F with different weights for title and body. ❏ Two step grid search for finding optimal parameter settings. ❏ Best results obtained with w(T), w(B) = 1, 3. 16
  • 17. www.adaptcentre.ieText based Classification (M4) ❏ Obtain neighbourhood of each document by: ❏ Treat the title of each question as a query. ❏ Use this query to retrieve top K similar documents by BM25F (k=1.2, b=0.75) with w(T), w(B) = 1, 3. ❏ Optimized search ❏ Choose the Φ(x, N(x)) function as the ‘concatenation’ operator. K Accuracy F-measure 0 (MNB) 0.713 0.704 3 (BM25) 0.719 0.713 3 (BM25F) 0.738 0.733 17 Textual Space results
  • 19. www.adaptcentre.ieEmbedding based Classification ❏ Motivation: Document embedding captures the semantic similarity between questions. ❏ Embed the text (title + body) of each SO question by doc2vec. ❏ Components of each vector in [-1, 1]. ❏ Use SVM for classifying these vectors. ❏ Best results obtained when #dimensions set to 200. ❏ Transformation function: Weighted centroid. 19
  • 20. www.adaptcentre.ieEmbedding based Classification (M5) ❏ For document embedded vector based experiments, we use an SVM classifier (Gaussian kernel with default parameters). ❏ The SVM classification effectiveness obtained with the dbow document vectors outperforms those obtained with the dmm model. K Accuracy F-measure 0 (MNB)2 0.713 0.704 3 (BM25F)2 0.738 0.733 0 (SVM Baseline)1 0.743 0.743 1 (SVM)1 0.740 0.739 3 (SVM)1 0.747 0.746 5 (SVM)1 0.750 0.749 9 (SVM)1 0.769 0.768 11 (SVM)1 0.765 0.764 20 1 indicates embedding space results and 2 indicates textual space results included for comparison
  • 21. www.adaptcentre.ieSummary ❏ A general framework for applying a non-parametric based transformation function. ❏ Empirical investigation on StackOverflow questions to predict question quality. ❏ Two domains investigated: text and real vectors. ❏ Two neighbourhood functions: ❏ Text: Concatenation ❏ Docvecs: Weighted centroid ❏ Interpretation of the transformation function φ operating on a vector x ❏ φ(x) = Φ(x, N(x)), where N(x) = {xi: d(x, xi) <= r} 21
  • 22. www.adaptcentre.ieConclusions ❏ BM25F with more weight to the ‘body’ field of a question improves results by 4.1% relative to MNB baseline. ❏ For docvecs, results are improved by 3.4% relative to SVM baseline. ❏ Consistent trends in improvements of classification results for both text and document vectors. ❏ Explore alternative transformation functions, and different ways of combining the neighbourhood and the transformation functions of the textual and the document vector spaces 22
  • 23. www.adaptcentre.ieReferences 1. S. Ravi, B. Pang, V. Rastogi, and R. Kumar. Great Question! Question Quality in Community Q&A. In Proc. of ICWSM ’14 , 2014 2. D. Correa and A. Sureka. Chaff from the wheat: characterization and modeling of deleted questions on stack overflow. In Proceedings of WWW ’14 , pages 631–642, 2014. 3. M. Efron, P. Organisciak, and K. Fenlon. Improving retrieval of short texts through document expansion. In Proceedings of the SIGIR ’12 ,pages 911–920, 2012. 4. Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of ICML ’14 , pages 1188–1196, 2014. 5. K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Schölkopf. Learning from distributions via support measure machines. In Proc. of NIPS ’12 6. S. E. Robertson, H. Zaragoza, and M. J. Taylor. Simple BM25 extension to multiple weighted fields. In Proceedings of CIKM ’04 , pages 42–49, 2004. 7. The good, the bad and their kins: Identifying questions with negative scores in StackOverflow. P Arora, D Ganguly, GJF Jones. In proceedings of ASONAM’ 2015, pages 1232-1239 8. Nearest Neighbour based Transformation Functions for Text Classification: A Case Study with StackOverflow. P Arora, D Ganguly, GJF Jones. In Proceedings of lCTIR’ 2016, pages 299-302 23